Remember those three components we discussed: the brain (weights), blueprint (config), and translator (tokenizer)? Today, we’re getting our hands dirty and exploring what they actually look like when you download a real AI model to your computer.
By the end of this post, you’ll have seen the raw files that make up artificial intelligence. No black boxes, no mystery. Just the actual math and data structures that power modern AI.
Meet TinyLlama: The Perfect AI to Dissect
For our exploration, we’re using TinyLlama-1.1B-Chat , a compact AI model that’s perfect for understanding how these systems work. At just 1.1 billion parameters (about 760MB when optimized), it’s small enough to run on any modern laptop but sophisticated enough to hold real conversations and answer questions.
Think of TinyLlama as AI’s equivalent of a high-performance motorcycle compared to the eighteen-wheeler trucks that are GPT-4, Claude, or Gemini. It’s nimble, efficient, and you can actually see how all the parts work together. This makes it perfect for education and experimentation.
Why Start Small? TinyLlama vs. The Giants
Before we dive into the files, let’s put TinyLlama in context. When you use ChatGPT, Claude, or Gemini, you’re interacting with models that have:
- 100+ billion parameters (compared to TinyLlama’s 1.1 billion)
- Massive server farms (TinyLlama runs on your laptop)
- Proprietary architectures (TinyLlama is fully open-source)
- Corporate filtering (TinyLlama gives you raw AI responses)
Here’s the fascinating part: TinyLlama uses the same fundamental architecture as those giants, the transformer model that powers virtually every modern AI. It’s like comparing a Formula 1 car to a go-kart. The go-kart has the same basic engineering principles, just at a scale you can actually understand and tinker with.
My workshop teachers were amazed: “Wait, this tiny model running on my laptop uses the same technology as ChatGPT?” Yes, just with fewer parameters and less training data.
The Download: What Actually Hits Your Hard Drive
Let’s see what happens when you download TinyLlama. Here’s what lands in your folder:
TinyLlama-1.1B-Chat-v1.0/
├── model-00001-of-00002.safetensors (548 MB)
├── model-00002-of-00002.safetensors (222 MB)
├── model.safetensors.index.json (14.7 KB)
├── config.json (665 bytes)
├── generation_config.json (147 bytes)
├── tokenizer.model (500 KB)
├── tokenizer.json (1.84 MB)
├── tokenizer_config.json (1.19 KB)
├── special_tokens_map.json (435 bytes)
└── README.md (4.21 KB)
That’s it. 760MB total. No executable files. No installation wizards. No complex software packages.
Compare this to installing Microsoft Word (several gigabytes) or even a modern video game. An entire artificial intelligence, capable of conversations, reasoning, and creative writing, fits in less space than a single high-resolution movie.
When one of my teachers first saw this, she laughed: “My phone’s photo library is bigger than this AI!”
The Brain: Exploring TinyLlama’s Weights
Those .safetensors files contain TinyLlama’s “learned knowledge.” 1.1 billion numbers that encode everything it knows about language, reasoning, and the world.
Here’s what it looks like when you peek inside:
from safetensors import safe_open
# Open the weights file
with safe_open("model-00001-of-00002.safetensors", framework="pt") as f:
# Look at the layer names and shapes
for key in list(f.keys())[:5]:
tensor = f.get_tensor(key)
print(f"{key}: {tensor.shape}")
from safetensors import safe_open
# Open the weights file
with safe_open("model-00001-of-00002.safetensors", framework="pt") as f:
# Look at the layer names and shapes
for key in list(f.keys())[:5]:
tensor = f.get_tensor(key)
print(f"{key}: {tensor.shape}")
Output:
model.embed_tokens.weight: torch.Size([32000, 2048])
model.layers.0.self_attn.q_proj.weight: torch.Size([2048, 2048])
model.layers.0.self_attn.k_proj.weight: torch.Size([2048, 2048])
model.layers.0.self_attn.v_proj.weight: torch.Size([2048, 2048])
model.layers.0.self_attn.o_proj.weight: torch.Size([2048, 2048])
model.embed_tokens.weight: torch.Size([32000, 2048])
model.layers.0.self_attn.q_proj.weight: torch.Size([2048, 2048])
model.layers.0.self_attn.k_proj.weight: torch.Size([2048, 2048])
model.layers.0.self_attn.v_proj.weight: torch.Size([2048, 2048])
model.layers.0.self_attn.o_proj.weight: torch.Size([2048, 2048])
Each of these represents millions of carefully tuned parameters. That first layer (embed_tokens.weight) alone contains 65.5 million numbers (32,000 × 2,048).
Here’s what a tiny sample of TinyLlama’s “thoughts” looks like as raw numbers:
tensor([[ 0.0234, -0.0127, 0.0089, ..., -0.0045],
[-0.0156, 0.0089, -0.0234, ..., 0.0123],
[ 0.0067, -0.0198, 0.0134, ..., -0.0089],
...])
tensor([[ 0.0234, -0.0127, 0.0089, ..., -0.0045],
[-0.0156, 0.0089, -0.0234, ..., 0.0123],
[ 0.0067, -0.0198, 0.0134, ..., -0.0089],
...])
One teacher stared at this output and said, “So when TinyLlama writes a poem or explains photosynthesis, it’s all just… these decimal numbers doing math?”
Exactly. Every creative response, every logical deduction, every moment of seeming understanding, all emerge from mathematical operations on these billion carefully arranged numbers.
Size Comparison: TinyLlama vs. The Giants
To put this in perspective:
| Model | Parameters | Size | Capability |
|---|---|---|---|
| TinyLlama | 1.1 B | 760MB | Good conversations, basic reasoning |
| GPT-3.5 | ~175B | ~350GB | Very capable, what powered early ChatGPT |
| GPT-4 | ~1.7T | ~3.4TB | Extremely capable, current ChatGPT |
TinyLlama has roughly 0.6% the parameters of GPT-3.5, yet it can still hold conversations, answer questions, and even write simple code. This showcases how remarkably efficient the transformer architecture is, even at tiny scales.
TinyLlama is literally a miniature version of the same architecture powering the AI giants. It’s like having a working scale model of a Formula 1 engine, all the same principles, just smaller.
The Blueprint: TinyLlama’s Config File
TinyLlama’s config file reveals its architectural DNA:
{
"architectures": ["LlamaForCausalLM"],
"attention_bias": false,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 2048,
"intermediate_size": 5632,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 22,
"num_key_value_heads": 4,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.34.0",
"use_cache": true,
"vocab_size": 32000
}
{
"architectures": ["LlamaForCausalLM"],
"attention_bias": false,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 2048,
"intermediate_size": 5632,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 22,
"num_key_value_heads": 4,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.34.0",
"use_cache": true,
"vocab_size": 32000
}
Notice how this compares to larger models:
- hidden_size: 2048 (GPT-4 likely uses 8192+)
- num_hidden_layers: 22 (GPT-4 probably has 80+)
- max_position_embeddings: 2048 (newer models handle 32K+ tokens)
Decoding the Blueprint: What These Numbers Actually Control
Let’s break down what these seemingly cryptic settings actually mean. Think of this as reading the specifications on a car engine. Each number tells you something important about how this AI is built:
The Core Architecture
- ” architectures”: [“LlamaForCausalLM”] - This tells software “build a LLaMA-style model that predicts the next word in a sequence.” It’s like specifying “V8 engine” vs “electric motor.”
- ” model_type”: “llama” - Confirms we’re using the LLaMA architecture family (the same foundation as ChatGPT’s competitors).
The Brain’s Dimensions
- ” hidden_size”: 2048 - This is how “wide” the AI’s thinking is. Each layer processes information in chunks of 2,048 numbers. Bigger models use 4,096 or even 8,192—more width means more sophisticated thinking.
- ” num_hidden_layers”: 22 - The AI has 22 layers of processing, like a 22-story building where each floor does more complex thinking. GPT-4 probably has 80+ layers.
- ” intermediate_size”: 5632 - In each layer, the AI briefly expands its thinking to this size before compressing back down. It’s like having a workspace bigger than your desk for complex tasks.
The Attention System
- ” num_attention_heads”: 32 - The AI can focus on 32 different aspects of the text simultaneously. Think of it as 32 spotlight operators, each highlighting different parts of what you wrote.
- ” num_key_value_heads”: 4 - A newer optimization that shares some attention work, making the model more efficient without losing much capability.
Memory and Context
- ” max_position_embeddings”: 2048 - The AI can remember roughly 2,048 tokens of conversation (about 1,500 words). Once you exceed this, it starts “forgetting” the beginning of your chat.
- ” vocab_size”: 32000 - The AI knows 32,000 different token “words” (including word pieces and punctuation).
Technical Optimizations
- ” torch_dtype”: “float16” - Uses 16-bit numbers instead of 32-bit for efficiency. It’s like using a smaller but still precise measuring cup. This saves memory with minimal accuracy loss.
- ” bos_token_id”: 1, “eos_token_id”: 2 - Special markers for “beginning of sentence” and “end of sentence.” Token #1 means “start here,” token #2 means “I’m done talking.”
- ” hidden_act”: “silu” - The mathematical function used to process information between layers. Different activation functions change how the AI “thinks” about problems.
Performance Tweaks
- ” use_cache”: true - Remembers previous calculations to speed up responses during conversation.
- ” attention_bias”: false - A technical detail about how attention calculations work—disabled here for efficiency.
The fascinating part? Change any of these numbers, and you change how the AI behaves. Bump num_hidden_layers from 22 to 44, and you’ve doubled the model’s depth (and size). Increase hidden_size from 2048 to 4096, and you’ve made it much more capable but also much hungrier for memory.
This config file is literally the DNA of artificial intelligence. A recipe that transforms billions of numbers into something that can be understood and generate human language.
The Translator: How TinyLlama Sees Text
TinyLlama uses the same tokenizer as its bigger siblings. Let’s watch it work:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Let's see how TinyLlama breaks down text
text = "TinyLlama is a small but capable AI model!"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text, add_special_tokens=False)
print("Original text:", text)
print("Tokens:", tokens)
print("Token IDs:", token_ids[:10], "...") # Just show first 10
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Let's see how TinyLlama breaks down text
text = "TinyLlama is a small but capable AI model!"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text, add_special_tokens=False)
print("Original text:", text)
print("Tokens:", tokens)
print("Token IDs:", token_ids[:10], "...") # Just show first 10
Output:
Original text: TinyLlama is a small but capable AI model!
Tokens: ['▁T', 'iny', 'L', 'lama', '▁is', '▁a', '▁small', '▁but', '▁cap', 'able', '▁AI', '▁model', '!']
Token IDs: [323, 4901, 365, 26653, 338, 263, 2319, 541, 2117, 519, 319, 1992, 29991]
Original text: TinyLlama is a small but capable AI model!
Tokens: ['▁T', 'iny', 'L', 'lama', '▁is', '▁a', '▁small', '▁but', '▁cap', 'able', '▁AI', '▁model', '!']
Token IDs: [323, 4901, 365, 26653, 338, 263, 2319, 541, 2117, 519, 319, 1992, 29991]
Interesting observations:
- “TinyLlama” gets split into pieces: “▁T”, “iny”, “L”, “lama”
- Common words like “is” and “a” get their own tokens
- “capable” splits into “▁cap” and “able”
This tokenization is identical to what GPT-3.5, ChatGPT, and other LLaMA-based models use. The difference isn’t in how they see text. It’s in how many parameters they have to process that text.
Real-World Performance: What Can TinyLlama Actually Do?
Is it as sophisticated as ChatGPT? No. But it demonstrates that the fundamental capability of AI emerges even at small scales. One teacher said, “It’s like having a smart middle schooler who can help with homework. Not a PhD researcher, but still genuinely useful!”
Understanding TinyLlama helps you appreciate what you’re getting with larger models:
- Speed: TinyLlama responds in milliseconds on a laptop. ChatGPT takes seconds on massive server farms.
- Complexity: TinyLlama handles simple tasks well but struggles with complex reasoning chains that larger models master.
- Knowledge: TinyLlama has broad but shallow knowledge. Larger models have both breadth and depth.
- Context: TinyLlama remembers ~2K tokens of conversation. GPT-4 can handle 32K+ tokens.
- Accuracy: TinyLlama makes more mistakes but gets basic facts right. Larger models are more reliable.
The Surprising Discoveries
Here’s what amazed the teachers most when we explored TinyLlama:
- ” It’s the same technology as ChatGPT”: The architecture is identical, just scaled down. This demystified how the AI giants work.
- ” Size vs. capability isn’t linear”: TinyLlama with 0.6% of GPT-3.5’s parameters still delivers maybe 40% of the usefulness for many tasks.
- ” Local AI is actually practical”: Running your own AI isn’t just possible. It’s fast, private, and surprisingly capable.
- ” The files are surprisingly small”: 760MB for an entire artificial intelligence felt impossibly compact.
Why This Matters for Understanding AI
TinyLlama serves as the perfect model for understanding AI:
- Accessible Scale: You can actually download, run, and experiment with it.
- Real Performance: It’s not a toy; it genuinely demonstrates AI capabilities.
- Same Principles: Everything you learn applies to understanding larger models.
- Hands-On Learning: You can modify configs, fine-tune weights, and see immediate results.
When teachers ask me, “How do the AI giants work?” I now point them to TinyLlama. It’s like learning about cars by studying a motorcycle engine. The same fundamental principles, just at a scale where you can see how all the parts work together.
Your Turn: Explore TinyLlama
Want to crack open your own AI? Download TinyLlama and start exploring:
- Download the model from Hugging Face
- Examine the file sizes and compare them to the software you know
- Load the tokenizer and see how it splits different texts
- Peek at the config and imagine tweaking the parameters
- Run it locally and marvel that it works on your laptop
The next time someone talks about AI as some incomprehensible black box running in distant data centers, you can smile knowingly.
Because you’ve seen what’s actually possible with just 760MB of carefully organized math. And if TinyLlama can hold conversations and write code on your laptop, imagine what those giants with 1000x more parameters can accomplish.
The magic isn’t in the mystery. It’s in the math, brilliantly scaled from tiny to tremendous.