One-sentence summary: The Transformer forward pass is a pipeline — text → tokens → embeddings + position → N blocks (Attention + FFN) → linear projection to vocabulary → Softmax probabilities → predicted next token. Understand this pipeline and you understand how GPT "thinks."
15.1 The Big Picture: Decoder-Only Architecture
15.1.1 GPT, LLaMA, Claude Are All Decoder-Only
Modern language models — GPT, LLaMA, Claude — all use a Decoder-Only architecture. Unlike the original Transformer's Encoder-Decoder design, Decoder-Only keeps only the decoder stack and focuses exclusively on autoregressive generation.
15.1.2 GPT-2 vs GPT-1 Architecture Comparison
These two models share the Decoder-Only skeleton. The main difference is LayerNorm placement:
| GPT-1 (Post-Norm) | GPT-2 (Pre-Norm) | |
|---|---|---|
| LayerNorm position | After Attention/FFN | Before Attention/FFN |
| Training stability | Less stable | More stable |
| Modern models | — | LLaMA, GPT-3, and nearly everything since |
Pre-Norm is now the default. We'll use GPT-2 as our reference and trace the data flow from bottom to top.
15.1.3 Complete Pipeline Overview
Input text: "The agent opened a pull request for"
|
Step 1: Tokenization
|
Step 2: Word Embeddings
|
Step 3: Positional Encoding
|
Steps 4-6: N × Transformer Block (GPT-2 Pre-Norm style)
(LayerNorm → Attention → Residual → LayerNorm → FFN → Residual)
|
Step 7: Final Layer Norm
|
Step 8: Linear → Softmax → Output Probability
|
Predicted next token: "review" (highest probability)
15.2 Steps 1-3: Input Processing
15.2.1 Step 1: Tokenization
Text becomes numbers. Using tiktoken with cl100k_base:
Input: "The agent opened a pull request for"
Token IDs: [791, 8479, 9107, 264, 6958, 1715, 369]
Length: 7 tokens
Each word or subword piece maps to a unique integer ID. The tokenizer does not see characters — it sees learned vocabulary entries.
15.2.2 Step 2: Word Embeddings
Token IDs enter the model via a lookup table:
Token IDs [7]
| Embedding lookup (vocab_size × d_model matrix)
Token Embeddings [7, 512]
Each token becomes a 512-dimensional vector carrying semantic information about that token. This is the first place where meaning lives as geometry.
15.2.3 Step 3: Positional Encoding
Attention has no built-in sense of order. Without position information, the model cannot distinguish "the agent tagged the reviewer" from "the reviewer tagged the agent." Position encoding fixes that:
Token Embeddings [7, 512]
| + Positional Encoding [7, 512]
Input Vectors [7, 512]
Two common strategies:
- Original Transformer: fixed sinusoidal functions (no training required)
- GPT series: learned positional embeddings (trained end-to-end)
Either way, every position gets a unique encoding, and nearby positions get similar encodings.
Output: each token is now a 512-dimensional vector that simultaneously encodes what it is and where it sits.
15.3 Step 4: Inside a Transformer Block
15.3.1 Block Structure
Each Transformer Block has two sub-layers:
Input X [7, 512]
|
+-------------------------------+
| LayerNorm |
| | |
| Multi-Head Attention | <- understands relationships between tokens
| | |
| Dropout -> + X (residual) |
+-------------------------------+
|
+-------------------------------+
| LayerNorm |
| | |
| Feed Forward Network | <- feature transformation
| | |
| Dropout -> + X (residual) |
+-------------------------------+
|
Output [7, 512]
The critical property: input is [7, 512], output is still [7, 512]. The dimension does not change through any block. Only the final projection breaks that invariant.
15.3.2 Multi-Head Attention in Detail
Attention is the core of the Transformer. Let me break it down step by step.
Step 4.1: Generate Q, K, V
Input X [7, 512]
| × Wq, Wk, Wv (three weight matrices)
Q, K, V each [7, 512]
| split across heads
Each head: Q, K, V each [7, 64] (assuming 8 heads)
Step 4.2: Compute Attention Scores
The dot product of Q and K measures similarity — how much should token i attend to token j.
Step 4.3: Visualize the Attention Matrix
The raw Q × K^T result is a 7×7 matrix. Each cell is the similarity score between one pair of token positions. Darker means higher similarity.
Step 4.4: Apply Causal Mask
The lower-triangular mask sets the upper triangle to -inf. After Softmax, -inf becomes 0. This is the Causal Mask — each position can only see tokens that come before it (or itself). This is what makes the model safe to train with teacher forcing and honest at inference time.
Step 4.5: Softmax and Weighted Sum
Attention Weights = Softmax(Masked Scores)
Output = Attention Weights × V
Each position's output is a weighted average of all V vectors, where the weights are the attention scores. The model learned what to pay attention to during training.
15.3.3 Feed Forward Network
The FFN is a simple two-layer network that operates independently on each token position:
Input [7, 512]
| Linear: 512 → 2048 (4× expansion)
| ReLU activation
| Linear: 2048 → 512 (back to model width)
Output [7, 512]
The FFN stores the majority of the model's "factual knowledge." It accounts for nearly half of all parameters — more on that in the parameter count section below.
15.4 Steps 5-6: Residual Connections and LayerNorm
15.4.1 Why Residual Connections?
Every sub-layer wraps its computation in a residual connection:
output = x + sublayer(x) # not: output = sublayer(x)
Benefits:
- Gradients can flow directly backward through the identity path, avoiding vanishing
- The network can stack many layers without training instability
- If a sub-layer learns nothing useful, the original signal passes through unchanged
This is the bypass lane from Chapter 13. Without it, 48-layer GPT-2 would not converge.
15.4.2 LayerNorm Placement
GPT-2 uses Pre-Norm:
# Pre-Norm (GPT-2, LLaMA, most modern models)
output = x + attention(layernorm(x))
# Post-Norm (original Transformer, 2017)
output = layernorm(x + attention(x))
Pre-Norm normalizes the input before the sublayer, not the combined output. This stabilizes training, especially in the early steps when parameter scales are unpredictable.
15.5 Step 7: Stacking Multiple Blocks
15.5.1 N Repetitions
GPT-2 stacks 12 to 48 blocks depending on model size:
Block 1 [7, 512] → Block 2 [7, 512] → ... → Block 12 [7, 512]
Each block:
- preserves the shape
[seq, d_model] - refines the representation with another round of Attention + FFN
- builds increasingly abstract features as depth increases
Early blocks tend to handle syntax and local patterns. Later blocks handle longer-range dependencies and more abstract semantics. This is not a design decision — it emerged from training.
15.5.2 Where All the Parameters Live
| Component | Parameter formula | Example (d_model=512, vocab=100,256) |
|---|---|---|
| Word Embedding | vocab × d_model | ~51M |
| Attention (×12) | 4 × d_model² × 12 | ~12.6M |
| FFN (×12) | 8 × d_model² × 12 | ~25.2M |
| Output Linear | d_model × vocab | ~51M |
15.6 Step 8: Output Mapping
15.6.1 Final LayerNorm
After all blocks, one more LayerNorm before projection:
Block 12 output [7, 512]
| LayerNorm
Normalized output [7, 512]
15.6.2 Linear Layer: Projecting to Vocabulary
The key step: map the 512-dimensional hidden vector to a 100,256-dimensional logit vector.
Input [batch, seq, d_model] = [4, 7, 512]
| @ Wp [d_model, vocab_size]
Output [batch, seq, vocab_size] = [4, 7, 100256]
15.6.3 What the Wp Matrix Means
Think of Wp as: every token in the vocabulary has a d_model-dimensional signature vector. The output logit for token i is the dot product between the current hidden state and token i's signature — a similarity score.
High dot product → high logit → model thinks this token is a likely next token.
15.6.4 Softmax to Probabilities
logits [7, 100256]
| Softmax (over last dimension)
probs [7, 100256]
Now every position has a probability distribution over the vocabulary:
- all probabilities sum to 1
- the highest probability token is the model's prediction
- the full distribution is what sampling strategies use
15.7 Full Shape Tracking
15.7.1 From Input to Output
token_ids: [batch=4, seq=7]
Steps 1-2: Embedding: [4, 7, 512]
Step 3: + Position: [4, 7, 512]
Steps 4-6: Blocks 1-12: [4, 7, 512] <- shape never changes!
Step 7: Final LayerNorm:[4, 7, 512]
Step 8: Linear: [4, 7, 100256]
Softmax: [4, 7, 100256] <- now probabilities
Take last position: [4, 100256]
argmax: [4] <- predicted token ID per sequence
The dimension is stable at d_model throughout every block. It only explodes to vocab_size at the very end.
15.7.2 Key Dimension Parameters
| Parameter | Meaning | GPT-2 Small | GPT-2 Large |
|---|---|---|---|
d_model | model width | 768 | 1280 |
n_layers | block count | 12 | 36 |
n_heads | attention heads | 12 | 20 |
d_ff | FFN hidden dim | 3072 | 5120 |
vocab_size | vocabulary size | 50,257 | 50,257 |
15.8 Parameter Count
15.8.1 Per-Component Breakdown
Using GPT-2 Small as the example (d_model=768, n_layers=12, vocab_size=50,257):
| Component | Formula | Parameters |
|---|---|---|
| Token Embedding | vocab × d_model | ~38.6M |
| Position Embedding | max_len × d_model | ~0.8M |
| Attention (×12) | 4 × d_model² × 12 | ~28.3M |
| FFN (×12) | 2 × d_model × d_ff × 12 | ~56.6M |
| LayerNorm (×25) | 2 × d_model × 25 | ~0.04M |
| Output Projection | shared with Token Embedding | 0* |
*Output projection usually shares weights with the token embedding table (weight tying). This is not an optimization — it is a modeling choice that says "the same geometry that encodes token meaning should also be used to score token likelihood."
Total: approximately 124 million parameters.
15.8.2 Parameter Distribution
Embedding: ~31% ||||||||
Attention: ~23% ||||||
FFN: ~46% ||||||||||||
LayerNorm: <1%
FFN holds nearly half the parameters. This is why people say FFN layers store the model's knowledge — there is simply more room there.
15.9 Backpropagation During Training
15.9.1 Loss Function
During training, we know the target (the actual next token), so we can compute cross-entropy loss:
Loss = CrossEntropy(predicted_probs, target_token)
15.9.2 Gradient Flow
The loss propagates backward through every component:
Loss
|
Output Projection (Wp) <- update
|
LayerNorm <- update
|
Block 12 (Attention, FFN) <- update
|
...
|
Block 1 <- update
|
Embeddings <- update
Residual connections are critical here. They provide gradient highways that bypass each sub-layer, preventing the vanishing gradients that would otherwise stall training at depth.
15.10 Chapter Summary
15.10.1 Eight-Step Forward Pass
| Step | Operation | Shape transition |
|---|---|---|
| 1 | Tokenization | text → token IDs |
| 2 | Embedding | IDs → vectors [seq, d_model] |
| 3 | + Position | add positional signal |
| 4 | Attention | capture token relationships |
| 5 | Residual + Norm | stabilize training |
| 6 | FFN | feature transformation |
| 7 | × N Blocks | repeat steps 4-6 |
| 8 | Linear + Softmax | output probability distribution |
15.10.2 Parameter Distribution
| Component | Share of parameters | Role |
|---|---|---|
| Embedding | ~30% | semantic token representations |
| Attention | ~25% | capturing inter-token relationships |
| FFN | ~45% | knowledge storage, feature transformation |
15.10.3 Core Insight
The Transformer forward pass is an elegant pipeline. Tokens become vectors, pass through N layers of Attention (context understanding) plus FFN (feature extraction), then project to a vocabulary-sized probability distribution. The shape stays fixed at
d_modelthrough every block — only the final projection breaks the invariant.
Chapter Checklist
After this chapter you should be able to:
- Describe all eight steps of the Transformer forward pass.
- Track tensor shapes from token IDs through to logits.
- Explain what the causal mask does and why it is necessary.
- Explain why FFN accounts for nearly half of all parameters.
- Estimate per-component parameter counts given
d_model,n_layers, andvocab_size.
Code Implementation
The complete forward pass described here is implemented step by step in Part 5 (Chapters 18-20):
- Chapter 18:
model.py— model definition - Chapter 19:
train.py— training loop - Chapter 20:
inference.py— inference logic
See You in the Next Chapter
That is the complete forward pass. If you can trace a tensor from input text to output probabilities without looking at the diagram, you are ready for Chapter 16.
Chapter 16 compares training and inference — the same forward pass, but operating in two very different modes. Understanding that distinction is where a lot of production confusion lives.