One-sentence summary: The Transformer forward pass is a pipeline — text → tokens → embeddings + position → N blocks (Attention + FFN) → linear projection to vocabulary → Softmax probabilities → predicted next token. Understand this pipeline and you understand how GPT "thinks."

15.1 The Big Picture: Decoder-Only Architecture

15.1.1 GPT, LLaMA, Claude Are All Decoder-Only

Modern language models — GPT, LLaMA, Claude — all use a Decoder-Only architecture. Unlike the original Transformer's Encoder-Decoder design, Decoder-Only keeps only the decoder stack and focuses exclusively on autoregressive generation.

15.1.2 GPT-2 vs GPT-1 Architecture Comparison

These two models share the Decoder-Only skeleton. The main difference is LayerNorm placement:

	GPT-1 (Post-Norm)	GPT-2 (Pre-Norm)
LayerNorm position	After Attention/FFN	Before Attention/FFN
Training stability	Less stable	More stable
Modern models	—	LLaMA, GPT-3, and nearly everything since

Pre-Norm is now the default. We'll use GPT-2 as our reference and trace the data flow from bottom to top.

15.1.3 Complete Pipeline Overview

Input text: "The agent opened a pull request for"
         |
    Step 1: Tokenization
         |
    Step 2: Word Embeddings
         |
    Step 3: Positional Encoding
         |
    Steps 4-6: N × Transformer Block (GPT-2 Pre-Norm style)
              (LayerNorm → Attention → Residual → LayerNorm → FFN → Residual)
         |
    Step 7: Final Layer Norm
         |
    Step 8: Linear → Softmax → Output Probability
         |
    Predicted next token: "review" (highest probability)

15.2 Steps 1-3: Input Processing

15.2.1 Step 1: Tokenization

Text becomes numbers. Using tiktoken with cl100k_base:

Input: "The agent opened a pull request for"

Token IDs: [791, 8479, 9107, 264, 6958, 1715, 369]
Length: 7 tokens

Each word or subword piece maps to a unique integer ID. The tokenizer does not see characters — it sees learned vocabulary entries.

15.2.2 Step 2: Word Embeddings

Token IDs enter the model via a lookup table:

Token IDs [7]
    | Embedding lookup (vocab_size × d_model matrix)
Token Embeddings [7, 512]

Each token becomes a 512-dimensional vector carrying semantic information about that token. This is the first place where meaning lives as geometry.

15.2.3 Step 3: Positional Encoding

Attention has no built-in sense of order. Without position information, the model cannot distinguish "the agent tagged the reviewer" from "the reviewer tagged the agent." Position encoding fixes that:

Token Embeddings [7, 512]
    | + Positional Encoding [7, 512]
Input Vectors [7, 512]

Two common strategies:

Original Transformer: fixed sinusoidal functions (no training required)
GPT series: learned positional embeddings (trained end-to-end)

Either way, every position gets a unique encoding, and nearby positions get similar encodings.

Output: each token is now a 512-dimensional vector that simultaneously encodes what it is and where it sits.

15.3 Step 4: Inside a Transformer Block

15.3.1 Block Structure

Each Transformer Block has two sub-layers:

Input X [7, 512]
    |
+-------------------------------+
|  LayerNorm                    |
|      |                        |
|  Multi-Head Attention         |  <- understands relationships between tokens
|      |                        |
|  Dropout -> + X (residual)    |
+-------------------------------+
    |
+-------------------------------+
|  LayerNorm                    |
|      |                        |
|  Feed Forward Network         |  <- feature transformation
|      |                        |
|  Dropout -> + X (residual)    |
+-------------------------------+
    |
Output [7, 512]

The critical property: input is [7, 512], output is still [7, 512]. The dimension does not change through any block. Only the final projection breaks that invariant.

15.3.2 Multi-Head Attention in Detail

Attention is the core of the Transformer. Let me break it down step by step.

Step 4.1: Generate Q, K, V

Input X [7, 512]
    | × Wq, Wk, Wv (three weight matrices)
Q, K, V each [7, 512]
    | split across heads
Each head: Q, K, V each [7, 64]  (assuming 8 heads)

Step 4.2: Compute Attention Scores

\text{Attention Score} = \frac{QK^T}{\sqrt{d_k}}

The dot product of Q and K measures similarity — how much should token i attend to token j.

Step 4.3: Visualize the Attention Matrix

The raw Q × K^T result is a 7×7 matrix. Each cell is the similarity score between one pair of token positions. Darker means higher similarity.

Step 4.4: Apply Causal Mask

The lower-triangular mask sets the upper triangle to -inf. After Softmax, -inf becomes 0. This is the Causal Mask — each position can only see tokens that come before it (or itself). This is what makes the model safe to train with teacher forcing and honest at inference time.

Step 4.5: Softmax and Weighted Sum

Attention Weights = Softmax(Masked Scores)
Output = Attention Weights × V

Each position's output is a weighted average of all V vectors, where the weights are the attention scores. The model learned what to pay attention to during training.

15.3.3 Feed Forward Network

The FFN is a simple two-layer network that operates independently on each token position:

Input [7, 512]
    | Linear: 512 → 2048  (4× expansion)
    | ReLU activation
    | Linear: 2048 → 512  (back to model width)
Output [7, 512]

The FFN stores the majority of the model's "factual knowledge." It accounts for nearly half of all parameters — more on that in the parameter count section below.

15.4 Steps 5-6: Residual Connections and LayerNorm

15.4.1 Why Residual Connections?

Every sub-layer wraps its computation in a residual connection:

output = x + sublayer(x)  # not: output = sublayer(x)

Benefits:

Gradients can flow directly backward through the identity path, avoiding vanishing
The network can stack many layers without training instability
If a sub-layer learns nothing useful, the original signal passes through unchanged

This is the bypass lane from Chapter 13. Without it, 48-layer GPT-2 would not converge.

15.4.2 LayerNorm Placement

GPT-2 uses Pre-Norm:

# Pre-Norm (GPT-2, LLaMA, most modern models)
output = x + attention(layernorm(x))

# Post-Norm (original Transformer, 2017)
output = layernorm(x + attention(x))

Pre-Norm normalizes the input before the sublayer, not the combined output. This stabilizes training, especially in the early steps when parameter scales are unpredictable.

15.5 Step 7: Stacking Multiple Blocks

15.5.1 N Repetitions

GPT-2 stacks 12 to 48 blocks depending on model size:

Block 1 [7, 512] → Block 2 [7, 512] → ... → Block 12 [7, 512]

Each block:

preserves the shape [seq, d_model]
refines the representation with another round of Attention + FFN
builds increasingly abstract features as depth increases

Early blocks tend to handle syntax and local patterns. Later blocks handle longer-range dependencies and more abstract semantics. This is not a design decision — it emerged from training.

15.5.2 Where All the Parameters Live

Full architecture with parameter locations annotated

Component	Parameter formula	Example (d_model=512, vocab=100,256)
Word Embedding	vocab × d_model	~51M
Attention (×12)	4 × d_model² × 12	~12.6M
FFN (×12)	8 × d_model² × 12	~25.2M
Output Linear	d_model × vocab	~51M

15.6 Step 8: Output Mapping

15.6.1 Final LayerNorm

After all blocks, one more LayerNorm before projection:

Block 12 output [7, 512]
    | LayerNorm
Normalized output [7, 512]

15.6.2 Linear Layer: Projecting to Vocabulary

The key step: map the 512-dimensional hidden vector to a 100,256-dimensional logit vector.

Input [batch, seq, d_model] = [4, 7, 512]
    | @ Wp [d_model, vocab_size]
Output [batch, seq, vocab_size] = [4, 7, 100256]

15.6.3 What the Wp Matrix Means

Think of Wp as: every token in the vocabulary has a d_model-dimensional signature vector. The output logit for token i is the dot product between the current hidden state and token i's signature — a similarity score.

High dot product → high logit → model thinks this token is a likely next token.

15.6.4 Softmax to Probabilities

logits [7, 100256]
    | Softmax (over last dimension)
probs [7, 100256]

Now every position has a probability distribution over the vocabulary:

all probabilities sum to 1
the highest probability token is the model's prediction
the full distribution is what sampling strategies use

15.7 Full Shape Tracking

15.7.1 From Input to Output

token_ids:               [batch=4, seq=7]

Steps 1-2: Embedding:   [4, 7, 512]
Step 3: + Position:     [4, 7, 512]

Steps 4-6: Blocks 1-12: [4, 7, 512]  <- shape never changes!

Step 7: Final LayerNorm:[4, 7, 512]

Step 8: Linear:         [4, 7, 100256]
        Softmax:        [4, 7, 100256]  <- now probabilities

Take last position:     [4, 100256]
argmax:                 [4]  <- predicted token ID per sequence

The dimension is stable at d_model throughout every block. It only explodes to vocab_size at the very end.

15.7.2 Key Dimension Parameters

Parameter	Meaning	GPT-2 Small	GPT-2 Large
`d_model`	model width	768	1280
`n_layers`	block count	12	36
`n_heads`	attention heads	12	20
`d_ff`	FFN hidden dim	3072	5120
`vocab_size`	vocabulary size	50,257	50,257

15.8 Parameter Count

15.8.1 Per-Component Breakdown

Using GPT-2 Small as the example (d_model=768, n_layers=12, vocab_size=50,257):

Component	Formula	Parameters
Token Embedding	vocab × d_model	~38.6M
Position Embedding	max_len × d_model	~0.8M
Attention (×12)	4 × d_model² × 12	~28.3M
FFN (×12)	2 × d_model × d_ff × 12	~56.6M
LayerNorm (×25)	2 × d_model × 25	~0.04M
Output Projection	shared with Token Embedding	0*

*Output projection usually shares weights with the token embedding table (weight tying). This is not an optimization — it is a modeling choice that says "the same geometry that encodes token meaning should also be used to score token likelihood."

Total: approximately 124 million parameters.

15.8.2 Parameter Distribution

Embedding:  ~31%  ||||||||
Attention:  ~23%  ||||||
FFN:        ~46%  ||||||||||||
LayerNorm:  <1%

FFN holds nearly half the parameters. This is why people say FFN layers store the model's knowledge — there is simply more room there.

15.9 Backpropagation During Training

15.9.1 Loss Function

During training, we know the target (the actual next token), so we can compute cross-entropy loss:

Loss = CrossEntropy(predicted_probs, target_token)

15.9.2 Gradient Flow

The loss propagates backward through every component:

Loss
 |
Output Projection (Wp) <- update
 |
LayerNorm <- update
 |
Block 12 (Attention, FFN) <- update
 |
...
 |
Block 1 <- update
 |
Embeddings <- update

Residual connections are critical here. They provide gradient highways that bypass each sub-layer, preventing the vanishing gradients that would otherwise stall training at depth.

15.10 Chapter Summary

15.10.1 Eight-Step Forward Pass

Step	Operation	Shape transition
1	Tokenization	text → token IDs
2	Embedding	IDs → vectors [seq, d_model]
3	+ Position	add positional signal
4	Attention	capture token relationships
5	Residual + Norm	stabilize training
6	FFN	feature transformation
7	× N Blocks	repeat steps 4-6
8	Linear + Softmax	output probability distribution

15.10.2 Parameter Distribution

Component	Share of parameters	Role
Embedding	~30%	semantic token representations
Attention	~25%	capturing inter-token relationships
FFN	~45%	knowledge storage, feature transformation

15.10.3 Core Insight

The Transformer forward pass is an elegant pipeline. Tokens become vectors, pass through N layers of Attention (context understanding) plus FFN (feature extraction), then project to a vocabulary-sized probability distribution. The shape stays fixed at d_model through every block — only the final projection breaks the invariant.

Chapter Checklist

After this chapter you should be able to:

Describe all eight steps of the Transformer forward pass.
Track tensor shapes from token IDs through to logits.
Explain what the causal mask does and why it is necessary.
Explain why FFN accounts for nearly half of all parameters.
Estimate per-component parameter counts given d_model, n_layers, and vocab_size.

Code Implementation

The complete forward pass described here is implemented step by step in Part 5 (Chapters 18-20):

Chapter 18: model.py — model definition
Chapter 19: train.py — training loop
Chapter 20: inference.py — inference logic

See You in the Next Chapter

That is the complete forward pass. If you can trace a tensor from input text to output probabilities without looking at the diagram, you are ready for Chapter 16.

Chapter 16 compares training and inference — the same forward pass, but operating in two very different modes. Understanding that distinction is where a lot of production confusion lives.