One-sentence summary: Residual connections let information travel a bypass lane directly to later layers, solving gradient vanishing in deep networks; Dropout randomly drops activations during training to prevent overfitting. These two techniques are the secret to stable Transformer training.
13.1 Revisiting the Transformer Block
Before diving into residual connections and Dropout, let us look at the complete Transformer block structure:
Input X
↓
Layer Norm
↓
Masked Multi-Head Attention
↓
Dropout
↓
Residual connection (+ X) ← first residual
↓
Layer Norm
↓
Feed Forward Network (FFN)
↓
Dropout
↓
Residual connection (+ previous output) ← second residual
↓
Output
Each block has two residual connections and two Dropout layers. This chapter explains what they do and why they exist.
13.2 Residual Connections: Information's Bypass Lane
13.2.1 The Problem with Deep Networks
As neural networks get deeper, a serious problem emerges: vanishing gradients.
Imagine information flowing from layer 1 through to layer 12:
Layer 1 → Layer 2 → Layer 3 → ... → Layer 12
Each layer processes the signal. After 12 layers:
- The original signal may be severely distorted
- Gradients shrink at each layer during backpropagation
- Layers near the input receive near-zero gradient updates and essentially learn nothing
This was a well-known problem in deep learning before residual connections.
13.2.2 The Fix: A Bypass Lane
The residual connection idea is simple: let the input skip the layer and be added directly to the output.
Input X ──────────────────────┐
↓ │ (bypass lane)
Sublayer │
↓ │
Output ←────────────────────┘ + X
The formula:
output = sublayer(X) + X
Instead of only outputting sublayer(X), we add the original input back.
13.2.3 Numeric Example
The diagram shows real values from a trained model run. Here is what the computation looks like:
Attention output (after Dropout):
[4, 16, 512] tensor
First values: -0.07005, 0.09600, 0.03522, ...
Original input X:
[4, 16, 512] tensor
First values: 0.50748, -1.96800, 5.14941, ...
After residual connection:
output = Attention_output + X
= [-0.07005 + 0.50748, 0.09600 + (-1.96800), ...]
= [ 0.43743, -1.87200, ...]
It is element-wise addition. Nothing exotic.
13.2.4 Why Residual Connections Work
1. Gradient flow
During backpropagation, the gradient through a residual connection is:
Even if is close to zero (vanishing gradient problem), the +1 ensures gradient still flows. The bypass lane carries the gradient directly.
2. Identity mapping as a fallback
If a layer does not yet know what to learn, it can default to outputting near-zero:
sublayer(X) ≈ 0
output = 0 + X = X
This effectively makes the layer a no-op. The information passes through unchanged. This is much easier to achieve than learning a perfect identity transformation from scratch.
3. Information preservation
Original information always survives. No matter how many layers exist, the input signal is never completely overwritten — it is always being added back.
13.2.5 Where Residual Connections Sit in the Transformer
First residual connection: after Attention
X → LayerNorm → Attention → Dropout → (+X) → output1
Second residual connection: after FFN
output1 → LayerNorm → FFN → Dropout → (+output1) → output2
13.2.6 PyTorch Implementation
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# First residual connection
attn_output = self.attention(self.norm1(x), mask)
x = x + self.dropout(attn_output) # residual here
# Second residual connection
ffn_output = self.ffn(self.norm2(x))
x = x + self.dropout(ffn_output) # residual here
return x
The two key lines are x = x + .... Everything else is the sublayer computation.
13.3 Dropout: Random Removal Prevents Overfitting
13.3.1 The Overfitting Problem
Neural networks have a tendency to overfit: they memorize training data rather than learning generalizable patterns.
Think of a model like a student who memorizes every practice problem without understanding the underlying concepts. The student scores 100% on practice tests, then fails on any unfamiliar exam question.
Overfitting is the model's version of that.
13.3.2 Dropout: Random Deactivation
Dropout's idea is surprising in its simplicity: randomly disable some neurons during training.
During each forward pass, Dropout creates a random binary mask. Some activations are zeroed out. Others pass through normally.
Normal: [0.5, 0.3, 0.8, 0.2, 0.6] ← all neurons active
Dropout: [0.5, 0.3, 0.0, 0.2, 0.6] ← 0.8 is zeroed out
Which neurons get dropped? Different ones each time, randomly.
13.3.3 The Intuition
Think of a software team:
Without Dropout: one engineer is brilliant and ends up doing everything. When that engineer leaves, the team collapses.
With Dropout: every day, some team members are randomly "on leave." Everyone must learn to cover multiple responsibilities. The team becomes more robust because no single person is a single point of failure.
In neural networks:
- Dropout forces each neuron to function without relying on any specific partner
- Each neuron learns to be useful independently
- The network becomes more resilient — more distributed representations
13.3.4 The Math
During training:
mask = random binary tensor, 1 with probability (1 - dropout_rate)
output = input * mask / (1 - dropout_rate)
Example with dropout_rate = 0.1 (dropping 10%):
input = [0.5, 0.3, 0.8, 0.2, 0.6]
mask = [1, 1, 0, 1, 1 ] # 0.8 gets dropped
output = [0.5, 0.3, 0.0, 0.2, 0.6] / 0.9
= [0.56, 0.33, 0.00, 0.22, 0.67]
During inference:
output = input # no dropout, pass everything through
13.3.5 Why the Rescaling?
The division by (1 - dropout_rate) keeps the expected value consistent between training and inference.
If 10% of neurons are dropped during training, the remaining 90% are rescaled by 1/0.9 ≈ 1.11. During inference, all 100% of neurons are active, no scaling needed. The expected output magnitude matches in both modes.
Without the rescaling, outputs would be systematically smaller during training than during inference, causing a distribution shift that degrades model quality.
13.3.6 Where Dropout Sits in the Transformer
- After Attention:
Attention → Dropout → residual connection - After FFN:
FFN → Dropout → residual connection
Dropout always appears before the residual addition.
13.3.7 PyTorch Implementation
import torch.nn as nn
# Create Dropout layer
dropout = nn.Dropout(p=0.1) # drop 10% of activations
# Training mode (model.train() activates dropout)
output = dropout(input) # randomly drops activations
# Inference mode (model.eval() disables dropout)
output = dropout(input) # passes everything through unchanged
PyTorch handles the training/inference mode switch automatically. Call model.train() before training, model.eval() before inference.
13.4 Pre-Norm vs Post-Norm
13.4.1 Two Layouts
One subtle architectural choice is where LayerNorm sits relative to the residual connection.
Post-Norm (original Transformer, 2017):
X → Attention → Add(+X) → LayerNorm → FFN → Add → LayerNorm → output
LayerNorm comes after the residual addition.
Pre-Norm (GPT-2 and later):
X → LayerNorm → Attention → Add(+X) → LayerNorm → FFN → Add → output
LayerNorm comes before each sublayer (before Attention, before FFN).
13.4.2 Why Pre-Norm Is Now Standard
Research and practice have converged on Pre-Norm for modern LLMs:
- More stable gradients: normalizing the input before each sublayer prevents pathological activations from building up
- Cleaner residual path: the residual addition does not pass through a normalization step, so the bypass lane carries the raw signal
- Better convergence on deep stacks: especially important for models with 20+ blocks
GPT-2, GPT-3, LLaMA, and essentially all modern decoder-only LLMs use Pre-Norm.
13.5 How Residual Connections and Dropout Work Together
13.5.1 Tracing Data Through the Block
Here is a complete data flow trace for one Transformer block:
Input X [4, 16, 512]
↓
LayerNorm(X) # stabilize the input
↓
Attention(LayerNorm(X)) # compute context-aware updates
↓
Dropout(Attention(...)) # drop some updates (training only)
↓
X + Dropout(...) # residual: original signal + updates
↓
Output1 [4, 16, 512] # shape preserved
The second sub-block (FFN) follows the same pattern.
13.5.2 Why This Combination Works
| Technique | Problem Solved | Mechanism |
|---|---|---|
| Residual connection | Gradient vanishing | Direct bypass for gradient flow |
| Dropout | Overfitting | Random deactivation forces robustness |
| LayerNorm | Numerical instability | Normalizes activations to stable range |
Together:
- LayerNorm stabilizes the input before computation
- Attention or FFN learns the features
- Dropout adds regularization
- The residual connection ensures the original signal survives
Remove any one of these, and training becomes measurably harder — or fails entirely for deep stacks.
13.6 Dropout Rates in Practice
13.6.1 Common Configurations
| Model | Dropout rate | Notes |
|---|---|---|
| GPT-2 | 0.1 | Standard configuration |
| GPT-3 | 0.0 – 0.1 | Varies across experiments |
| BERT | 0.1 | Standard configuration |
| LLaMA | 0.0 | No Dropout used |
The trend is clear: larger models use less Dropout, and sometimes none at all.
Why? Large models with massive parameter counts trained on huge datasets are less prone to overfitting — the data diversity itself acts as regularization. Additionally, large-scale training runs are expensive enough that practitioners prefer not to risk degraded convergence from aggressive Dropout.
13.6.2 Residual Variants
The original Transformer uses plain addition. Some research has explored variations:
Scaled residual:
x = x + 0.1 * sublayer(x) # scale down the residual contribution
Gated residual:
gate = torch.sigmoid(linear(x))
x = x + gate * sublayer(x) # learn how much to trust the sublayer output
These can improve stability in certain settings, but the standard Transformer sticks with simple addition. Simpler tends to generalize better at scale.
13.7 Chapter Summary
13.7.1 Key Concepts
| Concept | Purpose | Formula / Effect |
|---|---|---|
| Residual connection | Prevent gradient vanishing, preserve information | output = sublayer(x) + x |
| Dropout | Prevent overfitting, force robustness | zero random activations during training |
| Pre-Norm | Stable training for deep stacks | LayerNorm before each sublayer |
13.7.2 Block Layout
Input X
↓
LayerNorm → Attention → Dropout → (+X) → output1
↑
residual here
output1
↓
LayerNorm → FFN → Dropout → (+output1) → output2
↑
residual here
13.7.3 Core Takeaway
Residual connections and Dropout are the engineering scaffolding that makes deep Transformer training practical. Residual connections give gradients a bypass route so early layers can learn; Dropout prevents the model from memorizing its training data. Neither is glamorous, but remove either one and training quality drops noticeably. The three pieces — residuals, Dropout, and LayerNorm — work together to keep deep stacks stable.
Chapter Checklist
After this chapter, you should be able to:
- Explain why residual connections prevent gradient vanishing.
- Describe the identity-mapping fallback that residual connections enable.
- Explain how Dropout prevents overfitting.
- State where residual connections and Dropout sit inside a Transformer block.
- Distinguish Pre-Norm from Post-Norm and explain why Pre-Norm is preferred for modern LLMs.
See You in the Next Chapter
The residual connection adds the Attention output back to the original input. But that original input is a combination of two things: the token embedding and the positional encoding.
Chapter 14 asks a question that seems obvious but turns out to be subtle: why do we combine these two signals by adding them, rather than concatenating them?