One-sentence summary: LayerNorm is the stabilizer that keeps training from going sideways; Softmax is the converter that turns arbitrary scores into valid probability distributions.
6.1 Why These Two Tools Matter
Every Transformer block runs matrix multiplication dozens of times per token. Stack thirty-two of those blocks and the numbers can drift dramatically: some activations explode to tens of thousands, others collapse to near zero. When that happens, gradients either vanish or overflow, and training fails.
Two lightweight operations prevent this:
- LayerNorm appears inside every Transformer block — twice per block in the original design — and keeps activation magnitudes sensible.
- Softmax appears inside the Attention mechanism and again at the final output layer, converting raw score vectors into proper probability distributions.
Neither is complicated. Both have outsized impact on whether the system trains at all.
6.2 LayerNorm: Keep the Scale Reasonable
6.2.1 The Problem: Numbers Drift
Neural network layers multiply and add. Do it enough times and you get runaway growth:
too large: 10,000 100,000 → overflow, NaN during training
too small: 0.0001 0.00001 → vanishing gradient, model stops learning
The solution is normalization: bring the numbers back to a predictable range at each layer boundary.
6.2.2 What LayerNorm Does
Layer Normalization normalizes the feature values within a single token vector. For each token:
- Compute the mean across all
d_modeldimensions. - Compute the variance.
- Subtract the mean and divide by the standard deviation.
- Apply learned scale
γand shiftβparameters.
The formula:
y = (x - μ) / √(σ² + ε) × γ + β
Breaking it down:
x - μ: subtract the mean, so the result is centered at zero./ √(σ² + ε): divide by the standard deviation, so the spread becomes one. The smallε(typically1e-5) prevents division by zero.× γ + β: apply learnable scale and bias. This lets the model choose a different output range if that turns out to be useful.γandβstart at1and0and are updated during training.
6.2.3 Worked Example
Suppose a token's four-dimensional activation vector is [22, 5, 6, 8].
Step 1: compute the mean
μ = (22 + 5 + 6 + 8) / 4 = 41 / 4 = 10.25
Step 2: compute the variance
σ² = ((22 - 10.25)² + (5 - 10.25)² + (6 - 10.25)² + (8 - 10.25)²) / 4
= (138.06 + 27.56 + 18.06 + 5.06) / 4
= 47.19
Step 3: normalize
dim0: (22 - 10.25) / √47.19 = 11.75 / 6.87 ≈ 1.71
dim1: ( 5 - 10.25) / √47.19 = -5.25 / 6.87 ≈ -0.76
dim2: ( 6 - 10.25) / √47.19 = -4.25 / 6.87 ≈ -0.62
dim3: ( 8 - 10.25) / √47.19 = -2.25 / 6.87 ≈ -0.33
Result: [1.71, -0.76, -0.62, -0.33]. Mean ≈ 0, variance ≈ 1. Regardless of how extreme the input was, the output is well-behaved.
6.2.4 PyTorch Implementation
import torch
import torch.nn as nn
layer_norm = nn.LayerNorm(normalized_shape=4, bias=True)
x = torch.tensor([[22.0, 5.0, 6.0, 8.0]])
y = layer_norm(x)
print(y) # approximately [1.71, -0.76, -0.62, -0.33]
nn.LayerNorm handles the formula above. The normalized_shape tells it which dimension to normalize over. In a real Transformer, that dimension is d_model.
6.2.5 Why "Layer" Norm?
The normalization is applied per token, across the feature (layer) dimension — not across the batch. Each token's vector is independently normalized. Different tokens do not interfere with each other's statistics.
This contrasts with Batch Normalization, which normalizes across the batch dimension. Batch Norm is common in vision models, but it struggles with variable-length sequences. LayerNorm is the standard choice for Transformers.
6.2.6 Where LayerNorm Appears
Inside each Transformer block, LayerNorm appears twice:
Input
↓
LayerNorm <- first application
↓
Masked Multi-Head Attention
↓
Residual connection
↓
LayerNorm <- second application
↓
Feed Forward Network (FFN)
↓
Residual connection
↓
Output
In the original Transformer paper this was "post-norm" (LayerNorm after the sub-layer). Most modern LLMs use "pre-norm" (LayerNorm before the sub-layer, as shown above) because it is more stable during training.
6.3 Softmax: Turning Scores Into Probabilities
6.3.1 The Problem: We Need Probabilities
Two places in the Transformer need probability distributions:
- Inside Attention: after computing raw similarity scores between Query and Key vectors, we need weights that sum to 1.
- At the final output: after projecting the final hidden state onto the vocabulary, we need the model to express a distribution over all possible next tokens.
Probabilities have two requirements: every value is between 0 and 1, and all values sum to exactly 1. Raw scores from matrix multiplication satisfy neither.
6.3.2 What Softmax Does
Softmax transforms any vector of real numbers into a valid probability distribution.
Using the vocabulary output as an example — the model's raw scores (logits) for four candidate next tokens:
Input logits: request = 3.01, tab = 0.09, quote = 2.48, other = 1.95
Output probabilities: request = 50.28%, tab = 2.71%, quote = 29.59%, other = 17.42%
After Softmax:
- Every value is between 0 and 1.
- All values sum to 100%.
6.3.3 The Softmax Formula
In words:
- Raise
e ≈ 2.718to the power of each score. - Divide each result by the sum of all results.
6.3.4 Worked Example
Input logits: [3.01, 0.09, 2.48, 1.95]
Step 1: exponentiate
e^3.01 = 20.29
e^0.09 = 1.09
e^2.48 = 11.94
e^1.95 = 7.03
Step 2: sum
total = 20.29 + 1.09 + 11.94 + 7.03 = 40.35
Step 3: divide
request: 20.29 / 40.35 = 0.5028 = 50.28%
tab: 1.09 / 40.35 = 0.0271 = 2.71%
quote: 11.94 / 40.35 = 0.2959 = 29.59%
other: 7.03 / 40.35 = 0.1742 = 17.42%
6.3.5 Three Properties Worth Knowing
1. Amplifies differences: a gap of 2 in logit space produces a much larger gap in probability space. The largest logit dominates.
2. Preserves order: if logit A > logit B, then P(A) > P(B). The highest score remains the highest probability.
3. Handles negative inputs: exponentiation always returns a positive number (e^x > 0 for all x), so even negative logits produce valid probabilities.
6.3.6 PyTorch Implementation
import torch
import torch.nn.functional as F
logits = torch.tensor([3.01, 0.09, 2.48, 1.95])
probs = F.softmax(logits, dim=0)
print(probs) # tensor([0.5028, 0.0271, 0.2959, 0.1742])
print(probs.sum()) # tensor(1.0000)
6.4 Where They Sit in the Architecture
6.4.1 The Full Output Flow
From the final Transformer block to next-token prediction:
Transformer block output
↓
Final LayerNorm
↓
Linear projection (d_model → vocab_size)
↓
Softmax (logits → probabilities)
↓
Sample or argmax to select next token
The linear projection maps the final hidden state from d_model dimensions to vocab_size dimensions. If d_model = 4096 and vocab_size = 100,256, this is a large matrix: 4096 × 100,256 ≈ 410 million parameters. It is often called the LM Head (Language Model Head).
6.4.2 How They Work Together
LayerNorm and Softmax do different jobs at different points, but they cooperate:
- LayerNorm stabilizes activations between sub-layers, keeping the numbers in a range where matrix multiplication remains well-conditioned.
- After all the computation, the final hidden state passes through one more LayerNorm and then the LM Head.
- Softmax at the end converts the LM Head output into a valid next-token distribution.
Inside Attention, Softmax does the same job on a smaller scale: it converts the raw Q · K similarity scores into attention weights that sum to 1.
6.5 Temperature: Controlling the Distribution Shape
When you call an inference API and set temperature, you are modifying how Softmax behaves at the output layer.
6.5.1 The Temperature Formula
Dividing logits by T before Softmax changes the shape of the resulting distribution:
- T < 1 (low temperature): dividing by a small number makes large logits even larger in relative terms. The distribution sharpens — the top token dominates.
- T = 1 (default): standard Softmax, no modification.
- T > 1 (high temperature): logit differences shrink. The distribution flattens — lower-probability tokens get more weight.
6.5.2 Numerical Example
Logits: [3.0, 1.0, 0.5]
| Temperature | Probabilities (approx.) | Character |
|---|---|---|
| T = 0.5 | [0.98, 0.02, 0.01] | Very decisive — almost always picks the top token |
| T = 1.0 | [0.82, 0.11, 0.07] | Standard distribution |
| T = 2.0 | [0.60, 0.22, 0.17] | More spread out — lower tokens get more chances |
Values computed via
softmax(logits / T), rounded to two decimal places.
6.5.3 Practical Implications
When you use a model API:
- temperature = 0: greedy decoding — always pick the highest-probability token. Deterministic but sometimes repetitive.
- temperature = 0.7: a common production value. Balances coherence and variety.
- temperature = 1.0+: more diverse, creative, and potentially incoherent output.
This is the part people often misdiagnose. If a model output seems dull or repetitive, the first question should be whether the sampling settings are the problem, not whether the model lacks capability. If the output is incoherent, the question is the same. Temperature is a dial on the decoder, not a measure of model intelligence.
6.6 Chapter Summary
6.6.1 Side-by-Side Comparison
| Property | LayerNorm | Softmax |
|---|---|---|
| Purpose | normalize activations | convert scores to probabilities |
| Output range | mean = 0, std = 1 (before scale/shift) | [0, 1] per element, sums to 1 |
| Where it appears | twice per Transformer block | inside Attention, at final output |
| Learnable params | yes (γ and β) | no (but temperature is a hyperparameter) |
6.6.2 Formula Reference
LayerNorm:
y = (x - mean(x)) / std(x) × γ + β
Softmax:
P(i) = e^(x_i) / Σ_j e^(x_j)
Softmax with temperature:
P(i) = e^(x_i / T) / Σ_j e^(x_j / T)
6.6.3 Core Takeaway
LayerNorm is the Transformer's stabilizer: it keeps intermediate activations in a well-behaved range so that training converges. Softmax is the probability converter: it turns raw scores into valid distributions both inside Attention and at the final prediction step. These two tools are small in code, large in impact.
Chapter Checklist
After this chapter, you should be able to:
- Explain why LayerNorm is needed: stacked matrix multiplications cause activation drift.
- Work through a LayerNorm calculation by hand given a small input vector.
- Explain Softmax as an exponentiate-then-normalize operation.
- Work through a Softmax calculation by hand given a small logit vector.
- State where LayerNorm and Softmax each appear in the Transformer architecture.
- Explain what temperature controls and the practical effect of low vs. high values.
See You in the Next Chapter
That covers the two lightweight tools that quietly hold the whole system together. If you can sketch where LayerNorm and Softmax appear in a Transformer block diagram, you are ready to move forward.
Chapter 7 introduces the Feed Forward Network — the other major component inside each Transformer block, the one that holds most of the model's parameters. The good news: once you understand matrix multiplication and activation functions, the FFN is straightforward.