One-sentence summary: LayerNorm is the stabilizer that keeps training from going sideways; Softmax is the converter that turns arbitrary scores into valid probability distributions.

6.1 Why These Two Tools Matter

LayerNorm and Softmax positions in the Transformer architecture

Every Transformer block runs matrix multiplication dozens of times per token. Stack thirty-two of those blocks and the numbers can drift dramatically: some activations explode to tens of thousands, others collapse to near zero. When that happens, gradients either vanish or overflow, and training fails.

Two lightweight operations prevent this:

LayerNorm appears inside every Transformer block — twice per block in the original design — and keeps activation magnitudes sensible.
Softmax appears inside the Attention mechanism and again at the final output layer, converting raw score vectors into proper probability distributions.

Neither is complicated. Both have outsized impact on whether the system trains at all.

6.2 LayerNorm: Keep the Scale Reasonable

6.2.1 The Problem: Numbers Drift

Neural network layers multiply and add. Do it enough times and you get runaway growth:

too large:  10,000   100,000   → overflow, NaN during training
too small:  0.0001   0.00001   → vanishing gradient, model stops learning

The solution is normalization: bring the numbers back to a predictable range at each layer boundary.

6.2.2 What LayerNorm Does

LayerNorm normalizes each token vector independently

Layer Normalization normalizes the feature values within a single token vector. For each token:

Compute the mean across all d_model dimensions.
Compute the variance.
Subtract the mean and divide by the standard deviation.
Apply learned scale γ and shift β parameters.

The formula:

y = (x - μ) / √(σ² + ε) × γ + β

Breaking it down:

x - μ: subtract the mean, so the result is centered at zero.
/ √(σ² + ε): divide by the standard deviation, so the spread becomes one. The small ε (typically 1e-5) prevents division by zero.
× γ + β: apply learnable scale and bias. This lets the model choose a different output range if that turns out to be useful. γ and β start at 1 and 0 and are updated during training.

6.2.3 Worked Example

Suppose a token's four-dimensional activation vector is [22, 5, 6, 8].

Step 1: compute the mean

μ = (22 + 5 + 6 + 8) / 4 = 41 / 4 = 10.25

Step 2: compute the variance

σ² = ((22 - 10.25)² + (5 - 10.25)² + (6 - 10.25)² + (8 - 10.25)²) / 4
   = (138.06 + 27.56 + 18.06 + 5.06) / 4
   = 47.19

Step 3: normalize

dim0: (22 - 10.25) / √47.19 = 11.75 / 6.87 ≈  1.71
dim1: ( 5 - 10.25) / √47.19 = -5.25 / 6.87 ≈ -0.76
dim2: ( 6 - 10.25) / √47.19 = -4.25 / 6.87 ≈ -0.62
dim3: ( 8 - 10.25) / √47.19 = -2.25 / 6.87 ≈ -0.33

Result: [1.71, -0.76, -0.62, -0.33]. Mean ≈ 0, variance ≈ 1. Regardless of how extreme the input was, the output is well-behaved.

6.2.4 PyTorch Implementation

import torch
import torch.nn as nn

layer_norm = nn.LayerNorm(normalized_shape=4, bias=True)

x = torch.tensor([[22.0, 5.0, 6.0, 8.0]])
y = layer_norm(x)
print(y)  # approximately [1.71, -0.76, -0.62, -0.33]

nn.LayerNorm handles the formula above. The normalized_shape tells it which dimension to normalize over. In a real Transformer, that dimension is d_model.

6.2.5 Why "Layer" Norm?

The normalization is applied per token, across the feature (layer) dimension — not across the batch. Each token's vector is independently normalized. Different tokens do not interfere with each other's statistics.

This contrasts with Batch Normalization, which normalizes across the batch dimension. Batch Norm is common in vision models, but it struggles with variable-length sequences. LayerNorm is the standard choice for Transformers.

6.2.6 Where LayerNorm Appears

Inside each Transformer block, LayerNorm appears twice:

Input
  ↓
LayerNorm                   <- first application
  ↓
Masked Multi-Head Attention
  ↓
Residual connection
  ↓
LayerNorm                   <- second application
  ↓
Feed Forward Network (FFN)
  ↓
Residual connection
  ↓
Output

In the original Transformer paper this was "post-norm" (LayerNorm after the sub-layer). Most modern LLMs use "pre-norm" (LayerNorm before the sub-layer, as shown above) because it is more stable during training.

6.3 Softmax: Turning Scores Into Probabilities

6.3.1 The Problem: We Need Probabilities

Two places in the Transformer need probability distributions:

Inside Attention: after computing raw similarity scores between Query and Key vectors, we need weights that sum to 1.
At the final output: after projecting the final hidden state onto the vocabulary, we need the model to express a distribution over all possible next tokens.

Probabilities have two requirements: every value is between 0 and 1, and all values sum to exactly 1. Raw scores from matrix multiplication satisfy neither.

6.3.2 What Softmax Does

Softmax converts raw logits into a probability distribution

Softmax transforms any vector of real numbers into a valid probability distribution.

Using the vocabulary output as an example — the model's raw scores (logits) for four candidate next tokens:

Input logits: request = 3.01, tab = 0.09, quote = 2.48, other = 1.95

Output probabilities: request = 50.28%, tab = 2.71%, quote = 29.59%, other = 17.42%

After Softmax:

Every value is between 0 and 1.
All values sum to 100%.

6.3.3 The Softmax Formula

\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}

In words:

Raise e ≈ 2.718 to the power of each score.
Divide each result by the sum of all results.

6.3.4 Worked Example

Input logits: [3.01, 0.09, 2.48, 1.95]

Step 1: exponentiate

e^3.01 = 20.29
e^0.09 =  1.09
e^2.48 = 11.94
e^1.95 =  7.03

Step 2: sum

total = 20.29 + 1.09 + 11.94 + 7.03 = 40.35

Step 3: divide

request: 20.29 / 40.35 = 0.5028 = 50.28%
tab:      1.09 / 40.35 = 0.0271 =  2.71%
quote:   11.94 / 40.35 = 0.2959 = 29.59%
other:    7.03 / 40.35 = 0.1742 = 17.42%

6.3.5 Three Properties Worth Knowing

1. Amplifies differences: a gap of 2 in logit space produces a much larger gap in probability space. The largest logit dominates.

2. Preserves order: if logit A > logit B, then P(A) > P(B). The highest score remains the highest probability.

3. Handles negative inputs: exponentiation always returns a positive number (e^x > 0 for all x), so even negative logits produce valid probabilities.

6.3.6 PyTorch Implementation

import torch
import torch.nn.functional as F

logits = torch.tensor([3.01, 0.09, 2.48, 1.95])

probs = F.softmax(logits, dim=0)
print(probs)       # tensor([0.5028, 0.0271, 0.2959, 0.1742])
print(probs.sum()) # tensor(1.0000)

6.4 Where They Sit in the Architecture

6.4.1 The Full Output Flow

Complete output flow from the final Transformer block through LayerNorm, linear projection, and Softmax

From the final Transformer block to next-token prediction:

Transformer block output
        ↓
Final LayerNorm
        ↓
Linear projection (d_model → vocab_size)
        ↓
Softmax (logits → probabilities)
        ↓
Sample or argmax to select next token

The linear projection maps the final hidden state from d_model dimensions to vocab_size dimensions. If d_model = 4096 and vocab_size = 100,256, this is a large matrix: 4096 × 100,256 ≈ 410 million parameters. It is often called the LM Head (Language Model Head).

6.4.2 How They Work Together

LayerNorm and Softmax do different jobs at different points, but they cooperate:

LayerNorm stabilizes activations between sub-layers, keeping the numbers in a range where matrix multiplication remains well-conditioned.
After all the computation, the final hidden state passes through one more LayerNorm and then the LM Head.
Softmax at the end converts the LM Head output into a valid next-token distribution.

Inside Attention, Softmax does the same job on a smaller scale: it converts the raw Q · K similarity scores into attention weights that sum to 1.

6.5 Temperature: Controlling the Distribution Shape

When you call an inference API and set temperature, you are modifying how Softmax behaves at the output layer.

6.5.1 The Temperature Formula

\text{softmax}(z / T)_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

Dividing logits by T before Softmax changes the shape of the resulting distribution:

T < 1 (low temperature): dividing by a small number makes large logits even larger in relative terms. The distribution sharpens — the top token dominates.
T = 1 (default): standard Softmax, no modification.
T > 1 (high temperature): logit differences shrink. The distribution flattens — lower-probability tokens get more weight.

6.5.2 Numerical Example

Logits: [3.0, 1.0, 0.5]

Temperature	Probabilities (approx.)	Character
T = 0.5	[0.98, 0.02, 0.01]	Very decisive — almost always picks the top token
T = 1.0	[0.82, 0.11, 0.07]	Standard distribution
T = 2.0	[0.60, 0.22, 0.17]	More spread out — lower tokens get more chances

Values computed via softmax(logits / T), rounded to two decimal places.

6.5.3 Practical Implications

When you use a model API:

temperature = 0: greedy decoding — always pick the highest-probability token. Deterministic but sometimes repetitive.
temperature = 0.7: a common production value. Balances coherence and variety.
temperature = 1.0+: more diverse, creative, and potentially incoherent output.

This is the part people often misdiagnose. If a model output seems dull or repetitive, the first question should be whether the sampling settings are the problem, not whether the model lacks capability. If the output is incoherent, the question is the same. Temperature is a dial on the decoder, not a measure of model intelligence.

6.6 Chapter Summary

6.6.1 Side-by-Side Comparison

Property	LayerNorm	Softmax
Purpose	normalize activations	convert scores to probabilities
Output range	mean = 0, std = 1 (before scale/shift)	[0, 1] per element, sums to 1
Where it appears	twice per Transformer block	inside Attention, at final output
Learnable params	yes (`γ` and `β`)	no (but temperature is a hyperparameter)

6.6.2 Formula Reference

LayerNorm:

y = (x - mean(x)) / std(x) × γ + β

Softmax:

P(i) = e^(x_i) / Σ_j e^(x_j)

Softmax with temperature:

P(i) = e^(x_i / T) / Σ_j e^(x_j / T)

6.6.3 Core Takeaway

LayerNorm is the Transformer's stabilizer: it keeps intermediate activations in a well-behaved range so that training converges. Softmax is the probability converter: it turns raw scores into valid distributions both inside Attention and at the final prediction step. These two tools are small in code, large in impact.

Chapter Checklist

After this chapter, you should be able to:

Explain why LayerNorm is needed: stacked matrix multiplications cause activation drift.
Work through a LayerNorm calculation by hand given a small input vector.
Explain Softmax as an exponentiate-then-normalize operation.
Work through a Softmax calculation by hand given a small logit vector.
State where LayerNorm and Softmax each appear in the Transformer architecture.
Explain what temperature controls and the practical effect of low vs. high values.

See You in the Next Chapter

That covers the two lightweight tools that quietly hold the whole system together. If you can sketch where LayerNorm and Softmax appear in a Transformer block diagram, you are ready to move forward.

Chapter 7 introduces the Feed Forward Network — the other major component inside each Transformer block, the one that holds most of the model's parameters. The good news: once you understand matrix multiplication and activation functions, the FFN is straightforward.