One-sentence summary: Token embeddings carry meaning; positional encoding carries location — and the Transformer needs both before it can understand anything.
5.1 Why Position Matters
The previous chapter gave us this pipeline:
raw text -> token IDs -> embeddings -> Transformer blocks
We have embeddings now. But there is a quiet problem lurking in them.
5.1.1 A Critical Gap
Consider two sentences from our running example:
The agent tagged the reviewer.
The reviewer tagged the agent.
Both sentences contain the same tokens. If we only hand the model token embeddings, the representation of "agent" is identical in both sentences — same vector, same numbers. Same for "reviewer". Same for "tagged".
From the model's point of view, these two sentences look the same. They absolutely do not mean the same thing.
The gap is missing position. The Transformer processes all tokens in parallel, which is one reason it scales so efficiently. But that parallel processing is also why the model, without extra help, treats the input like a bag of tokens with no order.
5.1.2 Three Things Positional Encoding Fixes
Positional encoding solves three related problems:
-
Absolute location: the model learns that this token is first, this one is fifteenth, this one is the last in the sequence.
-
Relative distance: "agent" and "tagged" are adjacent; "agent" and "reviewer" are separated by one token. That spacing carries grammatical and semantic information.
-
Learnable patterns: because the encoding follows a consistent formula, the model can generalize position rules to longer sequences it did not see during training.
5.2 The Simple but Broken Idea: Raw Integers
Before explaining what the Transformer actually does, let's look at the most naive approach so you understand why it doesn't work.
5.2.1 Stacking Integer Position Numbers
The simplest possible idea: give each position a number, then add it to the token embedding at that position.
Here is an example with the tokens ["The", "agent", "tagged", "the", "reviewer", "."] using a toy d_model = 4:
Embedding vectors (semantic content):
| token | dim0 | dim1 | dim2 | dim3 |
|---|---|---|---|---|
| The | 0.62 | -0.51 | 0.09 | 0.85 |
| agent | 0.62 | -0.51 | 0.09 | 0.85 |
| tagged | 0.07 | 0.31 | -0.44 | 0.12 |
| the | 0.43 | 0.18 | 0.66 | -0.30 |
| reviewer | 1.30 | -0.72 | 0.55 | 0.41 |
| . | 0.98 | 0.03 | -0.11 | 0.74 |
Integer position vectors:
| position | dim0 | dim1 | dim2 | dim3 |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 |
| 4 | 4 | 4 | 4 | 4 |
| 5 | 5 | 5 | 5 | 5 |
| 6 | 6 | 6 | 6 | 6 |
After adding:
| token | dim0 | dim1 | dim2 | dim3 |
|---|---|---|---|---|
| The | 0.62+1 | -0.51+1 | 0.09+1 | 0.85+1 |
| agent | 0.62+2 | -0.51+2 | 0.09+2 | 0.85+2 |
| ... | ... | ... | ... | ... |
5.2.2 Why This Breaks
Two real problems:
-
Unbounded values: position 1000 adds 1000 to every dimension. The embedding values are usually around zero; suddenly some tokens are shifted by a thousand. Gradient flow collapses, training is unstable.
-
No learnable structure: 1, 2, 3, 4 grows linearly with no pattern the model can generalize. The model cannot extrapolate to longer sequences or learn that "two positions apart" has a consistent meaning.
The original Transformer paper needed something more principled.
5.3 The Transformer's Answer: Sinusoidal Encoding
5.3.1 Core Idea
The original Transformer uses sine and cosine waves at different frequencies to construct positional vectors. The way I remember this: every position gets a barcode made from stacked waves. Low-frequency waves encode broad location; high-frequency waves encode fine-grained distance between nearby positions.
The formula:
even dimensions: PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
odd dimensions: PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
Do not panic at the formula. There are only two ideas inside it:
- Even dimensions use sine, odd dimensions use cosine.
- Each pair of dimensions uses a different frequency — controlled by the exponent
2i/d_model.
5.3.2 Wave Visualization
If you plot the position encoding along a sequence, each dimension traces out a wave. A low-frequency dimension changes slowly across positions — like a slow clock. A high-frequency dimension oscillates quickly — like a fast clock. Put many clocks together and each position in the sequence has a unique combination of readings, which is what we need.
5.3.3 Concrete Numbers
Here are the actual values for the first four dimensions, for our six-token sequence (positions 0 through 5):
| token | pos | dim0 (sin) | dim1 (cos) | dim2 (sin) | dim3 (cos) |
|---|---|---|---|---|---|
| The | 0 | 0.00 | 1.00 | 0.00 | 1.00 |
| agent | 1 | 0.84 | 0.54 | 0.68 | 0.73 |
| tagged | 2 | 0.90 | -0.41 | 0.99 | 0.07 |
| the | 3 | 0.14 | -0.98 | 0.77 | -0.62 |
| reviewer | 4 | -0.75 | -0.65 | 0.14 | -0.98 |
| . | 5 | -0.95 | 0.28 | -0.57 | -0.82 |
Note: positions are zero-indexed (
pos = 0, 1, 2, ...). Atpos = 0,sin(0) = 0andcos(0) = 1.
Observations:
- All values stay in
[-1, 1]— the natural range of sine and cosine. No value explosion at long positions. - Each row is unique — every position has a distinctive fingerprint.
- The patterns change smoothly — nearby positions are numerically similar.
5.3.4 Why Sine and Cosine Specifically?
The paper authors chose sin/cos for three reasons:
-
Bounded values: always in
[-1, 1]. Position 10,000 does not blow up the numbers. -
Extrapolation in theory: a model trained on short sequences can in principle generalize to longer ones because the wave patterns continue predictably.
-
Relative position via linear transformation: there is a mathematical property that the encoding at position
pos + kcan be expressed as a linear function of the encoding atpos. This means the model can learn to attend to tokens at a fixed offset.
Later research found that sinusoidal encoding's extrapolation capability is more limited in practice than the theory suggests. Newer schemes like RoPE and ALiBi handle long contexts better. We cover those in Chapter 25. For now, sinusoidal encoding is the clearest way to understand why position encoding is needed and how it works.
5.4 Embedding + Position = Input
Now let's see the full addition step.
5.4.1 Vector Addition
Three matrices, one shape each: [seq_len, d_model].
Embedding matrix (semantic content):
The: [0.62, -0.51, 0.09, 0.85]
agent: [0.62, -0.51, 0.09, 0.85] <- same as "The" in this toy example
tagged: [0.07, 0.31, -0.44, 0.12]
...
Position matrix (location):
pos 0: [0.00, 1.00, 0.00, 1.00]
pos 1: [0.84, 0.54, 0.68, 0.73]
pos 2: [0.90, -0.41, 0.99, 0.07]
...
Input embeddings (their sum):
The (pos 0): [0.62+0.00, -0.51+1.00, 0.09+0.00, 0.85+1.00]
agent (pos 1): [0.62+0.84, -0.51+0.54, 0.09+0.68, 0.85+0.73]
...
The critical observation: if the same token appears at two different positions, its embedding vector is identical but its positional vector is different, so the combined input is different. The model can now distinguish "The agent tagged the reviewer" from "The reviewer tagged the agent" even though they share tokens.
5.4.2 Geometric Intuition
In a 2D sketch, vector addition follows the parallelogram rule:
embedding vector = [1, 3] (blue arrow: semantic direction)
position vector = [2, 1] (red arrow: positional shift)
input vector = [3, 4] (result: diagonal of the parallelogram)
The resulting vector carries both pieces of information, encoded in its direction and magnitude. In 768 or 4096 dimensions, there is far more room for this combined representation to remain coherent.
5.4.3 Relative Distance Matters
For many tasks the model cares about relative distance, not just absolute position. Whether "pull" comes just before "request" matters. The sinusoidal scheme preserves some of that structure mathematically, and learned position schemes preserve it empirically. Chapter 25 goes into the specifics of each approach.
5.5 Training: What Learns and What Doesn't
5.5.1 Fixed Encodings vs. Learned Parameters
There is an important asymmetry in the original Transformer:
-
Embedding matrix: trainable parameters. Every training step, gradients flow back through the embedding table and update the token vectors. The model learns to position semantically related tokens near each other in embedding space.
-
Position matrix (sinusoidal version): fixed. Not a parameter. Computed once from the formula and never updated. No gradient flows through it.
During training, the model learns how to interpret the position signal embedded in the vectors — but it does not change the signal itself.
Some models, including BERT, use learned positional embeddings: the position matrix is a parameter just like the token embedding table and is updated by gradient descent. The tradeoff is that learned embeddings often do not generalize beyond the training context length, while sinusoidal ones can in principle.
5.5.2 Where It Sits in the Architecture
The flow in the full model:
raw text
|
| tokenization
v
token IDs
|
| embedding lookup
v
embedding matrix [seq_len, d_model]
|
| + positional encoding [seq_len, d_model]
v
input embeddings [seq_len, d_model]
|
| feed into Transformer blocks
v
...
Positional encoding happens before the first Transformer block. It is part of the input preprocessing, not inside any block.
5.6 Why Add Instead of Concatenate?
This question comes up every time I teach this chapter. The intuition is worth spelling out.
5.6.1 Concatenation vs. Addition
Concatenation:
- Append the position vector after the embedding vector.
- Result:
[embedding | position]— a vector twice as wide. - Clean separation of information.
- Downside: doubles the vector width. Every subsequent matrix must handle
2 × d_modelfeatures. Parameter count and compute cost scale accordingly.
Addition:
- Add embedding and position element-wise.
- Result: same shape
[d_model]— no dimension change. - Downside: the two signals are mixed in the same dimensions.
5.6.2 Why Addition Works
The justification is that high-dimensional spaces are surprisingly spacious. In 768 or 4096 dimensions, the semantic content of a token and its positional signal can occupy largely orthogonal subspaces. The network has enough capacity to learn to disentangle them after the addition.
Think of it as two engineers sharing a whiteboard instead of each having their own. It sounds messy, but if you have a large enough whiteboard and organized people, it works fine — and you saved the cost of a second whiteboard.
Empirically, addition works. The architecture is simpler and the parameter count stays the same. That is a good engineering trade.
5.7 Chapter Summary
5.7.1 Key Concepts
| Concept | Meaning |
|---|---|
| Positional Encoding | a vector added to each token embedding to encode its sequence position |
| Sinusoidal Encoding | uses sin/cos waves at multiple frequencies to generate position vectors |
| Addition | embedding + position = input embedding, same shape, no dimension change |
| Fixed vs. Learned | sinusoidal is fixed; some models (BERT) use learned position parameters |
| Relative position | the model can learn to interpret distance, not just absolute index |
5.7.2 Data Flow
Embedding [seq_len, d_model] <- semantic content
+
Position [seq_len, d_model] <- sinusoidal position encoding
=
Input [seq_len, d_model] <- fed into the first Transformer block
5.7.3 Core Takeaway
Positional encoding solves the Transformer's "positional blindness." By adding a position vector to each token embedding, the model can distinguish "The agent tagged the reviewer" from "The reviewer tagged the agent" — same tokens, different meaning, different position vectors.
Chapter Checklist
After this chapter, you should be able to:
- Explain why the Transformer needs explicit position information even though it processes tokens sequentially.
- Describe sinusoidal encoding as a multi-frequency wave barcode: each position gets a unique combination of sine and cosine values.
- Reproduce the formula
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))and explain the two main ideas it contains. - Explain why addition is used instead of concatenation, and what the practical trade-off is.
- Distinguish between fixed sinusoidal encodings (original Transformer) and learned positional embeddings (BERT, etc.).
See You in the Next Chapter
That is enough for positional encoding. If you can explain why "The agent tagged the reviewer" and "The reviewer tagged the agent" produce different model outputs even though they share every token, you have internalized this chapter.
The input to the Transformer blocks is now complete: semantic information from embeddings, plus location information from positional encoding.
Chapter 6 introduces two small but essential mathematical tools that appear everywhere inside the Transformer: LayerNorm, which keeps numbers in a well-behaved range, and Softmax, which turns raw scores into probability distributions.