One-sentence summary: Token embeddings carry meaning; positional encoding carries location — and the Transformer needs both before it can understand anything.

5.1 Why Position Matters

The previous chapter gave us this pipeline:

raw text -> token IDs -> embeddings -> Transformer blocks

We have embeddings now. But there is a quiet problem lurking in them.

5.1.1 A Critical Gap

Consider two sentences from our running example:

The agent tagged the reviewer.
The reviewer tagged the agent.

Both sentences contain the same tokens. If we only hand the model token embeddings, the representation of "agent" is identical in both sentences — same vector, same numbers. Same for "reviewer". Same for "tagged".

From the model's point of view, these two sentences look the same. They absolutely do not mean the same thing.

The gap is missing position. The Transformer processes all tokens in parallel, which is one reason it scales so efficiently. But that parallel processing is also why the model, without extra help, treats the input like a bag of tokens with no order.

5.1.2 Three Things Positional Encoding Fixes

Three roles of positional encoding: absolute position, relative distance, and learnable patterns

Positional encoding solves three related problems:

Absolute location: the model learns that this token is first, this one is fifteenth, this one is the last in the sequence.
Relative distance: "agent" and "tagged" are adjacent; "agent" and "reviewer" are separated by one token. That spacing carries grammatical and semantic information.
Learnable patterns: because the encoding follows a consistent formula, the model can generalize position rules to longer sequences it did not see during training.

5.2 The Simple but Broken Idea: Raw Integers

Before explaining what the Transformer actually does, let's look at the most naive approach so you understand why it doesn't work.

5.2.1 Stacking Integer Position Numbers

Adding raw integer position numbers to embedding vectors

The simplest possible idea: give each position a number, then add it to the token embedding at that position.

Here is an example with the tokens ["The", "agent", "tagged", "the", "reviewer", "."] using a toy d_model = 4:

Embedding vectors (semantic content):

token	dim0	dim1	dim2	dim3
The	0.62	-0.51	0.09	0.85
agent	0.15	0.73	-0.38	0.46
tagged	0.07	0.31	-0.44	0.12
the	0.43	0.18	0.66	-0.30
reviewer	1.30	-0.72	0.55	0.41
.	0.98	0.03	-0.11	0.74

Integer position vectors:

position	dim0	dim1	dim2	dim3
1	1	1	1	1
2	2	2	2	2
3	3	3	3	3
4	4	4	4	4
5	5	5	5	5
6	6	6	6	6

After adding:

token	dim0	dim1	dim2	dim3
The	0.62+1	-0.51+1	0.09+1	0.85+1
agent	0.15+2	0.73+2	-0.38+2	0.46+2
...	...	...	...	...

5.2.2 Why This Breaks

Two real problems:

Unbounded values: position 1000 adds 1000 to every dimension. The embedding values are usually around zero; suddenly some tokens are shifted by a thousand. Gradient flow collapses, training is unstable.
No learnable structure: 1, 2, 3, 4 grows linearly with no pattern the model can generalize. The model cannot extrapolate to longer sequences or learn that "two positions apart" has a consistent meaning.

The original Transformer paper needed something more principled.

5.3 The Transformer's Answer: Sinusoidal Encoding

5.3.1 Core Idea

The original Transformer uses sine and cosine waves at different frequencies to construct positional vectors. The way I remember this: every position gets a barcode made from stacked waves. Low-frequency waves encode broad location; high-frequency waves encode fine-grained distance between nearby positions.

Sinusoidal positional encoding formula and wave visualization

The formula:

even dimensions:  PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
odd dimensions:   PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

Do not panic at the formula. There are only two ideas inside it:

Even dimensions use sine, odd dimensions use cosine.
Each pair of dimensions uses a different frequency — controlled by the exponent 2i/d_model.

5.3.2 Wave Visualization

If you plot the position encoding along a sequence, each dimension traces out a wave. A low-frequency dimension changes slowly across positions — like a slow clock. A high-frequency dimension oscillates quickly — like a fast clock. Put many clocks together and each position in the sequence has a unique combination of readings, which is what we need.

5.3.3 Concrete Numbers

Here are the actual values for the first four dimensions, for our six-token sequence (positions 0 through 5):

token	pos	dim0 (sin)	dim1 (cos)	dim2 (sin)	dim3 (cos)
The	0	0.00	1.00	0.00	1.00
agent	1	0.84	0.54	0.68	0.73
tagged	2	0.90	-0.41	0.99	0.07
the	3	0.14	-0.98	0.77	-0.62
reviewer	4	-0.75	-0.65	0.14	-0.98
.	5	-0.95	0.28	-0.57	-0.82

Note: positions are zero-indexed (pos = 0, 1, 2, ...). At pos = 0, sin(0) = 0 and cos(0) = 1.

Observations:

All values stay in [-1, 1] — the natural range of sine and cosine. No value explosion at long positions.
Each row is unique — every position has a distinctive fingerprint.
The patterns change smoothly — nearby positions are numerically similar.

5.3.4 Why Sine and Cosine Specifically?

The paper authors chose sin/cos for three reasons:

Bounded values: always in [-1, 1]. Position 10,000 does not blow up the numbers.
Extrapolation in theory: a model trained on short sequences can in principle generalize to longer ones because the wave patterns continue predictably.
Relative position via linear transformation: there is a mathematical property that the encoding at position pos + k can be expressed as a linear function of the encoding at pos. This means the model can learn to attend to tokens at a fixed offset.

Later research found that sinusoidal encoding's extrapolation capability is more limited in practice than the theory suggests. Newer schemes like RoPE and ALiBi handle long contexts better. We cover those in Chapter 25. For now, sinusoidal encoding is the clearest way to understand why position encoding is needed and how it works.

5.4 Embedding + Position = Input

Now let's see the full addition step.

5.4.1 Vector Addition

Embedding plus position vector produces the input to the first Transformer block

Three matrices, one shape each: [seq_len, d_model].

Embedding matrix (semantic content):

The:      [0.62, -0.51,  0.09,  0.85]
agent:    [0.15,  0.73, -0.38,  0.46]
tagged:   [0.07,  0.31, -0.44,  0.12]
...

Position matrix (location):

pos 0:    [0.00,  1.00,  0.00,  1.00]
pos 1:    [0.84,  0.54,  0.68,  0.73]
pos 2:    [0.90, -0.41,  0.99,  0.07]
...

Input embeddings (their sum):

The (pos 0):    [0.62+0.00, -0.51+1.00, 0.09+0.00, 0.85+1.00]
agent (pos 1):  [0.15+0.84,  0.73+0.54, -0.38+0.68, 0.46+0.73]
...

The critical observation: if the same token appears at two different positions, its embedding vector is identical but its positional vector is different, so the combined input is different. The model can now distinguish "The agent tagged the reviewer" from "The reviewer tagged the agent" even though they share tokens.

5.4.2 Geometric Intuition

2D vector addition visualization showing how embedding and position combine

In a 2D sketch, vector addition follows the parallelogram rule:

embedding vector = [1, 3]   (blue arrow: semantic direction)
position vector  = [2, 1]   (red arrow: positional shift)
input vector     = [3, 4]   (result: diagonal of the parallelogram)

The resulting vector carries both pieces of information, encoded in its direction and magnitude. In 768 or 4096 dimensions, there is far more room for this combined representation to remain coherent.

5.4.3 Relative Distance Matters

Same token at different positions gets different combined vectors

For many tasks the model cares about relative distance, not just absolute position. Whether "pull" comes just before "request" matters. The sinusoidal scheme preserves some of that structure mathematically, and learned position schemes preserve it empirically. Chapter 25 goes into the specifics of each approach.

Positional encoding also resolves a deeper ambiguity: the same token can carry completely different meanings depending on its position in the sentence. Consider the word review:

"Submit a review" — review is a noun, referring to a concrete artifact (a PR review, a code review document).
"Review the changes" — review is a verb, describing an action to take.

The token is identical. Its token embedding is identical. But the combined input vector (embedding + position) is different, and the surrounding context — which Attention will read — is different. The model learns to separate these meanings not from the token alone, but from its position and neighborhood.

This is the core reason positional encoding matters: without it, the model sees a bag of tokens with no order, and many such ambiguities become invisible.

5.5 Training: What Learns and What Doesn't

5.5.1 Fixed Encodings vs. Learned Parameters

Embedding parameters update each training step; sinusoidal position matrix stays constant

There is an important asymmetry in the original Transformer:

Embedding matrix: trainable parameters. Every training step, gradients flow back through the embedding table and update the token vectors. The model learns to position semantically related tokens near each other in embedding space.
Position matrix (sinusoidal version): fixed. Not a parameter. Computed once from the formula and never updated. No gradient flows through it.

During training, the model learns how to interpret the position signal embedded in the vectors — but it does not change the signal itself.

Some models, including BERT, use learned positional embeddings: the position matrix is a parameter just like the token embedding table and is updated by gradient descent. The tradeoff is that learned embeddings often do not generalize beyond the training context length, while sinusoidal ones can in principle.

5.5.2 Where It Sits in the Architecture

Positional encoding happens after embedding and before the first Transformer block

The flow in the full model:

raw text
   |
   | tokenization
   v
token IDs
   |
   | embedding lookup
   v
embedding matrix [seq_len, d_model]
   |
   | + positional encoding [seq_len, d_model]
   v
input embeddings [seq_len, d_model]
   |
   | feed into Transformer blocks
   v
   ...

Positional encoding happens before the first Transformer block. It is part of the input preprocessing, not inside any block.

5.6 Why Add Instead of Concatenate?

This question comes up every time I teach this chapter. The intuition is worth spelling out.

5.6.1 Concatenation vs. Addition

Concatenation:

Append the position vector after the embedding vector.
Result: [embedding | position] — a vector twice as wide.
Clean separation of information.
Downside: doubles the vector width. Every subsequent matrix must handle 2 × d_model features. Parameter count and compute cost scale accordingly.

Addition:

Add embedding and position element-wise.
Result: same shape [d_model] — no dimension change.
Downside: the two signals are mixed in the same dimensions.

5.6.2 Why Addition Works

The justification is that high-dimensional spaces are surprisingly spacious. In 768 or 4096 dimensions, the semantic content of a token and its positional signal can occupy largely orthogonal subspaces. The network has enough capacity to learn to disentangle them after the addition.

Think of it as two engineers sharing a whiteboard instead of each having their own. It sounds messy, but if you have a large enough whiteboard and organized people, it works fine — and you saved the cost of a second whiteboard.

Empirically, addition works. The architecture is simpler and the parameter count stays the same. That is a good engineering trade.

5.7 Chapter Summary

5.7.1 Key Concepts

Concept	Meaning
Positional Encoding	a vector added to each token embedding to encode its sequence position
Sinusoidal Encoding	uses `sin`/`cos` waves at multiple frequencies to generate position vectors
Addition	`embedding + position = input embedding`, same shape, no dimension change
Fixed vs. Learned	sinusoidal is fixed; some models (BERT) use learned position parameters
Relative position	the model can learn to interpret distance, not just absolute index

5.7.2 Data Flow

Embedding [seq_len, d_model]    <- semantic content
     +
Position  [seq_len, d_model]    <- sinusoidal position encoding
     =
Input     [seq_len, d_model]    <- fed into the first Transformer block

5.7.3 Core Takeaway

Positional encoding solves the Transformer's "positional blindness." By adding a position vector to each token embedding, the model can distinguish "The agent tagged the reviewer" from "The reviewer tagged the agent" — same tokens, different meaning, different position vectors.

Chapter Checklist

After this chapter, you should be able to:

Explain why the Transformer needs explicit position information even though it processes tokens sequentially.
Describe sinusoidal encoding as a multi-frequency wave barcode: each position gets a unique combination of sine and cosine values.
Reproduce the formula PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and explain the two main ideas it contains.
Explain why addition is used instead of concatenation, and what the practical trade-off is.
Distinguish between fixed sinusoidal encodings (original Transformer) and learned positional embeddings (BERT, etc.).

See You in the Next Chapter

That is enough for positional encoding. If you can explain why "The agent tagged the reviewer" and "The reviewer tagged the agent" produce different model outputs even though they share every token, you have internalized this chapter.

The input to the Transformer blocks is now complete: semantic information from embeddings, plus location information from positional encoding.

Chapter 6 introduces two small but essential mathematical tools that appear everywhere inside the Transformer: LayerNorm, which keeps numbers in a well-behaved range, and Softmax, which turns raw scores into probability distributions.