One-sentence summary: Positional encoding evolved from "add an absolute vector before Attention" to "rotate Q and K inside Attention" and "penalize Attention by distance," with each generation buying longer context and better relative-position behavior.

25.1 The Big Picture

In Chapter 5 we met the original Sinusoidal encoding from the 2017 paper. It was clever and it worked well enough for 512-token sequences. Fast-forward to 2024 and production models routinely handle 128k tokens. That seven-year gap forced the field to completely rethink how position information flows into the model.

Timeline of positional encoding methods from Sinusoidal through RoPE ALiBi and YaRN

25.1.1 What went wrong with the original scheme

Sinusoidal encoding has two problems that compound each other.

The first is absolute position. The model learns patterns tied to specific slot numbers: "position 37 looks like this." At inference time, the moment you feed a sequence longer than the training length, the model encounters absolute positions it has never seen. Performance collapses.

The second is the injection point. Sinusoidal adds a position vector to the embedding before Attention. By the time Q and K compute their dot product, the position signal has been linearly mixed with the semantic signal through the weight matrices:

Q = (x + PE_m) × W_Q
K = (x + PE_n) × W_K
Q · K = cross-terms that tangle semantic and position

You get four cross-terms when you expand that dot product. Position and content are inseparable, which makes it hard for the model to learn clean relative patterns.

Traditional learned relative encodings (T5-style) fixed the relative problem but broke KV Cache compatibility. Every new token would require recomputing attention over the full history because the relative position table changes as the sequence grows.

The question the field kept asking: can we get relative position behavior and keep KV Cache working?

25.1.2 The five mainstream schemes

Method	Full name	Representative models
Sinusoidal	Sine/Cosine Position Embedding	original Transformer
T5 Relative	Learned Relative Embeddings	T5, mT5
RoPE	Rotary Position Embedding	LLaMA, GPT-NeoX, Mistral
YaRN	Yet another RoPE extensioN	Code Llama, Qwen
ALiBi	Attention with Linear Biases	BLOOM, MPT

This chapter covers RoPE, ALiBi, and YaRN in depth. The others appear where needed for contrast.

25.2 Sinusoidal: The Additive Baseline

Quick recap before we move forward.

25.2.1 The formula

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

Each position gets a deterministic vector with the same dimension as the token embedding. The model adds this vector to the embedding table lookup before anything else happens.

Embedding vector (semantic content):

Token	dim 1	dim 2	dim 3	dim 4
"agent"	0.62	-0.51	0.09	0.85
"opened"	0.07	0.23	-0.40	0.11

Position vector (slot information):

Position	dim 1	dim 2	dim 3	dim 4
1	0.00	1.00	0.00	1.00
2	0.84	0.54	0.68	0.73
3	0.90	-0.41	0.99	0.07

Input to Transformer = Embedding + Position.

25.2.2 Why "before Attention" is the weak point

Once you add position before Attention and pass through W_Q and W_K, relative distance becomes implicit. The model can learn to extract it, but it has to work harder. And when you ask the model to process position 5000 and it was only trained to 4096, it is encountering absolute slots it has never seen. There is no graceful degradation, just confusion.

Early GPT models had strict context limits for exactly this reason. You get what you train on and nothing beyond.

25.3 RoPE: Rotary Position Embedding

RoPE was proposed by Su Jianlin in 2021 and quickly became the dominant method for decoder-only LLMs. LLaMA 1, LLaMA 2, Mistral, GPT-NeoX, and many others use it.

RoPE rotates Q and K vectors by position-dependent angles in 2D subspaces

25.3.1 The revolutionary idea: from addition to rotation

Instead of adding a position vector to the token embedding, RoPE rotates the Q and K vectors inside Attention using a rotation matrix that depends on position.

The contrast in one sentence:

Sinusoidal: embedding + position_vector → multiply through W_Q and W_K
RoPE: multiply through W_Q and W_K → rotate by position angle

This sounds like a small difference. The consequences are large.

25.3.2 2D geometric intuition

Start in 2D. You have two vectors w1 and w2 in the plane. If you rotate both by the same angle theta, their relative angle is unchanged. And since dot product depends only on the angle between vectors and their magnitudes:

w1 · w2 = |w1| |w2| cos(angle)

rotating both by the same theta leaves the dot product the same. Rotating by different amounts changes the dot product in a way that depends on the difference of the rotation angles.

That is the key geometric fact. If you rotate Q at position m by angle m * theta and rotate K at position n by angle n * theta, their dot product ends up depending on (n - m) * theta---the relative distance.

The 2D rotation matrix is:

[w1']   [cos(theta)  -sin(theta)] [w1]
[w2'] = [sin(theta)   cos(theta)] [w2]

25.3.3 Extending to high dimensions

Real query and key vectors have head_dim dimensions (typically 64, 128, or more). RoPE handles this by splitting the vector into head_dim / 2 pairs of dimensions and rotating each pair independently:

dimensions 1-2: angle m * theta_1
dimensions 3-4: angle m * theta_2
...
dimensions (head_dim-1) to head_dim: angle m * theta_{head_dim/2}

Each pair uses a different base frequency:

\theta_i = 10000^{-2(i-1)/d},\quad i = 1, 2, \ldots, d/2

For a d=6 vector, the three pairs get angles:

Position m	Pair 1 (theta=0.1)	Pair 2 (theta=0.2)	Pair 3 (theta=0.4)
m=0	0.0	0.0	0.0
m=1	0.1	0.2	0.4
m=2	0.2	0.4	0.8
m=3	0.3	0.6	1.2

25.3.4 The full rotation matrix

For a d-dimensional vector at position m, RoPE applies a block-diagonal rotation matrix R_m:

        [cos(m*θ₁)  -sin(m*θ₁)    0              0          ...  0              0         ]
        [sin(m*θ₁)   cos(m*θ₁)    0              0          ...  0              0         ]
R_m =   [    0            0    cos(m*θ₂) -sin(m*θ₂)  ...  0              0         ]
        [    0            0    sin(m*θ₂)  cos(m*θ₂)  ...  0              0         ]
        [   ...          ...       ...        ...     ...  ...            ...        ]
        [    0            0        0              0    ... cos(m*θ_{d/2}) -sin(m*θ_{d/2})]
        [    0            0        0              0    ... sin(m*θ_{d/2})  cos(m*θ_{d/2})]

Each 2×2 block is an independent rotation. The matrix is sparse and efficient to apply.

25.3.5 Why relative position emerges automatically

This is the elegant part. Let q_m be the query at position m and k_n be the key at position n.

After applying RoPE:

q_m' = R_m * q_m
k_n' = R_n * k_n

The Attention score becomes:

q_m' · k_n' = (R_m * q_m)ᵀ (R_n * k_n)
             = q_mᵀ * R_mᵀ * R_n * k_n
             = q_mᵀ * R_{n-m} * k_n

The last step uses the rotation matrix property $R_m^T R_n = R_{n-m}$ . The Attention score depends only on the relative distance (n - m), not on the absolute positions m and n separately. Relative position emerges from the math without any extra machinery.

KV Cache still works because k_n' is a deterministic function of k_n and position n alone. When you extend the sequence by one token, you compute k_n' for that token and append it to the cache. No recomputation of older keys required.

25.3.6 Long-distance decay

RoPE has one more nice property: as the relative distance increases, the upper bound on the Attention score decreases. This matches the empirical observation that nearby tokens are usually more relevant than distant ones. The model gets a soft locality prior without any hand-engineered falloff.

25.3.7 Efficient implementation with complex numbers

Multiplying dense rotation matrices is expensive. The standard implementation uses complex arithmetic:

def apply_rope(x, freqs):
    # x: [batch, seq_len, n_heads, head_dim]
    # freqs: [seq_len, head_dim // 2]

    # treat pairs of reals as complex numbers
    x_complex = torch.view_as_complex(x.reshape(*x.shape[:-1], -1, 2))

    # build e^(i * theta) for each position and dimension pair
    freqs_complex = torch.polar(torch.ones_like(freqs), freqs)

    # complex multiply = rotation
    x_rotated = x_complex * freqs_complex

    # back to reals
    return torch.view_as_real(x_rotated).flatten(-2)

Complex multiplication (a + bi)(c + di) = (ac - bd) + (ad + bc)i is exactly the 2D rotation formula. You get the rotation at the cost of four multiplications and two additions per pair---much cheaper than a full matrix multiply.

25.4 ALiBi: Attention with Linear Biases

Press et al. proposed ALiBi in 2021. It takes a completely different philosophy from RoPE.

ALiBi adds a fixed linear distance penalty to Attention logits — bias matrix on the left, per-head slope curves in the middle, and properties that explain why it extrapolates beyond training length without retraining

25.4.1 The idea: penalize distance directly

ALiBi does not touch embeddings or Q/K vectors at all. It adds a penalty to the Attention scores after the Q·K dot product:

Standard:  Attention = softmax(Q Kᵀ / sqrt(d)) V
ALiBi:     Attention = softmax(Q Kᵀ / sqrt(d) + m * bias) V

The bias matrix encodes relative distance with a simple triangular structure:

bias =  [  0                   ]  (query at position 1)
        [ -1   0               ]  (query at position 2)
        [ -2  -1   0           ]  (query at position 3)
        [ -3  -2  -1   0       ]  (query at position 4)
        [ -4  -3  -2  -1   0   ]  (query at position 5)

m is a per-head slope. The slopes are not learned---they are fixed at initialization as powers of 2 spaced across the number of heads.

25.4.2 What this does to Attention

Say the current query is at position 5, attending to positions 1 through 5. With slope m=1:

Target	Distance	Bias	Effect after softmax
position 1	4	-4	strongly suppressed
position 2	3	-3	significantly suppressed
position 3	2	-2	moderately suppressed
position 4	1	-1	slightly suppressed
position 5	0	0	no change

Nearby tokens get more weight. The bias implements locality without any learned parameters.

25.4.3 Why ALiBi extrapolates well

The bias matrix is deterministic and position-invariant. You can compute it for any sequence length without training data. A model trained at 1024 tokens encounters positions it has never seen at inference time for lengths of 2048 or 4096---but the bias formula is the same. Experiments show ALiBi models trained at 1024 can extrapolate to 2048+ with minimal degradation.

That is the reason BLOOM (176B) and MPT-7B chose ALiBi for their architecture. They wanted aggressive long-context extrapolation without additional fine-tuning.

25.4.4 Implementation

def alibi_bias(n_heads, seq_len):
    # per-head slopes: 2^(-8/n_heads * 1), 2^(-8/n_heads * 2), ...
    slopes = 2 ** (-8 / n_heads * torch.arange(1, n_heads + 1))

    # relative distance matrix
    positions = torch.arange(seq_len)
    distances = positions.unsqueeze(0) - positions.unsqueeze(1)  # [seq, seq]

    # bias = slope * distance, broadcast across heads
    bias = slopes.view(-1, 1, 1) * distances.unsqueeze(0)  # [heads, seq, seq]
    return bias

No extra parameters. No calibration data. The simplicity is the point.

25.4.5 Tradeoffs

Strengths:

Free to implement (no learned parameters)
No additional parameters
Strong zero-shot extrapolation

Weaknesses:

Simple linear penalty may be too coarse for tasks needing precise long-range information
Some retrieval-heavy tasks do better with RoPE
The slope schedule is a fixed hyperparameter choice, not tunable per task

25.5 YaRN: Extending RoPE Beyond Training Length

RoPE's relative-position behavior is excellent within the training context length. Beyond it, the model encounters rotation angles it has never been trained on, and performance degrades. YaRN (Yet another RoPE extensioN) is designed specifically to fix this.

YaRN: NTK-aware per-band frequency rescaling — high-frequency RoPE bands stay almost untouched (preserving local distinctions) while low-frequency bands are stretched to span the extended context, plus an attention temperature term that keeps softmax sharp at 32k+ tokens

25.5.1 The extrapolation problem

Suppose a model is trained with a 4k context. At inference with 8k tokens:

Positions 1-4000: familiar rotation angles
Positions 4001-8000: rotation angles the model has never produced during training

The Attention patterns for the unfamiliar part of the sequence can collapse.

25.5.2 Position Interpolation (PI): the simple fix

The straightforward approach is to compress longer sequences into the trained range:

f'(x_m, m, theta) = f(x_m, m * L / L', theta)

where L is the training length (4k) and L' is the target length (8k). Position 8000 maps to position 4000. The model sees only familiar angles.

The cost: high-frequency pairs lose precision. Adjacent tokens that differ by one position now differ by half a position unit. Fine-grained local information gets blurred.

25.5.3 NTK-Aware Interpolation: smarter scaling

NTK-Aware interpolation applies different scale factors to different frequency bands:

Low-frequency pairs (long-range): scale more aggressively. They handle coarse positional identity.
High-frequency pairs (short-range): scale less. They encode local distinctions that must stay precise.

This is the scheme used in Code Llama, Qwen 7B, and several other models when their context windows were extended.

25.5.4 YaRN's complete formula

YaRN adds one more ingredient beyond NTK-Aware: Attention temperature scaling.

f'(x_m, m, theta) = f(x_m, g(m), h(theta))

The temperature adjustment:

softmax(Q_mᵀ K_n / (t * sqrt(d)))

where: sqrt(1/t) = 0.1 * ln(s) + 1,  s = L' / L

As the scale factor s grows (longer context), t adjusts the softmax temperature to keep the Attention distribution from becoming too diffuse across many more tokens.

The practical win: a model trained on 4k tokens can be YaRN-extended to 32k or 128k with less than 0.1% of the original pretraining token count used for fine-tuning. For example:

Original: 4k context, trained on 1T tokens
YaRN extension: 32k context, fine-tuned on ~1B tokens (0.1%)

That is why Code Llama's extended variants and Qwen's long-context versions exist without full retraining.

25.6 Comparison

Side-by-side comparison of all major positional encoding methods — Sinusoidal, Learned, RoPE, ALiBi, YaRN, and NoPE — across mechanism, position type, extrapolation, KV cache compatibility, extra parameters, and which production models use each one

25.6.1 Technical comparison

Feature	Sinusoidal	RoPE	ALiBi	YaRN
Injection point	embedding	Q and K inside Attention	Attention scores	Q and K inside Attention
Position type	absolute	relative	relative	relative
Extrapolation	poor	medium	strong	strong
KV Cache compatible	yes	yes	yes	yes
Extra parameters	none	none	none	none
Compute overhead	low	medium	low	medium

25.6.2 Which model uses what

Model	Encoding	Context
GPT-3	learned absolute	2048
LLaMA 1	RoPE	2048
LLaMA 2	RoPE	4096
Code Llama	RoPE + YaRN	16384
Mistral 7B	RoPE	8192
BLOOM	ALiBi	2048
MPT-7B	ALiBi	65536
Qwen	RoPE + Dynamic NTK	8192-32768

25.6.3 Decision guide

Use RoPE if:

You need precise local and mid-range position information
You are working within the training context length
You want compatibility with the LLaMA/Mistral ecosystem

Use ALiBi if:

You need strong zero-shot length extrapolation
You want the simplest possible implementation
Memory and compute are tight

Use YaRN if:

You have an existing RoPE model and need to extend context length
You have a small fine-tuning budget (1B tokens or less)
Target length is 16k, 32k, or higher

25.7 Implementation Reference

25.7.1 RoPE frequencies

def precompute_freqs(head_dim, max_seq_len, theta=10000.0):
    # theta_i = 1 / (theta ^ (2i / head_dim))
    freqs = 1.0 / (theta ** (torch.arange(0, head_dim, 2).float() / head_dim))
    positions = torch.arange(max_seq_len)
    # outer product: [seq_len, head_dim // 2]
    return torch.outer(positions, freqs)

25.7.2 ALiBi slopes

def get_alibi_slopes(n_heads):
    # ratio between adjacent slopes: 2^(-8/n_heads)
    ratio = 2 ** (-8 / n_heads)
    # slopes for each head: ratio, ratio^2, ratio^3, ...
    return ratio ** torch.arange(1, n_heads + 1)

25.8 Chapter Summary

25.8.1 Key concepts

Concept	Meaning
RoPE	Rotates Q and K by position-dependent angle; relative distance falls out of the dot product
ALiBi	Adds a linear distance penalty to Attention scores after Q·K
YaRN	Rescales RoPE frequency bands to handle context beyond training length
Absolute vs relative	Sinusoidal encodes slot number; RoPE and ALiBi encode distance
Extrapolation	Behavior when inference length exceeds training length

25.8.2 The evolution in one diagram

2017: Sinusoidal
  | Problem: absolute position, poor extrapolation
  v
2021: RoPE (Su Jianlin)
  | Problem: degrades beyond training length
  v
2021: ALiBi (Press et al.)
  | Simple linear falloff, strong extrapolation
  v
2023: YaRN (Peng et al.)
  | Extends RoPE to 100k+ with minimal retraining
  v
Current: 128k+ contexts are standard

25.8.3 Core takeaway

The job of positional encoding is to tell the model who comes before whom. Sinusoidal uses addition to assign absolute addresses. RoPE uses rotation so relative distance appears from inside Attention. ALiBi uses a penalty to make far tokens speak more quietly. None is universally best---choose based on your context length requirements and ecosystem.

Chapter Checklist

After this chapter, you should be able to:

Explain why Sinusoidal encoding struggles at long context.
Describe the geometric intuition behind RoPE using 2D rotation.
Derive why RoPE Attention scores depend on relative position (n-m) rather than absolute positions m and n.
Explain ALiBi's mechanism: what the bias matrix looks like and why it extrapolates.
Describe what Position Interpolation and NTK-Aware Interpolation do differently.
Explain YaRN's temperature adjustment and its role in long-context extension.
Choose the right encoding scheme for a given model architecture and context requirement.

See You in the Next Chapter

That is enough for position encoding. If you can explain why RoPE Attention scores are a function of (n-m) and not of m and n separately, you have internalized the core idea.

Now we move from architecture choices to adaptation. Chapter 26 covers LoRA and QLoRA---the practical workhorses for fine-tuning large models on commodity hardware.