One-sentence summary: Q is "what I am looking for," K is "what label I carry," and V is "what I actually contain." Attention uses Q to match against K, finds the relevant V vectors, and computes a weighted sum.
10.1 What This Chapter Covers
The previous chapter built the geometric intuition: dot product as a similarity measure.
But several questions are still open:
- What do Q, K, and V actually mean?
- How are they generated from the input?
- How does the shape change at each step of the computation?
This chapter traces the complete Attention computation, step by step, keeping track of every dimension change.
10.2 The Input Shape
10.2.1 Understanding the Input Dimensions
In practice, training processes multiple sequences at once. The input tensor has shape:
X: [batch_size, ctx_length, d_model]
Using concrete numbers from the diagram:
batch_size = 4: four sequences processed in parallelctx_length = 16: each sequence has 16 tokensd_model = 512: each token is represented as a 512-dimensional vector
10.2.2 A Concrete Example
Imagine four prompts going through training at the same time:
1. "The agent opened a pull request for the..."
2. "The reviewer left a comment on line..."
3. "A tool call returned an error because..."
4. "The workflow passed all checks and was..."
Each prompt is cut into 16 tokens, and each token is represented as a 512-dimensional vector.
Total input shape: [4, 16, 512]
- 4 sequences
- 16 positions each
- 512 dimensions per position
10.2.3 The Three Dimensions
| Dimension | Name | Meaning |
|---|---|---|
batch_size | batch | how many sequences we process at once |
ctx_length | context length | how many tokens per sequence |
d_model | model dimension | width of each token vector |
10.3 Generating Q, K, and V
10.3.1 The Core Idea
Q, K, and V all come from the same input X, through three different weight matrices:
Q = X @ Wq
K = X @ Wk
V = X @ Wv
These weight matrices — Wq, Wk, and Wv — are learnable parameters. They start randomly initialized and get tuned during training.
10.3.2 Dimension Calculation
Take generating Q as an example:
X: [4, 16, 512] (batch_size, ctx_length, d_model)
Wq: [512, 512] (d_model, d_model)
Q: [4, 16, 512] (batch_size, ctx_length, d_model)
The matrix multiplication rule is [..., A, B] @ [B, C] = [..., A, C], so:
[4, 16, 512] @ [512, 512] = [4, 16, 512]
Q, K, and V have exactly the same shape as the input X.
10.3.3 Why Three Different Matrices?
You might ask: if the output shape is the same, why bother with three separate matrices?
Because Q, K, and V play different roles:
- Q (Query): "What information am I looking for?"
- K (Key): "What information can I be found by?"
- V (Value): "What content do I offer when selected?"
By learning separate Wq, Wk, and Wv, the model learns to project the same input token into three distinct semantic spaces — one for searching, one for being searched, one for carrying content.
10.3.4 An Analogy Worth Keeping
Think of a code review workflow:
| Role | Analogy | Function |
|---|---|---|
| Query (Q) | The reviewer's comment | "I'm looking for context about this function" |
| Key (K) | Each file's label or header | "This file handles authentication" |
| Value (V) | The actual file content | The full implementation text |
When you review:
- Your query (comment/question) matches against each file's key (description)
- Files with high match scores get their content surfaced
- The most relevant content is aggregated into the response
Attention works the same way.
10.4 First Matrix Multiplication: Q @ K^T
10.4.1 Computing the Similarity Matrix
With Q and K in hand, the next step is computing their similarity:
scores = Q @ K^T
K needs to be transposed (K^T) so that Q's rows (one per token) dot-product against K's rows (also one per token).
10.4.2 Dimension Change
Q: [4, 16, 128] (batch_size, ctx_length, d_key)
K^T: [4, 128, 16] (batch_size, d_key, ctx_length)
Result: [4, 16, 16] (batch_size, ctx_length, ctx_length)
The
d_key = 128here is because Multi-Head Attention splitsd_modelacross heads. Each head getsd_key = d_model / num_heads = 512 / 4 = 128. We cover that split in the next chapter.
10.4.3 What the Result Means
The result is a [4, 16, 16] tensor:
- 4 sequences
- Each sequence has a 16×16 "score matrix"
- Position (i, j) holds: how much token i should attend to token j
Here is a simplified 4×4 example for a short prompt "agent reviews PR":
agent reviews PR .
agent [ ]
reviews [ ]
PR [ ]
. [ ]
Each cell is the dot product of the corresponding Q row and K column.
10.5 Scale: Why Divide by
10.5.1 The Scaling Step
The raw scores from Q @ K^T need to be scaled down:
Attention Scores = (Q @ K^T) / sqrt(d_key)
After this division, values that might have been in the range of tens or hundreds compress into a much tighter range.
10.5.2 Why Scale?
The problem: when d_key is large (say, 128), dot products get large too.
dot product = sum(q_i × k_i) # summing 128 multiplied pairs
If each q_i and k_i has variance 1, the dot product has variance approximately d_key.
The consequence: large values drive Softmax to extremes.
Softmax([100, 1, 2]) ≈ [1.000, 0.000, 0.000] # one winner takes all
Softmax([1.0, 0.1, 0.2]) ≈ [0.400, 0.300, 0.300] # smooth distribution
Sharp Softmax means near-zero gradients for most positions. Training stalls.
The fix: divide by to return the scores to a reasonable variance.
dot product / sqrt(128) ≈ dot product / 11.3
10.5.3 Where It Sits in the Formula
The is the Scale step.
10.6 Mask: Preventing "Peeking" at the Future
10.6.1 Why Masking Is Needed
In a GPT-style autoregressive model, predicting the next token must not use future tokens. But Q @ K computes similarity across all positions, including future ones.
Think of it this way: if a model training on "The agent opened a pull request and merged it," the model must not, when processing the word "merged," be able to see what came after it in the training sequence.
10.6.2 How the Mask Works
The solution is a triangular mask that fills future positions with negative infinity:
Before mask: After mask:
[0.3, 0.2, 0.1, 0.4] → [0.3, -inf, -inf, -inf]
[0.2, 0.5, 0.2, 0.1] → [0.2, 0.5, -inf, -inf]
[0.1, 0.3, 0.4, 0.2] → [0.1, 0.3, 0.4, -inf]
[0.2, 0.1, 0.3, 0.4] → [0.2, 0.1, 0.3, 0.4 ]
The upper-right triangle (future positions) becomes -inf.
10.6.3 Why -inf?
Because Softmax maps -inf to exactly 0:
Softmax([0.3, -inf, -inf, -inf]) = [1.0, 0.0, 0.0, 0.0]
After Softmax, future positions carry zero weight. The model cannot read from them.
10.7 Softmax: Converting Scores to Probabilities
10.7.1 The Conversion
After masking, we apply Softmax row-by-row:
Before Softmax: [0.32, 0.04, -inf, -inf, ...]
After Softmax: [0.52, 0.48, 0.00, 0.00, ...]
10.7.2 What Softmax Does
- Normalizes: each row sums to 1
- Amplifies differences: larger values become even larger relative to smaller ones
- Handles -inf: maps them to exactly 0
10.7.3 Reading the Pattern
For the first token in a masked sequence, it can only attend to itself, so its row becomes [1.00, 0.00, 0.00, ...]. The second token can attend to positions 0 and 1, so its row might look like [0.61, 0.39, 0.00, ...]. Later tokens have more positions to spread attention across, so their rows are more diffuse.
This is the attention weight matrix — how much each position should draw from every other position.
10.8 Second Matrix Multiplication: Attention Weights @ V
10.8.1 Weighted Sum
With the attention weights computed, the final step is using them to blend the value vectors:
Output = Attention_Weights @ V
10.8.2 Dimension Change
Attention_Weights: [4, 4, 16, 16] (batch, heads, ctx_len, ctx_len)
V: [4, 4, 16, 128] (batch, heads, ctx_len, d_key)
Output: [4, 4, 16, 128] (batch, heads, ctx_len, d_key)
The multi-head structure here (4 heads) is the topic of the next chapter.
10.8.3 What This Step Does
Each output position is a weighted average of all V vectors, where the weights come from the attention scores:
output[i] = sum(attention_weight[i, j] × V[j])
If token i puts 70% of its attention on token j and 30% on token k:
output[i] = 0.7 × V[j] + 0.3 × V[k]
The output is a context-aware blend, not a copy of any single token.
10.9 What the Attention Output Means
10.9.1 Output Dimensions
After Attention @ V, each token has a new vector:
Output: [batch_size, ctx_length, d_key] = [4, 16, 128]
(In the multi-head case, outputs from all heads are concatenated — that is the next chapter.)
10.9.2 The Semantic Shift
Here is the important part. Before Attention, each token's vector only encoded its own information. After Attention, each token's vector is a weighted combination of the whole context.
The token embedding for "merged" started as a representation of that word in isolation. After Attention, it carries information from the surrounding tokens — who performed the action, what PR was involved, what came before.
That is how the model "understands" context. Not magic — just a learned, weighted average of value vectors.
10.9.3 The Loop
The Attention output:
- Replaces the input embedding for that position
- Feeds into the next component (FFN or the next block)
- Gets refined layer by layer
Each block adds more context into each token's representation.
10.10 The Full Attention Computation
10.10.1 Step-by-Step
Step 1: Generate Q, K, V
Q = X @ Wq [4, 16, 512]
K = X @ Wk [4, 16, 512]
V = X @ Wv [4, 16, 512]
↓
Step 2: Compute similarity
scores = Q @ K^T [4, 16, 16]
↓
Step 3: Scale
scores = scores / sqrt(d_key) [4, 16, 16]
↓
Step 4: Mask (decoder-only models)
scores[future] = -inf [4, 16, 16]
↓
Step 5: Softmax
weights = softmax(scores) [4, 16, 16]
↓
Step 6: Weighted sum
output = weights @ V [4, 16, 512]
10.10.2 PyTorch Implementation
import torch
import torch.nn.functional as F
def attention(Q, K, V, mask=None):
"""
Scaled Dot-Product Attention.
Q: [batch, seq_len, d_k]
K: [batch, seq_len, d_k]
V: [batch, seq_len, d_v]
mask: [batch, 1, seq_len] or [batch, seq_len, seq_len]
Returns:
output: [batch, seq_len, d_v]
attention_weights: [batch, seq_len, seq_len]
"""
d_k = Q.size(-1)
# Step 2: Q @ K^T
scores = torch.matmul(Q, K.transpose(-2, -1))
# Step 3: Scale
scores = scores / (d_k ** 0.5)
# Step 4: Mask
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Step 5: Softmax
attention_weights = F.softmax(scores, dim=-1)
# Step 6: Weighted sum
output = torch.matmul(attention_weights, V)
return output, attention_weights
10.11 Deeper Understanding of Q, K, and V
10.11.1 Role Summary
| Role | Generated by | Purpose | Used in |
|---|---|---|---|
| Q | X @ Wq | "What I am looking for" | Q @ K^T |
| K | X @ Wk | "What I advertise" | Q @ K^T |
| V | X @ Wv | "What I carry" | Attention @ V |
10.11.2 Why Separate K from V?
K and V both come from the same input. Why use two different matrices?
To decouple matching from extraction.
- K controls which positions get attention
- V controls what information flows when attention is given
This separation gives the model flexibility. A token can make itself easy to find (high K dot-product with many queries) while carrying very different actual content (V).
10.11.3 An Example
Consider: "The agent merged the pull request after review."
When processing "merged":
- Q("merged") is looking for: "who performed this action?"
- K("agent") signals: "I am a subject performing an action"
- V("agent") carries: the agent's semantic content
Q and K matching tells the model that "merged" should attend to "agent." V delivers the actual information.
10.12 Chapter Summary
10.12.1 Key Concepts
| Concept | Shape | Meaning |
|---|---|---|
| X | [batch, seq, d_model] | Input tensor |
| Wq / Wk / Wv | [d_model, d_model] | Learnable projection matrices |
| Q | [batch, seq, d_model] | Query: what I am looking for |
| K | [batch, seq, d_model] | Key: what I advertise |
| V | [batch, seq, d_model] | Value: what I carry |
| Scores | [batch, seq, seq] | Similarity matrix |
| Weights | [batch, seq, seq] | Attention probabilities (rows sum to 1) |
| Output | [batch, seq, d_model] | Context-aware token representations |
10.12.2 Computation Flow
X → [Wq, Wk, Wv] → Q, K, V
↓
Q @ K^T (similarity)
↓
/ sqrt(d_key) (scale)
↓
Mask (block the future)
↓
Softmax (normalize)
↓
@ V (weighted sum)
↓
Output
10.12.3 Core Takeaway
Q, K, and V are the three players in Attention. Q queries, K labels, V carries content. Match Q against K to find relevant positions; weight-average V by those scores to produce a context-enriched representation. That is how Attention helps the model understand language.
Chapter Checklist
After this chapter, you should be able to:
- Explain what Q, K, and V each represent.
- Describe how they are generated from the same input X.
- Trace the dimension changes through each step of Attention.
- Explain the causal mask and why it uses -inf.
- Explain why the scale factor exists.
See You in the Next Chapter
That was single-head Attention, end to end.
Real Transformers run this whole process from multiple angles simultaneously. Chapter 11 explains Multi-Head Attention: how the model looks at relationships in parallel and combines everything back together.