One-sentence summary: Q is "what I am looking for," K is "what label I carry," and V is "what I actually contain." Attention uses Q to match against K, finds the relevant V vectors, and computes a weighted sum.

10.1 What This Chapter Covers

The previous chapter built the geometric intuition: dot product as a similarity measure.

But several questions are still open:

What do Q, K, and V actually mean?
How are they generated from the input?
How does the shape change at each step of the computation?

This chapter traces the complete Attention computation, step by step, keeping track of every dimension change.

10.2 The Input Shape

10.2.1 Understanding the Input Dimensions

Input tensor shape: batch, sequence, model dimension

In practice, training processes multiple sequences at once. The input tensor has shape:

X: [batch_size, ctx_length, d_model]

Using concrete numbers from the diagram:

batch_size = 4: four sequences processed in parallel
ctx_length = 16: each sequence has 16 tokens
d_model = 512: each token is represented as a 512-dimensional vector

10.2.2 A Concrete Example

Imagine four prompts going through training at the same time:

1. "The agent opened a pull request for the..."
2. "The reviewer left a comment on line..."
3. "A tool call returned an error because..."
4. "The workflow passed all checks and was..."

Each prompt is cut into 16 tokens, and each token is represented as a 512-dimensional vector.

Total input shape: [4, 16, 512]

4 sequences
16 positions each
512 dimensions per position

10.2.3 The Three Dimensions

Dimension	Name	Meaning
`batch_size`	batch	how many sequences we process at once
`ctx_length`	context length	how many tokens per sequence
`d_model`	model dimension	width of each token vector

10.3 Generating Q, K, and V

10.3.1 The Core Idea

Q, K, V generated from the same input via three weight matrices

Q, K, and V all come from the same input X, through three different weight matrices:

Q = X @ Wq
K = X @ Wk
V = X @ Wv

These weight matrices — Wq, Wk, and Wv — are learnable parameters. They start randomly initialized and get tuned during training.

10.3.2 Dimension Calculation

Matrix multiplication X @ Wq with dimension tracking

Take generating Q as an example:

X:  [4, 16, 512]   (batch_size, ctx_length, d_model)
Wq: [512, 512]     (d_model, d_model)
Q:  [4, 16, 512]   (batch_size, ctx_length, d_model)

The matrix multiplication rule is [..., A, B] @ [B, C] = [..., A, C], so:

[4, 16, 512] @ [512, 512] = [4, 16, 512]

Q, K, and V have exactly the same shape as the input X.

10.3.3 Why Three Different Matrices?

You might ask: if the output shape is the same, why bother with three separate matrices?

Because Q, K, and V play different roles:

Q (Query): "What information am I looking for?"
K (Key): "What information can I be found by?"
V (Value): "What content do I offer when selected?"

By learning separate Wq, Wk, and Wv, the model learns to project the same input token into three distinct semantic spaces — one for searching, one for being searched, one for carrying content.

10.3.4 An Analogy Worth Keeping

Think of a code review workflow:

Role	Analogy	Function
Query (Q)	The reviewer's comment	"I'm looking for context about this function"
Key (K)	Each file's label or header	"This file handles authentication"
Value (V)	The actual file content	The full implementation text

When you review:

Your query (comment/question) matches against each file's key (description)
Files with high match scores get their content surfaced
The most relevant content is aggregated into the response

Attention works the same way.

10.4 First Matrix Multiplication: Q @ K^T

10.4.1 Computing the Similarity Matrix

Q @ K^T produces a sequence-by-sequence score matrix

With Q and K in hand, the next step is computing their similarity:

scores = Q @ K^T

K needs to be transposed (K^T) so that Q's rows (one per token) dot-product against K's rows (also one per token).

10.4.2 Dimension Change

Q:    [4, 16, 128]   (batch_size, ctx_length, d_key)
K^T:  [4, 128, 16]   (batch_size, d_key, ctx_length)
Result: [4, 16, 16]  (batch_size, ctx_length, ctx_length)

The d_key = 128 here is because Multi-Head Attention splits d_model across heads. Each head gets d_key = d_model / num_heads = 512 / 4 = 128. We cover that split in the next chapter.

10.4.3 What the Result Means

The result is a [4, 16, 16] tensor:

4 sequences
Each sequence has a 16×16 "score matrix"
Position (i, j) holds: how much token i should attend to token j

Here is a simplified 4×4 example for a short prompt "agent reviews PR .". The values below are raw dot products before scaling or masking:

             agent   reviews    PR      .
agent    [   8.2,    3.1,     6.5,    1.2  ]
reviews  [   3.4,   11.7,     7.8,    2.0  ]
PR       [   6.8,    7.2,    12.3,    1.9  ]
.        [   1.3,    2.1,     1.7,    9.4  ]

Each cell is the dot product of the corresponding Q row and K column. Diagonal values (a token attending to itself) tend to be large. Semantically related pairs — reviews/PR, agent/PR — also score high. After scaling by $\sqrt{d_{key}}$ and applying Softmax, these raw scores become normalized attention weights.

10.5 Scale: Why Divide by $\sqrt{d_{key}}$

10.5.1 The Scaling Step

The raw scores from Q @ K^T need to be scaled down:

Attention Scores = (Q @ K^T) / sqrt(d_key)

After this division, values that might have been in the range of tens or hundreds compress into a much tighter range.

10.5.2 Why Scale?

The problem: when d_key is large (say, 128), dot products get large too.

dot product = sum(q_i × k_i)   # summing 128 multiplied pairs

If each q_i and k_i has variance 1, the dot product has variance approximately d_key.

The consequence: large values drive Softmax to extremes.

Softmax([100, 1, 2])   ≈ [1.000, 0.000, 0.000]   # one winner takes all
Softmax([1.0, 0.1, 0.2]) ≈ [0.400, 0.300, 0.300]  # smooth distribution

Sharp Softmax means near-zero gradients for most positions. Training stalls.

The fix: divide by $\sqrt{d_{key}}$ to return the scores to a reasonable variance.

dot product / sqrt(128) ≈ dot product / 11.3

10.5.3 Where It Sits in the Formula

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_{key}}}\right) V

The $\sqrt{d_{key}}$ is the Scale step.

10.6 Mask: Preventing "Peeking" at the Future

10.6.1 Why Masking Is Needed

Causal mask blocks future positions with -inf

In a GPT-style autoregressive model, predicting the next token must not use future tokens. But Q @ K computes similarity across all positions, including future ones.

Think of it this way: if a model training on "The agent opened a pull request and merged it," the model must not, when processing the word "merged," be able to see what came after it in the training sequence.

10.6.2 How the Mask Works

The solution is a triangular mask that fills future positions with negative infinity:

Before mask:                     After mask:
[0.3, 0.2, 0.1, 0.4]     →     [0.3, -inf, -inf, -inf]
[0.2, 0.5, 0.2, 0.1]     →     [0.2, 0.5,  -inf, -inf]
[0.1, 0.3, 0.4, 0.2]     →     [0.1, 0.3,  0.4,  -inf]
[0.2, 0.1, 0.3, 0.4]     →     [0.2, 0.1,  0.3,  0.4 ]

The upper-right triangle (future positions) becomes -inf.

10.6.3 Why -inf?

Because Softmax maps -inf to exactly 0:

Softmax([0.3, -inf, -inf, -inf]) = [1.0, 0.0, 0.0, 0.0]

After Softmax, future positions carry zero weight. The model cannot read from them.

10.7 Softmax: Converting Scores to Probabilities

10.7.1 The Conversion

Softmax turns scores into a probability distribution per row

After masking, we apply Softmax row-by-row:

Before Softmax: [0.32, 0.04, -inf, -inf, ...]
After Softmax:  [0.52, 0.48, 0.00, 0.00, ...]

10.7.2 What Softmax Does

Normalizes: each row sums to 1
Amplifies differences: larger values become even larger relative to smaller ones
Handles -inf: maps them to exactly 0

10.7.3 Reading the Pattern

For the first token in a masked sequence, it can only attend to itself, so its row becomes [1.00, 0.00, 0.00, ...]. The second token can attend to positions 0 and 1, so its row might look like [0.61, 0.39, 0.00, ...]. Later tokens have more positions to spread attention across, so their rows are more diffuse.

This is the attention weight matrix — how much each position should draw from every other position.

10.8 Second Matrix Multiplication: Attention Weights @ V

10.8.1 Weighted Sum

Attention weights multiplied by V to produce the output

With the attention weights computed, the final step is using them to blend the value vectors:

Output = Attention_Weights @ V

10.8.2 Dimension Change

Attention_Weights: [4, 4, 16, 16]   (batch, heads, ctx_len, ctx_len)
V:                 [4, 4, 16, 128]  (batch, heads, ctx_len, d_key)
Output:            [4, 4, 16, 128]  (batch, heads, ctx_len, d_key)

The multi-head structure here (4 heads) is the topic of the next chapter.

10.8.3 What This Step Does

Each output position is a weighted average of all V vectors, where the weights come from the attention scores:

output[i] = sum(attention_weight[i, j] × V[j])

If token i puts 70% of its attention on token j and 30% on token k:

output[i] = 0.7 × V[j] + 0.3 × V[k]

The output is a context-aware blend, not a copy of any single token.

10.9 What the Attention Output Means

10.9.1 Output Dimensions

Attention output: each token gets an updated vector representation

After Attention @ V, each token has a new vector:

Output: [batch_size, ctx_length, d_key] = [4, 16, 128]

(In the multi-head case, outputs from all heads are concatenated — that is the next chapter.)

10.9.2 The Semantic Shift

Here is the important part. Before Attention, each token's vector only encoded its own information. After Attention, each token's vector is a weighted combination of the whole context.

The token embedding for "merged" started as a representation of that word in isolation. After Attention, it carries information from the surrounding tokens — who performed the action, what PR was involved, what came before.

That is how the model "understands" context. Not magic — just a learned, weighted average of value vectors.

10.9.3 The Loop

The Attention output:

Replaces the input embedding for that position
Feeds into the next component (FFN or the next block)
Gets refined layer by layer

Each block adds more context into each token's representation.

10.10 The Full Attention Computation

10.10.1 Step-by-Step

Complete Attention flow with shapes at each step

Step 1: Generate Q, K, V
        Q = X @ Wq    [4, 16, 512]
        K = X @ Wk    [4, 16, 512]
        V = X @ Wv    [4, 16, 512]
              ↓
        Split into heads (4 heads, d_key = 128)
        Q, K, V each: [4, 4, 16, 128]   (batch, heads, seq, d_key)
              ↓
Step 2: Compute similarity (per head)
        scores = Q @ K^T    [4, 4, 16, 16]
              ↓
Step 3: Scale
        scores = scores / sqrt(d_key)    [4, 4, 16, 16]
              ↓
Step 4: Mask (decoder-only models)
        scores[future] = -inf    [4, 4, 16, 16]
              ↓
Step 5: Softmax
        weights = softmax(scores)    [4, 4, 16, 16]
              ↓
Step 6: Weighted sum (per head)
        output = weights @ V    [4, 4, 16, 128]   (batch, heads, seq, d_key)
              ↓
        Concat heads: [4, 16, 512]   (batch, seq, d_model)

10.10.2 PyTorch Implementation

import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    """
    Scaled Dot-Product Attention.

    Q: [batch, seq_len, d_k]
    K: [batch, seq_len, d_k]
    V: [batch, seq_len, d_v]
    mask: [batch, 1, seq_len] or [batch, seq_len, seq_len]

    Returns:
        output: [batch, seq_len, d_v]
        attention_weights: [batch, seq_len, seq_len]
    """
    d_k = Q.size(-1)

    # Step 2: Q @ K^T
    scores = torch.matmul(Q, K.transpose(-2, -1))

    # Step 3: Scale
    scores = scores / (d_k ** 0.5)

    # Step 4: Mask
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Step 5: Softmax
    attention_weights = F.softmax(scores, dim=-1)

    # Step 6: Weighted sum
    output = torch.matmul(attention_weights, V)

    return output, attention_weights

10.11 Deeper Understanding of Q, K, and V

10.11.1 Role Summary

Role	Generated by	Purpose	Used in
Q	X @ Wq	"What I am looking for"	Q @ K^T
K	X @ Wk	"What I advertise"	Q @ K^T
V	X @ Wv	"What I carry"	Attention @ V

10.11.2 Why Separate K from V?

K and V both come from the same input. Why use two different matrices?

To decouple matching from extraction.

K controls which positions get attention
V controls what information flows when attention is given

This separation gives the model flexibility. A token can make itself easy to find (high K dot-product with many queries) while carrying very different actual content (V).

10.11.3 An Example

Consider: "The agent merged the pull request after review."

When processing "merged":

Q("merged") is looking for: "who performed this action?"
K("agent") signals: "I am a subject performing an action"
V("agent") carries: the agent's semantic content

Q and K matching tells the model that "merged" should attend to "agent." V delivers the actual information.

10.12 Chapter Summary

10.12.1 Key Concepts

Concept	Shape	Meaning
X	[batch, seq, d_model]	Input tensor
Wq / Wk / Wv	[d_model, d_model]	Learnable projection matrices
Q	[batch, seq, d_model]	Query: what I am looking for
K	[batch, seq, d_model]	Key: what I advertise
V	[batch, seq, d_model]	Value: what I carry
Scores	[batch, seq, seq]	Similarity matrix
Weights	[batch, seq, seq]	Attention probabilities (rows sum to 1)
Output	[batch, seq, d_model]	Context-aware token representations

10.12.2 Computation Flow

X → [Wq, Wk, Wv] → Q, K, V
              ↓
        Q @ K^T (similarity)
              ↓
        / sqrt(d_key) (scale)
              ↓
        Mask (block the future)
              ↓
        Softmax (normalize)
              ↓
        @ V (weighted sum)
              ↓
           Output

10.12.3 Core Takeaway

Q, K, and V are the three players in Attention. Q queries, K labels, V carries content. Match Q against K to find relevant positions; weight-average V by those scores to produce a context-enriched representation. That is how Attention helps the model understand language.

Chapter Checklist

After this chapter, you should be able to:

Explain what Q, K, and V each represent.
Describe how they are generated from the same input X.
Trace the dimension changes through each step of Attention.
Explain the causal mask and why it uses -inf.
Explain why the scale factor $\sqrt{d_{key}}$ exists.

See You in the Next Chapter

That was single-head Attention, end to end.

Real Transformers run this whole process from multiple angles simultaneously. Chapter 11 explains Multi-Head Attention: how the model looks at relationships in parallel and combines everything back together.