One-sentence summary: QKV Attention is not predicting anything — it is continuously adjusting each token's embedding vector so that it becomes meaningful in context.


12.1 A Short Chapter With a Big Payoff

This chapter wraps up Multi-Head Attention. In the previous chapters, we derived the output tensor A step by step. You should have a clear picture of how it is computed.

What many explanations skip is what A actually means — and what the model is doing when it produces it.

Three things to cover:

  1. Concatenate: merging multiple heads back into a single tensor
  2. Linear transform Wo: the final matrix multiplication
  3. The essence of training: what QKV is actually adjusting

Once these are clear, Layer Normalization and residual connections become much easier to understand.


12.2 The Shape of A: A Four-Dimensional Tensor

12.2.1 Attention Visualization

Q matches against K to produce attention weights, which blend V vectors

The diagram shows the core Attention process:

  • Q (Query): the query vector for the current token being processed
  • K (Key): the key matrix for all tokens in the context
  • V (Value): the value matrix for all tokens in the context

Q dot-products against each row of K to produce attention weights, then those weights blend the rows of V. That blend is the output for this token: Q queries K, finds relevant positions, retrieves and aggregates from V.

12.2.2 Shape Breakdown

The Multi-Head Attention output A is a four-dimensional tensor:

A: [4, 4, 16, 128]
            └── per-head dimension (d_head = 512 / 4 = 128)
         └────── sequence length (seq_len = 16)
      └────────── number of heads (num_heads = 4)
    └───────────── batch size (batch_size = 4)

Breaking it down:

  • First 4: 4 sequences in the batch
  • Second 4: the 512-dimensional model width split into 4 heads
  • 16: 16 tokens per sequence
  • 128: each head's subspace dimension

The actual computation happens on each [16, 128] slice — one per head per sequence. The four-dimensional shape is just the packaging.

┌─────────────────────────────────────┐
  Sequence 1                         
  ┌──────┬──────┬──────┬──────┐      
   Head1│ Head2│ Head3│ Head4│      
  │16×128│16×128│16×128│16×128│      
  └──────┴──────┴──────┴──────┘      
├─────────────────────────────────────┤
  Sequence 2  (same structure)       
├─────────────────────────────────────┤
  Sequence 3  (same structure)       
├─────────────────────────────────────┤
  Sequence 4  (same structure)       
└─────────────────────────────────────┘

12.2.3 Real Model Sizes

For GPT-2 (117M parameters):

  • d_model = 768, num_heads = 12, d_head = 64

For LLaMA-7B:

  • d_model = 4096, num_heads = 32, d_head = 128

The same four-dimensional structure, just much larger in practice.


12.3 Concatenate: Merging Heads Back Together

12.3.1 The Merge Operation

Concatenate (often called "concat") is the step that reassembles the heads into a single tensor. It is the inverse of the split from Chapter 11.

Before concat: [4, 4, 16, 128]      4 heads, each 128-dimensional
After concat:  [4, 16, 512]         1 unified 512-dimensional tensor

We merge the last two dimensions [4, 128] back into [512]. That is it — a reshape operation.

12.3.2 Why Split and Merge at All?

The question I had when I first learned this: why cut the vector into pieces, compute Attention on each piece, then glue them back? What does the detour accomplish?

The answer is multi-perspective representation.

Each head operates in a different subspace of the full model dimension. They do not share parameters. Over training, they tend to specialize:

  • Head 1 might learn syntax-sensitive patterns
  • Head 2 might learn semantic similarity
  • Head 3 might learn positional proximity
  • Head 4 might learn topic continuity

Splitting forces this specialization. Merging allows those specialized views to inform a single representation.

A tradeoff to keep in mind: more heads means richer representational capacity, but also more parameters and more computation. The empirical sweet spot is usually d_head = 64 or d_head = 128. There is no theoretical formula for the right number of heads — it is tuned experimentally.


12.4 Wo: The Final Linear Transform

12.4.1 What Wo Is

After concatenation, one final matrix multiplication:

Shape of Wo: [512, 512]
Operation:   A @ Wo  final output

Wo (the Output weight matrix) is structurally identical to Wq, Wk, and Wv:

  • Shape: [d_model, d_model] = [512, 512]
  • Initialization: random
  • Type: trainable parameters

12.4.2 Weight Sharing Rules

This confused me when I was learning, so I want to be explicit.

Within one Transformer block: all heads share one Wq, one Wk, one Wv, and there is one Wo. The heads are not separate modules — they share the projection matrices (via the reshape trick from Chapter 11), and Wo recombines their outputs.

Across Transformer blocks: each block has its own independent set of Wq, Wk, Wv, and Wo. A 12-block model has 12 separate sets of these matrices, each learning something slightly different.

Block 1:  Wq₁, Wk₁, Wv₁, Wo₁    first set
Block 2:  Wq₂, Wk₂, Wv₂, Wo₂    second set
...
Block 12: Wq₁₂, Wk₁₂, Wv₁₂, Wo₁₂    twelfth set

Each block computes its own Attention with its own weights, refining the representation one level deeper.

12.4.3 In PyTorch

When using PyTorch's nn.MultiheadAttention, the weight matrices are handled internally:

self.attn = nn.MultiheadAttention(embed_dim=512, num_heads=4)

# Internally:
# self.attn.in_proj_weight    packs Wq, Wk, Wv together
# self.attn.out_proj.weight   this is Wo

The Hugging Face transformers library wraps this further, but the same four matrices are there.


12.5 What Q × K Is Actually Computing

12.5.1 The Score Matrix

Let's revisit what Q multiplied by K produces. Taking the first batch, first head, we get a 16×16 square matrix.

           Token1 Token2 Token3 ... Token16
Token1  [  0.20   0.10   0.05  ...  0.01  ]
Token2  [  0.15   0.30   0.10  ...  0.02  ]
Token3  [  0.08   0.12   0.25  ...  0.03  ]
...
Token16 [  0.01   0.02   0.01  ...  0.40  ]

12.5.2 Geometric Intuition

How to read this matrix:

  • Each row: one token's attention perspective
  • Each column: one token's visibility to others
  • Each cell: the attention weight from row token to column token

After Softmax, each row sums to 1 — it is a probability distribution over the sequence.

So Q × K is computing: for every token, what percentage of attention should it give to every other token?

The matrix form is what makes this efficient. We compute all pairwise relationships at once instead of looping.

12.5.3 A Concrete Example

Imagine the model processing "The agent merged the pull request after review."

When attending from "merged":

  • "merged" → "agent": maybe 30% (subject of the action)
  • "merged" → "pull request": maybe 25% (object of the action)
  • "merged" → "review": maybe 20% (context for the action)
  • "merged" → remaining tokens: the remaining 25%

These percentages come from Q × K. They tell the model where to look.


12.6 What V Does: Applying the Attention to Content

12.6.1 V as the Carrier

The score matrix from Q × K is a "map" — it says where to look but carries no content itself. V is the content.

In our setup:

  • 16 tokens in the sequence
  • Each token has a 128-dimensional V vector (in this head's subspace)

12.6.2 What the Multiplication Does

Multiplying the score matrix by V:

(Q × K) × V      shape still 16×128, unchanged

This is the core operation: use the attention percentages to update each token's vector.

Each token's output vector is a weighted sum of all V vectors, where the weights are the attention scores. Tokens that got high attention contribute more of their V content to the output.

The original embeddings started as random initializations with no semantic meaning. After this operation — and after thousands of training steps — these values become meaningful: they encode what each token represents in the context of the surrounding sequence.


12.7 Training: The Two Things Being Adjusted

12.7.1 One Step at a Time

During training, each forward pass makes small adjustments. Then the next forward pass makes more adjustments. This continues for tens of thousands of steps (or more, for large models).

12.7.2 A Concrete Example of Token Embedding Updates

Imagine the training corpus includes many sequences containing the word "agent."

First training step: the token embedding for "agent" is random — no meaningful values.

After the first forward pass and backward pass: the embedding shifts slightly toward values that help predict what comes next after "agent" in context.

Second step: the next time "agent" appears, we use the updated embedding from step 1. Another small adjustment follows.

Step N: the embedding for "agent" now encodes rich information — not just "this is the word agent" but "an entity that takes actions, appears in agentic contexts, is often followed by verbs like 'opened' or 'merged'."

Initial "agent" embedding: [random numbers]
After step 1:              [slightly adjusted]
After step 2:              [more meaningful]
...
After step N:              [semantically rich]

This is why we call it an Embedding — the word gets embedded into a meaningful vector space.

12.7.3 Two Kinds of Parameters Being Updated

QKV Attention simultaneously refines two different sets of parameters:

Part 1: Token Embeddings

  • The lookup table that maps token IDs to vectors
  • Updated so each token's vector captures its meaning in context
  • Shared across all layers (one embedding per token ID)

Part 2: Weight Matrices

  • Wq, Wk, Wv, Wo — the linear transforms inside Attention
  • Updated so the Attention mechanism finds useful relationships
  • Separate per block (12 sets for a 12-block model)

These two update each other. Better token embeddings lead to better Attention scores. Better weight matrices lead to better token embedding updates. They converge together.


12.8 The Full Picture: What Multi-Head Attention Does

12.8.1 Placing It in the Architecture

The Multi-Head Attention module inside each Transformer block does:

  1. Updates token embeddings: every pass through this module adjusts the embedding vectors based on what surrounds them in the sequence
  2. Updates weight matrices: Wq, Wk, Wv, and Wo are all trained parameters that improve through backpropagation

12.8.2 Parameter Count

For a 12-block model with d_model = 512:

  • Each block: 4 weight matrices × 512² = 4 × 262,144 = 1,048,576 parameters
  • 12 blocks: 12 × 1,048,576 ≈ 12.6 million parameters (Attention only)
  • Plus: Token Embedding table, FFN layers, Layer Norms, and output projection

After enough training steps, these parameters settle into values that make the model capable of coherent text generation.

12.8.3 Model Scale Reference

ModelLayersd_modelTotal Parameters
GPT-2 Small12768117M
GPT-2 Medium241024345M
GPT-2 Large361280774M
LLaMA-7B3240967B
LLaMA-70B80819270B

Chapter Checklist

After this chapter, you should be able to:

  • Describe the four-dimensional shape [batch, heads, seq_len, d_head] and what each dimension means.
  • Explain what Concatenate does and why the split-then-merge is useful.
  • Explain what Wo does and how weights are shared within and across blocks.
  • Describe what Q × K computes (attention percentages).
  • Explain what (Q × K) × V does (updating token vectors using attention weights).
  • Describe the two things training simultaneously adjusts: token embeddings and weight matrices.

See You in the Next Chapter

Understanding what Attention outputs makes the next two components much easier to reason about.

Chapter 13 covers residual connections and Dropout — the engineering tricks that let deep Transformers train stably. Now that you know what Attention produces, you will immediately see why the residual connection pattern makes sense.

Cite this page
Zhang, Wayland (2026). Chapter 12: What the QKV Output Really Means. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-12-qkv-output
@incollection{zhang2026transformer_chapter_12_qkv_output,
  author = {Zhang, Wayland},
  title = {Chapter 12: What the QKV Output Really Means},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-12-qkv-output}
}