One-sentence summary: KV Cache stores computed keys and values from past tokens so the model never recomputes them — turning O(n²) repeated work into O(n) per step and delivering 5× measured inference speedup.

22.1 Why KV Cache Exists

22.1.1 The Autoregressive Waste

In Chapter 20 we established the autoregressive loop: the model generates one token, appends it to the sequence, and runs the full forward pass again. That "full forward pass" is the problem.

Consider an agent writing a response to a long pull request description. Suppose the prompt is 400 tokens and the response is 300 tokens. Without any optimization, the model must run Attention over a growing sequence at every step:

Step 1:  attend over 401 tokens  → output token 1
Step 2:  attend over 402 tokens  → output token 2
Step 3:  attend over 403 tokens  → output token 3
...
Step 300: attend over 700 tokens → output token 300

The 400 prompt tokens get reprocessed 300 times. That is the waste KV Cache eliminates.

22.1.2 What Attention Actually Computes

Recall:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_\text{key}}}\right)V

This involves two matrix multiplications:

$Q \times K^T$ — scores: how much each position attends to each other
$(\text{scores}) \times V$ — weighted sum of values

Q, K, V matrix shapes growing with each generation step

At every new token, Q, K, and V all grow. But here is the key observation: the K and V rows for old tokens do not change. The model weights are frozen during inference, so the K and V vectors for token 1 on step 100 are identical to what they were on step 1. Computing them again is pure waste.

22.1.3 A Concrete Walkthrough

Let the prompt be "The agent opened a pull request" (5 tokens). The model generates a response word by word.

Without KV Cache, each step reprocesses the full sequence:

Step 1 — generate first response token:

Sequence: [The, agent, opened, a, pull, request]  length: 6
Recompute K and V for all 6 tokens.
QK^T shape: 6×6

Step 2 — generate second response token:

Sequence: [The, agent, opened, a, pull, request, token_1]  length: 7
Recompute K and V for all 7 tokens — including "The" through "request" again.
QK^T shape: 7×7

Step 20 — generate twentieth response token:

Sequence: length 25
Recompute K and V for 25 tokens — the original 5 prompt tokens have been recomputed 20 times.

Every prompt token's K and V vectors are computed from scratch on every step. They are the same every time because the model weights and the token embeddings do not change.

22.1.4 How Much Waste Is It?

Let us count the Attention matrix multiplications needed to generate N tokens:

Step	Sequence length	$QK^T$ cost (proportional)
1	1	1
2	2	4
3	3	9
n	n	n²

Total over N steps without cache: $\sum_{k=1}^{N} k^2 \approx O(N^3)$

With KV Cache each step only computes a 1-row Q against the cached K: cost per step is O(N), total is $O(N^2)$ . At N = 1000 that is roughly a 1000× reduction in Attention work per step.

22.2 What Gets Cached and Why

22.2.1 The Four Rules

KV Cache applies to inference only. During training the whole target sequence is known, so all positions can be processed in parallel — no caching needed.
KV Cache lives only in decoder blocks. Encoder blocks (when present) process the input once in parallel; they are not autoregressive.
It accelerates the two Attention matmuls. Cached K and V turn the second matrix dimension from growing to static.
It increases memory. There is no free lunch: storing the cache costs GPU memory proportional to sequence length.

22.2.2 Why Only K and V, Not Q?

During autoregressive generation, we only care about the output for the last position. To compute that output, the new token's Q attends over all positions' K and V. The Q for position n+1 is computed fresh every step; there is nothing to cache. K and V for all past positions, however, are used again and again — those are the ones worth caching.

The way I think about this: Query is the new question the model is asking. Keys and Values are the knowledge base it is consulting. You keep the knowledge base on your desk; you do not need to fetch old questions.

22.3 Without vs With KV Cache

22.3.1 Without KV Cache

Without KV Cache: full K and V recomputed at every step

Let the prompt be "The agent opened" (3 tokens). Generation:

Step 1 — predict token after "The agent opened":

Q = [The, agent, opened]   size: 3×d
K = [The, agent, opened]   size: 3×d   ← computed from scratch
V = [The, agent, opened]   size: 3×d   ← computed from scratch
QK^T = 3×3 matrix

Step 2 — predict token after "The agent opened a":

Q = [The, agent, opened, a]  size: 4×d
K = [The, agent, opened, a]  size: 4×d  ← The/agent/opened recomputed!
V = [The, agent, opened, a]  size: 4×d  ← The/agent/opened recomputed!
QK^T = 4×4 matrix

The first three rows of K and V are identical in both steps. Standard implementation throws them away and recomputes them anyway.

22.3.2 With KV Cache

With KV Cache: new token only computes one new K and V row

Step 1 — process full prompt, populate cache:

Q = [The, agent, opened]   size: 3×d
K = [The, agent, opened]   size: 3×d   → cached
V = [The, agent, opened]   size: 3×d   → cached

Step 2 — predict next token:

Q = [a]           size: 1×d   ← only the new token
K = [The, agent, opened, a]   size: 4×d   ← read cache + append new row
V = [The, agent, opened, a]   size: 4×d   ← read cache + append new row
QK^T = 1×4 vector                         ← only one query row!

Colors in the diagram: gray = cached (read from memory), orange = newly computed K row, green = newly computed V row, pink = causally masked positions.

22.3.3 Compute Savings

Step n	Without cache	With cache	Saving
1	1×1 = 1	1×1 = 1	0%
2	2×2 = 4	1×2 = 2	50%
4	4×4 = 16	1×4 = 4	75%
10	100	10	90%
1000	1,000,000	1000	99.9%

22.4 Memory Cost of KV Cache

22.4.1 The Formula

KV Cache memory =
    2               (K and V)
  × batch_size
  × context_length
  × n_layers
  × n_heads
  × d_head
  × bytes_per_element

Where bytes_per_element is:

FP32 → 4 bytes
BF16 / FP16 → 2 bytes
INT8 / FP8 → 1 byte

22.4.2 Worked Example: Llama-7B

Configuration: n_layers = 32, n_heads = 32, d_head = 128, FP16, context_length = 4096, batch = 1.

KV Cache = 2 × 1 × 4096 × 32 × 32 × 128 × 2 bytes
         = 2,147,483,648 bytes
         ≈ 2 GB

Llama-7B itself weighs about 14 GB in FP32. A single 4096-token conversation requires a 2 GB KV Cache on top of that.

22.4.3 Deployment Constraints

Suppose you serve Llama-7B on an NVIDIA A10 GPU (24 GB memory):

Item	Memory
Model weights (FP32)	~14 GB
Remaining for KV Cache	~10 GB
Max KV Cache window	~20k tokens
Max concurrent users at 4k context	~5

This is the real constraint in production: KV Cache memory, not model parameters, often determines how many users you can serve simultaneously. That pressure is exactly what motivated MQA and GQA, which we cover in Chapter 23.

22.5 Multi-Turn Conversations

22.5.1 The Compounding Problem

Chat applications accumulate context over multiple turns. Without smart caching, each turn forces the model to reprocess the entire conversation history.

Multi-turn conversation with shared KV Cache across turns

With KV Cache the conversation history is computed once:

Turn 1: [Q1][A1]
        └── computed and cached

Turn 2: [Q1][A1] [Q2] [A2]
        └── reuse ──┘ └── new
                         ↑ only these tokens need new K, V

Turn 3: [Q1][A1][Q2][A2] [Q3] [A3]
        └───── reuse ───┘ └── new

Each turn adds only the new tokens' K and V rows. The accumulation is linear, not quadratic.

22.5.2 Memory Grows Linearly, Not Quadratically

Without cache management, multi-turn conversation is O(n²) in total attention work: turn k has a context of length proportional to k, and attending over it costs O(k²). With KV Cache it is O(k) per turn — you compute new K/V rows only for the new tokens.

22.5.3 The "Typewriter" Effect

The streaming token-by-token output you see in chat interfaces is partly a product of KV Cache: each new token is cheap to generate because the bulk of the K/V computation happened during prefill. Without KV Cache, the latency per token would grow with every step as the sequence got longer.

22.6 The Two Phases of Inference

22.6.1 Prefill and Decode

Phase	Prefill	Decode
Input	Full prompt (all at once)	Previous generated token
Computation	Parallel over all prompt tokens	Sequential, one token at a time
KV Cache	Populated	Extended one row at a time
Bottleneck	Compute-bound (many tokens in parallel)	Memory-bandwidth-bound

22.6.2 Why Decode Is Memory Bound

During prefill, the GPU processes many tokens simultaneously — it is doing real batched matrix multiplication. During decode, we process exactly one token per step. Most GPU compute is idle; the limiting factor becomes how fast we can read the KV Cache from HBM.

This is why decode is called memory-bandwidth-bound, and why reducing KV Cache size (via MQA, GQA, quantization) has such direct impact on decode throughput.

22.7 Code: Measuring the Speedup

import time
import numpy as np
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)

for use_cache in (True, False):
    times = []
    for _ in range(10):
        start = time.time()
        model.generate(
            **tokenizer("The agent opened a pull request.", return_tensors="pt").to(device),
            use_cache=use_cache,
            max_new_tokens=1000,
        )
        times.append(time.time() - start)
    print(f"{'with' if use_cache else 'without'} KV cache: "
          f"{np.mean(times):.1f} ± {np.std(times):.1f} s")

Expected output:

with KV cache:    11.x ± x.x s
without KV cache: 56.x ± x.x s

That is roughly 5× speedup on GPT-2, consistent with what the math predicts for sequences of ~1000 tokens.

22.7.1 Cache Data Structure

# Per-layer KV Cache structure
n_layers = 96
key_cache   = [[] for _ in range(n_layers)]
value_cache = [[] for _ in range(n_layers)]

# Prefill: process prompt, populate cache
for token in input_sequence:
    for layer in range(n_layers):
        key_cache[layer].append(compute_key(layer, token))
        value_cache[layer].append(compute_value(layer, token))

# Decode: generate new tokens one at a time
for new_token in generation_loop:
    for layer in range(n_layers):
        query = compute_query(layer, new_token)
        # Attention over full history via cache
        output = attention(query, keys=key_cache[layer], values=value_cache[layer])
        # Extend cache
        key_cache[layer].append(compute_key(layer, new_token))
        value_cache[layer].append(compute_value(layer, new_token))

Two things worth noting: each layer has an independent cache because each layer's projection weights are different; and in production frameworks the cache is a pre-allocated tensor, not a Python list.

22.8 Common Questions

Does KV Cache affect accuracy? No. It stores the exact same K and V values that would have been recomputed. The output is bit-for-bit identical to the no-cache version.

Does training use KV Cache? No. Training processes the full known sequence in parallel using causal masking. The autoregressive structure is enforced by the mask, not by sequential generation. The KV Cache concept only makes sense in the sequential decode setting.

What models support it? All Decoder-only models (GPT family, LLaMA, Mistral, Gemma, Qwen) and Encoder-Decoder models during decoding. Encoder-only models (BERT, RoBERTa) do not generate autoregressively and do not need it.

What are the downsides? Three: higher memory per request; limited concurrency on memory-constrained hardware; and prefill latency cannot be reduced by the cache (the first step still processes the full prompt with no shortcut).

Does context length limit matter more with KV Cache? Yes. Without KV Cache, long context just means more recomputation. With KV Cache, long context means more memory — the cache grows linearly. A 128k-token context on Llama-7B requires:

2 × 1 × 131072 × 32 × 32 × 128 × 2 bytes ≈ 64 GB

That is more than the entire model on an A100 80GB. This is the root pressure that drove MQA, GQA, and quantized KV representations.

22.9 KV Cache Optimization Directions

KV Cache is standard in every production deployment, but it has spun off a family of optimizations:

Reduce the number of KV heads (less to cache):

MQA — all query heads share one K/V pair
GQA — groups of query heads share one K/V pair
Covered in Chapter 23.

Reduce KV precision (smaller per-element footprint):

INT8 or FP8 quantization of cached K and V
Memory roughly halved with negligible quality loss

Evict or compress old KV entries:

Sliding Window Attention — cache only recent window
StreamingLLM — keep "attention sink" tokens + recent window
PagedAttention (vLLM) — manage KV pages like OS virtual memory

22.10 Chapter Summary

Concept	Key Point
Why cache K and V	Old tokens' K and V are constant during inference; recomputing them is O(N²) total waste
Why not cache Q	Q is only needed for the current (new) token
Compute savings	Per-step cost drops from O(N²) to O(N); measured 5× speedup
Memory cost formula	2 × batch × ctx × layers × heads × d_head × bytes
Llama-7B at 4k ctx	~2 GB KV Cache per batch element
Prefill vs Decode	Prefill: compute-bound. Decode: memory-bandwidth-bound
Hugging Face toggle	`use_cache=True` (default)

Chapter Checklist

After this chapter, you should be able to:

Explain why autoregressive generation creates redundant computation without a cache.
Explain why only K and V are cached, not Q.
Calculate the KV Cache memory for a given model configuration.
Describe the prefill and decode phases and their respective bottlenecks.
Run the benchmark code and interpret the speedup.

See You in the Next Chapter

KV Cache memory scales with the number of attention heads: 32 heads means storing 32 sets of K and 32 sets of V per layer. The natural follow-up question is whether all 32 sets actually need to be independent.

Chapter 23 covers the evolution from MHA (every head its own K/V) through MQA (all heads share one K/V) to GQA (groups of heads share K/V) — the architecture changes that let modern models serve longer contexts without running out of memory.