One-sentence summary: KV Cache stores computed keys and values from past tokens so the model never recomputes them — turning O(n²) repeated work into O(n) per step and delivering 5× measured inference speedup.
22.1 Why KV Cache Exists
22.1.1 The Autoregressive Waste
In Chapter 20 we established the autoregressive loop: the model generates one token, appends it to the sequence, and runs the full forward pass again. That "full forward pass" is the problem.
Consider an agent writing a response to a long pull request description. Suppose the prompt is 400 tokens and the response is 300 tokens. Without any optimization, the model must run Attention over a growing sequence at every step:
Step 1: attend over 401 tokens → output token 1
Step 2: attend over 402 tokens → output token 2
Step 3: attend over 403 tokens → output token 3
...
Step 300: attend over 700 tokens → output token 300
The 400 prompt tokens get reprocessed 300 times. That is the waste KV Cache eliminates.
22.1.2 What Attention Actually Computes
Recall:
This involves two matrix multiplications:
- — scores: how much each position attends to each other
- — weighted sum of values
At every new token, Q, K, and V all grow. But here is the key observation: the K and V rows for old tokens do not change. The model weights are frozen during inference, so the K and V vectors for token 1 on step 100 are identical to what they were on step 1. Computing them again is pure waste.
22.1.3 A Concrete Walkthrough
Let the prompt be "The agent opened a pull request" (5 tokens). The model generates a response word by word.
Without KV Cache, each step reprocesses the full sequence:
Step 1 — generate first response token:
Sequence: [The, agent, opened, a, pull, request] length: 6
Recompute K and V for all 6 tokens.
QK^T shape: 6×6
Step 2 — generate second response token:
Sequence: [The, agent, opened, a, pull, request, token_1] length: 7
Recompute K and V for all 7 tokens — including "The" through "request" again.
QK^T shape: 7×7
Step 20 — generate twentieth response token:
Sequence: length 25
Recompute K and V for 25 tokens — the original 5 prompt tokens have been recomputed 20 times.
Every prompt token's K and V vectors are computed from scratch on every step. They are the same every time because the model weights and the token embeddings do not change.
22.1.4 How Much Waste Is It?
Let us count the Attention matrix multiplications needed to generate N tokens:
| Step | Sequence length | cost (proportional) |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 2 | 4 |
| 3 | 3 | 9 |
| n | n | n² |
Total over N steps without cache:
With KV Cache each step only computes a 1-row Q against the cached K: cost per step is O(N), total is . At N = 1000 that is roughly a 1000× reduction in Attention work per step.
22.2 What Gets Cached and Why
22.2.1 The Four Rules
- KV Cache applies to inference only. During training the whole target sequence is known, so all positions can be processed in parallel — no caching needed.
- KV Cache lives only in decoder blocks. Encoder blocks (when present) process the input once in parallel; they are not autoregressive.
- It accelerates the two Attention matmuls. Cached K and V turn the second matrix dimension from growing to static.
- It increases memory. There is no free lunch: storing the cache costs GPU memory proportional to sequence length.
22.2.2 Why Only K and V, Not Q?
During autoregressive generation, we only care about the output for the last position. To compute that output, the new token's Q attends over all positions' K and V. The Q for position n+1 is computed fresh every step; there is nothing to cache. K and V for all past positions, however, are used again and again — those are the ones worth caching.
The way I think about this: Query is the new question the model is asking. Keys and Values are the knowledge base it is consulting. You keep the knowledge base on your desk; you do not need to fetch old questions.
22.3 Without vs With KV Cache
22.3.1 Without KV Cache
Let the prompt be "The agent opened" (3 tokens). Generation:
Step 1 — predict token after "The agent opened":
Q = [The, agent, opened] size: 3×d
K = [The, agent, opened] size: 3×d ← computed from scratch
V = [The, agent, opened] size: 3×d ← computed from scratch
QK^T = 3×3 matrix
Step 2 — predict token after "The agent opened a":
Q = [The, agent, opened, a] size: 4×d
K = [The, agent, opened, a] size: 4×d ← The/agent/opened recomputed!
V = [The, agent, opened, a] size: 4×d ← The/agent/opened recomputed!
QK^T = 4×4 matrix
The first three rows of K and V are identical in both steps. Standard implementation throws them away and recomputes them anyway.
22.3.2 With KV Cache
Step 1 — process full prompt, populate cache:
Q = [The, agent, opened] size: 3×d
K = [The, agent, opened] size: 3×d → cached
V = [The, agent, opened] size: 3×d → cached
Step 2 — predict next token:
Q = [a] size: 1×d ← only the new token
K = [The, agent, opened, a] size: 4×d ← read cache + append new row
V = [The, agent, opened, a] size: 4×d ← read cache + append new row
QK^T = 1×4 vector ← only one query row!
Colors in the diagram: gray = cached (read from memory), orange = newly computed K row, green = newly computed V row, pink = causally masked positions.
22.3.3 Compute Savings
| Step n | Without cache | With cache | Saving |
|---|---|---|---|
| 1 | 1×1 = 1 | 1×1 = 1 | 0% |
| 2 | 2×2 = 4 | 1×2 = 2 | 50% |
| 4 | 4×4 = 16 | 1×4 = 4 | 75% |
| 10 | 100 | 10 | 90% |
| 1000 | 1,000,000 | 1000 | 99.9% |
22.4 Memory Cost of KV Cache
22.4.1 The Formula
KV Cache memory =
2 (K and V)
× batch_size
× context_length
× n_layers
× n_heads
× d_head
× bytes_per_element
Where bytes_per_element is:
- FP32 → 4 bytes
- BF16 / FP16 → 2 bytes
- INT8 / FP8 → 1 byte
22.4.2 Worked Example: Llama-7B
Configuration: n_layers = 32, n_heads = 32, d_head = 128, FP16, context_length = 4096, batch = 1.
KV Cache = 2 × 1 × 4096 × 32 × 32 × 128 × 2 bytes
= 2,147,483,648 bytes
≈ 2 GB
Llama-7B itself weighs about 14 GB in FP16. A single 4096-token conversation requires a 2 GB KV Cache on top of that.
22.4.3 Deployment Constraints
Suppose you serve Llama-7B on an NVIDIA A10 GPU (24 GB memory):
| Item | Memory |
|---|---|
| Model weights (FP32) | ~14 GB |
| Remaining for KV Cache | ~10 GB |
| Max KV Cache window | ~20k tokens |
| Max concurrent users at 4k context | ~5 |
This is the real constraint in production: KV Cache memory, not model parameters, often determines how many users you can serve simultaneously. That pressure is exactly what motivated MQA and GQA, which we cover in Chapter 23.
22.5 Multi-Turn Conversations
22.5.1 The Compounding Problem
Chat applications accumulate context over multiple turns. Without smart caching, each turn forces the model to reprocess the entire conversation history.
With KV Cache the conversation history is computed once:
Turn 1: [Q1][A1]
└── computed and cached
Turn 2: [Q1][A1] [Q2] [A2]
└── reuse ──┘ └── new
↑ only these tokens need new K, V
Turn 3: [Q1][A1][Q2][A2] [Q3] [A3]
└───── reuse ───┘ └── new
Each turn adds only the new tokens' K and V rows. The accumulation is linear, not quadratic.
22.5.2 Memory Grows Linearly, Not Quadratically
Without cache management, multi-turn conversation is O(n²) in total attention work: turn k has a context of length proportional to k, and attending over it costs O(k²). With KV Cache it is O(k) per turn — you compute new K/V rows only for the new tokens.
22.5.3 The "Typewriter" Effect
The streaming token-by-token output you see in chat interfaces is partly a product of KV Cache: each new token is cheap to generate because the bulk of the K/V computation happened during prefill. Without KV Cache, the latency per token would grow with every step as the sequence got longer.
22.6 The Two Phases of Inference
22.6.1 Prefill and Decode
| Phase | Prefill | Decode |
|---|---|---|
| Input | Full prompt (all at once) | Previous generated token |
| Computation | Parallel over all prompt tokens | Sequential, one token at a time |
| KV Cache | Populated | Extended one row at a time |
| Bottleneck | Compute-bound (many tokens in parallel) | Memory-bandwidth-bound |
22.6.2 Why Decode Is Memory Bound
During prefill, the GPU processes many tokens simultaneously — it is doing real batched matrix multiplication. During decode, we process exactly one token per step. Most GPU compute is idle; the limiting factor becomes how fast we can read the KV Cache from HBM.
This is why decode is called memory-bandwidth-bound, and why reducing KV Cache size (via MQA, GQA, quantization) has such direct impact on decode throughput.
22.7 Code: Measuring the Speedup
import time
import numpy as np
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
for use_cache in (True, False):
times = []
for _ in range(10):
start = time.time()
model.generate(
**tokenizer("The agent opened a pull request.", return_tensors="pt").to(device),
use_cache=use_cache,
max_new_tokens=1000,
)
times.append(time.time() - start)
print(f"{'with' if use_cache else 'without'} KV cache: "
f"{np.mean(times):.1f} ± {np.std(times):.1f} s")
Expected output:
with KV cache: 11.x ± x.x s
without KV cache: 56.x ± x.x s
That is roughly 5× speedup on GPT-2, consistent with what the math predicts for sequences of ~1000 tokens.
22.7.1 Cache Data Structure
# Per-layer KV Cache structure
n_layers = 96
key_cache = [[] for _ in range(n_layers)]
value_cache = [[] for _ in range(n_layers)]
# Prefill: process prompt, populate cache
for token in input_sequence:
for layer in range(n_layers):
key_cache[layer].append(compute_key(layer, token))
value_cache[layer].append(compute_value(layer, token))
# Decode: generate new tokens one at a time
for new_token in generation_loop:
for layer in range(n_layers):
query = compute_query(layer, new_token)
# Attention over full history via cache
output = attention(query, keys=key_cache[layer], values=value_cache[layer])
# Extend cache
key_cache[layer].append(compute_key(layer, new_token))
value_cache[layer].append(compute_value(layer, new_token))
Two things worth noting: each layer has an independent cache because each layer's projection weights are different; and in production frameworks the cache is a pre-allocated tensor, not a Python list.
22.8 Common Questions
Does KV Cache affect accuracy? No. It stores the exact same K and V values that would have been recomputed. The output is bit-for-bit identical to the no-cache version.
Does training use KV Cache? No. Training processes the full known sequence in parallel using causal masking. The autoregressive structure is enforced by the mask, not by sequential generation. The KV Cache concept only makes sense in the sequential decode setting.
What models support it? All Decoder-only models (GPT family, LLaMA, Mistral, Gemma, Qwen) and Encoder-Decoder models during decoding. Encoder-only models (BERT, RoBERTa) do not generate autoregressively and do not need it.
What are the downsides? Three: higher memory per request; limited concurrency on memory-constrained hardware; and prefill latency cannot be reduced by the cache (the first step still processes the full prompt with no shortcut).
Does context length limit matter more with KV Cache? Yes. Without KV Cache, long context just means more recomputation. With KV Cache, long context means more memory — the cache grows linearly. A 128k-token context on Llama-7B requires:
2 × 1 × 131072 × 32 × 32 × 128 × 2 bytes ≈ 64 GB
That is more than the entire model on an A100 80GB. This is the root pressure that drove MQA, GQA, and quantized KV representations.
22.9 KV Cache Optimization Directions
KV Cache is standard in every production deployment, but it has spun off a family of optimizations:
Reduce the number of KV heads (less to cache):
- MQA — all query heads share one K/V pair
- GQA — groups of query heads share one K/V pair
- Covered in Chapter 23.
Reduce KV precision (smaller per-element footprint):
- INT8 or FP8 quantization of cached K and V
- Memory roughly halved with negligible quality loss
Evict or compress old KV entries:
- Sliding Window Attention — cache only recent window
- StreamingLLM — keep "attention sink" tokens + recent window
- PagedAttention (vLLM) — manage KV pages like OS virtual memory
22.10 Chapter Summary
| Concept | Key Point |
|---|---|
| Why cache K and V | Old tokens' K and V are constant during inference; recomputing them is O(N²) total waste |
| Why not cache Q | Q is only needed for the current (new) token |
| Compute savings | Per-step cost drops from O(N²) to O(N); measured 5× speedup |
| Memory cost formula | 2 × batch × ctx × layers × heads × d_head × bytes |
| Llama-7B at 4k ctx | ~2 GB KV Cache per batch element |
| Prefill vs Decode | Prefill: compute-bound. Decode: memory-bandwidth-bound |
| Hugging Face toggle | use_cache=True (default) |
Chapter Checklist
After this chapter, you should be able to:
- Explain why autoregressive generation creates redundant computation without a cache.
- Explain why only K and V are cached, not Q.
- Calculate the KV Cache memory for a given model configuration.
- Describe the prefill and decode phases and their respective bottlenecks.
- Run the benchmark code and interpret the speedup.
Further Reading
- Efficient Transformers: A Survey (Tay et al., 2020) — arXiv 2009.06732
- vLLM: Easy, Fast, and Cheap LLM Serving (PagedAttention) — github.com/vllm-project/vllm
- FlashAttention-2 (Dao, 2023) — combines KV Cache efficiently via Flash Decoding
See You in the Next Chapter
KV Cache memory scales with the number of attention heads: 32 heads means storing 32 sets of K and 32 sets of V per layer. The natural follow-up question is whether all 32 sets actually need to be independent.
Chapter 23 covers the evolution from MHA (every head its own K/V) through MQA (all heads share one K/V) to GQA (groups of heads share K/V) — the architecture changes that let modern models serve longer contexts without running out of memory.