One-sentence summary: Inference is load model → encode prompt → autoregressive generation → decode output. Thirty lines of code, but this is the moment the model speaks.
Complete code repository: github.com/waylandzhang/Transformer-from-scratch
20.1 Inference vs Training
20.1.1 Recap from Chapter 16
| Training | Inference | |
|---|---|---|
| Goal | learn parameters | generate text |
| Input | full sequence + targets | prompt only |
| Output | loss value | generated text |
| Parameter updates | yes | no |
| Dropout | on | off |
20.1.2 Core Inference Flow
1. Load the trained model
2. Encode the prompt to token IDs
3. Autoregressive generation (one token at a time)
4. Decode token IDs back to text
20.2 Loading the Model
20.2.1 Load the Checkpoint
# load model
import torch
import tiktoken
from model import Model
# load checkpoint
checkpoint = torch.load('model/model.ckpt')
# restore hyperparameters from checkpoint
h_params = checkpoint['h_params']
# reconstruct model architecture
model = Model(h_params)
# load parameters
model.load_state_dict(checkpoint['model_state_dict'])
# switch to evaluation mode
model.eval()
# move to the correct device
model.to(h_params['device'])
20.2.2 Why model.eval()?
Calling model.eval() does two things:
- Disables Dropout: inference does not need random dropping
- Fixes BatchNorm: uses statistics accumulated during training
Without switching to evaluation mode, every inference call produces different output. That is usually not what you want.
20.3 Preparing the Input
20.3.1 Encode the Prompt
# encode input
encoding = tiktoken.get_encoding("cl100k_base")
# what do you want the model to continue?
start = "农夫山泉 "
# encode to token IDs
start_ids = encoding.encode(start)
print(f"Prompt: {start}")
print(f"Token IDs: {start_ids}")
# convert to Tensor
x = torch.tensor(start_ids, dtype=torch.long, device=h_params['device'])
x = x.unsqueeze(0) # add batch dimension: [seq_len] -> [1, seq_len]
print(f"Input shape: {x.shape}")
Sample output:
Prompt: 农夫山泉
Token IDs: [161, 253, 109, 26288, 239, 103]
Input shape: torch.Size([1, 6])
The model never sees the raw string. It sees integers mapped to rows of the embedding table. The unsqueeze(0) adds the batch dimension — the model expects [batch, seq] even for a single prompt.
20.4 Generating Text
20.4.1 Call the Generation Function
# generate text
with torch.no_grad(): # no gradient computation
y = model.generate(
x,
max_new_tokens=200, # generate up to 200 new tokens
temperature=0.5, # temperature: lower = more deterministic
top_k=None # no top-k filtering
)
# decode
output_text = encoding.decode(y[0].tolist())
print('---------------')
print(output_text)
print('---------------')
20.4.2 Sample Generation Output
---------------
农夫山泉 天然水 550ml 瓶装
农夫山泉 东方树叶 茉莉花茶 500ml
农夫山泉 NFC 橙汁 300ml
农夫山泉 维他命水 柠檬味 500ml
---------------
The model has learned to generate text that looks like product names — the style of the training data. This is important to remember: the model learns patterns in data, not understanding of content.
20.5 Generation Parameters in Detail
20.5.1 Temperature
y = model.generate(x, temperature=0.5)
Temperature shapes the output distribution:
| Temperature | Effect | Use case |
|---|---|---|
| 0.1-0.3 | very deterministic, repetitive | factual Q&A |
| 0.5-0.7 | balanced randomness and determinism | general use |
| 0.8-1.0 | more varied, creative | creative writing |
| > 1.0 | very random, may be incoherent | experimentation |
20.5.2 Top-K Sampling
y = model.generate(x, top_k=50)
Sample only from the K highest-probability tokens:
Original distribution:
"天" = 0.30, "矿" = 0.20, "冰" = 0.15, ... (100k tokens)
After Top-K=3:
"天" = 0.50, "矿" = 0.33, "冰" = 0.17
(renormalized over just these 3 tokens)
Why it helps: prevents the model from occasionally sampling bizarre low-probability tokens that ruin an otherwise coherent output.
20.5.3 Max New Tokens
y = model.generate(x, max_new_tokens=200)
Controls generation length:
- Too short: may produce incomplete output
- Too long: wastes compute, increases chance of repetition
20.6 Inspecting Model Parameters
20.6.1 Print Parameter Count
# count parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model param size: {total_params:,}")
Sample output:
Model param size: 8,234,560
20.6.2 Inspect Layer-by-Layer
# print each layer's name and shape
for name, param in model.state_dict().items():
print(f"{name}: {param.shape}")
Sample output:
token_embedding_lookup_table.weight: torch.Size([100256, 80])
transformer_blocks.0.ln1.weight: torch.Size([80])
transformer_blocks.0.ln1.bias: torch.Size([80])
transformer_blocks.0.mha.heads.0.Wq.weight: torch.Size([20, 80])
transformer_blocks.0.mha.heads.0.Wk.weight: torch.Size([20, 80])
transformer_blocks.0.mha.heads.0.Wv.weight: torch.Size([20, 80])
...
model_out_linear_layer.weight: torch.Size([100256, 80])
model_out_linear_layer.bias: torch.Size([100256])
This is the full parameter layout of the model — every matrix has a name and a shape. You can cross-reference these names with the class definitions in Chapter 18.
20.7 Complete inference.py
# -*- coding: utf-8 -*-
"""
Sample from a trained model
"""
import torch
import tiktoken
from model import Model
# load model and hyperparameters
checkpoint = torch.load('model/model.ckpt')
h_params = checkpoint['h_params']
model = Model(h_params)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
model.to(h_params['device'])
# load tokenizer
encoding = tiktoken.get_encoding("cl100k_base")
# input prompt
start = "农夫山泉 "
start_ids = encoding.encode(start)
x = torch.tensor(start_ids, dtype=torch.long, device=h_params['device'])[None, ...]
# generate
with torch.no_grad():
y = model.generate(x, max_new_tokens=200, temperature=0.5, top_k=None)
print('---------------')
print(encoding.decode(y[0].tolist()))
print('---------------')
# print parameter count
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model param size: {total_params:,}")
# print parameter shapes
for name in model.state_dict().keys():
print(name, model.state_dict()[name].shape)
20.8 Varying the Prompt
20.8.1 Try Different Inputs
# try different prompts
prompts = [
"农夫山泉",
"可口可乐",
"奥利奥",
"蒙牛"
]
for prompt in prompts:
x = torch.tensor(encoding.encode(prompt), dtype=torch.long, device=h_params['device'])[None, ...]
with torch.no_grad():
y = model.generate(x, max_new_tokens=50, temperature=0.5)
print(f"Prompt: {prompt}")
print(f"Output: {encoding.decode(y[0].tolist())}")
print("---")
20.8.2 Observing the Output
The model generates in the style of its training data:
- If training data was product names, it generates product-name-style text
- If training data was fiction, it generates fiction-style text
- If training data was code, it generates code-style text
The model learns the statistical patterns in data, not the "meaning" behind them. This is both the source of its capability and its limitation.
20.9 Visualizing Autoregressive Generation
20.9.1 Stepping Through Generation
# visualize generation process
def generate_with_trace(model, x, max_new_tokens=10, temperature=1.0):
"""Generation with step-by-step tracing"""
encoding = tiktoken.get_encoding("cl100k_base")
print(f"Initial prompt: {encoding.decode(x[0].tolist())}")
print("---")
for i in range(max_new_tokens):
# forward pass
with torch.no_grad():
logits, _ = model(x[:, -model.context_length:])
# get last-position predictions
logits = logits[:, -1, :] / temperature
probs = torch.softmax(logits, dim=-1)
# get top-5 candidates
top5_probs, top5_ids = torch.topk(probs[0], 5)
print(f"Step {i+1} candidates:")
for prob, idx in zip(top5_probs, top5_ids):
print(f" '{encoding.decode([idx.item()])}': {prob.item():.3f}")
# sample
idx_next = torch.multinomial(probs, num_samples=1)
x = torch.cat((x, idx_next), dim=1)
print(f" -> selected: '{encoding.decode([idx_next[0].item()])}'")
print(f" current sequence: {encoding.decode(x[0].tolist())}")
print("---")
return x
20.9.2 Sample Trace Output
Initial prompt: 农夫山泉
---
Step 1 candidates:
'天': 0.312
'矿': 0.198
'有': 0.087
'纯': 0.076
'水': 0.065
-> selected: '天'
current sequence: 农夫山泉天
---
Step 2 candidates:
'然': 0.421
'山': 0.156
'地': 0.089
'的': 0.067
'下': 0.054
-> selected: '然'
current sequence: 农夫山泉天然
---
...
This trace makes the autoregressive process concrete. Each step sees the accumulated sequence and produces a probability distribution. The sampling makes it stochastic — running it twice may produce different text.
20.10 Common Issues
20.10.1 Repetitive Output
Symptom: the model keeps repeating the same word or phrase.
Causes:
- temperature too low
- training data itself has repetitive patterns
- model has overfit
Fixes:
- raise temperature
- use top-k or top-p sampling
- add a repetition penalty
20.10.2 Incoherent Output
Symptom: output is garbled or nonsensical.
Causes:
- model undertrained
- prompt is far outside the training distribution
- temperature too high
Fixes:
- train for more steps
- use a more in-distribution prompt
- lower temperature
20.10.3 Slow Generation
Symptom: each token takes a long time.
Causes:
- no GPU
- no KV Cache
- model is too large
Fixes:
- use GPU if available
- implement KV Cache (Chapter 22)
- use a smaller model
20.11 Chapter Summary
20.11.1 Three-Step Inference
1. Load model
checkpoint = torch.load('model.ckpt')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
2. Encode prompt
start_ids = encoding.encode(prompt)
x = torch.tensor(start_ids)[None, ...]
3. Generate
with torch.no_grad():
y = model.generate(x, max_new_tokens=200)
output = encoding.decode(y[0].tolist())
20.11.2 Key Parameters
| Parameter | Role | Recommended range |
|---|---|---|
max_new_tokens | maximum generation length | 50-500 |
temperature | randomness control | 0.5-0.8 |
top_k | restrict candidate tokens | 50-100 |
20.11.3 Core Insight
inference.pyis 30 lines of code, but it is the destination of the entire journey — the moment the model speaks. Load parameters, encode the prompt, generate autoregressively, decode the output. That is how every GPT-style model, from our educational toy to GPT-4, responds to a prompt.
Chapter Checklist
After this chapter you should be able to:
- Load a trained model checkpoint correctly.
- Explain what
model.eval()does and why it matters. - Use temperature and top-k to control generation.
- Run the inference script end to end.
Part 5 Summary
You have completed the code implementation section:
| Chapter | Content | Code size |
|---|---|---|
| Ch. 18 | model.py — model definition | ~200 lines |
| Ch. 19 | train.py — training loop | ~100 lines |
| Ch. 20 | inference.py — inference logic | ~30 lines |
Under 400 lines of code to implement a complete working Transformer.
These are simplified compared to production LLMs, but they contain the same core logic. Once you understand this code, you can read Hugging Face Transformers, LLaMA, and GPT-NeoX source code and recognize every component.
Complete Code
The full Part 5 implementation is on GitHub:
Includes:
model.py— complete model definitiontrain.py— training scriptinference.py— inference scriptstep-by-step.ipynb— annotated Jupyter notebook
See You in the Next Chapter
Our model works, but it is slow. Every token generation requires running the full forward pass over the entire sequence — recomputing every Key and Value matrix from scratch each time. That is wasteful.
Part 6 covers production optimization. We will look at Flash Attention and KV Cache, the two most impactful speedups in modern Transformer inference systems. KV Cache eliminates the redundant computation you just saw — instead of recomputing K and V for all previous tokens, it keeps them in memory and only computes the new token's contribution. That single optimization can make inference several times faster.