One-sentence summary: Inference is load model → encode prompt → autoregressive generation → decode output. Thirty lines of code, but this is the moment the model speaks.

Complete code repository: github.com/waylandzhang/Transformer-from-scratch

Chapter 20 overview: the inference pipeline — load weights, encode the prompt, generate tokens autoregressively one at a time, and decode the IDs back to text

20.1 Inference vs Training

20.1.1 Recap from Chapter 16

	Training	Inference
Goal	learn parameters	generate text
Input	full sequence + targets	prompt only
Output	loss value	generated text
Parameter updates	yes	no
Dropout	on	off

20.1.2 Core Inference Flow

1. Load the trained model
2. Encode the prompt to token IDs
3. Autoregressive generation (one token at a time)
4. Decode token IDs back to text

20.2 Loading the Model

20.2.1 Load the Checkpoint

# load model
import torch
import tiktoken
from model import Model

# load checkpoint
checkpoint = torch.load('model/model.ckpt')

# restore hyperparameters from checkpoint
h_params = checkpoint['h_params']

# reconstruct model architecture
model = Model(h_params)

# load parameters
model.load_state_dict(checkpoint['model_state_dict'])

# switch to evaluation mode
model.eval()

# move to the correct device
model.to(h_params['device'])

20.2.2 Why `model.eval()`?

Calling model.eval() does two things:

Disables Dropout: inference does not need random dropping
Fixes BatchNorm: uses statistics accumulated during training

Without switching to evaluation mode, every inference call produces different output. That is usually not what you want.

20.3 Preparing the Input

20.3.1 Encode the Prompt

# encode input
encoding = tiktoken.get_encoding("cl100k_base")

# what do you want the model to continue?
start = "fix(auth): "

# encode to token IDs
start_ids = encoding.encode(start)
print(f"Prompt: {start}")
print(f"Token IDs: {start_ids}")

# convert to Tensor
x = torch.tensor(start_ids, dtype=torch.long, device=h_params['device'])
x = x.unsqueeze(0)  # add batch dimension: [seq_len] -> [1, seq_len]

print(f"Input shape: {x.shape}")

Sample output:

Prompt: fix(auth):
Token IDs: [11148, 7, 3997, 1680]
Input shape: torch.Size([1, 4])

The model never sees the raw string. It sees integers mapped to rows of the embedding table. The unsqueeze(0) adds the batch dimension — the model expects [batch, seq] even for a single prompt.

20.4 Generating Text

20.4.1 Call the Generation Function

# generate text
with torch.no_grad():  # no gradient computation
    y = model.generate(
        x,
        max_new_tokens=200,   # generate up to 200 new tokens
        temperature=0.5,       # temperature: lower = more deterministic
        top_k=None            # no top-k filtering
    )

# decode
output_text = encoding.decode(y[0].tolist())

print('---------------')
print(output_text)
print('---------------')

20.4.2 Sample Generation Output

---------------
fix(auth): handle expired refresh tokens before retry
fix(auth): validate JWT signature against rotated key set
fix(auth): clear session cookie on logout in incognito tabs
fix(auth): reject empty Authorization header with 401
---------------

The model has learned to generate text that looks like pull-request titles — the style of the training data. This is important to remember: the model learns patterns in data, not understanding of content.

20.5 Generation Parameters in Detail

20.5.1 Temperature

y = model.generate(x, temperature=0.5)

Temperature shapes the output distribution:

Temperature	Effect	Use case
0.1-0.3	very deterministic, repetitive	factual Q&A
0.5-0.7	balanced randomness and determinism	general use
0.8-1.0	more varied, creative	creative writing
> 1.0	very random, may be incoherent	experimentation

20.5.2 Top-K Sampling

y = model.generate(x, top_k=50)

Sample only from the K highest-probability tokens:

Original distribution:
  " handle" = 0.30, " validate" = 0.20, " refresh" = 0.15, ...  (100k tokens)

After Top-K=3:
  " handle" = 0.46, " validate" = 0.31, " refresh" = 0.23
  (renormalized over just these 3 tokens)

Why it helps: prevents the model from occasionally sampling bizarre low-probability tokens that ruin an otherwise coherent output.

20.5.3 Max New Tokens

y = model.generate(x, max_new_tokens=200)

Controls generation length:

Too short: may produce incomplete output
Too long: wastes compute, increases chance of repetition

20.6 Inspecting Model Parameters

20.6.1 Print Parameter Count

# count parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model param size: {total_params:,}")

Sample output:

Model param size: 8,234,560

20.6.2 Inspect Layer-by-Layer

# print each layer's name and shape
for name, param in model.state_dict().items():
    print(f"{name}: {param.shape}")

Sample output:

token_embedding_lookup_table.weight: torch.Size([100256, 80])
transformer_blocks.0.ln1.weight: torch.Size([80])
transformer_blocks.0.ln1.bias: torch.Size([80])
transformer_blocks.0.mha.heads.0.Wq.weight: torch.Size([20, 80])
transformer_blocks.0.mha.heads.0.Wk.weight: torch.Size([20, 80])
transformer_blocks.0.mha.heads.0.Wv.weight: torch.Size([20, 80])
...
model_out_linear_layer.weight: torch.Size([100256, 80])
model_out_linear_layer.bias: torch.Size([100256])

This is the full parameter layout of the model — every matrix has a name and a shape. You can cross-reference these names with the class definitions in Chapter 18.

20.7 Complete inference.py

# -*- coding: utf-8 -*-
"""
Sample from a trained model
"""
import torch
import tiktoken
from model import Model

# load model and hyperparameters
checkpoint = torch.load('model/model.ckpt')
h_params = checkpoint['h_params']
model = Model(h_params)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
model.to(h_params['device'])

# load tokenizer
encoding = tiktoken.get_encoding("cl100k_base")

# input prompt
start = "fix(auth): "
start_ids = encoding.encode(start)
x = torch.tensor(start_ids, dtype=torch.long, device=h_params['device'])[None, ...]

# generate
with torch.no_grad():
    y = model.generate(x, max_new_tokens=200, temperature=0.5, top_k=None)
    print('---------------')
    print(encoding.decode(y[0].tolist()))
    print('---------------')

# print parameter count
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model param size: {total_params:,}")

# print parameter shapes
for name in model.state_dict().keys():
    print(name, model.state_dict()[name].shape)

20.8 Varying the Prompt

20.8.1 Try Different Inputs

# try different prompts
prompts = [
    "fix(auth):",
    "feat(api):",
    "refactor(db):",
    "docs(readme):"
]

for prompt in prompts:
    x = torch.tensor(encoding.encode(prompt), dtype=torch.long, device=h_params['device'])[None, ...]
    with torch.no_grad():
        y = model.generate(x, max_new_tokens=50, temperature=0.5)
    print(f"Prompt: {prompt}")
    print(f"Output: {encoding.decode(y[0].tolist())}")
    print("---")

20.8.2 Observing the Output

The model generates in the style of its training data:

If training data was PR titles, it generates PR-title-style text
If training data was fiction, it generates fiction-style text
If training data was code, it generates code-style text

The model learns the statistical patterns in data, not the "meaning" behind them. This is both the source of its capability and its limitation.

20.9 Visualizing Autoregressive Generation

20.9.1 Stepping Through Generation

# visualize generation process
def generate_with_trace(model, x, max_new_tokens=10, temperature=1.0):
    """Generation with step-by-step tracing"""
    encoding = tiktoken.get_encoding("cl100k_base")

    print(f"Initial prompt: {encoding.decode(x[0].tolist())}")
    print("---")

    for i in range(max_new_tokens):
        # forward pass
        with torch.no_grad():
            logits, _ = model(x[:, -model.context_length:])

        # get last-position predictions
        logits = logits[:, -1, :] / temperature
        probs = torch.softmax(logits, dim=-1)

        # get top-5 candidates
        top5_probs, top5_ids = torch.topk(probs[0], 5)
        print(f"Step {i+1} candidates:")
        for prob, idx in zip(top5_probs, top5_ids):
            print(f"  '{encoding.decode([idx.item()])}': {prob.item():.3f}")

        # sample
        idx_next = torch.multinomial(probs, num_samples=1)
        x = torch.cat((x, idx_next), dim=1)

        print(f"  -> selected: '{encoding.decode([idx_next[0].item()])}'")
        print(f"  current sequence: {encoding.decode(x[0].tolist())}")
        print("---")

    return x

20.9.2 Sample Trace Output

Initial prompt: fix(auth):
---
Step 1 candidates:
  ' handle': 0.312
  ' validate': 0.198
  ' refresh': 0.087
  ' verify': 0.076
  ' check': 0.065
  -> selected: ' handle'
  current sequence: fix(auth): handle
---
Step 2 candidates:
  ' expired': 0.421
  ' missing': 0.156
  ' invalid': 0.089
  ' empty': 0.067
  ' stale': 0.054
  -> selected: ' expired'
  current sequence: fix(auth): handle expired
---
...

This trace makes the autoregressive process concrete. Each step sees the accumulated sequence and produces a probability distribution. The sampling makes it stochastic — running it twice may produce different text.

20.10 Common Issues

20.10.1 Repetitive Output

Symptom: the model keeps repeating the same word or phrase.

Causes:

temperature too low
training data itself has repetitive patterns
model has overfit

Fixes:

raise temperature
use top-k or top-p sampling
add a repetition penalty

20.10.2 Incoherent Output

Symptom: output is garbled or nonsensical.

Causes:

model undertrained
prompt is far outside the training distribution
temperature too high

Fixes:

train for more steps
use a more in-distribution prompt
lower temperature

20.10.3 Slow Generation

Symptom: each token takes a long time.

Causes:

no GPU
no KV Cache
model is too large

Fixes:

use GPU if available
implement KV Cache (Chapter 22)
use a smaller model

20.11 Chapter Summary

20.11.1 Three-Step Inference

1. Load model
   checkpoint = torch.load('model.ckpt')
   model.load_state_dict(checkpoint['model_state_dict'])
   model.eval()

2. Encode prompt
   start_ids = encoding.encode(prompt)
   x = torch.tensor(start_ids)[None, ...]

3. Generate
   with torch.no_grad():
       y = model.generate(x, max_new_tokens=200)
   output = encoding.decode(y[0].tolist())

20.11.2 Key Parameters

Parameter	Role	Recommended range
`max_new_tokens`	maximum generation length	50-500
`temperature`	randomness control	0.5-0.8
`top_k`	restrict candidate tokens	50-100

20.11.3 Core Insight

inference.py is 30 lines of code, but it is the destination of the entire journey — the moment the model speaks. Load parameters, encode the prompt, generate autoregressively, decode the output. That is how every GPT-style model, from our educational toy to GPT-4, responds to a prompt.

Chapter Checklist

After this chapter you should be able to:

Load a trained model checkpoint correctly.
Explain what model.eval() does and why it matters.
Use temperature and top-k to control generation.
Run the inference script end to end.

Part 5 Summary

You have completed the code implementation section:

Chapter	Content	Code size
Ch. 18	model.py — model definition	~200 lines
Ch. 19	train.py — training loop	~100 lines
Ch. 20	inference.py — inference logic	~30 lines

Under 400 lines of code to implement a complete working Transformer.

These are simplified compared to production LLMs, but they contain the same core logic. Once you understand this code, you can read Hugging Face Transformers, LLaMA, and GPT-NeoX source code and recognize every component.

Complete Code

The full Part 5 implementation is on GitHub:

github.com/waylandzhang/Transformer-from-scratch

Includes:

model.py — complete model definition
train.py — training script
inference.py — inference script
step-by-step.ipynb — annotated Jupyter notebook

See You in the Next Chapter

Our model works, but it is slow. Every token generation requires running the full forward pass over the entire sequence — recomputing every Key and Value matrix from scratch each time. That is wasteful.

Part 6 covers production optimization. We will look at Flash Attention and KV Cache, the two most impactful speedups in modern Transformer inference systems. KV Cache eliminates the redundant computation you just saw — instead of recomputing K and V for all previous tokens, it keeps them in memory and only computes the new token's contribution. That single optimization can make inference several times faster.

20.1 Inference vs Training

20.1.1 Recap from Chapter 16

20.1.2 Core Inference Flow

20.2 Loading the Model

20.2.1 Load the Checkpoint

20.2.2 Why model.eval()?

20.3 Preparing the Input

20.3.1 Encode the Prompt

20.4 Generating Text

20.4.1 Call the Generation Function

20.4.2 Sample Generation Output

20.5 Generation Parameters in Detail

20.5.1 Temperature

20.5.2 Top-K Sampling

20.5.3 Max New Tokens

20.6 Inspecting Model Parameters

20.6.1 Print Parameter Count

20.6.2 Inspect Layer-by-Layer

20.7 Complete inference.py

20.8 Varying the Prompt

20.8.1 Try Different Inputs

20.8.2 Observing the Output

20.9 Visualizing Autoregressive Generation

20.9.1 Stepping Through Generation

20.9.2 Sample Trace Output

20.10 Common Issues

20.10.1 Repetitive Output

20.10.2 Incoherent Output

20.10.3 Slow Generation

20.11 Chapter Summary

20.11.1 Three-Step Inference

20.11.2 Key Parameters

20.11.3 Core Insight

Chapter Checklist

Part 5 Summary

Complete Code

See You in the Next Chapter

20.2.2 Why `model.eval()`?