One-sentence summary: Decoding is how we turn a probability distribution into the next token.
B.1 Why Decoding Matters
The model outputs probabilities. The decoder chooses.
B.2 Greedy
Pick the most likely token every time.
Good for deterministic tasks. Bad for variety.
B.3 Random Sampling
Sample according to the probability distribution. This can produce more natural variation, but also more mistakes.
B.4 Temperature
Temperature reshapes the distribution.
| temperature | effect |
|---|---|
| low | sharper |
| 1.0 | unchanged |
| high | flatter |
B.5 Top-K
Only sample from the k most likely tokens.
B.6 Top-P
Also called nucleus sampling. Keep the smallest set of tokens whose cumulative probability exceeds p.
B.7 Beam Search
Beam search tracks multiple candidate sequences. It is useful in some structured generation tasks but less central for open-ended chat.
B.8 Repetition Penalty
Repetition penalties reduce the chance of looping. They are a practical fix for a practical failure mode.
B.9 Practical Default
For many assistant tasks:
temperature: low to medium
top_p: around 0.9
max tokens: explicit
stop conditions: explicit
Then evaluate on your product, not someone else's demo.
Checklist
- Explain greedy decoding.
- Explain temperature.
- Compare top-k and top-p.
- Explain why decoding settings can change perceived model quality.