One-sentence summary: Decoding is how we turn a probability distribution into the next token.


B.1 Why Decoding Matters

The model outputs probabilities. The decoder chooses.

Decoding strategies choose next tokens from probability distributions

B.2 Greedy

Pick the most likely token every time.

Good for deterministic tasks. Bad for variety.

B.3 Random Sampling

Sample according to the probability distribution. This can produce more natural variation, but also more mistakes.

B.4 Temperature

Temperature reshapes the distribution.

temperatureeffect
lowsharper
1.0unchanged
highflatter

B.5 Top-K

Only sample from the k most likely tokens.

B.6 Top-P

Also called nucleus sampling. Keep the smallest set of tokens whose cumulative probability exceeds p.

Beam search tracks multiple candidate sequences. It is useful in some structured generation tasks but less central for open-ended chat.

B.8 Repetition Penalty

Repetition penalties reduce the chance of looping. They are a practical fix for a practical failure mode.

B.9 Practical Default

For many assistant tasks:

temperature: low to medium
top_p: around 0.9
max tokens: explicit
stop conditions: explicit

Then evaluate on your product, not someone else's demo.


Checklist

  • Explain greedy decoding.
  • Explain temperature.
  • Compare top-k and top-p.
  • Explain why decoding settings can change perceived model quality.
Cite this page
Zhang, Wayland (2026). Appendix B: Decoding Strategies. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/appendix-b-decoding-strategies
@incollection{zhang2026transformer_appendix_b_decoding_strategies,
  author = {Zhang, Wayland},
  title = {Appendix B: Decoding Strategies},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/appendix-b-decoding-strategies}
}