Transformer Architecture Book — English Edition Now Online

Why an English Edition

The original LLM Transformer Book was written in Chinese, and over the past months readers from outside the Chinese-speaking world kept asking when an English version would land. The short answer is: today.

The English edition is now fully online — 32 chapters and 3 appendices, all freely readable on the web.

What's In the Book

The book is organized into 9 parts plus an appendix:

Part 1 — Build Intuition

Chapter 1: What is GPT? A short history of LLMs
Chapter 2: Large models are two files (weights + runner)
Chapter 3: The Transformer map — a single mental model that fits the whole architecture

Part 2 — Core Components

Tokenization, positional encoding, LayerNorm, Softmax, FFN — the small parts that make the whole thing work

Part 3 — Attention, Properly

Linear transforms, attention geometry, the meaning of Q/K/V, multi-head attention, the output projection — five chapters that build the intuition step by step

Part 4 — Full Architecture

Residual connections, the embedding-plus-position bookkeeping, the complete forward pass, training vs inference, learning-rate schedules

Part 5 — Code Implementation

Three hand-written files: model.py, train.py, inference.py. Not API calls. Real code you can read and run.

Part 6 — Production Optimization

Flash Attention — why GPU memory hierarchy is the real bottleneck
KV Cache — the trick that turns O(N²) inference into O(N)

Part 7 — Architecture Variants

MHA → MQA → GQA, sparse and infinite attention, positional encoding evolution from sinusoidal through RoPE, ALiBi, and YaRN

Part 8 — Deployment and Fine-Tuning

LoRA and QLoRA, model quantization (GPTQ, AWQ, GGUF)

Part 9 — Frontier Progress

Prompt engineering, RLHF and DPO, Mixture-of-Experts, reasoning models (o1, o3, R1, K1.5), post-Transformer architectures (Mamba, RWKV, hybrids)

Appendix

Scaling laws and compute estimation
Decoding strategies (greedy, sampling, beam search, top-k, top-p)
37-question FAQ for self-assessment

What Makes This Book Different

Intuition first, formulas second. Every chapter builds the mental picture before the math. Once the picture is right, the equations are just precise descriptions of what you already understand.

From-scratch code. The implementation chapters write the model in plain PyTorch — no nn.MultiheadAttention shortcuts. The complete code lives at github.com/waylandzhang/Transformer-from-scratch.

2025-current. The book covers OpenAI o1/o3, DeepSeek R1, Claude Opus 4.7, Kimi K1.5, Gemini 2.5, Flash Attention 3, and the post-Transformer architectures shipping in 2025. Reasoning models, MoE-at-scale, and Mamba get full chapters.

The English edition is its own pass. This is not machine translation. The whole book was rewritten with English-native phrasing, English-friendly analogies, and a tighter voice. Where the Chinese edition leaned on idioms or cultural references that wouldn't carry, the English edition rebuilds the explanation from first principles.

Who Should Read This

Developers who use the OpenAI / Anthropic APIs and want to understand what's behind them
ML engineers who've read scattered Transformer posts but still don't have a single mental model that fits
Anyone implementing a Transformer from scratch or fine-tuning one in production
Practitioners trying to track 2024-2025 frontier progress without drowning in arXiv

Start Reading

The book is completely free to read online:

Read the Transformer Book — English Edition →

The Chinese edition is still maintained at /llm-transformer-book — the two editions track each other.

If you find a mistake or have a suggestion, the source lives in the same repo as this site. Pull requests welcome.

This English edition was produced with multi-pass review and a final tier-1/3 editing sweep. The companion AI Agent book is also available in English at /ai-agent-book-en.