Name: Transformer Architecture: From Intuition to Implementation
Author: Wayland Zhang

Why This Book Exists

At the end of 2023, I published a Chinese video series explaining Transformer architecture. I originally made it as a way to document my own learning process. Then viewers kept asking for something more durable: a text version they could revisit, search, and keep beside them while coding.

This book is that text version.

But it is not a transcript. While turning the videos into a book, I rethought the order of the explanations, added details that were only touched briefly in the videos, corrected places where my earlier understanding was not precise enough, and added developments from 2024–2026: reasoning models (OpenAI o1 and o3, DeepSeek R1, Kimi K1.5), frontier families (GPT-5, Claude Opus 4.7 and Sonnet 4.5, Gemini 2.5 with thinking mode), tool use and MCP-style integrations, preference learning (PPO, DPO, KTO), Mixture of Experts (Mixtral, DeepSeek-MoE), and Mamba-style architectures.

AI moved faster than almost anyone expected. A book about Transformers has to acknowledge that pace, while still giving you a stable foundation.

How This Book Teaches

Intuition first, formulas second.

Too many technical explanations begin with notation before the reader has any mental picture. This book uses the same teaching rhythm again and again:

Start with why: what problem does this component solve?
Build intuition: use analogy, geometry, and diagrams.
Then read the formula: once the idea is clear, the math becomes compact language.
Finally write code: runnable code is the test of understanding.

If you can explain a chapter in your own words after reading it, instead of only repeating the formula, the chapter has done its job.

Who Should Read This

This book is for you if:

You have used ChatGPT and want to understand what happens under the hood.
You have read Transformer introductions but still feel the architecture is blurry.
You want to implement a small GPT-style model instead of only calling APIs.
You are an engineer who wants a practical reference for LLM internals.
You want a map that connects GPT, LLaMA, Gemini, Claude, and modern agent systems.

This book may not be for you if:

You are completely new to neural networks.
You need formal mathematical proofs.
You only want to call an existing model as quickly as possible.

How to Read It

Fast orientation (1–2 days)

Read Part 1 in full (Chapters 1–3), then jump to Chapter 10 (QKV) and Chapter 15 (the full forward pass).

Systematic study (1–2 weeks)

Read Parts 1–5 in order (Chapters 1–20), type the code in Chapters 18–20 yourself, and check each chapter's closing self-test before moving on.

Production optimization

After the fundamentals, focus on Part 6 (Chapters 21–22: Flash Attention and KV Cache), Part 8 (Chapters 26–27: LoRA and quantization), and the Flash-Attention variants across Chapters 23–24.

Frontier tracking

Use Part 9 (Chapters 28–32) as a map for prompt engineering, RLHF and preference learning, Mixture of Experts, reasoning models, and post-Transformer architectures.

Every chapter ends with a short checklist. Treat it as a self-test: can you explain the idea without looking?

About The Code

The code in this book is meant to run. I prefer writing the important pieces from scratch because it reveals what the framework normally hides.

# This is convenient:
output = nn.MultiheadAttention(embed_dim, num_heads)(query, key, value)

# This is more educational:
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)

When you can write the second version and explain every line, Attention stops being mysterious.

Acknowledgements

Thanks to everyone who watched the original Chinese videos, asked questions, and pointed out unclear explanations. Many improvements in this book came directly from those questions. English readers can follow the written edition here, use the GitHub repository for issues and corrections, and treat the original video series as source material rather than the only entry point.

Thanks to Geoffrey Hinton, Ilya Sutskever, Andrej Karpathy, and many other teachers and researchers whose public lectures, courses, papers, and code made this field easier to learn.

Thanks to my family for tolerating the late nights and weekends I spent talking to a monitor while recording videos and writing this book.

And thanks to you for reading. I hope this book helps you understand Transformer architecture instead of merely recognizing its vocabulary.

Wayland Zhang

Original videos recorded from December 2023 to March 2024. Chinese text edition organized in January 2026. English edition started as a localized adaptation after that.

"The best way to learn is to teach."

Errata and Feedback

If you find mistakes or have suggestions, please reach out through any of these channels:

GitHub: github.com/WaylandZhang
Bilibili (original Chinese videos): @LLM张老师

Technical books always have rough edges, and careful readers make them better.