One-sentence summary: Reasoning models discovered that "think longer" is a second scaling axis — orthogonal to "train bigger," and sometimes more efficient per compute dollar.

Chapter 31 overview: the reasoning-model arc from OpenAI o1 to DeepSeek R1 — test-time compute as a second scaling axis, long chain-of-thought, RL-from-verifier rewards, and the cost-per-correct-answer accounting that makes 'think longer' competitive with 'train bigger'

31.1 The Breakthrough: A Number That Changed Everything

31.1.1 The Benchmark That Exposed the Gap

In early 2024, researchers tested GPT-4o — then the strongest available model — on AIME 2024 (the American Invitational Mathematics Examination). AIME is the second round of the US high school math olympiad. Participants are the top 5% of US math students. It has 15 problems. No calculator.

GPT-4o scored 12–13%. That is fewer than two problems out of fifteen.

In September 2024, OpenAI released o1-preview. Same AIME 2024 problems.

o1-preview scored 74–83%.

AIME 2024 accuracy:
  GPT-4o        ████                             12–13%
  o1-preview    █████████████████████████████    74–83%
  o1 (final)    ██████████████████████████████   83–93%
  o3 (Dec 2024) ████████████████████████████████ 96.7%

From 12% to 83%: a 6–7× jump with no comparable increase in model size. This was not a gradual improvement on the existing scaling curve. It was a different phenomenon.

31.1.2 Why This Is Qualitatively Different

Previous capability gains followed a pattern:

GPT-3 → GPT-3.5: better training, more data, ~50% relative improvement
GPT-3.5 → GPT-4: larger model, longer context, ~100% relative improvement on hard tasks
GPT-4 → GPT-4o: efficiency gains, multimodal, ~10–20% relative improvement

Each step came from the same lever: more parameters, more data, better training.

o1 did not become dramatically larger than GPT-4o. The jump came from a different lever entirely.

The core insight: model capability is a function of both the model's weights and the compute spent at inference time. Until o1, almost all effort went into training. o1 invested heavily in test-time compute.

31.1.3 AIME in Context

AIME is a particularly clean benchmark for measuring reasoning capability:

Problems are created fresh each year, eliminating training-data contamination
Each problem requires multi-step reasoning with no shortcut
There is a clear human reference point: top US high school mathematicians typically score 8–12/15

o3's 96.7% means it averages fewer than one wrong answer per test. It has surpassed all but the top-tier IMO competitors.

31.2 Test-Time Compute Scaling

31.2.1 The Traditional Scaling Law

Before reasoning models, the dominant performance-improvement strategy was "train-time compute scaling":

\text{Performance} \propto \log(\text{params}) \times \log(\text{data}) \times \log(\text{FLOPs})

This drove the arms race:

GPT-3: 175B parameters
PaLM: 540B parameters
GPT-4: ~1.8T parameters (rumored MoE)
Llama 3.1: 405B parameters (dense)

But the returns were diminishing. Doubling parameters from 100B to 200B improved performance by 10–20%, not 100%. Each increment got more expensive.

31.2.2 The New Axis: Test-Time Compute

In August 2024, DeepMind and UC Berkeley published: "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters."

The central experiment: take a small model and a large model (14× size difference). Give the small model more inference compute. The small model + more thinking time outperforms the large model's direct answer.

Experiment setup:
  Large model   = PaLM 2-L (~14× bigger)
  Small model   = PaLM 2-S  + extra test-time compute

Result:
  Large model, direct answer:    score X
  Small model, extra compute:    score > X  (on many tasks)

This does not mean small models always beat big ones. It means test-time compute is a real scaling axis, not a toy.

31.2.3 Why Longer Thinking Helps

Standard next-token generation produces an answer in a single pass. For hard problems, the single pass often fails because:

The first approach may be wrong — but the model commits to it
Intermediate steps are not verified before the final answer
No backtracking when an approach hits a dead end

Extended thinking allows:

Generating multiple candidate approaches
Checking intermediate steps
Backtracking and trying a different path
Verifying the answer before outputting it

The compute cost is proportional to how many tokens the model generates in its "thinking" phase.

31.2.4 Two Scaling Curves

Performance
    ▲
    │                              ╭── Test-time scaling
    │                         ╭───╯   (more thinking tokens)
    │                    ╭────╯
    │               ╭────╯
    │          ╭────╯
    │     ╭────╯   ← Train-time scaling
    │ ╭───╯       (more parameters + data)
    │─╯
    └──────────────────────────────────▶ Compute

Both curves are real. The optimal strategy combines both: train a capable base model, then spend inference compute wisely on hard problems.

31.3 OpenAI o1 and o3

31.3.1 What o1 Actually Does

OpenAI's official framing: o1 was "trained to reason." The technical implementation is not fully disclosed, but the observable behavior is clear:

When you ask o1 a hard question, it spends time generating an internal "reasoning trace" before producing the visible answer. You see a summary; the full chain is hidden.

Traditional model response:
  User question → immediate answer

o1 response:
  User question → [internal reasoning trace, ~3,000–10,000 tokens]
               → final answer (visible)

The reasoning trace may include:

Setting up the problem from multiple angles
Testing a hypothesis and finding a contradiction
Backtracking and trying an alternative approach
Verifying the final answer against the original problem

31.3.2 Training Approach (Inferred)

OpenAI has not published o1's training details. Based on public information and research community analysis, the approach likely involves:

Large-scale RL with process rewards: the reward model evaluates not just final answers but intermediate reasoning steps
Process Reward Models (PRM): a separate model trained to judge whether each step in a reasoning chain is correct
Search over reasoning paths: generating multiple candidate chains and selecting the best

Inferred o1 training loop:
  Problem
    ↓
  Base model generates N reasoning chains
    ↓
  Process Reward Model scores each step of each chain
    ↓
  Best chain identified
    ↓
  RL update: increase probability of successful chain patterns

The key insight in PRMs: evaluating "was this intermediate step correct?" is more informative than evaluating only "was the final answer correct?" A model that takes 8 correct steps and then one wrong step at the end deserves a different signal than a model that guesses correctly by chance.

31.3.3 o3: Another Step Change

December 2024, o3 arrived. The jump from o1 to o3 was not as dramatic as GPT-4o to o1, but it was still substantial:

Benchmark	GPT-4o	o1	o3
AIME 2024	12–13%	83.3%	96.7%
MATH-500	76.6%	94.8%	97.9%
Codeforces Rating	~1200	~1800	2727
ARC-AGI	~5%	32%	87.5%

ARC-AGI is worth noting. François Chollet designed this benchmark specifically to resist memorization — it tests novel pattern recognition. Human average is ~85%. o3 at 87.5% is the first AI to exceed human average on this test.

31.3.4 The Cost Structure

Reasoning tokens are expensive because you pay for the thinking:

One o1 request for a hard math problem:
  User input:         200 tokens
  Internal thinking:  3,000–10,000 tokens  ← you pay for these
  Final output:       500 tokens
  Total billed:       3,700–10,700 tokens

API pricing (approximate at time of writing):

GPT-4o: $2.50/1M input, $10/1M output
o1-pro: $150/1M input, $600/1M output

The 60× cost premium is not arbitrary. The model genuinely uses 10–50× more compute to answer hard questions. For problems where the answer matters — competitive coding, mathematical proofs, complex debugging — the price is often worth it.

31.3.5 o1/o3 Limitations

Speed: GPT-4o answers in 1–3 seconds. o1 takes 10–60 seconds for hard problems. For conversational use or real-time applications, this is often unacceptable.

Overthinking easy questions: adding irrelevant information to simple problems reduces o1's accuracy significantly. Apple researchers showed that adding "His backpack is blue" to a simple arithmetic word problem caused meaningful accuracy drops. The model seems to process noise as potential signal.

Opacity: OpenAI deliberately hides the reasoning trace. Users cannot inspect how the model arrived at its answer. This is both a product choice (reasoning traces are verbose and confusing) and a safety decision (OpenAI reports that o1 models showed mid-level risk on CBRN-related knowledge).

31.4 DeepSeek-R1: Open-Source Reasoning

31.4.1 Why R1 Matters

January 2025. DeepSeek releases DeepSeek-R1, achieving performance comparable to OpenAI o1 with one crucial difference: fully open weights and openly published training methodology.

Property	OpenAI o1	DeepSeek-R1
Model weights	Closed	Open
Training method	Undisclosed	Published paper
API cost	High	Low
Local deployment	No	Yes
AIME 2024	83.3%	79.8%
MATH-500	94.8%	97.3%

R1 made reasoning models impossible to treat as proprietary magic. The approach was reproducible.

31.4.2 R1-Zero: Reasoning From Pure RL

The most surprising result: DeepSeek trained a reasoning model without any supervised reasoning demonstrations. They called this R1-Zero.

Traditional training pipeline:

Pretraining (language modeling)
SFT on curated reasoning examples
RLHF

DeepSeek-R1-Zero pipeline:

Pretraining (language modeling)
Directly apply RL — no SFT, no human-written reasoning traces

They started from DeepSeek-V3-Base (the pretrained model) and applied RL with a simple reward: correct final answer gets +1, incorrect gets 0. No process rewards, no human demonstrations.

What emerged from training:

R1-Zero AIME accuracy during RL training:
  Step 0:      15.6%
  Step 2000:   ~30%
  Step 6000:   ~55%
  Step 10000:  71.0% (pass@1)

Majority vote over 64 samples at step 10000: 86.7%

31.4.3 Emergent Reasoning Behaviors

R1-Zero was not told how to reason. These behaviors emerged from the RL training signal alone:

Self-verification:

...so the answer is 42.
Wait, let me check the third step again.
3 × 14 = 42 ✓
Yes, the answer is 42.

Backtracking:

This approach seems to be getting complicated.
Let me try a different method...

Strategy selection:

Direct computation is too complex here.
Let me consider n=1 as a special case first,
find the pattern, then generalize.

These are not programmed behaviors. The model discovered that these strategies lead to higher reward. This is a remarkable result: reasoning capability as an emergent property of RL on outcome rewards.

31.4.4 GRPO: Training Without a Critic

Standard PPO requires a Critic model to estimate the value of each state. For a 671B-parameter policy, the Critic is also ~671B parameters. You are running two massive models simultaneously.

DeepSeek's solution: GRPO (Group Relative Policy Optimization).

PPO:
  advantage = reward - critic_model(state)    ← requires Critic

GRPO:
  For each prompt, generate G responses:
    rewards = [r_1, r_2, ..., r_G]
    baseline = mean(rewards)
    advantage_i = r_i - baseline               ← no Critic needed

Instead of comparing against a learned value function, compare against the group average. The intuition: if this response scored above average for this prompt, it was good. No separate model required.

GRPO example:
  Prompt: "Prove that 1 + 2 + ... + n = n(n+1)/2"

  8 responses generated:
    Response 1: score 0.8
    Response 2: score 0.3
    Response 3: score 0.9
    Response 4: score 0.5
    Response 5: score 1.0
    Response 6: score 0.2
    Response 7: score 0.8
    Response 8: score 0.5

  Group mean: 0.625

  Advantages:
    +0.175, -0.325, +0.275, -0.125, +0.375, -0.425, +0.175, -0.125

  Positive advantage → increase probability of that response pattern
  Negative advantage → decrease probability

Removing the Critic model cuts training memory by ~50% and reduces compute accordingly. This is part of why DeepSeek-V3 trained at $5.5M: architectural choices that save compute compound.

31.4.5 Cold-Start Data and the Final R1

R1-Zero proved the concept but had usability problems:

Reasoning chains were verbose and disorganized
The model mixed languages mid-response (Chinese and English)
Formatting was inconsistent

The production R1 added a small amount of cold-start data: a few thousand high-quality reasoning examples that demonstrate clean formatting and language consistency. The RL training then runs on top of this.

The result: same strong reasoning capability, readable output format.

This is an important pattern. RL discovers the capability. SFT on a small curated set shapes how that capability is expressed.

31.4.6 What Open-Source Reasoning Changes

R1's open-source release had immediate practical effects:

Research: the training methodology is reproducible, so other groups can verify it, extend it, and build on it.

Deployment cost: running R1 locally is possible. At scale, the savings versus o1-pro API are enormous.

Domain specialization: you can fine-tune R1's reasoning on medical diagnosis, legal analysis, or competitive programming — something you cannot do with a closed API.

Ecosystem: the distilled R1 models (see below) created a wave of capable small reasoning models.

31.5 Kimi K1.5: Long-Context Scaling for Reasoning

31.5.1 Moonshot AI's Distinctive Approach

January 2025. The same month as DeepSeek-R1, Moonshot AI (China) released Kimi K1.5, which took a different technical route to reasoning capability. Where DeepSeek focused on pure RL with group-relative advantage estimation, Kimi K1.5's central innovation was extending RL training to a full 128K-token context window.

The three approaches side by side:

OpenAI o1:      hidden CoT + large-scale RL + process reward model
DeepSeek R1:    pure RL (R1-Zero) + GRPO + cold-start data shaping
Kimi K1.5:      128K long-context RL + Long-CoT framework + multimodal

31.5.2 Long-Context Scaling: The Core Innovation

Standard RL for reasoning operates on short trajectories — a few hundred to a few thousand tokens. Kimi K1.5 trains on trajectories up to 128K tokens in length. This matters because hard reasoning problems benefit from long chains: the model can revisit earlier steps, accumulate partial results, and explore multiple subproblems before committing to an answer.

The challenge is efficiency: generating full 128K trajectories for every gradient step is prohibitively expensive. Kimi K1.5 solves this with Partial Rollouts: instead of regenerating the full trajectory from scratch for each update, it reuses the majority of the previous trajectory and only regenerates the new portion. This cuts training compute dramatically while preserving the benefit of long-context exploration.

The full Long-CoT RL training stack:

Training innovations:
  Partial Rollouts:         reuse prior trajectory prefix, regenerate tail only
  Online Mirror Descent:    theoretically grounded policy update for long sequences
  Effective sampling:       prioritize prompts where the model is uncertain
  Length penalty:           discourage padding and repetition in long reasoning chains

The length penalty is particularly important. Without it, the model learns to fill context with redundant reasoning steps to reach the correct answer by brute force. The penalty rewards reaching correct answers in fewer tokens — encouraging genuine reasoning efficiency.

31.5.3 Performance

Benchmark	Kimi K1.5	o1-preview	Comparison
AIME 2024	77.5%	74.3%	K1.5 slightly better
MATH-500	96.2%	94.8%	K1.5 slightly better
Codeforces	94th percentile	~90th percentile	Comparable

K1.5 also supports multimodal reasoning: it can process image and text jointly within the same reasoning chain. This was not available in o1-preview at launch, making K1.5 an early entrant in the multimodal reasoning space.

31.5.4 The Long-CoT RL Framework

Kimi K1.5 formalizes long-context reasoning training as an optimization problem with three components:

Policy: the language model generating the reasoning chain
Reward signal: correct final answer receives +1; incorrect receives 0, applied at the end of the full long trajectory
KL constraint: keeps the policy from drifting too far from the reference model across the extended context

The key empirical finding: reasoning performance continues to improve as context length increases, without saturation at the lengths tested. This is evidence that the reasoning chain itself — not just the final answer token — carries real computation that scales with length.

31.5.5 Partial Rollouts in Practice

Standard PPO-style training for a 128K-trajectory problem:

Naive approach per step:
  Generate full 128K trajectory from token 1    ← expensive
  Compute rewards at token 128K
  Update policy
  Repeat

Kimi K1.5's partial rollout approach:

Partial rollout per step:
  Reuse tokens 1...(t-1) from previous trajectory
  Regenerate only tokens t...128K
  Compute rewards
  Update policy for the regenerated suffix only

The regenerated suffix length grows gradually during training. Early in training it is short; as the policy stabilizes, longer suffixes are regenerated. This curriculum prevents the policy from oscillating on long trajectories before it has learned basic reasoning.

31.5.6 Online Mirror Descent

PPO uses a clipped probability ratio to constrain the policy update. For long trajectories the clipping can become too conservative: by the time you are computing gradients at token 100,000, the probability ratios have compounded across 100K steps and the clip fires constantly, killing the learning signal.

Kimi K1.5 uses Online Mirror Descent (OMD) instead. OMD constrains updates using a KL-divergence trust region directly:

OMD update:
  minimize: -expected_reward + beta * KL(new_policy || old_policy)
  subject to:  KL <= delta

This is theoretically equivalent to natural policy gradient but more stable at very long sequence lengths, because the KL constraint operates on the full trajectory distribution rather than per-token probability ratios.

31.6 Distillation: Making Reasoning Accessible

31.6.1 The Problem with Large Reasoning Models

DeepSeek-R1 is based on a 671B MoE model. Deploying it requires:

~200 GB VRAM (multiple high-end GPUs)
$80,000+ in hardware at H100 prices
Significant engineering overhead

For most applications, this is inaccessible.

31.6.2 Knowledge Distillation for Reasoning

The core idea: the large reasoning model generates high-quality reasoning traces on many problems. A small model is then fine-tuned on those traces — learning to produce similar reasoning patterns.

Teacher (DeepSeek-R1, 671B):
  Problem → [Extended reasoning trace] → Solution

Student (Qwen-2.5-7B):
  Trained on (Problem, Reasoning trace) pairs
  Learns to produce similar traces despite 100× fewer parameters

This is just SFT on the teacher's reasoning traces. No RL required on the student side.

31.6.3 DeepSeek Distilled Model Performance

Model	Base	AIME 2024	MATH-500	Note
R1-Distill-Qwen-1.5B	Qwen-2.5-1.5B	28.9%	83.9%	Laptop-deployable
R1-Distill-Qwen-7B	Qwen-2.5-7B	55.5%	92.8%	RTX 4090 fits
R1-Distill-Llama-8B	Llama-3.1-8B	50.4%	89.1%
R1-Distill-Qwen-14B	Qwen-2.5-14B	69.7%	93.9%
R1-Distill-Qwen-32B	Qwen-2.5-32B	72.6%	94.3%	Better than o1-mini
R1-Distill-Llama-70B	Llama-3.3-70B	70.0%	94.5%

The 7B distilled model at 55.5% on AIME outperforms QwQ-32B-Preview, a dedicated 32B reasoning model. The 32B distilled model outperforms OpenAI o1-mini on several metrics.

Distillation comparison vs direct RL on the student:

Method	32B model, AIME 2024	Cost
Direct RL on 32B	67.2%	High (full RL training)
Distill from R1	72.6%	Low (just SFT on traces)

Distillation is both cheaper and better performing. The teacher already discovered good reasoning strategies; the student just needs to learn to execute them.

31.6.4 Deployment Reality

Hardware requirements:
  DeepSeek-R1 (full):     ~200 GB VRAM → 8× A100 80GB
  R1-Distill-Qwen-32B:    ~70 GB VRAM  → 2× A100 80GB
  R1-Distill-Qwen-7B:     ~14 GB VRAM  → 1× RTX 4090 (consumer)
  R1-Distill-Qwen-1.5B:   ~4 GB VRAM   → CPU inference feasible

The 1.5B model running on a laptop that scores 28.9% on AIME — better than GPT-4o — is a striking demonstration of how much the field has changed.

31.7 The 2025 Ecosystem

31.7.1 Where the Field Stands Now

The CN original for this chapter was written when o1 and R1 were the primary examples. As of April 2026, the ecosystem has expanded substantially. The core technical ideas remain the same; the landscape of deployed systems has grown.

OpenAI o3 and o4-mini: o3 represents a significant step beyond o1. o4-mini is a smaller, faster reasoning model in the same o-series family, targeted at lower cost while retaining chain-of-thought quality. Both models show better calibration on when to spend more thinking time.

Claude Opus 4.7 with extended thinking: Anthropic integrated reasoning into Claude's product line. Extended thinking is visible to the user and controllable by the developer. Unlike o1's hidden traces, Claude shows the reasoning summary. This reflects a different product philosophy: transparency over polish.

Gemini 2.5 Pro with Thinking: Google's response to reasoning models. The distinctive feature is a thinking budget — a developer-controllable parameter that sets the maximum number of thinking tokens:

response = gemini.generate(
    prompt="...",
    thinking_budget=0,      # disabled: fast, cheap
    # thinking_budget=8192, # medium: balanced
    # thinking_budget=24576 # maximum: best on hard problems
)

This creates a cost-performance knob: spend $0.01 on fast answers, or $0.06 on deep reasoning, depending on the query. Gemini 2.5 Pro reaches 92.0% on AIME 2024, ahead of o1 at 83.3%.

Gemini 2.5 Deep Think achieved bronze-medal level at the 2025 International Mathematical Olympiad. For context: IMO problems require creative mathematical insight over multiple hours. This is no longer a benchmark artifact.

GPT-5.5 (April 2026): OpenAI announced GPT-5.5 on April 23, 2026, with API availability the following day. It is the current released frontier in the o-series-and-successor reasoning family at the time this chapter was last revised. Specific benchmark numbers will land in third-party reproductions over the weeks after launch — for fresh figures, check the OpenAI announcement and the AIME / ARC-AGI / MATH-500 leaderboards directly. The architectural pattern is the same as o3: extended internal reasoning, controllable thinking budget, hidden trace shown only as summary.

The pattern across all these systems: extended reasoning is becoming a standard capability, not a specialty product. Each new release tightens the cost-per-correct-answer curve; the underlying mechanism — spending more inference compute on a hard query — has not changed since o1.

31.7.2 Performance Comparison

Model	AIME 2024	MATH-500	Codeforces	Open?	Cost tier
GPT-4o	12–13%	76.6%	~1200	No	Low
o1	83.3%	94.8%	~1800	No	High
o3	96.7%	97.9%	2727	No	Very high
DeepSeek-R1	79.8%	97.3%	2029	Yes	Low
Gemini 2.5 Pro	92.0%	~95%	—	No	Medium
R1-Distill-32B	72.6%	94.3%	1691	Yes	Lowest
GPT-5.5 (Apr 2026)	numbers landing	numbers landing	—	No	Highest

31.8 When to Use Reasoning Models

31.8.1 The Cost vs Benefit Calculation

Single hard math problem (AIME difficulty):
  GPT-4o:  ~$0.01   response time ~2s
  o1:      ~$0.50   response time ~30s
  Cost ratio: 50×

Whether o1 is worth it depends on what "correct" is worth to you. For a one-time calculation, $0.50 for reliability may be cheap. For a system that makes millions of such calls, you need the distilled models.

31.8.2 Good Use Cases for Reasoning

Competitive programming: algorithm design, edge case handling, proving correctness
Mathematical proofs: multi-step formal reasoning, counterexample generation
Complex debugging: reasoning about what a program does before changing it
Scientific reasoning: hypothesis generation, experiment design, data interpretation
Legal/medical analysis: careful multi-factor reasoning with explicit uncertainty

31.8.3 Poor Use Cases for Reasoning

Real-time conversation: 30-second response times are product-breaking
Simple lookups: reasoning overhead wasted on "what is the capital of France"
Creative writing: the systematic thinking style can make prose feel mechanical
High-volume APIs: cost scales with difficulty, not traffic

31.8.4 Practical Selection Guide

Task	Recommended model	Reason
Real-time chat	GPT-4o / Claude	Speed critical
Simple Q&A	GPT-4o-mini / Haiku	Cost critical
Complex math	o1 / R1	Accuracy critical
Code generation	Claude / R1	Balanced
Research analysis	o3 / Gemini 2.5	Long chains needed
Edge deployment	R1-Distill-7B	Resource constrained

31.9 Chapter Summary

31.9.1 Core Concepts

Test-Time Compute Scaling: performance is a function of both model weights and inference compute. Spending more tokens on internal reasoning improves hard-task accuracy.

o1/o3: OpenAI's systems that made this pattern commercially visible. AIME: 12% → 83% → 96.7%. The reasoning trace is internal and hidden.

DeepSeek-R1: open-source reasoning model matching o1 quality. Key findings: pure RL can induce reasoning capability (R1-Zero), GRPO eliminates the Critic model, cold-start data shapes formatting without damaging capability.

GRPO: eliminates the Critic model by using group-relative advantage estimation. Cuts training cost by ~50% for RL-based reasoning training.

Distillation: fine-tune a small model on the large model's reasoning traces. R1-Distill-7B outperforms o1-mini on many tasks. Distillation beats direct RL on the small model.

2025–2026 landscape: o3, o4-mini, Claude Opus 4.7 extended thinking, Kimi K1.5, Gemini 2.5 thinking mode, and GPT-5.5 (April 2026 release). Extended reasoning is now a standard capability tier.

31.9.2 Key Benchmarks

AIME 2024 tells the o3-era story:
  GPT-4o:      12%   ← where we started
  o1:          83%   ← test-time compute, hidden trace
  DeepSeek-R1: 80%   ← open-source, reproducible
  Gemini 2.5:  92%   ← competitive landscape, April 2025
  o3:          97%   ← the o3-era reasoning jump (Dec 2024)
GPT-5.5 launched April 23, 2026; benchmarks for the
new frontier are still being reproduced as of writing.

31.9.3 My Take

The R1-Zero result is the most important insight from this chapter. Nobody programmed self-verification or backtracking into the model. Those behaviors appeared because the RL training signal rewarded correct final answers, and the model discovered that checking your work is a good strategy for getting correct answers. That is, in some meaningful sense, the model learned to think.

The distillation story is equally important for practitioners. If your application needs reasoning and you cannot afford the current top-tier closed model (o3 or its successors), you have a credible path: run R1-Distill-7B locally, get performance that would have been frontier-level two years ago, and deploy it on a single consumer GPU. The democratization of reasoning capability happened faster than most expected.

Chapter Checklist

After this chapter, you should be able to:

Explain why GPT-4o scored 12% on AIME 2024 and o1 scored 83%.
Describe test-time compute scaling and how it differs from train-time scaling.
Explain what a Process Reward Model is and why it helps reasoning training.
Describe R1-Zero: what it proved, and how GRPO made it compute-efficient.
Explain why distillation from R1 outperforms direct RL on the student model.
Name the 2025–2026 reasoning model ecosystem (o3, o4-mini, Claude Opus 4.7 extended thinking, Kimi K1.5, Gemini 2.5, GPT-5.5).
Choose between reasoning models based on latency, cost, and task difficulty.

See You in the Next Chapter

Reasoning models use more inference compute. But they still use Transformer foundations. The next question is whether Transformers are the right foundation at all for very long sequences.

Chapter 32 examines what happens when the O(N²) Attention cost becomes the binding constraint, and how State Space Models, Mamba, and hybrid architectures are trying to provide a better answer.