One-sentence summary: Reasoning models discovered that "think longer" is a second scaling axis — orthogonal to "train bigger," and sometimes more efficient per compute dollar.
31.1 The Breakthrough: A Number That Changed Everything
31.1.1 The Benchmark That Exposed the Gap
In early 2024, researchers tested GPT-4o — then the strongest available model — on AIME 2024 (the American Invitational Mathematics Examination). AIME is the second round of the US high school math olympiad. Participants are the top 5% of US math students. It has 15 problems. No calculator.
GPT-4o scored 12–13%. That is fewer than two problems out of fifteen.
In September 2024, OpenAI released o1-preview. Same AIME 2024 problems.
o1-preview scored 74–83%.
AIME 2024 accuracy:
GPT-4o ████ 12–13%
o1-preview █████████████████████████████ 74–83%
o1 (final) ██████████████████████████████ 83–93%
o3 (Dec 2024) ████████████████████████████████ 96.7%
From 12% to 83%: a 6–7× jump with no comparable increase in model size. This was not a gradual improvement on the existing scaling curve. It was a different phenomenon.
31.1.2 Why This Is Qualitatively Different
Previous capability gains followed a pattern:
- GPT-3 → GPT-3.5: better training, more data, ~50% relative improvement
- GPT-3.5 → GPT-4: larger model, longer context, ~100% relative improvement on hard tasks
- GPT-4 → GPT-4o: efficiency gains, multimodal, ~10–20% relative improvement
Each step came from the same lever: more parameters, more data, better training.
o1 did not become dramatically larger than GPT-4o. The jump came from a different lever entirely.
The core insight: model capability is a function of both the model's weights and the compute spent at inference time. Until o1, almost all effort went into training. o1 invested heavily in test-time compute.
31.1.3 AIME in Context
AIME is a particularly clean benchmark for measuring reasoning capability:
- Problems are created fresh each year, eliminating training-data contamination
- Each problem requires multi-step reasoning with no shortcut
- There is a clear human reference point: top US high school mathematicians typically score 8–12/15
o3's 96.7% means it averages fewer than one wrong answer per test. It has surpassed all but the top-tier IMO competitors.
31.2 Test-Time Compute Scaling
31.2.1 The Traditional Scaling Law
Before reasoning models, the dominant performance-improvement strategy was "train-time compute scaling":
This drove the arms race:
- GPT-3: 175B parameters
- PaLM: 540B parameters
- GPT-4: ~1.8T parameters (rumored MoE)
- Llama 3.1: 405B parameters (dense)
But the returns were diminishing. Doubling parameters from 100B to 200B improved performance by 10–20%, not 100%. Each increment got more expensive.
31.2.2 The New Axis: Test-Time Compute
In August 2024, DeepMind and UC Berkeley published: "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters."
The central experiment: take a small model and a large model (14× size difference). Give the small model more inference compute. The small model + more thinking time outperforms the large model's direct answer.
Experiment setup:
Large model = PaLM 2-L (~14× bigger)
Small model = PaLM 2-S + extra test-time compute
Result:
Large model, direct answer: score X
Small model, extra compute: score > X (on many tasks)
This does not mean small models always beat big ones. It means test-time compute is a real scaling axis, not a toy.
31.2.3 Why Longer Thinking Helps
Standard next-token generation produces an answer in a single pass. For hard problems, the single pass often fails because:
- The first approach may be wrong — but the model commits to it
- Intermediate steps are not verified before the final answer
- No backtracking when an approach hits a dead end
Extended thinking allows:
- Generating multiple candidate approaches
- Checking intermediate steps
- Backtracking and trying a different path
- Verifying the answer before outputting it
The compute cost is proportional to how many tokens the model generates in its "thinking" phase.
31.2.4 Two Scaling Curves
Performance
▲
│ ╭── Test-time scaling
│ ╭───╯ (more thinking tokens)
│ ╭────╯
│ ╭────╯
│ ╭────╯
│ ╭────╯ ← Train-time scaling
│ ╭───╯ (more parameters + data)
│─╯
└──────────────────────────────────▶ Compute
Both curves are real. The optimal strategy combines both: train a capable base model, then spend inference compute wisely on hard problems.
31.3 OpenAI o1 and o3
31.3.1 What o1 Actually Does
OpenAI's official framing: o1 was "trained to reason." The technical implementation is not fully disclosed, but the observable behavior is clear:
When you ask o1 a hard question, it spends time generating an internal "reasoning trace" before producing the visible answer. You see a summary; the full chain is hidden.
Traditional model response:
User question → immediate answer
o1 response:
User question → [internal reasoning trace, ~3,000–10,000 tokens]
→ final answer (visible)
The reasoning trace may include:
- Setting up the problem from multiple angles
- Testing a hypothesis and finding a contradiction
- Backtracking and trying an alternative approach
- Verifying the final answer against the original problem
31.3.2 Training Approach (Inferred)
OpenAI has not published o1's training details. Based on public information and research community analysis, the approach likely involves:
- Large-scale RL with process rewards: the reward model evaluates not just final answers but intermediate reasoning steps
- Process Reward Models (PRM): a separate model trained to judge whether each step in a reasoning chain is correct
- Search over reasoning paths: generating multiple candidate chains and selecting the best
Inferred o1 training loop:
Problem
↓
Base model generates N reasoning chains
↓
Process Reward Model scores each step of each chain
↓
Best chain identified
↓
RL update: increase probability of successful chain patterns
The key insight in PRMs: evaluating "was this intermediate step correct?" is more informative than evaluating only "was the final answer correct?" A model that takes 8 correct steps and then one wrong step at the end deserves a different signal than a model that guesses correctly by chance.
31.3.3 o3: Another Step Change
December 2024, o3 arrived. The jump from o1 to o3 was not as dramatic as GPT-4o to o1, but it was still substantial:
| Benchmark | GPT-4o | o1 | o3 |
|---|---|---|---|
| AIME 2024 | 12–13% | 83.3% | 96.7% |
| MATH-500 | 76.6% | 94.8% | 97.9% |
| Codeforces Rating | ~1200 | ~1800 | 2727 |
| ARC-AGI | ~5% | 32% | 87.5% |
ARC-AGI is worth noting. François Chollet designed this benchmark specifically to resist memorization — it tests novel pattern recognition. Human average is ~85%. o3 at 87.5% is the first AI to exceed human average on this test.
31.3.4 The Cost Structure
Reasoning tokens are expensive because you pay for the thinking:
One o1 request for a hard math problem:
User input: 200 tokens
Internal thinking: 3,000–10,000 tokens ← you pay for these
Final output: 500 tokens
Total billed: 3,700–10,700 tokens
API pricing (approximate at time of writing):
- GPT-4o: $2.50/1M input, $10/1M output
- o1-pro: $150/1M input, $600/1M output
The 60× cost premium is not arbitrary. The model genuinely uses 10–50× more compute to answer hard questions. For problems where the answer matters — competitive coding, mathematical proofs, complex debugging — the price is often worth it.
31.3.5 o1/o3 Limitations
Speed: GPT-4o answers in 1–3 seconds. o1 takes 10–60 seconds for hard problems. For conversational use or real-time applications, this is often unacceptable.
Overthinking easy questions: adding irrelevant information to simple problems reduces o1's accuracy significantly. Apple researchers showed that adding "His backpack is blue" to a simple arithmetic word problem caused meaningful accuracy drops. The model seems to process noise as potential signal.
Opacity: OpenAI deliberately hides the reasoning trace. Users cannot inspect how the model arrived at its answer. This is both a product choice (reasoning traces are verbose and confusing) and a safety decision (OpenAI reports that o1 models showed mid-level risk on CBRN-related knowledge).
31.4 DeepSeek-R1: Open-Source Reasoning
31.4.1 Why R1 Matters
January 2025. DeepSeek releases DeepSeek-R1, achieving performance comparable to OpenAI o1 with one crucial difference: fully open weights and openly published training methodology.
| Property | OpenAI o1 | DeepSeek-R1 |
|---|---|---|
| Model weights | Closed | Open |
| Training method | Undisclosed | Published paper |
| API cost | High | Low |
| Local deployment | No | Yes |
| AIME 2024 | 83.3% | 79.8% |
| MATH-500 | 94.8% | 97.3% |
R1 made reasoning models impossible to treat as proprietary magic. The approach was reproducible.
31.4.2 R1-Zero: Reasoning From Pure RL
The most surprising result: DeepSeek trained a reasoning model without any supervised reasoning demonstrations. They called this R1-Zero.
Traditional training pipeline:
- Pretraining (language modeling)
- SFT on curated reasoning examples
- RLHF
DeepSeek-R1-Zero pipeline:
- Pretraining (language modeling)
- Directly apply RL — no SFT, no human-written reasoning traces
They started from DeepSeek-V3-Base (the pretrained model) and applied RL with a simple reward: correct final answer gets +1, incorrect gets 0. No process rewards, no human demonstrations.
What emerged from training:
R1-Zero AIME accuracy during RL training:
Step 0: 15.6%
Step 2000: ~30%
Step 6000: ~55%
Step 10000: 71.0% (pass@1)
Majority vote over 64 samples at step 10000: 86.7%
31.4.3 Emergent Reasoning Behaviors
R1-Zero was not told how to reason. These behaviors emerged from the RL training signal alone:
Self-verification:
...so the answer is 42.
Wait, let me check the third step again.
3 × 14 = 42 ✓
Yes, the answer is 42.
Backtracking:
This approach seems to be getting complicated.
Let me try a different method...
Strategy selection:
Direct computation is too complex here.
Let me consider n=1 as a special case first,
find the pattern, then generalize.
These are not programmed behaviors. The model discovered that these strategies lead to higher reward. This is a remarkable result: reasoning capability as an emergent property of RL on outcome rewards.
31.4.4 GRPO: Training Without a Critic
Standard PPO requires a Critic model to estimate the value of each state. For a 671B-parameter policy, the Critic is also ~671B parameters. You are running two massive models simultaneously.
DeepSeek's solution: GRPO (Group Relative Policy Optimization).
PPO:
advantage = reward - critic_model(state) ← requires Critic
GRPO:
For each prompt, generate G responses:
rewards = [r_1, r_2, ..., r_G]
baseline = mean(rewards)
advantage_i = r_i - baseline ← no Critic needed
Instead of comparing against a learned value function, compare against the group average. The intuition: if this response scored above average for this prompt, it was good. No separate model required.
GRPO example:
Prompt: "Prove that 1 + 2 + ... + n = n(n+1)/2"
8 responses generated:
Response 1: score 0.8
Response 2: score 0.3
Response 3: score 0.9
Response 4: score 0.5
Response 5: score 1.0
Response 6: score 0.2
Response 7: score 0.8
Response 8: score 0.5
Group mean: 0.625
Advantages:
+0.175, -0.325, +0.275, -0.125, +0.375, -0.425, +0.175, -0.125
Positive advantage → increase probability of that response pattern
Negative advantage → decrease probability
Removing the Critic model cuts training memory by ~50% and reduces compute accordingly. This is part of why DeepSeek-V3 trained at $5.5M: architectural choices that save compute compound.
31.4.5 Cold-Start Data and the Final R1
R1-Zero proved the concept but had usability problems:
- Reasoning chains were verbose and disorganized
- The model mixed languages mid-response (Chinese and English)
- Formatting was inconsistent
The production R1 added a small amount of cold-start data: a few thousand high-quality reasoning examples that demonstrate clean formatting and language consistency. The RL training then runs on top of this.
The result: same strong reasoning capability, readable output format.
This is an important pattern. RL discovers the capability. SFT on a small curated set shapes how that capability is expressed.
31.4.6 What Open-Source Reasoning Changes
R1's open-source release had immediate practical effects:
Research: the training methodology is reproducible, so other groups can verify it, extend it, and build on it.
Deployment cost: running R1 locally is possible. At scale, the savings versus o1-pro API are enormous.
Domain specialization: you can fine-tune R1's reasoning on medical diagnosis, legal analysis, or competitive programming — something you cannot do with a closed API.
Ecosystem: the distilled R1 models (see below) created a wave of capable small reasoning models.
31.5 Distillation: Making Reasoning Accessible
31.5.1 The Problem with Large Reasoning Models
DeepSeek-R1 is based on a 671B MoE model. Deploying it requires:
- ~200 GB VRAM (multiple high-end GPUs)
- $80,000+ in hardware at H100 prices
- Significant engineering overhead
For most applications, this is inaccessible.
31.5.2 Knowledge Distillation for Reasoning
The core idea: the large reasoning model generates high-quality reasoning traces on many problems. A small model is then fine-tuned on those traces — learning to produce similar reasoning patterns.
Teacher (DeepSeek-R1, 671B):
Problem → [Extended reasoning trace] → Solution
Student (Qwen-2.5-7B):
Trained on (Problem, Reasoning trace) pairs
Learns to produce similar traces despite 100× fewer parameters
This is just SFT on the teacher's reasoning traces. No RL required on the student side.
31.5.3 DeepSeek Distilled Model Performance
| Model | Base | AIME 2024 | MATH-500 | Note |
|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | Qwen-2.5-1.5B | 28.9% | 83.9% | Laptop-deployable |
| R1-Distill-Qwen-7B | Qwen-2.5-7B | 55.5% | 92.8% | RTX 4090 fits |
| R1-Distill-Llama-8B | Llama-3.1-8B | 50.4% | 89.1% | |
| R1-Distill-Qwen-14B | Qwen-2.5-14B | 69.7% | 93.9% | |
| R1-Distill-Qwen-32B | Qwen-2.5-32B | 72.6% | 94.3% | Better than o1-mini |
| R1-Distill-Llama-70B | Llama-3.3-70B | 70.0% | 94.5% |
The 7B distilled model at 55.5% on AIME outperforms QwQ-32B-Preview, a dedicated 32B reasoning model. The 32B distilled model outperforms OpenAI o1-mini on several metrics.
Distillation comparison vs direct RL on the student:
| Method | 32B model, AIME 2024 | Cost |
|---|---|---|
| Direct RL on 32B | 67.2% | High (full RL training) |
| Distill from R1 | 72.6% | Low (just SFT on traces) |
Distillation is both cheaper and better performing. The teacher already discovered good reasoning strategies; the student just needs to learn to execute them.
31.5.4 Deployment Reality
Hardware requirements:
DeepSeek-R1 (full): ~200 GB VRAM → 8× A100 80GB
R1-Distill-Qwen-32B: ~70 GB VRAM → 2× A100 80GB
R1-Distill-Qwen-7B: ~14 GB VRAM → 1× RTX 4090 (consumer)
R1-Distill-Qwen-1.5B: ~4 GB VRAM → CPU inference feasible
The 1.5B model running on a laptop that scores 28.9% on AIME — better than GPT-4o — is a striking demonstration of how much the field has changed.
31.6 The 2025 Ecosystem
31.6.1 Where the Field Stands Now
The CN original for this chapter was written when o1 and R1 were the primary examples. As of April 2026, the ecosystem has expanded substantially. The core technical ideas remain the same; the landscape of deployed systems has grown.
OpenAI o3 and o4-mini: o3 represents a significant step beyond o1. o4-mini achieves o3-level performance on many benchmarks at substantially lower cost, making extended reasoning economically viable for a broader range of applications. Both models show better calibration on when to spend more thinking time.
Claude Opus 4.5's extended thinking mode: Anthropic integrated reasoning into Claude's product line. Extended thinking is visible to the user and controllable by the developer. Unlike o1's hidden traces, Claude shows the reasoning summary. This reflects a different product philosophy: transparency over polish.
Gemini 2.5 Pro with Thinking: Google's response to reasoning models. The distinctive feature is a thinking budget — a developer-controllable parameter that sets the maximum number of thinking tokens:
response = gemini.generate(
prompt="...",
thinking_budget=0, # disabled: fast, cheap
# thinking_budget=8192, # medium: balanced
# thinking_budget=24576 # maximum: best on hard problems
)
This creates a cost-performance knob: spend $0.01 on fast answers, or $0.06 on deep reasoning, depending on the query. Gemini 2.5 Pro reaches 92.0% on AIME 2024, ahead of o1 at 83.3%.
Gemini 2.5 Deep Think achieved bronze-medal level at the 2025 International Mathematical Olympiad. For context: IMO problems require creative mathematical insight over multiple hours. This is no longer a benchmark artifact.
The pattern across all these systems: extended reasoning is becoming a standard capability, not a specialty product.
31.6.2 Performance Comparison
| Model | AIME 2024 | MATH-500 | Codeforces | Open? | Cost tier |
|---|---|---|---|---|---|
| GPT-4o | 12–13% | 76.6% | ~1200 | No | Low |
| o1 | 83.3% | 94.8% | ~1800 | No | High |
| o3 | 96.7% | 97.9% | 2727 | No | Very high |
| DeepSeek-R1 | 79.8% | 97.3% | 2029 | Yes | Low |
| Gemini 2.5 Pro | 92.0% | ~95% | — | No | Medium |
| R1-Distill-32B | 72.6% | 94.3% | 1691 | Yes | Lowest |
31.7 When to Use Reasoning Models
31.7.1 The Cost vs Benefit Calculation
Single hard math problem (AIME difficulty):
GPT-4o: ~$0.01 response time ~2s
o1: ~$0.50 response time ~30s
Cost ratio: 50×
Whether o1 is worth it depends on what "correct" is worth to you. For a one-time calculation, $0.50 for reliability may be cheap. For a system that makes millions of such calls, you need the distilled models.
31.7.2 Good Use Cases for Reasoning
- Competitive programming: algorithm design, edge case handling, proving correctness
- Mathematical proofs: multi-step formal reasoning, counterexample generation
- Complex debugging: reasoning about what a program does before changing it
- Scientific reasoning: hypothesis generation, experiment design, data interpretation
- Legal/medical analysis: careful multi-factor reasoning with explicit uncertainty
31.7.3 Poor Use Cases for Reasoning
- Real-time conversation: 30-second response times are product-breaking
- Simple lookups: reasoning overhead wasted on "what is the capital of France"
- Creative writing: the systematic thinking style can make prose feel mechanical
- High-volume APIs: cost scales with difficulty, not traffic
31.7.4 Practical Selection Guide
| Task | Recommended model | Reason |
|---|---|---|
| Real-time chat | GPT-4o / Claude | Speed critical |
| Simple Q&A | GPT-4o-mini / Haiku | Cost critical |
| Complex math | o1 / R1 | Accuracy critical |
| Code generation | Claude / R1 | Balanced |
| Research analysis | o3 / Gemini 2.5 | Long chains needed |
| Edge deployment | R1-Distill-7B | Resource constrained |
31.8 Chapter Summary
31.8.1 Core Concepts
Test-Time Compute Scaling: performance is a function of both model weights and inference compute. Spending more tokens on internal reasoning improves hard-task accuracy.
o1/o3: OpenAI's systems that made this pattern commercially visible. AIME: 12% → 83% → 96.7%. The reasoning trace is internal and hidden.
DeepSeek-R1: open-source reasoning model matching o1 quality. Key findings: pure RL can induce reasoning capability (R1-Zero), GRPO eliminates the Critic model, cold-start data shapes formatting without damaging capability.
GRPO: eliminates the Critic model by using group-relative advantage estimation. Cuts training cost by ~50% for RL-based reasoning training.
Distillation: fine-tune a small model on the large model's reasoning traces. R1-Distill-7B outperforms o1-mini on many tasks. Distillation beats direct RL on the small model.
2025 landscape: o3, o4-mini, Claude Opus 4.5 extended thinking, Gemini 2.5 thinking mode. Extended reasoning is now a standard capability tier.
31.8.2 Key Benchmarks
AIME 2024 tells the story:
GPT-4o: 12% ← where we started
o1: 83% ← test-time compute, hidden trace
DeepSeek-R1: 80% ← open-source, reproducible
Gemini 2.5: 92% ← competitive landscape, April 2025
o3: 97% ← current frontier
31.8.3 My Take
The R1-Zero result is the most important insight from this chapter. Nobody programmed self-verification or backtracking into the model. Those behaviors appeared because the RL training signal rewarded correct final answers, and the model discovered that checking your work is a good strategy for getting correct answers. That is, in some meaningful sense, the model learned to think.
The distillation story is equally important for practitioners. If your application needs reasoning and you cannot afford o3, you have a credible path: run R1-Distill-7B locally, get performance that would have been frontier-level two years ago, and deploy it on a single consumer GPU. The democratization of reasoning capability happened faster than most expected.
Chapter Checklist
After this chapter, you should be able to:
- Explain why GPT-4o scored 12% on AIME 2024 and o1 scored 83%.
- Describe test-time compute scaling and how it differs from train-time scaling.
- Explain what a Process Reward Model is and why it helps reasoning training.
- Describe R1-Zero: what it proved, and how GRPO made it compute-efficient.
- Explain why distillation from R1 outperforms direct RL on the student model.
- Name the 2025 reasoning model ecosystem (o3, o4-mini, Claude extended thinking, Gemini 2.5).
- Choose between reasoning models based on latency, cost, and task difficulty.
See You in the Next Chapter
Reasoning models use more inference compute. But they still use Transformer foundations. The next question is whether Transformers are the right foundation at all for very long sequences.
Chapter 32 examines what happens when the O(N²) Attention cost becomes the binding constraint, and how State Space Models, Mamba, and hybrid architectures are trying to provide a better answer.