One-sentence summary: Learning rate decides how far to step on each parameter update — too large and the model oscillates, too small and it crawls, and only a well-chosen rate with a good schedule produces stable, efficient training.
17.1 What Is a Learning Rate?
17.1.1 The Basic Parameter Update Formula
Neural network training is gradient descent: compute how the loss changes with respect to each parameter, then move the parameters in the direction that reduces loss.
The basic update rule:
new_weight = old_weight - learning_rate × gradient
Or in math:
17.1.2 A Concrete Example
learning_rate = 0.1
old_weight = 0.90
gradient = -0.4
new_weight = 0.90 - 0.1 × (-0.4)
= 0.90 + 0.04
= 0.94
The gradient is negative, which means increasing this weight reduces loss. So the weight moves upward from 0.90 to 0.94. The learning rate controls how much of the gradient we actually apply.
17.1.3 What Learning Rate Controls
Learning rate (lr) decides the step size for each update:
- lr = 0.1: take 10% of the gradient each step
- lr = 0.01: take 1% of the gradient
- lr = 0.001: take 0.1% of the gradient
Every iteration of training applies this to every parameter in the model.
17.2 Three Cases
17.2.1 Visualizing the Loss Landscape
Too small (left):
- Every step is tiny
- Training takes a very long time to reach the minimum
- May get stuck in a shallow local minimum
- Training curve looks like it is barely moving
Too large (center):
- Every step is huge
- The optimizer overshoots and bounces around the minimum
- Loss oscillates instead of decreasing
- In the worst case, loss diverges entirely
Just right (right):
- Steps are moderate
- Loss descends steadily to the minimum
- The training curve slopes smoothly downward
17.2.2 Andrej Karpathy's Observation
There is a famous tweet from Andrej Karpathy:
"3e-4 is the best learning rate for Adam, hands down." "(i just wanted to make sure that people understand that this is a joke...)"
The joke lands because 3 × 10⁻⁴ (0.0003) really is a reasonable starting point for Adam-family optimizers on many tasks. The reason it feels like a universal answer is that a lot of standard architectures and batch sizes converge in a similar ballpark. But the actual best learning rate depends on model size, batch size, data, and schedule — there is no universal constant.
17.2.3 The Loss Landscape
The loss surface is a high-dimensional function of all model parameters. Visualizing it in 3D:
- X, Y axes: two parameter values
- Z axis: loss value
The surface has valleys and ridges. The goal is to descend into a valley. Learning rate determines how fast we descend — and whether we skip past the valley entirely.
17.3 Which Parameters Update?
17.3.1 Every Trainable Parameter in the Transformer
During full training, every trainable parameter is updated:
1. Word Embedding
- Each token's vector representation
- Parameters:
vocab_size × d_model
2. Attention weights
- Wq, Wk, Wv: produce Q, K, V from the input
- Wo: output projection after Attention
- Per-layer parameters:
4 × d_model²
3. FFN weights
- Two linear layers with ReLU between them
- Per-layer parameters:
2 × d_model × d_ff
4. Output projection Wp
- Maps hidden states to vocabulary logits
- Parameters:
d_model × vocab_size - Often weight-tied to the token embedding
17.3.2 All Parameters Update Together
The backward pass computes a gradient for every parameter simultaneously. Then a single optimizer step updates all of them at once, each using its own gradient and the shared learning rate.
Compute loss (compare prediction to target)
|
Backpropagation
|
Compute gradient for every parameter
|
Update all parameters simultaneously
17.4 PyTorch Implementation
17.4.1 Code Example
import torch
import torch.nn.functional as F
# 1. compute loss
loss = F.cross_entropy(input=logits_reshaped, target=targets_reshaped)
print(loss.item()) # output: 11.515044212341309
# 2. inspect weights before update
for name, param in Wq.named_parameters():
print(name, param)
# weight Parameter containing:
# tensor([[ 0.0474, -0.0321, -0.0234, ...
# 3. create optimizer and run one step
optimizer = torch.optim.AdamW(Wq.parameters(), lr=0.0001)
optimizer.zero_grad() # clear accumulated gradients
loss.backward() # compute gradients
optimizer.step() # update parameters
# 4. inspect gradients
for name, param in Wq.named_parameters():
print(name, param.grad)
# weight tensor([[-1.7941e-09, -2.7664e-10, -2.7543e-09, ...
# 5. inspect weights after update
for name, param in Wq.named_parameters():
print(name, param)
# weight Parameter containing:
# tensor([[ 0.0474, -0.0320, -0.0233, ...
17.4.2 The Three Critical Lines
loss.backward(): compute gradients for all parameters via autodiffoptimizer.zero_grad(): clear gradients (PyTorch accumulates by default — must clear before each step)optimizer.step(): apply the update rule with the current learning rate
17.4.3 Observing Parameter Changes
The parameter shift in this example:
Before: 0.0474, -0.0321, -0.0234, ...
After: 0.0474, -0.0320, -0.0233, ...
The changes are tiny — fourth decimal place — because:
- learning rate is small (0.0001)
- gradients are small (~10⁻⁹)
But over thousands of steps, these tiny adjustments accumulate into genuine learning. That is the core idea of gradient descent.
17.5 Common Optimizers
17.5.1 SGD (Stochastic Gradient Descent)
The baseline:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
Update rule:
θ = θ - lr × gradient
Simple and theoretically clean. For Transformer training at scale, it usually converges too slowly. Still useful for understanding the fundamentals.
17.5.2 Adam
The most widely used optimizer in deep learning:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Key properties:
- Maintains first moment (mean) and second moment (variance) of gradients
- Adapts the effective learning rate per parameter
- Handles sparse gradients well
17.5.3 AdamW
Adam with a corrected weight decay:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001, weight_decay=0.01)
Weight decay in vanilla Adam is slightly broken — it conflates the decay with the gradient adaptation. AdamW fixes this. AdamW is the default choice for training large Transformer models.
17.5.4 Common Configurations
| Model | Optimizer | Initial LR | Weight Decay |
|---|---|---|---|
| GPT-2 | Adam | 2.5e-4 | 0.01 |
| GPT-3 | Adam | 6e-5 ~ 2e-4 | 0.1 |
| LLaMA | AdamW | 3e-4 | 0.1 |
17.6 Learning Rate Schedules
17.6.1 Why Schedules Exist
A single fixed learning rate is rarely optimal throughout training:
- Early training: larger steps help find a good region quickly
- Late training: smaller steps help fine-tune within that region
17.6.2 Warmup + Cosine Decay
The most common schedule in modern LLM training:
Learning Rate
| /\
| / \__
| / \___
| / \____
+--------------------------> Training Steps
| |
warmup cosine decay
- Warmup: learning rate increases linearly from a small value to the target
- Cosine Decay: learning rate decreases following a cosine curve until a minimum value
17.6.3 PyTorch Implementation
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
# warmup: linear increase for first 1000 steps
warmup_scheduler = LinearLR(
optimizer,
start_factor=0.1, # start at lr × 0.1
total_iters=1000
)
# cosine decay: reduce over the remaining steps
cosine_scheduler = CosineAnnealingLR(
optimizer,
T_max=100000, # total training steps
eta_min=1e-5 # minimum learning rate
)
# combine
scheduler = SequentialLR(
optimizer,
schedulers=[warmup_scheduler, cosine_scheduler],
milestones=[1000]
)
# in training loop
for step in range(total_steps):
loss = train_step(...)
optimizer.step()
scheduler.step() # update learning rate
17.6.4 Why Warmup?
At the start of training:
- Parameters are randomly initialized — their scales are arbitrary
- Gradient direction estimates are unreliable, based on minimal history
- Large steps can cause irreversible damage to the model's initial representations
Warmup gives the model time to find stable internal scales before applying the full learning rate. Without warmup, large models often diverge in the first few hundred steps.
17.7 Learning Rate and Other Hyperparameters
17.7.1 Batch Size
Larger batch → can use larger learning rate.
The empirical rule (Linear Scaling Rule):
if batch size doubles, learning rate can also double
Reasoning: a larger batch gives a more accurate gradient estimate. A more accurate gradient means a larger step is safe.
17.7.2 Model Size
Larger model → typically needs smaller learning rate.
| Model scale | Parameters | Typical learning rate |
|---|---|---|
| Small | ~100M | 3e-4 ~ 1e-3 |
| Medium | ~1B | 1e-4 ~ 3e-4 |
| Large | ~10B | 3e-5 ~ 1e-4 |
| XL | ~100B+ | 1e-5 ~ 3e-5 |
Larger models have more parameters and more complex loss surfaces. Smaller steps reduce the risk of destabilizing them.
17.7.3 Weight Decay
Weight decay is an L2 regularization term added to each update:
θ = θ - lr × (gradient + weight_decay × θ)
It pushes weights toward zero, which prevents individual parameters from growing very large and helps prevent overfitting.
Common values: 0.01 to 0.1. The AdamW default in most LLM training setups is 0.1.
17.8 Practical Guidance
17.8.1 How to Choose a Learning Rate
-
Start from defaults:
- Adam/AdamW on Transformers: 3e-4 is a reasonable first guess
- Larger models: 1e-4 or smaller
-
Read the training curve:
- Loss decreasing steadily → learning rate is good
- Loss oscillating → learning rate is too large
- Loss barely moving → learning rate is too small
-
Use a learning rate finder:
- Start very small, increase exponentially over a few hundred steps
- Find the point where loss decreases fastest
- Use a value slightly below that
17.8.2 Common Diagnostics
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss not decreasing | LR too small | increase LR |
| Loss oscillating sharply | LR too large | reduce LR |
| Loss becomes NaN | LR too large or numerical overflow | reduce LR, check data |
| Loss falls then rises | overfitting or LR not decaying | add decay, add regularization |
17.8.3 My Recommended Configuration
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4, # initial learning rate
betas=(0.9, 0.95), # Adam momentum parameters
weight_decay=0.1 # weight decay
)
total_steps = 100000
warmup_steps = 2000
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=total_steps
)
The betas (0.9, 0.95) instead of the Adam default (0.9, 0.999) is a Karpathy recommendation for language model training — the lower second moment decay makes the optimizer respond faster to recent gradient information.
17.9 Chapter Summary
17.9.1 Core Formula
new_weight = old_weight - learning_rate × gradient
θ_new = θ_old - lr × ∂Loss/∂θ
17.9.2 Learning Rate Effects
| Learning rate | Effect |
|---|---|
| Too large | oscillation, divergence |
| Too small | slow convergence, may get stuck |
| Well chosen | stable, efficient convergence |
17.9.3 Recommended Configuration
| Component | Recommendation |
|---|---|
| Optimizer | AdamW |
| Initial LR | 1e-4 ~ 3e-4 |
| Weight Decay | 0.01 ~ 0.1 |
| Schedule | Warmup + Cosine Decay |
| Warmup steps | 1-5% of total steps |
17.9.4 Core Insight
Learning rate is probably the most important training hyperparameter. It controls the size of every parameter update. Too large causes oscillation; too small causes sluggish convergence. Pair it with warmup and cosine decay for stable, efficient runs. For large Transformer models, AdamW with lr ≈ 3e-4 and weight_decay ≈ 0.1 is a solid default.
Chapter Checklist
After this chapter you should be able to:
- State the weight update formula and explain what each term does.
- Describe the symptoms of a too-large and too-small learning rate.
- Explain why AdamW is preferred over vanilla Adam for Transformer training.
- Explain warmup and cosine decay and when each applies.
Part 4 Summary
You have completed Part 4: Complete Architecture.
| Chapter | Topic | Core idea |
|---|---|---|
| Ch. 13 | Residual connections and Dropout | gradient highways, overfitting prevention |
| Ch. 14 | Embeddings + positional encoding | sum, not concatenate |
| Ch. 15 | Full forward pass | text to probabilities, shape tracking |
| Ch. 16 | Training vs inference | parallel training, serial inference |
| Ch. 17 | Learning rate | step size control and scheduling |
You now understand how a Transformer works end to end — every piece and how they connect.
See You in the Next Chapter
That is enough theory for now. If you can explain warmup + cosine decay without looking at the chart, you are ready for Part 5.
Part 5 is where we write code. Chapter 18 defines the model in PyTorch, from scratch, one class at a time. Chapter 19 trains it. Chapter 20 runs inference. By the end, you will have a working GPT in under 400 lines of code.