One-sentence summary: Learning rate decides how far to step on each parameter update — too large and the model oscillates, too small and it crawls, and only a well-chosen rate with a good schedule produces stable, efficient training.

17.1 What Is a Learning Rate?

17.1.1 The Basic Parameter Update Formula

Weight update formula with learning rate

Neural network training is gradient descent: compute how the loss changes with respect to each parameter, then move the parameters in the direction that reduces loss.

The basic update rule:

new_weight = old_weight - learning_rate × gradient

Or in math:

\theta_{\text{new}} = \theta_{\text{old}} - \text{lr} \times \frac{\partial \text{Loss}}{\partial \theta}

17.1.2 A Concrete Example

learning_rate = 0.1
old_weight    = 0.90
gradient      = -0.4

new_weight = 0.90 - 0.1 × (-0.4)
           = 0.90 + 0.04
           = 0.94

The gradient is negative, which means increasing this weight reduces loss. So the weight moves upward from 0.90 to 0.94. The learning rate controls how much of the gradient we actually apply.

17.1.3 What Learning Rate Controls

Learning rate (lr) decides the step size for each update:

lr = 0.1: take 10% of the gradient each step
lr = 0.01: take 1% of the gradient
lr = 0.001: take 0.1% of the gradient

Every iteration of training applies this to every parameter in the model.

17.2 Three Cases

17.2.1 Visualizing the Loss Landscape

Loss landscape with too-small, ideal, and too-large learning rates

Too small (left):

Every step is tiny
Training takes a very long time to reach the minimum
May get stuck in a shallow local minimum
Training curve looks like it is barely moving

Too large (center):

Every step is huge
The optimizer overshoots and bounces around the minimum
Loss oscillates instead of decreasing
In the worst case, loss diverges entirely

Just right (right):

Steps are moderate
Loss descends steadily to the minimum
The training curve slopes smoothly downward

17.2.2 Andrej Karpathy's Observation

There is a famous tweet from Andrej Karpathy:

"3e-4 is the best learning rate for Adam, hands down." "(i just wanted to make sure that people understand that this is a joke...)"

The joke lands because 3 × 10⁻⁴ (0.0003) really is a reasonable starting point for Adam-family optimizers on many tasks. The reason it feels like a universal answer is that a lot of standard architectures and batch sizes converge in a similar ballpark. But the actual best learning rate depends on model size, batch size, data, and schedule — there is no universal constant.

17.2.3 The Loss Landscape

The loss surface is a high-dimensional function of all model parameters. Visualizing it in 3D:

X, Y axes: two parameter values
Z axis: loss value

The surface has valleys and ridges. The goal is to descend into a valley. Learning rate determines how fast we descend — and whether we skip past the valley entirely.

17.3 Which Parameters Update?

17.3.1 Every Trainable Parameter in the Transformer

Backpropagation through the Transformer architecture

During full training, every trainable parameter is updated:

1. Word Embedding

Each token's vector representation
Parameters: vocab_size × d_model

2. Attention weights

Wq, Wk, Wv: produce Q, K, V from the input
Wo: output projection after Attention
Per-layer parameters: 4 × d_model²

3. FFN weights

Two linear layers with ReLU between them
Per-layer parameters: 2 × d_model × d_ff

4. Output projection Wp

Maps hidden states to vocabulary logits
Parameters: d_model × vocab_size
Often weight-tied to the token embedding

17.3.2 All Parameters Update Together

The backward pass computes a gradient for every parameter simultaneously. Then a single optimizer step updates all of them at once, each using its own gradient and the shared learning rate.

Compute loss (compare prediction to target)
         |
    Backpropagation
         |
Compute gradient for every parameter
         |
    Update all parameters simultaneously

17.4 PyTorch Implementation

17.4.1 Code Example

import torch
import torch.nn.functional as F

# 1. compute loss
loss = F.cross_entropy(input=logits_reshaped, target=targets_reshaped)
print(loss.item())  # output: 11.515044212341309

# 2. inspect weights before update
for name, param in Wq.named_parameters():
    print(name, param)
# weight Parameter containing:
# tensor([[ 0.0474, -0.0321, -0.0234, ...

# 3. create optimizer and run one step
optimizer = torch.optim.AdamW(Wq.parameters(), lr=0.0001)
optimizer.zero_grad()     # clear accumulated gradients
loss.backward()           # compute gradients
optimizer.step()          # update parameters

# 4. inspect gradients
for name, param in Wq.named_parameters():
    print(name, param.grad)
# weight tensor([[-1.7941e-09, -2.7664e-10, -2.7543e-09, ...

# 5. inspect weights after update
for name, param in Wq.named_parameters():
    print(name, param)
# weight Parameter containing:
# tensor([[ 0.0474, -0.0320, -0.0233, ...

17.4.2 The Three Critical Lines

loss.backward(): compute gradients for all parameters via autodiff
optimizer.zero_grad(): clear gradients (PyTorch accumulates by default — must clear before each step)
optimizer.step(): apply the update rule with the current learning rate

17.4.3 Observing Parameter Changes

The parameter shift in this example:

Before: 0.0474, -0.0321, -0.0234, ...
After:  0.0474, -0.0320, -0.0233, ...

The changes are tiny — fourth decimal place — because:

learning rate is small (0.0001)
gradients are small (~10⁻⁹)

But over thousands of steps, these tiny adjustments accumulate into genuine learning. That is the core idea of gradient descent.

17.5 Common Optimizers

17.5.1 SGD (Stochastic Gradient Descent)

The baseline:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

Update rule:

θ = θ - lr × gradient

Simple and theoretically clean. For Transformer training at scale, it usually converges too slowly. Still useful for understanding the fundamentals.

17.5.2 Adam

The most widely used optimizer in deep learning:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Key properties:

Maintains first moment (mean) and second moment (variance) of gradients
Adapts the effective learning rate per parameter
Handles sparse gradients well

17.5.3 AdamW

Adam with a corrected weight decay:

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001, weight_decay=0.01)

Weight decay in vanilla Adam is slightly broken — it conflates the decay with the gradient adaptation. AdamW fixes this. AdamW is the default choice for training large Transformer models.

17.5.4 Common Configurations

Model	Optimizer	Initial LR	Weight Decay
GPT-2	Adam	2.5e-4	0.01
GPT-3	Adam	6e-5 ~ 2e-4	0.1
LLaMA	AdamW	3e-4	0.1

17.6 Learning Rate Schedules

17.6.1 Why Schedules Exist

A single fixed learning rate is rarely optimal throughout training:

Early training: larger steps help find a good region quickly
Late training: smaller steps help fine-tune within that region

17.6.2 Warmup + Cosine Decay

The most common schedule in modern LLM training:

Learning Rate
  |    /\
  |   /  \__
  |  /      \___
  | /           \____
  +--------------------------> Training Steps
    |        |
  warmup   cosine decay

Warmup: learning rate increases linearly from a small value to the target
Cosine Decay: learning rate decreases following a cosine curve until a minimum value

17.6.3 PyTorch Implementation

from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# warmup: linear increase for first 1000 steps
warmup_scheduler = LinearLR(
    optimizer,
    start_factor=0.1,   # start at lr × 0.1
    total_iters=1000
)

# cosine decay: reduce over the remaining steps
cosine_scheduler = CosineAnnealingLR(
    optimizer,
    T_max=100000,  # total training steps
    eta_min=1e-5   # minimum learning rate
)

# combine
scheduler = SequentialLR(
    optimizer,
    schedulers=[warmup_scheduler, cosine_scheduler],
    milestones=[1000]
)

# in training loop
for step in range(total_steps):
    loss = train_step(...)
    optimizer.step()
    scheduler.step()  # update learning rate

17.6.4 Why Warmup?

At the start of training:

Parameters are randomly initialized — their scales are arbitrary
Gradient direction estimates are unreliable, based on minimal history
Large steps can cause irreversible damage to the model's initial representations

Warmup gives the model time to find stable internal scales before applying the full learning rate. Without warmup, large models often diverge in the first few hundred steps.

17.7 Learning Rate and Other Hyperparameters

17.7.1 Batch Size

Larger batch → can use larger learning rate.

The empirical rule (Linear Scaling Rule):

if batch size doubles, learning rate can also double

Reasoning: a larger batch gives a more accurate gradient estimate. A more accurate gradient means a larger step is safe.

17.7.2 Model Size

Larger model → typically needs smaller learning rate.

Model scale	Parameters	Typical learning rate
Small	~100M	3e-4 ~ 1e-3
Medium	~1B	1e-4 ~ 3e-4
Large	~10B	3e-5 ~ 1e-4
XL	~100B+	1e-5 ~ 3e-5

Larger models have more parameters and more complex loss surfaces. Smaller steps reduce the risk of destabilizing them.

17.7.3 Weight Decay

Weight decay is an L2 regularization term added to each update:

θ = θ - lr × (gradient + weight_decay × θ)

It pushes weights toward zero, which prevents individual parameters from growing very large and helps prevent overfitting.

Common values: 0.01 to 0.1. The AdamW default in most LLM training setups is 0.1.

17.8 Practical Guidance

17.8.1 How to Choose a Learning Rate

Start from defaults:
- Adam/AdamW on Transformers: 3e-4 is a reasonable first guess
- Larger models: 1e-4 or smaller
Read the training curve:
- Loss decreasing steadily → learning rate is good
- Loss oscillating → learning rate is too large
- Loss barely moving → learning rate is too small
Use a learning rate finder:
- Start very small, increase exponentially over a few hundred steps
- Find the point where loss decreases fastest
- Use a value slightly below that

17.8.2 Common Diagnostics

Symptom	Likely cause	Fix
Loss not decreasing	LR too small	increase LR
Loss oscillating sharply	LR too large	reduce LR
Loss becomes NaN	LR too large or numerical overflow	reduce LR, check data
Loss falls then rises	overfitting or LR not decaying	add decay, add regularization

17.8.3 My Recommended Configuration

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,              # initial learning rate
    betas=(0.9, 0.95),    # Adam momentum parameters
    weight_decay=0.1      # weight decay
)

total_steps = 100000
warmup_steps = 2000

scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

The betas (0.9, 0.95) instead of the Adam default (0.9, 0.999) is a Karpathy recommendation for language model training — the lower second moment decay makes the optimizer respond faster to recent gradient information.

17.9 Chapter Summary

17.9.1 Core Formula

new_weight = old_weight - learning_rate × gradient
θ_new = θ_old - lr × ∂Loss/∂θ

17.9.2 Learning Rate Effects

Learning rate	Effect
Too large	oscillation, divergence
Too small	slow convergence, may get stuck
Well chosen	stable, efficient convergence

17.9.3 Recommended Configuration

Component	Recommendation
Optimizer	AdamW
Initial LR	1e-4 ~ 3e-4
Weight Decay	0.01 ~ 0.1
Schedule	Warmup + Cosine Decay
Warmup steps	1-5% of total steps

17.9.4 Core Insight

Learning rate is probably the most important training hyperparameter. It controls the size of every parameter update. Too large causes oscillation; too small causes sluggish convergence. Pair it with warmup and cosine decay for stable, efficient runs. For large Transformer models, AdamW with lr ≈ 3e-4 and weight_decay ≈ 0.1 is a solid default.

Chapter Checklist

After this chapter you should be able to:

State the weight update formula and explain what each term does.
Describe the symptoms of a too-large and too-small learning rate.
Explain why AdamW is preferred over vanilla Adam for Transformer training.
Explain warmup and cosine decay and when each applies.

Part 4 Summary

You have completed Part 4: Complete Architecture.

Chapter	Topic	Core idea
Ch. 13	Residual connections and Dropout	gradient highways, overfitting prevention
Ch. 14	Embeddings + positional encoding	sum, not concatenate
Ch. 15	Full forward pass	text to probabilities, shape tracking
Ch. 16	Training vs inference	parallel training, serial inference
Ch. 17	Learning rate	step size control and scheduling

You now understand how a Transformer works end to end — every piece and how they connect.

See You in the Next Chapter

That is enough theory for now. If you can explain warmup + cosine decay without looking at the chart, you are ready for Part 5.

Part 5 is where we write code. Chapter 18 defines the model in PyTorch, from scratch, one class at a time. Chapter 19 trains it. Chapter 20 runs inference. By the end, you will have a working GPT in under 400 lines of code.