One-sentence summary: Learning rate decides how far to step on each parameter update — too large and the model oscillates, too small and it crawls, and only a well-chosen rate with a good schedule produces stable, efficient training.


17.1 What Is a Learning Rate?

17.1.1 The Basic Parameter Update Formula

Weight update formula with learning rate

Neural network training is gradient descent: compute how the loss changes with respect to each parameter, then move the parameters in the direction that reduces loss.

The basic update rule:

new_weight = old_weight - learning_rate × gradient

Or in math:

θnew=θoldlr×Lossθ\theta_{\text{new}} = \theta_{\text{old}} - \text{lr} \times \frac{\partial \text{Loss}}{\partial \theta}

17.1.2 A Concrete Example

learning_rate = 0.1
old_weight    = 0.90
gradient      = -0.4

new_weight = 0.90 - 0.1 × (-0.4)
           = 0.90 + 0.04
           = 0.94

The gradient is negative, which means increasing this weight reduces loss. So the weight moves upward from 0.90 to 0.94. The learning rate controls how much of the gradient we actually apply.

17.1.3 What Learning Rate Controls

Learning rate (lr) decides the step size for each update:

  • lr = 0.1: take 10% of the gradient each step
  • lr = 0.01: take 1% of the gradient
  • lr = 0.001: take 0.1% of the gradient

Every iteration of training applies this to every parameter in the model.


17.2 Three Cases

17.2.1 Visualizing the Loss Landscape

Loss landscape with too-small, ideal, and too-large learning rates

Too small (left):

  • Every step is tiny
  • Training takes a very long time to reach the minimum
  • May get stuck in a shallow local minimum
  • Training curve looks like it is barely moving

Too large (center):

  • Every step is huge
  • The optimizer overshoots and bounces around the minimum
  • Loss oscillates instead of decreasing
  • In the worst case, loss diverges entirely

Just right (right):

  • Steps are moderate
  • Loss descends steadily to the minimum
  • The training curve slopes smoothly downward

17.2.2 Andrej Karpathy's Observation

There is a famous tweet from Andrej Karpathy:

"3e-4 is the best learning rate for Adam, hands down." "(i just wanted to make sure that people understand that this is a joke...)"

The joke lands because 3 × 10⁻⁴ (0.0003) really is a reasonable starting point for Adam-family optimizers on many tasks. The reason it feels like a universal answer is that a lot of standard architectures and batch sizes converge in a similar ballpark. But the actual best learning rate depends on model size, batch size, data, and schedule — there is no universal constant.

17.2.3 The Loss Landscape

The loss surface is a high-dimensional function of all model parameters. Visualizing it in 3D:

  • X, Y axes: two parameter values
  • Z axis: loss value

The surface has valleys and ridges. The goal is to descend into a valley. Learning rate determines how fast we descend — and whether we skip past the valley entirely.


17.3 Which Parameters Update?

17.3.1 Every Trainable Parameter in the Transformer

Backpropagation through the Transformer architecture

During full training, every trainable parameter is updated:

1. Word Embedding

  • Each token's vector representation
  • Parameters: vocab_size × d_model

2. Attention weights

  • Wq, Wk, Wv: produce Q, K, V from the input
  • Wo: output projection after Attention
  • Per-layer parameters: 4 × d_model²

3. FFN weights

  • Two linear layers with ReLU between them
  • Per-layer parameters: 2 × d_model × d_ff

4. Output projection Wp

  • Maps hidden states to vocabulary logits
  • Parameters: d_model × vocab_size
  • Often weight-tied to the token embedding

17.3.2 All Parameters Update Together

The backward pass computes a gradient for every parameter simultaneously. Then a single optimizer step updates all of them at once, each using its own gradient and the shared learning rate.

Compute loss (compare prediction to target)
         |
    Backpropagation
         |
Compute gradient for every parameter
         |
    Update all parameters simultaneously

17.4 PyTorch Implementation

17.4.1 Code Example

PyTorch training code
import torch
import torch.nn.functional as F

# 1. compute loss
loss = F.cross_entropy(input=logits_reshaped, target=targets_reshaped)
print(loss.item())  # output: 11.515044212341309

# 2. inspect weights before update
for name, param in Wq.named_parameters():
    print(name, param)
# weight Parameter containing:
# tensor([[ 0.0474, -0.0321, -0.0234, ...

# 3. create optimizer and run one step
optimizer = torch.optim.AdamW(Wq.parameters(), lr=0.0001)
optimizer.zero_grad()     # clear accumulated gradients
loss.backward()           # compute gradients
optimizer.step()          # update parameters

# 4. inspect gradients
for name, param in Wq.named_parameters():
    print(name, param.grad)
# weight tensor([[-1.7941e-09, -2.7664e-10, -2.7543e-09, ...

# 5. inspect weights after update
for name, param in Wq.named_parameters():
    print(name, param)
# weight Parameter containing:
# tensor([[ 0.0474, -0.0320, -0.0233, ...

17.4.2 The Three Critical Lines

  1. loss.backward(): compute gradients for all parameters via autodiff
  2. optimizer.zero_grad(): clear gradients (PyTorch accumulates by default — must clear before each step)
  3. optimizer.step(): apply the update rule with the current learning rate

17.4.3 Observing Parameter Changes

The parameter shift in this example:

Before: 0.0474, -0.0321, -0.0234, ...
After:  0.0474, -0.0320, -0.0233, ...

The changes are tiny — fourth decimal place — because:

  • learning rate is small (0.0001)
  • gradients are small (~10⁻⁹)

But over thousands of steps, these tiny adjustments accumulate into genuine learning. That is the core idea of gradient descent.


17.5 Common Optimizers

17.5.1 SGD (Stochastic Gradient Descent)

The baseline:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

Update rule:

θ = θ - lr × gradient

Simple and theoretically clean. For Transformer training at scale, it usually converges too slowly. Still useful for understanding the fundamentals.

17.5.2 Adam

The most widely used optimizer in deep learning:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Key properties:

  • Maintains first moment (mean) and second moment (variance) of gradients
  • Adapts the effective learning rate per parameter
  • Handles sparse gradients well

17.5.3 AdamW

Adam with a corrected weight decay:

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001, weight_decay=0.01)

Weight decay in vanilla Adam is slightly broken — it conflates the decay with the gradient adaptation. AdamW fixes this. AdamW is the default choice for training large Transformer models.

17.5.4 Common Configurations

ModelOptimizerInitial LRWeight Decay
GPT-2Adam2.5e-40.01
GPT-3Adam6e-5 ~ 2e-40.1
LLaMAAdamW3e-40.1

17.6 Learning Rate Schedules

17.6.1 Why Schedules Exist

A single fixed learning rate is rarely optimal throughout training:

  • Early training: larger steps help find a good region quickly
  • Late training: smaller steps help fine-tune within that region

17.6.2 Warmup + Cosine Decay

The most common schedule in modern LLM training:

Learning Rate
  |    /\
  |   /  \__
  |  /      \___
  | /           \____
  +--------------------------> Training Steps
    |        |
  warmup   cosine decay
  1. Warmup: learning rate increases linearly from a small value to the target
  2. Cosine Decay: learning rate decreases following a cosine curve until a minimum value

17.6.3 PyTorch Implementation

from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# warmup: linear increase for first 1000 steps
warmup_scheduler = LinearLR(
    optimizer,
    start_factor=0.1,   # start at lr × 0.1
    total_iters=1000
)

# cosine decay: reduce over the remaining steps
cosine_scheduler = CosineAnnealingLR(
    optimizer,
    T_max=100000,  # total training steps
    eta_min=1e-5   # minimum learning rate
)

# combine
scheduler = SequentialLR(
    optimizer,
    schedulers=[warmup_scheduler, cosine_scheduler],
    milestones=[1000]
)

# in training loop
for step in range(total_steps):
    loss = train_step(...)
    optimizer.step()
    scheduler.step()  # update learning rate

17.6.4 Why Warmup?

At the start of training:

  • Parameters are randomly initialized — their scales are arbitrary
  • Gradient direction estimates are unreliable, based on minimal history
  • Large steps can cause irreversible damage to the model's initial representations

Warmup gives the model time to find stable internal scales before applying the full learning rate. Without warmup, large models often diverge in the first few hundred steps.


17.7 Learning Rate and Other Hyperparameters

17.7.1 Batch Size

Larger batch → can use larger learning rate.

The empirical rule (Linear Scaling Rule):

if batch size doubles, learning rate can also double

Reasoning: a larger batch gives a more accurate gradient estimate. A more accurate gradient means a larger step is safe.

17.7.2 Model Size

Larger model → typically needs smaller learning rate.

Model scaleParametersTypical learning rate
Small~100M3e-4 ~ 1e-3
Medium~1B1e-4 ~ 3e-4
Large~10B3e-5 ~ 1e-4
XL~100B+1e-5 ~ 3e-5

Larger models have more parameters and more complex loss surfaces. Smaller steps reduce the risk of destabilizing them.

17.7.3 Weight Decay

Weight decay is an L2 regularization term added to each update:

θ = θ - lr × (gradient + weight_decay × θ)

It pushes weights toward zero, which prevents individual parameters from growing very large and helps prevent overfitting.

Common values: 0.01 to 0.1. The AdamW default in most LLM training setups is 0.1.


17.8 Practical Guidance

17.8.1 How to Choose a Learning Rate

  1. Start from defaults:

    • Adam/AdamW on Transformers: 3e-4 is a reasonable first guess
    • Larger models: 1e-4 or smaller
  2. Read the training curve:

    • Loss decreasing steadily → learning rate is good
    • Loss oscillating → learning rate is too large
    • Loss barely moving → learning rate is too small
  3. Use a learning rate finder:

    • Start very small, increase exponentially over a few hundred steps
    • Find the point where loss decreases fastest
    • Use a value slightly below that

17.8.2 Common Diagnostics

SymptomLikely causeFix
Loss not decreasingLR too smallincrease LR
Loss oscillating sharplyLR too largereduce LR
Loss becomes NaNLR too large or numerical overflowreduce LR, check data
Loss falls then risesoverfitting or LR not decayingadd decay, add regularization
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,              # initial learning rate
    betas=(0.9, 0.95),    # Adam momentum parameters
    weight_decay=0.1      # weight decay
)

total_steps = 100000
warmup_steps = 2000

scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

The betas (0.9, 0.95) instead of the Adam default (0.9, 0.999) is a Karpathy recommendation for language model training — the lower second moment decay makes the optimizer respond faster to recent gradient information.


17.9 Chapter Summary

17.9.1 Core Formula

new_weight = old_weight - learning_rate × gradient
θ_new = θ_old - lr × ∂Loss/∂θ

17.9.2 Learning Rate Effects

Learning rateEffect
Too largeoscillation, divergence
Too smallslow convergence, may get stuck
Well chosenstable, efficient convergence
ComponentRecommendation
OptimizerAdamW
Initial LR1e-4 ~ 3e-4
Weight Decay0.01 ~ 0.1
ScheduleWarmup + Cosine Decay
Warmup steps1-5% of total steps

17.9.4 Core Insight

Learning rate is probably the most important training hyperparameter. It controls the size of every parameter update. Too large causes oscillation; too small causes sluggish convergence. Pair it with warmup and cosine decay for stable, efficient runs. For large Transformer models, AdamW with lr ≈ 3e-4 and weight_decay ≈ 0.1 is a solid default.


Chapter Checklist

After this chapter you should be able to:

  • State the weight update formula and explain what each term does.
  • Describe the symptoms of a too-large and too-small learning rate.
  • Explain why AdamW is preferred over vanilla Adam for Transformer training.
  • Explain warmup and cosine decay and when each applies.

Part 4 Summary

You have completed Part 4: Complete Architecture.

ChapterTopicCore idea
Ch. 13Residual connections and Dropoutgradient highways, overfitting prevention
Ch. 14Embeddings + positional encodingsum, not concatenate
Ch. 15Full forward passtext to probabilities, shape tracking
Ch. 16Training vs inferenceparallel training, serial inference
Ch. 17Learning ratestep size control and scheduling

You now understand how a Transformer works end to end — every piece and how they connect.


See You in the Next Chapter

That is enough theory for now. If you can explain warmup + cosine decay without looking at the chart, you are ready for Part 5.

Part 5 is where we write code. Chapter 18 defines the model in PyTorch, from scratch, one class at a time. Chapter 19 trains it. Chapter 20 runs inference. By the end, you will have a working GPT in under 400 lines of code.

Cite this page
Zhang, Wayland (2026). Chapter 17: Learning Rate - The Key to Training Stability. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-17-learning-rate
@incollection{zhang2026transformer_chapter_17_learning_rate,
  author = {Zhang, Wayland},
  title = {Chapter 17: Learning Rate - The Key to Training Stability},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-17-learning-rate}
}