One-sentence summary: Scaling laws are not prophecy, but they give engineers a useful way to estimate model size, data, compute, and cost.
A.1 Why Estimate?
You do not need perfect numbers to make better decisions. Rough estimates tell you whether an idea is laptop-scale, single-GPU-scale, cluster-scale, or frontier-lab-scale.
A.2 Parameter Count
A rough dense Transformer estimate:
parameters ~= layers x hidden_width^2 x constant
The constant depends on architecture details: Attention projections, FFN expansion, embeddings, and output head.
A.3 Training Compute
A common rough estimate:
training FLOPs ~= 6 x parameters x training_tokens
It is not exact, but it is useful for scale intuition.
A.4 Cost
Cost depends on:
- GPU type
- utilization
- run length
- failed runs
- engineering time
- data pipeline
- evaluation
If someone quotes only "GPU hours", remember the hidden costs.
A.5 Quick Table
| model scale | rough deployment feeling |
|---|---|
| millions | educational |
| billions | practical small models |
| tens of billions | serious serving |
| hundreds of billions+ | frontier-scale systems |
A.6 Wayland Note
Back-of-the-envelope math is not about pretending to know exact budgets. It is about catching nonsense early.
Checklist
- Estimate training FLOPs from parameters and tokens.
- Explain why GPU price alone is not total cost.
- Separate rough planning numbers from exact benchmark claims.