One-sentence summary: Scaling laws are not prophecy, but they give engineers a useful way to estimate model size, data, compute, and cost.


A.1 Why Estimate?

You do not need perfect numbers to make better decisions. Rough estimates tell you whether an idea is laptop-scale, single-GPU-scale, cluster-scale, or frontier-lab-scale.

Scaling estimates connect parameters data compute and cost

A.2 Parameter Count

A rough dense Transformer estimate:

parameters ~= layers x hidden_width^2 x constant

The constant depends on architecture details: Attention projections, FFN expansion, embeddings, and output head.

A.3 Training Compute

A common rough estimate:

training FLOPs ~= 6 x parameters x training_tokens

It is not exact, but it is useful for scale intuition.

A.4 Cost

Cost depends on:

  • GPU type
  • utilization
  • run length
  • failed runs
  • engineering time
  • data pipeline
  • evaluation

If someone quotes only "GPU hours", remember the hidden costs.

A.5 Quick Table

model scalerough deployment feeling
millionseducational
billionspractical small models
tens of billionsserious serving
hundreds of billions+frontier-scale systems

A.6 Wayland Note

Back-of-the-envelope math is not about pretending to know exact budgets. It is about catching nonsense early.


Checklist

  • Estimate training FLOPs from parameters and tokens.
  • Explain why GPU price alone is not total cost.
  • Separate rough planning numbers from exact benchmark claims.
Cite this page
Zhang, Wayland (2026). Appendix A: Scaling Laws and Compute Estimates. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/appendix-a-scaling-laws-compute
@incollection{zhang2026transformer_appendix_a_scaling_laws_compute,
  author = {Zhang, Wayland},
  title = {Appendix A: Scaling Laws and Compute Estimates},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/appendix-a-scaling-laws-compute}
}