Scaling Laws — Why Bigger Models Work Better
The Chinchilla paper (Hoffman et al., 2022) showed that scaling model size alone (GPT-3) is suboptimal — you need to scale data AND parameters together. The optimal ratio is ~20 tokens per parameter. Understanding scaling laws tells you how to allocate your training compute budget most efficiently.
Scaling Laws & Compute Allocation
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# SCALING LAWS (Kaplan et al. 2020 + Chinchilla 2022)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Loss ~ N^(-0.076) × D^(-0.095) × C^(-0.25)
# N = model parameters, D = training tokens, C = compute FLOPs
# Kaplan (OpenAI) finding: Scale N (params) more than D (tokens)
# → Led to GPT-3: 175B params trained on 300B tokens (much undertrained)
# Chinchilla (Google DeepMind) finding:
# For a given compute budget C, the optimal strategy is:
# N_opt ∝ C^0.5 (scale params with sqrt of compute)
# D_opt ∝ C^0.5 (scale data with sqrt of compute)
# RULE: ~20 tokens per parameter for optimal training
scaling_examples = {
"GPT-3 (2020)": {
"params": "175B",
"training_tokens": "300B",
"tokens_per_param": 300/175,
"note": "Kaplan scaling — undertrained by Chinchilla standards"
},
"Chinchilla (2022)": {
"params": "70B",
"training_tokens": "1.4T",
"tokens_per_param": 1400/70,
"note": "Optimal by Chinchilla compute law — beats Gopher (280B)"
},
"LLaMA 3 8B (2024)": {
"params": "8B",
"training_tokens": "15T",
"tokens_per_param": 15000/8,
"note": "1875 tokens/param — heavily overtrained for inference efficiency"
},
"LLaMA 3 70B (2024)": {
"params": "70B",
"training_tokens": "15T",
"tokens_per_param": 15000/70,
"note": "~214 tokens/param — approaches GPT-4 level at open-source"
},
}
print("Scaling Law Examples:")
for name, stats in scaling_examples.items():
tpp = stats.get("tokens_per_param", "?")
if isinstance(tpp, float):
tpp = f"{tpp:.1f}"
print(f" {name:25s} | {stats['params']:6s} params | {stats['training_tokens']:5s} tokens | {tpp} tok/param")
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# EMERGENT ABILITIES — capabilities that appear at scale
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Some capabilities are NOT present in small models but suddenly
# appear when crossing a parameter threshold — "emergent behaviors"
emergent_abilities = {
"In-context learning": "~10B params — learn from examples in the prompt without weight updates",
"Chain-of-thought": "~100B params — reason step-by-step when prompted 'think step by step'",
"Instruction following": ">1B params with RLHF — reliably follow natural language instructions",
"Code generation": ">20B params — write correct, runnable code from natural language",
"Multi-step math": ">100B params — solve multi-step arithmetic reliably",
"Calibration": ">10B params — know when they don't know something (uncertainty)",
}
print("\nEmergent Abilities at Scale:")
for ability, threshold in emergent_abilities.items():
print(f" {ability:30s}: {threshold}")Tip
Tip
Practice Scaling Laws Why Bigger Models Work Better in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Scaling Laws Why Bigger Models Work Better from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Scaling Laws Why Bigger Models Work Better is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.