Optimizers — SGD, Adam, AdamW & Learning Rate Schedulers

The optimizer updates weights using computed gradients. Adam is the default for nearly all deep learning tasks. AdamW is the standard for Transformers/LLMs. Learning rate is the most sensitive hyperparameter — too large and you overshoot, too small and you never converge.

20 min•By Priygop Team•Updated 2026

Optimizers & Schedulers in Practice

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(100, 10)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# OPTIMIZERS
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# 1. SGD (Stochastic Gradient Descent)
# W = W - lr * gradient
# Simple, predictable, good with momentum for image tasks (ResNet paper used this)
sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

# 2. Adam (Adaptive Moment Estimation) — 2014, Kingma & Ba
# Maintains per-parameter learning rates with momentum (m) and velocity (v)
# m = beta1 * m + (1-beta1) * grad           (exponential moving avg of gradients)
# v = beta2 * v + (1-beta2) * grad^2         (exponential moving avg of squared grads)
# W = W - lr * m_hat / (sqrt(v_hat) + eps)   (bias-corrected update)
adam = optim.Adam(
    model.parameters(),
    lr=1e-3,          # default — usually a good starting point
    betas=(0.9, 0.999), # momentum and RMS decay coefficients
    eps=1e-8,
)

# 3. AdamW — Adam + PROPER weight decay (decoupled)
# WHY AdamW over Adam: Adam applies weight decay INSIDE the adaptive update
# AdamW applies weight decay SEPARATELY — correct for L2 regularization
# USE FOR: Transformers, LLMs, BERT fine-tuning, GPT training
adamw = optim.AdamW(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    weight_decay=0.01,   # L2 penalty — prevents overfitting
)

# 4. Layer-wise different learning rates — important for fine-tuning
# Pre-trained transformer layers need LOWER lr to not forget what they learned
optimizer_llm = optim.AdamW([
    {"params": model.parameters(), "lr": 1e-4},  # NOTE: model is simplified here
    # In real LLM fine-tuning:
    # {"params": model.encoder.parameters(), "lr": 1e-5},  # backbone: very low lr
    # {"params": model.classifier.parameters(), "lr": 1e-3}, # head: normal lr
])

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# LEARNING RATE SCHEDULERS
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# 1. Cosine Annealing — most popular for Transformers
# lr starts at max, smoothly decays to near-zero following cosine curve
cosine_scheduler = optim.lr_scheduler.CosineAnnealingLR(
    adamw, T_max=100, eta_min=1e-6
)

# 2. Linear Warmup + Cosine Decay (BERT, GPT standard)
# Critical: LLMs need warmup because random init has large gradients initially
from torch.optim.lr_scheduler import LambdaLR

def get_linear_warmup_cosine_decay(optimizer, warmup_steps: int, total_steps: int):
    """The standard LR schedule for training Transformers."""
    import math
    def lr_lambda(current_step: int) -> float:
        if current_step < warmup_steps:
            return current_step / warmup_steps  # linear warmup: 0 → 1
        progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
        return 0.5 * (1 + math.cos(math.pi * progress))  # cosine decay: 1 → 0
    return LambdaLR(optimizer, lr_lambda)

scheduler = get_linear_warmup_cosine_decay(adamw, warmup_steps=1000, total_steps=10000)

# In training loop:
# optimizer.zero_grad()
# loss.backward()
# torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # gradient clipping
# optimizer.step()
# scheduler.step()   # update LR

# GRADIENT CLIPPING — essential for RNNs and Transformers
# Prevents exploding gradients by scaling down if norm > max_norm
def training_step(model, optimizer, scheduler, X, y, loss_fn, max_grad_norm=1.0):
    optimizer.zero_grad()
    pred = model(X)
    loss = loss_fn(pred, y)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)
    optimizer.step()
    scheduler.step()
    return loss.item()

Tip

Practice Optimizers SGD Adam AdamW Learning Rate Schedulers in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of Optimizers SGD Adam AdamW Learning Rate Schedulers from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Optimizers SGD Adam AdamW Learning Rate Schedulers is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module