Multi-GPU Training & DeepSpeed ZeRO
Training LLMs on a single GPU is limited. Multi-GPU training with Accelerate scales to multiple GPUs and machines. DeepSpeed ZeRO Stage 3 splits optimizer state, gradients, AND parameters across GPUs — enabling training of models much larger than any single GPU's VRAM.
Multi-GPU Training with Accelerate and DeepSpeed
from accelerate import Accelerator
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
accelerator = Accelerator(
mixed_precision="bf16",
gradient_accumulation_steps=4,
log_with="wandb",
)
model = nn.Linear(512, 10)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# train_loader = DataLoader(...) # your dataset
# Accelerate wraps everything -- handles multi-GPU automatically
# model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)
# ZeRO Stage 3 config: shards optimizer + gradients + parameters across GPUs
deepspeed_config = {
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu", "pin_memory": True},
"offload_param": {"device": "cpu", "pin_memory": True},
"overlap_comm": True,
"contiguous_gradients": True,
},
"bf16": {"enabled": True},
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 4,
"gradient_clipping": 1.0,
}
# Launch: accelerate launch --num_processes 4 train.py
# With DeepSpeed: deepspeed --num_gpus 8 train.py --deepspeed ds_config.json
training_costs = {
"Google Colab A100 (Pro)": "$10/mo -- can fine-tune 7B model",
"RunPod A100 80GB": "$1.99/hr -- fine-tune 13B easily",
"Lambda Labs A10G": "$0.60/hr -- fine-tune 7B with QLoRA",
"Vast.ai (spot H100)": "$1.50/hr -- cheapest H100 option",
}
print("Training cost options:")
for option, cost in training_costs.items():
print(f" {option:40s}: {cost}")Tip
Tip
Practice MultiGPU Training DeepSpeed ZeRO in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of MultiGPU Training DeepSpeed ZeRO from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with MultiGPU Training DeepSpeed ZeRO is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.