Multi-GPU Training & DeepSpeed ZeRO

Training LLMs on a single GPU is limited. Multi-GPU training with Accelerate scales to multiple GPUs and machines. DeepSpeed ZeRO Stage 3 splits optimizer state, gradients, AND parameters across GPUs — enabling training of models much larger than any single GPU's VRAM.

20 min•By Priygop Team•Updated 2026

Multi-GPU Training with Accelerate and DeepSpeed

from accelerate import Accelerator
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

accelerator = Accelerator(
    mixed_precision="bf16",
    gradient_accumulation_steps=4,
    log_with="wandb",
)

model = nn.Linear(512, 10)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# train_loader = DataLoader(...)  # your dataset

# Accelerate wraps everything -- handles multi-GPU automatically
# model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

# ZeRO Stage 3 config: shards optimizer + gradients + parameters across GPUs
deepspeed_config = {
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu", "pin_memory": True},
        "offload_param": {"device": "cpu", "pin_memory": True},
        "overlap_comm": True,
        "contiguous_gradients": True,
    },
    "bf16": {"enabled": True},
    "train_micro_batch_size_per_gpu": 4,
    "gradient_accumulation_steps": 4,
    "gradient_clipping": 1.0,
}

# Launch: accelerate launch --num_processes 4 train.py
# With DeepSpeed: deepspeed --num_gpus 8 train.py --deepspeed ds_config.json

training_costs = {
    "Google Colab A100 (Pro)":    "$10/mo -- can fine-tune 7B model",
    "RunPod A100 80GB":           "$1.99/hr -- fine-tune 13B easily",
    "Lambda Labs A10G":           "$0.60/hr -- fine-tune 7B with QLoRA",
    "Vast.ai (spot H100)":        "$1.50/hr -- cheapest H100 option",
}

print("Training cost options:")
for option, cost in training_costs.items():
    print(f"  {option:40s}: {cost}")

Tip

Practice MultiGPU Training DeepSpeed ZeRO in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of MultiGPU Training DeepSpeed ZeRO from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with MultiGPU Training DeepSpeed ZeRO is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module