PPO — Proximal Policy Optimization

PPO (Schulman et al., 2017) is the most widely used RL algorithm: it trains on-policy (uses transitions from current policy), uses a clipped surrogate objective to prevent too-large policy updates, and adds an entropy bonus for exploration. Used for: ChatGPT RLHF, robotics, game playing, autonomous driving.

25 min•By Priygop Team•Updated 2026

PPO Implementation with Stable-Baselines3

import gym
from stable_baselines3 import PPO, A2C, SAC, TD3
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import SubprocVecEnv, VecNormalize

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# PPO -- conceptual understanding
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Objective: maximize expected return while staying close to old policy
#
# ratio = pi_new(a|s) / pi_old(a|s)  (how much policy changed)
# advantage = Q(s,a) - V(s)          (how much better than average was this action?)
#
# Clipped objective:
# L = min(ratio * advantage, clip(ratio, 1-epsilon, 1+epsilon) * advantage)
# clip prevents ratio from going too far from 1.0 (conservative updates)
# epsilon=0.2: policy can change by at most 20% per update step

# ACTOR-CRITIC: PPO has two heads on same network:
# - Actor (policy head): outputs action probabilities
# - Critic (value head): estimates V(s) for advantage computation
# Shared backbone (CNN for pixels, MLP for vectors) - 70% of params shared

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# STABLE-BASELINES 3 -- production RL library
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Vectorized environments: train on 8 envs in parallel for higher throughput
n_envs = 8
env = make_vec_env("CartPole-v1", n_envs=n_envs, vec_env_cls=SubprocVecEnv)
eval_env = Monitor(gym.make("CartPole-v1"))

model = PPO(
    "MlpPolicy",     # MlpPolicy: MLP for vector states, CnnPolicy: CNN for pixels
    env,
    n_steps=2048,          # steps per update per environment
    batch_size=64,          # mini-batch size (for multiple gradient steps per update)
    n_epochs=10,           # gradient steps per update
    learning_rate=3e-4,
    gamma=0.99,
    gae_lambda=0.95,        # GAE: generalized advantage estimation smoothing
    clip_range=0.2,         # epsilon for clipped ratio
    ent_coef=0.01,          # entropy coefficient (encourages exploration)
    vf_coef=0.5,            # value function loss coefficient
    max_grad_norm=0.5,
    verbose=1,
    tensorboard_log="./ppo_cartpole_tb",
)

# Callbacks
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./best_model",
    eval_freq=5000,
    n_eval_episodes=10,
    deterministic=True,
)

model.learn(
    total_timesteps=200_000,
    callback=eval_callback,
)

# The algorithm comparison table:
algos = {
    "DQN":   {"type": "Off-policy",  "spaces": "Discrete actions only",  "best_for": "Atari, simple games"},
    "PPO":   {"type": "On-policy",   "spaces": "Discrete + Continuous",  "best_for": "Most tasks, stable training"},
    "A2C":   {"type": "On-policy",   "spaces": "Discrete + Continuous",  "best_for": "Fast training, parallel envs"},
    "SAC":   {"type": "Off-policy",  "spaces": "Continuous only",        "best_for": "Robotics, MuJoCo, sample efficient"},
    "TD3":   {"type": "Off-policy",  "spaces": "Continuous only",        "best_for": "Deterministic continuous control"},
    "DDPG":  {"type": "Off-policy",  "spaces": "Continuous only",        "best_for": "Baseline for continuous control"},
}

print("RL Algorithm Comparison:")
for name, info in algos.items():
    print(f"  {name:6s} ({info['type']:11s}) | {info['spaces']:25s} | {info['best_for']}")

Tip

Practice PPO Proximal Policy Optimization in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Better prompts = better AI output. Structure, examples, and constraints matter.

Practice Task

Note

Practice Task — (1) Write a working example of PPO Proximal Policy Optimization from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with PPO Proximal Policy Optimization is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module