RL Foundations — MDP, Rewards, and Policies
Reinforcement Learning is a learning paradigm where an agent learns by interacting with an environment: taking actions, receiving rewards, and updating its policy to maximize cumulative reward. Unlike supervised learning, there are no labeled examples — feedback comes from the environment through trial and error.
MDP and Q-Learning from Scratch
import numpy as np
import gym
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# KEY RL CONCEPTS
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
rl_terminology = {
"Agent": "The learner/decision-maker (AI player)",
"Environment": "What the agent interacts with (game, robot, simulator)",
"State (s)": "Current situation description (pixels, sensor readings)",
"Action (a)": "What the agent can do (move left, press button, turn)",
"Reward (r)": "Scalar feedback signal (goal: maximize cumulative reward)",
"Policy (pi)": "Strategy: maps states to actions. pi(a|s) = P(action | state)",
"Value V(s)": "Expected total future reward from state s under policy pi",
"Q-value Q(s,a)": "Expected future reward of taking action a from state s, then following pi",
"Episode": "One complete sequence from start state to terminal state",
"Return Gt": "Discounted sum of future rewards: r_t + gamma*r_{t+1} + gamma^2*r_{t+2} ...",
"gamma": "Discount factor (0-1): how much to value future rewards vs immediate rewards",
}
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Q-LEARNING -- tabular (for small state/action spaces)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
class QLearningAgent:
'''
Q-Learning: model-free, off-policy tabular RL.
Learns Q(s,a) for every state-action pair.
Works for small discrete state+action spaces.
'''
def __init__(self, n_states: int, n_actions: int,
lr: float = 0.1, gamma: float = 0.99,
epsilon: float = 1.0, epsilon_decay: float = 0.995):
self.q_table = np.zeros((n_states, n_actions)) # Q(s, a) initialized to 0
self.lr = lr # learning rate alpha
self.gamma = gamma # discount factor
self.epsilon = epsilon # exploration probability
self.epsilon_decay = epsilon_decay
self.n_actions = n_actions
def choose_action(self, state: int) -> int:
'''Epsilon-greedy: explore randomly or exploit best known action.'''
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions) # explore
return self.q_table[state].argmax() # exploit
def update(self, s: int, a: int, r: float, s_next: int, done: bool) -> None:
'''Bellman equation update: Q(s,a) = r + gamma * max_a' Q(s',a')'''
target = r + (0 if done else self.gamma * self.q_table[s_next].max())
td_error = target - self.q_table[s, a] # temporal difference error
self.q_table[s, a] += self.lr * td_error # Q-table update
self.epsilon = max(0.01, self.epsilon * self.epsilon_decay)
# Train on FrozenLake (4x4 grid, find goal without falling in hole)
env = gym.make("FrozenLake-v1", is_slippery=False)
agent = QLearningAgent(n_states=16, n_actions=4)
for episode in range(5000):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
action = agent.choose_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.update(state, action, reward, next_state, done)
state = next_state
total_reward += reward
if (episode + 1) % 1000 == 0:
print(f"Episode {episode+1} | Epsilon: {agent.epsilon:.3f}")
# Test the learned policy
success = 0
for _ in range(100):
state, _ = env.reset()
done = False
while not done:
action = agent.q_table[state].argmax()
state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
success += reward
print(f"Success rate: {success}/100") # Should be close to 100%Tip
Tip
Practice RL Foundations MDP Rewards and Policies in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Better prompts = better AI output. Structure, examples, and constraints matter.
Practice Task
Note
Practice Task — (1) Write a working example of RL Foundations MDP Rewards and Policies from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with RL Foundations MDP Rewards and Policies is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.