RL Foundations — MDP, Rewards, and Policies

Reinforcement Learning is a learning paradigm where an agent learns by interacting with an environment: taking actions, receiving rewards, and updating its policy to maximize cumulative reward. Unlike supervised learning, there are no labeled examples — feedback comes from the environment through trial and error.

20 min•By Priygop Team•Updated 2026

MDP and Q-Learning from Scratch

import numpy as np
import gym

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# KEY RL CONCEPTS
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

rl_terminology = {
    "Agent":        "The learner/decision-maker (AI player)",
    "Environment":  "What the agent interacts with (game, robot, simulator)",
    "State (s)":    "Current situation description (pixels, sensor readings)",
    "Action (a)":   "What the agent can do (move left, press button, turn)",
    "Reward (r)":   "Scalar feedback signal (goal: maximize cumulative reward)",
    "Policy (pi)":  "Strategy: maps states to actions. pi(a|s) = P(action | state)",
    "Value V(s)":   "Expected total future reward from state s under policy pi",
    "Q-value Q(s,a)": "Expected future reward of taking action a from state s, then following pi",
    "Episode":      "One complete sequence from start state to terminal state",
    "Return Gt":    "Discounted sum of future rewards: r_t + gamma*r_{t+1} + gamma^2*r_{t+2} ...",
    "gamma":        "Discount factor (0-1): how much to value future rewards vs immediate rewards",
}

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Q-LEARNING -- tabular (for small state/action spaces)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

class QLearningAgent:
    '''
    Q-Learning: model-free, off-policy tabular RL.
    Learns Q(s,a) for every state-action pair.
    Works for small discrete state+action spaces.
    '''
    def __init__(self, n_states: int, n_actions: int,
                 lr: float = 0.1, gamma: float = 0.99,
                 epsilon: float = 1.0, epsilon_decay: float = 0.995):
        self.q_table = np.zeros((n_states, n_actions))  # Q(s, a) initialized to 0
        self.lr = lr        # learning rate alpha
        self.gamma = gamma  # discount factor
        self.epsilon = epsilon          # exploration probability
        self.epsilon_decay = epsilon_decay
        self.n_actions = n_actions

    def choose_action(self, state: int) -> int:
        '''Epsilon-greedy: explore randomly or exploit best known action.'''
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)   # explore
        return self.q_table[state].argmax()             # exploit

    def update(self, s: int, a: int, r: float, s_next: int, done: bool) -> None:
        '''Bellman equation update: Q(s,a) = r + gamma * max_a' Q(s',a')'''
        target = r + (0 if done else self.gamma * self.q_table[s_next].max())
        td_error = target - self.q_table[s, a]       # temporal difference error
        self.q_table[s, a] += self.lr * td_error      # Q-table update
        self.epsilon = max(0.01, self.epsilon * self.epsilon_decay)

# Train on FrozenLake (4x4 grid, find goal without falling in hole)
env = gym.make("FrozenLake-v1", is_slippery=False)
agent = QLearningAgent(n_states=16, n_actions=4)

for episode in range(5000):
    state, _ = env.reset()
    total_reward = 0
    done = False

    while not done:
        action = agent.choose_action(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        agent.update(state, action, reward, next_state, done)
        state = next_state
        total_reward += reward

    if (episode + 1) % 1000 == 0:
        print(f"Episode {episode+1} | Epsilon: {agent.epsilon:.3f}")

# Test the learned policy
success = 0
for _ in range(100):
    state, _ = env.reset()
    done = False
    while not done:
        action = agent.q_table[state].argmax()
        state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
    success += reward
print(f"Success rate: {success}/100")  # Should be close to 100%

Tip

Practice RL Foundations MDP Rewards and Policies in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Better prompts = better AI output. Structure, examples, and constraints matter.

Practice Task

Note

Practice Task — (1) Write a working example of RL Foundations MDP Rewards and Policies from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with RL Foundations MDP Rewards and Policies is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module