Skip to content

Implementing Q Learning from Scratch

Have you ever wondered how an AI agent can learn to play a game—like Snake, Pac-Man, or even chess—just by trial and error? Behind the scenes, a powerful concept called Q-learning is often at work. Q-learning is a model-free reinforcement learning algorithm. That means the agent doesn’t need to know the rules of the environment—it learns them by experience.

Open In Colab

AI Summary

This blog introduces Q-learning, a reinforcement learning algorithm that enables agents to learn optimal actions through trial and error. It covers the key components—states, actions, rewards, and the Q-table—and explains how the agent updates its knowledge using the Bellman equation. With a balance of exploration and exploitation, Q-learning helps agents improve their behavior over time without needing a model of the environment.

What is Q-Learning?

Q-learning is a type of reinforcement learning algorithm that helps an agent learn what action to take in each state to maximize its total reward—purely through trial and error. It doesn’t require a model of the environment, making it model-free.

Key Components in Q-Learning

  • Agent: The decision-maker (e.g., a snake in the Snake game).

  • Environment: The world the agent interacts with (e.g., a grid).

  • State \( s \): A snapshot of the environment.

  • Action \( a \): A move the agent can take.

  • Reward \( r \): Feedback received after an action.

  • Q-table \( Q(s, a) \): Stores the expected future rewards for each action in each state.

How Q-Learning Works

The agent learns by updating Q-values using the Bellman equation:

\[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] \]

Where:

  • \( \alpha \): Learning rate (how quickly it learns)

  • \( \gamma \): Discount factor (importance of future rewards)

  • \( r \): Immediate reward

  • \( s' \): New state after taking action \( a \)

  • \( \max_{a'} Q(s', a') \): Best possible future reward from next state.

Over time, the agent uses an explore-exploit strategy to balance learning new actions (exploration) and choosing the best-known actions (exploitation), improving the Q-table until it converges to optimal behavior.

Implementation of Q-Learning

1. Importing the Libraries

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML
import random
from collections import defaultdict

2. Declare Constant Values

# Constants
GRID_SIZE = 10
EPISODES = 500
MAX_STEPS = 100
EPSILON_DECAY = 0.995
MIN_EPSILON = 0.01
ALPHA = 0.1
GAMMA = 0.9

# Directions
UP = 0
DOWN = 1
LEFT = 2
RIGHT = 3
DIRECTION_VECTORS = [(-1, 0), (1, 0), (0, -1), (0, 1)]

# Colors
RED = [255, 0, 0]
GREEN = [0, 255, 0]

3. Creating the Snake's Environment

class SnakeEnv:
    def __init__(self, size=GRID_SIZE):
        self.size = size
        self.reset()

    def reset(self):
        self.snake = [(self.size // 2, self.size // 2)]
        self.direction = random.choice([UP, DOWN, LEFT, RIGHT])
        self.place_food()
        self.done = False
        return self.get_state()

    def place_food(self):
        while True:
            self.food = (random.randint(0, self.size - 1), random.randint(0, self.size - 1))
            if self.food not in self.snake:
                break

    def get_state(self):
        head = self.snake[0]
        dir_vector = DIRECTION_VECTORS[self.direction]
        food_dir = (np.sign(self.food[0] - head[0]), np.sign(self.food[1] - head[1]))
        danger = self.check_danger()
        return (dir_vector, food_dir, danger)

    def check_danger(self):
        head = self.snake[0]
        danger = []
        for d in range(4):
            dx, dy = DIRECTION_VECTORS[d]
            nx, ny = head[0] + dx, head[1] + dy
            if (nx < 0 or nx >= self.size or ny < 0 or ny >= self.size or (nx, ny) in self.snake):
                danger.append(1)
            else:
                danger.append(0)
        return tuple(danger)

    def step(self, action):
        if self.done:
            return self.get_state(), 0, self.done

        self.direction = action
        dx, dy = DIRECTION_VECTORS[self.direction]
        head = self.snake[0]
        new_head = (head[0] + dx, head[1] + dy)

        if (new_head in self.snake or 
            not (0 <= new_head[0] < self.size) or 
            not (0 <= new_head[1] < self.size)):
            self.done = True
            return self.get_state(), -10, True

        self.snake.insert(0, new_head)

        if new_head == self.food:
            self.place_food()
            reward = 10
        else:
            self.snake.pop()
            reward = -0.1

        return self.get_state(), reward, self.done

    def render(self):
        grid = np.zeros((self.size, self.size, 3), dtype=np.uint8)
        grid[:, :] = 255
        for (x, y) in self.snake:
            grid[x, y] = RED
        fx, fy = self.food
        grid[fx, fy] = GREEN
        return grid

4. The Q-Learning Agent

class QLearningAgent:
    def __init__(self, actions):
        self.q_table = defaultdict(lambda: np.zeros(len(actions)))
        self.actions = actions
        self.epsilon = 1.0

    def get_action(self, state):
        state_key = str(state)
        if np.random.rand() < self.epsilon:
            return random.choice(self.actions)
        else:
            return int(np.argmax(self.q_table[state_key]))

    def learn(self, state, action, reward, next_state):
        state_key = str(state)
        next_state_key = str(next_state)
        predict = self.q_table[state_key][action]
        target = reward + GAMMA * np.max(self.q_table[next_state_key])
        self.q_table[state_key][action] += ALPHA * (target - predict)

5. Training the Agent In The Environment

env = SnakeEnv()
agent = QLearningAgent(actions=[UP, DOWN, LEFT, RIGHT])

for episode in range(EPISODES):
    state = env.reset()
    total_reward = 0
    for _ in range(MAX_STEPS):
        action = agent.get_action(state)
        next_state, reward, done = env.step(action)
        agent.learn(state, action, reward, next_state)
        state = next_state
        total_reward += reward
        if done:
            break
    agent.epsilon = max(MIN_EPSILON, agent.epsilon * EPSILON_DECAY)
    if (episode + 1) % 100 == 0:
        print(f"Episode {episode+1}, Total reward: {total_reward:.3f}, Epsilon: {agent.epsilon:.3f}")
Episode 100, Total reward: -14.200, Epsilon: 0.606
Episode 200, Total reward: -11.500, Epsilon: 0.367
Episode 300, Total reward: 8.100, Epsilon: 0.222
Episode 400, Total reward: -0.600, Epsilon: 0.135
Episode 500, Total reward: 16.900, Epsilon: 0.082

6. Visualizing The Training

frames = []
state = env.reset()

fig = plt.figure(figsize=(5, 5))
plt.axis("off")

for _ in range(100):
    grid = env.render()
    im = plt.imshow(grid, animated=True)
    frames.append([im])
    action = agent.get_action(state)
    state, _, done = env.step(action)
    if done:
        break

ani = animation.ArtistAnimation(fig, frames, interval=200, blit=True)
plt.close()
HTML(ani.to_jshtml())

Snake QLearning Gif

Conclusion

Q-learning is a simple yet powerful way for agents to learn optimal behavior by interacting with their environment. With just a table of values and a smart update rule, it allows agents to improve over time—learning entirely from rewards, not instructions. It's a foundational technique in reinforcement learning and a great starting point for building intelligent systems.

Comments