Implementing Q Learning from Scratch¶
Have you ever wondered how an AI agent can learn to play a game—like Snake, Pac-Man, or even chess—just by trial and error? Behind the scenes, a powerful concept called Q-learning is often at work. Q-learning is a model-free reinforcement learning algorithm. That means the agent doesn’t need to know the rules of the environment—it learns them by experience.
AI Summary
This blog introduces Q-learning, a reinforcement learning algorithm that enables agents to learn optimal actions through trial and error. It covers the key components—states, actions, rewards, and the Q-table—and explains how the agent updates its knowledge using the Bellman equation. With a balance of exploration and exploitation, Q-learning helps agents improve their behavior over time without needing a model of the environment.
What is Q-Learning?¶
Q-learning is a type of reinforcement learning algorithm that helps an agent learn what action to take in each state to maximize its total reward—purely through trial and error. It doesn’t require a model of the environment, making it model-free.
Key Components in Q-Learning¶
-
Agent: The decision-maker (e.g., a snake in the Snake game).
-
Environment: The world the agent interacts with (e.g., a grid).
-
State \( s \): A snapshot of the environment.
-
Action \( a \): A move the agent can take.
-
Reward \( r \): Feedback received after an action.
-
Q-table \( Q(s, a) \): Stores the expected future rewards for each action in each state.
How Q-Learning Works¶
The agent learns by updating Q-values using the Bellman equation:
Where:
-
\( \alpha \): Learning rate (how quickly it learns)
-
\( \gamma \): Discount factor (importance of future rewards)
-
\( r \): Immediate reward
-
\( s' \): New state after taking action \( a \)
-
\( \max_{a'} Q(s', a') \): Best possible future reward from next state.
Over time, the agent uses an explore-exploit strategy to balance learning new actions (exploration) and choosing the best-known actions (exploitation), improving the Q-table until it converges to optimal behavior.
Implementation of Q-Learning¶
1. Importing the Libraries¶
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML
import random
from collections import defaultdict
2. Declare Constant Values¶
# Constants
GRID_SIZE = 10
EPISODES = 500
MAX_STEPS = 100
EPSILON_DECAY = 0.995
MIN_EPSILON = 0.01
ALPHA = 0.1
GAMMA = 0.9
# Directions
UP = 0
DOWN = 1
LEFT = 2
RIGHT = 3
DIRECTION_VECTORS = [(-1, 0), (1, 0), (0, -1), (0, 1)]
# Colors
RED = [255, 0, 0]
GREEN = [0, 255, 0]
3. Creating the Snake's Environment¶
class SnakeEnv:
def __init__(self, size=GRID_SIZE):
self.size = size
self.reset()
def reset(self):
self.snake = [(self.size // 2, self.size // 2)]
self.direction = random.choice([UP, DOWN, LEFT, RIGHT])
self.place_food()
self.done = False
return self.get_state()
def place_food(self):
while True:
self.food = (random.randint(0, self.size - 1), random.randint(0, self.size - 1))
if self.food not in self.snake:
break
def get_state(self):
head = self.snake[0]
dir_vector = DIRECTION_VECTORS[self.direction]
food_dir = (np.sign(self.food[0] - head[0]), np.sign(self.food[1] - head[1]))
danger = self.check_danger()
return (dir_vector, food_dir, danger)
def check_danger(self):
head = self.snake[0]
danger = []
for d in range(4):
dx, dy = DIRECTION_VECTORS[d]
nx, ny = head[0] + dx, head[1] + dy
if (nx < 0 or nx >= self.size or ny < 0 or ny >= self.size or (nx, ny) in self.snake):
danger.append(1)
else:
danger.append(0)
return tuple(danger)
def step(self, action):
if self.done:
return self.get_state(), 0, self.done
self.direction = action
dx, dy = DIRECTION_VECTORS[self.direction]
head = self.snake[0]
new_head = (head[0] + dx, head[1] + dy)
if (new_head in self.snake or
not (0 <= new_head[0] < self.size) or
not (0 <= new_head[1] < self.size)):
self.done = True
return self.get_state(), -10, True
self.snake.insert(0, new_head)
if new_head == self.food:
self.place_food()
reward = 10
else:
self.snake.pop()
reward = -0.1
return self.get_state(), reward, self.done
def render(self):
grid = np.zeros((self.size, self.size, 3), dtype=np.uint8)
grid[:, :] = 255
for (x, y) in self.snake:
grid[x, y] = RED
fx, fy = self.food
grid[fx, fy] = GREEN
return grid
4. The Q-Learning Agent¶
class QLearningAgent:
def __init__(self, actions):
self.q_table = defaultdict(lambda: np.zeros(len(actions)))
self.actions = actions
self.epsilon = 1.0
def get_action(self, state):
state_key = str(state)
if np.random.rand() < self.epsilon:
return random.choice(self.actions)
else:
return int(np.argmax(self.q_table[state_key]))
def learn(self, state, action, reward, next_state):
state_key = str(state)
next_state_key = str(next_state)
predict = self.q_table[state_key][action]
target = reward + GAMMA * np.max(self.q_table[next_state_key])
self.q_table[state_key][action] += ALPHA * (target - predict)
5. Training the Agent In The Environment¶
env = SnakeEnv()
agent = QLearningAgent(actions=[UP, DOWN, LEFT, RIGHT])
for episode in range(EPISODES):
state = env.reset()
total_reward = 0
for _ in range(MAX_STEPS):
action = agent.get_action(state)
next_state, reward, done = env.step(action)
agent.learn(state, action, reward, next_state)
state = next_state
total_reward += reward
if done:
break
agent.epsilon = max(MIN_EPSILON, agent.epsilon * EPSILON_DECAY)
if (episode + 1) % 100 == 0:
print(f"Episode {episode+1}, Total reward: {total_reward:.3f}, Epsilon: {agent.epsilon:.3f}")
Episode 100, Total reward: -14.200, Epsilon: 0.606 Episode 200, Total reward: -11.500, Epsilon: 0.367 Episode 300, Total reward: 8.100, Epsilon: 0.222 Episode 400, Total reward: -0.600, Epsilon: 0.135 Episode 500, Total reward: 16.900, Epsilon: 0.082
6. Visualizing The Training¶
frames = []
state = env.reset()
fig = plt.figure(figsize=(5, 5))
plt.axis("off")
for _ in range(100):
grid = env.render()
im = plt.imshow(grid, animated=True)
frames.append([im])
action = agent.get_action(state)
state, _, done = env.step(action)
if done:
break
ani = animation.ArtistAnimation(fig, frames, interval=200, blit=True)
plt.close()
HTML(ani.to_jshtml())
Conclusion¶
Q-learning is a simple yet powerful way for agents to learn optimal behavior by interacting with their environment. With just a table of values and a smart update rule, it allows agents to improve over time—learning entirely from rewards, not instructions. It's a foundational technique in reinforcement learning and a great starting point for building intelligent systems.