Q-learning • Deep Q Networks (DQN) • Policy Gradient Methods • Actor-Critic Methods
Reinforcement Learning (RL) is a branch of AI where an agent learns by interacting with an environment.
It learns through trial and error, just like humans learn to ride a bicycle or play a video game.
When we combine RL with Deep Learning, we get powerful systems that can:
- Play games better than humans
- Drive cars
- Control robots
- Make smart decisions over time
This chapter explains Q-learning, Deep Q-Networks, Policy Gradient Methods, and Actor–Critic models in simple, clear language suitable for grades 10–11.
1. What Is Reinforcement Learning?
In RL, an agent learns by:
- Taking actions
- Receiving rewards
- Improving behavior over time
Real-world examples
- A robot learning to walk
- A self-driving car learning to stay in lane
- A computer learning to play chess or Minecraft
- A drone learning to balance in the air
Key terms
- Agent: the learner (robot, AI program)
- Environment: the world agent interacts with
- Action: what the agent does
- Reward: feedback (good or bad)
- Goal: maximize long-term reward
2. Q-Learning
Q-learning is one of the simplest and most important RL algorithms.
Intuition
Q-learning teaches the agent how good each action is in each situation.
The agent keeps a table called a Q-table with values:
- High value = good action
- Low value = bad action
Simple Example: Maze Game
- If the agent moves closer to the goal → positive reward
- If it hits a wall → negative reward
- Over time, Q-table improves
Why Q-Learning Is Useful
- Easy to understand
- Works for small problems
- Forms the foundation for Deep Q Networks (DQN)
Code Example (Basic Q-Learning Table)
import numpy as np
# Q-table with 5 states and 2 actions
Q = np.zeros((5, 2))
state = 0
action = 1
reward = 10
learning_rate = 0.5
discount = 0.9
# Q-learning update rule
Q[state, action] = Q[state, action] + learning_rate * (
reward + discount * np.max(Q[state]) - Q[state, action]
)
print(Q)3. Deep Q Networks (DQN)
Q-learning works only for small environments.
But real problems — like video games — have millions of states.
A Q-table becomes impossible.
Solution: Use a Deep Neural Network
DQN replaces the Q-table with a deep neural network that predicts Q-values.
Why DQN Is Powerful
- Learns from pixels (like Atari or Minecraft games)
- Learns complex strategies
- Uses a memory buffer to store experiences
Simple Explanation
DQN looks at the game screen → predicts best action → gets reward → learns from mistakes.
Real Example
DeepMind trained DQN to play Atari games:
- Breakout
- Pac-Man
- Space Invaders
Better than humans!
Code Example (Simple DQN Structure)
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(24, activation='relu', input_shape=(4,)),
Dense(24, activation='relu'),
Dense(2, activation='linear') # Q-values for 2 actions
])
model.compile(optimizer='adam', loss='mse')
print(model.summary())4. Policy Gradient Methods
Q-learning tries to estimate values (goodness of each action).
Policy Gradient methods directly learn a policy — a rule that maps states to the best actions.
Intuition
Instead of saying: "This action has value 8."
The agent learns: "When in this situation, choose this action with 80% probability."
Benefits
- Works well for continuous actions
- Better for robotics and control tasks
- Smoother learning
Real Example
A robot arm learning to pick up objects:
- It learns smooth joint movements
- Not just "action 1 or 2"
- Policy gradients are perfect here
Code Example (Simple Policy Network)
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
policy = Sequential([
Dense(32, activation='relu', input_shape=(4,)),
Dense(2, activation='softmax') # probabilities of actions
])
policy.compile(optimizer='adam', loss='categorical_crossentropy')
print(policy.summary())5. Actor–Critic Methods
Actor–Critic combines the best of both:
- Actor → decides actions (policy)
- Critic → evaluates actions (value)
Why This Is Powerful
- Reduces training instability
- Faster than pure policy gradient
- More accurate than Q-learning
- Handles large environments
Simple Explanation
Imagine a student (actor) and a teacher (critic):
- Student tries an action
- Teacher tells how good it is
- Student improves based on feedback
Real Example
Used in:
- AlphaGo
- Robotics
- Drone flight stabilization
- Walking robots
Code Example (Actor & Critic Networks)
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
# Actor network
actor = Sequential([
Dense(32, activation='relu', input_shape=(4,)),
Dense(2, activation='softmax')
])
# Critic network
critic = Sequential([
Dense(32, activation='relu', input_shape=(4,)),
Dense(1) # value of the state
])
print("Actor:")
print(actor.summary())
print("Critic:")
print(critic.summary())6. Summary Table
| Method | Description | Good For |
|---|---|---|
| Q-learning | Table of good actions | Small problems |
| DQN | Neural network learns actions | Video games, complex tasks |
| Policy Gradient | Learns probabilities of actions | Robotics, continuous actions |
| Actor–Critic | Combines value + policy | Advanced RL (Go, robots) |