Advanced

Deep Reinforcement Learning

Combining deep neural networks with RL algorithms to solve problems with high-dimensional state spaces — DQN, PPO, A3C, SAC, and more.

Why Deep RL?

Tabular methods (Q-tables) cannot handle large or continuous state spaces. A game screen with 210x160 pixels and 3 color channels has more possible states than atoms in the universe. Deep neural networks serve as powerful function approximators that can generalize across similar states.

Deep Q-Network (DQN)

DQN (2013, DeepMind) was the breakthrough that started modern deep RL. It uses a neural network to approximate the Q-function, with two key innovations:

Experience Replay: Store transitions (s, a, r, s') in a buffer and sample random mini-batches for training. This breaks correlations between consecutive samples and improves stability.
Target Network: Use a separate, slowly-updated copy of the Q-network to compute target values. This prevents the "moving target" problem where the network chases its own predictions.

Python - DQN with Stable Baselines3

from stable_baselines3 import DQN
import gymnasium as gym

# Create environment
env = gym.make("CartPole-v1")

# Train DQN agent
model = DQN(
    "MlpPolicy",
    env,
    learning_rate=1e-3,
    buffer_size=50000,
    learning_starts=1000,
    batch_size=32,
    gamma=0.99,
    target_update_interval=500,
    verbose=1
)
model.learn(total_timesteps=100000)

# Evaluate
obs, _ = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, _ = env.reset()

Proximal Policy Optimization (PPO)

PPO (2017, OpenAI) is the most widely-used deep RL algorithm today. It is an actor-critic method that uses a clipped objective to prevent too-large policy updates:

Clipped Objective: Limits how much the policy can change in a single update step, preventing catastrophic performance drops.
Multiple Epochs: Reuses collected data for multiple gradient steps, improving sample efficiency over vanilla policy gradient.
Simple and Robust: Works well across a wide range of tasks with minimal hyperparameter tuning.

Python - PPO Training

from stable_baselines3 import PPO

# PPO works great for both discrete and continuous actions
env = gym.make("LunarLander-v3")

model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    verbose=1
)
model.learn(total_timesteps=500000)
model.save("ppo_lunar_lander")

A3C and A2C

Asynchronous Advantage Actor-Critic (A3C) uses multiple parallel workers that each interact with their own copy of the environment. This provides diverse experience and stabilizes training without replay buffers.

A2C is the synchronous version: all workers collect data, then a single gradient update is performed. Simpler to implement and often performs just as well as A3C.

Soft Actor-Critic (SAC)

SAC maximizes both expected return and entropy (randomness) of the policy. This encourages exploration and leads to more robust policies. SAC is the go-to algorithm for continuous control tasks (robotics, locomotion).

Algorithm Comparison

Algorithm	Type	Action Space	Best For
DQN	Off-policy, value-based	Discrete only	Atari games, simple discrete tasks
PPO	On-policy, actor-critic	Discrete + Continuous	General purpose, most tasks
A3C/A2C	On-policy, actor-critic	Discrete + Continuous	Parallel environments, fast training
SAC	Off-policy, actor-critic	Continuous	Robotics, continuous control
TD3	Off-policy, actor-critic	Continuous	Continuous control (deterministic)

DQN Variants and Improvements

Double DQN: Uses the online network to select actions and the target network to evaluate them. Reduces overestimation bias.
Dueling DQN: Separates Q-value into state value V(s) and advantage A(s,a). Better evaluation of state quality.
Prioritized Experience Replay: Samples important transitions more frequently, based on TD error magnitude.
Rainbow: Combines six DQN improvements into one agent, achieving state-of-the-art Atari performance.

✅

Key takeaway: Deep RL combines neural networks with RL algorithms to handle complex, high-dimensional problems. PPO is the default choice for most tasks. SAC excels at continuous control. DQN variants work well for discrete action games. Use Stable Baselines3 for quick experimentation.

← Previous Policy Methods Next → Applications