Advanced

Deep Reinforcement Learning

Combining deep neural networks with RL algorithms to solve problems with high-dimensional state spaces — DQN, PPO, A3C, SAC, and more.

Why Deep RL?

Tabular methods (Q-tables) cannot handle large or continuous state spaces. A game screen with 210x160 pixels and 3 color channels has more possible states than atoms in the universe. Deep neural networks serve as powerful function approximators that can generalize across similar states.

Deep Q-Network (DQN)

DQN (2013, DeepMind) was the breakthrough that started modern deep RL. It uses a neural network to approximate the Q-function, with two key innovations:

  • Experience Replay: Store transitions (s, a, r, s') in a buffer and sample random mini-batches for training. This breaks correlations between consecutive samples and improves stability.
  • Target Network: Use a separate, slowly-updated copy of the Q-network to compute target values. This prevents the "moving target" problem where the network chases its own predictions.
Python - DQN with Stable Baselines3
from stable_baselines3 import DQN
import gymnasium as gym

# Create environment
env = gym.make("CartPole-v1")

# Train DQN agent
model = DQN(
    "MlpPolicy",
    env,
    learning_rate=1e-3,
    buffer_size=50000,
    learning_starts=1000,
    batch_size=32,
    gamma=0.99,
    target_update_interval=500,
    verbose=1
)
model.learn(total_timesteps=100000)

# Evaluate
obs, _ = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, _ = env.reset()

Proximal Policy Optimization (PPO)

PPO (2017, OpenAI) is the most widely-used deep RL algorithm today. It is an actor-critic method that uses a clipped objective to prevent too-large policy updates:

  • Clipped Objective: Limits how much the policy can change in a single update step, preventing catastrophic performance drops.
  • Multiple Epochs: Reuses collected data for multiple gradient steps, improving sample efficiency over vanilla policy gradient.
  • Simple and Robust: Works well across a wide range of tasks with minimal hyperparameter tuning.
Python - PPO Training
from stable_baselines3 import PPO

# PPO works great for both discrete and continuous actions
env = gym.make("LunarLander-v3")

model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    verbose=1
)
model.learn(total_timesteps=500000)
model.save("ppo_lunar_lander")

A3C and A2C

Asynchronous Advantage Actor-Critic (A3C) uses multiple parallel workers that each interact with their own copy of the environment. This provides diverse experience and stabilizes training without replay buffers.

A2C is the synchronous version: all workers collect data, then a single gradient update is performed. Simpler to implement and often performs just as well as A3C.

Soft Actor-Critic (SAC)

SAC maximizes both expected return and entropy (randomness) of the policy. This encourages exploration and leads to more robust policies. SAC is the go-to algorithm for continuous control tasks (robotics, locomotion).

Algorithm Comparison

AlgorithmTypeAction SpaceBest For
DQNOff-policy, value-basedDiscrete onlyAtari games, simple discrete tasks
PPOOn-policy, actor-criticDiscrete + ContinuousGeneral purpose, most tasks
A3C/A2COn-policy, actor-criticDiscrete + ContinuousParallel environments, fast training
SACOff-policy, actor-criticContinuousRobotics, continuous control
TD3Off-policy, actor-criticContinuousContinuous control (deterministic)

DQN Variants and Improvements

  • Double DQN: Uses the online network to select actions and the target network to evaluate them. Reduces overestimation bias.
  • Dueling DQN: Separates Q-value into state value V(s) and advantage A(s,a). Better evaluation of state quality.
  • Prioritized Experience Replay: Samples important transitions more frequently, based on TD error magnitude.
  • Rainbow: Combines six DQN improvements into one agent, achieving state-of-the-art Atari performance.
Key takeaway: Deep RL combines neural networks with RL algorithms to handle complex, high-dimensional problems. PPO is the default choice for most tasks. SAC excels at continuous control. DQN variants work well for discrete action games. Use Stable Baselines3 for quick experimentation.