Deep Reinforcement Learning
Combining deep neural networks with RL algorithms to solve problems with high-dimensional state spaces — DQN, PPO, A3C, SAC, and more.
Why Deep RL?
Tabular methods (Q-tables) cannot handle large or continuous state spaces. A game screen with 210x160 pixels and 3 color channels has more possible states than atoms in the universe. Deep neural networks serve as powerful function approximators that can generalize across similar states.
Deep Q-Network (DQN)
DQN (2013, DeepMind) was the breakthrough that started modern deep RL. It uses a neural network to approximate the Q-function, with two key innovations:
- Experience Replay: Store transitions (s, a, r, s') in a buffer and sample random mini-batches for training. This breaks correlations between consecutive samples and improves stability.
- Target Network: Use a separate, slowly-updated copy of the Q-network to compute target values. This prevents the "moving target" problem where the network chases its own predictions.
from stable_baselines3 import DQN import gymnasium as gym # Create environment env = gym.make("CartPole-v1") # Train DQN agent model = DQN( "MlpPolicy", env, learning_rate=1e-3, buffer_size=50000, learning_starts=1000, batch_size=32, gamma=0.99, target_update_interval=500, verbose=1 ) model.learn(total_timesteps=100000) # Evaluate obs, _ = env.reset() for _ in range(1000): action, _ = model.predict(obs, deterministic=True) obs, reward, terminated, truncated, info = env.step(action) if terminated or truncated: obs, _ = env.reset()
Proximal Policy Optimization (PPO)
PPO (2017, OpenAI) is the most widely-used deep RL algorithm today. It is an actor-critic method that uses a clipped objective to prevent too-large policy updates:
- Clipped Objective: Limits how much the policy can change in a single update step, preventing catastrophic performance drops.
- Multiple Epochs: Reuses collected data for multiple gradient steps, improving sample efficiency over vanilla policy gradient.
- Simple and Robust: Works well across a wide range of tasks with minimal hyperparameter tuning.
from stable_baselines3 import PPO # PPO works great for both discrete and continuous actions env = gym.make("LunarLander-v3") model = PPO( "MlpPolicy", env, learning_rate=3e-4, n_steps=2048, batch_size=64, n_epochs=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, verbose=1 ) model.learn(total_timesteps=500000) model.save("ppo_lunar_lander")
A3C and A2C
Asynchronous Advantage Actor-Critic (A3C) uses multiple parallel workers that each interact with their own copy of the environment. This provides diverse experience and stabilizes training without replay buffers.
A2C is the synchronous version: all workers collect data, then a single gradient update is performed. Simpler to implement and often performs just as well as A3C.
Soft Actor-Critic (SAC)
SAC maximizes both expected return and entropy (randomness) of the policy. This encourages exploration and leads to more robust policies. SAC is the go-to algorithm for continuous control tasks (robotics, locomotion).
Algorithm Comparison
| Algorithm | Type | Action Space | Best For |
|---|---|---|---|
| DQN | Off-policy, value-based | Discrete only | Atari games, simple discrete tasks |
| PPO | On-policy, actor-critic | Discrete + Continuous | General purpose, most tasks |
| A3C/A2C | On-policy, actor-critic | Discrete + Continuous | Parallel environments, fast training |
| SAC | Off-policy, actor-critic | Continuous | Robotics, continuous control |
| TD3 | Off-policy, actor-critic | Continuous | Continuous control (deterministic) |
DQN Variants and Improvements
- Double DQN: Uses the online network to select actions and the target network to evaluate them. Reduces overestimation bias.
- Dueling DQN: Separates Q-value into state value V(s) and advantage A(s,a). Better evaluation of state quality.
- Prioritized Experience Replay: Samples important transitions more frequently, based on TD error magnitude.
- Rainbow: Combines six DQN improvements into one agent, achieving state-of-the-art Atari performance.
Lilly Tech Systems