Deep Q-Learning (DQN): From Theory to Implementation
Imagine teaching a computer to master complex video games without explicitly programming the rules—just by watching pixels on a screen and learning from trial and error. This isn’t science fiction; it’s the power of deep q learning, a groundbreaking technique that combines classical reinforcement learning with deep neural networks. The deep q network (DQN) represents one of the most significant breakthroughs in artificial intelligence, enabling machines to achieve human-level performance on tasks that were once thought impossible for computers to learn autonomously.

In this comprehensive guide, we’ll explore everything about deep q learning, from its theoretical foundations rooted in the q learning algorithm to practical implementation details. Whether you’re a researcher, developer, or AI enthusiast, understanding DQN opens doors to building intelligent agents capable of solving complex decision-making problems.
Content
Toggle1. Understanding the foundations of q learning
Before diving into deep q learning, we need to understand its predecessor: the classical q learning algorithm. This fundamental reinforcement learning technique laid the groundwork for the revolutionary advances that followed.
What is reinforcement learning?
Reinforcement learning is a paradigm where an agent learns to make decisions by interacting with an environment. Unlike supervised learning, where we provide labeled examples, reinforcement learning agents learn through experience. The agent takes actions, receives rewards or penalties, and gradually improves its behavior to maximize cumulative rewards over time.
The key components of any reinforcement learning system include:
- Agent: The learner or decision-maker
- Environment: The world the agent interacts with
- State: A representation of the current situation
- Action: Choices available to the agent
- Reward: Feedback signal indicating how good an action was
The q function and q table
At the heart of q learning lies the q function, also known as the action-value function. This function estimates the expected cumulative reward for taking a specific action in a given state and following the optimal policy thereafter. Mathematically, we express this as \(Q(s, a)\), where \(s\) represents the state and \(a\) represents the action.
In classical q learning, we store these q values in a q table—essentially a lookup table where rows represent states, columns represent actions, and each cell contains the estimated value of taking that action in that state. For a simple grid world with 100 states and 4 possible actions (up, down, left, right), our q table would be a 100×4 matrix.
The Bellman equation
The bellman equation provides the mathematical foundation for updating q values. It expresses the relationship between the value of a state-action pair and the values of subsequent state-action pairs. The equation states:
$$ Q(s, a) = r + \gamma \max_{a’} Q(s’, a’) $$
where:
- \(r\) is the immediate reward
- \(\gamma\) is the discount factor (typically between 0.8 and 0.99)
- \(s’\) is the next state
- \(a’\) represents possible actions in the next state
- \(\max_{a’} Q(s’, a’)\) is the maximum q value for the next state
The discount factor \(\gamma\) determines how much the agent values future rewards compared to immediate ones. A value close to 1 makes the agent far-sighted, while a value close to 0 makes it myopic.
The q learning algorithm
The q learning algorithm iteratively updates the q table using the temporal difference learning approach. After each action, we update the q value using:
$$ Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a’} Q(s’, a’) – Q(s, a)] $$
where \(\alpha\) is the learning rate controlling how quickly the agent incorporates new information. The term in brackets represents the temporal difference error—the difference between the new estimate and the old one.
Here’s a simple Python implementation of q learning for a grid world:
import numpy as np
import random
class QLearningAgent:
def __init__(self, n_states, n_actions, learning_rate=0.1,
discount_factor=0.99, epsilon=0.1):
self.q_table = np.zeros((n_states, n_actions))
self.lr = learning_rate
self.gamma = discount_factor
self.epsilon = epsilon
self.n_actions = n_actions
def choose_action(self, state):
# Epsilon-greedy policy
if random.random() < self.epsilon:
return random.randint(0, self.n_actions - 1)
else:
return np.argmax(self.q_table[state])
def update(self, state, action, reward, next_state):
# Q-learning update rule
current_q = self.q_table[state, action]
max_next_q = np.max(self.q_table[next_state])
new_q = current_q + self.lr * (reward + self.gamma * max_next_q - current_q)
self.q_table[state, action] = new_q
def train(self, env, episodes=1000):
for episode in range(episodes):
state = env.reset()
done = False
while not done:
action = self.choose_action(state)
next_state, reward, done = env.step(action)
self.update(state, action, reward, next_state)
state = next_state
This implementation showcases the core q learning algorithm with epsilon-greedy exploration, where the agent occasionally takes random actions to explore the environment while mostly exploiting its current knowledge.
2. The limitations of classical q learning
While the q learning algorithm works remarkably well for simple problems, it faces severe limitations when confronting real-world complexity. Understanding these constraints is crucial to appreciating why deep q learning emerged as a transformative solution.
The curse of dimensionality
The most critical limitation is the curse of dimensionality. As the state space grows, maintaining a q table becomes computationally infeasible. Consider playing Atari games: a single frame has dimensions of 210×160 pixels with 128 possible color values per pixel. The number of possible states is astronomical—far too many to store in a table.
Even for moderately complex problems like chess, where the number of possible board positions exceeds \(10^{43}\), a q table approach is completely impractical. The memory requirements alone would exceed what any computer could handle.
Lack of generalization
Classical q learning treats each state independently without any generalization capability. If an agent learns that a particular action is good in one state, this knowledge doesn’t transfer to similar states. This means the agent must visit every state multiple times to learn effectively, which is impossibly time-consuming for large state spaces.
Imagine teaching someone to play tennis by having them practice every possible configuration of ball position, velocity, and spin independently. They’d never finish learning! Humans naturally generalize from similar experiences, but q tables cannot.
Inability to handle continuous state spaces
Many real-world problems involve continuous state spaces. For example, a robot’s position might be represented by continuous x and y coordinates, or a self-driving car must process continuous sensor readings. Discretizing these continuous spaces loses important information and still results in enormous state spaces.
3. Deep q network: Bridging neural networks and q learning
The deep q network (DQN) revolutionized reinforcement learning by replacing the q table with a deep neural network. This elegant solution addresses all the major limitations of classical qlearning while maintaining its theoretical soundness.
Neural networks as function approximators
Instead of storing q values in a table, a deep q network learns to approximate the q function using a neural network. The network takes a state as input and outputs q values for all possible actions. This approach provides several crucial advantages:
Generalization: Neural networks naturally generalize across similar states. If the network learns that moving right is valuable in one game state, it can apply this knowledge to similar states without experiencing them directly.
Compact representation: Rather than storing millions or billions of q values, we only need to store the neural network’s weights, which is dramatically more memory-efficient.
Continuous state handling: Neural networks can process continuous inputs directly, whether raw pixels, sensor readings, or any other representation.
The DQN architecture
A typical DQN for processing visual input uses convolutional neural networks. Here’s the architecture used in the seminal work “playing atari with deep reinforcement learning”:
import torch
import torch.nn as nn
import torch.nn.functional as F
class DQN(nn.Module):
def __init__(self, input_shape, n_actions):
super(DQN, self).__init__()
# Convolutional layers for processing visual input
self.conv1 = nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4)
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
# Calculate size after convolutions
def conv2d_size_out(size, kernel_size, stride):
return (size - kernel_size) // stride + 1
convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(
input_shape[1], 8, 4), 4, 2), 3, 1)
convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(
input_shape[2], 8, 4), 4, 2), 3, 1)
linear_input_size = convw * convh * 64
# Fully connected layers
self.fc1 = nn.Linear(linear_input_size, 512)
self.fc2 = nn.Linear(512, n_actions)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = x.view(x.size(0), -1) # Flatten
x = F.relu(self.fc1(x))
return self.fc2(x) # Output Q-values for each action
The convolutional layers extract spatial features from the input frames, while the fully connected layers combine these features to estimate action values. The network outputs one q value for each possible action.
Experience replay
One of the key innovations in DQN is experience replay. Instead of learning from experiences immediately and discarding them, the agent stores transitions \((s, a, r, s’)\) in a replay buffer. During training, the agent samples random mini-batches from this buffer to update the network.
Experience replay provides two critical benefits:
Breaking correlation: Consecutive experiences are highly correlated. Training on them sequentially can cause the network to overfit to recent experiences and forget earlier learning. Random sampling breaks this correlation.
Data efficiency: Each experience can be used multiple times for learning, making the agent much more data-efficient.
Here’s an implementation of a replay buffer:
from collections import deque
import random
class ReplayBuffer:
def __init__(self, capacity=100000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return states, actions, rewards, next_states, dones
def __len__(self):
return len(self.buffer)
Target networks
Another crucial innovation is the use of target networks. When computing the target q values for training, using the same network for both current and target values creates a moving target problem—like trying to catch your own shadow. The values being updated and the targets are constantly changing, leading to instability.
The solution is to maintain two networks: the main Q-network being trained and a target network for computing target values. The target network’s weights are periodically copied from the main network (every few thousand steps), providing stable targets during training:
$$ \text{Target} = r + \gamma \max_{a’} Q_{\text{target}}(s’, a’) $$
The loss function for training becomes:
$$L =
\mathbb{E}_{(s, a, r, s’) \sim \text{Replay Buffer}} \left[
\left(
r + \gamma \max_{a’} Q_{\text{target}}\bigl(s’, a’\bigr)
– Q\bigl(s, a\bigr)
\right)^2
\right] $$
This is essentially mean squared error between the predicted q values and the target q values.
4. Implementing a complete DQN agent
Now let’s bring everything together with a complete DQN implementation that can learn to play games or solve other sequential decision-making tasks.
The DQN agent class
import torch
import torch.optim as optim
import numpy as np
class DQNAgent:
def __init__(self, state_shape, n_actions, learning_rate=0.00025,
gamma=0.99, epsilon_start=1.0, epsilon_end=0.01,
epsilon_decay=0.995, buffer_size=100000, batch_size=32):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.n_actions = n_actions
self.gamma = gamma
self.batch_size = batch_size
# Epsilon-greedy parameters
self.epsilon = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_decay = epsilon_decay
# Networks
self.policy_net = DQN(state_shape, n_actions).to(self.device)
self.target_net = DQN(state_shape, n_actions).to(self.device)
self.target_net.load_state_dict(self.policy_net.state_dict())
self.target_net.eval()
# Optimizer
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=learning_rate)
# Replay buffer
self.replay_buffer = ReplayBuffer(buffer_size)
self.steps = 0
self.update_target_frequency = 10000
def select_action(self, state):
# Epsilon-greedy action selection
if np.random.random() < self.epsilon:
return np.random.randint(0, self.n_actions)
else:
with torch.no_grad():
state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
q_values = self.policy_net(state_tensor)
return q_values.argmax().item()
def store_transition(self, state, action, reward, next_state, done):
self.replay_buffer.push(state, action, reward, next_state, done)
def train_step(self):
if len(self.replay_buffer) < self.batch_size:
return None
# Sample from replay buffer
states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size)
# Convert to tensors
states = torch.FloatTensor(np.array(states)).to(self.device)
actions = torch.LongTensor(actions).to(self.device)
rewards = torch.FloatTensor(rewards).to(self.device)
next_states = torch.FloatTensor(np.array(next_states)).to(self.device)
dones = torch.FloatTensor(dones).to(self.device)
# Current Q values
current_q_values = self.policy_net(states).gather(1, actions.unsqueeze(1))
# Target Q values
with torch.no_grad():
next_q_values = self.target_net(next_states).max(1)[0]
target_q_values = rewards + (1 - dones) * self.gamma * next_q_values
# Compute loss
loss = F.mse_loss(current_q_values.squeeze(), target_q_values)
# Optimize
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), 10)
self.optimizer.step()
# Update target network
self.steps += 1
if self.steps % self.update_target_frequency == 0:
self.target_net.load_state_dict(self.policy_net.state_dict())
# Decay epsilon
self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
return loss.item()
Training loop
Here’s a complete training loop that ties everything together:
def train_dqn(env, agent, episodes=1000, max_steps=10000):
episode_rewards = []
for episode in range(episodes):
state = env.reset()
episode_reward = 0
for step in range(max_steps):
# Select and perform action
action = agent.select_action(state)
next_state, reward, done, _ = env.step(action)
# Store transition
agent.store_transition(state, action, reward, next_state, done)
# Train the agent
loss = agent.train_step()
episode_reward += reward
state = next_state
if done:
break
episode_rewards.append(episode_reward)
# Logging
if (episode + 1) % 10 == 0:
avg_reward = np.mean(episode_rewards[-10:])
print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}, "
f"Epsilon: {agent.epsilon:.3f}")
return episode_rewards
Preprocessing for visual inputs
When working with visual inputs like Atari games, preprocessing is essential. The original work on “human-level control through deep reinforcement learning” used several preprocessing steps:
import cv2
class AtariPreprocessor:
def __init__(self, frame_stack=4):
self.frame_stack = frame_stack
self.frames = deque(maxlen=frame_stack)
def preprocess_frame(self, frame):
# Convert to grayscale
gray = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
# Resize to 84x84
resized = cv2.resize(gray, (84, 84), interpolation=cv2.INTER_AREA)
# Normalize
normalized = resized / 255.0
return normalized
def reset(self, initial_frame):
processed = self.preprocess_frame(initial_frame)
for _ in range(self.frame_stack):
self.frames.append(processed)
return np.stack(self.frames, axis=0)
def step(self, frame):
processed = self.preprocess_frame(frame)
self.frames.append(processed)
return np.stack(self.frames, axis=0)
The preprocessing converts frames to grayscale, resizes them to 84×84 pixels, and stacks four consecutive frames to capture motion information. This dramatically reduces the input dimensionality while preserving essential information.
5. Advanced techniques and improvements
Since the original DQN publication, researchers have developed numerous enhancements that significantly improve performance and training stability.
Double DQN
Standard DQN tends to overestimate action values because it uses the maximum value for computing targets. Double DQN addresses this by using the policy network to select actions but the target network to evaluate them:
$$ \text{Target} = r + \gamma Q_{\text{target}}(s’, \arg\max_{a’} Q_{\text{policy}}(s’, a’)) $$
Implementation requires only a small modification to the training step:
# In the train_step method, replace target Q value computation with:
with torch.no_grad():
# Policy network selects action
next_actions = self.policy_net(next_states).argmax(1)
# Target network evaluates that action
next_q_values = self.target_net(next_states).gather(1, next_actions.unsqueeze(1)).squeeze()
target_q_values = rewards + (1 - dones) * self.gamma * next_q_values
Prioritized experience replay
Not all experiences are equally valuable for learning. Prioritized experience replay samples transitions based on their temporal difference error—transitions where the agent’s prediction was most wrong are sampled more frequently:
class PrioritizedReplayBuffer:
def __init__(self, capacity, alpha=0.6, beta=0.4):
self.capacity = capacity
self.alpha = alpha # How much prioritization to use
self.beta = beta # Importance sampling correction
self.buffer = []
self.priorities = np.zeros(capacity, dtype=np.float32)
self.position = 0
def push(self, state, action, reward, next_state, done):
max_priority = self.priorities.max() if self.buffer else 1.0
if len(self.buffer) < self.capacity:
self.buffer.append((state, action, reward, next_state, done))
else:
self.buffer[self.position] = (state, action, reward, next_state, done)
self.priorities[self.position] = max_priority
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
if len(self.buffer) == self.capacity:
priorities = self.priorities
else:
priorities = self.priorities[:len(self.buffer)]
# Calculate sampling probabilities
probabilities = priorities ** self.alpha
probabilities /= probabilities.sum()
# Sample indices
indices = np.random.choice(len(self.buffer), batch_size, p=probabilities)
# Calculate importance sampling weights
weights = (len(self.buffer) * probabilities[indices]) ** (-self.beta)
weights /= weights.max()
batch = [self.buffer[idx] for idx in indices]
states, actions, rewards, next_states, dones = zip(*batch)
return states, actions, rewards, next_states, dones, indices, weights
def update_priorities(self, indices, priorities):
for idx, priority in zip(indices, priorities):
self.priorities[idx] = priority
Dueling DQN
The dueling architecture separates the estimation of state value and action advantages. Some states are inherently valuable regardless of the action taken, while in other states, the choice of action matters significantly. The dueling network explicitly models this:
$$ Q(s, a) = V(s) + A(s, a) – \frac{1}{|A|} \sum_{a’} A(s, a’) $$
where \(V(s)\) is the state value and \(A(s, a)\) is the advantage of action \(a\) in state \(s\).
class DuelingDQN(nn.Module):
def __init__(self, input_shape, n_actions):
super(DuelingDQN, self).__init__()
# Shared convolutional layers
self.conv1 = nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4)
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
# Calculate size after convolutions
convw = self._conv_size_out(self._conv_size_out(
self._conv_size_out(input_shape[1], 8, 4), 4, 2), 3, 1)
convh = self._conv_size_out(self._conv_size_out(
self._conv_size_out(input_shape[2], 8, 4), 4, 2), 3, 1)
linear_input_size = convw * convh * 64
# Value stream
self.value_stream = nn.Sequential(
nn.Linear(linear_input_size, 512),
nn.ReLU(),
nn.Linear(512, 1)
)
# Advantage stream
self.advantage_stream = nn.Sequential(
nn.Linear(linear_input_size, 512),
nn.ReLU(),
nn.Linear(512, n_actions)
)
def _conv_size_out(self, size, kernel_size, stride):
return (size - kernel_size) // stride + 1
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = x.view(x.size(0), -1)
value = self.value_stream(x)
advantages = self.advantage_stream(x)
# Combine value and advantages
q_values = value + (advantages - advantages.mean(dim=1, keepdim=True))
return q_values
Noisy networks
Instead of using epsilon-greedy exploration, noisy networks add parametric noise to the network weights, allowing the network to learn when and how to explore:
class NoisyLinear(nn.Module):
def __init__(self, in_features, out_features, sigma_init=0.5):
super(NoisyLinear, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.sigma_init = sigma_init
# Learnable parameters
self.weight_mu = nn.Parameter(torch.FloatTensor(out_features, in_features))
self.weight_sigma = nn.Parameter(torch.FloatTensor(out_features, in_features))
self.bias_mu = nn.Parameter(torch.FloatTensor(out_features))
self.bias_sigma = nn.Parameter(torch.FloatTensor(out_features))
# Register noise buffers
self.register_buffer('weight_epsilon', torch.FloatTensor(out_features, in_features))
self.register_buffer('bias_epsilon', torch.FloatTensor(out_features))
self.reset_parameters()
self.reset_noise()
def reset_parameters(self):
mu_range = 1.0 / np.sqrt(self.in_features)
self.weight_mu.data.uniform_(-mu_range, mu_range)
self.weight_sigma.data.fill_(self.sigma_init / np.sqrt(self.in_features))
self.bias_mu.data.uniform_(-mu_range, mu_range)
self.bias_sigma.data.fill_(self.sigma_init / np.sqrt(self.out_features))
def reset_noise(self):
epsilon_in = self._scale_noise(self.in_features)
epsilon_out = self._scale_noise(self.out_features)
self.weight_epsilon.copy_(epsilon_out.outer(epsilon_in))
self.bias_epsilon.copy_(epsilon_out)
def _scale_noise(self, size):
x = torch.randn(size)
return x.sign() * x.abs().sqrt()
def forward(self, x):
if self.training:
weight = self.weight_mu + self.weight_sigma * self.weight_epsilon
bias = self.bias_mu + self.bias_sigma * self.bias_epsilon
else:
weight = self.weight_mu
bias = self.bias_mu
return F.linear(x, weight, bias)
6. Applications and real-world impact
Deep q learning has transcended its origins in game-playing to impact numerous domains, demonstrating the versatility and power of this approach.
Game playing and beyond
The breakthrough moment for DQN came with its performance on Atari games. The work on “playing atari with deep reinforcement learning” demonstrated that a single algorithm could learn to play dozens of different games using only pixel inputs and game scores. Later research on “human-level control through deep reinforcement learning” showed that DQN could achieve superhuman performance on many of these games.
What makes this remarkable isn’t just the performance level but the generality. The same network architecture and hyperparameters worked across games as diverse as Breakout, Space Invaders, and Pong. The agent discovered effective strategies without any game-specific engineering or domain knowledge.
Robotics and control
In robotics, DQN enables robots to learn complex manipulation tasks. Researchers have applied DQN variants to teach robots to grasp objects, navigate environments, and perform assembly tasks. The ability to learn directly from visual inputs makes DQN particularly valuable for vision-based robotic control.
For instance, a robot arm can learn to pick and place objects by receiving camera images as input and trying different grasping strategies. Through trial and error with DQN, the robot discovers effective grasping techniques without explicit programming of hand-eye coordination or object geometry.
Resource management
DQN has proven effective for complex resource allocation problems. In data centers, DQN agents can learn to optimize cooling systems, reducing energy consumption by 40% or more. The agent learns to predict thermal dynamics and make control decisions that balance temperature regulation with energy efficiency.
Cloud computing platforms use DQN for virtual machine placement and migration decisions, learning to optimize resource utilization while maintaining performance guarantees. The agent considers factors like CPU usage, memory requirements, network bandwidth, and predicted workload patterns.
Autonomous systems
Self-driving vehicles and drones use DQN variants for decision-making in navigation and control. While end-to-end learning from pixels to steering commands remains challenging, DQN excels at higher-level decision tasks like lane changing, intersection navigation, and path planning.
A DQN agent for autonomous driving might learn when to change lanes, when to overtake slow vehicles, and how to handle complex traffic scenarios. The state representation includes information about nearby vehicles, road geometry, traffic signals, and the vehicle’s current velocity and position.
Trading and finance
In financial markets, DQN agents learn trading strategies by treating price movements and market conditions as states and buy/sell/hold decisions as actions. While the stochastic nature of financial markets presents challenges, DQN can discover profitable trading patterns that exploit market inefficiencies.
These agents consider multiple factors: historical prices, trading volumes, technical indicators, market sentiment, and macroeconomic variables. The reward signal comes from portfolio returns adjusted for risk, encouraging the agent to find strategies with favorable risk-reward profiles.
Healthcare optimization
Medical applications include treatment planning, where DQN learns to sequence treatments for chronic diseases like diabetes or cancer. The agent considers patient history, current health state, treatment side effects, and long-term outcomes to recommend personalized treatment strategies.
Hospital resource management systems use DQN to optimize bed allocation, operating room scheduling, and staff assignment. The agent learns to balance competing objectives like minimizing patient wait times, maximizing resource utilization, and maintaining quality of care.
7. Challenges and future directions
Despite its successes, deep q learning faces several challenges that researchers continue to address through ongoing innovation.
Sample efficiency
DQN typically requires millions of environment interactions to learn effective policies. For real-world applications where data collection is expensive or time-consuming, this sample inefficiency becomes prohibitive. A robot learning through physical trial and error cannot afford millions of attempts, and simulated environments don’t always transfer well to reality.
Current research focuses on improving sample efficiency through better exploration strategies, incorporating prior knowledge, learning from demonstrations, and transfer learning from related tasks. Model-based approaches that learn environment dynamics can also improve sample efficiency by enabling planning.
Stability and hyperparameter sensitivity
Training DQN can be unstable, with performance varying dramatically based on hyperparameter choices. Learning rate, network architecture, replay buffer size, target network