Double DQN

Wednesday, November 14, 2018, 12:00 AM

@author: Ziyu Zhou

Double Q-learning

Double Q-learning, proposed by Hado van Hasselt et al in their paper Deep Reinforcement Learning with Double Q-learning, is to solve the overestimation problem of the Q-learning algorithm.

DQN with target network

Let’s first take a brief review on the deep Q-learning algorithm.

In the deep Q network (DQN), based on the Bellman optimality, the target is defined as

$Y_t^{\text{DQN}} = R_{t+1} + \gamma \displaystyle\max_a Q( S_{t+1}, a; \theta_t )$

where $\theta_t$ is the parameters of the network at time $t$.

To update the parameters, we take a stochastic gradient descent (SGD) step

$\theta = \theta + \alpha \left(Y_t^{\text{DQN}} - Q(S_t, A_t; \theta ) \right) \nabla_{\theta} Q(S_t, A_t; \theta)$

This assumes that the Q target $Y_t^{\text{DQN}}$ and the Q value $ Q(S_t, A_t; \theta)$ use the same parameters and are updated together. It will introduce high correlations between Q target and Q value which makes the training very unstable, meaning that at every step of training, the Q values shift but also the target value shifts.

To break this kind of correlation, Mnih et al. propose to use a separate target network whose parameters are copied every $\tau$ steps from the original Q value network and are kept fixed on all other steps.

The target network is now

$Y_t^{\text{DQN}} = R_{t+1} + \gamma \displaystyle\max_a Q( S_{t+1}, a; \theta_t^{-} )$

where $\theta_t^{-}$ represents the parameters of the target network.

At every $\tau$ steps, the paramters are updated as

$\theta_t^- \gets \theta_t$

Problem of DQN

The max operator in the target network

$Y_t^{\text{DQN}} = R_{t+1} + \gamma \displaystyle\max_a Q( S_{t+1}, a; \theta_t^{-} )$

uses the same parameters to select and evaluate the actions. The problem is that the best action for the next state is not necessarily the action with the highest Q value. As illustrated in Thomas Simonini’s blog:

At the beginning of the training we don’t have enough information about the best action to take. Therefore, taking the maximum Q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly given a higher Q value than the optimal best action, the learning will be complicated.

Double DQN

Double DQN addresses the above problem by decoupling action selection and action evaluation:

Use our Q value network, i.e., the original online network, to select what is the best action to take for the next state (the action with the highest Q value).
Use our target network to calculate the target Q value of taking that action at the next state.

Now $Y_t$ becomes

$Y_t^{\text{DoubleDQN}} = R_{t+1} + \gamma Q \left( S_{t+1}, \text{arg} \displaystyle\max_a Q(S_{t+1}, a; \theta_t); \theta_t^{-} \right)$

The following table shows the difference between the target network $Y_t$ for DQN and Double DQN:

Algorithm	Target Network
DQN	$Y_t^{\text{DQN}} = R_{t+1} + \gamma Q \left( S_{t+1}, \text{arg} \displaystyle\max_a Q(S_{t+1}, a; \theta_t); \theta_t \right)$
Double DQN	$Y_t^{\text{DoubleDQN}} = R_{t+1} + \gamma Q \left( S_{t+1}, \text{arg} \displaystyle\max_a Q(S_{t+1}, a; \theta_t); \theta_t^{-} \right)$

References

Hado van Hasselt, Arthur Guez and David Silver. Deep Reinforcement Learning with Double Q-learning
Mnih et al. Human Level Control Through Deep Reinforcement Learning
Thomas Simonini. Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

Deep Q Networks »