Reinforcement Learning

Double Deep Q-Learning

Double DQN

Wednesday, November 14, 2018, 12:00 AM

@author: Ziyu Zhou

Double Q-learning

Double Q-learning, proposed by Hado van Hasselt et al in their paper Deep Reinforcement Learning with Double Q-learning, is to solve the overestimation problem of the Q-learning algorithm.

DQN with target network

Let’s first take a brief review on the deep Q-learning algorithm.

In the deep Q network (DQN), based on the Bellman optimality, the target is defined as

where $\theta_t$ is the parameters of the network at time $t$.

To update the parameters, we take a stochastic gradient descent (SGD) step

This assumes that the Q target $Y_t^{\text{DQN}}$ and the Q value $ Q(S_t, A_t; \theta)$ use the same parameters and are updated together. It will introduce high correlations between Q target and Q value which makes the training very unstable, meaning that at every step of training, the Q values shift but also the target value shifts.

To break this kind of correlation, Mnih et al. propose to use a separate target network whose parameters are copied every $\tau$ steps from the original Q value network and are kept fixed on all other steps.

The target network is now

where $\theta_t^{-}$ represents the parameters of the target network.

At every $\tau$ steps, the paramters are updated as

Problem of DQN

The max operator in the target network

uses the same parameters to select and evaluate the actions. The problem is that the best action for the next state is not necessarily the action with the highest Q value. As illustrated in Thomas Simonini’s blog:

At the beginning of the training we don’t have enough information about the best action to take. Therefore, taking the maximum Q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly given a higher Q value than the optimal best action, the learning will be complicated.

Double DQN

Double DQN addresses the above problem by decoupling action selection and action evaluation:

  • Use our Q value network, i.e., the original online network, to select what is the best action to take for the next state (the action with the highest Q value).
  • Use our target network to calculate the target Q value of taking that action at the next state.

Now $Y_t$ becomes

The following table shows the difference between the target network $Y_t$ for DQN and Double DQN:

Algorithm Target Network
DQN $Y_t^{\text{DQN}} = R_{t+1} + \gamma Q \left( S_{t+1}, \text{arg} \displaystyle\max_a Q(S_{t+1}, a; \theta_t); \theta_t \right)$
Double DQN $Y_t^{\text{DoubleDQN}} = R_{t+1} + \gamma Q \left( S_{t+1}, \text{arg} \displaystyle\max_a Q(S_{t+1}, a; \theta_t); \theta_t^{-} \right)$

References

  1. Hado van Hasselt, Arthur Guez and David Silver. Deep Reinforcement Learning with Double Q-learning
  2. Mnih et al. Human Level Control Through Deep Reinforcement Learning
  3. Thomas Simonini. Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

##