Deep Q Networks
author: Zhenye Na
Deep Q Networks
The goal of reinforcement learning (Sutton and Barto, 1998) is to learn good policies for sequential decision problems, by optimizing a cumulative future reward signal. Q-learning (Watkins, 1989) is one of the most popular reinforcement learning algorithms, but it is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends to prefer overestimated to underestimated values.
Background
Under a given policy $\pi$, the true value of an action $a$ in a state $s$ is
where $\gamma \in [0,1]$ is a discount factor that trades off the importance of immediate and later rewards. The optimal value is then $Q_*(s, a) = \max_\pi Q_\pi(s, a)$.
By selecting the highest valued action in each state, we can get optimal policy.
The standard Q-learning update for the parameters after taking action $A_t$ in state $S_t$ and observing the immediate reward $R_{t+1}$ and resulting state $S_{t+1}$ is then:
$\alpha$ is a scalar step size and the target $Y_t^Q$ is defined as:
Deep Q Networks
A deep Q network (DQN) is a multi-layered neural network that for a given state $s$ outputs a vector of action values $Q(s, \cdot; \theta)$, where $\theta$ are the parameters of the network. For an $n$-dimensional state space and an action space containing $m$ actions, the neural network is a function from $\mathbb{R}^n$ to $\mathbb{R}^m$.
DQN overcomes unstable learning by mainly 4 techniques.
- Experience Replay
- Target Network
- Clipping Rewards
- Skipping Frames
Experience Replay
It is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends to prefer overestimated to underestimated values.
DNN is easily overfitting current episodes. Once DNN is overfitted, it’s hard to produce various experiences. To solve this problem, Experience Replay stores experiences including state transitions, rewards and actions, which are necessary data to perform Q learning, and makes mini-batches to update neural networks.
For the experience replay (Lin, 1992), observed transitions are stored for some time and sampled uniformly from this memory bank to update the network. Both the target network and the experience replay dramatically improve the performance of the algorithm (Mnih et al., 2015).
Target Network
The target network, with parameters $\theta^−$, is the same as the online network except that its parameters are copied every $\tau$ steps from the online network, so that then $\theta_t^- \equiv \theta_t$, and kept fixed on all other steps. The target used by DQN is then
Clipping Rewards
Each game has different score scales. For example, in Pong, players can get 1
point when wining the play. Otherwise, players get -1
point. However, in SpaceInvaders, players get 10 ~ 30
points when defeating invaders. This difference would make training unstable. Thus Clipping Rewards technique clips scores, which all positive rewards are set +1
and all negative rewards are set -1
.
Skipping Frames
ALE is capable of rendering 60 images per second. But actually people don’t take actions so much in a second. AI doesn’t need to calculate Q values every frame. So Skipping Frames technique is that DQN calculates Q values every 4 frames and use past 4 frames as inputs. This reduces computational cost and gathers more experiences.
References
[1] Hado van Hasselt, Arthur Guez and David Silver. “Deep Reinforcement Learning with Double Q-learning”
[2] Thomas Simonini. “An introduction to Deep Q-Learning: let’s play Doom”