Reinforcement Learning Digest Part 4: Deep Q-Network(DQN) and Double Deep Q-Networks(DDQN)

In last article, we have discussed Q-learning and we have seen its desirable convergence attributes. Never the less, Q-learning has one fundamental limitation preventing it from being applicable to more complex RL tasks. During learning, Q-learning keeps the Q-value for every state-action pair. In FrozenLake with 4x4 grid, there are 4 actions leading to Q-table size of 4x4x4 = 64. Size of Q-table can grows linearly proportional to number of states. states. This becomes limiting very quickly for RL tasks with much larger states domain size.

Usage of Neural Networks

So clearly we need a better way to approximate Q-value function that does not have memory requirements that are directly proportional to state domain size. Neural Networks are known to learn approximating functions. It seems natural to use NN to learn Q-value function with Q-target:

During back-propagation, NN weights will be updated according to the following rule:

Experience Replay Memory

During training the agent will continuously interact with the environment to get new experiences which will be used to train NN to learn Q-value function. During long training cycles, the NN weights will get continuously updated causing the NN to eventually forget about old experiences. Additionally, as experiences are collected in sequences from consecutive time steps, the NN can wrongly learn correlation between experiences. Experience replay memory is used to elevate both problems. The idea is to use a memory that will store experience tuples(state, action, reward, next-state, terminal flag) from every time steps into a memory data structure. Then during every training iteration, the agent interacts with environment every time step and store experience information into memory and then randomly sample a mini-batch from experience memory. Q-value are then calculated for the mini-batch and NN is trained on the mini-batch. During back propagation the NN will update the wights to better approximate targets till convergence.

I have tried Q-learning using DQN in cartPole-v1 environment and it did not converge after 200 episodes.

The issue with DQN is that Q-value targets of Q-learning update rule are estimated using the very same weights on NN that will be updated in the same epoch. As the NN weights gets updated, it is quite likely that the next Q-targets are going to change. This creates a moving target for NN causing lots of instability during training.

Double Deep Q-Networks(DDQN)

In order to provide fixed targets for the policy DQN during training, a second DQN is introduced to approximate Q-value targets. At the beginning weights for both NN are identically initialized. Q-targets from target DQN are used to train policy DQN and after tau time steps the weights of the target DQN are updated from policy DQN. This will provide stability during training and eliminate moving targets issue.

On important detail for Q-targets calculation, during training of policy DQN agent interacts with environments every time step and save obtained experience into memory and then the memory is sampled to obtain a mini-batch. What is new here is the calculation of Q-value target. The policy DQN is used to predict Q-value for next states of each experience. Target DQN also predicts the Q-value of next-states. Q-targets are calculated by selecting Q-targets from target DQN with the maximal action predicted by policy network.

Training DDQN in cartPole-v1 environment led to much better convergence characteristics and performed well during testing as well.

Technical lead of IBM Cognos recommenders system

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store