Reinforcement Learning Digest Part 4: Deep Q-Network(DQN) and Double Deep Q-Networks(DDQN)
In last article, we have discussed Q-learning and we have seen its desirable convergence attributes. Never the less, Q-learning has one fundamental limitation preventing it from being applicable to more complex RL tasks. During learning, Q-learning keeps the Q-value for every state-action pair. In FrozenLake with 4x4 grid, there are 4 actions leading to Q-table size of 4x4x4 = 64. Size of Q-table can grows linearly proportional to number of states. states. This becomes limiting very quickly for RL tasks with much larger states domain size.
Usage of Neural Networks
So clearly we need a better way to approximate Q-value function that does not have memory requirements that are directly proportional to state domain size. Neural Networks are known to learn approximating functions. It seems natural to use NN to learn Q-value function with Q-target:
During back-propagation, NN weights will be updated according to the following rule:
Experience Replay Memory
During training the agent will continuously interact with the environment to get new experiences which will be used to train NN to learn Q-value function. During long training cycles, the NN weights will get continuously updated causing the NN to eventually forget about old experiences. Additionally, as experiences are collected in sequences from consecutive time steps, the NN can wrongly learn correlation between experiences. Experience replay memory is used to elevate both problems. The idea is to use a memory that will store experience tuples(state, action, reward, next-state, terminal flag) from every time steps into a memory data structure. Then during every training iteration, the agent interacts with environment every time step and store experience information into memory and then randomly sample a mini-batch from experience memory. Q-value are then calculated for the mini-batch and NN is trained on the mini-batch. During back propagation the NN will update the wights to better approximate targets till convergence.