Reinforcement Learning Digest Part 3: SARSA & Q-learning

Ahmed El-Khouly
3 min readNov 22, 2020

In the last article I have explained generalized policy iteration process and described our first reinforcement learning algorithm: Mote Carlo. In this article we will discuss the drawbacks of Monte Carlo and explore two other algorithms that can help the agent overcome shortcomings of Monte Carlo.

Monte Carlo algorithm learns from complete episodes. This can have the following drwabacks:

  • Monte Carlo cannot be used for continuous tasks.
  • Monte Carlo can be very slow for environments with long episodes.

SARSA

RL algorithms that can update Q-value estimates without having to wait for complete episodes are called Temproal difference (TD). SARSA algorithm is an on-policy algorithm where Q-value is updated at every time-step where the state is not terminal. . Therefore, this learning in this algorithm happens for quintuple: S,A,R,S’,A’. The update rule for Q functions is as follows:

Remember Bellman equation for Q-function:

Therefore, the TD-Error becomes:

One last remark, updates are weighted by learning rate (α) the controls how fast do Q-function forgets old values in favour of new experiences.

--

--