Reinforcement Learning Digest Part 3: SARSA & Q-learning

Ahmed El-Khouly
3 min readNov 22, 2020

In the last article I have explained generalized policy iteration process and described our first reinforcement learning algorithm: Mote Carlo. In this article we will discuss the drawbacks of Monte Carlo and explore two other algorithms that can help the agent overcome shortcomings of Monte Carlo.

Monte Carlo algorithm learns from complete episodes. This can have the following drwabacks:

  • Monte Carlo cannot be used for continuous tasks.
  • Monte Carlo can be very slow for environments with long episodes.


RL algorithms that can update Q-value estimates without having to wait for complete episodes are called Temproal difference (TD). SARSA algorithm is an on-policy algorithm where Q-value is updated at every time-step where the state is not terminal. . Therefore, this learning in this algorithm happens for quintuple: S,A,R,S’,A’. The update rule for Q functions is as follows:

Remember Bellman equation for Q-function:

Therefore, the TD-Error becomes:

One last remark, updates are weighted by learning rate (α) the controls how fast do Q-function forgets old values in favour of new experiences.

Trying SARSA algorithm in FrozenLake-v0 environment, proved to converge much faster than Monte Carlo on-policy algorithm as it required almost order of magnitude less training episodes.


Q-Learning is another Temporal Difference algorithm that update Q-value every time step. What differentiates Q-Learning from SARSA is that the policy used to generate experiences, behaviour policy, is different from the policy being improved. Specifically, behaviour policy used in Q-Learning is ϵ-Greedy, just like Monte Carlo and SARSA, but Q-value is updated using greedy policy estimating Bellman optimality equation. In other words, the Q-value update rule in Q-Learning is estimating the Q-value target to be:

Ahmed El-Khouly

Technical lead of IBM Cognos recommenders system