Reinforcement Learning Digest Part 2: Bellman Equations, Generalized Policy Iteration and Monte Carlo

Ahmed El-Khouly
5 min readNov 22, 2020

In the last article, I have introduced Reinforcement learning Markov Decision Process (MDP) framework, discounted expected rewards and value and policy functions definitions. In this article, we will continue the definition of the MDP framework explaining Bellman and Bellman optimality equations. Additionally we will have describe our first reinforcement learning algrithm: Monte Carlo. So let us start…

Bellman equations

In the last article, value and Q functions were defined as:

One important property of value and Q functions is that they can be expressed recursively. The value of state at time-step t can be expressed in terms of time-step t+1:

the last equation can rewritten as:

Similarly, Q-function can be expressed as:

Bellman Optimality equations

Now we have learnt about value and policy functions and bellman equations, we can start defining optimal value and policy functions. By optimal here we mean that they lead our agent to achieve its goal which is maximizing discounted cumulative returns.

Since value function is expressed as the expected discounted returns of states, this can lead us to a very logical definition of optimal policy with respect to value function: