Reinforcement Learning Digest Part 2: Bellman Equations, Generalized Policy Iteration and Monte Carlo

Ahmed El-Khouly
5 min readNov 22, 2020

In the last article, I have introduced Reinforcement learning Markov Decision Process (MDP) framework, discounted expected rewards and value and policy functions definitions. In this article, we will continue the definition of the MDP framework explaining Bellman and Bellman optimality equations. Additionally we will have describe our first reinforcement learning algrithm: Monte Carlo. So let us start…

Bellman equations

In the last article, value and Q functions were defined as:

One important property of value and Q functions is that they can be expressed recursively. The value of state at time-step t can be expressed in terms of time-step t+1:

the last equation can rewritten as:

Similarly, Q-function can be expressed as:

Bellman Optimality equations

Now we have learnt about value and policy functions and bellman equations, we can start defining optimal value and policy functions. By optimal here we mean that they lead our agent to achieve its goal which is maximizing discounted cumulative returns.

Since value function is expressed as the expected discounted returns of states, this can lead us to a very logical definition of optimal policy with respect to value function:

Policy π is said to be better equal to policy π’ if and only if the

Please note that there can be more than one optimal policies. Optimal policy have associated value functions which can be expressed as:

As we have seen previously, Q-function for policy π expresses the expected cumulative returns for taking action a in state s and following the policy thereafter. It must then follow that the optimal value function can be defined as taking the maximum action on state s and following the optimal policy thereafter. Therefore, the optimal value function can be expressed as follows:

Then by following Q-function Bellman equation. the optimal value fucntion can be rewritten as:

Similarity, Q- function can expressed in since q∗ gives expected return for taking an action a in state s and follow the optimal policy thereafter:

Then by substitution of v* by LHS of eq 1, we can express Q* as follows: