# Reinforcement Learning Digest Part 1: Introduction & Finite Markov Decision Process framework

**Introduction**

Reinforcement learning is an important type of machine learning used in vast range of applications and fields including robotics, genetics, financial applications and recommendation systems to mention a few. In this series of articles, I aim at taking the reader into a journey to learn enough about this topic. The goal is to build knowledge in reinforcement learning starting from basic principles and gradually get to more advanced aspects of reinforcement learning. The articles will have a balance theory and practical demos which can help to practice theory learnt and cement understanding. So let us start the journey…

**Definition**

Reinforcement learning can be defined as follows:

”Reinforcement learning is an area of machine learning concerned with how software

agentsought to takeactionsin anenvironmentin order to maximize some notion ofcumulative reward.”

- Wikipedia

From this definition, we see that we have a software agent that interacts with an environment by taking actions which results in an immediate reward. it is the goal of the reinforcement learning is for the agent to learn how to maximize cumulative rewards obtained from taking sequence of such actions. One should note that actions with highest immediate rewards will result in optimal overall rewards. Therefore, the goal of reinforcement learning is to learn how to maximize overall rewards.

## How reinforcement learning different from other types of machine learning?

Supervised machine learning algorithms receive labeled samples. The label can be class for classification tasks or numeric value for regression tasks. The aim is to learn provide labels for examples they did not see before. The input samples are independent from each other and during training they are sampled with equal probability. For unsupervised learning, the input is unlabeled samples and the aim is identify clusters or association within the sample population.

Reinforcement learning is different from both types of ML in the following ways:

· **Input**: input to reinforcement learning algorithm is a representation of state of the environment.

· **Learning method**: learning achieved by receiving numerical signal, reward, for every action taken. Reinforcement learning agents learn from interaction with environment.

· **Learning objective**: optimal policy to make sequence actions that will maximize cumulative rewards in the long run.

· **Active algorithm**: given a state the reinforcement agent takes actions that will change the state of the environment; the very same state the agent is trying to learn how interact with.

· **State dependency**: states are not independent. Probability of ending in a certain states is dependent on previous states.

Reinforcement learning algorithms can be categorized as model based and model-free. In model based algorithms, the agent uses a readily available predictive model for predicting the outcome of actions in specific environment. The models can be used for specific environments and often do not generalize to other environments. The other class of reinforcement learning are model free where the agent does not know much about the environment he is interacting with and learns from trial and error. In this series of articles we will focus on the model free reinforcement learning algorithms.

# Reinforcement Learning task types

RL tasks have can be categorized as:

**Episodic tasks:** interaction between agent and environment, trajectory, can be divided into subsequences, called episode, where every episode has a well defined terminal state.

**Continuous tasks: **interaction between agent and environment can be divided into subsequence and tend to continue without limit.

**Finite Markov Decision Process**

Finite Markov Decision Process (MDP) provide mathematical framework to formalize representation of reinforcement machine learning. MDP compromises the following components:

• **Environment:** embodiment of problem agent interacting with through time

• **Agent:** entity learning actions to maximize overall rewards

• **State:** representation of environment at specific time step. There is a finite set **S** of all states.

• **Reward:** numerical signal agent receives for taking actions

• **Action:** agent’s decision making. There is a finite set **A** of all actions.

In this framework, the agent interacts with the environment during time increments or time steps t=0,1,2,… . At each time step t, the agent is presented with the state of the environment. The agent then has to decide on the next action to take. Executing an action on the environment will result in a reward and sate of environment changes to next state in the next time step and the cycle repeats. This interaction leads to a trajectory of S0, A0, R1, S1, A1, R2, S2, … . In MDP framework, probability of state can be determined given current state only:

Similarly, sstate and reward pair have well defined probability the depends only on the state and action in the preceding time step:

This is significant as it means that it is sufficient to have just current state to make determinations on next states without the need to keep track of the full history of state transition leading up to .

This is significant as it means that it is sufficient to have just current state to make determinations on next states without the need to keep track of the full history of state transition leading up to .

As the agent is interacting with the environment, the rewards returned at each time step are accumulated: **Gt = Rt+1 + Rt+2 + Rt+3 + …**

But expected rewards from future time-steps can be uncertain. For example, the environment can terminate before agent to a certain time step. The further the time step in the future, the higher degree of uncertainty associated with it. This gives the rise for discounting, where future rewards are discounted exponentially. To achieve this, returns at step x, is multiplied by where , discount ratio, is between 0 and 1. Therefore, the expected discounted return can be expressed as follows:

From the last equation, the expected discounted return at step t is expressed as a sum of **Rt+1 **and discounted expected return of t+1.

**Functions of reinforcement learning**

Almost all reinforcement learning algorithms involve estimating value and policy functions used to maximize expected cumulative returns.

**Policy function**

Policy function determines how the agent is going to behave given a any state at any time step. In other words it determines the probability of taking an action for any given state:

**Value Function**

Value function asses how “good” a given state is for the agent in the long run following policy . The “goodness” of a state is measured in terms of the expected discounted returns:

**Q Function**

The Q function measures how good taking a certain action in state s and following policy thereafter. Once again the “goodness” of state-action pair is measured in terms expected discounted cumulative returns:

In the next article of this series, I am going to discuss Bellman and Bellman optimality equations, which are fundamental to any RL algorithm, generalized policy iterations and then introduce our first RL algorithm.

# Articles in this series:

Reinforcement Learning Digest Part 3: SARSA & Q-learning

Reinforcement Learning Digest Part 4: Deep Q-Network(DQN) and Double Deep Q-Networks(DDQN)

# References:

Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow http://incompleteideas.net/book/RLbook2020.pdf