Markov Decision Process, MDP

A Markov Decision Process (MDP) provides the rigorous mathematical foundation underlying all policy evaluation and policy improvement methods in Reinforcement Learning (RL). Through an MDP, we can formally describe the interaction between an agent and its environment, and define the value of a policy in terms of its expected return.

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a model for sequential decision-making in situations where outcomes are uncertain. Originating in operations research in the 1950s, the MDP framework has since become influential across many fields. Its name reflects its connection to Markov chains, introduced by the Russian mathematician Andrey Markov.

An MDP is the formal mathematical model of reinforcement learning. It provides a simple yet expressive framework for learning from interactions to achieve a goal. These interactions refer to the continual exchange between an agent and its environment, as illustrated below.

The agent and the environment interact over a sequence of time steps, $t=0,1,2,3,\cdots$ . At each time step $t$ , the agent receives a state representation from the environment, $S_t \in \mathcal{S}$ . Based on this state, the agent selects an action, $A_t \in \mathcal{A}(s)$ . At the next time step $t+1$ , the agent receives a reward, $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$ , and simultaneously observes its new state $S_{t+1}$ .

The agent–environment interaction in a Markov decision process. (source: Reinforcement Learning: An Introduction, 2nd)

Thus, the MDP together with the agent generates a trajectory of the following form:

$S_0, A_0,R_1,S_1, A_1,R_2,S_2, A_2,R_3,\cdots$

A complete MDP is defined by the following five components:

$(\mathcal{S}, \mathcal{A}, p, r, \gamma)$

State Space

The state space, denoted by $\mathcal{S}$ , represents all possible states. At each time step $t$ , the agent occupies some state $S_t \in \mathcal{S}$ . The current state $S_t$ contains all information observable to the agent at time step $t$ , and is used to determine the next action $A_t$ . Therefore, the state must include all features of past agent–environment interactions that may influence future outcomes. When this condition is satisfied, the state is said to possess the Markov property:

$\Pr(S_{t+1} \mid S_t, A_t) = \Pr(S_{t+1} \mid S_1, A_1, \cdots, S_t, A_t)$

In other words, when the agent is in state $S_t$ , it can discard all previous states and actions, ( $S_1, A_1, \cdots, S_{t-1}, A_{t-1}$ ). The future depends only on the present, not the past.

Action Space

The action space, denoted by $\mathcal{A}$ , defines the set of actions available to the agent.
For each state $s$ , the set of allowable actions is denoted by $\mathcal{A}(s)$ .

State-Transition Probabilities

After choosing an action $a$ in state $s$ , the environment may transition to various possible next states. This is governed by the state-transition dynamics $p$ , which determine the next state $s^\prime$ and reward $r$ .

$\displaystyle p(s^\prime, r \mid s, a) \doteq \Pr\{S_t=s^\prime, R_t=r \mid S_{t-1}=s, A_{t-1}=a\} \\\\ \sum_{s^\prime \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s^\prime, r \mid s, a) = 1, \forall s \in \mathcal{S}, a \in \mathcal{A}(s) \\\\ p(s^\prime \mid s, a) \doteq \mathbb{E} \left[ R_t \mid S_{t-1}=s, A_{t-1}=a \right] = \sum_{r \in \mathcal{R}} p(s^\prime, r \mid s, a)$

Reward

The reward, denoted by $\mathcal{R}$ , represents the immediate feedback received. A reward $r$ is produced when the agent takes action a in state s and transitions to a new state $s^\prime$ .

$\displaystyle r(s, a, s^\prime) \doteq \mathbb{E} \left[ R_t \mid S_{t-1}=s, A_{t-1}=a, S_t=s^\prime \right] = \sum_{r \in \mathcal{R}} r \frac{p(s^\prime, r \mid s, a)}{p(s^\prime \mid s, a)} \\\\ r(s, a) \doteq \mathbb{E} \left[ R_t \mid S_{t-1}=s, A_{t-1}=a \right] = \sum_{r \in \mathcal{R}} r \sum_{s^\prime \in \mathcal{S}} p(s^\prime, r \mid s, a)$

Discount Factor

The discount factor, denoted $\gamma \in \left[ 0,1 \right]$ , quantifies how much future rewards are valued. When $\gamma = 0$ , only immediate rewards matter; when $\gamma = 1$ , all future rewards are valued equally with present rewards.

Returns and Episodes

The agent’s long-term objective is to maximize the cumulative reward. It aims to maximize the expected return. The return at time step $t$ , denoted $G_t$ , in the simplest case is the sum of future rewards:

$\displaystyle G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$

where $T$ is the final time step.

When the interaction between the agent and the environment naturally breaks into subsequences, each subsequence is called an episode. Every episode ends in a special state known as a terminal state, and episodes are independent of one another. Such tasks are referred to as episodic tasks, as in the case of a single game.

When the interaction does not naturally break into episodes and continues indefinitely, the task is referred to as a continuing task, such as in long-lived robotic applications. In such cases, the final time step becomes $T = \infty$ in the expression for $G_t$ . Without additional structure, maximizing the return would cause $G_t$ to diverge.

To address this issue, we apply discounting. The agent instead maximizes the expected discounted return:

$\displaystyle G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

where $\gamma \in \left[ 0,1 \right]$ is the discount factor.

Returns from successive time steps are closely related, a property that is crucial in RL theory and algorithms::

$G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \cdots \\\\ \hphantom{G_t \doteq \space} R_{t+1} + \gamma (R_{t+2} + \gamma^1 R_{t+3} + \gamma^2 R_{t+4} + \cdots) \\\\ \hphantom{G_t \doteq \space} R_{t+1} + \gamma G_{t+1}$

Although the discounted return consists of infinitely many terms, it remains finite as long as the reward is bounded and $\gamma < 1$ . For example, if the reward is constantly $+1$ , then:

$\displaystyle G_t = \sum_{k=0}^\infty \gamma^k = \frac{1}{1 - \gamma}$

Policy

A policy, denoted $\pi$ , specifies how the agent selects an action from the action set $\mathcal{A}(s)$ available in each state $s$ . Formally, a policy is a probability distribution over actions conditioned on the current state.

At time step $t$ , when the agent is in state $s$ , the probability that it chooses action a according to policy $\pi$ is:

$\pi(a \mid s) = \Pr(A_t = a \mid S_t = s)$

A policy may be deterministic, meaning the policy always selects a single specific action:

$\pi(a \mid s) = \left \{ \begin{matrix} 1 & a = \pi(s) \\ 0 & \text{otherwise} \end{matrix} \right.$

A policy may also be stochastic, in which case each available action has some probability of being selected.

Value Function

A value function evaluates a policy by estimating the expected return an agent can obtain when following that policy. In other words, under a policy $\pi$ , the value function quantifies how much cumulative reward the agent can expect in the future.

State-Value Function

Under policy $\pi$ , the state-value function measures the expected return when starting from state $s$ . It is denoted by $v_\pi(s)$ :

$\displaystyle v_\pi(s) \doteq \mathbb{E}_\pi \left[ G_t \mid S_t=s \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t=s \right], \forall s \in \mathcal{S}$

Action-Value Function

Under policy $\pi$ , the action-value function measures the expected return when starting from state s and taking action $a$ . It is denoted by $q_\pi(s, a)$ , commonly referred to as the q-function:

$\displaystyle q_\pi(s,a) \doteq \mathbb{E}_\pi \left[ G_t \mid S_t=s, A_t=a \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t=s, A_t=a \right]$

Relationship Between State-Value Function and Action-Value Function

The relationship between the state-value function $v$ and the action-value function $q$ is fundamental throughout reinforcement learning. The formulas in this section will appear repeatedly across RL algorithms.

Below is the backup diagram for $v_\pi$ , where hollow circles represent states and solid circles represent actions. This diagram summarizes all elements discussed so far. In state $s$ , an action $a$ is selected according to policy $\pi$ . Then, according to the state-transition probabilities $p$ , the next state $s^\prime$ is determined and a reward $r$ is generated for reaching $s^\prime$ .

Backup diagram (source: Reinforcement Learning: An Introduction, 2nd).

First, consider the relationship from the state-value function to the action-value function. In a given state $s$ , the value $v_\pi(s)$ is the expectation over all possible actions $a$ weighted by the policy $\pi(a \mid s)$ . In other words, the state-value function can be viewed as the expectation of the action-value function.

$\displaystyle v_\pi(s) = \sum_a \pi(a \mid s) q_\pi(s, a)$

Next, consider the recursive relationship from the action-value function back to the state-value function. After taking action $a$ in state $s$ , the value $q_\pi(s, a)$ is the expectation over all possible next states $s^\prime$ and corresponding rewards $r$ , weighted by the state-transition probabilities $p(s^\prime, r \mid s, a)$ . For each possible pair $(s^\prime, r)$ , we sum the immediate reward $r$ and the discounted future value $\gamma v_\pi(s^\prime)$ . Thus, the action-value function equals the immediate reward plus the discounted expectation of the state-value function.

$\displaystyle q_\pi(s, a) = \sum_{s^\prime, r} p(s^\prime, r \mid s, a) \left[ r + \gamma v_\pi(s^\prime) \right]$

Bellman Expectation Equation

The Bellman Equation, named after Richard E. Bellman, is a fundamental technique in Dynamic Programming. Bellman’s principle of optimality states that an optimization problem can be decomposed into a recursive structure of subproblems. The value of any state can be expressed as the immediate reward resulting from an initial choice plus the value of the subsequent subproblem induced by that choice.

Therefore, the essential idea of the Bellman equation is that the value of a state can be written as the expected immediate reward plus the expected value of the next state.

Bellman Expectation Equation for State-Value Function

Under a policy $\pi$ , the state-value function $v_\pi$ can be written recursively as:

$\displaystyle v_\pi(s) \doteq \mathbb{E}_\pi \left[ G_t \mid S_t=s \right] \\\\ \hphantom{v_\pi(s)} = \mathbb{E}_\pi \left[ R_{t+1} + \gamma G_{t+1} \mid S_t=s \right] \\\\ \hphantom{v_\pi(s)} = \sum_a \pi(a \mid s) \sum_{s^\prime} \sum_r p(s^\prime, r \mid s, a) \Big[ r + \gamma \mathbb{E}_\pi \left[ G_{t+1} \mid S_{t+1} = s^\prime \right] \Big] \\\\ \hphantom{v_\pi(s)} = \sum_a \pi(a \mid s) \sum_{s^\prime} \sum_r p(s^\prime, r \mid s, a) \Big[ r + \gamma v_\pi(s^\prime) \Big], \forall s \in \mathcal{S}$

This equation states that the value $v_\pi(s)$ is the expectation over all possible actions $a$ weighted by the policy $\pi(a \mid s)$ . Each action’s value is, in turn, the expected immediate reward plus the discounted value of the next state, weighted by the transition probabilities $p(s^\prime, r \mid s, a)$ . For this reason, the equation is known as the Bellman Expectation Equation.

Bellman Expectation Equation for Action-Value Function

Applying the same recursive reasoning to the action-value function $q_\pi(s,a)$ yields:

$\displaystyle q_\pi(s, a) \doteq \mathbb{E}_\pi \left[ G_t \mid S_t=s, A_t=a \right] \\\\ \hphantom{v_\pi(s, a)} = \mathbb{E}_\pi \left[ R_{t+1} + \gamma G_{t+1} \mid S_t=s, A_t=a \right] \\\\ \hphantom{v_\pi(s, a)} = \sum_{s^\prime} \sum_r p(s^\prime, r \mid s, a) \Big[ r + \gamma \mathbb{E}_\pi \left[ G_{t+1} \mid S_{t+1} = s^\prime \right] \Big] \\\\ \hphantom{v_\pi(s, a)} = \sum_{s^\prime} \sum_r p(s^\prime, r \mid s, a) \Big[ r + \gamma v_\pi(s^\prime) \Big], \forall s \in \mathcal{S}, a \in \mathcal{A}(s)$

This equation describes the value of taking action $a$ in state $s$ as the expectation over the immediate reward $r$ and the discounted future state-value, weighted by the environment’s transition probabilities $p(s^\prime, r \mid s, a)$ .

Optimal Policy

Solving a reinforcement learning task means discovering a policy that achieves high cumulative reward over the long run. For a finite MDP, an optimal policy can be defined precisely. Value functions impose a partial ordering over policies: if, for every state, the expected return of one policy $\pi$ is greater than or equal to that of another policy $\pi^\prime$ , then policy $\pi$ is said to be better than or equal to $\pi^\prime$ .

$\pi \ge \pi^\prime \text{ if and only if } v_\pi(s) \ge v_{\pi^\prime}(s), \forall s \in \mathcal{S}$

There always exists at least one policy that is better than or equal to all other policies. Such a policy is called an optimal policy. Although there may be multiple optimal policies, we denote the entire set of them by $\pi_*$ .

All optimal policies share the same optimal state-value function, denoted $v_*$ :

$\displaystyle v_*(s) \doteq \max_{\pi} v_\pi(s), \forall s \in \mathcal{S}$

They also share the same optimal action-value function, denoted $q_*$ .

$\displaystyle q_*(s, a) \doteq \max_{\pi} q_\pi (s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)$

Bellman Optimality Equation

Bellman Optimality Equation for Optimal State-Value Function

Under an optimal policy $\pi_*$ , the optimal state-value function $v_*$ can be written in the following recursive form:

$\displaystyle v_*(s)=\max_{a \in \mathcal{A}(s)} q_* (s, a) \\\\ \hphantom{v_*(s)} = \max_a \mathbb{E}_{\pi_*} \left[ G_t \mid S_t=s, A_t=a \right] \\\\ \hphantom{v_*(s)} = \max_a \mathbb{E}_{\pi_*} \left[ R_{t+1} + \gamma G_{t+1} \mid S_t=s, A_t=a \right] \\\\ \hphantom{v_*(s)} = \max_a \mathbb{E} \left[ R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t=s, A_t=a \right] \\\\ \hphantom{v_*(s)} = \max_a \sum_{s^\prime, r} p(s^\prime, r \mid s, a) \left[ r + \gamma v_*(s^\prime) \right]$

Here, the operator $\displaystyle \max_a$ indicates that, in each state, the agent chooses the action that maximizes the expected value. This captures the essence of an optimal policy. This recursive relationship is the Bellman Optimality Equation for $v_*$ .

Bellman Optimality Equation for Optimal Action-Value Function

Under an optimal policy $\pi_*$ , the optimal action-value function $q_*$ satisfies the following recursive form:

$\displaystyle q_*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a^\prime} q_*(S_{t+1}, a^\prime) \mid S_t=s, A_t=a \right] \\\\ \hphantom{q_*(s, a)} = \sum_{s^\prime, r} p(s^\prime, r \mid s, a) \left[ r + \gamma \max_{a^\prime} q_*(s^\prime, a^\prime) \right]$

This expression is the Bellman Optimality Equation for $q_*$ .

Conclusion

The MDP framework provides a precise and operational mathematical description for decision-making problems in reinforcement learning. It enables us to define policies, evaluate their long-term performance, and compare the quality of different policies. All subsequent control methods and optimal policy–search techniques ultimately trace back to this foundational set of definitions established by the MDP framework.

References

Adam White and Martha White. Reinforcement Learning Specialization. University of Alberta and Coursera.
Richard S. Sutton and Andrew G. Barto. 2020. Reinforcement Learning: An Introduction, 2^nd. The MIT Press.