Generalized Policy Iteration, GPI

Generalized Policy Iteration (GPI) is not a single algorithm, but rather the fundamental framework underlying all Reinforcement Learning (RL) methods. It integrates policy evaluation and policy improvement, enabling algorithms to steadily approach the optimal policy $\pi_*$ and the optimal state-value function $v_*$ even under limited information.

Markov Decision Process, MDP
Generalized Policy Iteration, GPI
Policy Evaluation
Policy Improvement
Conclusion
References

Markov Decision Process, MDP

Before reading this article, the reader should already be familiar with the Markov Decision Process (MDP). If MDP is still unfamiliar, please refer to the following article first.

- Artificial Intelligence
- Reinforcement Learning

Markov Decision Process, MDP

ByWayne
09/12/2025

Generalized Policy Iteration, GPI

In a Markov Decision Process (MDP), the long-term objective of the agent is to maximize the expected return. In other words, the fundamental goal of an MDP is to find a policy $\pi_*$ that achieves the highest expected return across all states. This task is known as the control problem.

When solving the control problem, all algorithms follow a common structure in which policy evaluation and policy improvement continuously interact and alternate with one another. This unified framework, where evaluation and improvement mutually drive each other, is called Generalized Policy Iteration (GPI).

In practice, we often begin with an arbitrary or poorly informed policy whose value function is inaccurate; at the same time, we do not yet know what the optimal policy will be. GPI provides a general principle that allows any algorithm to progressively improve its policy by repeatedly performing two steps:

Policy Evaluation: Estimate the value of the current policy (i.e., its long-term performance).
Policy Improvement: Update the policy based on this value estimate, making it better than before.

Through the repeated interaction of these two processes, the policy gradually approaches the optimal policy $\pi_*$ .

The diagram below illustrates how GPI alternates between policy evaluation and policy improvement, eventually converging to the optimal policy $\pi_*$ and the optimal value function $v_*$ .

GPI (source: Reinforcement Learning: An Introduction, 2nd).

The next diagram shows an alternative perspective. As GPI repeatedly performs policy evaluation and policy improvement, policy evaluation pulls the estimate toward the upper boundary $v = v_\pi$ , while policy improvement pulls it toward the lower boundary $\pi = \text{greedy}(v)$ . The combined process moves progressively closer to the optimum.

Policy Evaluation

Given an arbitrary policy $\pi$ , we compute its state-value function $v_\pi$ . This process is called policy evaluation, and we treat it as a prediction problem.

Policy evaluation uses the Bellman expectation equation to assess the state-value function. It tells us how good the current policy $\pi$ is.

$\displaystyle v_\pi(s) \doteq \mathbb{E}_\pi \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s \right] \\\\ \hphantom{v_\pi(s)} = \sum_a \pi(a \mid s) \sum_{s^\prime, r} p(s^\prime, r \mid s, a) \left[ r + \gamma v_\pi(s^\prime) \right]$

Policy Improvement

Given an arbitrary policy $\pi$ and its estimated state-value function $v_\pi$ , we search another, better policy $\pi^\prime$ such that $\pi^\prime \ge \pi$ . This process is called policy improvement.

The idea behind policy improvement is as follows, suppose we have a deterministic policy $\pi$ . For certain states $s$ , we want to determine whether we should modify the policy so that it deterministically selects another action $a \neq \pi(s)$ . If the alternative action is better, we update (improve) the policy accordingly.

$\displaystyle \pi(s) = \text{argmax}_a q_\pi(s, a)$

More concretely, under policy $\pi$ and state $s$ , we can use the Bellman expectation equation of the action-value function to evaluate whether another action yields a higher return than the action originally selected by the policy.

$\displaystyle q_\pi(s, a) \doteq \mathbb{E}_\pi \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a \right] \\\\ \hphantom{q_\pi(s, a)} = \sum_{s^\prime, r} p(s^\prime, r \mid s, a) \left[ r + \gamma v_\pi(s^\prime) \right]$

The Policy Improvement Theorem states that if

$q_\pi(s, \pi^\prime(s)) \ge v_\pi(s)$

then policy $\pi^\prime$ must be at least as good as policy $\pi$ . That is, policy $\pi^\prime$ must obtain a greater or equal expected return in all states $s \in \mathcal{S}$ .

$v_{\pi^\prime}(s) \ge v_\pi(s), \forall s \in \mathcal{S}$

The policy improvement theorem guarantees that any policy improvement step will never make a policy worse. It will only make it better or keep it the same. This provides GPI with the assurance that policy evaluation and policy improvement will not diverge; instead, they will continue to progress toward the optimal policy.

Therefore, according to the policy improvement theorem, for each state, we select the action that produces a higher expected return. By using this greedy policy, we perform a one-step lookahead and choose the action that appears best in the short term. We can thus obtain a new policy $\pi^\prime$ through the greedy policy. Consequently, policy improvement will always yield a better policy, unless the original policy is already the optimal policy.

$\displaystyle \pi^\prime(s) \doteq \text{argmax}_a q_\pi(s, a) \\\\ \hphantom{pi^\prime(s)} = \text{argmax}_a \mathbb{E}_\pi \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a \right] \\\\ \hphantom{pi^\prime(s)} = \text{argmax}_a \sum_{s^\prime, r} p(s^\prime, r \mid s, a) \left[ r + \gamma v_\pi(s^\prime) \right]$

Here, $\text{argmax}_a$ denotes the action $a$ that maximizes the expression.

Thus, we use the new $v_{\pi^\prime}$ to update the old $v_\pi$ .

$\displaystyle v_{\pi^\prime}(s) = \max_a \mathbb{E}_\pi \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a \right] \\\\ \hphantom{v_{\pi^\prime}(s)} = \max_a \sum_{s^\prime, r} p(s^\prime, r \mid s, a) \left[ r + \gamma v_\pi (s^\prime) \right]$

Conclusion

All algorithms in RL can be viewed as concrete instances of GPI. Their differences lie only in how they estimate the value function and how they improve the policy, while the underlying framework remains the same. The key insight of GPI is that policy evaluation does not need to fully converge, and policy improvement does not need to reach optimality in a single step. As long as the two processes continue to drive each other forward, the system will reliably converge toward the optimal solution. This enables RL to discover highly efficient policies even when the model is incomplete or when exact computation is infeasible, simply through iterative interaction.

References

Adam White and Martha White. Reinforcement Learning Specialization. University of Alberta and Coursera.
Richard S. Sutton and Andrew G. Barto. 2020. Reinforcement Learning: An Introduction, 2^nd. The MIT Press.

Get source code of posts.

Generalized Policy Iteration, GPI

Share

Table of Contents

Markov Decision Process, MDP

Markov Decision Process, MDP

Generalized Policy Iteration, GPI

Policy Evaluation

Policy Improvement

Conclusion

References

Wayne

Leave a Reply Cancel reply

Python Pie/Donut/Sunburst Charts

Python Candlestick Charts

Python Box/Violin Plots

Python Regression Line Plots

Python Choropleth Map

Python Heatmaps

Bradley-Terry Model

Entropy

Byte-Pair Encoding

Policy Gradient

On-Policy Control with Approximation

Spring Security JWT Authentication with Google Sign-In Explained

How to Backup and Restore MySQL Databases in Spring Boot

Sending Push Notifications Using FCM in Spring Boot

Python Pie/Donut/Sunburst Charts

Kotlin Coroutine Flow Tutorial

Spring Security JWT Authentication with Google Sign-In Explained

How to Backup and Restore MySQL Databases in Spring Boot

Sending Push Notifications Using FCM in Spring Boot

Python Pie/Donut/Sunburst Charts

Generalized Policy Iteration, GPI

Share

Table of Contents

Markov Decision Process, MDP

Generalized Policy Iteration, GPI

Policy Evaluation

Policy Improvement

Conclusion

References

Leave a Reply Cancel reply

You May Also Like