On-Policy Control with Approximation

Photo by Paladuta Stefan on Unsplash
Photo by Paladuta Stefan on Unsplash
In practical control problems, the state and action spaces are often high-dimensional, continuous, and noisy, which makes reinforcement learning algorithms based on tabular methods difficult to apply directly. Once function approximation is introduced, the two components that are conceptually well separated in theory of value evaluation and policy improvement become tightly intertwined, bringing with them challenges related to stability and variance. This article focuses on on-policy control methods under function approximation, with particular attention to Sarsa.

In practical control problems, the state and action spaces are often high-dimensional, continuous, and noisy, which makes reinforcement learning algorithms based on tabular methods difficult to apply directly. Once function approximation is introduced, the two components that are conceptually well separated in theory of value evaluation and policy improvement become tightly intertwined, bringing with them challenges related to stability and variance. This article focuses on on-policy control methods under function approximation, with particular attention to Sarsa.

The complete code for this chapter can be found in .

Control Problem

In control problems, the agent’s objective is no longer merely to evaluate a given policy, but to simultaneously perform value estimation and policy improvement through continuous interaction with the environment, ultimately approaching an optimal policy.

For episodic tasks, we first define the discounted return starting from time step t as:

\displaystyle G_t \doteq \sum_{k=0}^{T-t-1} \gamma^k R_{t_k+1}

where \gamma \in [0,1] is the discount factor and T denotes the terminal time step of the episode. Based on this definition, the action-value function under a policy \pi can be written as:

q_\pi(s, a) \doteq \mathbb{E} [ G_t \mid S_t=s, A_t=a]

The goal of the control problem is to find an optimal policy \pi_* such that, for all states, the action selected by this policy maximizes the corresponding action-value. Formally, we seek

\displaystyle \pi_*(s) \in \text{argmax}_a q_*(s, a) \\\\ q_*(s, a) \doteq \max_\pi q_\pi(s, a)

That is, the optimal policy is implicitly defined by the optimal action-value function.

Under tabular methods, we can maintain an independent value Q(s, a) for each state–action pair (s, a), and gradually approximate q_* through TD control–based algorithms while interacting with the environment. This approach relies on the assumption that both the state space and the action space are sufficiently small, so that all state–action pairs can be explicitly enumerated and stored.

Once the state space or action space becomes large, or even continuous, this assumption no longer holds. In such cases, tabular representations cease to be practical, and the control problem must move into the framework of function approximation, where the action-value function is represented in a parameterized form and the interaction between policy learning and value learning must be reconsidered.

In prediction problems, function approximation primarily affects the accuracy of value estimation. In control problems, however, approximation errors in the value function are directly fed back into policy improvement, thereby influencing subsequent data distributions and learning stability. This feedback loop is precisely why control problems under function approximation are inherently more challenging than their prediction counterparts.

Value Function Approximation

When the state space or action space becomes large, or even continuous, tabular representations are no longer feasible. In such cases, the action-value function must instead be approximated by a parameterized function:

\hat{q}(s, a, \mathbf{w}) \approx q_\pi(s, a)

where \mathbf{w} \in \mathbb{R}^d denotes a vector of learnable parameters. The object of learning is no longer the value associated with an individual state–action pair, but rather the representation of the entire action-value function in parameter space. That is, the parameter vector \mathbf{w} itself.

The introduction of function approximation brings about two fundamental changes.

First, different state–action pairs share the same set of parameters. As a result, an update derived from a single experience can simultaneously affect the value estimates of multiple states and actions. This parameter sharing endows the agent with generalization capability, allowing limited experience to be extrapolated to states that are rarely or never visited. At the same time, it also means that updates for different states are no longer independent; interference between states may occur, potentially leading to less stable learning dynamics.

Second, once function approximation is used, value estimation can no longer be viewed as the exact solution of a well-defined Bellman operator. Because the estimated value function is restricted to a finite-dimensional function class, the learning process can only search within this space for a solution that approximately satisfies Bellman consistency, rather than converging to the true fixed point.

Consequently, when function approximation is introduced into control problems, the learning objective itself changes. Instead of recovering the true optimal action-value function q_*, the goal becomes to learn an approximate action-value function, within the chosen feature representation and function class, that is sufficient to support effective policy improvement. It is precisely this shift that causes approximate control problems to be inherently accompanied by issues of stability and bias, making them a central challenge in both algorithm analysis and design.

It is worth emphasizing that, in prediction problems, function approximation primarily affects the accuracy of value estimation. In control problems, however, approximation errors are fed back through policy improvement into the subsequent data distribution, forming a closed feedback loop. This interaction explains why the combination of function approximation, bootstrapping, and policy improvement leads to significant difficulties in both theory and practice.

On-Policy

Although the representational capacity and theoretical guarantees change once function approximation is introduced, the control problem can still be understood, at a structural level, as an instance of Generalized Policy Iteration (GPI). That is, an iterative process in which policy evaluation and policy improvement alternate. The key difference is that when the action-value function is approximated by \hat{q}(s, a, \mathbf{w}), neither step is exact; both can only be carried out approximately within the chosen function class.

Under the on-policy setting, the behavior policy is identical to the policy being evaluated. Concretely, the agent interacts with the environment according to an exploratory policy \pi (for example, an \varepsilon-greedy policy with respect to \hat{q}), and this same policy is used to define the TD target for updates. In other words, data generation and value-function evaluation are always governed by the same policy.

Policy improvement then proceeds by adjusting the policy based on the current approximate action-value function, gradually reducing the degree of exploration while moving toward a policy that is more greedy with respect to \hat{q}. As the parameter vector \mathbf{w} is updated, the behavior policy changes accordingly, which in turn affects the distribution of future experience.

Consequently, the entire on-policy control process can be viewed as an online, incremental approximation of GPI: at each step, only a small amount of recent experience is used to perform local updates to both the value function and the policy, rather than waiting for policy evaluation to fully converge before improvement. It is precisely this tightly interleaved update scheme that allows on-policy methods to exhibit relatively good stability when combined with function approximation.

Action-Value Function Approximation

Before moving on to specific control algorithms, it is necessary to clarify a fundamental question: once function approximation is introduced, how is the action-value function represented, and how is it updated?

Under the setting of linear function approximation, the basic assumption is that each state–action pair (s, a) can be mapped to a fixed-dimensional feature vector \mathbf{x}(s, a) \in \mathbb{R}^d. With this assumption, the action-value function is represented as a linear combination of features parameterized by a vector \mathbf{w} \in \mathbb{R}^d:

\hat{q}(s, a, \mathbf{w}) \doteq \mathbf{w}^\top \mathbf{x}(s, a)

In this representation, all states and actions share the same parameter vector \mathbf{w}. It is precisely this parameter sharing that enables function approximation to generalize. A single parameter update simultaneously affects all state–action pairs that are similar in feature space. However, this shared representation also comes with a cost, that is updates for different states may interfere with one another, a phenomenon that does not arise in tabular methods.

One important advantage of linear approximation is the simplicity of its gradient. Taking the gradient of the above approximation with respect to the parameters yields:

\nabla_\mathbf{w} \hat{q}(s, a, \mathbf{w}) = \mathbf{x}(s, a)

As a result, all gradient-based TD control methods can express their parameter updates in a unified form:

\delta_t \doteq U_t - \hat{q}(S_t, A_t, \mathbf{w}_t) \\\\ \mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \Big[ U_t - \hat{q}(S_t, A_t, \mathbf{w}_t) \Big] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \\\\ \hphantom{\mathbf{w}_{t+1}} = \mathbf{w}_t + \alpha \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)

Here, \alpha denotes the step size, while the TD target U_t and the corresponding TD error \delta_t are determined by the specific control algorithm being used. For this reason, in practical implementations, the critical aspect of linear approximation is not how the gradient is computed, but how the features are designed.

From a representational perspective, linear action-value approximation is equivalent to assuming that the true action-value function q(s, a) can be projected onto the linear subspace spanned by the feature vectors. The outcome of learning is therefore not the recovery of the true optimal action-value function q_*, but rather the identification of a parameter vector \mathbf{w} such that \hat{q}(s, a, \mathbf{w}) satisfies an approximate form of Bellman consistency within the chosen feature space.

Episodic Semi-Gradient Sarsa

Under the tabular setting, the Sarsa update is given by:

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \Big[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \Big]

This update can be interpreted as using the action A_{t+1} actually taken in the next state to estimate the action-value under the current policy \pi, and then correcting the current estimate accordingly. For this reason, Sarsa is a canonical on-policy TD control method.

When the state–action space becomes too large to maintain Q(s, a) in a table, we instead represent the action-value function using linear function approximation:

\hat{q}(s, a, \mathbf{w}) \doteq \mathbf{w}^\top \mathbf{x}(s,a)

In this setting, the object being learned is no longer a scalar value associated with a particular (s, a), but the parameter vector \mathbf{w} itself. Our goal is to adjust \mathbf{w} so that, during interaction,

\hat{q}(S_t, A_t, \mathbf{w}) \approx q_\pi(S_t, A_t)

Therefore, once function approximation is introduced, the control problem becomes a parameter-learning problem, that is how to use the TD error to incrementally refine \mathbf{w} so that the approximate action-value function better matches the true values induced by the current policy.

For episodic tasks, the TD target U_t and TD error \delta_t in Sarsa are defined as:

U_t \doteq R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) \\\\ \delta_t \doteq R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \\\\ \hphantom{\delta_t} = U_t - \hat{q}(S_t, A_t, \mathbf{w}_t)

Because the action-value function is represented in a parameterized form, updates require taking gradients with respect to the parameters. Combining this with the gradient of the linear approximation, \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) = \mathbf{x}(S_t, A_t), the parameter update for episodic Sarsa can be written as:

\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \Big[ R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \Big] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \\\\ \hphantom{\mathbf{w}_{t+1}} = \mathbf{w}_t + \alpha \Big[ R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \Big] \mathbf{x}(S_t, A_t)

This method is referred to as episodic semi-gradient one-step Sarsa. The term semi-gradient indicates that, during the update, the component \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) appearing in the TD target is treated as a constant and its gradient is not taken. Although this approach deviates from full gradient descent on a well-defined error function, it nevertheless exhibits good practical stability under on-policy learning with linear function approximation.

Episodic Semi-Gradient Sarsa (source: Reinforcement Learning: An Introduction, 2nd).
Episodic Semi-Gradient Sarsa (source: Reinforcement Learning: An Introduction, 2nd).

Expected Sarsa with Function Approximation

Episodic Semi-Gradient Sarsa uses the action value corresponding to the actual next action taken as the TD target. Conceptually, this faithfully reflects the current behavior policy. However, this design also introduces a practical issue: the variance of the TD target is directly influenced by exploratory behavior.

When the policy is \varepsilon-greedy, even after the value function has largely converged, the agent will still take non-greedy actions with probability \varepsilon. These occasional exploratory actions can cause significant fluctuations in Q(S_{t+1}, A_{t+1}), leading to highly unstable TD targets across time steps. As a result, the direction of parameter updates becomes sensitive to random sampling, which can degrade both learning efficiency and stability.

Expected Sarsa is an improved variant proposed to address this issue. Its core idea is that, since the randomness of the TD target arises from sampling the next action, and since the distribution over actions is already explicitly defined by the current policy \pi, one can take the expectation over action values directly, rather than performing another random sample. Specifically, at state S_{t+1}, the Expected Sarsa update is given by:

\displaystyle Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \sum_a \pi(a \mid S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t)]

Under this formulation, exploratory behavior is still fully preserved in the policy \pi, but its effect is incorporated in an averaged manner through the expectation, rather than being injected directly into the TD target as high-variance noise.

Combining linear function approximation with the semi-gradient update scheme, the parameter update for Expected Sarsa can be written as:

\displaystyle \mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \Big[ R_{t+1} + \gamma \sum_a \pi(a \mid S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t \Big] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \\\\ \hphantom{\mathbf{w}_{t+1}} = \mathbf{w}_t + \alpha \Big[ R_{t+1} + \gamma \sum_a \pi(a \mid S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t \Big] \mathbf{x}(S_t, A_t)

From an implementation perspective, this update is almost identical to that of semi-gradient Sarsa; the only difference lies in how the TD target is computed. Despite being a seemingly minor modification, this change has a substantial impact on learning behavior. Because the TD target no longer depends on a single randomly selected action, Expected Sarsa typically produces smoother update directions and is less sensitive to the choice of step size. In environments with high reward variance or high exploratory risk, Expected Sarsa often exhibits more stable convergence behavior than standard Sarsa.

Example

Implementation of Semi-Gradient Sarsa Using Neural Network

In the following code, we implement Semi-Gradient Sarsa as an on-policy TD control method and use a neural network to approximate the action-value function \hat{q}(s, a, \mathbf{w}). Because the TD target itself contains an estimate of the next step (i.e., bootstrapping), and this estimate is also produced by the current approximator, propagating gradients through the target would cause the update to deviate from the standard TD formulation.

Therefore, this implementation adopts a semi-gradient update, that is the TD target is treated as a constant, and gradients are taken only with respect to the current prediction \hat{q}(S_t, A_t, \mathbf{w}) when updating the parameters, while no gradient is propagated through the target term. In other words, backpropagation is applied solely to the current estimate, preserving the characteristic update structure and learning behavior of TD control.

import numpy as np
import torch
from torch import nn, optim

from mlp_q_network import MLPQNetwork


class MLPQNetwork(nn.Module):
    def __init__(self, observation_dim: int, n_actions: int, hidden_dim: int):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(observation_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions),
        )

    def forward(self, s: torch.Tensor) -> torch.Tensor:
        return self.network(s)


class MLPSarsa:
    def __init__(
        self,
        observation_dim: int,
        n_actions: int,
        *,
        hidden_dim: int,
        lr=1e-4,
        grad_clip=5.0,
        gamma=0.99,
        epsilon=0.1,
    ):
        self.observation_dim = observation_dim
        self.n_actions = n_actions
        self.grad_clip = grad_clip
        self.gamma = gamma
        self.epsilon = epsilon
        self.q_network = MLPQNetwork(observation_dim, n_actions, hidden_dim)
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)

    @torch.no_grad()
    def q_s(self, s: np.ndarray) -> np.ndarray:
        x = torch.tensor(s, dtype=torch.float32).unsqueeze(0)
        q = self.q_network(x).squeeze(0)
        return q.numpy()

    def act(self, s: np.ndarray, greedy: bool = False) -> tuple[int, np.ndarray]:
        q = self.q_s(s)
        if not greedy and np.random.rand() < self.epsilon:
            a = np.random.randint(self.n_actions)
        else:
            ties = np.flatnonzero(np.isclose(q, q.max()))
            a = np.random.choice(ties)
        p = np.ones(self.n_actions) * (self.epsilon / self.n_actions)
        greedy_a = np.argmax(q)
        p[greedy_a] += 1.0 - self.epsilon
        return a, p

    def update(self, s: np.ndarray, a: int, r: float, s_prime: np.ndarray, terminated: bool) -> float:
        s_t = torch.tensor(s, dtype=torch.float32).unsqueeze(0)
        s_tp1 = torch.tensor(s_prime, dtype=torch.float32).unsqueeze(0)

        q_t_s = self.q_network(s_t).squeeze(0)
        q_t_sa = q_t_s[a]

        with torch.no_grad():
            if terminated:
                target = torch.tensor(r, dtype=torch.float32)
            else:
                a_tp1, _ = self.act(s_prime)
                q_tp1_s = self.q_network(s_tp1).squeeze(0)
                q_sa = q_tp1_s[a_tp1]
                target = r + self.gamma * q_sa

        loss = 0.5 * (target - q_t_sa) ** 2

        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(self.q_network.parameters(), self.grad_clip)
        self.optimizer.step()

        return float(loss.item())

Implementation of Semi-Gradient Expected Sarsa Using Neural Network

In the following code, we implement Semi-Gradient Expected Sarsa as an on-policy TD control method and use a neural network to approximate the action-value function \hat{q}(s, a, \mathbf{w}). The overall training procedure is identical to that of Semi-Gradient Sarsa; the only difference lies in how the TD target is defined.

Because the TD target contains the expected action-value at the next state, and this expectation is also computed using the current function approximator, the update still follows a semi-gradient approach. Specifically, the TD target is treated as a constant, and gradients are backpropagated only through the current prediction \hat{q}(S_t, A_t, \mathbf{w}), with no gradient propagated through the target term.

In other words, when implemented with a neural network, Expected Sarsa does not alter the fundamental structure of the parameter update. Instead, it replaces the single-sample next-action estimate used in Sarsa with an expectation over action values. This design yields a smoother TD target and, while preserving the on-policy framework, further reduces the variance of the updates.

import numpy as np
import torch
from torch import nn, optim

from mlp_q_network import MLPQNetwork


class MLPExpectedSarsa:
    def __init__(
        self,
        observation_dim: int,
        n_actions: int,
        *,
        hidden_dim: int,
        lr=1e-4,
        grad_clip=5.0,
        gamma=0.99,
        epsilon=0.1,
    ):
        self.observation_dim = observation_dim
        self.n_actions = n_actions
        self.grad_clip = grad_clip
        self.gamma = gamma
        self.epsilon = epsilon
        self.q_network = MLPQNetwork(observation_dim, n_actions, hidden_dim)
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)

    @torch.no_grad()
    def q_values(self, s: np.ndarray) -> np.ndarray:
        x = torch.tensor(s, dtype=torch.float32).unsqueeze(0)
        q = self.q_network(x).squeeze(0)
        return q.numpy()

    def act(self, s: np.ndarray, greedy: bool = False) -> tuple[int, np.ndarray]:
        q = self.q_values(s)
        if not greedy and np.random.rand() < self.epsilon:
            a = np.random.randint(self.n_actions)
        else:
            ties = np.flatnonzero(np.isclose(q, q.max()))
            a = np.random.choice(ties)
        p = np.ones(self.n_actions) * (self.epsilon / self.n_actions)
        greedy_a = np.argmax(q)
        p[greedy_a] += 1.0 - self.epsilon
        return a, p

    def update(self, s: np.ndarray, a: int, r: float, s_prime: np.ndarray, terminated: bool) -> float:
        s_t = torch.tensor(s, dtype=torch.float32).unsqueeze(0)
        s_tp1 = torch.tensor(s_prime, dtype=torch.float32).unsqueeze(0)

        q_t_s = self.q_network(s_t).squeeze(0)
        q_t_sa = q_t_s[a]

        with torch.no_grad():
            if terminated:
                target = torch.tensor(r, dtype=torch.float32)
            else:
                q_tp1_s = self.q_network(s_tp1).squeeze(0)
                p = torch.ones(self.n_actions, dtype=torch.float32) * (self.epsilon / self.n_actions)
                a_start = torch.argmax(q_tp1_s).item()
                p[a_start] += 1.0 - self.epsilon

                expected_q_tp1 = torch.sum(q_tp1_s * p)
                target = r + self.gamma * expected_q_tp1

        loss = 0.5 * (target - q_t_sa) ** 2

        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(self.q_network.parameters(), self.grad_clip)
        self.optimizer.step()

        return float(loss.item())

Lunar Lander

LunarLander is a classic rocket-trajectory optimization problem. According to Pontryagin’s maximum principle, under an optimal policy the engine is either fired at full thrust or kept completely off. Accordingly, this environment uses a discrete action space, where each action corresponds to turning an engine on or off.

The action space contains four actions:

  • 0: Do nothing.
  • 1: Fire the left engine.
  • 2: Fire the main engine.
  • 3: Fire the right engine.

Each state is an 8-dimensional vector consisting of:

  • The lander’s x and y position.
  • The lander’s x and y linear velocity.
  • The lander’s angle.
  • The lander’s angular velocity.
  • Two boolean values indicating whether the left and right legs are in contact with the ground.

At each time step, the reward is shaped as follows:

  • The closer the lander is to the landing pad, the higher the reward; the farther away, the lower the reward.
  • The slower the lander is moving, the higher the reward; the faster it is moving, the lower the reward.
  • The more the lander is tilted, the lower the reward.
  • A bonus of +10 is given for each leg that is in contact with the ground.
  • A penalty of −0.03 is applied for each frame in which a side engine fires.
  • A penalty of −0.3 is applied for each frame in which the main engine fires.

When an episode terminates by either crashing or landing safely, an additional terminal reward is given:

  • Crash: −100.
  • Safe landing: +100.

An episode is considered solved if its total reward reaches 200 or higher.

In the following code, we train Expected Sarsa and Sarsa for 1000 episodes.

import time

import gymnasium as gym

from expected_sarsa_nn import MLPExpectedSarsa
from sarsa_nn import MLPSarsa

GYM_ID = "LunarLander-v3"

N_EPISODES = 1000
MAX_STEPS = 10_000

ALGO = "Sarsa"
# ALGO = "Expected Sarsa"


def play_game(agent: MLPExpectedSarsa, episodes=1):
    visual_env = gym.make(GYM_ID, render_mode="human")

    for episode in range(episodes):
        state, _ = visual_env.reset()
        terminated = False
        truncated = False
        total_reward = 0.0
        step_count = 0

        input("Press Enter to play game")

        print(f"Episode {episode + 1} starts")
        while not terminated and not truncated:
            action, _ = agent.act(state, greedy=True)
            state, reward, terminated, truncated, _ = visual_env.step(action)
            total_reward += reward
            step_count += 1
            visual_env.render()

        print(f"Episode {episode + 1} is finished: Total reward is {total_reward}, steps = {step_count}")
        time.sleep(1)

    visual_env.close()


def train(env: gym.Env, agent: MLPExpectedSarsa):
    for i_episode in range(1, N_EPISODES + 1):
        print(f"\rEpisode: {i_episode}/{N_EPISODES}", end="", flush=True)

        s, _ = env.reset()
        done = False
        G = 0.0
        steps = 0

        a, _ = agent.act(s)

        while not done and steps < MAX_STEPS:
            s_prime, r, terminated, truncated, _ = env.step(a)
            G += r
            done = terminated or truncated
            if done:
                a_prime = 0
            else:
                a_prime, _ = agent.act(s_prime)

            agent.update(s, a, r, s_prime, done)
            s, a = s_prime, a_prime
            steps += 1


if __name__ == "__main__":
    env = gym.make(GYM_ID)
    if ALGO == "Sarsa":
        agent = MLPSarsa(
            observation_dim=env.observation_space.shape[0],
            n_actions=env.action_space.n,
            hidden_dim=256,
        )
    else:
        agent = MLPExpectedSarsa(
            observation_dim=env.observation_space.shape[0],
            n_actions=env.action_space.n,
            hidden_dim=256,
        )
    train(env, agent)
    env.close()
    play_game(agent)

Under the same experimental setup, we trained agents using Semi-Gradient Sarsa and Semi-Gradient Expected Sarsa in the LunarLander environment for a total of 1000 episodes, and observed markedly different learning behaviors. Agents trained with Sarsa tended to crash frequently during the early stages of training. In contrast, agents trained with Expected Sarsa were more inclined to adopt conservative strategies, gradually learning to decelerate and correct their attitude, and ultimately achieving relatively stable landings.

This difference can be directly understood from the way the TD target is defined in the two methods. In Semi-Gradient Sarsa, each update uses the action-value corresponding to the actual next action taken, A_{t+1}, as the learning target. As a result, the TD target is strongly influenced by the randomness introduced by exploratory behavior. In a high-risk continuous control environment such as LunarLander, even when the exploration rate \varepsilon is relatively small, occasional non-greedy actions can still induce drastic changes in attitude or velocity near the ground, leading to extreme negative rewards.

These high-variance TD targets are injected directly into the parameter updates. Under function approximation, a large error from a single sample does not only affect the current state, but also propagates to other states and actions through shared parameters. Consequently, the learned policy tends to overcorrect or exhibit aggressive control behaviors, manifesting as fast but highly unstable actions and ultimately resulting in frequent crashes.

By contrast, Expected Sarsa does not rely on a single randomly sampled action when performing TD updates. Instead, it takes the expectation over the action values of all possible actions at the next state. This averages the risk introduced by exploratory behavior into the learning signal, thereby substantially reducing the variance of the TD target. In the context of function approximation, such smoother update directions help prevent violent oscillations in the parameters across successive updates, allowing the agent to more readily learn gradual deceleration, fine-grained attitude corrections, and stable control near the ground.

As a result, although Expected Sarsa may learn more slowly in the early stages, it ultimately exhibits safer behavior that better aligns with the task’s requirement for precise control.

Overall, these experimental results clearly demonstrate that, in the setting of on-policy control with function approximation, the critical distinction between algorithms does not lie in whether a neural network is used, but in how the TD target handles the uncertainty introduced by exploratory behavior. Through its expectation-based update mechanism, Expected Sarsa achieves a better balance between stability and learning efficiency, making it particularly well suited for control problems like LunarLander, where action continuity and precision are highly critical.

Lunar Lander by Sarsa with Approximation Using Neral Network Trained by 1000 Episodes.
Lunar Lander by Sarsa with Approximation Using Neral Network Trained by 1000 Episodes.
Lunar Lander by Expected Sarsa with Approximation Using Neral Network Trained by 1000 Episodes.
Lunar Lander by Expected Sarsa with Approximation Using Neral Network Trained by 1000 Episodes.

Conclusion

Once control problems enter the regime of function approximation, learning behavior is no longer determined solely by the name of the algorithm, but by the design of the TD target and how it handles the uncertainty introduced by exploration. Within the on-policy framework, Semi-Gradient Sarsa and Expected Sarsa share the same update structure and implementation pipeline; yet, the simple distinction of whether the TD target takes an expectation over the next action leads to markedly different stability and behavioral characteristics. As evidenced by the LunarLander experiments, Expected Sarsa trades smoother update directions for more stable and safer control behavior, clearly illustrating that, in on-policy control with function approximation, stability is often more critical than short-term learning speed. This observation reinforces the idea that understanding TD methods should go beyond the update equations themselves, and instead focus on how learning signals are constructed and how they shape the overall learning dynamics.

References

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like