Photo by Janosch Diggelmann on Unsplash
Read More

Bradley-Terry Model

In many machine learning and decision-making systems, what we encounter is not a directly measurable quality score, but rather a large number of preference judgments in the form of pairwise comparisons, that is deciding which of two options is better. Although such pairwise comparison data is simple in form, it implicitly contains rich structural information. Starting from a probabilistic semantics perspective, this article will gradually explain how the Bradley–Terry model can transform these preference comparisons into a learnable representation of latent utilities.
Read More
Photo by Daniel Seßler on Unsplash
Read More

Entropy

In probabilistic modeling and machine learning, entropy is a fundamental concept for quantifying uncertainty. It not only describes the inherent randomness of data, but also implicitly captures the minimum information cost required in prediction and modeling. Many learning objectives that may appear different on the surface, such as maximizing log-likelihood or designing loss functions, can in fact be traced back and understood through the lens of entropy.
Read More
Photo by Aivars Vilks on Unsplash
Read More

Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a frequency-based symbol merging algorithm that was originally proposed as a data compression method. In natural language processing (NLP), BPE has been reinterpreted as a subword tokenization technique that strikes a balance between characters and full words. By automatically learning high-frequency fragments from data, BPE can construct a scalable vocabulary effectively without relying on any language-specific knowledge.
Read More
Photo by Jannes Jacobs on Unsplash
Read More

Policy Gradient

In RL control problems, most methods take value functions as the core learning object, improving the policy indirectly by estimating long-term returns. However, when the state or action space becomes continuous, or when the policy itself must remain stochastic, this approach becomes less direct. Policy gradient methods adopt a different perspective by treating the policy itself as the object of optimization, directly performing gradient ascent on the expected return.
Read More
Photo by Paladuta Stefan on Unsplash
Read More

On-Policy Control with Approximation

In practical control problems, the state and action spaces are often high-dimensional, continuous, and noisy, which makes reinforcement learning algorithms based on tabular methods difficult to apply directly. Once function approximation is introduced, the two components that are conceptually well separated in theory of value evaluation and policy improvement become tightly intertwined, bringing with them challenges related to stability and variance. This article focuses on on-policy control methods under function approximation, with particular attention to Sarsa.
Read More
Photo by Federico Di Dio photography on Unsplash
Read More

On-Policy Prediction with Approximation

This chapter focuses on on-policy prediction with approximation, systematically organizing the learning objectives for value estimation under this setting, the feasible learning methods, and the solutions to which they actually converge. By contrasting Gradient Monte Carlo with Semi-Gradient TD(0), we will see the unavoidable trade-offs that arise between theoretically well-defined objectives and methods that are practically viable.
Read More
Photo by Charlotte Cowell on Unsplash
Read More

Dyna Architecture

In reinforcement learning (RL), an agent often needs to learn an effective decision policy under conditions where real interactions with the environment are limited and costly. Relying solely on real experience is conceptually straightforward, but it is often constrained by poor data efficiency and slow learning speed. Conversely, relying entirely on planning with a model may introduce bias when the model is inaccurate. The Dyna architecture was proposed to strike a balance between these two extremes by integrating acting, learning, and planning within a single learning process.
Read More
Photo by israel palacio on Unsplash
Read More

Temporal-Difference Learning, TD

In Reinforcement Learning (RL), Dynamic Programming (DP) offers the most complete and mathematically explicit solution framework. However, its reliance on a known environment model makes it difficult to apply directly to real-world settings. Monte Carlo (MC) methods, in contrast, learn from experience without requiring a model, but they must wait until the end of an entire episode before performing updates, resulting in relatively coarse learning granularity. Temporal Difference (TD) learning represents a compromise between these two approaches: it does not require a model, yet it can update value estimates incrementally after each interaction step.
Read More
Photo by Andriyko Podilnyk on Unsplash
Read More

Incremental Implementation

In Reinforcement Learning (RL), many algorithms may appear different in form, yet their core update mechanisms are highly similar. At the implementation level, they all rely on a common numerical estimation approach. This approach is not an independent algorithm, but rather a computational technique for gradually approximating an expectation. Understanding this mechanism helps clarify the fundamental differences among various reinforcement learning methods.
Read More
Photo by Rishi Jhajharia on Unsplash
Read More

Monte Carlo Methods, MC

In Dynamic Programming (DP), having a complete environment model is a prerequisite for exact computation. However, this assumption rarely holds in most real-world problems. Monte Carlo (MC) methods choose to forgo reliance on an explicit model and instead learn directly from complete experiences generated through interaction with the environment. By sampling and averaging episode returns, MC provides a practical pathway for estimating value functions grounded in actual experience.
Read More