Photo by Aivars Vilks on Unsplash
Read More

Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a frequency-based symbol merging algorithm that was originally proposed as a data compression method. In natural language processing (NLP), BPE has been reinterpreted as a subword tokenization technique that strikes a balance between characters and full words. By automatically learning high-frequency fragments from data, BPE can construct a scalable vocabulary effectively without relying on any language-specific knowledge.
Read More
Photo by Jannes Jacobs on Unsplash
Read More

Policy Gradient

In RL control problems, most methods take value functions as the core learning object, improving the policy indirectly by estimating long-term returns. However, when the state or action space becomes continuous, or when the policy itself must remain stochastic, this approach becomes less direct. Policy gradient methods adopt a different perspective by treating the policy itself as the object of optimization, directly performing gradient ascent on the expected return.
Read More
Photo by Paladuta Stefan on Unsplash
Read More

On-Policy Control with Approximation

In practical control problems, the state and action spaces are often high-dimensional, continuous, and noisy, which makes reinforcement learning algorithms based on tabular methods difficult to apply directly. Once function approximation is introduced, the two components that are conceptually well separated in theory of value evaluation and policy improvement become tightly intertwined, bringing with them challenges related to stability and variance. This article focuses on on-policy control methods under function approximation, with particular attention to Sarsa.
Read More
Photo by Federico Di Dio photography on Unsplash
Read More

On-Policy Prediction with Approximation

This chapter focuses on on-policy prediction with approximation, systematically organizing the learning objectives for value estimation under this setting, the feasible learning methods, and the solutions to which they actually converge. By contrasting Gradient Monte Carlo with Semi-Gradient TD(0), we will see the unavoidable trade-offs that arise between theoretically well-defined objectives and methods that are practically viable.
Read More
Photo by Charlotte Cowell on Unsplash
Read More

Dyna Architecture

In reinforcement learning (RL), an agent often needs to learn an effective decision policy under conditions where real interactions with the environment are limited and costly. Relying solely on real experience is conceptually straightforward, but it is often constrained by poor data efficiency and slow learning speed. Conversely, relying entirely on planning with a model may introduce bias when the model is inaccurate. The Dyna architecture was proposed to strike a balance between these two extremes by integrating acting, learning, and planning within a single learning process.
Read More
Photo by israel palacio on Unsplash
Read More

Temporal-Difference Learning, TD

In Reinforcement Learning (RL), Dynamic Programming (DP) offers the most complete and mathematically explicit solution framework. However, its reliance on a known environment model makes it difficult to apply directly to real-world settings. Monte Carlo (MC) methods, in contrast, learn from experience without requiring a model, but they must wait until the end of an entire episode before performing updates, resulting in relatively coarse learning granularity. Temporal Difference (TD) learning represents a compromise between these two approaches: it does not require a model, yet it can update value estimates incrementally after each interaction step.
Read More
Photo by Rishi Jhajharia on Unsplash
Read More

Monte Carlo Methods, MC

In Dynamic Programming (DP), having a complete environment model is a prerequisite for exact computation. However, this assumption rarely holds in most real-world problems. Monte Carlo (MC) methods choose to forgo reliance on an explicit model and instead learn directly from complete experiences generated through interaction with the environment. By sampling and averaging episode returns, MC provides a practical pathway for estimating value functions grounded in actual experience.
Read More
Photo by Nerfee Mirandilla on Unsplash
Read More

Dynamic Programming, DP

In Reinforcement Learning (RL), Dynamic Programming (DP) is the earliest and most complete solution framework. Although DP is almost impossible to apply directly to practical high-dimensional or continuous environments, it reveals the mathematical foundations of all core concepts in modern RL. At a fundamental level, the convergence objectives and update rules of all RL algorithms are derived from the Bellman Equations and the Generalized Policy Iteration (GPI) framework used in DP.
Read More
Photo by Courtney Cook on Unsplash
Read More

Chain-of-Thought, CoT

The performance of LLMs on reasoning tasks has undergone substantial change in recent years with the introduction of Chain-of-Thought (CoT) prompting. This technique guides an LLM to produce step-by-step intermediate reasoning, enabling the model to exhibit a human-like structure of thought. As task complexity increases, however, the limitations of traditional CoT have become more apparent, motivating a series of follow-up methods designed to address these issues. This article presents an overview of CoT and its extensions.
Read More
Photo by Tomáš Malík on Unsplash
Read More

Vision Transformer Model

In the field of image recognition, Convolutional Neural Networks (CNNs) have long been the dominant architecture. In recent years, Transformer models have achieved great success in Natural Language Processing (NLP), which has led researchers to consider applying the Transformer architecture to image processing tasks. Vision Transformer (ViT) is a model designed for image understanding based on the Transformer framework.
Read More