Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a frequency-based symbol merging algorithm that was originally proposed as a data compression method. In natural language processing (NLP), BPE has been reinterpreted as a subword tokenization technique that strikes a balance between characters and full words. By automatically learning high-frequency fragments from data, BPE can construct a scalable vocabulary effectively without relying on any language-specific knowledge.

The complete code for this chapter can be found in .

Overview
Algorithm
Implementation
Example
Conclusion
Conclusion
References

Overview

BPE was originally introduced by Philip Gage in 1994 in A New Algorithm for Data Compression. In that paper, Gage defined BPE as a purely frequency-based, substitution-style data compression algorithm. Its design goal was not language modeling, but rather the efficient reduction of storage costs for symbol sequences.

The core idea of BPE is straightforward. The algorithm repeatedly identifies the most frequent byte pair in the data and replaces that pair with a new symbol, thereby gradually shortening the overall representation. This design exhibits several key characteristics:

No language knowledge required
- The algorithm relies solely on the symbol sequence itself and its frequency statistics, without any semantic or syntactic assumptions.
Greedy strategy
- Each replacement selects the pair that yields the greatest immediate compression benefit, without backtracking or global optimization.
Reversibility
- As long as the complete set of replacement rules is preserved, the original data can be fully reconstructed from the compressed representation.

It is worth noting that although the original formulation of BPE was strictly focused on byte-level data compression, the underlying idea, learning high-frequency fragments directly from data and treating them as new symbols, was later naturally extended and adapted to text processing and the NLP domain, where it became a foundational approach for subword representations.

Algorithm

In NLP, BPE is commonly reinterpreted as a subword segmentation method that provides a compromise between character-level and word-level representations. Its basic procedure is as follows:

Initialize the symbol inventory by treating each character or byte in the corpus as an independent token.
Count the frequencies of all adjacent token pairs.
Merge the token pair with the highest frequency.
Update the corpus representation and recompute pair frequencies.
Repeat the above steps until a predefined vocabulary size or a maximum number of merges is reached.

To illustrate how BPE operates on an actual corpus, consider the following simple sentence:

the lowest price is lower than the last one

At the initial stage, BPE treats every character in the sentence as an independent token:

t h e   l o w e s t   p r i c e   i s   l o w e r   t h a n   t h e   l a s t   o n e

At this point, the vocabulary consists only of single characters, and the model has not yet learned any longer structural fragments.

Across the entire corpus, certain adjacent character pairs appear repeatedly, such as (l, o), (o, w), and (t, h). Suppose the statistics indicate that (t, h) is currently the most frequent pair. The first merge is then performed:

th e   l o w e s t   p r i c e   i s   l o w e r   th a n   th e   l a s t   o n e

At this stage, th becomes a new token in the vocabulary.

The algorithm then recomputes statistics over the new token pairs. At this stage, pairs such as (th, e), (l, o), (o, w), and (w, e) may all share the same highest frequency. Because BPE is a greedy algorithm, one of these pairs is selected according to implementation details, such as order of appearance or tie-breaking rules. Here, assume that (l, o) is chosen as the second merge:

th e   lo w e s t   p r i c e   i s   lo w e r   th a n   th e   l a s t   o n e

After recomputing the statistics again, common high-frequency pairs may include (th, e), (lo, w), and (w, e). Suppose that (th, e) is selected for the next merge, yielding:

the   lo w e s t   p r i c e   i s   lo w e r   th a n   the   l a s t   o n e

At this point, the vocabulary contains a mixture of:

The original single-character tokens
Merged tokens such as th, the, and lo.

BPE then continues to iterate through this cycle of recounting and merging, gradually learning longer subwords that occur more frequently in the corpus, until the predefined vocabulary size or merge limit is reached.

This process clearly illustrates that BPE does not rely on any linguistic rules. Instead, it incrementally constructs an efficient symbol inventory purely by exploiting recurring patterns in the data.

Implementation

Below is a simplified implementation example of BPE that corresponds to the algorithmic process described in the previous section. In this implementation, token_to_id represents the final learned vocabulary, where the keys are the actual byte sequences and the values are the corresponding token IDs.

The overall implementation can be divided into three main parts: initialization, training, and encoding.

At initialization, a byte-level representation is used as the minimal unit, and all 256 possible byte values (0–255) are preloaded into the vocabulary. This design ensures that:

The vocabulary is always complete and usable under all circumstances.
Any input string can be encoded without encountering out-of-vocabulary (OOV) issues.
At this stage, each token corresponds to a single byte, and no subword structure has yet been introduced.

Before BPE training begins, the implementation applies a simple pretokenization rule to split raw text into coarser-grained units, such as English words, numbers, punctuation, and whitespace. This step is not part of the original BPE algorithm; rather, it is a common preprocessing practice in modern NLP tokenizers, intended to prevent BPE merges from crossing clearly unreasonable boundaries.

After pretokenization, each token is converted into a UTF-8 byte sequence. Its frequency of occurrence in the corpus is then used as a weight for subsequent token-pair frequency calculations.

The training stage corresponds directly to the core BPE algorithm described in the previous section:

For each token’s byte sequence, count the frequencies of adjacent byte pairs.
Treat the most frequent pair as the next merge candidate.
Create a new token and replace all occurrences of that pair across the tokens.
Record the merge rule in sequence.

This procedure is repeated until the vocabulary size reaches a predefined limit or until the frequency of the most common pair falls below a minimum threshold. It is important to note that the merge strategy is greedy: at each step, the algorithm makes a locally optimal choice based solely on the current statistics, without revisiting earlier decisions.

When encoding new input text, the process mirrors the logic used during training:

The input text is first processed using the same pretokenization.
Each token is converted into a byte sequence.
The merge rules recorded during training are applied sequentially, in order.
The final output is a sequence of corresponding token IDs.

Because the merge rules are applied in a fixed order, the encoding process is deterministic, ensuring consistency between training and inference.

import re
from collections import Counter
from typing import Iterable

PRETOKENIZE_PATTERN = re.compile(r"'s|'t|'re|'ve|'m|'ll|'d|\s?[A-Za-z]+|\s?\d+|\s?[^A-Za-z\d\s]+|\s+")


class BPE:
    def __init__(self):
        self.token_to_id: dict[bytes, int] = {bytes([i]): i for i in range(256)}
        self.id_to_token: dict[int, bytes] = {token_id: token for token, token_id in self.token_to_id.items()}
        self.merges: list[tuple[int, int]] = []

    def train(self, corpus: Iterable[str], vocab_size: int, min_frequency: int = 2):
        word_counts = Counter()
        for line in corpus:
            for token in self.pretokenize(line):
                b = token.encode("utf-8")
                if b:
                    word_counts[b] += 1

        word_symbols: dict[bytes, list[int]] = {word: list(word) for word in word_counts}

        while len(self.token_to_id) < vocab_size:
            pair_frequencies = Counter()

            for word, symbols in word_symbols.items():
                word_count = word_counts[word]
                for i in range(len(symbols) - 1):
                    pair_frequencies[symbols[i], symbols[i + 1]] += word_count

            if not pair_frequencies:
                break

            (symbol_left, symbol_right), pair_frequency = pair_frequencies.most_common(1)[0]
            if pair_frequency < min_frequency:
                break

            new_id = len(self.token_to_id)
            new_token = self.id_to_token[symbol_left] + self.id_to_token[symbol_right]

            self.token_to_id[new_token] = new_id
            self.id_to_token[new_id] = new_token
            self.merges.append((symbol_left, symbol_right))

            self._merge(word_symbols, symbol_left, symbol_right, new_id)

    @classmethod
    def pretokenize(cls, text: str) -> list[str]:
        return [m.group(0) for m in PRETOKENIZE_PATTERN.finditer(text)]

    @classmethod
    def _merge(cls, word_symbols: dict[bytes, list[int]], left: bytes, right: bytes, new_id: int) -> None:
        for word, symbols in word_symbols.items():
            new_symbols = []
            i = 0
            while i < len(symbols):
                if i < len(symbols) - 1 and symbols[i] == left and symbols[i + 1] == right:
                    new_symbols.append(new_id)
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1
            word_symbols[word] = new_symbols

    def encode(self, text: str) -> list[int]:
        ids = []
        for token in self.pretokenize(text):
            symbols = list(token.encode("utf-8"))
            symbols = self._apply_bpe(symbols)
            ids.extend(symbols)
        return ids

    def _apply_bpe(self, symbols: list[int]) -> list[int]:
        for left, right in self.merges:
            new_token = self.id_to_token[left] + self.id_to_token[right]
            new_id = self.token_to_id[new_token]
            i = 0
            output = []
            while i < len(symbols):
                if i < len(symbols) - 1 and symbols[i] == left and symbols[i + 1] == right:
                    output.append(new_id)
                    i += 2
                else:
                    output.append(symbols[i])
                    i += 1
            symbols = output
        return symbols

Example

To validate the practical behavior of the BPE implementation described above, WikiText-2 is used as the training corpus. WikiText-2 is a widely used benchmark dataset for language modeling; it contains a little over thirty thousand lines of minimally cleaned Wikipedia text, preserving common punctuation, whitespace, and structural patterns found in natural language. This makes it well suited as a test corpus for tokenizer training.

In this experiment, all text lines from the training split are used as input, and the vocabulary size is set to 2000. Because the implementation employs byte-level initialization, this setting implies that, in addition to the original 256 byte tokens, up to approximately 1,700 additional subword tokens can be learned.

from datasets import load_dataset

from bpe import BPE


def main():
    print("Loading WikiText-2...")
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
    train_text = dataset["train"]["text"]
    print(f"Number of training lines: {len(train_text)}")

    print("Training BPE...")
    bpe = BPE()
    bpe.train(corpus=train_text, vocab_size=2000)

    print("Training finished")
    print("Vocab size:", len(bpe.token_to_id))
    print("Number of merges:", len(bpe.merges))

    print("\nFirst 20 merges (as bytes):")
    for i, (a, b) in enumerate(bpe.merges[:20]):
        ta = bpe.id_to_token[a]
        tb = bpe.id_to_token[b]
        print(f"{i:02d}: {ta!r} + {tb!r}")

    example = "Natural language processing is interesting"
    encoded = bpe.encode(example)

    print()
    print(f"Example text: {example}")
    print(f"Encoded token ids: {encoded}")
    print(f"Decoded tokens: {[bpe.id_to_token[i] for i in encoded]}")


if __name__ == "__main__":
    main()

After training, several key observations can be made:

The final vocabulary size is 2000.
The actual number of merges performed is 1744.
The number of merges is slightly below the upper limit, indicating that in the later stages there were no sufficiently frequent pairs left to merge.

Through purely frequency-based statistics and greedy merging, the model is able to automatically learn a subword vocabulary from raw byte sequences that effectively captures the structural patterns of the corpus.

Loading WikiText-2...
Number of training lines: 36718
Training BPE...
Training finished
Vocab size: 2000
Number of merges: 1744

First 20 merges (as bytes):
00: b' ' + b't'
01: b'h' + b'e'
02: b' ' + b'a'
03: b'i' + b'n'
04: b' t' + b'he'
05: b'e' + b'r'
06: b'o' + b'n'
07: b' ' + b','
08: b'r' + b'e'
09: b' ' + b's'
10: b'e' + b'd'
11: b' ' + b'o'
12: b' ' + b'w'
13: b'n' + b'd'
14: b'a' + b't'
15: b' ' + b'.'
16: b'o' + b'r'
17: b'i' + b't'
18: b' ' + b'c'
19: b'e' + b'n'

Example text: Natural language processing is interesting
Encoded token ids: [78, 270, 1221, 312, 630, 117, 553, 1971, 444, 285, 363, 824, 390, 285]
Decoded tokens: [b'N', b'at', b'ural', b' l', b'ang', b'u', b'age', b' proc', b'ess', b'ing', b' is', b' inter', b'est', b'ing']

Conclusion

The experimental results show that BPE provides a highly practical trade-off between performance and implementation complexity. Compared with word-level tokenization, BPE significantly reduces vocabulary size while effectively eliminating the OOV problem. Owing to its simplicity, high degree of control, and ability to learn structure directly from the corpus, BPE remains widely used in many tokenizer designs, such as the byte-level BPE adopted by GPT-2 and related variants carried forward in subsequent models. Even as newer tokenization methods continue to emerge, BPE remains a baseline approach for understanding tokenization design in modern language models.

References

Philip Gage. 1994. A new algorithm for data compression. C Users Journal, Volume 12, Issue 2, pages 23–38.

Get source code of posts.

Share

Table of Contents

Overview

Algorithm

Implementation

Example

Conclusion

Conclusion

References

Related Tags

Wayne

Leave a Reply Cancel reply

Python Pie/Donut/Sunburst Charts

Python Candlestick Charts

Python Box/Violin Plots

Python Regression Line Plots

Python Choropleth Map

Python Heatmaps

Bradley-Terry Model

Entropy

Byte-Pair Encoding

Policy Gradient

On-Policy Control with Approximation

Spring Security JWT Authentication with Google Sign-In Explained

How to Backup and Restore MySQL Databases in Spring Boot

Sending Push Notifications Using FCM in Spring Boot

Python Pie/Donut/Sunburst Charts

Kotlin Coroutine Flow Tutorial

Spring Security JWT Authentication with Google Sign-In Explained

How to Backup and Restore MySQL Databases in Spring Boot

Sending Push Notifications Using FCM in Spring Boot

Python Pie/Donut/Sunburst Charts

Byte-Pair Encoding

Share

Table of Contents

Overview

Algorithm

Implementation

Example

Conclusion

Conclusion

References

Related Tags

Leave a Reply Cancel reply

You May Also Like