Bradley-Terry Model

In many machine learning and decision-making systems, what we encounter is not a directly measurable quality score, but rather a large number of preference judgments in the form of pairwise comparisons, that is deciding which of two options is better. Although such pairwise comparison data is simple in form, it implicitly contains rich structural information. Starting from a probabilistic semantics perspective, this article will gradually explain how the Bradley–Terry model can transform these preference comparisons into a learnable representation of latent utilities.

Problem
Odds
Bradley-Terry Model
Aligning with Maximum Likelihood Estimation (MLE)
Example 1
Example 2
Conclusion

Problem

In many real-world scenarios, when comparing two options, we are often unable to directly assign each option a meaningful absolute score. However, it is relatively easy to answer a simpler question, that is between the two options, which one is better? This type of problem is known as a pairwise comparison problem. Common examples include:

Between two options, which one does the user prefer?
When two players compete, who is more likely to win?
In RLHF or preference learning, which response will a human annotator choose between two model-generated outputs?

A shared characteristic of such data is that we only observe whether A beats B, rather than an absolute quality rating for each individual option.

Our goal is to infer, from a large collection of comparison outcomes A vs. B, the latent utility parameters associated with each item, and then answer the question: Given any two items $i, j$ , what is the probability that $i$ is preferred over $j$ ? Equivalently, what is the probability that $i$ beats $j$ ?

To answer this question, we first formalize the setting.

Assume there are $N$ items in total. For each item $i$ , we assign a real-valued latent utility parameter $\theta_i \in \mathbb{R}$ . This parameter represents the intrinsic utility level of item $i$ under preference comparisons. However, in practice, we cannot directly observe $\theta_i$ ; we only observe the outcomes of pairwise comparisons. Therefore, $\theta_i$ is an unobserved latent variable.

Next, we would like to construct a probabilistic model that describes, given the latent utility parameters $\theta_i, \theta_j$ , the probability that item $i$ beats item $j$ :

$P( i \succ j \mid \theta_i, \theta_j) \in [0, 1]$

This probabilistic model should satisfy several intuitive and reasonable properties:

If $\theta_i > \theta_j$ , then $0.5 < P(i \succ j) \le 1$ ; moreover, the larger the gap between $\theta_i and \theta_j$ , the closer $P(i \succ j)$ is to 1.
If $\theta_i = \theta_j$ , then $P(i \succ j) = 0.5$ , meaning that the two items are equally preferred.
If $\theta_i < \theta_j$ , then $0 \le P(i \succ j) < 0.5$ ; moreover, the larger the gap, the closer $P(i \succ j)$ is to 0.
For any pair $(i, j)$ , the comparison outcome must be mutually exclusive and collectively exhaustive, thus satisfying $P(i \succ j) + P(j \succ i) = 1$ .

Together, these conditions characterize the fundamental behavior we expect from a pairwise comparison model, and they provide a clear and consistent theoretical starting point for introducing the concrete form of the Bradley–Terry model.

Odds

In probability-based comparisons, directly using the difference in probabilities can often be semantically misleading. For example, consider the following two probability changes, $0.6 \rightarrow 0.7$ and $0.9 \rightarrow 1.0$ 。Although both have the same numerical difference of $0.1$ , the meaning they convey is not the same. The former corresponds to a significant change under moderate uncertainty, whereas the latter is already close to a near-certain outcome.

This illustrates that, when comparing the likelihood of two events, using probability differences as a measure is not always appropriate. In many situations, a more intuitive and semantically consistent question is: how many times more likely is one event than the other?

Motivated by this, we introduce the concept of odds.

For the event $i \succ j$ , its odds is defined as:

$\displaystyle \text{odds}(i \succ j) = \frac{P(i \succ j)}{1 -P(i \succ j)} = \frac{P(i \succ j)}{P(j \succ i)}$

Semantically, this asks: how many times more likely is choosing $i$ compared to choosing $j$ ?

Odds is a ratio, rather than a difference. Therefore, when describing the relative strength of preference, it can more faithfully reflect the relative relationship between probabilities, without being distorted by the absolute scale of the probabilities.

Furthermore, by taking the logarithm of odds, we obtain the log-odds:

$\displaystyle \text{log-odds}(i \succ j) = \log \frac{P(i \succ j)}{1 - P(i \succ j)} = \log \frac{P(i \succ j)}{P(j \succ i)}$

An important property of log-odds is that it maps a probability originally constrained to the interval $(0, 1)$ onto the entire real line $\mathbb{R}$ , allowing relative advantage to be expressed additively. This makes log-odds a particularly natural and convenient representation when constructing pairwise comparison models in later sections.

The Bradley–Terry model was proposed in 1952 by Ralph A. Bradley and Milton E. Terry. It is a classic probabilistic model designed to handle pairwise comparison data.

At its core, the Bradley–Terry model is a log-odds difference model. For each item $i$ , the model assigns a real-valued latent utility parameter $\theta_i \in \mathbb{R}$ . It is important to emphasize that $\theta_i$ is not a probability; rather, it represents the item’s relative utility strength on the log scale in preference comparisons.

The key assumption of the model is that the log-odds of the event $i \succ j$ equals the difference between the corresponding latent utilities:

$\displaystyle \log \frac{P(i \succ j)}{1 - P(i \succ j)} = \theta_i - \theta_j$

This assumption directly links the log-odds introduced in the previous section to the latent utility parameters $\theta_i$ , allowing relative preference strength to be expressed in a linear form.

Next, we can derive the corresponding probability form from the log-odds expression. From the equation above, we obtain:

$\displaystyle \log \frac{P(i \succ j)}{1 - P(i \succ j)} = \theta_i - \theta_j \\\\ \implies \frac{P(i \succ j)}{1 - P(i \succ j)} = e^{\theta_i - \theta_j} \\\\ \implies P(i \succ j) = \frac{e^{\theta_i - \theta_j}}{1 + e^{\theta_i - \theta_j}}$

Rearranging further yields:

$\displaystyle \begin{aligned} P(i \succ j) &= \frac{e^{\theta_i - \theta_j}}{1 + e^{\theta_i - \theta_j}} \\ &= \frac{1}{1 + e^{-(\theta_j - \theta_i)}} \\ &= \sigma(\theta_i - \theta_j), \quad \sigma(z) = \frac{1}{1 + e^{-z}} \end{aligned}$

where $\sigma(\cdot)$ denotes the sigmoid function. This shows that the Bradley–Terry model can be viewed as a logistic model that takes the latent utility difference as its input.

Intuitively, if $\theta_i - \theta_j = \log 3$ , then under long-run repeated comparisons, the odds of choosing $i$ is three times the odds of choosing $j$ . When $\theta_i = \theta_j$ , the model naturally reduces to the symmetric case $P(i \succ j) = 0.5$ .

Aligning with Maximum Likelihood Estimation (MLE)

In practice, we do not know the true probability distribution $P$ underlying the pairwise comparison data. What we can obtain is only a finite number of observed samples, for example, when comparing $(i, j)$ , we observe a binary label indicating whether the outcome is $i \succ j$ or the opposite.

Under this setting, the Bradley–Terry model provides a parameterized probabilistic model $P_\theta(i \succ j)$ , which is used to describe the probability of observing any comparison result. However, the latent utility parameters in the model, $\theta = (\theta_1, \dots, \theta_N)$ , are still unknown and must be estimated from data.

If we assume that these pairwise comparison samples are drawn independently and identically distributed (i.i.d.) from some fixed but unknown true distribution, then a natural strategy is to choose parameters $\hat{\theta}$ that maximize the probability of observing the dataset under the model. This is exactly Maximum Likelihood Estimation (MLE). If the reader is not familiar with MLE, the following article may be a useful reference:

- Artificial Intelligence
- Machine Learning

Entropy

ByWayne
10/01/2026

Concretely, suppose the dataset consists of $N$ pairwise comparison samples. Each sample can be represented as $(i_k, j_k, y_k)$ , where $y_k \in {0, 1}$ indicates whether, in the $k$ -th comparison, item $i_k$ is chosen over item $j_k$ . Under the Bradley–Terry model assumption, the overall negative log-likelihood can be written as:

$\displaystyle \hat{\theta} = \arg \min_\theta \frac{1}{N} \sum_{k=1}^N -\log P_\theta(y_k \mid i_k, j_k)$

This objective function is, in form, equivalent to minimizing the cross-entropy between the model distribution $P_\theta$ and the true data distribution. Therefore, MLE can be viewed as, within a specified model family, finding an approximate distribution $\hat{P}$ that best matches the true data-generating mechanism.

Combining this with the sigmoid form derived in the previous section for the Bradley–Terry model, the learning task is ultimately reduced to a standard differentiable optimization problem, enabling us to efficiently estimate each item’s latent utility parameter $\theta_i$ using gradient-based methods.

Example 1

To concretely illustrate how the Bradley–Terry model works in practice when combined with MLE, we consider an extremely simple pairwise comparison example.

Suppose we have only three items $A, B, C$ , and we collect the following pairwise comparison results (which can be viewed as coming from user preferences or human labeling):

$A \succ B$ : observed 8 times.
$B \succ A$ : observed 2 times.
$A \succ C$ : observed 9 times.
$C \succ A$ : observed 1 time.
$B \succ C$ : observed 6 times.
$C \succ B$ : observed 4 times.

These data only tell us who tends to beat whom more often, and do not provide any absolute quality score for each item.

In the Bradley–Terry model, we assign each item a latent utility parameter $\theta_A, \theta_B, \theta_C$ , and assume that in any single comparison, the probability of observing $i \succ j$ is:

$\displaystyle P(i \succ j) = \sigma(\theta_i - \theta_j)$

The role of MLE is then to adjust these latent utility parameters $\theta$ so that the win probabilities predicted by the model match the observed empirical comparison frequencies as closely as possible.

Take $A$ and $B$ as an example. The data show that out of 10 comparisons, the proportion of times $A$ beats $B$ is approximately 0.8. Therefore, the model will tend to adjust the parameters such that:

$\displaystyle \sigma(\theta_A - \theta_B) \approx 0.8$

This implies that $\theta_A - \theta_B$ must be positive, and its magnitude must be large enough for the sigmoid output to be close to 0.8.

Similarly, since $A$ almost always beats $C$ , the difference $\theta_A - \theta_C$ will be pushed to be even larger. In contrast, the comparison results between $B$ and $C$ indicate that the gap between them is smaller, and therefore the corresponding utility difference will be relatively closer as well.

Under the joint constraints imposed by the full dataset, MLE considers all pairwise comparison samples simultaneously and searches for a set of parameters $(\theta_A, \theta_B, \theta_C)$ that maximizes the joint likelihood of all observed outcomes. The resulting latent utility parameters $\theta$ are not determined by any single comparison; instead, they reflect a global and consistent preference ordering and strength structure induced by all pairwise information.

It is worth noting that the Bradley–Terry model depends only on differences $\theta_i - \theta_j$ . Therefore, adding the same constant to all $\theta_i$ does not affect the model’s predicted probabilities. This implies that the model suffers from parameter non-identifiability. In practice, this degree of freedom is typically removed by fixing one parameter to a constant value or by introducing appropriate regularization constraints.

This example demonstrates that the Bradley–Terry model provides a mechanism for recovering a global and consistent latent utility representation from local pairwise comparison outcomes. MLE, in turn, is the key step that operationalizes this mechanism in a data-driven manner.

Example 2

In the RLHF (Reinforcement Learning from Human Feedback) pipeline, human labelers do not assign absolute scores to model responses. Instead, they perform pairwise comparisons. Given the same prompt, the model generates two responses, and the labeler only needs to answer which one is better.

Consider the following scenario. For a given prompt, the language model generates three responses $A, B, C$ , and we collect preference data from human labelers:

In the comparison between $A$ and $B$ , labelers mostly choose $A$ .
In the comparison between $A$ and $C$ , labelers almost always choose $A$ .
In the comparison between $B$ and $C$ , labelers slightly prefer $B$ .

These data contain only preference judgments about which response is better, and do not include any form of absolute quality rating.

Under the Bradley–Terry model, we assign each response $i$ a latent utility parameter $\theta_i$ , and assume that when humans compare two responses $i, j$ , the probability of choosing $i$ is:

$\displaystyle P(i \succ j) = \sigma(\theta_i - \theta_j)$

In this context, $\theta_i$ can be interpreted as the implicit utility strength under human preference. It is not directly observable, and can only be inferred indirectly from pairwise comparison outcomes.

Next, MLE is responsible for adjusting these latent utility parameters $\theta$ based on the observed labels. For example, if in comparisons between $A$ and $B$ , labelers frequently choose $A$ , then MLE will push the parameters such that:

$\displaystyle \sigma(\theta_A - \theta_B) \simeq \text{observed preference frequency of } A \succ B$

Similarly, if $A$ almost always beats $C$ , then $\theta_A - \theta_C$ will be pushed to be larger. Meanwhile, the comparison results between $B$ and $C$ suggest that the gap between them is smaller, and thus the corresponding difference in latent utility will be relatively closer. By considering all comparison samples jointly, MLE ultimately learns a set of latent utility parameters such that the model’s predictions over all pairwise preferences best match the preference distribution implied by human labels.

In RLHF implementations, this step corresponds to training the reward model.

The Bradley–Terry model provides a structural assumption that maps latent utility differences to preference probabilities, while MLE enables this assumption to be estimated from human comparison data, thereby learning a differentiable and generalizable reward representation.

Therefore, from this perspective, the reward model in RLHF is not learning a scoring function from scratch. Instead, it is grounded in a clear probabilistic model: human preferences are treated as random variables, and the Bradley–Terry model is one such hypothesis for describing this stochastic preference behavior.

Conclusion

Starting from the pairwise comparison problem, we first introduced odds and log-odds as a natural language for describing relative preference strength, and explained why this representation is more semantically consistent than directly comparing probability differences. Building on this foundation, the Bradley–Terry model adopts the simplest structural assumption over log-odds, modeling preference behavior as a function of latent score differences. Through MLE, the model can learn a consistent latent utility representation from real pairwise comparison data. In the context of RLHF and preference learning, the Bradley–Terry model not only provides a theoretically clear probabilistic interpretation, but also establishes a solid and implementable foundation for reward modeling.

Get source code of posts.

Bradley-Terry Model

Share

Table of Contents

Problem

Odds