Entropy

Photo by Daniel Seßler on Unsplash
Photo by Daniel Seßler on Unsplash
In probabilistic modeling and machine learning, entropy is a fundamental concept for quantifying uncertainty. It not only describes the inherent randomness of data, but also implicitly captures the minimum information cost required in prediction and modeling. Many learning objectives that may appear different on the surface, such as maximizing log-likelihood or designing loss functions, can in fact be traced back and understood through the lens of entropy.

In probabilistic modeling and machine learning, entropy is a fundamental concept for quantifying uncertainty. It not only describes the inherent randomness of data, but also implicitly captures the minimum information cost required in prediction and modeling. Many learning objectives that may appear different on the surface, such as maximizing log-likelihood or designing loss functions, can in fact be traced back and understood through the lens of entropy.

Self-Information

In information theory, self-information refers to the amount of information associated with the occurrence of a specific event, determined by its probability of happening. It is also commonly called information content. This concept originates from the information-theoretic framework established by Claude Shannon in 1948, and it is one of the most fundamental definitions in the entire theory.

Suppose the probability of an event x is p(x). The self-information I(x) is defined as:

I(x) = - \log p(x)

The base of the logarithm determines the unit of information. If a base-2 logarithm is used, the unit is bits; if the natural logarithm is used, the unit is nats. Throughout this article, we use \log_2 as the default.

Let us consider an intuitive example. Suppose the probability that David passes an exam is p_{\text{D}}(x_{pass}) = 0.3, while the probability that Wayne passes the exam is p_{\text{W}}(x_{pass}) = 0.9. When David passes the exam, this is a relatively rare event and therefore produces a higher sense of surprise. In contrast, when Wayne passes the exam, the event is highly likely, and we are hardly surprised at all. This intuitive notion of surprise is precisely what self-information attempts to quantify. The corresponding information values are:

I_{\text{D}}(x_{\text{pass}}) = - \log_2 p_{\text{D}}(x_{\text{pass}}) = - \log_2 0.3 \approx 1.74\\\\ I_{\text{W}}(x_{\text{pass}}) = - \log_2 p_{\text{W}}(x_{\text{pass}}) = - \log_2 0.9 \approx 0.15

As we can see, events with lower probabilities have higher self-information, whereas events with higher probabilities carry less information. In other words, self-information quantifies the degree of surprise associated with the occurrence of an event, and is therefore often referred to in the literature as surprisal.

We can also interpret the information content of an event from a coding perspective. Suppose we have a set of 8 equally likely events {x_1, x_2, \cdots, x_8}. To encode each event, we require 3 bits.

Beyond serving as a measure of surprise, self-information is also directly related to coding. Given a set of 8 equally likely events {x_1, x_2, \cdots, x_8}, each event can be represented using 3 bits:

\log_2 8 = 3

Since the events are equiprobable, we have p(x_i) = \frac{1}{8}, \quad i = 1, \cdots, 8. The self-information of each event is therefore:

I(x_i) = - \log_2 p(x_i) = - \log_2 \frac{1}{8} = 3

That is, in this case, the amount of information carried by each event exactly corresponds to the shortest fixed-length code required to represent it, namely 3 bits.

When event probabilities are not equal, fixed-length coding is no longer optimal. According to results from information theory, the optimal coding strategy assigns shorter bit representations to high-probability events and longer bit representations to low-probability events. Under such an optimal coding scheme, the self-information - \log p(x) can be interpreted as the minimum code length required for that event in an optimal average sense.

Entropy

Consider a random source composed of a set of events {x_1, x_2, \cdots}, where each event x_i occurs with probability p(x_i). In this setting, although the overall probability distribution p is known, before the event actually occurs we cannot know in advance which specific event x_i will be realized, and therefore cannot determine beforehand which particular self-information value will be associated with it.

In other words, prior to the occurrence of an event, the system is in a state of uncertainty. What we care about is not the information content of a single event, but rather, on average, how much information the random source as a whole will produce. Given that the probability distribution is known, the only reasonable and consistent way to quantify this is to take the expectation of the self-information over all possible events, weighted by their probabilities.

Based on this motivation, entropy is defined as the expected value of the self-information of a random variable:

\displaystyle H(X) = \mathbb{E}_{x \sim p} [I(x)] = \sum_{x \in X} p(x) (- \log p(x)) = - \sum_{x \in X} p(x) \log p(x)

Therefore, entropy measures the average amount of information required when a result is randomly drawn from the probability distribution before any event has occurred. This quantity was introduced by Claude Shannon in 1948 as one of the foundational concepts of information theory. To distinguish it from other generalized forms of entropy, it is commonly referred to as Shannon entropy.

From the definition, the relationship between entropy and self-information is clear:

  • Self-information describes the amount of information produced when a particular event occurs.
  • Entropy describes the average information content of the entire random source before any event occurs, that is, the degree of its uncertainty.

Let us return to the earlier example of David and Wayne taking an exam. Let the random variable X represent the exam outcome, with the event set:

X = {\text{pass}, \text{fail}}

The entropies of David’s and Wayne’s exam outcomes are then given by:

H_{\text{D}}(X) = \mathbb{E}_{x \sim p(x)} [I(x)] \\\\ \phantom{H_{\text{D}}(X)} = \sum_{x \in X} p(x) (- \log_2 p(x)) \\\\ \phantom{H_{\text{D}}(X)} = 0.3 (-\log_2 0.3) + 0.7 (-\log_2 0.7) \approx 0.88 \\\\ H_{\text{W}}(X) = 0.9 (-\log_2 0.9) + 0.1 (-\log_2 0.1) \approx 0.47

We can see that Wayne’s exam outcome has a significantly lower entropy. This indicates that the outcome is highly predictable, with most of the probability mass concentrated on passing, and therefore the uncertainty faced before the event occurs is relatively lower.

Conditional Entropy

Consider two discrete random variables X and Y. In the absence of any additional information, the uncertainty of the random variable Y can be measured by its entropy H(Y). However, if we have already observed the value of X, the probability distribution of Y is updated from the marginal distribution p(y) to the conditional distribution p(y \mid x). In this case, the uncertainty of Y should be reassessed.

Given X = x, the self-information and entropy of Y under this condition are defined as:

I(y \mid x) = -\log p(y \mid x) \\\\ \displaystyle H(Y \mid X=x) = \mathbb{E}_{y \sim p(\cdot \mid x)} [I(y \mid x)] = \sum_{y \in Y} p(y \mid x) (-\log p(y \mid x))

Since we do not know in advance which value X will take, we must further take the expectation over all possible values of X. Accordingly, conditional entropy is defined as:

However, before the event occurs, we do not know which value X will take. Therefore, in order to measure the overall average uncertainty of Y given knowledge of X, we must take an expectation over all possible values of X, weighted by their probabilities. Based on this motivation, conditional entropy is defined as:

\displaystyle H(Y \mid X) = \mathbb{E}_{x \sim p} [H(Y \mid X=x)] \\\\ \hphantom{H(Y \mid X)} = \mathbb{E}_{x \sim p} \Big[ \mathbb{E}_{y \sim p(\cdot \mid x)} [I(y \mid x)] \Big] \\\\ \hphantom{H(Y \mid X)} = \sum_{x \in X} \sum_{y \in Y} p(x, y) (-\log p(y \mid x))

Thus, conditional entropy measures the average amount of information required for Y given that the random variable X is known. In other words, it quantifies the average remaining uncertainty about Y when X is given.

The relationship between conditional entropy and Shannon entropy is as follows:

  • H(Y): the uncertainty about Y when no information is known.
  • H(Y \mid X): the remaining uncertainty about Y after X is known.

Since additional information can only reduce, and never increase, uncertainty, we necessarily have:

H(Y \mid X) \le H(Y)

If X and Y are independent, that is, p(y \mid x) = p(y), then observing X provides no information about Y, and we have:

H(Y \mid X) = H(Y)

Returning to the earlier example of David and Wayne taking an exam, if the two belong to different classes, or even different grades, their exam outcomes are independent. In this case, observing Wayne’s exam result does not change our assessment of David’s performance, and therefore:

H(\text{D} \mid \text{W}) = H(\text{D})

However, if David and Wayne are in the same class and take the same exam, the situation is different. Suppose that when Wayne fails the exam, it often indicates that the exam was particularly difficult, which significantly reduces the probability that David passes. In this case, observing \text{W} = \text{fail} does indeed reduce our uncertainty about David’s exam outcome.

Assume that under such circumstances:

p(\text{D} = \text{pass} \mid \text{W} = \text{fail}) = 0.03 \\\\ p(\text{D} = \text{fail} \mid \text{W} = \text{fail}) = 0.97

Then, under the condition \text{W} = \text{fail}, the entropy of David’s exam outcome is:

H(\text{D} \mid \text{W} = \text{fail}) = 0.97 * (-\log 0.97) + 0.03 * (-\log 0.03) \approx 0.19

We can see that, given the information that Wayne failed the exam, David’s outcome is almost certainly a failure as well, and the uncertainty is significantly reduced. This example clearly illustrates that conditional entropy characterizes the remaining uncertainty after some information has been observed.

Information Gain

Consider two discrete random variables X and Y.

  • When X is unknown, the uncertainty of the random variable Y can be measured by its entropy H(Y).
  • When a specific event X = x is known, the probability distribution of Y is updated to p(y \mid x), and the remaining uncertainty is given by H(Y \mid X = x).

Therefore, the reduction in uncertainty of Y after observing the specific event X = x can be naturally defined as the information gain:

IG(Y ; X = x) = H(Y) - H(Y \mid X = x)

Information gain measures the amount of information we actually obtain about Y after observing the particular outcome X = x, that is, the extent to which the uncertainty of Y is reduced. If observing X = x makes the distribution of Y more concentrated, the information gain will be larger; conversely, if the conditional distribution is almost identical to the original distribution, the information gain will be close to zero.

Let us return to the earlier example of David and Wayne taking an exam. Suppose that before knowing Wayne’s exam result, the entropy of David’s exam outcome is:

H(\text{D}) = 0.3 (-\log 0.3) + 0.7 (-\log 0.7) \approx 0.88

After observing that Wayne failed the exam, the conditional entropy of David’s exam outcome is:

H(\text{D} \mid \text{W} = \text{fail}) = 0.97 (-\log 0.97) + 0.03 (-\log 0.03) \approx 0.19

Thus, after learning that \text{W} = \text{fail}, the information gain about David’s exam outcome is:

IG(\text{D} ; \text{W} = \text{fail}) = H(\text{D}) - H(\text{D} \mid \text{W} = \text{fail}) \approx 0.88 - 0.19 = 0.69

This indicates that merely observing the single event that Wayne failed the exam reduces our average uncertainty about David’s exam outcome by approximately 0.69. This example clearly illustrates that information gain captures the informational value that a specific observation provides about another random variable.

Mutual Information

Since before an event occurs we do not know which specific value the random variable X will take, if we wish to measure the amount of information that X provides about Y in an overall average sense, we must take an expectation over all possible values of x, weighted by their probabilities. In this setting, the expected value of information gain gives rise to mutual information:

I(Y ; X) = \mathbb{E}_{x \sim p} [ IG(Y ; X = x)] = H(Y) - H(Y \mid X)

Mutual information is symmetric, and therefore we have:

\begin{aligned} I(Y ; X) &= I(X ; Y) \\ H(Y) - H(Y \mid X) &= H(X) - H(X \mid Y) \end{aligned}

This shows that mutual information does not favor either random variable, but instead describes the amount of information shared between X and Y. That is, the portion of uncertainty about one variable that can be eliminated by knowing the other.

From another perspective, mutual information can also be written in an equivalent form that directly involves probability distributions:

\displaystyle I(Y ; X) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}

This expression clearly shows that mutual information measures the discrepancy between the joint distribution p(x, y) and the distribution p(x)p(y) under the assumption of independence. If X and Y are independent, these two distributions coincide and the mutual information is zero.

Returning to the earlier example of David and Wayne taking an exam, suppose that:

  • p(\text{W}=\text{pass}) = 0.9, and p(\text{W}=\text{fail}) = 0.1.
  • p(\text{D}=\text{pass}) = 0.3.
  • When Wayne fails, p(\text{D}=\text{pass} \mid \text{W}=\text{fail}) = 0.03.
  • When Wayne passes, p(\text{D}=\text{pass} \mid \text{W}=\text{pass}) = 0.33.

First, the entropy of David’s exam outcome is:

H(\text{D}) = 0.3 (-\log_2 0.3) + 0.7 (-\log_2 0.7) \approx 0.88

Next, we compute the entropy under different conditions:

H(\text{D} \mid \text{W}=\text{pass}) = 0.33 (-\log_2 0.33) + 0.67 (-\log_2 0.67) \approx 0.91 \\\\ H(\text{D} \mid \text{W}=\text{fail}) = 0.03 (-\log_2 0.03) + 0.97 (-\log_2 0.97) \approx 0.19

Taking the expectation over Wayne’s exam outcome yields the conditional entropy:

\begin{aligned} H(\text{D} \mid \text{W}) &= p(\text{D} \mid \text{W} = \text{pass}) H(\text{D} \mid \text{W} = \text{pass}) \\ &\hphantom{=} + p(\text{D} \mid \text{W} = \text{fail}) H(\text{D} \mid \text{W} = \text{fail}) \\ &\approx 0.9 \times 0.91 + 0.1 \times 0.19 \approx 0.84 \end{aligned}

Therefore, the mutual information between David’s and Wayne’s exam outcomes is:

I(\text{D} ; \text{W}) = H(\text{D}) - H(\text{D} \mid \text{W}) \approx 0.88 - 0.84 = 0.04

This indicates that Wayne’s exam result does indeed provide information about David’s outcome. When Wayne fails, David almost certainly fails as well; when Wayne passes, David’s probability of passing also increases slightly. However, because Wayne almost always passes, the highly informative event occurs with very low probability. As a result, in an overall average sense, the mutual information remains quite limited.

Cross-Entropy

We already know that Shannon entropy can be used to measure the average amount of information of a probability distribution p:

H(X) = \mathbb{E}_{x \sim p} [ I(x) ] = \mathbb{E}_{x \sim p} [ -\log p(x) ]

However, in practical settings, the true data-generating distribution p is often unknown. Typically, we can only obtain an approximate distribution q through some model, which is used to describe or predict the data-generating process. In other words, although the data are actually generated according to p(x), during encoding or prediction we assign probabilities according to q(x).

In this situation, a natural question arises: when data are generated from the true distribution p but described using the model distribution q, how much information is required on average?

Motivated by this question, cross-entropy is defined as:

\displaystyle H(p, q) = \mathbb{E}_{x \sim p} [-\log q(x)] = \sum_{x \in X} p(x) (-\log q(x))

Thus, cross-entropy measures the average information cost incurred when data are generated from the true distribution p but encoded or predicted using the model distribution q.

If the model distribution happens to be equal to the true distribution, that is, q = p, then cross-entropy reduces to entropy itself:

\displaystyle H(p, p) = \sum_{x \in X} p(x) (-\log p(x)) = H(p)

From an intuitive perspective, we can summarize the relationship as follows:

  • If the model distribution q is very close to the true distribution p, then H(p, q) will be slightly larger than, but very close to, H(p).
  • If the model distribution q differs significantly from the true distribution p, then H(p, q) will be much larger than H(p).

Therefore, entropy H(p) can be regarded as a theoretical lower bound of cross-entropy H(p, q):

H(p, q) \ge H(p)

This inequality reflects the fact that using an incorrect probability distribution to describe data inevitably incurs additional average information cost.

Let us return once again to David’s exam example, but this time consider only David. Suppose that a teacher from a neighboring class observes David looking toward the front during class (while actually daydreaming), mistakenly assumes that he is very attentive, and subjectively believes that David’s probability of passing the exam is 0.8. In this case, we have:

\begin{aligned} p_D(pass) = 0.3 &, \quad p_D(fail) = 0.7 \\ q_D(pass) = 0.8 &, \quad q_D(fail) = 0.2 \end{aligned}

First, the entropy of the true distribution p is:

H(p) = 0.3 (-\log _2 0.3) + 0.7 (-\log_2 0.7) \approx 0.88

This means that if the homeroom teacher uses the correct distribution p to predict David’s exam outcome, the long-term average surprisal incurred per prediction is 0.88.

Next, if the neighboring-class teacher uses the incorrect model distribution q to make predictions, the cross-entropy is:

H(p, q) = 0.3 (-\log_2 0.8) + 0.7 (-\log_2 0.2) \approx 1.72

When the neighboring-class teacher uses the incorrect model distribution q to predict David’s exam outcome, the average surprisal incurred is 1.72. In other words, compared with the homeroom teacher, the neighboring-class teacher incurs an additional surprisal of 1.72 - 0.88 = 0.84.

That is, under long-term prediction, the neighboring-class teacher must bear an average surprisal of 1.72 per prediction. Compared to using the correct distribution, this corresponds to an extra information cost of 1.72 - 0.88 = 0.84.

Kullback–Leibler Divergence (KL Divergence)

We have already seen that for a true probability distribution p, its entropy is defined as:

\displaystyle H(p) = \sum_{x \in X} p(x) (-\log p(x))

This quantity describes the average uncertainty that is unavoidable even when the true distribution p is fully known.

On the other hand, when data are still generated from the true distribution p, but we use another model distribution q to describe or encode the data, the average information cost is given by the cross-entropy:

\displaystyle H(p, q) = \sum_{x \in X} p(x) (-\log q(x))

In this setting, the difference between H(p, q) and H(p) corresponds precisely to the following question: beyond the unavoidable uncertainty inherent in the data itself, how much extra information cost is caused by using an incorrect distribution q?

Based on this motivation, the Kullback–Leibler divergence (KL divergence) is defined as the difference between cross-entropy and entropy:

\displaystyle D_{\text{KL}}(p \| q) = H(p, q) - H(p) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)} \\\\ \text{where } q(x) > 0 \forall x \text{ s.t. } p(x) >0

Thus, KL divergence measures the average extra amount of information required when the true distribution is p but the data are described using q, relative to using the correct distribution p. Because it quantifies the additional cost incurred relative to a correct description, KL divergence is also referred to as relative entropy in information theory.

It is important to note that although KL divergence is often used to measure the difference between two distributions, it is not a distance. In general, D_{\text{KL}}(p | q) \neq D_{\text{KL}}(q | p), and it does not satisfy symmetry or the triangle inequality. Nevertheless, from the perspective of information theory and probabilistic modeling, it remains the most natural and operationally meaningful measure of distributional mismatch.

Let us once again return to David’s exam example. The homeroom teacher uses the true distribution p to predict David’s exam outcome, while the teacher from the neighboring class uses the incorrect model distribution q. As computed earlier:

H(p) = 0.3 (-\log _2 0.3) + 0.7 (-\log_2 0.7) \approx 0.88 \\\\ H(p, q) = 0.3 (-\log_2 0.8) + 0.7 (-\log_2 0.2) \approx 1.72

Therefore, the difference in average information cost between these two description strategies is:

D_{\text{KL}} = H(p, q) - H(p) = 1.72 - 0.88 = 0.84

This indicates that because the neighboring-class teacher uses an incorrect probability distribution to predict David’s exam outcome, they incur an additional average information cost of 0.84 per prediction. This value represents the degree of information inefficiency of the model distribution q relative to the true distribution p.

Maximum Likelihood Estimation (MLE)

Consider a set of independent and identically distributed (i.i.d.) observations {x_1, x_2, \cdots, x_n}. Given a model distribution q_\theta, the likelihood of the entire dataset under the model is defined as:

\displaystyle \mathcal{L} = \prod_{i=1}^n q_\theta (x_i)

Here, i.i.d. means that each observation is independent of the others (independent) and that all observations are randomly drawn from the same probability distribution (identically distributed).

The goal of Maximum Likelihood Estimation (MLE) is to choose a set of parameters \theta such that the model assigns the highest probability to the observed data:

\displaystyle \hat{\theta} = \arg \max_\theta \prod_{i=1}^n q_\theta(x_i)

Because the product form is inconvenient for numerical computation and optimization, in practice we usually take the logarithm of the likelihood, yielding the log-likelihood:

\displaystyle \log \mathcal{L}(\theta) = \sum_{i=1}^n \log q_\theta(x_i)

Therefore, MLE can equivalently be written as:

\displaystyle \hat{\theta} = \arg \max_\theta \sum_{i=1}^n \log q_\theta(x_i)

Multiplying the above maximization problem by -1 converts it into an equivalent minimization problem:

\displaystyle \hat{\theta} = \arg \min_\theta \Big( - \sum_{i=1}^n \log q_\theta(x_i) \Big)

Here, -\log q_\theta(x_i) is precisely the self-information of the event x_i under the model distribution q_\theta. Therefore, the meaning of MLE can be restated as choosing model parameters such that the total self-information assigned by the model to the observed data is minimized.

In other words, MLE is not merely a technique for probability maximization; in an information-theoretic sense, it is an attempt to describe the data as efficiently as possible using the model distribution.

However, the log-likelihood above still reflects only a single finite sampling outcome. Ideally, what we truly care about is the average information cost that the model incurs under the true data-generating distribution p. That is, the following expectation:

\mathbb{E}_{x \sim p}[-\log q\theta(x)]

Since the true distribution p is unknown, this expectation cannot be computed directly. Given that we can only observe a finite dataset, the only feasible approach is to approximate the true distribution using the empirical distribution induced by the samples:

\displaystyle \hat{p}_n(x) = \frac{1}{n} \sum_{i=1}^n 1 \{x_i = x\}

According to the law of large numbers, when the sample size n is sufficiently large and the data are i.i.d., for any integrable function f(x), the expectation computed under the empirical distribution converges to the expectation under the true distribution p:

\displaystyle \mathbb{E}_{x \sim \hat{p}_n}[f(x)] = \frac{1}{n} \sum_{i=1}^n f(x_i) \longrightarrow \mathbb{E}_{x \sim p}[f(x)]

By taking f(x) = -\log q_\theta(x), we obtain:

\displaystyle \frac{1}{n} \sum_{i=1}^n -\log q_\theta(x_i) \approx \mathbb{E}_{x \sim p}\big[-\log q_\theta(x)\big] = H(p, q_\theta)

That is, the empirical average of the log-likelihood is precisely an approximation of the cross-entropy between the true distribution p and the model distribution q_\theta.

Therefore, maximizing the log-likelihood is equivalent to minimizing the cross-entropy:

\displaystyle \hat{\theta} = \arg\min_\theta H(p, q_\theta)

Recalling the earlier relationship:

H(p, q_\theta) = H(p) + D_{\text{KL}}(p \| q_\theta)

where H(p) is determined solely by the true data-generating distribution p and is independent of the model parameters \theta. Hence, when optimizing with respect to \theta, we have:

\displaystyle \arg\min_\theta H(p, q_\theta) = \arg\min_\theta D_{\text{KL}}(p \| q_\theta)

This leads to an important conclusion: MLE is, in essence, equivalent to minimizing the KL divergence between the model distribution q_\theta and the true distribution p. From an information-theoretic perspective, MLE can be understood as choosing a model distribution that minimizes the average information cost required to describe samples generated by the true data-generating mechanism.

This also explains why, in modern machine learning, maximizing likelihood, minimizing cross-entropy, and minimizing KL divergence repeatedly appear in different forms. They all describe the same objective, viewed from different perspectives.

Conclusion

Through this sequence of derivations, we can see that entropy characterizes the unavoidable uncertainty inherent in the data itself, while cross-entropy and KL divergence quantify the additional information cost introduced by model choice. MLE is not merely a formal procedure for maximizing probability; in an information-theoretic sense, it represents an attempt to describe the data-generating mechanism as efficiently as possible using a model distribution. From this perspective, minimizing cross-entropy and minimizing KL divergence are simply different expressions of the same objective: approximating the true distribution. This also explains why these information-theoretic quantities repeatedly appear in loss functions and training objectives in modern machine learning. Understanding this theoretical thread helps us not only know how to apply models, but also why such practices are fundamentally well justified.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like