Information Theory Cheat-Sheet

Entropy, divergence, mutual information, etc. are central concepts in statistical machine learning. This post ties them together in a short collection of elementary information theoretic results.

Below, we consider random variables $\mathrm{X}, \mathrm{Y}$ that take values in some discrete sets $\mathcal{X}$ and $\mathcal{Y}$. We denote, respectively, $p_{\tiny\mathrm{X}}\in\Delta_\mathcal{X}$ and $p_{\tiny\mathrm{Y}}\in\Delta_\mathcal{Y}$ their respective probability mass function. Their joint pmf is denoted $p_{\tiny\mathrm{XY}}$.

$\quad$ Discrete support is for simplicity: one can replace sums with integrals in the continuous case.

Entropy

The information entropy of $\mathrm{X}$ measures the amount of unpredictability of this random variable. The surprise of the event $\{X=x\}$ can be measured via $-\log p_{\tiny\mathrm{X}}(x)$: the lower this event’s probability, the higher the surprise when it actually happens. The entropy is the expected value of this score: $$ \tag{1} \mathcal{H}(\mathrm{X}) := \mathbb{E}_{p_{\tiny\mathrm{X}}}\left[-\log p_{\tiny\mathrm{X}}(X)\right] = - \sum_{x\in\mathcal{X}} p_{\tiny\mathrm{X}}(x)\log p_{\tiny\mathrm{X}}(x)\;. $$ The entropy is maximum when $p$ is uniform over $\mathcal{X}$, and in the discrete case, it is also positive. The entropy is measured in nats when using the natural logarithm (or in bits, when the basis of the logarithm is 2).

Proof
On positivity

Conditional entropy

One can define the joint entropy $\mathcal{H}(\mathrm{X},\mathrm{Y})$ of $\mathrm{X}$ and $\mathrm{Y}$ by measuring the entropy of the joint $p_{\tiny\mathrm{X}\mathrm{Y}}$. The conditional entropy of $\mathrm{X}$ wrt $\mathrm{Y}$ is the expected value of $\mathrm{X}$’s entropy conditionally on the realisation of $\mathrm{Y}$: $$ \tag{2} \mathcal{H}(\mathrm{X}\vert \mathrm{Y}) := -\sum_{y} p_{\tiny\mathrm{Y}}(y) \sum_{x} p_{\tiny\mathrm{X}\vert \mathrm{Y}}(x \vert y)\log p_{\tiny\mathrm{X}\vert \mathrm{Y}}(x \vert y)\;. $$ There is a chain rule for the entropy, in the sense that $\mathcal{H}(\mathrm{X},\mathrm{Y}) = \mathcal{H}(\mathrm{X}\vert\mathrm{Y}) + \mathcal{H}(\mathrm{Y})$. When $\mathrm{X}$ and $\mathrm{Y}$ are independent, the joint entropy is therefore the sum of the marginal entropies, or $\mathcal{H}(\mathrm{X},\mathrm{Y}) = \mathcal{H}(\mathrm{X}) + \mathcal{H}(\mathrm{Y})$.

Proof

Cross entropy

When $\mathcal{X}=\mathcal{Y}$, one can define the cross entropy of $\mathrm{Y}$ relative to $\mathrm{X}$ as $ \mathcal{C}(\mathrm{X}, \mathrm{Y}) := -\sum_{x} p_{\tiny\mathrm{X}}(x)\log p_{\tiny\mathrm{Y}}(x) \;. $ It is a standard result known as Gibb’s inequality that it is lower-bounded by the entropy of $\mathrm{X}$: $$ \mathcal{C}(\mathrm{X}, \mathrm{Y}) \geq \mathcal{H}(\mathrm{X}) \; .\tag{3} $$

Proof

Relative entropy

In this section we assume that $\mathcal{X}=\mathcal{Y}$. The relative entropy of $\mathrm{X}$ wrt $\mathrm{Y}$, also known as the Kullback-Leibler divergence of $\mathrm{X}$ from $\mathrm{Y}$, is defined as the difference between the cross-entropy and the entropy: $$ \text{KL}(p_{\tiny\mathrm{X}} \|\, p_{\tiny\mathrm{Y}}) := \mathcal{C}(\mathrm{X}, \mathrm{Y}) - \mathcal{H}(\mathrm{X})\;. $$ A direct consequence of (3) is that the relative entropy is positive. It reaches zero if and only if $p_{\tiny \mathrm{X}} =p_{\tiny \mathrm{Y}}$. It is only a divergence; it is not symmetric and does not satisfy the triangle inequality. It can, however, be lower-bounded by the total variation metric (that’s called Pinsker inequality, not proven here): $$ \text{TV}(\mathrm{X}, \mathrm{Y}) \leq \sqrt{\text{KL}(\mathrm{X} \| \mathrm{Y})/2}\;. $$

Proof
Tighter bounds

Mutual information

The mutual information $\mathcal{I}$ between $\mathrm{X}$ and $\mathrm{Y}$ is defined by the difference between the entropy and conditional entropy, in an effort to measure the mutual dependencies between the random variables: $$ \mathcal{I}(\mathrm{X}, \mathrm{Y}) := \mathcal{H}(\mathrm{X}) - \mathcal{H}(\mathrm{X}\vert \mathrm{Y})\;. $$ The definition is of course symmetric (the position of each variable can be swapped). A useful identity to see how this measures the information about one variable contained in the other involves the relative entropy between the joint distribution and the product of marginals. Slightly abusing notations: $$ \tag{4} \mathcal{I}(\mathrm{X}, \mathrm{Y}) = \text{KL}(p_{\tiny\mathrm{XY}}\|\,p_{\tiny\mathrm{X}}\cdot p_{\tiny\mathrm{Y}})\;. $$ Observe that this proves the mutual information is non-negative.

Proof

Conditional Mutual Information

The goal of this section is to prove the data processing inequality, a result that embodies that no information can be gained via processing. Before that, we need to introduce the notion of conditional mutual information. Let $\mathrm{Z}$ be some random variable in $\mathcal{Z}$. The conditional mutual information between $\mathrm{X}$ and $\mathrm{Y}$ with respect to $\mathrm{Z}$ is defined as the expected mutual information between $\mathrm{X}$ and $\mathrm{Y}$ given realisations of $\mathrm{Z}$. Concretely: $$ \mathcal{I}(\mathrm{X}, \mathrm{Y} \vert \mathrm{Z}) := \mathcal{H}(\mathrm{X}\vert \mathrm{Z}) - \mathcal{H}(\mathrm{X}\vert \mathrm{Y}, \mathrm{Z})\;. $$

We can give an equivalent but more intuitive definition that follows (4) via $ \mathcal{I}(\mathrm{X}, \mathrm{Y} \vert \mathrm{Z}) = \text{KL}(p_{\tiny\mathrm{XY\vert Z}}\|\,p_{\tiny\mathrm{X\vert Z}}\cdot p_{\tiny\mathrm{Y\vert Z}})\;. $ The conditional mutual information is involved in an identity called the chain rule for mutual information: $$ \tag{5} \mathcal{I}(\mathrm{X}, (\mathrm{Y}, \mathrm{Z})) = \mathcal{I}(\mathrm{X}, \mathrm{Z}) + \mathcal{I}(\mathrm{X}, \mathrm{Y}\vert \mathrm{Z})\;. $$

Proof

We are now ready to state and prove the data-processing inequality.

$\qquad\qquad\qquad\qquad\qquad\qquad\; \text{Let } \mathrm{X}, \mathrm{Y} \text{ and } \mathrm{Z} \text{ form the Markov Chain } \mathrm{X}\rightarrow \mathrm{Y}\rightarrow \mathrm{Z}. \text{ Then:}$ $$ \mathcal{I}(\mathrm{X}, \mathrm{Y}) \geq \mathcal{I}(\mathrm{X}, \mathrm{Z})\;. $$
Data-processing inequality

The fact that $ \mathrm{X}\rightarrow \mathrm{Y}\rightarrow \mathrm{Z}$ means that $\mathrm{Z}$ is conditionally independent of $\mathrm{X}$ given $\mathrm{Y}$. This captures settings where one would like to add some post-processing power over some signal $\mathrm{Y}$ encoded from $\mathrm{X}$. The data-processing inequality establishes that no additional signal can be created out of thin air; in other words, post-processing can only decrease the mutual information with $\mathrm{X}$.

Proof