Information Entropy

In the field of Information Theory, we can define the information conveyed $I(E)$ by the occurrence of an event $E$ as
$$I(E) = \log \frac{1}{p(E)},$$
where $p(E)$ is the probability of the event $E$. Note that if the event is unlikely then the information gained is large.

Suppose that we have an information source $X$, which can be thought of as a discrete random variable. $X$ can take on many "states" $\{x_i\}$. We can define the entropy of the system as
$$H(X) = \sum_i^n  p(x_i)I(x_i),$$
where $n$ is the number of states that $X$ can take. An example of $X$ is an alphabet.
A qualitative description of the entropy of an information source is a measure of its unpredictability. We can see this by observing the following:

If the probabilities of the states of the source are uniform the entropy is maximised.

That is, when we have no prior information about the likelihood of each state we expect the entropy of the information source to be large. We can prove this fact in a few ways.

We can use Jensen's inequality,
\begin{equation}
H(x) = E\Big[\log \frac{1}{p(E)}\Big] \geq  \log E\Big[ \frac{1}{p(E)}\Big] = \log n.
\end{equation}
This is true when $p(x_i) = \frac{1}{n}$.

Another way is to use Lagrange multipliers. We want to maximise $H$ subject to the obvious constraints that probability distributions are subject to. This leads to the equations
$$\log p(x_i) -1 = \lambda$$
and
$$\sum_i^n p(x_i) = 1.$$
The first of these equations basically says that all of the $x_i$'s have to be equal and the second one imposes one of the obvious conditions on a probability distribution. The only conclusion that we can come to is the same conclusion that we had come to above.

Of course, this only applies to zero-memory sources. That is, information sources in which the states are independent. More specifically they do not depend on previous states. Things get more complicated when we consider $m$th-order Markov sources, which are sources that produce symbols with a probability dependent on the $m$ previously encountered symbols.

Comments

Popular posts from this blog

Using the AVX instruction set to boost performance

The magic behind deep learning : Obtaining gradients using reverse Automatic Differentiation.