Entropy (Information Theory)

In information theory, the major goal is for one person (a transmitter) to convey some message (over a channel) to another person (the receiver). To do so, the transmitter sends a series (possibly just one) partial messages that give clues towards the original message. The information content of one of these partial messages is a measure of how much uncertainty this resolves for the receiver. For instance,

A partial message that cuts the number of possibilities in half transmits one bit of information about the message. For instance, if the transmitter were trying to transmit the result of a randomly chosen digit to the receiver, the partial message "the number is odd" would transmit one bit of information.
A partial message that doesn't reduce the number of possibilities at all transmits no information at all; for instance, the partial message "the number is less than 10" transmits zero bits of information.

In essence, the "information content" can be viewed as how much useful information the message actually contains.

The entropy, in this context, is the expected number of bits of information contained in each message, taken over all possibilities for the transmitted message. For example, suppose the transmitter wanted to inform the receiver of the result of a 4-person tournament, where some of the players are better than others. The entropy of this message would be a weighted average of the amount of information each of the possible messages (i.e. "player 1 won", "player 2 won", "player 3 won", or "player 4 won") provides.

In essence, the "entropy" can be viewed as how much useful information a message is expected to contain. It also provides a lower bound for the "size" of an encoding scheme, in the sense that the expected number of bits to be transmitted under some encoding scheme is at least the entropy of the message. An encoding scheme that manages to achieve this lower bound is called lossless.

Formal Definition (Information)

Before we define \(H\) formally, let us see the properties of \(H:\)

1) \(H(X)\) is always positive.
2) Conditioning reduces entropy, i.e. \(H(X/Y)<H(X).\)
3) Invariance under relabeling: \(H(X)=H(f(X))\) for any bijective \(f.\)

A situation with several possible outcomes can be modeled by a random variable \(X\), where for any possible outcome \(x_i\) that has probability \(p_i\) of occurring:

\[\text{Pr}(X=x_i) = p_i.\]

Communication can be viewed as giving information about this random variable \(X\).

The simplest type of message is one that specifies the value of \(X\). The information content of a message specifying \(X\) resulted in \(x_i\) is

\[I(x_i)=\log_2\left(\frac{1}{p_i}\right).\]

This makes an intuitive sense, as specifying a high-probability event occurred is not transmitting much information (as it doesn't resolve much uncertainty--the result was "expected"), while specifying a low-probability event occurred is transmitting a significant amount of information. The reason that \(I\) is logarithmic stems from the simple reason that \(I\) is additive over independent events; i.e.

\[I(x_ix_j)=I(x_i)+I(x_j)\]

meaning that, in essence, if the situation were run twice and the information about both results were transmitted simultaneously, the information content would be the same as if information about the results were separately transmitted. It is clear that the above condition is satisfied by the logarithmic function, hence the definition.

It is also worth noting that the definition is satisfying for two reasons:

Transmitting a result that had probability \(\frac{1}{2}\) transmits \(\log_2\left(\frac{1}{\frac{1}{2}}\right)=1\) bit of information, which agrees with the introduction (that cutting the number of possibilities by half is transmitting 1 bit of information).
Transmitting a result that had probability 1 transmits \(\log_2\left(\frac{1}{1}\right)=0\) bits of information, which also agrees with the introduction (that transmitting a message that doesn't reduce the number of possibilities transmits no information).

Furthermore, this definition continues to hold even when more complex messages are received. For instance, in the previous example of transmitting the result of a random digit, the message "the digit is at least 3" had a \(\frac{7}{10}\) chance of being received; hence, the information content of such a message is

\[\log_2\left(\frac{1}{\frac{7}{10}}\right)=\log_2\left(\frac{10}{7}\right).\]

It is also worth examining what happens upon the further message "the digit is at most 3": this has a \(\frac{1}{7}\) chance of being received, so the information content of this message is

\[\log_2\left(\frac{1}{\frac{1}{7}}\right)=\log_2 7.\]

Therefore, the information content of these messages taken together is

\[\log_2\left(\frac{10}{7}\right)+\log_2 7=\log_2 10,\]

which is exactly the information content of the message "the digit is 3." This agrees with the fact that information is additive over independent events, and affirms the choice of a logarithmic function.

Formal Definition (Entropy)

The entropy of a message is defined as the expected amount of information to be transmitted about the random variable \(X\) defined in the previous section. More formally, if \(X\) takes on the states \(x_1, x_2, \ldots, x_n\), the entropy is defined as

\[\sum_{i=1}^n p_i\log_2\left(\frac{1}{p_i}\right).\]

This comes from repeating the scenario involving \(X\) a large number \(N\) times. The expected number of times \(x_i\) is transmitted is \(n_i=Np_i\), and the amount of information received from that message is \(I(x_i)=\log_2\left(\frac{1}{p_i}\right)\). Hence the total amount of information received over the \(N\) messages (recall that information is additive) is

\[\sum_{i=1}^n n_iI(x_i)=\sum_{i=1}^nNp_iI(x_i)=N\sum_{i=1}^np_iI(x_i),\]

so the expected amount of information per message is \(\sum_{i=1}^np_iI(x_i)\), as expected.

A small issue worth pointing out immediately: in the case where \(p_i=0\), it can be assumed that \(p_i\log(p_i)=0\), which is justified as \(\lim_{x \rightarrow 0}x\log x=0\). This also makes intuitive sense: if an event never occurs, it cannot contribute to the entropy, as it is never expected to occur.

Application to Encoding

A major goal of information theory is to encode messages as binary strings in such a way that

a string of transmitted bits (binary digits) can be reconstructed into a sequence of messages, and
the number of bits necessary to transmit message is as small as possible.

For instance, the encoding scheme

Letter	Encoding
A	0
B	1
C	01

is efficient but extremely flawed, as the message "AB" and the message "C" are indistinguishable (both would be transmitted as 01). In contrast, the encoding scheme

Letter	Encoding
A	00001
B	00111
C	11111

is correct (any stream of bits can be reconstructed into the original message), but extremely inefficient as significantly less bits are necessary to reliably transmit messages.

It is natural to ask what the "best" encoding scheme is, and while the answer strongly depends on application (for example, transmissions over noisy channels would do well to employ error correcting codes), one natural answer is "the correct encoding scheme that minimizes the expected length of a message." That theoretical minimum is given by the entropy of the message.

For example, suppose a message details the value of a random variable \(X\), defined by

\(x\)	\(\text{Pr}(X = x)\)
A	0.5
B	0.25
C	0.125
D	0.125

The entropy of this message is

\[ \begin{align*} \sum_{i=1}^{n} p_i\log_2\left(\frac{1}{p_i}\right) &= 0.5\log_2\left(\frac{1}{0.5}\right) + 0.25\log_2\left(\frac{1}{0.25}\right) + 0.125\log_2\left(\frac{1}{0.125}\right) + 0.125\log_2\left(\frac{1}{0.125}\right) \\ &=0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 \\ &= 1.75, \end{align*} \]

so the expected length of any correct encoding scheme is at least \(1.75\). In this case, this can be realized with a relatively simple scheme:

Letter	Encoding	Length
A	1	1
B	01	2
C	001	3
D	000	3

It can be verified that the above encoding is correct (it is an example of Huffman coding), and it has expected length

\[1 \cdot 0.5 + 2 \cdot 0.25 + 3 \cdot 0.125 + 3 \cdot 0.125 = 1.75,\]

which is equal to the entropy of the message.

Application to Data Compression

A ubiquitous application of encoding schemes, and thus entropy, is to data compression: the act of transferring a large file into a smaller, equivalent file for storage (but usually not human readability). One simple example of such a scheme is a run-length code, which replaces each sequence of repeated bits with two numbers: the bit and the number of times it is to appear. For instance, the message "0011100011" could become "02130312".

In essence, this kind of data compression is doing the same thing entropy was meant to measure: events that occur with higher likelihood give less information content, and thus should be encoded with fewer bits. Analogously, data compression identifies inherent structure in the large file, and replaces the more commonly appearing structures with smaller compressions. For instance^[1],

That Sam-I-Am!
That Sam-I-Am!
I do not like that Sam-I-Am!
Do you like green eggs and ham?
I do not like them, Sam-I-Am!
I do not like green eggs and ham!
Would you like them here or there?
I would not like them here or there.
I would not like them anywhere.
I do not like green eggs and ham.
I do not like them, Sam-I-Am.

can be compressed to

1=hat
2=Sam-I-Am
3=I do not like
4=green eggs and ham
5=you like
6=here or there
7=I would not like them
T1 2!
T1 2!
3 t1 2!
Do 5 4?
3 4!
Would 5 them 6?
7 6.
7 anywhere.
3 4.
3 them, 2.

which is a shortening from 322 to 186 characters (and the entire story would experience even better compression).

References

Garcia, E. How does file compression work?. Retrieved March 16th, 2016, from http://qr.ae/RS2Y2m

Contents

Formal Definition (Information)

Formal Definition (Entropy)

Application to Encoding

Application to Data Compression

See Also

References