### 6.1.5 Information Theory

Probability forms the basis of information theory. In this section, we give a brief overview of information theory.

A **bit** is a binary digit. Because a bit has two possible
values, it can be used to distinguish two items. Often the two
values are written as *0* and *1*, but they can be any two different
values.

Two bits can distinguish four items, each associated with
either *00*, *01*, *10*, or *11*. Similarly, three bits can distinguish
eight items. In general, *n* bits can distinguish *2 ^{n}* items. Thus, we
can distinguish

*n*items with

*log*bits. It may be surprising, but we can do better than this by taking probabilities into account.

_{2}n**Example 6.9:**Suppose you want to design a code to distinguish the elements of the set

*{a,b,c,d}*, with

*P(a)=(1)/(2)*,

*P(b)=(1)/(4)*,

*P(c)=(1)/(8)*, and

*P(d)=(1)/(8)*. Consider the following code:

a 0 c 110 b 10 d 111

This code sometimes uses 1 bit and sometimes uses 3 bits. On average, it uses

P(a) ×1 + P(b) ×2 + P(c) ×3 + P(d) ×3 = (1)/(2) + (2)/(4) + (3)/(8) + (3)/(8)= 1(3)/(4) bits.

For example, the string *aacabbda* with 8 characters has code *00110010101110*, which uses 14 bits.

With this code, *- log _{2} P(a)=1* bit is required to distinguish

*a*from the other symbols. To distinguish

*b*, you must have

*- log*bits. To distinguish

_{2}P(b)=2*c*, you must have

*- log*bits.

_{2}P(c)=3It is possible to build a code that, to identify *x*, requires *- log _{2} P(x)*
bits (or the integer greater than this, if

*x*is the only thing being transmitted). Suppose there is a sequence of symbols we want to transmit or store and we know the probability distribution over the symbols. A symbol

*x*with probability

*P(x)*uses

*- log*bits. To transmit a sequence, each symbol requires, on average,

_{2}P(x)∑_{x}-P(x)× log_{2}P(x)

bits to send it. This value just depends on the probability distribution of
the symbols.
This is called the **information content** or **entropy** of the distribution.

Analogously to conditioning in probability,
the expected number of bits it takes to describe a distribution given
evidence *e* is

I(e) = ∑_{x}-P(x|e)× log_{2}P(x|e).

If a test exists that can distinguish the cases where *α* is
true from the cases where *α* is false, the
**information gain** from this test is

I(true) - ( P(α) ×I(α) + P(¬α) ×I(¬α)),

where *I(true)* is the expected number of bits needed before the test and
*P(α) ×I(α) + P(¬α) ×I(¬α)* is the expected number of bits after the test.

In later sections, we use the notion of information for a number of tasks:

- In diagnosis, an agent can choose a test that provides the most information.
- In decision-tree learning, information theory provides a useful criterion for choosing which property to split on: split on the property that provides the greatest information gain. The elements it must distinguish are the different values in the target concept, and the probabilities are obtained from the proportion of each value in the training set remaining at each node.
- In Bayesian learning, information theory provides a basis for deciding which is the best model given some data.