8.1 Probability

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

8.1.5 Information

The information theory box discussed how to represent information using bits. For xdomain(X), it is possible to build a code that, to identify x uses -log2P(x) bits (or the integer greater than this). The expected number of bits to transmit a value for X is then

H(X)=xdomain(X)-P(X=x)*log2P(X=x)

This is the information content or entropy of random variable X.

[Note that, unlike the notation used elsewhere in the book, H is a function of the variable, not a function of the values of the variable. Thus, for a variable X, the entropy H(X) is a number, unlike P(X) which a function that given a value for X, returns a number.]

The entropy of X given the observation Y=y is

H(XY=y)=x-P(X=xY=y)*log2P(X=xY=y).

Before observing Y, the expectation over Y:

H(XY)=yP(Y=y)*x-P(X=x(Y=y)*log2P(X=x(Y=y)

is called conditional entropy of X given Y.

For a test that determines the value of Y, the information gain from this test is H(X)-H(XY), which is the number of bits used to describe X minus the expected number of bits to describe X after learning Y. The information gain is never negative.

Example 8.11.

Suppose spinning a wheel in a game can produces a number in the set {1,2,,8}, each with equal probability. Let S be the outcome of a spin. Then H(S)=-i=1818*log218=3 bits.

Suppose there is a sensor G that detects whether the outcome is greater than 6. G=true if H>6. Then H(SG)=-0.25log212-0.75log216=2.19. The information gain of G is thus 3-2.19=0.81 bits. A fraction of a bit makes sense in that it is possible to design a code that uses 219 bits to predict 100 outcomes.

For an “even” sensor E, where E=true if H is even, H(SE)=-0.5log214-0.5log214=2. The information gain of E is thus 1 bit.

The notion of information is used for a number of tasks:

  • In diagnosis, an agent could choose a test that provides the most information.

  • In decision tree learning, information theory provides a useful criterion for choosing which property to split on: split on the property that provides the greatest information gain. The elements it must distinguish are the different values in the target concept, and the probabilities are obtained from the proportion of each value in the training set remaining at each node.

  • In Bayesian learning, information theory provides a basis for deciding which is the best model given some data.