7.8 Learning as Refining the Hypothesis Space

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

7.8.2 Probably Approximately Correct Learning

Rather than just studying different learning algorithms that happen to work well, computational learning theory investigates general principles that can be proved to hold for classes of learning algorithms.

Some relevant questions that we can ask about a theory of computational learning include the following:

  • Is the learner guaranteed to converge to the correct hypothesis as the number of examples increases?

  • How many examples are required to identify a concept?

  • How much computation is required to identify a concept?

In general, the answer to the first question is “no,” unless it can be guaranteed that the examples always eventually rule out all but the correct hypothesis. An adversary out to trick the learner could choose examples that do not help discriminate correct hypotheses from incorrect hypotheses. If an adversary cannot be ruled out, a learner cannot guarantee to find a consistent hypothesis. However, given randomly chosen examples, a learner that always chooses a consistent hypothesis can get arbitrarily close to the correct concept. This requires a notion of closeness and a specification of what is a randomly chosen example.

Consider a learning algorithm that chooses a hypothesis consistent with all of the training examples. Assume a probability distribution over possible examples and that the training examples and the test examples are chosen from the same distribution. The distribution does not have to be known. We will prove a result that holds for all distributions.

The error of hypothesis h on instance space I, written error(I,h), is defined to be the probability of choosing an element i of I such that h(i)Y(i), where h(i) is the predicted value of target variable Y on possible example i, and Y(i) is the actual value of Y on example i. That is,

error(I,h)=P(h(i)Y(i)iI).

An agent typically does not know P or Y(i) for all i and, thus, does not actually know the error of a particular hypothesis.

Given ϵ>0, hypothesis h is approximately correct if error(I,h)ϵ.

We make the following assumption.

Assumption 7.3.

The training and test examples are chosen independently from the same probability distribution as the population.

It is still possible that the examples do not distinguish hypotheses that are far away from the concept. It is just very unlikely that they do not. A learner that chooses a hypothesis consistent with the training examples is probably approximately correct if, for an arbitrary number δ (0<δ1), the algorithm is not approximately correct in at most δ of the cases. That is, the hypothesis generated is approximately correct at least 1-δ of the time.

Under the preceding assumption, for arbitrary ϵ and δ, we can guarantee that an algorithm that returns a consistent hypothesis will find a hypothesis with error less than ϵ, in at least 1-δ of the cases. Moreover, this result does not depend on the probability distribution.

Proposition 7.4.

Given Assumption 7.3, if a hypothesis is consistent with at least

1ϵ(ln||+ln1δ)

training examples, it has error at most ϵ, at least 1-δ of the time.

Proof.

Suppose ϵ>0 and δ>0 are given. Partition the hypothesis space into

0= {h:error(I,h)ϵ}
1= {h:error(I,h)>ϵ}.

We want to guarantee that the learner does not choose an element of 1 in more than δ of the cases.

Suppose h1, then

P(h is wrong for a single example)ϵ
P(h is correct for a single example)1-ϵ
P(h is correct for m random examples)(1-ϵ)m.

Therefore,

P(1 contains a hypothesis that is correct for m random examples)
    |1|(1-ϵ)m
    ||(1-ϵ)m
    ||e-ϵm

using the inequality (1-ϵ)e-ϵ if 0ϵ1.

If we ensure that ||e-ϵmδ, we guarantee that 1 does not contain a hypothesis that is correct for m examples in more than δ of the cases. So H0 contains all of the correct hypotheses in all but δ of the cases.

Solving for m gives

m1ϵ(ln||+ln1δ)

which proves the proposition. ∎

The number of examples required to guarantee this error bound is called the sample complexity. The number of examples required according to this proposition is a function of ϵ, δ, and the size of the hypothesis space.

Example 7.31.

Suppose the hypothesis space H is the set of conjunctions of literals on n Boolean variables. In this case |H|=3n+1 because, for each conjunction, each variable is in one of three states: (1) it is unnegated in the conjunction, (2) it is negated, or (3) it does not appear; the “+1” is needed to represent false, which is the conjunction of any atom and its negation. Thus, the sample complexity is 1ϵ(nln3+ln1δ) examples, which is polynomial in n, 1ϵ, and ln1δ.

If we want to guarantee at most a 5% error 99% of the time and have 30 Boolean variables, then ϵ=1/20, δ=1/100, and n=30. The bound says that we can guarantee this performance if we find a hypothesis that is consistent with 20*(30ln3+ln100)752 examples. This is much fewer than the number of possible instances, which is 230=1,073,741,824, and the number of hypotheses, which is 330+1=205,891,132,094,650.

Example 7.32.

If the hypothesis space H is the set of all Boolean functions on n variables, then |H|=22n; thus, we require 1ϵ(2nln2+ln1δ) examples. The sample complexity is exponential in n.

If we want to guarantee at most a 5% error 99% of the time and have 30 Boolean variables, then ϵ=1/20, δ=1/100, and n=30. The bound says that we can guarantee this performance if we find a hypothesis consistent with 20*(230ln2+ln100)14,885,222,452 examples.

Consider the third question raised at the start of this section, namely, how quickly can a learner find the probably approximately correct hypothesis. First, if the sample complexity is exponential in the size of some parameter (e.g., n above), the computational complexity must be exponential because an algorithm must at least consider each example. To show an algorithm with polynomial complexity, you need to find a hypothesis space with polynomial sample complexity and show that the algorithm uses polynomial time for each example.