Artificial Intelligence - foundations of computational agents -- 7.7.2 Probably Approximately Correct Learning

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

7.7.2 Probably Approximately Correct Learning

So far, we have seen a number of different learning algorithms. This section covers some of the theoretical aspects of learning, developed in an area called computational learning theory.

Some relevant questions that we can ask about a theory of computational learning include the following:

Is the learner guaranteed to converge to the correct hypothesis as the number of examples increases?
How many examples are required to identify a concept?
How much computation is required to identify a concept?

In general, the answer to the first question is "no," unless it can be guaranteed that the examples always eventually rule out all but the correct hypothesis. Someone out to trick the learner could choose examples that do not help discriminate correct hypotheses from incorrect hypotheses. So if such a person cannot be ruled out, a learner cannot guarantee to find a consistent hypothesis. However, given randomly chosen examples, a learner that always chooses a consistent hypothesis can get arbitrarily close to the correct concept. This requires a notion of closeness and a specification of what is a randomly chosen example.

Consider a learning algorithm that chooses a hypothesis consistent with all of the training examples. Assume a probability distribution over possible examples and that the training examples and the test examples are chosen from the same distribution. The distribution does not have to be known. We will prove a result that holds for all distributions.

Define the error of hypothesis h∈H, written error(h), to be the probability of choosing an element i of I such that h(i)≠val(i,Y), where h(i) is the predicted value of target variable Y on possible example i, and val(i,Y) is the actual value of Y. Recall that I, the instance space, is the set of all possible examples. That is,

error(h)=P(h(i)≠val(i,Y)| i ∈I) .

An agent typically does not know P or val(i,Y) for all i and, thus, does not actually know the error of a particular hypothesis.

Given ε>0, hypothesis h is approximately correct if error(h) ≤ ε.

We make the following assumption.

Assumption. The training and test examples are chosen independently from the same probability distribution as the population.

It is still possible that the examples do not distinguish hypotheses that are far away from the concept - it is just very unlikely that they do not. A learner that chooses a hypothesis that is consistent with the training examples is probably approximately correct if, for an arbitrary number δ (0<δ ≤ 1), the algorithm is not approximately correct in at most δ of the cases. That is, the hypothesis generated is approximately correct at least 1-δ of the time.

Under the preceding assumption, for arbitrary ε and δ, we can guarantee that an algorithm that returns a consistent hypothesis will find a hypothesis with error less than ε, in at least 1-δ of the cases. Moreover, this outcome does not depend on the probability distribution.

Suppose ε>0 and δ>0 are given. Partition the hypothesis space H into

H₀ = {h∈H: error(h) ≤ ε}

H₁ = {h∈H: error(h) > ε} .

We want to guarantee that the learner does not choose an element of H₁ in more than δ of the cases.

Suppose h ∈H₁, then

P(h is wrong for a single example) ≥ ε
P(h is correct for a single example) ≤ 1- ε
P(h is correct for m examples) ≤ (1- ε)^m

Thus,

P(H₁ contains a hypothesis that is correct for m examples)
      ≤ |H₁| (1- ε)^m
      ≤ |H| (1- ε)^m
      ≤ |H| e^-εm

using the inequality that (1-ε) ≤ e^-ε if 0 ≤ ε ≤ 1.

Thus, if we ensure that |H| e^-εm ≤ δ, we guarantee that H₁ does not contain a hypothesis that is correct for m examples in more than δ of the cases. Thus, H₀ contains all of the correct hypotheses in all but δ is the cases.

Solving for m gives

m ≥ (1)/(ε)(ln|H|+ ln(1)/(δ)).

Thus, we can conclude the following proposition.

Proposition. If a hypothesis is consistent with at least
(1)/(ε)(ln|H|+ ln(1)/(δ))

training examples, it has error at most ε, at least 1-δ of the time.

The number of examples required to guarantee this error bound is called the sample complexity. The number of examples required according to this proposition is a function of ε, δ, and the size of the hypothesis space.

Example 7.28: Suppose the hypothesis space H is the set of conjunctions of literals on n Boolean variables. In this case |H|=3ⁿ+1 because, for each conjunction, each variable in is one of three states: (1) it is unnegated in the conjunction, (2) it is negated, or (3) it does not appear; the "+1" is needed to represent false, which is the conjunction of any atom and its negation. Thus, the sample complexity is (1)/(ε)(nln3 + ln(1)/(δ)) examples, which is polynomial in n, (1)/(ε), and ln(1)/(δ).

If we want to guarantee at most a 5% error 99% of the time and have 30 Boolean variables, then ε= 1/20, δ=1/100, and n=30. The bound says that we can guarantee this performance if we find a hypothesis that is consistent with 20×(30ln3 + ln100) approx 752 examples. This is much less than the number of possible instances, which is 2³⁰=1,073,741,824, and the number of hypotheses, which is 3³⁰+1=205,891,132,094,650.

Example 7.29: If the hypothesis space H is the set of all Boolean functions on n variables, then |H|=2^2ⁿ; thus, we require (1)/(ε)(2ⁿln2+ ln(1)/(δ)) examples. The sample complexity is exponential in n.

Consider the third question raised at the start of this section, namely, how quickly a learner can find the probably approximately correct hypothesis. First, if the sample complexity is exponential in the size of some parameter (e.g., n above), the computational complexity must be exponential because an algorithm must at least consider each example. To show an algorithm with polynomial complexity, we must find a hypothesis space with polynomial sample complexity and show that the algorithm uses polynomial time for each example.