### 7.7.2 Probably Approximately Correct Learning

So far, we have seen a number of different learning algorithms. This
section covers some of the theoretical aspects of learning,
developed in an area called **computational learning theory**.

Some relevant questions that we can ask about a theory of computational learning include the following:

- Is the learner guaranteed to converge to the correct hypothesis as the number of examples increases?
- How many examples are required to identify a concept?
- How much computation is required to identify a concept?

In general, the answer to the first question is "no," unless it can be guaranteed that the examples always eventually rule out all but the correct hypothesis. Someone out to trick the learner could choose examples that do not help discriminate correct hypotheses from incorrect hypotheses. So if such a person cannot be ruled out, a learner cannot guarantee to find a consistent hypothesis. However, given randomly chosen examples, a learner that always chooses a consistent hypothesis can get arbitrarily close to the correct concept. This requires a notion of closeness and a specification of what is a randomly chosen example.

Consider a learning algorithm that chooses a hypothesis consistent with all of the training examples. Assume a probability distribution over possible examples and that the training examples and the test examples are chosen from the same distribution. The distribution does not have to be known. We will prove a result that holds for all distributions.

Define the **error of hypothesis** *h∈ H*, written

*error(h)*, to be the probability of choosing an element

*i*of

*I*such that

*h(i)≠val(i,Y)*, where

*h(i)*is the predicted value of target variable

*Y*on possible example

*i*, and

*val(i,Y)*is the actual value of

*Y*. Recall that

*I*, the instance space, is the set of all possible examples. That is,

error(h)=P(h(i)≠val(i,Y)| i ∈I) .

An agent typically does not know *P* or *val(i,Y)* for all *i* and, thus, does not actually know
the error of a particular hypothesis.

Given *ε>0*, hypothesis *h* is **approximately
correct** if *error(h) ≤ ε*.

We make the following assumption.

**Assumption.**

*The training and test examples are chosen independently from the same probability distribution as the population.*

It is still possible that the examples do not distinguish hypotheses
that are far away from the concept - it is just very unlikely that
they do not. A learner that chooses a hypothesis that is consistent
with the training examples is **probably
approximately correct** if, for an arbitrary number
*δ* (*0<δ ≤ 1*),
the algorithm is not approximately correct in at most
*δ* of the cases. That is, the hypothesis generated is approximately correct at least
*1-δ* of the time.

Under the preceding assumption, for arbitrary *ε* and *δ*, we can
guarantee that an algorithm that returns a consistent hypothesis will
find a hypothesis with error less than *ε*, in at least
*1-δ* of the cases. Moreover, this outcome does not depend on the probability
distribution.

Suppose *ε>0* and *δ>0* are given. Partition the
hypothesis space * H* into

H_{0}= {h∈ H: error(h) ≤ ε}H_{1}= {h∈ H: error(h) > ε} .

We want to guarantee that the learner does not choose an element of * H_{1}* in more
than

*δ*of the cases.

Suppose *h ∈ H_{1}*, then

*P(h is wrong for a single example) ≥ ε*

*P(h is correct for a single example) ≤ 1- ε*

*P(h is correct for*m

*examples) ≤ (1- ε)*

^{m}Thus,

*P(*m

**H**_{1}contains a hypothesis that is correct for*examples)*

*≤ |*

**H**_{1}| (1- ε)^{m}*≤ |*

**H**| (1- ε)^{m}*≤ |*

**H**| e^{-εm}using the inequality that *(1-ε) ≤ e ^{-ε}* if

*0 ≤ ε ≤ 1*.

Thus, if we ensure that *| H| e^{-εm} ≤ δ*, we guarantee that

*does not contain a hypothesis that is correct for*

**H**_{1}*m*examples in more than

*δ*of the cases. Thus,

*H*contains all of the correct hypotheses in all but

_{0}*δ*is the cases.

Solving for *m* gives

m ≥ (1)/(ε)(ln|H|+ ln(1)/(δ)).

Thus, we can conclude the following proposition.

**Proposition.**

*If a hypothesis is consistent with at least*

(1)/(ε)(ln|H|+ ln(1)/(δ))

*training examples, it has error at most ε,
at least 1-δ of the time.
*

The number of examples required to guarantee this error bound is
called the **sample complexity**.
The number of examples required according to this proposition is a function of
*ε*, *δ*, and the size of the hypothesis space.

**Example 7.28:**Suppose the hypothesis space

*is the set of conjunctions of literals on*

**H***n*Boolean variables. In this case

*|*because, for each conjunction, each variable in is one of three states: (1) it is unnegated in the conjunction, (2) it is negated, or (3) it does not appear; the "

**H**|=3^{n}+1*+1*" is needed to represent false, which is the conjunction of any atom and its negation. Thus, the sample complexity is

*(1)/(ε)(nln3 + ln(1)/(δ))*examples, which is polynomial in

*n*,

*(1)/(ε)*, and

*ln(1)/(δ)*.

If we want to guarantee at most a *5%* error *99%* of
the time and have *30* Boolean variables, then *ε= 1/20*,
*δ=1/100*, and *n=30*. The bound says that we can guarantee this
performance if we find a hypothesis that is consistent with *20×(30ln3 + ln100) approx 752* examples. This is much less than the number of
possible instances, which is *2 ^{30}=1,073,741,824*, and the number of
hypotheses, which is

*3*.

^{30}+1=205,891,132,094,650**Example 7.29:**If the hypothesis space

*is the set of all Boolean functions on*

**H***n*variables, then

*|*; thus, we require

**H**|=2^{2n}*(1)/(ε)(2*examples. The sample complexity is exponential in

^{n}ln2+ ln(1)/(δ))*n*.

If we want
to guarantee at most a *5%* error *99%* of the time and have *30*
Boolean variables, then *ε= 1/20*, *δ=1/100*, and *n=30*.
The bound says that we can guarantee this performance if we find a
hypothesis that is consistent with *20×(2 ^{30}ln2 + ln100)
approx 14,885,222,452* examples.

Consider the third question raised at the start of this section,
namely, how quickly a learner
can find the probably approximately correct hypothesis. First, if the
sample complexity is exponential in the size of some parameter
(e.g., *n* above), the computational complexity must be exponential because an
algorithm must at least consider each example. To show an algorithm
with polynomial complexity, we must find a hypothesis space with
polynomial sample complexity and show that the algorithm uses
polynomial time for each example.