7.8.2 Probably Approximately Correct Learning

Rather than just studying different learning algorithms that happen to work well, computational learning theory investigates general principles that can be proved to hold for classes of learning algorithms.

Some relevant questions that we can ask about a theory of computational learning include the following:

• Is the learner guaranteed to converge to the correct hypothesis as the number of examples increases?

• How many examples are required to identify a concept?

• How much computation is required to identify a concept?

In general, the answer to the first question is “no,” unless it can be guaranteed that the examples always eventually rule out all but the correct hypothesis. An adversary out to trick the learner could choose examples that do not help discriminate correct hypotheses from incorrect hypotheses. If an adversary cannot be ruled out, a learner cannot guarantee to find a consistent hypothesis. However, given randomly chosen examples, a learner that always chooses a consistent hypothesis can get arbitrarily close to the correct concept. This requires a notion of closeness and a specification of what is a randomly chosen example.

Consider a learning algorithm that chooses a hypothesis consistent with all of the training examples. Assume a probability distribution over possible examples and that the training examples and the test examples are chosen from the same distribution. The distribution does not have to be known. We will prove a result that holds for all distributions.

The error of hypothesis $h\in\mathcal{H}$ on instance space $I$, written $error(I,h)$, is defined to be the probability of choosing an element $i$ of $I$ such that $h(i)\neq{Y}({i})$, where $h(i)$ is the predicted value of target variable $Y$ on possible example $i$, and ${Y}({i})$ is the actual value of $Y$ on example $i$. That is,

 $error(I,h)=P(h(i)\neq{Y}({i})\mid i\in I)~{}.$

An agent typically does not know $P$ or ${Y}({i})$ for all $i$ and, thus, does not actually know the error of a particular hypothesis.

Given $\epsilon>0$, hypothesis $h$ is approximately correct if $error(I,h)\leq\epsilon$.

We make the following assumption.

Assumption 7.3.

The training and test examples are chosen independently from the same probability distribution as the population.

It is still possible that the examples do not distinguish hypotheses that are far away from the concept. It is just very unlikely that they do not. A learner that chooses a hypothesis consistent with the training examples is probably approximately correct if, for an arbitrary number $\delta$ ($0<\delta\leq 1$), the algorithm is not approximately correct in at most $\delta$ of the cases. That is, the hypothesis generated is approximately correct at least $1-\delta$ of the time.

Under the preceding assumption, for arbitrary $\epsilon$ and $\delta$, we can guarantee that an algorithm that returns a consistent hypothesis will find a hypothesis with error less than $\epsilon$, in at least $1-\delta$ of the cases. Moreover, this result does not depend on the probability distribution.

Proposition 7.4.

Given Assumption 7.3, if a hypothesis is consistent with at least

 $\frac{1}{\epsilon}\left(\ln\left|\mathcal{H}\right|+\ln\frac{1}{\delta}\right)$

training examples, it has error at most $\epsilon$, at least $1-\delta$ of the time.

Proof.

Suppose $\epsilon>0$ and $\delta>0$ are given. Partition the hypothesis space $\mathcal{H}$ into

 $\displaystyle\mathcal{H}_{0}=$ $\displaystyle\{h\in\mathcal{H}:error(I,h)\leq\epsilon\}$ $\displaystyle\mathcal{H}_{1}=$ $\displaystyle\{h\in\mathcal{H}:error(I,h)>\epsilon\}~{}.$

We want to guarantee that the learner does not choose an element of $\mathcal{H}_{1}$ in more than $\delta$ of the cases.

Suppose $h\in\mathcal{H}_{1}$, then

 $\displaystyle{P(h\mbox{ is wrong for a single example})\geq\epsilon}$ $\displaystyle{P(h\mbox{ is correct for a single example})\leq 1-\epsilon}$ $\displaystyle{P(h\mbox{ is correct for m random examples})\leq(1-\epsilon)^{% m}.}$

Therefore,

 $\displaystyle{P(\mathcal{H}_{1}\mbox{ contains a hypothesis that is correct % for m random examples})}$ $\displaystyle\ \ \ \ {\leq\left|\mathcal{H}_{1}\right|(1-\epsilon)^{m}}$ $\displaystyle\ \ \ \ {\leq\left|\mathcal{H}\right|(1-\epsilon)^{m}}$ $\displaystyle\ \ \ \ {\leq\left|\mathcal{H}\right|e^{-\epsilon m}}$

using the inequality $(1-\epsilon)\leq e^{-\epsilon}$ if $0\leq\epsilon\leq 1$.

If we ensure that $\left|\mathcal{H}\right|e^{-\epsilon m}\leq\delta$, we guarantee that $\mathcal{H}_{1}$ does not contain a hypothesis that is correct for $m$ examples in more than $\delta$ of the cases. So $H_{0}$ contains all of the correct hypotheses in all but $\delta$ of the cases.

Solving for $m$ gives

 $m\geq\frac{1}{\epsilon}\left(\ln\left|\mathcal{H}\right|+\ln\frac{1}{\delta}\right)$

which proves the proposition. ∎

The number of examples required to guarantee this error bound is called the sample complexity. The number of examples required according to this proposition is a function of $\epsilon$, $\delta$, and the size of the hypothesis space.

Example 7.31.

Suppose the hypothesis space $\mathcal{H}$ is the set of conjunctions of literals on $n$ Boolean variables. In this case $\left|\mathcal{H}\right|=3^{n}+1$ because, for each conjunction, each variable is in one of three states: (1) it is unnegated in the conjunction, (2) it is negated, or (3) it does not appear; the “$+1$” is needed to represent false, which is the conjunction of any atom and its negation. Thus, the sample complexity is $\frac{1}{\epsilon}\left(n\ln 3+\ln\frac{1}{\delta}\right)$ examples, which is polynomial in $n$, $\frac{1}{\epsilon}$, and $\ln\frac{1}{\delta}$.

If we want to guarantee at most a $5\%$ error $99\%$ of the time and have $30$ Boolean variables, then $\epsilon=1/20$, $\delta=1/100$, and $n=30$. The bound says that we can guarantee this performance if we find a hypothesis that is consistent with $20*(30\ln 3+\ln 100)\approx 752$ examples. This is much fewer than the number of possible instances, which is $2^{30}=1,073,741,824$, and the number of hypotheses, which is $3^{30}+1=205,891,132,094,650$.

Example 7.32.

If the hypothesis space $\mathcal{H}$ is the set of all Boolean functions on $n$ variables, then $\left|\mathcal{H}\right|=2^{2^{n}}$; thus, we require $\frac{1}{\epsilon}\left(2^{n}\ln 2+\ln\frac{1}{\delta}\right)$ examples. The sample complexity is exponential in $n$.

If we want to guarantee at most a $5\%$ error $99\%$ of the time and have $30$ Boolean variables, then $\epsilon=1/20$, $\delta=1/100$, and $n=30$. The bound says that we can guarantee this performance if we find a hypothesis consistent with $20*(2^{30}\ln 2+\ln 100)\approx 14,885,222,452$ examples.

Consider the third question raised at the start of this section, namely, how quickly can a learner find the probably approximately correct hypothesis. First, if the sample complexity is exponential in the size of some parameter (e.g., $n$ above), the computational complexity must be exponential because an algorithm must at least consider each example. To show an algorithm with polynomial complexity, you need to find a hypothesis space with polynomial sample complexity and show that the algorithm uses polynomial time for each example.