foundations of computational agents
Rather than just studying different learning algorithms that happen to work well, computational learning theory investigates general principles that can be proved to hold for classes of learning algorithms.
Some relevant questions that we can ask about a theory of computational learning include the following:
Is the learner guaranteed to converge to the correct hypothesis as the number of examples increases?
How many examples are required to identify a concept?
How much computation is required to identify a concept?
In general, the answer to the first question is “no,” unless it can be guaranteed that the examples always eventually rule out all but the correct hypothesis. An adversary out to trick the learner could choose examples that do not help discriminate correct hypotheses from incorrect hypotheses. If an adversary cannot be ruled out, a learner cannot guarantee to find a consistent hypothesis. However, given randomly chosen examples, a learner that always chooses a consistent hypothesis can get arbitrarily close to the correct concept. This requires a notion of closeness and a specification of what is a randomly chosen example.
Consider a learning algorithm that chooses a hypothesis consistent with all of the training examples. Assume a probability distribution over possible examples and that the training examples and the test examples are chosen from the same distribution. The distribution does not have to be known. We will prove a result that holds for all distributions.
The error of hypothesis on instance space , written , is defined to be the probability of choosing an element of such that , where is the predicted value of target variable on possible example , and is the actual value of on example . That is,
An agent typically does not know or for all and, thus, does not actually know the error of a particular hypothesis.
Given , hypothesis is approximately correct if .
We make the following assumption.
The training and test examples are chosen independently from the same probability distribution as the population.
It is still possible that the examples do not distinguish hypotheses that are far away from the concept. It is just very unlikely that they do not. A learner that chooses a hypothesis consistent with the training examples is probably approximately correct if, for an arbitrary number (), the algorithm is not approximately correct in at most of the cases. That is, the hypothesis generated is approximately correct at least of the time.
Under the preceding assumption, for arbitrary and , we can guarantee that an algorithm that returns a consistent hypothesis will find a hypothesis with error less than , in at least of the cases. Moreover, this result does not depend on the probability distribution.
Given Assumption 7.3, if a hypothesis is consistent with at least
training examples, it has error at most , at least of the time.
Suppose and are given. Partition the hypothesis space into
We want to guarantee that the learner does not choose an element of in more than of the cases.
Suppose , then
using the inequality if .
If we ensure that , we guarantee that does not contain a hypothesis that is correct for examples in more than of the cases. So contains all of the correct hypotheses in all but of the cases.
Solving for gives
which proves the proposition. ∎
The number of examples required to guarantee this error bound is called the sample complexity. The number of examples required according to this proposition is a function of , , and the size of the hypothesis space.
Suppose the hypothesis space is the set of conjunctions of literals on Boolean variables. In this case because, for each conjunction, each variable is in one of three states: (1) it is unnegated in the conjunction, (2) it is negated, or (3) it does not appear; the “” is needed to represent false, which is the conjunction of any atom and its negation. Thus, the sample complexity is examples, which is polynomial in , , and .
If we want to guarantee at most a error of the time and have Boolean variables, then , , and . The bound says that we can guarantee this performance if we find a hypothesis that is consistent with examples. This is much fewer than the number of possible instances, which is , and the number of hypotheses, which is .
If the hypothesis space is the set of all Boolean functions on variables, then ; thus, we require examples. The sample complexity is exponential in .
If we want to guarantee at most a error of the time and have Boolean variables, then , , and . The bound says that we can guarantee this performance if we find a hypothesis consistent with examples.
Consider the third question raised at the start of this section, namely, how quickly can a learner find the probably approximately correct hypothesis. First, if the sample complexity is exponential in the size of some parameter (e.g., above), the computational complexity must be exponential because an algorithm must at least consider each example. To show an algorithm with polynomial complexity, you need to find a hypothesis space with polynomial sample complexity and show that the algorithm uses polynomial time for each example.