7.7 Learning as Refining the Hypothesis Space

So far, learning is either choosing the best representation - for example, the best decision tree or the best values for parameters in a neural network - or predicting the value of the target features of a new case from a database of previous cases. This section considers a different notion of learning, namely learning as delineating those hypotheses that are consistent with the examples. Rather than choosing a hypothesis, the aim is to find all hypotheses that are consistent. This investigation will shed light on the role of a bias and provide a mechanism for a theoretical analysis of the learning problem.

We make three assumptions:

  • There is a single target feature, Y, that is Boolean. This is not really a restriction for classification, because any discrete feature can be made into Boolean features using indicator variables.
  • The hypotheses make definitive predictions, predicting true or false for each example, rather than probabilistic prediction.
  • There is no noise in the data.

Given these assumptions, it is possible to write a hypothesis in terms of a proposition, where the primitive propositions are assignments to the input features.

Example 7.22: The decision tree of Figure 7.4 can be seen as a representation reads defined by the proposition
pval(e,Reads)=val(e,Short)∧(val(e,New)∨ val(e,Known)) .

For the rest of this section, we write this more simply as

reads ↔short ∧(new ∨ known) .

The goal is to try to find a proposition on the input features that correctly classifies the training examples.

Example 7.23: Consider the trading agent trying to infer which books or articles the user reads based on keywords supplied in the article. Suppose the learning agent has the following data:
article Crime Academic Local Music Reads
a1 true false false true true
a2 true false false false true
a3 false true false false false
a4 false false true false false
a5 true true false false true

The aim is to learn which articles the user reads.

In this example, reads is the target feature, and the aim is to find a definition such as

reads ↔crime ∧(¬academic ∨ ¬music) .

This definition may be used to classify the training examples as well as future examples.

Hypothesis space learning assumes the following sets:

  • I, the instance space, is the set of all possible examples.
  • H, the hypothesis space, is a set of Boolean functions on the input features.
  • E⊆I is the set of training examples. Values for the input features and the target feature are given for the training example.

If h∈H and i∈I, we write h(i) to mean the value that h predicts for i on the target variable Y.

Example 7.24: In Example 7.23, I is the set of the 25=32 possible examples, one for each combination of values for the features.

The hypothesis space H could be all Boolean combinations of the input features or could be more restricted, such as conjunctions or propositions defined in terms of fewer than three features.

In Example 7.23, the training examples are E={a1,a2,a3,a4,a5}. The target feature is Reads. Because the table specifies some of the values of this feature, and the learner will make predictions on unseen cases, the learner requires a bias. In hypothesis space learning, the bias is imposed by the hypothesis space.

Hypothesis h is consistent with a set of training examples E if ∀e ∈E, h accurately predicts the target feature of e. That is, h(e)=val(e,Y); the predicted value is the same as the actual value for each example. The problem is to find the subset of H or just an element of H consistent with all of the training examples.

Example 7.25: Consider the data of Example 7.23, and suppose H is the set of conjunctions of literals. An example hypothesis in H that is consistent with {a1} is ¬academic ∧music. This hypothesis means that the person reads an article if and only if ¬academic ∧music is true of the article. This concept is not the target concept because it is inconsistent with {a1,a2}.