foundations of computational agents
So far, learning involves either choosing the best representation, such as the best decision tree or the best values for weights in a neural network, or predicting the value of the target features of a new case from a database of previous cases. This section considers a different notion of learning, namely learning as delineating those hypotheses that are consistent with the examples. Rather than choosing a hypothesis, the aim is to find a description of all hypotheses that are consistent with the data. This investigation will shed light on the role of a bias and provide a mechanism for a theoretical analysis of the learning problem.
We make three assumptions:
There is a single Boolean target feature, $Y$.
The hypotheses make definitive predictions, predicting true or false for each example, rather than probabilistic predictions.
There is no noise in the data.
Given these assumptions, it is possible to write a hypothesis in terms of a proposition, where the primitive propositions are assignments to the input features.
The decision tree of Figure 7.6 can be seen as a representation ${r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{s}$ defined by the proposition
$\widehat{{R}{}{e}{}{a}{}{d}{}{s}}{}{(}{e}{)}{\iff}{S}{}{h}{}{o}{}{r}{}{t}{}{(}{e}{)}{\wedge}{(}{N}{}{e}{}{w}{}{(}{e}{)}{\vee}{K}{}{n}{}{o}{}{w}{}{n}{}{(}{e}{)}{)}{.}$ |
This is the prediction that the person read an article if and only if the article is short and either new or known.
The goal is to try to find a proposition on the input features that correctly classifies the training examples.
Consider a trading agent trying to infer which articles a user reads based on keywords for the article. Suppose the learning agent has the following data:
${a}{}{r}{}{t}{}{i}{}{c}{}{l}{}{e}$ | ${C}{}{r}{}{i}{}{m}{}{e}$ | ${A}{}{c}{}{a}{}{d}{}{e}{}{m}{}{i}{}{c}$ | ${L}{}{o}{}{c}{}{a}{}{l}$ | ${M}{}{u}{}{s}{}{i}{}{c}$ | ${R}{}{e}{}{a}{}{d}{}{s}$ |
---|---|---|---|---|---|
${{a}}_{{1}}$ | ${t}{}{r}{}{u}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${t}{}{r}{}{u}{}{e}$ | ${t}{}{r}{}{u}{}{e}$ |
${{a}}_{{2}}$ | ${t}{}{r}{}{u}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${t}{}{r}{}{u}{}{e}$ |
${{a}}_{{3}}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${t}{}{r}{}{u}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ |
${{a}}_{{4}}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${t}{}{r}{}{u}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ |
${{a}}_{{5}}$ | ${t}{}{r}{}{u}{}{e}$ | ${t}{}{r}{}{u}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${f}{}{a}{}{l}{}{s}{}{e}$ | ${t}{}{r}{}{u}{}{e}$ |
The aim is to learn which articles the user reads.
In this example, ${R}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{s}$ is the target feature, and the aim is to find a definition such as
$$\widehat{{R}{}{e}{}{a}{}{d}{}{s}}{}{(}{e}{)}{\iff}{C}{}{r}{}{i}{}{m}{}{e}{}{(}{e}{)}{\wedge}{(}{\mathrm{\neg}}{}{A}{}{c}{}{a}{}{d}{}{e}{}{m}{}{i}{}{c}{}{(}{e}{)}{\vee}{\mathrm{\neg}}{}{M}{}{u}{}{s}{}{i}{}{c}{}{(}{e}{)}{)}{.}$$ |
This definition may be used to classify the training examples as well as future examples.
Hypothesis space learning assumes the following sets
$I$, the instance space, is the set of all possible examples.
$\mathscr{H}$, the hypothesis space, is a set of Boolean functions on the input features.
$E\subseteq I$ is the set of training examples. Values for the input features and the target feature are given for the training example.
If $h\in \mathscr{H}$ and $i\in I$, then $h(i)$ is the value that $h$ predicts for $i$.
In Example 7.26, ${I}$ is the set of the ${{\mathrm{2}}}^{{\mathrm{5}}}{\mathrm{=}}{\mathrm{32}}$ possible examples, one for each combination of values for the features.
The hypothesis space ${\mathrm{H}}$ could be all Boolean combinations of the input features or could be more restricted, such as conjunctions or propositions defined in terms of fewer than three features.
In Example 7.26, the training examples are ${E}{\mathrm{=}}{\mathrm{\{}}{{a}}_{{\mathrm{1}}}{\mathrm{,}}{{a}}_{{\mathrm{2}}}{\mathrm{,}}{{a}}_{{\mathrm{3}}}{\mathrm{,}}{{a}}_{{\mathrm{4}}}{\mathrm{,}}{{a}}_{{\mathrm{5}}}{\mathrm{\}}}$. The target feature is ${R}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{s}$. Because the table specifies some of the values of this feature, and the learner will make predictions on unseen cases, the learner requires a bias. In hypothesis space learning, the bias is imposed by the hypothesis space.
Hypothesis $h$ is consistent with a set of training examples $E$ if the value predicted by $h$ is the value of the target feature $Y$ for each example in $E$. That is, if $\forall e\in E$, $h(e)=Y(e)$.
The problem in hypothesis space learning is to find the set of elements of $\mathscr{H}$ that is consistent with all of the training examples.
Consider the data of Example 7.26, and suppose ${\mathrm{H}}$ is the set of conjunctions of literals. An example hypothesis in ${\mathrm{H}}$ that is consistent with the set ${\mathrm{\{}}{{a}}_{{\mathrm{1}}}{\mathrm{\}}}$ that consists of a single example is $\widehat{{R}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{s}}{\mathit{}}{\mathrm{(}}{e}{\mathrm{)}}{\mathrm{\iff}}{\mathrm{\neg}}{\mathit{}}{a}{\mathit{}}{c}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{e}{\mathit{}}{m}{\mathit{}}{i}{\mathit{}}{c}{\mathit{}}{\mathrm{(}}{e}{\mathrm{)}}{\mathrm{\wedge}}{m}{\mathit{}}{u}{\mathit{}}{s}{\mathit{}}{i}{\mathit{}}{c}{\mathit{}}{\mathrm{(}}{e}{\mathrm{)}}$. This hypothesis means that the person reads an article if and only if the article is not academic and it is about music. This concept is not the target concept because it is inconsistent with ${\mathrm{\{}}{{a}}_{{\mathrm{1}}}{\mathrm{,}}{{a}}_{{\mathrm{2}}}{\mathrm{\}}}$.