7 Supervised Machine Learning

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

7.1 Learning Issues

The following components are part of any learning problem:

Task

The behavior or task that is being improved

Data

The experiences that are used to improve performance in the task, usually in the form of a sequence of examples

Measure of improvement

How the improvement is measured – for example, new skills that were not present initially, increasing accuracy in prediction, or improved speed

Consider the agent internals of Figure 2.9. The problem of learning is to take in prior knowledge and data (e.g., about the experiences of the agent) and to create an internal representation (the knowledge base) that is used by the agent as it acts.

Learning techniques face the following issues:

Task

Virtually any task for which an agent can get data or experiences can be learned. The most commonly studied learning task is supervised learning: given some input features, some target features, and a set of training examples where the input features and the target features are specified, predict the value of target features for new examples given their values on the input features. This is called classification when the target features are discrete and regression when the target features are continuous.

Other learning tasks include learning classifications when the examples do not have targets defined (unsupervised learning), learning what to do based on rewards and punishments (reinforcement learning), learning to reason faster (analytic learning), and learning richer representations such as logic programs (inductive logic programming).

Feedback

Learning tasks can be characterized by the feedback given to the learner. In supervised learning, what has to be learned is specified for each training example. Supervised classification occurs when a trainer provides the classification for each example. Supervised learning of actions occurs when the agent is given immediate feedback about the value of an action in the current situation. Unsupervised learning occurs when no classifications are given and the learner must discover categories and regularities in the data. Feedback often falls between these extremes, such as in reinforcement learning, where the feedback in terms of rewards and punishments occurs after a sequence of actions. This leads to the credit assignment problem of determining which actions were responsible for the rewards or punishments. For example, a user could give rewards to the delivery robot without telling it exactly what it is being rewarded for. The robot then must either learn what it is being rewarded for or learn which actions are preferred in which situations. It is possible that it can learn which actions to perform without actually determining which consequences of the actions are responsible for rewards.

Representation

For an agent to use its experiences, the experiences must affect the agent’s internal representation. This internal representation could be the raw experiences themselves, but it is typically a compact representation that generalizes the data.

The problem of inferring an internal representation based on examples is called induction in contrast to deduction, which is deriving consequences of a knowledge base, and abduction, which is hypothesizing what may be true about a particular case.

There are two principles that are at odds in choosing a representation:

  • The richer the representation, the more useful it is for subsequent problem solving. For an agent to learn a way to solve a task, the representation must be rich enough to express a way to solve the task.

  • The richer the representation, the more difficult it is to learn. A very rich representation is difficult to learn because it requires a great deal of data, and often many different hypotheses are consistent with the data.

The representations required for intelligence are a compromise between many desiderata. The ability to learn the representation is one of them, but it is not the only one.

Much of machine learning is studied in the context of particular representations (e.g., decision trees, neural networks, or case bases). This chapter presents some standard representations to show the common features behind learning.

Online and offline

In offline learning, all of the training examples are available to an agent before it needs to act. In online learning, training examples arrive as the agent is acting. An agent that learns online requires some representation of its previously seen examples before it has seen all of its examples. As new examples are observed, the agent must update its representation. Typically, an agent never sees all of the examples it could possibly see. Active learning is a form of online learning in which the agent acts to acquire useful examples from which to learn. In active learning, the agent reasons about which examples would be useful to learn from and acts to collect those examples.

Measuring success

Learning is defined in terms of improving performance based on some measure. To know whether an agent has learned, we must define a measure of success. The measure is usually not how well the agent performs on the training data, but how well the agent performs for new data.

In classification, being able to correctly classify all training examples is not the goal. For example, consider predicting a Boolean (true/false) feature based on a set of examples. Suppose that there were two agents P and N. Agent P claims that all of the negative examples seen are the only negative examples and that every other instance is positive. Agent N claims that the positive examples in the training set are the only positive examples and that every other instance is negative. Both agents correctly classify every example in the training set but disagree on every other example. Success in learning should not be judged on correctly classifying the training set but on being able to correctly classify unseen examples. Thus, the learner must generalize: go beyond the specific given examples to classify unseen examples.

A standard way to evaluate a learning procedure is to divide the examples into training examples and test examples. A representation is built using the training examples, and the predictive accuracy is measured on the test examples. To properly evaluate the method, the test cases should not be known while the training is occurring. Of course, using a test set is only an approximation of what is wanted; the real measure is its performance on some future task.

Bias

The tendency to prefer one hypothesis over another is called a bias. Consider the agents N and P defined earlier. Saying that a hypothesis is better than N’s or P’s hypothesis is not something that is obtained from the data – both N and P accurately predict all of the data given – but is something external to the data. Without a bias, an agent will not be able to make any predictions on unseen examples. The hypotheses adopted by P and N disagree on all further examples, and, if a learning agent cannot choose some hypotheses as better, the agent will not be able to resolve this disagreement. To have any inductive process make predictions on unseen data, an agent requires a bias. What constitutes a good bias is an empirical question about which biases work best in practice; we do not imagine that either P’s or N’s biases work well in practice.

Learning as search

Given a representation and a bias, the problem of learning can be reduced to one of search. Learning is a search through the space of possible representations, trying to find the representation or representations that best fit the data given the bias. Unfortunately, the search spaces are typically prohibitively large for systematic search, except for the simplest of examples. Nearly all of the search techniques used in machine learning can be seen as forms of local search through a space of representations. The definition of the learning algorithm then becomes one of defining the search space, the evaluation function, and the search method.

Noise

In most real-world situations, the data are not perfect. There can be noise where the observed features are not adequate to predict the classification, missing data where the observations of some of the features for some or all of the examples are missing, and errors where some of the features have been assigned wrong values. One of the important properties of a learning algorithm is its ability to handle noisy data in all of its forms.

Interpolation and extrapolation

For domains with a natural interpretation of “between,” such as where the features are about time or space, interpolation involves making a prediction between cases for which there are data. Extrapolation involves making a prediction that goes beyond the seen examples. Extrapolation is usually much less accurate than interpolation. For example, in ancient astronomy, the Ptolemaic system and the heliocentric system of Copernicus made detailed models of the movement of solar system in terms of epicycles (cycles within cycles). The parameters for the models could be made to fit the data very well and they were very good at interpolation; however, the models were very poor at extrapolation. As another example, it is often easy to predict a stock price on a certain day given data about the prices on the days before and the days after that day. It is very difficult to predict the price that a stock will be tomorrow, and it would be very profitable to be able to do so. An agent must be careful if its test cases mostly involve interpolating between data points, but the learned model is used for extrapolation.

Why Should We Believe an Inductive Conclusion?

When learning from data, an agent makes predictions beyond what the data give it. From observing the sun rising each morning, people predict that the sun will rise tomorrow. From observing unsupported objects repeatedly falling, a child may conclude that unsupported objects always fall (until she comes across helium-filled balloons). From observing many swans, all of which were black, someone may conclude that all swans are black. From the data of Figure 7.1, an algorithm may learn a representation that predicts the user action for a case where the author is unknown, the thread is new, the length is long, and it was read at work. The data do not tell us what the user does in this case. The question arises of why an agent should ever believe any conclusion that is not a logical consequence of its knowledge.

When an agent adopts a bias, or chooses a hypothesis, it is going beyond the data – even making the same prediction about a new case that is identical to an old case in all measured respects goes beyond the data. So why should an agent believe one hypothesis over another? By what criteria can it possibly go about choosing a hypothesis?

The most common method is to choose the simplest hypothesis that fits the data by appealing to Ockham’s razor. William of Ockham was an English philosopher who was born in about 1285 and died, apparently of the plague, in 1349. (Note that “Occam” is the French spelling of the English town “Ockham” and is often used.) He argued for economy of explanation: “What can be done with fewer [assumptions] is done in vain with more.”

Why should one believe the simplest hypothesis when which hypothesis is simplest depends on the language used to express the hypothesis?

First, it is reasonable to assume that there is structure in the world and that an agent should discover this structure to act appropriately. A reasonable way to discover the structure of the world is to search for it. An efficient search strategy is to search from simpler hypotheses to more complicated ones. If there is no structure to be discovered, nothing will work! The fact that much structure has been found in the world (e.g., all of the structure discovered by physicists) would lead us to believe that this is not a futile search.

The fact that simplicity is language dependent should not necessarily make us suspicious. Language has evolved because it is useful; it allows people to express the structure of the world. Thus, we would expect that simplicity in everyday language would be a good measure of complexity.

The most important reason for believing inductive hypotheses is that it is useful to believe them. An agent that does not learn that it should not fling itself from heights will not survive long. The “simplest hypothesis” heuristic is useful because it works.