7 Supervised Machine Learning 7 Supervised Machine Learning 7.2 Supervised Learning Foundations

7.1 Learning Issues

The following components are part of any learning problem:

Task: The behavior or task that is being improved.
Data: The experiences that are used to improve performance in the task, either as a bag of examples or as a temporal sequence of examples. A bag is a set that allows for repeated elements, also known as a multiset. A bag is used when the order of the examples conveys no information.
Measure of improvement: How the improvement is measured – for example, new skills that were not present initially, increasing accuracy in prediction, or improved speed.

Consider the agent internals of Figure 2.10. The problem of learning is to take in prior knowledge and data (including the experiences of the agent) and to create an internal representation (a model) that is used by the agent as it acts.

Learning techniques face the following issues:

Task

Virtually any task for which an agent can get data or experiences might be learned. The most commonly studied learning task is supervised learning: given some input features, some target output features, and a set of training examples where values of the input features and target features are specified, predict the value of each target feature for new examples given their values on the input features. This is called classification when the target features are discrete and regression when the target features are continuous. Other variants include structured prediction where the target is a more complex data structure, such as the constituents and shape of a molecule.

Other learning tasks include learning when the examples do not have targets defined (unsupervised learning), learning what to do based on rewards and punishments (reinforcement learning), and learning richer representations such as graphs (graph learning) or programs (inductive programming). These are considered in later chapters.

Feedback

Learning tasks can be characterized by the feedback given to the learner. In supervised learning, the value of what should be learned is specified for each training example. Unsupervised learning occurs when no targets are given and the learner must discover clusters and regularities in the data. Feedback often falls between these extremes, such as in reinforcement learning, where the feedback in terms of rewards and punishments can occur after a sequence of actions.

Measuring success

Learning is defined in terms of improving performance based on some measure. To know whether an agent has learned, you need a measure of success. The measure is usually not how well the agent performs on the training data, but how well the agent performs for new data.

In classification, being able to correctly classify all training examples is not the goal. For example, consider predicting a Boolean (true/false) feature based on a set of examples. Suppose that there were two agents $P$ and $N$ . Agent $P$ claims that all of the negative examples seen are the only negative examples and that every other instance is positive. Agent $N$ claims that the positive examples in the training set are the only positive examples and that every other instance is negative. Both agents correctly classify every example in the training set but disagree on every other example. Success in learning should not be judged on correctly classifying the training set but on being able to correctly classify unseen examples. Thus, the learner must generalize: go beyond the specific given examples to classify unseen examples.

To measure a prediction, a loss (or error) function specifies how close the prediction is to the correct answer; utility provides a measure of preferences, often in terms of rewards, when there is no correct answer.

A standard way to evaluate a learning method is to divide the given examples into training examples and test examples. A predictive model is built using the training examples, and the predictions of the model are measured on the test examples. To evaluate the method properly, the test cases should not be used in any way for the training. Using a test set is only an approximation of what is wanted; the real measure is its performance on future tasks. In deployment of a model in an application – using a prediction to make decisions – an explicit test set is not usually required. A learned model is applied to new examples, and when the ground truth of these is eventually determined, the model can be evaluated using those examples in an ongoing manner. It is typical to relearn on all of the data, including the old data and the new data. In some applications, it is appropriate to remove old data, particularly when the world is changing or the quality of data is improving.

Bias

The tendency to prefer one hypothesis over another is called a bias. Consider the agents $N$ and $P$ introduced earlier. Saying that a hypothesis is better than the hypotheses of $N$ or $P$ is not something that is obtained from the data – both $N$ and $P$ accurately predict all of the data given – but is something external to the data. Without a bias, an agent cannot make any predictions on unseen examples [see Section 7.6].

The set of all assumptions that enable generalization to unseen examples is called the inductive bias. What constitutes a good bias is an empirical question about which biases work best in practice; we do not imagine that either $P$ ’s or $N$ ’s biases work well in practice.

Representation

For an agent to use its experiences, the experiences must affect the agent’s internal representation. This internal representation could be the raw experiences themselves, but it is typically a model, a compact representation that generalizes the data. The choice of the possible representations for models provides a representation bias – only models that can be represented are considered. The preference for one model over another is called a preference bias.

There are two principles that are at odds in choosing a representation:

•

The richer the representation, the more useful it is for subsequent problem solving. For an agent to learn a way to solve a task, the representation must be rich enough to express a way to solve the task.
•

The richer the representation, the more difficult it is to learn. A very rich representation is difficult to learn because it requires a great deal of data, and often many different hypotheses are consistent with the data.

The representations required for intelligence are a compromise among many desiderata. The ability to learn the representation is one of them, but it is not the only one.

Much of machine learning is studied in the context of particular representations such as decision trees, linear models, or neural networks.

Learning as search

Given a class of representations and a bias, the problem of learning can be reduced to one of search. Learning becomes a search through the space of possible models, trying to find a model or models that best fit the data given the bias. The search spaces are typically prohibitively large for systematic search, except for the simplest of cases. Nearly all of the search techniques used in machine learning can be seen as forms of local search through a space of representations. The definition of the learning algorithm then becomes one of defining the search space, the evaluation function, and the search method. A search bias occurs when the search only returns one of the many possible models that could be found.

Imperfect data

In most real-world situations, data are not perfect. There can be noise where the observed features are not adequate to predict the classification or when the process generating the data is inherently noisy, missing data where the observations of some of the features for some or all of the examples are missing, and errors where some of the features have been assigned wrong values. One of the important properties of a learning algorithm is its ability to handle imperfect data in all of its forms.

Interpolation and extrapolation

For domains with a natural interpretation of “between,” such as where the features are about time or space, interpolation involves making a prediction between cases for which there are examples. Extrapolation involves making a prediction that goes beyond the seen examples. Extrapolation is usually less accurate than interpolation. For example, in ancient astronomy, the Ptolemaic system developed about 150 CE made detailed models of the movement of the solar system in terms of epicycles (cycles within cycles). The parameters for the models could be made to fit the data very well and they were very good at interpolation; however, the models were very poor at extrapolation. As another example, it is often easy to predict a stock price on a certain day given data about the prices on the days before and the days after that day. It is very difficult to predict the price that a stock will be tomorrow given historical data, although it would be very profitable to do so. An agent must be careful if its test cases mostly involve interpolating between data points, but the learned model is used for extrapolation.

Curse of dimensionality

When examples are described in terms of features, each feature can be seen as a dimension. As the dimensionality increases, the number of possible examples grows exponentially, and even large datasets can be very sparse in large dimensions. For example, a single frame in a 4k video has about 8 million pixels, where each pixel value can be considered a dimension. The space is so large that it is extremely unlikely that any example is between the other examples in all dimensions of the multidimensional space. Any one of the dimensions is often not important; for example, changing just one of the pixels in an image does not change what the image is. In predicting diseases based on electronic health records, with images, text, laboratory results, and other information, the number of dimensions varies for different patients and can be enormous.

Online and offline

In offline learning, all of the training examples are available to an agent before it needs to act. In online learning, training examples arrive as the agent is acting. An agent that learns online requires some representation of its previously seen examples before it has seen all of its examples. As new examples are observed, the agent must update its representation. Typically, an agent never sees all of the examples it could possibly see, and often cannot store all of the examples seen. Active learning is a form of online learning in which the agent acts to acquire new useful examples from which to learn. In active learning, the agent reasons about which examples would be most useful to learn from and acts to collect those examples.

Artificial Intelligence 3E

7.1 Learning Issues

Artificial
Intelligence 3E