Intelligence 2E

foundations of computational agents

Decision trees, and (squashed) linear functions provide the basis for many other supervised learning techniques. Although decision trees can represent any discrete function, many simple functions have very complicated decision trees. Linear functions and linear classifiers by themselves are very restricted in what they can represent. However, layers of linear functions, separated by non-linear activation functions, forming neural networks, can approximate many more functions (including discrete functions and continuous functions over a compact set).

One way to make the linear function more powerful is to have the inputs to the linear function be some non-linear function of the original inputs. Adding these new features can increase the dimensionality, making some functions that were not linear (or linearly separable) in the lower-dimensional space linear in the higher-dimensional space.

The exclusive-or function, ${{x}}_{{\mathrm{1}}}{\mathit{}}{\mathit{\text{xor}}}{\mathit{}}{{x}}_{{\mathrm{2}}}$, is linearly separable in the space where the dimensions are ${{x}}_{{\mathrm{1}}}$, ${{x}}_{{\mathrm{2}}}$, and ${{x}}_{{\mathrm{1}}}{\mathit{}}{{x}}_{{\mathrm{2}}}$, where ${{x}}_{{\mathrm{1}}}{\mathit{}}{{x}}_{{\mathrm{2}}}$ is a feature that is true when both ${{x}}_{{\mathrm{1}}}$ and ${{x}}_{{\mathrm{2}}}$ are true. To visualize this, consider Figure 7.11; with the product as the third dimension, the top-right point will be lifted out of the page, allowing for a linear separator (in this case a plane) to go underneath it.

A kernel function is a function that is applied to the input features to create new features. For example, a product of features could either replace or augment the existing features. Adding such features can allow for linear separators where there was none before. Another example is, for a feature $x$, adding ${x}^{2}$ and ${x}^{3}$ to the features allows the learner to find the best degree-3 polynomial fit. Note that when the feature space is augmented, overfitting can become more of a problem.

Are some linear separators better than others?

A support vector machine (SVM) is used for classification. It uses functions of the original inputs as the inputs of the linear function. These functions are called kernel functions. Many different kernel functions are used. An example kernel function is the product of original features. Adding the products of features is enough to enable the representation of the exclusive-or function. Increasing the dimensionality can, however, cause overfitting. An SVM constructs a decision surface, which is a hyperplane that divides the positive and negative examples in this higher-dimensional space. Define the margin to be the minimum distance from the decision surface to any of the examples. An SVM finds the decision surface with maximum margin. The examples that are closest to the decision surface are those that support (or hold up) the decision surface. In particular, these examples, if removed, would change the decision surface. Overfitting is avoided because these support vectors define a surface that can be defined in fewer parameters than there are examples. For detailed description of SVMs see the references at the end of this chapter.

Neural networks allow the inputs to the (squashed) linear function to be a squashed linear function with weights to be tuned. Having multiple layers of squashed linear functions as inputs to (squashed) linear functions allows more complex functions to be represented.

Another nonlinear representation is a regression tree, which is a decision tree with a (squashed) linear function at the leaves of the decision tree. This can represent a piecewise linear approximation. It is even possible to have neural networks or other classifiers at the leaves of the decision tree. To classify a new example, the example is filtered down the tree, and the classifier at the leaves is then used to classify the example.

Another possibility is to use a number of classifiers that have each been trained on the data and to combine these using some mechanism such as voting or averaging. These techniques are known as ensemble learning.