foundations of computational agents
Decision trees, and (squashed) linear functions provide the basis for many other supervised learning techniques. Although decision trees can represent any discrete function, many simple functions have very complicated decision trees. Linear functions and linear classifiers by themselves are very restricted in what they can represent. However, layers of linear functions, separated by non-linear activation functions, forming neural networks, can approximate many more functions (including discrete functions and continuous functions over a compact set).
One way to make the linear function more powerful is to have the inputs to the linear function be some non-linear function of the original inputs. Adding these new features can increase the dimensionality, making some functions that were not linear (or linearly separable) in the lower-dimensional space linear in the higher-dimensional space.
The exclusive-or function, , is linearly separable in the space where the dimensions are , , and , where is a feature that is true when both and are true. To visualize this, consider Figure 7.11; with the product as the third dimension, the top-right point will be lifted out of the page, allowing for a linear separator (in this case a plane) to go underneath it.
A kernel function is a function that is applied to the input features to create new features. For example, a product of features could either replace or augment the existing features. Adding such features can allow for linear separators where there was none before. Another example is, for a feature , adding and to the features allows the learner to find the best degree-3 polynomial fit. Note that when the feature space is augmented, overfitting can become more of a problem.
Neural networks allow the inputs to the (squashed) linear function to be a squashed linear function with weights to be tuned. Having multiple layers of squashed linear functions as inputs to (squashed) linear functions allows more complex functions to be represented.
Another nonlinear representation is a regression tree, which is a decision tree with a (squashed) linear function at the leaves of the decision tree. This can represent a piecewise linear approximation. It is even possible to have neural networks or other classifiers at the leaves of the decision tree. To classify a new example, the example is filtered down the tree, and the classifier at the leaves is then used to classify the example.
Another possibility is to use a number of classifiers that have each been trained on the data and to combine these using some mechanism such as voting or averaging. These techniques are known as ensemble learning.