Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

## 7.4 Composite Models

Decision trees, (squashed) linear functions, and Bayesian classifiers provide the basis for many other supervised learning techniques. Linear classifiers are very restricted in what they can represent. Although decision trees can represent any discrete function, many simple functions have very complicated decision trees. Bayesian classifiers make a priori modeling assumptions that may not be valid.

One way to make the linear function more powerful is to have the inputs to the linear function be some non-linear function of the original inputs. Adding these new features can increase the dimensionality, making some functions that were not linear (or linearly separable) in the lower-dimensional space linear in the higher-dimensional space.

**Example 7.14:**The exclusive-or function,

*x*, is linearly separable in the space where the dimensions are

_{1}xor x_{2}*X*,

_{1}*X*, and

_{2}*x*, where

_{1}x_{2}*x*is a feature that is true when both

_{1}x_{2}*x*and

_{1}*x*are true. To visualize this, consider Figure 7.8; with the product as the third dimension, the top-right point will be lifted out of the page, allowing for a linear separator (in this case a plane) to go underneath it.

_{2}A **support vector machine (SVM)** is used for
classification. It uses functions of the
original inputs as the inputs of the linear function. These functions
are called **kernel functions**. Many possible kernel
functions exist. An example kernel function is
the product of original features; adding the products of features is
enough to enable the representation of the exclusive-or
function. Increasing the dimensionality can, however, cause
overfitting.
An SVM constructs a decision surface, which is a
hyperplane that divides the positive and negative examples in this
higher-dimensional space.
Define the **margin** to be the minimum distance
from the decision surface to any of the examples. An SVM
finds the decision surface with maximum margin. The examples that are
closest to the decision surface are those that support (or hold up)
the decision surface. In particular, these examples, if
removed, would change the decision surface. Overfitting is avoided because these support
vectors define a surface that can be defined in fewer parameters than
there are examples. For detailed description of
SVMs see the references at the end of this chapter.

Neural networks allow the inputs to the (squashed) linear function to be a squashed linear function with parameters to be tuned. Having multiple layers of squashed linear functions as inputs to (squashed) linear functions that predict the target variables allows more complex functions to be represented. Neural networks are described in more detail later.

Another nonlinear representation is a **regression tree**, which is a
decision tree with a (squashed) linear function at the leaves of the decision tree. This can
represent a piecewise linear approximation. It is even possible to
have neural networks or SVMs at the leaves of the decision tree. To classify a new example, the example is
filtered down the tree, and the classifier at the leaves is then used
to classify the example.

The naive Bayesian classifier can be expanded to allow some input
features to be parents of the classification and to allow some to be
children. The probability of the classification given its parents can
be represented as a decision tree or a squashed linear function or a
neural network. The children of the classification do not have to be
independent. One representation of the children is as a **tree
augmented naive Bayesian (TAN) network**, where the children are
allowed to have exactly one other parent other than the classification
(as long as the resulting graph is acyclic). This allows for a simple
model that accounts for interdependencies among the children. An
alternative is to put structure in the class variable. A
**latent tree model** decomposes the class variable into a number of
latent variables that are connected together in a tree structure. Each
observed variable is a child of one of the latent variables. The
latent variables allow a model of the dependence between the observed variables.

Another possibility is to use a number of classifiers that have each been trained on the data and to combine these using some mechanism such as voting or a linear function. These techniques are known as ensemble learning.