### 7.4.2 Ensemble Learning

In **ensemble learning**, an agent takes a number of learning
algorithms and combines their output to make a prediction. The algorithms being
combined are called **base-level algorithms**.

The simplest case of ensemble learning is to train the base-level algorithms on random subsets of the data and either let these vote for the most popular classification (for definitive predictions) or average the predictions of the base-level algorithm. For example, one could train a number of decision trees, each on random samples of, say, 50% of the training data, and then either vote for the most popular classification or average the numerical predictions. The outputs of the decision trees could even be inputs to a linear classifier, and the weights of this classifier could be learned.

This approach works well when the base-level algorithms are **unstable**: they
tend to produce different representations depending on which subset of
the data is chosen. Decision trees and neural networks are unstable, but linear
classifiers tend to be stable and so would not work well with ensembles.

In **bagging**, if there are *m* training examples, the base-level algorithms are trained on sets of *m* randomly drawn, with
replacement, sets of the training examples. In each of these sets,
some examples are not chosen, and some are duplicated. On average, each
set contains about 63% of the original examples.

In **boosting** there is a sequence of classifiers in which each
classifier uses a weighted set of examples. Those examples that the previous classifiers
misclassified are
weighted more. Weighting examples can either be incorporated into the
base-level algorithms or can affect which examples are
chosen as training examples for the future classifiers.

Another way to create base-level classifiers is to manipulate the input features. Different base-level classifiers can be trained on different features. Often the sets of features are hand-tuned.

Another way to get diverse base-level classifiers is to randomize the algorithm. For example, neural network algorithms that start at different parameter settings may find different local minima, which make different predictions. These different networks can be combined.