7.6 Composite Models

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

7.6.2 Ensemble Learning

In ensemble learning, an agent takes a number of learners and combines their predictions to make a prediction for the ensemble. The algorithms being combined are called base-level algorithms. Random forests are an example of an ensemble method, where the base-level algorithm is decision tree, and the predictions from the individual trees are averaged or are used to vote for a prediction.

In boosting, there is a sequence of learners in which each one learns from the errors of the previous ones. The features of a boosting algorithm are:

  • There is a sequence of base learners (that can be different from each other or the same as each other), such as small decision trees, or (squashed) linear functions

  • Each learner is trained to fit the examples that the previous learners did not fit well

  • The final prediction is a mix (e.g., sum, weighted average or mode) of the predictions of each learner.

The base learners can be weak learners, in that they do not need to be very good; they just need to better than random. These weak learners are then boosted to be components in the ensemble that performs better than any of them.

A simple boosting algorithm is functional gradient boosting which can be used for regression, as follows. The final prediction, as a function of the inputs, is the sum

p0(X)+d1(X)++dk(X)

where p0(X) is an initial prediction, say the mean, and each di is the difference from the previous prediction. Let the ith prediction be pi(X)=p0(X)+d1(X)++di(X). Then pi(X)=pi-1(X)+di(X). Each di is constructed so that the error of pi is minimal, given that pi-1 is fixed. At each stage, the base learner learns di to minimize

eerror(Yi(e)-pi(e))=eerror(Yi(e)-pi-1(e)-di(e)).

The ith learner can learn di(e) to best fit Yi(e)-pi-1(e). This is equivalent to learning from a modified data set, where the previous prediction is subtracted from the actual value of the training set. In this way, each learner is made to correct the errors of the previous prediction.

1: procedure Boosting_learner(Xs,Y,Es,L,k)
2:      Inputs
3:          Xs: set of input features
4:          Y: target feature
5:          Es: set of examples from which to learn
6:          L: base learner
7:          k: number of components in the ensemble      
8:      Output
9:          function to make prediction on examples
10:      mean:=eEsY(e)/|Es|
11:      define p0(e)=mean
12:      for each i from 1 to k do
13:          let Ei={Xs(e),Y(e)-pi-1(e) for eEs}
14:          let di=L(Ei)
15:          define pi(e)=pi-1(e)+di(e)      
16:      return pk
Figure 7.19: Functional gradient boosting regression learner

The algorithm is shown in Figure 7.19. Each pi is a function that, given an example, returns a prediction for that example. Ei is a new set of examples, where for each eEs, the latest prediction, pi-1(e), is subtracted from the value of the target feature Y(e). The new learner is therefore learning from the errors of the old learner. Note that Ei does not need to be stored; the examples in Ei can be generated as needed. The function di is computed by applying the base learner to the examples Ei.

Figure 7.20: Error of functional gradient boosting of decision trees
Example 7.22.

Figure 7.20 shows a plot of the sum-of-squares error as the number of trees increases for functional gradient boosting of decision trees. The different lines are for different thresholds for the minimum number of examples required for a split, as a proportion of the total examples. At one tree, it is just the decision tree algorithm. Boosting makes the tree with a 1% threshold eventually better than with a 0.1% threshold, even though they are about the same for a single tree. The code is available from the book website.