Ensemble methods¶

To increase overall prediction quality we could train several different models and somehow aggregate their prediction results. There are many different ways to realize this idea. Machine learning methods exploiting more than one trained model are called ensemble methods. Here we consider three classes of ensemble methods: bootstrap aggregation (bagging), boosting, stacking.

Imagine you have a difficult problem and ask an expert for advice. Why not ask several experts? One expert could tell you something wrong, but you do not realize that he or she is wrong, because you arn't an expert. If we have a list of expert at our disposal, what can we do with their advice?

We could ask them idependently and apply some simple aggregation function to their answers to obtain a final answer. We we ask for numerical values, we could take the mean. If we ask for categories, we could take the one appearing most often in the list of answers. That's the basic idea of bagging.
We could ask one expert, think about his answer and identify possible weak points in her or his answer. Then we go to the second expert, ask the same question but tell her or him to look at certain aspects more closely. Then we go to the third expert and add information about weak points in the second expert's answer. We do this as long as we feel uncomfortable with the answers or as long as our list of experts isn't exhausted. This strategy is known as boosting. Each expert boosts the answer of the previous one.
We could ask all experts and then find a 'meta-expert' who knows how to combine all the answers to a final answer. This is known as stacking.

All three strategies have in common that each expert could be rather weak (not really an expert), but the final answer will be quite accurate.

Bagging¶

Although bagging (short for bootstrap aggregation) in principle can be applied to a set of very different machine learning models, usually it is used with a set of identical models. The aim of bagging is to reduce variance (that is, prediction error due to overfitting) by averaging results from many high variance models. This is the aggregation part.

Here is the bootstrap part: If we train identical models on identical training data, models will yield more or less identical predictions. Thus, we have to train each model on a different data set. We could divide the data set into as many subsets as we have models, but then each subset would be rather small. Instead we use a method known as bootstrapping in statistics. We sample a subsets from the original data set with replacement. Thus, samples may occur several times in a subset. The advantage of replacement is that distributions of samples in the subsets are independent from each other making the trained models independent from each other. Bootstrapping yields a list of subsets which on the one hand follow more or less the same distribution as the original data set and on the other hand can be (at least in principle) arbitrarily large.

Scikit-Learn supports bagging for regression tasks with BaggingRegressor from Scikit-Learns ensemble module. BaggingRegressor objects have the usual fit and predict interface. When creating the regressor we may pass the following arguments:

base_estimator: a Scikit-Learn regressor object (linear regression, ANN, regression tree aso.) to be trained several times,
n_estimators: how many models to train,
max_samples: size of training subsets.

There is also a max_features argument to restrict the number of features to consider in each model. Instead of training each model on a different data set we might train models on different sets of features (random subspace method). BaggingRegressor also supports some techniques similar to bagging we did not introduce here.

If bagging is used with decision trees as base model, then we have a random forest (a forest is a collection of trees). Scikit-Learn has some specialized routines for training random forests: RandomForestRegressor.

Random forest can be exploited for feature selection. Having a trained random forest at hand we calculate for each feature the total decrease in the loss function caused by splittings with respect to the feature. The influence of each split is weighted by the number of training samples reaching the node. Thus, nodes close to a root are more influencial than deeper nodes. The resulting measure of importance is the higher the more influence a feature has on the predictions. In Scikit-Learn we have access to the feature importances via the feature_importances_ attribute of the RandomForestRegressor object after training.

Boosting¶

There exist many different implementations of boosting. The fundamental idea is to use information about prediction quality of a trained model to improve training of another model. This process can be repeated to produce a sequence of models. Finally, either predictions of the last model in the sequence are used or some averaging is done. Although there is no restriction of the chosen models, one typically uses identical models for boosting. Boosting tends to reduce bias (prediction error due to lacking model complexity). Thus, we may use very weak models like decision stumps (trees with only two leaves). Note that this is in contrast to bagging, which tends to reduce variance.

Here we consider two boosting algorithms in more detail: AdaBoost.R2 and gradient boosting. AdaBoost originally has been developed for binary classification. Later an adaption to regression appeared (AdaBoost.R) and someday a modified regression version has been published (AdaBoost.R2). Scikit-Learn implements AdaBoost.R2. Thus, we restrict attention to that version. For classification AdaBoost is a special case of gradient boosting. But for regression this is not true. AdaBoost.R2 was introduced in Improving Regressors using Boosting Techniques by Harris Drucker.

Adaptive boosting for regression (AdaBoost.R2)¶

In adaptive boosting we associate a weight to each training sample. A small weight marks the sample as unimportant, high weight means very important. We may use such weights in two ways:

train a model only on samples with high weight or
include weights in to the training procedure (weighted mean squared error as loss function).

The first variant is always applicable. The second variant works for models trained by minimizing some loss function. That includes all models we have seen so far. AdaBoost.R2 only uses the first variant. But a weighted loss function is used for updating the weights.

Step 1 (initialization): Assign weights $w_1,\ldots,w_n$ to the training samples $(x_1,y_1,),\ldots,(x_n,y_n)$ and initialize all weights to $\frac{1}{n}$.

Step 2 (subset selection): Calculate probabilities \begin{equation*} p_l:=\frac{w_l}{\sum\limits_{\lambda=1}^n w_\lambda},\qquad l=1,\ldots,n, \end{equation*} and choose $N<n$ training samples according to the probabilities $p_1,\ldots,p_n$ with replacement.

Step 3 (training): Train a model on the $N$ samples.

Step 4 (stopping criteria): Get predictions $y_{\mathrm{pred},1},\ldots,y_{\mathrm{pred},n}$ for all (!) samples and calculate normalized squared errors \begin{equation*} e_l:=\frac{\bigl(y_{\mathrm{pred},l}-y_l\bigr)^2}{\sum\limits_{\lambda=1}^n\bigl(y_{\mathrm{pred},\lambda}-y_\lambda\bigr)^2}\in[0,1]. \end{equation*} Normalization simplyfies formulas below. If the denominator is zero we have perfect fitting (overfitting) and should stop the procedure, because there is no more need for improvement. But that's rarely seen in practice. Else calulate the weighted loss \begin{equation*} L:=\sum_{l=1}^n p_l\,e_l\in[0,1]. \end{equation*} Stop if $L\geq\frac{1}{2}$. Without weighting we would have $L=1$. With weighting $L$ expresses the average loss on the more important samples. Thus, $L$ close to 1 (for instance $L\geq\frac{1}{2}$) indicates that the current model is not able to fit important samples much better than less important ones. So we stop the boosting procedure because no more improvement can be expected.

Step 5 (weight update): Multiply weight $w_l$ by \begin{equation*} \left(\frac{L}{1-L}\right)^{1-e_l} \end{equation*} for $l=1,\ldots,n$ and go to step 2. The function $L\mapsto\frac{L}{1-L}$ is monotonically increasing and maps $[0,\frac{1}{2})$ to $[0,1)$. Small $L$ decreases weights more than $L$ close to $\frac{1}{2}$. With $L=\frac{1}{2}$ weights would remain unchanged. The closer an individual error $e_l$ is to zero the closer the corresponding update factor is to $\frac{L}{1-L}$. The larger $e_l$ the closer the update factor is to 1 (weight remains almost unchanged). Thus, weights of well fitted samples are decreased whereas weights of less well fitted ones remain almost unchanged.

After stopping (aggregation): After stopping the iteration we have a list of trained hypotheses $f_{\mathrm{approx},1},\ldots,f_{\mathrm{approx},q}$, where we exclude the last one, which satisfied the stopping criterion. Based on this list we define the final hypothesis $f_{\mathrm{approx}}$ as follows:

For a feature vector $x$ get predictions $f_{\mathrm{approx},1}(x),\ldots,f_{\mathrm{approx},q}(x)$ from all models.
For each model calculate \begin{equation*} \ln\left(\frac{1}{L}-1\right) \end{equation*} with modeldependent $L$ as introduced in step 4. Denote these numbers by $a_1,\ldots,a_q$. They will be used as weights for the models. The function $L\mapsto\frac{1}{L}-1$ is monotonically decreasing and maps $0$ and $\frac{1}{2}$ to $+\infty$ and $1$, respectively. The logarithm maps $[1,\infty]$ monotonically to $[0,\infty$. Thus, $L$ close to zero yields a high weight for the model, $L$ close to $\frac{1}{2}$ yields a small weight.
Calculate the weighted median of $f_{\mathrm{approx},1}(x),\ldots,f_{\mathrm{approx},q}(x)$ with weights $a_1,\ldots,a_q$: \begin{equation*} f_{\mathrm{approx}}(x):=\inf\left\{y\in\mathbb{R}:\,\sum_{\mu=1\atop{f_{\mathrm{approx},\mu}(x)\leq y}}^q a_\mu\geq\frac{1}{2}\,\sum_{\mu=1}^q a_\mu\right\}. \end{equation*}

Rules for weight update, stopping and aggregation are somewhat arbitrary, but yield good results in practice.

Scikit-Learn implements adaptive boosting in AdaBoostRegressor in the ensemble module.

Gradient boosting¶

Gradient boosting is an approximate gradient descent for a loss function. Approximations of the gradients are chosen to be predictions of (simple) models on the training set. Thus, we minimize a loss by improving an initial model. In each step a new model is added, yielding a weighted sum of models. Prediction performance of the overall model will be much better than for each single model. The procedure is repeated until some stopping criterion (performance on validation set, for instance) is satisfied. That's the basic idea. Now we have to fill the details.

Denote training samples by $(x_1,y_1),\ldots,(x_n,y_n)$ and let $L:\mathbb{R}^n\rightarrow\mathbb{R}$ be the loss with respect to the true labels $y_1,\ldots,y_n$. For mean squared error loss we have \begin{equation*} L(z_1,\ldots,z_n)=\frac{1}{n}\,\sum_{l=1}^{n}(z_l-y_l)^2. \end{equation*} When discussing classification tasks in more detail we will meet more involved loss function. So we stick to the general case here.

Denote the initial hypothesis by $f_{\mathrm{approx},1}$ (some trained model). Given a hypothesis $f_{\mathrm{approx},j}$ we denote the vector of predictions by \begin{equation*} y_{\mathrm{pred},j}:=\bigl(f_{\mathrm{approx},j}(x_1),\ldots,f_{\mathrm{approx},j}(x_n)\bigr). \end{equation*} We want to improve the training loss $L\bigl(y_{\mathrm{pred},j}\bigr)$ by doing a gradient step. Step direction is $-\nabla L\bigl(y_{\mathrm{pred},j}\bigr)$ and with step length $s$ we obtain the updated predictions \begin{equation*} y_{\mathrm{pred},j}-s\,\nabla L\bigl(y_{\mathrm{pred},j}\bigr). \end{equation*} The problem is that in general this is not a prediction of some model of interest. Thus, we fit a new model to samples \begin{equation*} \left(x_1,\frac{\partial L}{\partial z_1}\bigl(y_{\mathrm{pred},j}\bigr)\right),\ldots,\left(x_n,\frac{\partial L}{\partial z_n}\bigl(y_{\mathrm{pred},j}\bigr)\right) \end{equation*} and set \begin{equation*} f_{\mathrm{approx},j+1}:=f_{\mathrm{approx},j}-s\,f_{\mathrm{grad}}, \end{equation*} where $f_{\mathrm{grad}}$ denotes the hypothesis fitted to the gradient.

The described iterative procedure finally yields weighted sum of many models. Each model approximates a gradient. Samples very well fitted by a model will have small gradient. Thus, the next model will not modify corresponding prediction too much. The other way round, the next model will concentrate on samples with large prediction error in the previous round.

For mean squared error loss we have \begin{equation*} \nabla L\bigl(y_{\mathrm{pred},j}\bigr)=\frac{2}{n}\bigl(y_{\mathrm{pred},j}-y\bigr), \end{equation*} the difference between predicted and true labels, also known as residual vector (neglecting the factor $\frac{2}{n}$). Consequently, the gradient approximation model is fitted to the residual vector of the previous model. For a perfect fit model $f_{\mathrm{grad}}$ and step length $s=\frac{n}{2}$ we would have \begin{equation*} f_{\mathrm{approx},j+1}(x_l)=f_{\mathrm{approx},j}(x_l)-s\,f_{\mathrm{grad}}(x_l) =f_{\mathrm{approx},j}(x_l)-\bigl(f_{\mathrm{approx},j}(x_l)-y_l\bigr) =-y_l \end{equation*} for $l=1,\ldots,n$. In other words, the next model would perfectly fit the training data.

Scikit-Learn implements gradient boosting in GradientBoostingRegressor in the ensemble module.

Ensemble methods¶

Bagging¶

Boosting¶

Adaptive boosting for regression (AdaBoost.R2)¶

Gradient boosting¶

Stacking¶