Boosting#

The fundamental idea of boosting is to use information about prediction quality of a trained model to improve training of new model. Information about prediction quality of the second model then is used to train a third model, and so on. This process yields a sequence of models, each making up for weaknesses of the previous one. Finally, either predictions of the last model in the sequence are used or some averaging is done.

Although there is no restriction of the chosen models, one typically uses identical models for boosting. Boosting tends to reduce bias (prediction error due to lacking model complexity). Thus, we may use very weak models like decision stumps (trees with only two leaves). Note that this is in contrast to bagging, which tends to reduce variance.

There exist many different implementations of boosting. Here we consider three boosting algorithms in more detail: AdaBoost, gradient boosting, and XGBoost.

Adaptive Boosting for Regression (AdaBoost.R2)#

AdaBoost originally has been developed for binary classification. Later an adaption to regression appeared (AdaBoost.R) and someday a modified regression version has been published (AdaBoost.R2). Scikit-Learn implements AdaBoost.R2. Thus, we restrict attention to that version. AdaBoost.R2 was introduced in Improving Regressors using Boosting Techniques by Harris Drucker.

In adaptive boosting we associate a weight to each training sample. A small weight marks the sample as unimportant, high weight means ‘very important’. We may use such weights in two ways:

train a model only on samples with high weight or
include weights into the training procedure (weighted mean squared error as loss function).

The first variant is always applicable. The second variant works for models trained by minimizing some loss function. AdaBoost.R2 only uses the first variant. But a weighted loss function is used for updating the weights.

Step 1 (initialization): Assign weights \(w_1,\ldots,w_n\) to the training samples \((x_1,y_1,),\ldots,(x_n,y_n)\) and initialize all weights to \(\frac{1}{n}\).

Step 2 (subset selection): Calculate probabilities

\[\begin{equation*} p_l:=\frac{w_l}{\sum\limits_{\lambda=1}^n w_\lambda},\qquad l=1,\ldots,n, \end{equation*}\]

and choose \(N<n\) training samples according to the probabilities \(p_1,\ldots,p_n\) with replacement.

Step 3 (training): Train a model on the \(N\) samples.

Step 4 (stopping criteria): Get predictions \(y_{\mathrm{pred},1},\ldots,y_{\mathrm{pred},n}\) for all (!) samples and calculate normalized squared errors

\[\begin{equation*} e_l:=\frac{\bigl(y_{\mathrm{pred},l}-y_l\bigr)^2}{\sum\limits_{\lambda=1}^n\bigl(y_{\mathrm{pred},\lambda}-y_\lambda\bigr)^2}\in[0,1]. \end{equation*}\]

Normalization simplyfies formulas below. If the denominator is zero we have perfect fitting (overfitting) and should stop the procedure, because there is no more need for improvement. But that’s rarely seen in practice. Else calulate the weighted loss

\[\begin{equation*} L:=\sum_{l=1}^n p_l\,e_l\in[0,1]. \end{equation*}\]

Stop if \(L\geq\frac{1}{2}\). Without weighting we would have \(L=1\). With weighting \(L\) expresses the average loss on the more important samples. Thus, \(L\) close to 1 (for instance \(L\geq\frac{1}{2}\)) indicates that the current model is not able to fit important samples much better than less important ones. So we stop the boosting procedure because no more improvement can be expected.

Step 5 (weight update): Multiply weight \(w_l\) by

\[\begin{equation*} \left(\frac{L}{1-L}\right)^{1-e_l} \end{equation*}\]

for \(l=1,\ldots,n\) and go to step 2. The function \(L\mapsto\frac{L}{1-L}\) is monotonically increasing and maps \([0,\frac{1}{2})\) to \([0,1)\). Small \(L\) decreases weights more than \(L\) close to \(\frac{1}{2}\). With \(L=\frac{1}{2}\) weights would remain unchanged. The closer an individual error \(e_l\) is to zero the closer the corresponding update factor is to \(\frac{L}{1-L}\). The larger \(e_l\) the closer the update factor is to 1 (weight remains almost unchanged). Thus, weights of well fitted samples are decreased whereas weights of less well fitted ones remain almost unchanged.

After stopping (aggregation): After stopping the iteration we have a list of trained hypotheses \(f_{\mathrm{approx},1},\ldots,f_{\mathrm{approx},q}\), where we exclude the last one, which satisfied the stopping criterion. Based on this list we define the final hypothesis \(f_{\mathrm{approx}}\) as follows:

For a feature vector \(x\) get predictions \(f_{\mathrm{approx},1}(x),\ldots,f_{\mathrm{approx},q}(x)\) from all models.
For each model calculate

\[\begin{equation*} \ln\left(\frac{1}{L}-1\right) \end{equation*}\]

with modeldependent \(L\) as introduced in step 4. Denote these numbers by \(a_1,\ldots,a_q\). They will be used as weights for the models. The function \(L\mapsto\frac{1}{L}-1\) is monotonically decreasing and maps \(0\) and \(\frac{1}{2}\) to \(+\infty\) and \(1\), respectively. The logarithm maps \([1,\infty]\) monotonically to \([0,\infty]\). Thus, \(L\) close to zero yields a high weight for the model, \(L\) close to \(\frac{1}{2}\) yields a small weight.
Calculate the weighted median of \(f_{\mathrm{approx},1}(x),\ldots,f_{\mathrm{approx},q}(x)\) with weights \(a_1,\ldots,a_q\):

\[\begin{equation*} f_{\mathrm{approx}}(x):=\inf\left\{y\in\mathbb{R}:\,\sum_{\mu=1\atop{f_{\mathrm{approx},\mu}(x)\leq y}}^q a_\mu\geq\frac{1}{2}\,\sum_{\mu=1}^q a_\mu\right\}. \end{equation*}\]

Rules for weight update, stopping and aggregation are somewhat arbitrary, but yield good results in practice.

Scikit-Learn implements adaptive boosting in AdaBoostRegressor in the ensemble module.

Adaptive Boosting for Classification (AdaBoost-SAMME)#

Above we considered AdaBoost for regression tasks. For classification problems AdaBoost originated in 1995 focussing on binary classification. In 2009 the now standard AdaBoost algorithm for multiclass problems has been proposed (see Multi-class AdaBoost by Zhu, Zou, Rosset, Hastie), known as AdaBoost-SAMME. An AdaBoost model outputs class labels (not probabilities) and base models are required to yield class labels, too.

Like for regression each training sample is assigned a weight. Depending on the prediction qualitiy of the model weights are modified to control fitting of the next model. The better the previous fit the smaller the weight. Thus, fitting concentrates on difficult samples. The detailed procedure for \(C\) classes is as follows:

Step 1 (initialization): Set all weights \(w_1,\ldots,w_n\) to \(\frac{1}{n}\).

Step 2 (training): Train a model on the weighted samples.

Step 3 (weighted classification error): Calculate

\[\begin{equation*} e:=\frac{\text{sum of weights of missclassified samples}}{\text{sum of all weights}}. \end{equation*}\]

Step 4 (stopping criteria): Typically AdaBoost is stopped after a fixed number of iterations (100, for instance) or if \(e=0\) (early stopping). Alternatively one may stop AdaBoost if \(e\geq\frac{C-1}{C}\), that is, if the error is not better than the error of random guessing.

Step 5 (weight update): Multiply all weights of missclassified samples by

\[\begin{equation*} (C-1)\,\frac{1-e}{e} \end{equation*}\]

and go to step 2. The update factor is greater than 1 if and only if the error \(e\) is smaller than the error for randomly assigned classes \(\frac{C-1}{C}\).

After stoppping (aggregation): After stopping the iteration we have a list of trained hypotheses \(f_{\mathrm{approx},1},\ldots,f_{\mathrm{approx},q}\) (step 2) and a list of weighted errors \(e_1,\ldots,e_q\) (step 3). We define the final hypothesis \(f_{\mathrm{approx}}\) as follows:

Given a sample \(x\) for each model \(f_{\mathrm{approx},\mu}\) define a vector \(a_\mu\in\{0,1\}^C\) with components

\[\begin{equation*} a_\mu^{(i)}:=\begin{cases}0,&\text{if }f_{\mathrm{approx},\mu}(x)\neq i,\\ 1,&\text{if }f_{\mathrm{approx},\mu}(x)=i, \end{cases}\qquad i=1,\ldots,C. \end{equation*}\]

Vector \(a_1,\ldots,a_q\) can be regarded as one-hot encoded predictions of corresponding models.
Calculate the weighted sum

\[\begin{equation*} a:=\sum_{\mu=1}^q\log\left((C-1)\,\frac{1-e_\mu}{e_\mu}\right)\,a_\mu. \end{equation*}\]

Components of the vector \(a\) can be regarded as scores for each class. Coefficients are 0 if the error \(e_\mu\) equals the error from random guessing. The smaller the error the larger the coefficient.
Predict class \(i\) for \(x\), where \(i\) is the index of the largest component of \(a\).

Gradient Boosting#

Gradient boosting is an approximate gradient descent for a loss function. Approximations of the gradients are chosen to be predictions of (simple) models on the training set. Thus, we minimize a loss by improving an initial model. In each step a new model is added, yielding a weighted sum of models. Prediction performance of the overall model will be much better than for each single model. The procedure is repeated until some stopping criterion (performance on validation set, for instance) is satisfied. That’s the basic idea. Now we have to fill the details.

Denote training samples by \((x_1,y_1),\ldots,(x_n,y_n)\) and let \(L:\mathbb{R}^n\rightarrow\mathbb{R}\) be the loss with respect to the true labels \(y_1,\ldots,y_n\). For mean squared error loss we have

\[\begin{equation*} L(z_1,\ldots,z_n)=\frac{1}{n}\,\sum_{l=1}^{n}(z_l-y_l)^2, \end{equation*}\]

but we stick to the general case here.

Denote the initial hypothesis by \(f_{\mathrm{approx},1}\) (some trained model). Given a hypothesis \(f_{\mathrm{approx},j}\) we denote the vector of predictions by

\[\begin{equation*} y_{\mathrm{pred},j}:=\bigl(f_{\mathrm{approx},j}(x_1),\ldots,f_{\mathrm{approx},j}(x_n)\bigr). \end{equation*}\]

We want to improve the training loss \(L\bigl(y_{\mathrm{pred},j}\bigr)\) by doing a gradient step. In other words, how to modify predictions \(y_{\mathrm{pred},j}\) to make \(L\) smaller? Step direction is \(-\nabla L\bigl(y_{\mathrm{pred},j}\bigr)\) and with step length \(s\) we obtain the updated predictions

\[\begin{equation*} y_{\mathrm{pred},j}-s\,\nabla L\bigl(y_{\mathrm{pred},j}\bigr). \end{equation*}\]

The problem is that in general this is not a prediction of some model of interest. Thus, we fit a new model to samples

\[\begin{equation*} \left(x_1,\frac{\partial L}{\partial z_1}\bigl(y_{\mathrm{pred},j}\bigr)\right),\ldots,\left(x_n,\frac{\partial L}{\partial z_n}\bigl(y_{\mathrm{pred},j}\bigr)\right) \end{equation*}\]

and set

\[\begin{equation*} f_{\mathrm{approx},j+1}:=f_{\mathrm{approx},j}-s\,f_{\mathrm{grad}}, \end{equation*}\]

where \(f_{\mathrm{grad}}\) denotes the hypothesis fitted to the gradient.

The described iterative procedure finally yields a weighted sum of many models. Each model approximates a gradient. Samples very well fitted by a model will have small gradient. Thus, the next model will not modify corresponding prediction too much. The other way round, the next model will concentrate on samples with large prediction error with the previous model.

For mean squared error loss we have

\[\begin{equation*} \nabla L\bigl(y_{\mathrm{pred},j}\bigr)=\frac{2}{n}\bigl(y_{\mathrm{pred},j}-y\bigr), \end{equation*}\]

the difference between predicted and true labels, also known as residual vector (neglecting the factor \(\frac{2}{n}\)). Consequently, the gradient approximation model is fit to the residual vector of the previous model. For a perfect fit model \(f_{\mathrm{grad}}\) and step length \(s=\frac{n}{2}\) we would have

\[\begin{equation*} f_{\mathrm{approx},j+1}(x_l)=f_{\mathrm{approx},j}(x_l)-s\,f_{\mathrm{grad}}(x_l) =f_{\mathrm{approx},j}(x_l)-\bigl(f_{\mathrm{approx},j}(x_l)-y_l\bigr) =y_l \end{equation*}\]

for \(l=1,\ldots,n\). In other words, the next model would perfectly fit the training data.

Scikit-Learn implements gradient boosting in GradientBoostingRegressor and GradientBoostingClassifier in the ensemble module.

Note that the the output of a gradient boosted model is some real number due to its additive nature. To solve binary classification tasks with gradient boosted models sigmoid or some similare function has to be applied to the outputs.

Multiclass classification task usually are solved in a one-versus-rest manner if boosting shall be applied. If we have \(C\) classes, then gradient boosting for binary classification is done \(C\) times in parallel. If the final model’s output shall be probabilities softmax is applied to the predictions.

In principle, gradient boosting can be applied directly to a multiclass problem (no one-versus-all). But this is less efficient because a more complex model with C outputs has to be trained in each step. Moreover each gradient step yields smaller decrease of the loss function than with a one-versus-all approach. The reason for the latter observation is the structure of most loss functions. Loss is calculated for each class individually based on class probabilities and then summed up. Summands are mutually independent. Minimizing each summand individually typically yields faster descent than minimizing the whole sum as one function.

AdaBoost versus Gradient Boosting#

AdaBoost-SAMME is a special case of gradient boosting. This relation has been discovered in 2000, five years after invention of the original AdaBoost algorithm. With \(C\) being the number of classes, class labels \(y\) have to be encoded in a one-hot-like manner

\[\begin{equation*} \tilde{y}:=\bigl(\tilde{y}^{(1)},\ldots,\tilde{y}^{(C)}\bigr),\qquad \tilde{y}^{(i)}:=\begin{cases}1,&\text{if }y=i,\\-\frac{1}{C-1},&\text{if }y\neq i,\end{cases} \end{equation*}\]

and the loss function for gradient boosting is

\[\begin{equation*} L(z_1,\ldots,z_n):=\frac{1}{n}\,\sum_{l=1}^n\mathrm{e}^{-\frac{1}{C}\bigl(\tilde{y}_l^{(1)}\,\tilde{z}_l^{(1)}+\cdots+\tilde{y}_l^{(C)}\,\tilde{z}_l^{(C)}\bigr)}. \end{equation*}\]

The full proof that AdaBoost-SAMME is equivalent to gradient boosting is provided by the authors of AdaBoost-SAMME in the above mentioned article.

XGBoost#

XGBoost orignated in 2016 and became very popular due its success in several machine learning contests. XGBoost uses decision trees to boost an intial model and aims at large-scale high-performance computing.

The basic boosting idea is similar to gradient boosting, but instead of gradient descent for the loss function XGBoost uses second order minimization methods involving second partial derivatives (Newton method).

See XGBoost: A Scalable Tree Boosting System for details on the algorithm and XGBoost Python Package for a Python package.

Boosting

Contents