Bagging#

Bagging (short for bootstrap aggregation) averages predictions of many simple models to obtain a more accurate prediction than each single simple model can provide. The aim of bagging is to reduce variance (that is, prediction error due to overfitting) by averaging results from many high variance models.

Although bagging in principle can be applied to a set of very different machine learning models, usually it is used with a set of identical models.

Bootstrapping#

If we train identical models on identical training data, models will yield more or less identical predictions. Thus, we have to train each model on a different data set. We could divide the data set into as many subsets as we have models, but then each subset would be rather small. Instead we use a method known as bootstrapping in statistics. We sample new data sets from the original data set with replacement. Thus, samples may occur several times in the new sets. The advantage of replacement is that distributions of samples in the new sets are independent from each other making the trained models independent from each other. Bootstrapping yields a list of data sets which on the one hand follow more or less the same distribution as the original data set and on the other hand can be (at least in principle) arbitrarily large.

Bagging with Scikit-Learn#

Scikit-Learn supports bagging for regression tasks with BaggingRegressor and BaggingClassifier from Scikit-Learn’s ensemble module. Corresponding estimator objects have the usual fit and predict interface. When creating the estimator we may pass the following arguments:

  • estimator: a Scikit-Learn estimator object (linear regression, ANN, decision tree aso.) to be trained several times,

  • n_estimators: how many models to train,

  • max_samples: size of training subsets.

There is also a max_features argument to restrict the number of features to consider in each model. Instead of training each model on a different data set we might train models on different sets of features (random subspace method).

Note that BaggingRegressor and BaggingClassifier also supports some bagging-like techniques we do not introduce here.

Random Forests#

If bagging is used with decision trees as base model, then we have a random forest (a forest is a collection of trees). Scikit-Learn has some specialized routines for training random forests: RandomForestRegressor and RandomForestClassifier.

Standard behavior is to grow trees to their maximum size. For complex data sets growing a forest of maximum size trees may result in memory exhaustion.

Random Forests for Feature Selection#

Random forests can be exploited for feature selection. Having a trained random forest at hand, to calculate the importance of a feature do the following:

  1. For all trees find all nodes splitting with respect to the feature.

  2. For all nodes from 1. calculate the decrease in the impurity measure (variance, missclassification rate,…) caused by the split.

  3. Calculate the weighted sum of all decreases. Weights are the number of samples in each node.

This procedure ensures that

  • features decreasing impurity more than others have higher importance.

  • features corresponding to nodes close to a root (more samples in node) have higher importance.

In Scikit-Learn we have access to random forest based feature importances via the feature_importances_ attribute of the RandomForestRegressor or RandomForestClassifier object after training the forest.