Regression trees¶

Classification and regression trees (CART) are a class of machine learning techniques based on decision trees. A tree is an abstract structure made of nodes and edges. Nodes are connected by edges. Each node has exactly one parent node and may have several child nodes. There is one node without parent node, called root node. Nodes without children are leaves. A subtree is a node together with all its descendants (children, grandchildren and so on).

tree

In a decision tree each node represents a condition on a feature in a learning task. A feature vector is passed through the tree starting at the root and finally arriving at one of the tree's leaves, representing the predictions. At each node corresponding condition is evaluated for the concrete feature vector. Based on the result the feature vector is passed on to one of the node's child. Evaluating a node may have several possible outcomes, but often conditions are either satisfied or not, yielding a binary tree (two children per node).

decision tree

The training phase consists of building a decision tree and the prediction phase consists of passing feature vectors through the tree. Prediction is fast and simple, but for training we have to answer difficult questions:

Which features should we consider in the nodes? Which one first?
Which conditions should we check on the features?
How large should the tree be?

Major advantages of decision trees:

They can be applied for arbitrary data types including categorical data.
They not only yield predictions but also a list of human readable decisions leading to that prediction.

Training¶

Training a regression tree is a relatively complex task. We start with on overview, then consider a concrete algorithm, and finally dicuss regularization techniques.

Overview¶

There exist many techniques to grow decision trees. The overall procedure is as follows: We start with a tree containing only the root node. We select one of the features and a condition involving only the selected feature. Then we split the training data set according the condition into disjoint subsets. For each subset we create a child node and process each child node in the same way as the root node, but with the full data set replaced by the subset corresponding to the child node. This splitting procedure is repeated with all leaves not satisfying some stopping criterion. Common stopping criteria are:

variance in the leaf is small,
only few samples correspond to leaf,
predefined depth of tree reached,
maximum number of leaves reached.

After stopping the growth process, each leave corresponds to a small set of samples (the ones satisfiying all conditions on the path to the leaf). The prediction corresponding to a leave is the mean of all targets of the samples in that set. The smaller the variance of targets at each leave the more accurate are predictions on the training data set.

For numerical features conditions are formulated as a single inequality, so the feature's range is splitted into two disjoint intervals. Since we only have finitely many samples, there are at most as many sensible splitting points as we have samples.

For categorical features with few categories splitting into as many child nodes as there are categories is feasible. Else some condition with binary result should be considered. There are at most $2^{\text{number of categories}}$ different conditions with binary result for a categorical feature.

Choosing features and conditions is the hard part. There exist many techniques to do this, but most of them are closely related to classification tasks discussed in more detail next semester. A common technique for regression trees is variance reduction.

Variance reduction¶

Consider one node and the corresponding subset $S$ of the training data set $\{(x_1,y_1,\ldots,(x_n,y_n)\}$. Let \begin{equation*} I_S:=\{k\in\{1,\ldots,n\}:\,(x_k,y_k)\in S\} \end{equation*} be the index set holding all indices of samples in $S$ and denote mean and variance of targets in $S$ by \begin{equation*} m(S):=\frac{1}{|S|}\,\sum_{k\in I_S}y_k \end{equation*} and \begin{equation*} v(S):=\frac{1}{|S|}\,\sum_{k\in I_S}\bigl(y_k-m(S)\bigr)^2, \end{equation*} respectively.

Given a split $S=L\cup R$ into disjoint subsets $L$ and $R$ we immediately see \begin{equation*} v(S)\geq v(L)\qquad\text{and}\qquad v(S)\geq v(R). \end{equation*} Thus, splitting reduces variance. Since we aim at low variance in the tree's leaves we would like to choose a split which minimizes both variance $v(L)$ and $v(R)$ at once. Optimization problems with multiple objective functions are hard to handle. So we look for an objective combining both variances. We simply could use the sum $v(L)+v(R)$, but then small subsets with low variance have the same weight as large subsets with low variance. Leafs corresponding to large subsets with low variance are good because they yield good predictions on many training samples, whereas leafs corresponding to small subsets do not matter so much. A better idea for a joint objective is a weighted sum of variances with weights representing subset sizes: \begin{equation*} \frac{|L|}{|S|}\,v(L)+\frac{|R|}{|S|}\,v(R). \end{equation*}

Here is script for comparing splits resulting from joint variance without and with weights:

In [2]:

# targets of samples in set S
y = np.array([1, 0.5, 0, 0, 0, 0.03, 0, 0.02, 0, -0.01, 0, 0, 0, 0, 0, 0.6])

# all splitting points
splits = np.sort(np.unique(y))

# empty arrays to be filled with v(L), v(R), joint variance without/with weights
vL = np.empty(splits.size - 1)
vR = np.empty(splits.size - 1)
jv = np.empty(splits.size - 1)
jvw = np.empty(splits.size - 1)

# calculate joint variances
for k in range(0, splits.size - 1):
    yL = y[y <= splits[k]]    # targets in subset L
    yR = y[y > splits[k]]     # targets in subset R
    vL[k] = np.var(yL)
    vR[k] = np.var(yR)
    jv[k] = vL[k] + vR[k]
    jvw[k] = yL.size / y.size * vL[k] + yR.size / y.size * vR[k]

# splitting points for both variants (add small number for better visualization)
jv_split = splits[jv.argmin()] + 0.05
jvw_split = splits[jvw.argmin()] + 0.05

# plot targets and splitting points
fig, ax = plt.subplots()
ax.plot(y, 'ob', label='targets')
ax.plot([-1, y.size], [jv_split, jv_split], '-r', label='split without weights')
ax.plot([-1, y.size], [jvw_split, jvw_split], '-g', label='split with weights')
ax.legend()
ax.set_xlabel('index')
ax.set_ylabel('value')
plt.show()

Note, that the formula for joint variance can be reduced to \begin{equation*} \frac{|L|}{|S|}\,v(L)+\frac{|R|}{|S|}\,v(R) =\sum_{k\in I_L}\bigl(y_k-m(L)\bigr)^2+\sum_{k\in I_R}\bigl(y_k-m(R)\bigr)^2. \end{equation*} Each of both summands can be regarded as sum of squared errors (SSE) if we consider $m(L)$ and $m(R)$ as predictions and the $y_k$ as targets. Don't confuse SSE with MSE (mean squared error).

Regularization (pruning)¶

Small trees aren't able to represent complex hypotheses. Large trees tend to overfit training data. Thus, growth of trees has to be stopped at the right moment by some stopping criterion (see above). A more complex regularization technique is pruning. Here we grow a very complex tree, which overfits training data, and then remove some nodes together with all descendants. Removing a node means that we replace it by a leaf as if splitting had never happend. We try to remove nodes which can be removed without effecting prediction accuracy on a validation data set too much.

Here we only consider one pruning algorithm in more detail, so called cost-complexity pruning. Passing the validation data through the tree we obtain a subset of validation samples at each leaf. For those subsets we calculate leafwise prediction errors (SSE). If we remove a node (and all its descendants) from the tree, the prediction error of the tree increases or remains unchanged. If error remains unchanged, then we can remove the node. If error increases, we have to decide if increase is not too large compared to decrease of complexity. Complexity of a tree can be expressed as the number of leaves. The trade-off between prediction error and complexity can be expressed by the cost-complexity measure (CCM).

CCM of a node or corresponding subtree is completely determined by the leaves of the subtree: \begin{equation*} CCM(\text{subtree}):=\sum_{\text{leaves of}\atop{\text{subtree}}}\text{validation SSE of leaf} +\alpha\cdot\text{leaves in subtree}. \end{equation*} The first summand expresses the prediction error, the second the complexity of the subtree. The regularization parameter $\alpha$ controls the trade-off between error and complexity. CCM of a leaf (a subtree containing only one node) is \begin{equation*} CCM(\text{leaf})=\text{validation SSE of leaf}+\alpha. \end{equation*}

If we replace a subtree by a leaf the first summand (error) in CCM increases and the second (complexity) decreases to one. For $\alpha=0$ we have $CCM(\text{subtree})\leq CMM(\text{leaf})$, because prediction error dominates CCM. For very large $\alpha$ we have $CCM(\text{subtree})>CCM(\text{leaf})$, because complexity dominates CCM. So we may ask for $\alpha$ with \begin{equation*} CCM(\text{subtree})=CCM(\text{leaf}). \end{equation*} Such an $\alpha$ is called effective $\alpha$. The formula is \begin{equation*} \alpha=\frac{{\text{validation SSE of node}\atop\text{before splitting}}-\sum\limits_{\text{leaves of}\atop{\text{subtree}}}\text{validation SSE of leaf}}{\text{leaves in subtree}-1} \end{equation*}

For the effective $\alpha$ CCM does not change if we replace a subtree by a leaf. Small effective $\alpha$ indicates that prediction error changes only slightly while complexity is changed much more when replacing the subtree. Based on that observation we should compute effective $\alpha$ for all subtrees (nodes) and remove subtrees with effective $\alpha$ below some predefined value.

Pruning versus penalization¶

Formula for CCM are very similar to formula for regularizing loss function based learning methods (linear regression, ANNs). For loss function based methods we looked for minimizers of \begin{equation*} \text{loss(predictions, targets)}+\alpha\cdot\text{penalty}, \end{equation*} where $\alpha$ is fixed. Following this idea, in regression tree learning we could ask for a tree minimizing \begin{equation*} \text{loss(predictions, targets)}+\alpha\cdot\text{number of leaves} \end{equation*} over the set of all regression trees. Forumlating such an optimization problem is not hard, but how to solve it? The objective is not differentiable and the search space (set of all trees) is extremely large.

Cost-complexity pruning solves this optimization problem for a much smaller search space. The pruned tree minimizes the objective over the set of all trees which can be obtained from the unpruned tree by removing nodes. The regularization parameter $\alpha$ in the objective is the lower bound for effective $\alpha$ values to keep. In this sense cost-complexity pruning fits well into the usual regularization framework.

Regression trees with Scikit-Learn¶

Scikit-Learn implements regression trees in DecisionTreeRegressor.

DecisionTreeRegressor takes several parameters for stopping growth of the tree and also supports cost-complexity pruning. For the latter the ccp_alpha parameter has to be specified. Nodes with small effective $\alpha$ will be removed. Scikit-Learns default values for stopping criteria lead to trees with one training sample per leaf, that is, to maximum complexity.

To find good splits, Scikit-Learn uses variance reduction by default, but other techniques are available (parameter criterion). Further, instead of considering all possible splits, we may reduce computation time by considering fewer feature or fewer splitting points (parameters splitter and max_features).