Basics#

Decision trees, also known as classification and regression trees (CART), are a class of machine learning techniques based on tree data structures.

Decision Tree Structure#

A tree is an abstract structure made of nodes and edges. Nodes are connected by edges. Each node has exactly one parent node and may have several child nodes. There is one node without parent node, called the root node. Nodes without children are leaves. A subtree is a node together with all its descendants (children, grandchildren and so on).

parts of a tree

Fig. 46 A tree consists of nodes and edges. Some nodes are special, like the root node and the leaves.#

In a decision tree each node represents a condition on a feature in a learning task. A feature vector is passed through the tree starting at the root and finally arriving at one of the tree’s leaves, representing the predictions. At each node corresponding condition is evaluated for the concrete feature vector. Based on the result the feature vector is passed on to one of the node’s child. Evaluating a node may have several possible outcomes, but often conditions are either satisfied or not, yielding a binary tree (two children per node).

example of decision tree

Fig. 47 Each node in a decision tree, which is not a leaf, represents a condition on the samples passed down the tree.#

The training phase consists of building a decision tree and the prediction phase consists of passing feature vectors through the tree. Prediction is fast and simple, but for training we have to answer difficult questions:

  • Which features should we consider in the nodes? Which one first?

  • Which conditions should we check on the features?

  • How large should the tree be?

Major advantages of decision trees:

  • They can be applied for arbitrary data types including categorical data.

  • They not only yield predictions but also a list of human readable decisions leading to that prediction.

Training#

Training a decision tree is a relatively complex task. We start general remarks and then provide concrete algorithms in subsequent sections.

Growing a Tree#

There exist many techniques to grow decision trees. The overall procedure is as follows:

  1. Start with a tree containing only the root node.

  2. Select one of the features and a condition involving only the selected feature.

  3. Split the training data set according to the condition into disjoint subsets.

  4. For each subset create a child node.

  5. Process each child node in the same way as the root node (that is, go to 2), but with the full data set replaced by the subset corresponding to the child node.

This splitting procedure is repeated until all leaves satisfy some stopping criterion. Common stopping criteria are:

  • variance in the leaf is small,

  • only few samples correspond to leaf,

  • predefined depth of tree reached,

  • maximum number of leaves reached.

After stopping the growth process, each leave corresponds to a small set of training samples (the ones satisfiying all conditions on the path to the leaf). The prediction corresponding to a leave is

  • the mean of all targets of the samples in that set in case of regression,

  • the class most samples in that st belong to in case of classification.

For numerical features conditions are formulated as a single inequality, so the feature’s range is splitted into two disjoint intervals. Since we only have finitely many samples, there are at most as many sensible splitting points as we have samples.

For categorical features with few categories splitting into as many child nodes as there are categories is feasible. Else some condition with binary result should be considered. There are at most \(2^{\text{number of categories}}\) different conditions with binary result for a categorical feature.

Choosing features and conditions is the hard part. There exist many techniques to do this. Some prominent ones will be considered below.

Pruning#

Small trees aren’t able to represent complex hypotheses. Large trees tend to overfit training data. Thus, growth of trees has to be stopped at the right moment by some stopping criterion (see above). A more complex regularization technique is pruning. Here we grow a very complex tree, which overfits training data, and then remove some nodes together with all descendants. Removing a node means that we replace it by a leaf as if splitting had never happend. We try to remove nodes which can be removed without effecting prediction accuracy on a validation data set too much. Conrete pruning algorithms will be considered below.