Classification Trees#

Classification trees work much the same like regression trees, but there are some more standard splitting rules for growing a tree.

Splitting Strategies#

All splitting rules described here define some error measure and choose the split which minimizes the error.

Splitting by Missclassification Rate#

For each possible split we determine the number of missclassified samples in each resulting leaf. That is, we count all samples not belonging to their leaf’s majority class (the leaf’s prediction). The lower the resulting number the better the split.

Note that sometimes different splits have identical missclassification rate, but from manual inspection we would clearly favor one of them.

Example

Consider 200 samples, 100 belonging to class A, 100 belonging to class B, and two splits

\[\begin{equation*} \text{25 A / 75 B}\quad\text{and}\quad\text{75 A / 25 B},\qquad\qquad\text{50 A / 100 B}\quad\text{and}\quad\text{50 A / 0 B}. \end{equation*}\]

Both splits missclassify 50 samples, but the second one has a pure leaf and, thus, should be preferred. For the first split we would need at least two further splitting steps to make all leaves pure. For the second split one further split might suffice.

Another issue splitting by missclassifications is that there might be no split that decreases missclassifiction, suggesting to stop the growth process to avoid overfitting. But other measures (see below) justify further splitting.

Example

Consider three samples with feature values \(x_1=1\), \(x_2=2\), \(x_3=3\) belonging to classes A, B, A, respectively. Then the original node and the two possible splits

\[\begin{equation*} \text{A}\quad\text{and}\quad\text{BA},\qquad\qquad\text{AB}\quad\text{and}\quad\text{A} \end{equation*}\]

have exactly one missclassified sample. Thus, splitting does not reduce missclassification.

Splitting by Gini Impurity#

Gini impurity of a leaf is the probability that two samples randomly chosen from the subset corresponding to the leaf belong to different classes. Gini impurity of a split is the weighted sum of all new leaves’ Gini impurities with weights relative to leaf size (samples in leaf divided by samples in parent).

Given \(C\) classes consider a leaf with \(n\) samples, \(n_1\) samples from class 1, \(n_2\) samples from class 2, and so on. Then Gini impurity of the leaf is

\[\begin{equation*} \sum_{i=1}^C\frac{n_i}{n}\,\left(1-\frac{n_i}{n}\right) =\sum_{i=1}^C\left(\frac{n_i}{n}-\left(\frac{n_i}{n}\right)^2\right) =\sum_{i=1}^C\frac{n_i}{n}-\sum_{i=1}^C\left(\frac{n_i}{n}\right)^2 =1-\sum_{i=1}^C\left(\frac{n_i}{n}\right)^2. \end{equation*}\]

Gini impurity of a pure leaf is 0. Gini impurity of a leaf with same number of samples from all classes is \(1-\frac{1}{C}\). In particular, Gini impurity is always below one. The lower the Gini impurity the better the split.

Example

Consider the 200 samples from above again, 100 belonging to class A, 100 belonging to class B, and two splits

\[\begin{equation*} \text{25 A / 75 B}\quad\text{and}\quad\text{75 A / 25 B}\qquad\qquad\text{50 A / 100 B}\quad\text{and}\quad\text{50 A / 0 B}. \end{equation*}\]

Gini impurity of first split is

\[\begin{equation*} \frac{100}{200}\,\left(1-\left(\frac{25}{100}\right)^2-\left(\frac{75}{100}\right)^2\right) +\frac{100}{200}\,\left(1-\left(\frac{75}{100}\right)^2-\left(\frac{25}{100}\right)^2\right) =0.3750. \end{equation*}\]

For the second split we have

\[\begin{equation*} \frac{150}{200}\,\left(1-\left(\frac{50}{150}\right)^2-\left(\frac{100}{150}\right)^2\right) +\frac{50}{200}\,\left(1-\left(\frac{50}{50}\right)^2-\left(\frac{0}{50}\right)^2\right) =0.3333. \end{equation*}\]

Looking at Gini impurity we would choose the second split.

Example

Consider the three samples from above again belonging to classes A, B, A. Then the original node has Gini impurity

\[\begin{equation*} 1-\left(\frac{1}{3}\right)^2-\left(\frac{2}{3}\right)^2=\frac{4}{9} \end{equation*}\]

and the two possible splits

\[\begin{equation*} \text{A}\quad\text{and}\quad\text{BA},\qquad\qquad\text{AB}\quad\text{and}\quad\text{A} \end{equation*}\]

each have Gini impurity

\[\begin{equation*} \frac{1}{3}\cdot 0+\frac{2}{3}\,\left(1-\left(\frac{1}{2}\right)^2-\left(\frac{1}{2}\right)^2\right)=\frac{1}{3}. \end{equation*}\]

Thus, both splits reduce Gini impurity whereas number of missclassifications remains the same.

Splitting by entropy#

Entropy is an alternative to and very similar to Gini impurity. The formula for entropy of a leaf is

\[\begin{equation*} -\sum_{i=1}^C\frac{n_i}{n}\,\log\frac{n_i}{n}\qquad\text{with}\quad 0\cdot\log 0:=0. \end{equation*}\]

Entropy is a concept from information theory motivated by entropy in physics. Entropy is a way to quantify information. Most data scientists use the term, but only few understand it. So we spend some time to explain the ideas behind.

Probabilities versus information: Consider a bag of colored balls. We do not know the exact number of balls of each color, but we know that on average (of many bags with colored balls) 50 per cent of the balls are red, 30 percent are green, 15 per cent are blue and 5 per cent are yellow. If we randomly take one ball out of the bag we may ask: What do we learn from this one ball about the contents of the bag? If the ball is red, then we know that there are red balls in the bag. That’s not surprising since we knew that on average half the balls are red. We did not learn something really new about the contents of the bag. But if the ball is yellow, then we know, that there was at least one yellow ball in the bag. Probability for yellow was 5 per cent. Thus, it’s not unlikely that there is no yellow ball in the bag. So from finding a yellow ball we obtain much more new information about the bag’s contents than from finding a red ball. We may state our observation as follows: The less likely an event we observe is the more information it contains.

Measuring information: To express the information content of an event we may transform the events’ probabilities to satisfy the following criteria.

Information is nonnegative.
Information is 0 if and only if probability is 1.
The higher the probability, the lower the information obtained from observing the event (monotonicity).
Information obtained from observing two independent events is the sum of information obtained from each of the two events.

The only function satisfying all four requirements is the negative logarithm (to an arbitrary base). So we define the amount of information obtained from observing an event as the logarithm of the event’s probability. If \(p\) is a probability, then corresponding information is

\[\begin{equation*} -\log p. \end{equation*}\]

Choosing a concrete base just scales the measure, because

\[\begin{equation*} \log_a p=(\log_a b)\,\log_b p. \end{equation*}\]

For base 2 the unit for information is bits, for base \(\mathrm{e}\) it’s nats, for base 10 it’s dits or bans.

In the above example finding a red ball has information 0.69 nats, finding a yellow ball has information 3.00 nats.

Entropy: Entropy is defined as mean information content. It’s the weighted sum of information for all events with probabilities as weights. So more likely events have heigher weight. In the above example the bag’s entropy is

\[\begin{equation*} -0.5\,\log 0.5-0.3\,\log 0.3-0.15\,\log 0.15-0.05\,\log 0.05=1.14. \end{equation*}\]

Entropy measures disorder. The highest form of order is that only one event can occur (with probability 1). Then entropy is 0. The highest form of disorder is that all events are equally likely. With \(n\) events, entropy then is \(\log n\). The more equally likely events, the higher the disorder.

Entropy for classification: In classification contexts we have one event per class. Analogously to the colored balls example above we randomly pick one sample out of a leaf and look at its label. Probabilities are the relative class counts. We choose the split with the lowest entropy. Or the other way round, we choose the split with the highest decrease in information obtainable from looking at concrete samples. Lowest information is reached if all samples in a leaf have the same label. So we cannot learn something from looking at a specific example.

Pruning#

Fully grown trees should be pruned to avoid overfitting. Reduced error pruning is a simple and fast pruning technique for classification trees. One by one, beginning from the leaves, subtrees are replaced by leaves and resulting missclassification rate on validation data is calculated. If missclassification rate does not increase, the change is kept. This way we obtain a tree which cannot be improved by removing further subtrees.

An alternative is cost-complexity pruning based on some classification error measure (missclassification rate, Gini impurity, entropy,…). See Regression Trees for details.