This is Part 2. of my decision tree series. Here we will see how we can build a decision tree algorithmically using Leo Breiman’s (One of the big, big names in decision trees) CART algorithm.
Ok, so there’s going to be a little maths in this post, so for those who are not sure what all the notations mean let’s go over them. (if you already know all this feel free to skip ahead 😄)
Anything in between brackets \({}\) is a set of objects (generally numbers).
Sums are represented by the \(\Sigma\) symbol (read as “sigma”). The value under the sigma is the start point, and the value above is the end point of the sum. So :
$$ \sum^3_{i=1} i^2 = 1^2 + 2^2 + 3^2 = 14 $$
If you have any more questions about mathematical notation I would suggest you consult this Wikipedia page which is quite comprehensive.
For a classification tree \(Y_i \in {1,2,\cdots,k}\), where \(k\) is the number of possible classes.
On the other hand for a regression tree \(Y_i \in \mathbb{R}\) (with \(\mathbb{R}\) the set of real numbers)
(NB. in any post I might use “variable” and “feature” interchangeably because they essentially mean the same thing)
In decision trees nodes represent a split in data, in this post I will usually call splits \(S\) and represent them as the 2 opposite sets \(\mathcal{A}\) and \(\overline{\mathcal{A}}\), where \(\mathcal{A}\) is the set that goes to the left branch of the split and \(\overline{\mathcal{A}}\) the set that goes to the right branch of the split. For example, the first split of part 1’s simple tree, that has the condition \(x_3 \leq 1.9\) will be noted as:
$$ S = {x_3 \leq 1.9},\ {x_3 > 1.9} $$
CART stands for Classification And Regression Trees, it is an algorithm developed by Leo Breiman et. al in 1984 designed for inferring decision trees. It relies on evaluating possible splits at each node in our tree and choosing the best one.
To infer a tree we need to choose the best possible split at each node, and to do that we need to know what those splits are. As we saw in part 1 a split is only ever done on one feature
There are two types of features, hence two types of splits:
(NB. there can be ordered categorical values, like grades for example \(A > B > C > D\) but they can be assimilated to numerical values which is why I didn’t mention them in the list above)
For a given node there are \(n\) data points. Let us consider the numerical explanatory feature \(X_{num}\). In this case we have a maximum number of splits \(n-1\) corresponding to all the splits:
$$ S = { X_{num} \leq x_i},{X_{num} > x_i}\quad\quad i \in {1,\cdots,n-1} $$
We stop at \(n-1\) because, for the maximum value of \(X_{num}\) all points are smaller or equal to it, meaning we send all of our data points to one side of the split and leave the other side empty. This of course is not split, so we don’t count that possibility.
Let us consider now the categorical feature \(X_{cat}\), which has \(k\) possible levels. the possible number of splits is \(2^{k-1} - 1\). A given split can be defined as:
$$ S = {X_{cat} \in \mathcal{A}},{X_{cat} \in \mathcal{\overline{A}}}\quad\quad \mathcal{A}\subset{1,\cdots,k} $$
How many subsets \(\mathcal{A}\) are there?
Let’s consider the case of a feature called \(C\) with 3 levels:
$$ C \in {red,\ blue,\ green} $$
To create a subset we must choose which values are in or not the subset. For example the subset \(\mathcal{A} = {red,\ green}\) what we are saying is \(red\in \mathcal{A}\) and \(green\in \mathcal{A}\) and \(blue\notin \mathcal{A}\). So each subset is a set of 3 values indicating the presence or absence of a given level of \(C\) in the subset. With this we can easily calculate the number of possible subsets.
For the first value (presence or absence of \(red\) in the subset) there are \(2\) possible options, for the second value we also have \(2\) potential values and the same for the third values.
So our total number of possible subsets is:
$$ N_{sets}=2\times2\times2 = 2^3 $$
And if we have \(k\) possible levels, our number of possible subsets becomes:
$$ N_{sets}=2^k $$
but further up it was \(2^{k-1} -1\) why?
Well this comes from the fact that we are looking for splits, not just for subsets. And splits are symmetrical. For our example above, if we have:
It is easy to see that \(S_1 = S_2\), that symmetry is why on out number of possible splits we have \(2^{k-1}\) instead of simply \(2^k\). The explanation for why it is \(2^{k-1}-1\) and not \(2^{k-1}\) is the same as for numerical features, because if \(\mathcal{A}={red,\ blue,\ green}\) then Our splits has all the points one one side and and empty set on the other, so it is not a split, so we remove that possibility and that’s how we end up with \(2^{k-1}-1\) (thats also why we used \(\mathcal{A} \subset {1,\cdots,k}\) instead of \(\mathcal{A} \subseteq {1,\cdots,k}\) when we defined our categorical splits ).
So now we have all of the possible splits in our data, but how do we choose the best one?
Since we want to use the tree to predict either a class or a value, we want the leaves of the tree to be as “pure” as possible, meaning we want the examples in each leaf to be similar. To get that we need a way to measure the “purity” of a node, so how similar all data points are in that node.
Therefore, to chose the best split, we choose the one that maximizes this “purity” measure in the child nodes. In practice we don’t measure “purity” but rather “impurity”. There are several of these measures. In this whole section let’s consider a node \(t\), and for an impurity measure \(i\) we can define \(i(t)\) as the impurity of this node.
The Gini index is a way to measure impurity in classification trees. Let \(p_i\) be the probability of having class \(i\) in our node, and \(k\) the number of classes in the node. The Gini index \(G\) is:
$$ G(t) = 1 - \sum^k_{i=1} p_i^2 $$
Since we don’t know the real probability \(p_i\) for a given class, we just approximate it by the frequency of said class in the node:
$$ p_i = \frac{n_i}{n} $$
with \(n_i\) the number of examples of class \(i\) in the considered node and \(n\) the total number of examples in said node.
Entropy is also an impurity measure that is commonly used in CART with classification trees. If we use the same notation as for the Gini index we can define the Entropy of a node, \(E\), as:
$$ E(t) = \sum^k_{i=1} p_i \log(p_i) $$
usually the base \(2\) logarithm is used.
The Residual Sum of Squares (RSS) is used in regression trees, where examples have an outcome value instead of an outcome class. For a given node, let \(n\) be the number of examples at that node, \(y_i\) the outcome value of the \(i^{th}\) example, and \(\mu\) the mean outcome values of all examples of the node. We can define :
$$ RSS(t) = \sum^n_{i=1} (y_i - \mu)^2 $$
To then choose the optimal split we choose the one that leads to the maximal decrease in impurity (which can be either the Gini index or the entropy for classification trees or the RSS for regression trees). So if we are in a node \(t\) and a given split defines nodes \(L\) and \(R\) the left and right child nodes of \(t\) respectively. We can define the decrease in impurity \(\Delta i\) as:
$$ \Delta i = i(t) - p_L\cdot i(L) - p_R\cdot i(R) $$
with \(i(L)\) and \(i(R)\) the impurities of the child nodes and \(p_L\) and \(p_R\) the probabilities of a sample from node \(t\) will go to node \(L\) or node \(R\), again we equate this to the proportions of cases that go to nodes \(R\) and \(L\) so \(p_L = \frac{n_L}{n_t}\) (with \(n_L\) and \(n_t\) the number of cases in nodes \(L\) and \(t\))
The basic algorithm is actually quite simple:
We can implement it in a recursive manner thusly:
|
|
This is only a part of the algorithm, it results in a tree that grows until all leaves are pure (meaning that they contain examples of only one class), and the CART algorithm restricts that so we can have a more generalizable tree (no overfitting).
I hope you learned something on how to build decision trees, and go check out part 3 to see a Python implementation of this algorithm.