Decision tree
Introduction
Decision trees are widely used in machine learning for classification and regression tasks. They operate by recursively partitioning the data based on the most informative features until a stopping criterion is met. Decision trees can be visualized as tree-like structures.
Consider a dataset \(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\), where \(\mathbf{x}_i \in \mathbb{R}^d\), and let \(\{y_1, \ldots, y_n\}\) be the corresponding labels, where \(y_i \in \{0, 1\}\). The output of the decision tree algorithm is a constructed decision tree.
Prediction: Given a test sample \(\mathbf{x}_{\text{test}}\), we traverse the decision tree to reach a leaf node, and the label assigned to the leaf node is considered as \(y_{\text{test}}\).
The following diagram depicts a decision tree:
Here, a question refers to a (feature, value) pair. For example, \(height \le 180\text{cm}\)?
Goodness of a Question
To evaluate the quality of a question, we need a measure of “impurity” for a set of labels \(\{y_1, \ldots, y_n\}\). Various measures can be employed, but we will use the Entropy function.
The Entropy function is defined as:
\[ \text{Entropy}(\{y_1, \ldots, y_n\}) = \text{Entropy}(p) = -\left( p\log(p)+(1-p)\log(1-p) \right ) \]
Here, \(\log(0)\) is conventionally treated as \(0\).
Pictorial representation of the Entropy function:
Information Gain is then utilized to measure the quality of a split in the decision tree algorithm.
Information gain is a commonly used criterion in decision tree algorithms that quantifies the reduction in entropy or impurity of a dataset after splitting based on a given feature. High information gain signifies features that effectively differentiate between the different classes of data and lead to accurate predictions.
Information gain is calculated as:
\[ \text{Information Gain}(\text{feature}, \text{value}) = \text{Entropy}(D) - \left[\gamma \cdot \text{Entropy}(D_{\text{yes}}) + (1-\gamma) \cdot \text{Entropy}(D_{\text{no}}) \right] \]
where \(\gamma\) is defined as:
\[ \gamma = \frac{|D_{\text{yes}}|}{|D|} \]
Decision Tree Algorithm
The decision tree algorithm follows these steps:
- Discretize each feature within the range [min, max].
- Select the question that provides the highest information gain.
- Repeat the procedure for subsets \(D_{\text{yes}}\) and \(D_{\text{no}}\).
- Stop growing the tree when a node becomes sufficiently “pure” according to a predefined criterion.
Different measures, such as the Gini Index, can also be employed to evaluate the quality of a question.
Pictorial depiction of the decision boundary and its decision tree: