4 minute read

Components, How it Works & Maximizing Information Gain (1) - Gini Impurity



(Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition)

Decision tree classifiers are highly valued for their ability to provide interpretability. This model operates by breaking down data through a series of decision-making questions. These questions can involve real numbers, allowing the model to ask binary questions such as “Is the sepal width greater than or equal to 2.8 cm?” This approach enables a clear and understandable process for decision-making based on the input data.

Components of a Decision Tree:

  1. Root Node: The topmost node in a decision tree represents the entire dataset and is split into two or more homogeneous sets.
  2. Internal Nodes: These nodes represent the dataset’s features and form the decision rules. Each internal node corresponds to a feature, and the tree splits based on the values of these features.
  3. Branches: These are the outcomes of the decision rules at each internal node, leading to the next node.
  4. Leaf Nodes: These terminal nodes represent the final classification or regression outcome. For a classification task, leaf nodes represent class labels, while for regression tasks, they represent continuous values.


When using the decision algorithm, the process starts at the tree’s root node by splitting the data based on the feature that yields the highest information gain (IG). This splitting procedure is then iteratively applied at each child node until the leaves contain entirely homogeneous data, meaning that all the training examples belong to the same class. In practical terms, this may result in a deep tree with many nodes, which increases the risk of overfitting the model to the training data. To address this, it is common practice to prune the tree by limiting the maximum depth, ensuring a more generalized model.


How Decision Trees Work in Details:

  1. Splitting: The dataset is split into subsets based on an attribute value test. Depending on the nature of the problem (classification or regression), this splitting is done using algorithms such as maximizing Information Gain- Gini impurity, Entropy, Classification Error- Chi-square, or mean squared error.

  2. Stopping Criteria: The splitting continues until a stopping criterion is met. Common stopping criteria include:

    • All the instances in a node belong to the same class (for classification).
    • The maximum depth of the tree is reached.
    • No further improvement can be made by splitting.
    • A minimum number of instances are left in a node.
  3. Pruning: This is the process of removing unnecessary branches from the tree. Pruning helps in reducing the complexity of the model and prevents overfitting. There are two types of pruning:

    • Pre-pruning: Stops the tree from growing by setting conditions to halt the splitting process.
    • Post-pruning: Removes branches from a fully grown tree by evaluating their impact on the performance of the tree.

Maximizing IG: Getting the Best Possible Value with the Best Efficiency

We must establish an objective function to optimize through the tree learning algorithm to split the nodes effectively based on the most informative features.

Here is the objective function to maximize the information gain at each split.

$IG(D_p, f) - I(D_p) - \sum_{j=1}^{m}\frac{N_j}{N_p}I(D_j)$

The variable $f$ represents the feature used to perform the split; $D_p$ and $D_j$ are the datasets of the parent and $j^{th}$ child node; I is our impurity measure; $N_p$ is the total number of training examples at the parent node, and $N_j$ is the number of examples in the jth child node. Information gain is the difference between the impurity of the parent node and the sum of the child node impurities. The lower the impurities of the child nodes, the larger the information gain. However, most libraries, including scikit-learn, implement binary decision trees to simplify and reduce the search space.

This means that each parent node is split into two child notes, $D_{left}$ and $D_{right}$:

$IG(D_p, f) = I(D_p)-\frac{N_{left}}{N_p}I(D_{left})-\frac{N_{right}}{N_p}I(D_{right})$

The three impurity measures or splitting criteria that are commonly used in binary decision trees are: Gini Impurity($I_G$), Entropy($I_H$), and the classification error($I_E$).


1. Gini Impurity ($I_G$)

The Gini impurity can be understood as a criterion to minimize the probability of misclassification:

$I_G(t) = \sum_{i=1}^{c} p(i \mid t)(1-p(i \mid t)) = 1 - \sum_{i=1}^{c}p(i \mid t)^2$


Like entropy, the Gini impurity is maximal when the classes are perfectly mixed. This occurs, for example, in a binary class setting $(c=2)$.

$I_G(t) = 1-\sum_{i=1}^{c}0.5^2=0.5$


In practice, both Gini impurity and entropy usually yield similar results. It’s not worth spending much time evaluating trees using different impurity criteria rather than experimenting with different pruning cut-offs.

Calculation Example

Assume we have the same node as above:

  • Class 0: 30 instances
  • Class 1: 70 instances

The probabilities are:

$p_0 = \frac{30}{100} = 0.3$, $\hspace{20 mm}$ $p_1 = \frac{70}{100} = 0.7$


Then, the entropy is:

$\text{Entropy} = - (0.3 \log_2 0.3 + 0.7 \log_2 0.7) $
$= - (0.3 \cdot (-1.737) + 0.7 \cdot (-0.514)) $
$= - (-0.521 - 0.360) $
$= 0.881 $



Leave a comment