2 minute read

Information Gain (2) - Entropy & Classification Error



Maximizing IG: Getting the Best Possible Value with the Best Efficiency

We must establish an objective function to optimize through the tree learning algorithm to split the nodes effectively based on the most informative features.

2. Entropy ($I_H$)

Here is the definition of entropy for all non-empty class $P(i \mid t) \neq 0$

$I_H(t)=-\sum_{i=1}^{c}p(i \mid t)log_2p(i \mid t)$

In the context of decision trees, the term $p(i\mid t)$ represents the proportion of examples belonging to the class $i$ at a specific node, denoted as $t$. The concept of entropy measures the impurity or uncertainty at a node. It is 0 when all examples at a node belong to the same class and increases as the class distribution becomes more uniform.

Specifically, in a binary class scenario,

  • the entropy is 0 when $p(i=1 \mid t)=1$ or $p(i=0 \mid t)=0$.
  • Conversely, if the classes are evenly distributed with $p(i=1 \mid t)=0.5$ and $p(i=0 \mid t)=0.5$, the entropy is 1, indicating maximum uncertainty.

In summary, the entropy criterion in decision tree construction aims to maximize the mutual information within the tree. This reduces uncertainty and enhances the tree’s overall effectiveness in classifying new instances.

Calculation Example

Let’s assume we have a node with the following class distribution:

  • Class 0: 30 instances
  • Class 1: 70 instances

The probabilities are:

$p_0= \frac{30}{100} = 0.3 $, $\hspace{20 mm}$ $p_1=\frac{70}{100} = 0.7$

Then, the Gini impurity is:

$Gini=1−(0.3^2+0.7^2)=1−(0.09+0.49)=1−0.58=0.42$



3. Classification Error

Classification error measurement is as follows.

$I_E(t) = 1 - max\{ p(i \mid t) \} $


This criterion is useful for pruning but not recommended for growing a decision tree because it is less sensitive to changes in the class probabilities of the nodes.

Calculation Example

Again, using the same node as before:

  • Class 0: 30 instances
  • Class 1: 70 instances

The probabilities are:

$p_0 = \frac{30}{100} = 0.3$, $\hspace{20 mm}$ $p_1 = \frac{70}{100} = 0.7$

The classification error is:

$ \text{Error} = 1 - \max(0.3, 0.7) = 1 - 0.7 = 0.3 $


Summary of Metrics

  • Gini Impurity: Measures the impurity of a node, with lower values indicating purer nodes.
  • Entropy: Measures the randomness or impurity of a node, with lower values indicating purer nodes.
  • Classification Error: Measures the misclassification rate at a node, with lower values indicating better classification.

Each of these measures can guide the decision tree algorithm in selecting the best feature and threshold to split the data at each node, aiming to create pure or homogeneous nodes, leading to an effective decision tree.



Leave a comment