Day142 - STAT Review: Statistical Machine Learning (4)
Practical Statistics for Data Scientists: Tree Models (2) (A Simple Example, The Recursive Partitioning Algorithm, & Measuring Homogeneity or Impurity)
A decision tree is a flowchart-like structure used to make predictions. It splits the dataset into smaller and smaller groups based on feature values until a final decision (prediction) is made.
A Simple Example
The main packages for fitting tree models in R are rpart
and tree
. The rpart
package fits a model to a sample of 3,000 loan records, using payment_inc_ratio
and borrower_score
.
-
In R
library(rpart) loan_tree <- rpart(outcome ~ borrower_score + payment_inc_ratio, data=loan3000, control=rpart.control(cp=0.005)) plot(loan_tree, uniform=TRUE, margin=0.05) text(loan_tree)
-
The
sklearn.tree.DecisionTreeClassifier
in Python implements a decision tree. Thedmba
package offers a function for visualization.predictors = ['borrower_score', 'payment_inc_ratio'] outcome = 'outcome' X = loan3000[predictors] y = loan3000[outcome] loan_tree = DecisionTreeClassifier(random_state=1, criterion='entropy', min_impurity_decrease=0.003) loan_tree.fit(X, y) plotDecisionTree(loan_tree, feature_names=predictors, class_names=loan_tree.classes_)
The result will look as follows.
The tree works like a series of “if-then” rules. For example:
If
borrower_score >= 0.575
→ predict paid offElse if
borrower_score < 0.575
andpayment_inc_ratio < 10.4
→ check further splits…Each node shows:
- How many records are in that node (n)
- Misclassifications (loss) at that node
- The majority class prediction
- The probability of each class (e.g., (0.53 paid off, 0.47 default))
-
From R, the result is as follows.
loan_tree --- n= 3000 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 3000 1445 paid off (0.5183333 0.4816667) 2) borrower_score>=0.575 878 261 paid off (0.7027335 0.2972665) * 3) borrower_score< 0.575 2122 938 default (0.4420358 0.5579642) 6) borrower_score>=0.375 1639 802 default (0.4893228 0.5106772) 12) payment_inc_ratio< 10.42265 1157 547 paid off (0.5272256 0.4727744) 24) payment_inc_ratio< 4.42601 334 139 paid off (0.5838323 0.4161677) * 25) payment_inc_ratio>=4.42601 823 408 paid off (0.5042527 0.4957473) 50) borrower_score>=0.475 418 190 paid off (0.5454545 0.4545455) * 51) borrower_score< 0.475 405 187 default (0.4617284 0.5382716) * 13) payment_inc_ratio>=10.42265 482 192 default (0.3983402 0.6016598) * 7) borrower_score< 0.375 483 136 default (0.2815735 0.7184265) *
The indent shows the depth of the tree. Each node corresponds to a provisional classification determined by the prevailing outcome in that partition. The “loss” represents the number of misclassifications produced by the provisional classification within a partition. For instance, in node 2, there were 261 misclassifications out of 878 records.
The values in parentheses indicate the proportion of records that are either paid off or in default. For instance, in node 13, which predicts default, over 60 percent of the records are loans currently in default.
-
In Python
print(textDecisionTree(loan_tree)) -- node=0 test node: go to node 1 if 0 <= 0.5750000178813934 else to node 6 node=1 test node: go to node 2 if 0 <= 0.32500000298023224 else to node 3 node=2 leaf node: [[0.785, 0.215]] node=3 test node: go to node 4 if 1 <= 10.42264986038208 else to node 5 node=4 leaf node: [[0.488, 0.512]] node=5 leaf node: [[0.613, 0.387]] node=6 test node: go to node 7 if 1 <= 9.19082498550415 else to node 10 node=7 test node: go to node 8 if 0 <= 0.7249999940395355 else to node 9 node=8 leaf node: [[0.247, 0.753]] node=9 leaf node: [[0.073, 0.927]] node=10 leaf node: [[0.457, 0.543]]
The Recursive Partitioning Algorithm
The algorithm for constructing a decision tree, known as recursive partitioning, is straightforward and intuitive. The data is repeatedly divided using predictor values that most effectively separate the data into relatively homogeneous partitions.
The figure below illustrates the partitions created for the tree we previously reviewed. The first rule, represented by rule 1, states that borrower_score >= 0.575
, segmenting the right portion of the plot. The second rule states that borrower_score < 0.375
, segmenting the left portion.

- Start with the entire dataset.
- For each feature, check all possible split points.
- For each possible split, compute:
- How well it separates the classes (e.g., makes them more homogeneous).
- Choose the best feature and value that minimizes impurity (i.e., increases purity).
- Split the data into two subsets.
- Repeat steps 2–5 recursively on each subset.
It stops when:
- No more improvement in impurity
- Minimum samples per node are reached.
- Max depth is reached.
Tree models predict binary outcomes and provide a probability estimate based on the count of 0s and 1s. This estimate is the sum of 0s or 1s divided by the total number of observations in the partition.
The estimated $\text{Prob}(Y=1)$ can then be converted to a binary decision; for example, set the estimate to $1$ if $\text{Prob}(Y = 1) > 0.5$.
Measuring Homogeneity or Impurity
Tree models partition records recursively into sets, A, predicting an outcome of $Y=0$ or $Y=1$. We need a measure of homogeneity or class purity within each partition, or equivalently, the impurity.
Prediction accuracy is the proportion $p$ of misclassified records in a partition, ranging from 0 (perfect) to 0.5 (random guessing). However, accuracy is not an adequate measure of impurity. Instead, the Gini impurity and information entropy are two widely used metrics for impurity.
Gini Impurity — “How often would we misclassify a point?”
-
Formula:
$Gini(p) = 1 - (p^2 + (1 - p)^2)$
Or rewritten:
$Gini(p) = 2p(1 - p)$
This value:
- Is 0 when a node is pure (p = 0 or p = 1)
- Peaks at 0.5 when the classes are perfectly mixed
Example Values:
p (probability of class 1) | Gini(p) |
---|---|
0.0 | 0.00 |
0.25 | 0.375 |
0.5 | 0.50 |
0.75 | 0.375 |
1.0 | 0.00 |
Entropy — “How uncertain are we about the class?”
-
Formula:
$Entropy(p) = -p \log_2(p) - (1 - p) \log_2(1 - p)$
-
This comes from information theory — it measures a random variable’s average information content (surprise).
- If we’re totally certain (p = 0 or 1), entropy = 0 → no surprise
- If we’re totally unsure (p = 0.5), entropy is highest → maximum uncertainty
Example Values:
p (probability of class 1) | Entropy(p) |
---|---|
0.0 | 0.00 |
0.25 | 0.811 |
0.5 | 1.000 |
0.75 | 0.811 |
1.0 | 0.00 |
Graph Comparison

If we plotted both functions, you could see:
- Both are symmetric around 0.5.
- Both hit 0 when the node is pure.
- Entropy grows more sharply near p=0.5, so it penalizes impurity more harshly.
This means Entropy is more sensitive to changes near p = 0.5, whereas Gini is smoother
Which One Should We Use?
Criterion | Gini | Entropy |
---|---|---|
Computation | Faster (no log computation) | Slightly slower (uses log₂) |
Interpretation | Based on misclassification | Based on information gain |
Sensitivity | Less harsh around p = 0.5 | More sensitive around p = 0.5 |
Effect | Tends to create purer nodes early on | May give more balanced splits |
Please note that scikit-learn uses Gini by default, but we can easily switch to entropy.
Leave a comment