5 minute read

Practical Statistics for Data Scientists: Tree Models (3) (Dealing With Overfitting Problems in R and Python & Predicting a Continuous Value)

59662805-981D-444B-B0A8-3B9BED8393F9_1_105_c


Stopping the Tree from Growing

When we let a decision tree grow without restriction, it can create a split for almost every variation in the training data. This leads to:

  • Pure leaves (100% accuracy on training data)
  • But terrible performance on new, unseen data

This is called overfitting — the tree learns not just the signal but the noise, too.

We need a method to determine when to stop growing a tree at a stage that will generalize to new data. There are various approaches to stopping splits in R and Python:

  1. Minimum Partition Size:
    • Do not split a partition if the subpartition or terminal leaf is too small. In rpart(R), this is controlled by minsplit (default 20) and minbucket (default 7). In Python’s DecisionTreeClassifier, use min_samples_split (default 2) and min_samples_leaf (default 1).
  2. Impurity-Based Stopping:
    • Avoid splitting a partition if it doesn’t significantly reduce impurity. In rpart, this is managed by the complexity parameter cp, measuring tree complexity: higher cp values indicate greater complexity. Practically, cp limits tree growth by penalizing additional complexity (splits). DecisionTreeClassifier in Python includes min_impurity_decrease to control splits based on weighted impurity decrease. Smaller values result in more complex trees.

These methods use arbitrary rules and are helpful for exploration, but determining optimal values for maximizing predictive accuracy with new data is challenging. We must combine cross-validation with systematically changing the model parameters</u> or modifying the tree through pruning.

In summary,

  • Minimum Partition Size: Preventing the tree from making splits that produce tiny groups
    • In R (rpart):
    • minsplit: Minimum number of observations in a node to consider splitting (default = 20)
    • minbucket: Minimum number of observations allowed in a leaf (default = 7)
    • In Python (DecisionTreeClassifier):
      • min_samples_split: Minimum number of samples required to split a node
      • min_samples_leaf: Minimum samples allowed in a leaf node
  • Impurity-Based Stopping: Only split if it significantly reduces impurity.
    • In R:
    • cp (complexity parameter): Penalizes overly complex trees
      • Low cp → big tree, may overfit
      • High cp → small tree, may underfit
    • In Python:
      • min_impurity_decrease: Split only if impurity drops by a minimum value
      • ccp_alpha (cost-complexity pruning): Controls post-training pruning


Controlling Tree Complexity in R

With the complexity parameter, cp, we can estimate the optimal tree size for new data. If cp is too small, the tree will overfit, capturing noise rather than signal. Conversely, if cp is too large, the tree will be too small and exhibit limited predictive power. The rpart default is 0.01, which may be too large for larger datasets. In the previous example, cp was set to 0.005 as the default led to a tree with only one split. During exploratory analysis, trying a few values is sufficient.

Finding the optimal cp shows the bias-variance trade-off. The common method to estimate cp is cross-validation.

  1. Split data into training and validation sets.
  2. Grow the tree with training data, pruning it step by step while recording cp at each step.
  3. Identify the cp corresponding to minimum error on validation data.
  4. Re-split data and repeat the growing, pruning, and cp recording process.
  5. Average the cps reflecting minimum error for each tree.
  6. Finally, use the original or future data to grow a tree, stopping at the optimum cp value.

In rpart, use the cptable argument to generate cp values and their cross-validation error (xerror), helping you identify the cp value with the lowest error.


Controlling Tree Complexity in Python

In scikit-learn’s decision tree, the complexity parameter is cpp_alpha, defaulting to 0, which means no pruning; increasing it results in smaller trees.

Use GridSearchCV to find the optimal value. Other parameters like max_depth (5 to 30) and min_samples_split (20 to 100) also control tree size. GridSearchCV method in scikit-learn is a convenient way to combines exhaustive searches with cross-validation to select the best parameter set based on model performance.


Predicting Continuous Values with Trees (Regression Trees)

In classification trees, the goal is to assign each record into a category like “default” or “paid off.” However, in regression trees, the task shifts slightly:

Now the tree must predict a continuous numeric value, like the price of a house, the amount of a loan, or the expected number of sales.

  1. At each split, the tree still tries to separate the data into “purer” groups.
  2. But instead of class labels, the focus is on minimizing the spread of numeric values within each node.
  3. Impurity is measured using variance or Mean Squared Error (MSE).
    • Variance: How spread out are the numbers inside the node?
    • MSE: Average of the squared differences from the node mean.

Therefore:

  • Each leaf (final group) will predict the average value of all the records inside that leaf.

Example: If a leaf contains house sale prices of $250,000, $260,000, and $255,000, the tree will predict $255,000 for any new house falling into that leaf.

  • In Python, we use DecisionTreeRegressor instead of DecisionTreeClassifier

    from sklearn.tree import DecisionTreeRegressor
      
    model = DecisionTreeRegressor()
    model.fit(X_train, y_train)
    
    • X_train contains the predictors (features), and y_train contains the continous target variable.
    • After training, the model can predict a numeric output for new records.


Why Use Decision Trees (Even Single Simple Trees)?

Even though single decision trees may not always win predictive competitions, they remain incredibly valuable tools for exploration, communication, and understanding data.

  • Major advantages
    1. Highly Interpretable (“White Box” Model)
      • A decision tree’s structure is easy to explain: a series of understandable “if-then” rules.
      • You can trace exactly why the model made a specific prediction.
      • Useful when stakeholders (managers, clients) want transparency.
    2. Visual and Intuitive
      • You can draw a decision tree.
      • Each split represents a simple decision based on a feature.
      • Non-technical audiences can quickly grasp patterns and key decision points from the visual tree structure.
    3. Handles Nonlinear Relationships Automatically
      • Trees naturally model complex interactions without requiring feature engineering.
      • If the true relationship between variables is nonlinear (curved, thresholds, etc.), trees can discover it without you specifying it.
    4. No Need for Feature Scaling or Normalization
      • Unlike models like KNN, SVMs, or Logistic Regression, trees are invariant to the scale of features.
      • A variable measured in dollars and another measured in percentages are treated equally when deciding splits.
  • Key Limitations
    1. Prone to Overfitting
      • If left unrestricted (deep trees with many branches), trees can fit training data perfectly.
      • This leads to poor generalization on new, unseen data (they “memorize” rather than “learn”).
    2. Less Accurate Compared to Ensemble Methods
      • Single decision trees often have higher variance (performance fluctuates with different datasets).
      • Models like Random Forests (averaging many trees) and Boosted Trees (sequential correction of mistakes) consistently outperform a single tree in predictive power.
    3. Unstable with Small Changes
      • A small change in the data can result in a completely different tree structure.
      • This instability makes them sensitive unless they are combined (again, ensembles help).


Aspect Single Tree Strengths Single Tree Weaknesses
Interpretability Clear and easy to explain -
Model Complexity Captures complex, nonlinear patterns Overfits if unrestricted
Feature Engineering Minimal (trees handle it) -
Predictive Performance Decent for small data, exploratory tasks Outperformed by Random Forests / Boosting for final deployment
Robustness Prone to instability Ensemble methods improve it



Leave a comment