Day146 - STAT Review: Statistical Machine Learning (8)

March 30, 2025 6 minute read

Practical Statistics for Data Scientists: Boosting (1) (Key Concepts & XGBoost)

B9492343-B8D9-4BCE-B93F-245AE655A0C1_1_105_c

Boosting

Ensemble models have become essential in predictive modeling. Boosting is a broad method for forming an ensemble of models. Similar to bagging, boosting is primarily applied with decision trees. However, despite these similarities, boosting adopts a distinct approach that involves many additional features and complexities. As a result, while bagging can be done with relatively little tuning, boosting requires much more excellent care in its application.

We already noticed that in Bagging (Iike in Random Forests):

We build various models independently.
Then, they combine their predictions at the end by averaging or voting.

Boosting, on the other hand, adopts a distinctly different strategy. Rather than constructing multiple models simultaneously, it develops them one at a time, sequentially, like a relay race. Each model learns from the mistakes of the previous model.

Boosting = “Let’s focus more and more on the difficult cases.”

It’s like a team of doctors:

The first doctor makes an initial diagnosis.
The second doctor reviews the mistakes and corrects them.
The third doctor improves on what the second missed.
And so on.

Each step learns from the previous step’s failures. Several algorithm variants are commonly used: Adaboost, gradient boosting, and stochastic gradient boosting. Stochastic gradient boosting is the most general and widely used. Indeed, the algorithm can replicate the random forest with the proper choice of parameters.

Key Terms for Boosting

Ensemble
- Forming a prediction by using a collection of models.
- = Model Averaging
Boosting
- A general technique for fitting a sequence of models is assigning greater weight to the records with large residuals during each successive round.
Adaboost
- An early version of boosting that reweights the data based on the residuals.
Gradient Boosting
- A broader approach to boosting is described as minimizing a cost function.
Stochastic Gradient Boosting
- The broadest boosting algorithm includes resampling records and columns during each iteration.
Regularization
- A technique to avoid overfitting by adding a penalty term to the cost function based on the number of parameters in the model.
Hyperparameters
- Parameters that must be established before fitting the algorithm.

The Boosting Algorithm

There are various boosting algorithms, and the basic idea behind them is the same.

Step-by-Step

Start simple: Train a basic model (for example, a small decision tree) on the full dataset.
Check where we made mistakes: Look at which examples were misclassified (or had significant errors).
Focus harder on the mistakes: Increase the importance (weights) of the misclassified examples.
- It’s like assigning to the model, “Pay extra attention to these tricky cases next time."
Train the next model: The subsequent model is now trained with the updated weights.
It tries harder to predict the examples that were missed previously.
Repeat: Keep adding models. Each one focused on fixing the mistakes of the ones before.
Combine all models together:
The final prediction is a weighted vote (classification) or weighted average (regression) of all models.

Variants of Boosting

Boosting comes in several flavors, but they all follow this core idea of sequential improvement.

Variant	Key Idea
Adaboost	Increases the weight of misclassified points after each model.
Gradient Boosting	Frames the problem as minimizing a loss function (like MSE, cross-entropy) step-by-step.
Stochastic Gradient Boosting	Adds randomness: sample rows and features at each step to make the process more robust and faster.

Boosting vs. Bagging: The Main Difference

Aspect	Bagging (Random Forest)	Boosting
Training	Models are built independently.	Models are built sequentially.
Focus	Reduce variance (stabilize models).	Reduces bias (fix mistakes step by step).
Sensitivity	More stable and easy to run.	More sensitive and powerful but trickier to tune.

The Boosting Algorithm in Action (Example: Adaboost)

Imagine you have a set of loans, and you’re predicting default or paid off. Here’s how Adaboost would work:

First tree: guesses 70% right, 30% wrong.
We increase the weights on those wrong cases.
Second tree: trained more heavily on those 30% misclassified loans.
Now it gets 80% correct.
We boost again: the third tree focuses on what’s still wrong.
And so on.

Each tree is like a correction layer for the previous mistakes.

What About Gradient Boosting?

In Gradient Boosting, we directly minimize a loss function (like Mean Squared Error or Log Loss) instead of manually adjusting observation weights. Each new model fits the residuals (what’s left to be predicted after previous models).

Residual = True value - Current prediction

Each new model tries to predict the residuals (errors) from the previous step — meaning it keeps improving prediction accuracy.

Stochastic Gradient Boosting goes even further. It adds randomness by training each model on a random subset of data and features. This makes it faster and reduces overfitting.

XGBoost

The most widely used public-domain software for boosting is XGBoost, “Extreme Gradient Boosting.” It is a high-speed, highly optimized boosting version that gained fame because it works excellently on massive datasets and handles missing data nicely. It also supports regularization to avoid overfitting.

XGBoost operates similarly to traditional boosting by building models sequentially and correcting the errors of previous models. However, it incorporates more efficient techniques (such as fast memory usage and parallel processing), provides greater flexibility (allowing for precise tuning), and includes additional regularization (to minimize the risk of overfitting). In short,

XGBoost is not a new algorithm but an optimized, supercharged boosting implementation.

Important Parameters in XGBoost

We must tune these parameters carefully to get the best result.

Subsample refers to the “fraction of rows randomly sampled for each tree.” It introduces randomness to improve generalization (similar to Random Forest).
eta (learning_rate in Python) refers to “shrinkage applied after each tree.” It controls how quickly the model learns; small eta results in slower, more careful learning, which helps avoid overfitting.

For instance, we will train XGBoost on the loan default data that we have used previously.

In R
```
predictors <- data.matrix(loan3000[, c('borrower_score', 'payment_inc_ratio')])
label <- as.numeric(loan3000[, 'outcome']) - 1
xgb <- xgboost(data=predictors, label=label, objective="binary:logistic",
               params=list(subsample=0.63, eta=0.1), 
               nrounds=100)
---
[1]	train-error:0.358333
[2]	train-error:0.346333
[3]	train-error:0.347333
...
[99]	train-error:0.239333
[100]	train-error:0.241000
```
xgboost does not support the formula syntax, so the predictors must be converted to data.matrix, and the response needs to be converted to 0/1 variables. The objective argument tells xgboost the problem; based on this, `xgboost` will choose a metric to optimize.

Here’s what each part means:
- objective=”binary:logistic” → Binary classification (yes/no, default or not).
- subsample=0.63 → Use 63% of the data randomly for each tree.
- eta=0.1 → Small learning steps to avoid overfitting.
- nrounds=100 → Build 100 trees.
As the boosting progresses, the error on the training data decreases.

In Python, xgboost has two interfaces: a scikit-learn API (XGBoost has two APIs: XGBClassifier and XGBRegressor) and a more functional interface like in R. (eta will be replaced with learning_rate here).

predictors = ['borrower_score', 'payment_inc_ratio']
outcome = 'outcome'
  
X = loan3000[predictors]
y = loan3000[outcome]
  
xgb = XGBClassifier(objective='binary:logistic', subsample=0.63)
xgb.fit(X:y)
--
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.63, verbosity=1)

From each part:

learning_rate = 0.1 by default (instead of eta — Python uses the newer naming).
The objective is set to “binary:logistic,” meaning it’s solving a binary classification task.

Predictions and Plotting

In R, the predicted values can be obtained from the predict function and plotted versus the predictors since there are only two variables.

pred <- predict(xgb, newdata=predictors)
xgb_df <- cbind(loan3000, pred_default = pred > 0.5, prob_default = pred)
ggplot(data=xgb_df, aes(x=borrower_score, y=payment_inc_ratio,
                        color=pred_default, shape=pred_default, size=pred_default)) +
         geom_point(alpha=.8) +
         scale_color_manual(values = c('FALSE'='#b8e186', 'TRUE'='#d95f02')) +
         scale_shape_manual(values = c('FALSE'=0, 'TRUE'=1)) +
         scale_size_manual(values = c('FALSE'=0.5, 'TRUE'=2))

The same figure can be created in Python using the code below.

fig, ax = plt.subplots(figsize=(6, 4))
  
xgb_df.loc[xgb_df.prediction=='paid off'].plot(
    x='borrower_score', y='payment_inc_ratio', style='.',
    markerfacecolor='none', markeredgecolor='C1', ax=ax)
xgb_df.loc[xgb_df.prediction=='default'].plot(
    x='borrower_score', y='payment_inc_ratio', style='o',
    markerfacecolor='none', markeredgecolor='C0', ax=ax)
ax.legend(['paid off', 'default']);
ax.set_xlim(0, 1)
ax.set_ylim(0, 25)
ax.set_xlabel('borrower_score')
ax.set_ylabel('payment_inc_ratio')

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin