4 minute read

Practical Statistics for Data Scientists: Bagging and the Random Forest (1) (Bagging and Random Forest)

BC9A9255-2FE3-459C-B4DF-21FD074E5CED_1_105_c


Bagging and the Random Forest

Imagine hundreds of people at a county fair trying to guess the weight of an ox. Each person’s guess is slightly wrong—some overestimate, some underestimate. But when you average all the guesses, you get astonishingly close to the actual weight.

This is called the wisdom of crowds — many weak guesses can combine to make one strong prediction. In machine learning, we apply this same idea. Instead of relying on a single predictive model (like one decision tree), we build many models, each a little different. We combine their results (by averaging or voting) and make much better predictions.


Key Terms for Bagging and the Random Forest

  • Ensemble
    • Creating a prediction through an ensemble of models.
    • = Model averaging
  • Bagging
    • A common method for generating a set of models through data bootstrapping.
    • = Bootstrap aggregation
  • Random Forest
    • A bagged estimate derived from decision tree models.
    • = Bagged decision trees
  • Variable importance
    • A metric that indicates the significance of a predictor variable in the model’s performance.

The ensemble approach has been used in various modeling methods, notably in the Netflix Prize, where Netflix offered $1 million for a model achieving a 10% improvement in predicting customer movie ratings. The basic concept of ensembles is:

  1. Develop a predictive model and document its predictions for a specific data set.
  2. Do this for various models using the same data.
  3. Calculate the projections’ average (or weighted average, or use majority voting) for every record that needs prediction.

Ensemble methods are most effectively utilized with decision trees. These highly efficient ensemble tree models offer a powerful means to create robust predictive models with minimal effort.

Beyond the simple ensemble algorithm, there are two main ensemble models: bagging and boosting. In ensemble tree models, these are known as random forest and boosted tree models.


Bagging

Bagging, which stands for “bootstrap aggregating,” begins with the assumption that we have a response $Y$ and $P$ predictor variables $\textbf{X} = X_1, X_2, \dots, X_P$ with $N$ records.

Bagging is like the basic algorithm for ensembles, except that each new model is fitted to a bootstrap resample instead of fitting the various models to the same data. Here is the algorithm presented more formally:

  1. Initialize $M$, the number of models to be fit, and $n$, the number of records to choose from ($n<N$). Set the iteration $m=1$.
  2. Take a bootstrap resample (i.e., with replacement) of $n$ records from the training data to form a subsample $Y_m$ and $\textbf{X}_m$ (the bag).
  3. Train a model using $Y_m$ and $\textbf{X}_m$ to create a set of decision rules $\hat{f_m}(\textbf{X})$.
  4. Increment the model counter $m=m+1$. If $m<= M$, go to step 2.
  5. When we need a prediction:
    • Average all the models’ outputs (for regression),
    • Or vote among their predictions (for classification).


Why Bagging is Beneficial

  • Different Trees Exhibit Unique Errors. Each individual decision tree in a bagging ensemble tends to make its own mistakes, stemming from the varying data samples they are trained on.
  • By averaging the predictions of these trees, random errors effectively cancel each other out. This process reduces the noise and enhances the overall prediction accuracy.
  • Consequently, the overall variance—which refers to the model’s sensitivity to small fluctuations in the training dataset—decreases. This leads to the development of a more robust model that performs consistently across different datasets and conditions.


Random Forest

The random forest method applies bagging to decision trees, introducing one important extension: in addition to sampling the records, the algorithm also samples the variables.

In traditional decision trees, the algorithm selects the variable and split point for partitioning A by minimizing a criterion like Gini impurity. With random forests, at each stage of the algorithm, the choice of variable is limited to a random subset of variables.

A Random Forest is just Bagging applied to Decision Trees, plus one extra random twist:

  • At each split, instead of considering all variables, the tree randomly chooses a subset of variables to consider.

This does two things:

  • Forces trees to be even more diverse.
  • Prevents trees from all focusing on the same “obvious” strong predictor (e.g., borrower_score).

The result is that the trees are different enough that averaging them provides a much better, more stable model.


Random Forest - Step by Step

Imagine building a forest:

  1. Bootstrap sample the records (pick random loans, with replacement).
  2. For each tree:
    • When we need to split the tree:
      • Randomly choose a subset of features (say, 3 out of 10).
      • Find the best split only among those 3 features.
    • Continue splitting until the tree is grown.
  3. Repeat this process to build hundreds of trees (maybe 500).
    Then,
  4. For a new borrower:
    • Ask all the trees what they think (paid off or default).
    • Take a vote.
    • Majority wins.

How many variables to sample at each step? A rule of thumb is to choose $\sqrt{P}$ where $P$ is the number of predictor variables.

The package randomForest implements the random forest in R. The following applies this package to the loan data that we previously posted.

  • In R

    rf <- randomForest(outcome ~ borrower_score + payment_inc_ratio,
                       data=loan3000)
    rf
    ---
    Call:
     randomForest(formula = outcome ~ borrower_score + payment_inc_ratio, 
                  data = loan3000)
               	Type of random forest: classification
                         Number of trees: 500
    No. of variables tried at each split: 1
      
        	OOB estimate of error rate: 39.17%
    Confusion matrix:
            default  paid off  class.error
    default     873       572   0.39584775
    paid off    603       952   0.38778135
    
  • In Python, we use the method sklearn.ensemble.RandomForestClassifier.

    predictors = ['borrower_score', 'payment_inc_ratio']
    outcome = 'outcome'
      
    X = loan3000[predictors]
    y = loan3000[outcome]
      
    rf = RandomForestClassifier(n_estimators=500, random_state=1, oob_score=True)
    rf.fit(X, y)
    

    By default, 500 trees are trained. Since there are only two variables in the predictor set, the algorithm randomly selects the variable on which to split at each stage (i.e., a bootstrap subsample of size 1).



Leave a comment