Day144 - STAT Review: Statistical Machine Learning (6)
Practical Statistics for Data Scientists: Bagging and the Random Forest (1) (Bagging and Random Forest)
Bagging and the Random Forest
Imagine hundreds of people at a county fair trying to guess the weight of an ox. Each person’s guess is slightly wrong—some overestimate, some underestimate. But when you average all the guesses, you get astonishingly close to the actual weight.
This is called the wisdom of crowds — many weak guesses can combine to make one strong prediction. In machine learning, we apply this same idea. Instead of relying on a single predictive model (like one decision tree), we build many models, each a little different. We combine their results (by averaging or voting) and make much better predictions.
Key Terms for Bagging and the Random Forest
- Ensemble
- Creating a prediction through an ensemble of models.
- = Model averaging
- Bagging
- A common method for generating a set of models through data bootstrapping.
- = Bootstrap aggregation
- Random Forest
- A bagged estimate derived from decision tree models.
- = Bagged decision trees
- Variable importance
- A metric that indicates the significance of a predictor variable in the model’s performance.
The ensemble approach has been used in various modeling methods, notably in the Netflix Prize, where Netflix offered $1 million for a model achieving a 10% improvement in predicting customer movie ratings. The basic concept of ensembles is:
- Develop a predictive model and document its predictions for a specific data set.
- Do this for various models using the same data.
- Calculate the projections’ average (or weighted average, or use majority voting) for every record that needs prediction.
Ensemble methods are most effectively utilized with decision trees. These highly efficient ensemble tree models offer a powerful means to create robust predictive models with minimal effort.
Beyond the simple ensemble algorithm, there are two main ensemble models: bagging and boosting. In ensemble tree models, these are known as random forest and boosted tree models.
Bagging
Bagging, which stands for “bootstrap aggregating,” begins with the assumption that we have a response $Y$ and $P$ predictor variables $\textbf{X} = X_1, X_2, \dots, X_P$ with $N$ records.
Bagging is like the basic algorithm for ensembles, except that each new model is fitted to a bootstrap resample instead of fitting the various models to the same data. Here is the algorithm presented more formally:
- Initialize $M$, the number of models to be fit, and $n$, the number of records to choose from ($n<N$). Set the iteration $m=1$.
- Take a bootstrap resample (i.e., with replacement) of $n$ records from the training data to form a subsample $Y_m$ and $\textbf{X}_m$ (the bag).
- Train a model using $Y_m$ and $\textbf{X}_m$ to create a set of decision rules $\hat{f_m}(\textbf{X})$.
- Increment the model counter $m=m+1$. If $m<= M$, go to step 2.
- When we need a prediction:
- Average all the models’ outputs (for regression),
- Or vote among their predictions (for classification).
Why Bagging is Beneficial
- Different Trees Exhibit Unique Errors. Each individual decision tree in a bagging ensemble tends to make its own mistakes, stemming from the varying data samples they are trained on.
- By averaging the predictions of these trees, random errors effectively cancel each other out. This process reduces the noise and enhances the overall prediction accuracy.
- Consequently, the overall variance—which refers to the model’s sensitivity to small fluctuations in the training dataset—decreases. This leads to the development of a more robust model that performs consistently across different datasets and conditions.
Random Forest
The random forest method applies bagging to decision trees, introducing one important extension: in addition to sampling the records, the algorithm also samples the variables.
In traditional decision trees, the algorithm selects the variable and split point for partitioning A by minimizing a criterion like Gini impurity. With random forests, at each stage of the algorithm, the choice of variable is limited to a random subset of variables.
A Random Forest is just Bagging applied to Decision Trees, plus one extra random twist:
- At each split, instead of considering all variables, the tree randomly chooses a subset of variables to consider.
This does two things:
- Forces trees to be even more diverse.
- Prevents trees from all focusing on the same “obvious” strong predictor (e.g.,
borrower_score
).
The result is that the trees are different enough that averaging them provides a much better, more stable model.
Random Forest - Step by Step
Imagine building a forest:
- Bootstrap sample the records (pick random loans, with replacement).
- For each tree:
- When we need to split the tree:
- Randomly choose a subset of features (say, 3 out of 10).
- Find the best split only among those 3 features.
- Continue splitting until the tree is grown.
- When we need to split the tree:
- Repeat this process to build hundreds of trees (maybe 500).
Then, - For a new borrower:
- Ask all the trees what they think (paid off or default).
- Take a vote.
- Majority wins.
How many variables to sample at each step? A rule of thumb is to choose $\sqrt{P}$ where $P$ is the number of predictor variables.
The package randomForest
implements the random forest in R. The following applies this package to the loan data that we previously posted.
-
In R
rf <- randomForest(outcome ~ borrower_score + payment_inc_ratio, data=loan3000) rf --- Call: randomForest(formula = outcome ~ borrower_score + payment_inc_ratio, data = loan3000) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 39.17% Confusion matrix: default paid off class.error default 873 572 0.39584775 paid off 603 952 0.38778135
-
In Python, we use the method
sklearn.ensemble.RandomForestClassifier
.predictors = ['borrower_score', 'payment_inc_ratio'] outcome = 'outcome' X = loan3000[predictors] y = loan3000[outcome] rf = RandomForestClassifier(n_estimators=500, random_state=1, oob_score=True) rf.fit(X, y)
By default, 500 trees are trained. Since there are only two variables in the predictor set, the algorithm randomly selects the variable on which to split at each stage (i.e., a bootstrap subsample of size 1).
Leave a comment