2 minute read

Implementation Step by Step



Let’s talk about the random forest algorithm, which is based on decision trees and is known for its scalability and ease of use. A random forest is a collection of decision trees (other methods will be discussed later). The concept behind a random forest is to combine multiple (deep) decision trees, each of which may have high variance on its own, to create a more reliable model with better generalization performance and reduced risk of overfitting.

Following Steps

  1. Draw a random bootstrap sample of size n by randomly selecting n examples from the training dataset with replacement.
  2. Grow a decision tree from the bootstrap sample. At each node:
    • Randomly select d features without replacement.
    • Split the node using the feature that provides the best split according to the objective function, for instance, maximizing the information gain.
  3. Repeat the steps 1-2 k times.
  4. Aggregate the prediction by each tree to assign the class label by majority vote. We will discuss majority voting in more detail later.


We need to modify step 2 when training the individual decision trees slightly. Instead of evaluating all features to determine the best split at each node, we will only consider a random subset of those.


Hyperparameters

Random forests offer less interpretability than decision trees, but we don’t have to worry about hyperparameter values much. We usually don’t need to prune the random forest since the ensemble model is robust to noise. The only parameter that matters in practice is the number of trees, denoted as $k$, that we choose for the random forest.

In most implementations, such as the RandomForestClassifier in scikit-learn, the size of the bootstrap sample is usually set to be equal to the number of training examples in the original dataset. This choice generally strikes a good balance between bias and variance. As for the number of features, denoted as $d$, at each split, we should pick a value that is smaller than the total number of features in the training dataset. A common default used in scikit-learn and other implementations is to set $d$ to be equal to the number of features in the training dataset. A reasonable default that is used in scikit-learn and other implementations is $d=\sqrt{m}$, where $m$ is the number of features in the training dataset.


from sklearn.ensemble import RandomForestClssifier
forest = RandomForestClassifier(criterion='gini', n_estimator=25, random_state=1, n_jobs=2)
forest.fit(X_train, y_train)
plot_decision_regions(X_combined, y_combined, classifier=forest, test_idx=range(105,150))


In the code above, we trained a random forest with 25 decision trees using the n_estimators parameter. We used the Gini impurity measure to split the nodes. Even though we are building a small random forest from a small training dataset, we used the n_jobs parameter for demonstration purposes. This parameter allows us to parallelize the model training using multiple cores of our computer (in this case, 2 cores).



Leave a comment