2 minute read

Implementation Step by Step



Letโ€™s talk about the random forest algorithm, which is based on decision trees and is known for its scalability and ease of use. A random forest is a collection of decision trees (other methods will be discussed later). The concept behind a random forest is to combine multiple (deep) decision trees, each of which may have high variance on its own, to create a more reliable model with better generalization performance and reduced risk of overfitting.

Following Steps

  1. Draw a random bootstrap sample of size n by randomly selecting n examples from the training dataset with replacement.
  2. Grow a decision tree from the bootstrap sample. At each node:
    • Randomly select d features without replacement.
    • Split the node using the feature that provides the best split according to the objective function, for instance, maximizing the information gain.
  3. Repeat the steps 1-2 k times.
  4. Aggregate the prediction by each tree to assign the class label by majority vote. We will discuss majority voting in more detail later.


We need to modify step 2 when training the individual decision trees slightly. Instead of evaluating all features to determine the best split at each node, we will only consider a random subset of those.


Hyperparameters

Random forests offer less interpretability than decision trees, but we donโ€™t have to worry about hyperparameter values much. We usually donโ€™t need to prune the random forest since the ensemble model is robust to noise. The only parameter that matters in practice is the number of trees, denoted as $k$, that we choose for the random forest.

In most implementations, such as the RandomForestClassifier in scikit-learn, the size of the bootstrap sample is usually set to be equal to the number of training examples in the original dataset. This choice generally strikes a good balance between bias and variance. As for the number of features, denoted as $d$, at each split, we should pick a value that is smaller than the total number of features in the training dataset. A common default used in scikit-learn and other implementations is to set $d$ to be equal to the number of features in the training dataset. A reasonable default that is used in scikit-learn and other implementations is $d=\sqrt{m}$, where $m$ is the number of features in the training dataset.


from sklearn.ensemble import RandomForestClssifier
forest = RandomForestClassifier(criterion='gini', n_estimator=25, random_state=1, n_jobs=2)
forest.fit(X_train, y_train)
plot_decision_regions(X_combined, y_combined, classifier=forest, test_idx=range(105,150))


In the code above, we trained a random forest with 25 decision trees using the n_estimators parameter. We used the Gini impurity measure to split the nodes. Even though we are building a small random forest from a small training dataset, we used the n_jobs parameter for demonstration purposes. This parameter allows us to parallelize the model training using multiple cores of our computer (in this case, 2 cores).



Leave a comment