Day37 ML Review - Decision Tree (3) & Random Forest (1)
Building a Decision Tree & Random Forest (1) - Key Concepts & How it Works
(Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition)
Decision trees can build complex decision boundaries by dividing the feature space into rectangles. However, we must be careful since the deeper the decision tree, the more complex the decision boundary becomes, which can easily result in overfitting. We will now train a decision tree with a maximum depth of 4, using GIni impurity as a criterion for impurity.
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=1)
tree_model.fit(X_train, y_train)
- Parameters
criterion
: specifies the function to measure the quality of a split.gini
,entropy
can be used
max_depth
: specifies the maximum depth of the tree. Limiting the depth of the tree helps prevent overfitting.- If
None
, the nodes are expanded until all leaves are pure or until all leaves contain less thanmin_samples_split
samples.
- If
min_samples_split
: The minimum number of samples required to split an internal node.- Any positive integer (default is 2).
- A float between 0 and 1, represents a fraction of the samples.
Nicer visualizations can be obtained by model tree
as below.
from sklearn import tree
tree.plot_tree(tree_model)
Combining Multiple Decision Trees via Random Forest
Random Forest is an ensemble learning method that is used for classification, regression, and other tasks. It works by creating multiple decision trees during the training process and then producing the mode (for classification) or mean prediction (for regression) from the individual trees. Random Forests address the tendency of decision trees to overfit their training data.
Key Concepts
- Ensemble Learning: Combining the predictions of multiple models to improve the overall performance. Random Forest is a type of ensemble learning.
- Bagging (Bootstrap Aggregating): Random Forest uses bagging, which means it builds multiple decision trees on different samples of the dataset. Each sample is drawn with replacement, meaning some data points can appear multiple times in a single sample.
- Random Subspace Method: When splitting nodes during the construction of the trees, a Random Forest considers a random subset of features. This ensures diversity among the trees.
How Random Forest Works
- Bootstrapping: Create multiple bootstrap samples from the original dataset. Each bootstrap sample is a random sample with replacement.
- Tree Construction: For each bootstrap sample, grow a decision tree. During the construction of each tree, a random subset of features is selected at each split. This introduces additional randomness and prevents the trees from being too correlated.
- Prediction Aggregation:
- Classification: Each tree votes for a class, and the class with the most votes is the final prediction.
- Regression: The final prediction is the average of the predictions from all trees.
Side Note
Sampling With Replacement
Sampling “with replacement” means that each data point from the original dataset can be selected multiple times in the same bootstrap sample. Here’s how it works:
- Original Dataset: Consider an original dataset with $n$ data points.
- Bootstrap Sample: To create a bootstrap sample, you randomly select $n$ data points from the original dataset. However, each time you select a data point, you replace it back into the dataset before the next selection. This means each data point can be chosen more than once, and some data points from the original dataset might not be chosen at all.
Example
Suppose you have a dataset with the following five data points: [1,2,3,4,5].
- Step 1: Randomly select a data point, say $3$. Add it to the bootstrap sample.
- Step 2: Replace $3$ back into the original dataset.
- Step 3: Randomly select another data point, say $2$. Add it to the bootstrap sample.
- Step 4: Replace $2$ back into the original dataset.
- Continue: Repeat this process until you have $n$ data points in the bootstrap sample.
An example of a bootstrap sample might be: [3,2,3,5,1].
Importance of Sampling with Replacement
- Diversity of Trees: By creating multiple bootstrap samples with replacement, each decision tree in the forest is trained on a different subset of the original data. This introduces variability among the trees, which is crucial for reducing overfitting and improving the generalization ability of the Random Forest.
- Bias-Variance Trade-off: The randomness introduced by sampling with replacement helps to balance the bias-variance trade-off. Individual trees might be high-variance (they overfit the bootstrap sample), but by averaging their predictions (in regression) or taking a majority vote (in classification), the Random Forest achieves lower variance and, consequently, a more robust model.
- Out-of-Bag Error: Since each bootstrap sample is drawn with replacement, on average, about 63.2% of the original data points are included in each bootstrap sample. The remaining data points, known as “out-of-bag” (OOB) samples, can be used to estimate the error of the Random Forest model without needing a separate validation set. This provides a built-in cross-validation mechanism.
Leave a comment