Day49 ML Review - Data Preprocessing (3)
Partitioning a Dataset into Training & Test Datasets, Feature Scaling, and Feature Selection
Partitioning a Dataset
A convenient way to randomly split this dataset into separate test and training datasets is to use the train_test_split function from scikit-learn’s
model_selection
submodule.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
Providing the class label array y
as an argument to stratify
ensures that the training and test datasets have the same class proportions as the original dataset.
Choosing an Appropriate Ratio
The tradeoff involved must be carefully considered when dividing a dataset into training and test datasets. The commonly employed splits include the 60:40, 70:30, or 80:20 ratios, chosen based on the initial dataset's size. For larger datasets, options such as 90:10 or 99:1 splits are frequently utilized and are considered appropriate due to their ability to ensure a robust evaluation of the model’s performance.
After training and evaluating a model, it is often beneficial to retrain the classifier with the entirety of the dataset rather than discarding the allocated test data. This approach has the potential to enhance the model’s predictive performance. However, retraining on the entire dataset could result in poorer generalization performance when dealing with a small dataset or one that contains outliers in the test data. Furthermore, no independent data will remain to assess its performance after retraining the model with the complete dataset.
Feature Scaling
It’s essential to ensure that features are on the same scale for optimal performance when working with machine learning and optimization algorithms. This is especially crucial, for example, when using the gradient descent optimization algorithm, as it can significantly impact its behavior.
Normalization and standardization are common approaches to bringing different features onto the same scale.
Normalization
Normalization usually involves rescaling the features to a range of [0,1] and can be thought of as a form of min-max scaling.
To normalize our data, we can apply the min-max scaling to each feature column, where the new value, $x^{(i)}_{norm}$ of example, $x^{(i)}$can be calculated as follows.
Here, $x^{(i)}$ is a particular example, $x_{min}$ is the smallest value in a feature column, and $x_{max}$ is the largest value.
The min-max scaling procedure is implemented in scikit-learn and can be used as follows.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
Standardization
In various linear models, such as logistic regression and Support Vector Machines (SVM), the process begins by setting the weights to 0
or small random values close to 0. Standardization involves adjusting the feature columns to have a mean of 0 and a standard deviation of 1 (standard normal distribution). This transformation facilitates the learning of weights in the model. Moreover, standardization maintains essential information about outliers and diminishes the model's susceptibility to these extreme values. This sets it apart from min-max scaling, compressing the data into a specific range of values.
The following equation can express the standardization:
Here, $\mu_{x}$ is the sample mean of a particular feature column,
and, $\sigma_{x}$ is the corresponding standard deviation.
Again, it is also important to highlight that we fit the StandardScaler
class only once—on the training data—and use those parameters to transform the test dataset or any new data point.
Scikit-learn offers advanced feature scaling methods like RobustScaler
, which is ideal for small datasets with outliers. It operates on each feature column independently, removing the median and scaling the data based on the 1st and 3rd quartile, minimizing the impact of extreme values and outliers.
Selecting Meaningful Features
When a model performs significantly better on a training dataset than the test dataset, it’s a vital sign of overfitting. Overfitting occurs when a model tightly fits the parameters to the specific observations in the training dataset but fails to generalize well to new data, resulting in high variance. The overfitting is because our model is too complex for the training data. There are two ways to reduce overfitting: regularization and dimensionality reduction.
We already handled these from here:
- Regularization: Day17 ML Review - Regularization
- Dimensionality reduction (partly): Day46 ML Review - K-Nearest Neighbors (3)
L2 regularization is a method used to decrease the complexity of a model by penalizing large individual weights. Another approach to reduce model complexity is L1 regularization, which involves replacing the square of the weights with the sum of the absolute values of the weights. In contrast to L2 regularization, L1 regularization typically results in sparse feature vectors, with most feature weights being zero. This sparsity can be beneficial when dealing with high-dimensional datasets containing many irrelevant features, especially when there are more irrelevant dimensions than training examples. From this perspective, L1 regularization is a technique for feature selection.
Sequential Feature Selection Algorithms
Reducing the number of features through dimensionality reduction is an effective method to simplify the model and prevent overfitting. This approach is particularly beneficial for non-regularized models. Two primary types of dimensionality reduction techniques exist feature selection and feature extraction. With feature selection, we choose a subset of the original features, while feature extraction involves deriving new information from the feature set to create a new feature subspace.
Sequential feature selection algorithms are used to reduce a $d$-dimensional feature space to a $k$-dimensional feature subspace. They automatically select relevant features to improve computational efficiency and reduce generalization error, which is beneficial for algorithms that don’t support regularization. A popular algorithm for sequential feature selection is Sequential Backward Selection (SBS). Its purpose is to decrease the dimensionality of the initial feature space while minimizing the decline in classifier performance to enhance computational efficiency. In some cases, SBS can also enhance a model’s predictive ability when overfitting is an issue. However, it is not implemented in Scikit-learn yet, so we will need to do it from scratch.
In scikit-learn, we can find a wide range of feature selection algorithms. These include recursive backward elimination, which is based on feature weights, tree-based methods that select features by importance, and univariate statistical tests.
Leave a comment