Day48 ML Review - Data Preprocessing (2)
Handling Categorical Data - Converting, Ordinal Encoding, and One-Hot Encoding
Dealing with Categorical Data
When discussing categorical data, distinguishing between ordinal and nominal features is essential. Ordinal features are categorical values that can be sorted or ordered, such as t-shirt sizes. On the other hand, nominal features do not imply any order, like t-shirt colors. Let’s make a data frame example as below.
import pandas as pd
df = pd.DataFrame([
['green', 'M', 10.1, 'class2'],
['red', 'L', 13.5, 'class1'],
['blue', 'XL', 15.3, 'class2']])
df.columns = ['color', 'size', 'price', 'class-label']
1. Mapping Ordinal Features
We need to convert the categorical string values into integers to ensure that the learning algorithm interprets the ordinal features correctly. Unfortunately, no convenient function can automatically derive the correct order of the labels on features, so we have to define the mapping manually.
size_mapping = {'XL' : 3,
'L': 2,
'M', 1}
df['size'] = df['size'].map(size_mapping)
We can create a reverse-mapping dictionary to convert the integer values back to their original string representation later on. This dictionary, inv_size_mapping, can be defined as:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
We can then use this dictionary with the Pandas map
method in the transformed feature column. This process is similar to that of the size_mapping
dictionary we previously used.
df['size'].map(inv_size_mapping)
2. Encoding Class Labels
It’s recommended that class labels be provided as integer arrays to avoid technical issues. To encode the class labels, we can use a method similar to the one used for mapping ordinal features.
It’s important to note that class labels are not ordinal, so it doesn't matter which integer we assign to a specific label. We can enumerate the class labels, starting at 0
.
import numpy as np
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
Then, we can use a mapping dictionary to convert the class labels into integers.
df['classlabel'] = df['classlabel'].map(class_mapping)
We can reverse the key-value pairs in the mapping dictionary to map the converted class labels back to their original string representation.
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
Alternatively, we can use the convenient LabelEncoder
class directly implemented in scikit-learn to achieve this. It’s important to note that the fit_transform
method is just a shortcut for calling fit and transform separately.
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
We can use the inverse_transform
method to transform the integer class labels back into their original string representation.
class_le.inverse_transform(y)
3. Performing One-Hot Encoding on Nominal Features
In the earlier section on Mapping ordinal features, we converted the ordinal size feature into integers using a simple dictionary-mapping approach. As scikit-learn’s classification estimators treat class labels as categorical data without any implied order (nominal), we conveniently used the `LabelEncoder` to encode the string labels into integers. It might seem that we could employ a similar approach to transform the nominal color column of our dataset as follows:
X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
After running the code, the first column of the NumPy array X now contains the new color values: blue (0)
, green (1
), and red (2)
. Although the color values are not in any specific order, a learning algorithm will interpret green as larger than blue and red as larger than green. This assumption needs to be corrected, but the algorithm could still generate valuable results, although they may need to be revised.
To handle this issue, we can use a method known as one-hot encoding. This involves creating a new binary feature for each unique value in the nominal feature column. For instance, if we have a color feature with values like blue, green, and red, we would create new binary features for each color. Then, for each example, we would use binary values to indicate the presence of a particular color (e.g., blue=1, green=0, red=0). OneHotEncoder in scikit-learn’s preprocessing
module can be used to carry out this transformation.
from sklearn.preprocessing import OneHotEncoder
X = df[['color', 'size', 'price']].values
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()
If we want to transform columns in a multi-feature array selectively, we can use the ColumnTransformer
, which accepts a list of (name, transformer, column(s))
tuples as follows.
from sklearn.compose import ColumnTransformer
X = df[['color', 'size', 'price']].values
c_transf = ColumnTransformer([('onehot', OneHotEncoder(), [0]),
('nothing', 'passthrough', [1,2])])
c_transf.fit_transform(X.astype(float)
In the preceding code, we specified that we wanted to modify only the first column and leave the other two columns untouched via the passthrough
argument.
When using one-hot encoding for datasets, it’s important to be aware of the potential for multicollinearity issues. This can be problematic for certain methods, such as those that require matrix inversion. When features are highly correlated, matrices become tough to invert, leading to potentially unstable estimates. To mitigate this issue, one approach is to simply remove one feature column from the one-hot encoded array.
Leave a comment