Day48 ML Review - Data Preprocessing (2)

August 7, 2024 4 minute read

Handling Categorical Data - Converting, Ordinal Encoding, and One-Hot Encoding

Dealing with Categorical Data

When discussing categorical data, distinguishing between ordinal and nominal features is essential. Ordinal features are categorical values that can be sorted or ordered, such as t-shirt sizes. On the other hand, nominal features do not imply any order, like t-shirt colors. Let’s make a data frame example as below.

import pandas as pd
df = pd.DataFrame([
 ['green', 'M', 10.1, 'class2'],
  ['red', 'L', 13.5, 'class1'],
  ['blue', 'XL', 15.3, 'class2']])
df.columns = ['color', 'size', 'price', 'class-label']

1. Mapping Ordinal Features

We need to convert the categorical string values into integers to ensure that the learning algorithm interprets the ordinal features correctly. Unfortunately, no convenient function can automatically derive the correct order of the labels on features, so we have to define the mapping manually.

size_mapping = {'XL' : 3, 
               'L': 2,
               'M', 1}
df['size'] = df['size'].map(size_mapping)

We can create a reverse-mapping dictionary to convert the integer values back to their original string representation later on. This dictionary, inv_size_mapping, can be defined as:

inv_size_mapping = {v: k for k, v in size_mapping.items()}

We can then use this dictionary with the Pandas map method in the transformed feature column. This process is similar to that of the size_mapping dictionary we previously used.

df['size'].map(inv_size_mapping)

2. Encoding Class Labels

It’s recommended that class labels be provided as integer arrays to avoid technical issues. To encode the class labels, we can use a method similar to the one used for mapping ordinal features.

It’s important to note that class labels are not ordinal, so it doesn't matter which integer we assign to a specific label. We can enumerate the class labels, starting at 0.

import numpy as np
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}

Then, we can use a mapping dictionary to convert the class labels into integers.

df['classlabel'] = df['classlabel'].map(class_mapping)

We can reverse the key-value pairs in the mapping dictionary to map the converted class labels back to their original string representation.

inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)

Alternatively, we can use the convenient LabelEncoder class directly implemented in scikit-learn to achieve this. It’s important to note that the fit_transform method is just a shortcut for calling fit and transform separately.

 from sklearn.preprocessing import LabelEncoder
 class_le = LabelEncoder()
 y = class_le.fit_transform(df['classlabel'].values)

We can use the inverse_transform method to transform the integer class labels back into their original string representation.

 class_le.inverse_transform(y)

3. Performing One-Hot Encoding on Nominal Features

In the earlier section on Mapping ordinal features, we converted the ordinal size feature into integers using a simple dictionary-mapping approach. As scikit-learn’s classification estimators treat class labels as categorical data without any implied order (nominal), we conveniently used the `LabelEncoder` to encode the string labels into integers. It might seem that we could employ a similar approach to transform the nominal color column of our dataset as follows:

 X = df[['color', 'size', 'price']].values
 color_le = LabelEncoder()
 X[:, 0] = color_le.fit_transform(X[:, 0])

After running the code, the first column of the NumPy array X now contains the new color values: blue (0), green (1), and red (2). Although the color values are not in any specific order, a learning algorithm will interpret green as larger than blue and red as larger than green. This assumption needs to be corrected, but the algorithm could still generate valuable results, although they may need to be revised.

To handle this issue, we can use a method known as one-hot encoding. This involves creating a new binary feature for each unique value in the nominal feature column. For instance, if we have a color feature with values like blue, green, and red, we would create new binary features for each color. Then, for each example, we would use binary values to indicate the presence of a particular color (e.g., blue=1, green=0, red=0). OneHotEncoder in scikit-learn’s preprocessing module can be used to carry out this transformation.

 from sklearn.preprocessing import OneHotEncoder
 X = df[['color', 'size', 'price']].values
 color_ohe = OneHotEncoder()
 color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()

If we want to transform columns in a multi-feature array selectively, we can use the ColumnTransformer, which accepts a list of (name, transformer, column(s)) tuples as follows.

from sklearn.compose import ColumnTransformer
X = df[['color', 'size', 'price']].values
c_transf = ColumnTransformer([('onehot', OneHotEncoder(), [0]),
                            ('nothing', 'passthrough', [1,2])])
c_transf.fit_transform(X.astype(float)

In the preceding code, we specified that we wanted to modify only the first column and leave the other two columns untouched via the passthrough argument.

When using one-hot encoding for datasets, it’s important to be aware of the potential for multicollinearity issues. This can be problematic for certain methods, such as those that require matrix inversion. When features are highly correlated, matrices become tough to invert, leading to potentially unstable estimates. To mitigate this issue, one approach is to simply remove one feature column from the one-hot encoded array.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day48 ML Review - Data Preprocessing (2)

Handling Categorical Data - Converting, Ordinal Encoding, and One-Hot Encoding

Dealing with Categorical Data

1. Mapping Ordinal Features

2. Encoding Class Labels

3. Performing One-Hot Encoding on Nominal Features

Share on

Leave a comment

You may also enjoy

Day208 - Leetcode: Python 121 & SQL 175,176 & DL Review

Day207 - Leetcode: Python 53 & SQL 185 & DL Review

Day206 - Leetcode: Python 217 & SQL 175,176 & DL Review

Day205 - Leetcode: Python 175 & SQL Inner Join & DL Review