4 minute read

Practical Statistics for Data Scientists: Strategies for Imbalanced Data (1) (Undersampling & Oversampling)

A057901D-004A-4B69-8367-A1FC0F3402E8_1_105_c


Strategies for Imbalanced Data

When working with imbalanced data—where one class (typically the one we care about the most, like fraud, purchase, or loan default) is much rarer than the other—traditional modeling and evaluation techniques often fall short. This section outlines strategies to handle imbalanced datasets**, making our models more effective and fair.

Why Is Imbalanced Data a Problem?

Imagine 99% of loans are repaid and only 1% default. A model that always predicts “paid off” is 99% accurate but misses all the defaults you care about most.

Hence, new strategies are needed for:

  • Better training the model on rare cases
  • More relevant evaluation metrics


Key Terms for Imbalanced Data

  • Undersample
    • Use fewer of the prevalent class records in the classification model.
    • = Downsample
  • Oversample
    • Use more rare class records in the classification model, bootstrapping if necessary.
    • = Upsample
  • Up-weight or down-weight
    • Attach more (or less) weight to the rare (or prevalent) class in the model.
  • Data generation
    • Similar to bootstrapping, but each new bootstrapped record differs slightly from its source.
  • z-score
    • The value obtained after standardization.
  • K
    • The number of neighbors included in the nearest neighbor calculation.


Undersampling

If we have enough data, as with the loan data, one solution is to undersample (or downsample) the prevalent class to create a more balanced dataset between 0s and 1s. Undersampling tackles redundant records in the dominant class data. A smaller, balanced dataset enhances model performance and simplifies data preparation, exploration, and testing.

How much data is enough? It depends on the application, but tens of thousands of records for the less dominant class are usually sufficient. The more precise the 1s are from the 0s, the less data is needed.

For example, the “Logistic Regression” loan data used a balanced training set: half of the loans were paid off, and the other half defaulted. Predicted values showed similar results: half of the probabilities were below 0.5 and half above.

  • In the full data set, only about 19% of the loans were in default, as shown in R:

    mean(full_train_set$outcome=='default')
    ---
    [1] 0.1889455
    
  • In Python:

    print('percentage of loans in default: ',
         100 * np.mean(full_train_set.outcome == 'default'))
    

What if we use the entire data set to train the model?

  • Let’s see this in R:

    full_model <- glm(outcome ~payment_inc_ratio + purpose_ + home_ +
                     emp_len_ + dti + revol_bal + revol_util,
                     data=full_train_set, family='binomial')
    pred <- predict(full_model)
    mean(pred > 0)
    ---
    [1] 0.003942094
    
  • In Python,

    predictors = ['payment_inc_ratio', 'purpose_', 'home_', 'emp_len_', 
                  'dti', 'revol_bal', 'revol_util']
    outcome = 'outcome'
    X = pd.get_dummies(full_train_set[predictors], prefix='', prefix_sep='', 
                       drop_first=True)
    y = full_train_set[outcome]
      
    full_model = LogisticRegression(penalty='l2', C=1e42, solver='liblinear')
    full_model.fit(X, y)
    print('Percentage of loans predicted to default: ',
         100 * np.mean(full_model.predict(X) == 'default'))
    

Only 0.39% of the loans are predicted to default, which is less than 1 in 47 of the expected number. The loans that have been paid off vastly outnumber those in default because the model is trained using all the data equally.

When considering it intuitively, numerous non-defaulting loans and the inevitable variability in predictor data suggest that even a defaulting loan may be incorrectly identified as similar to some non-defaulting loans purely by chance. In the case of a balanced sample, approximately 50% of the loans were predicted to be in default.


Oversampling and Up/Down Weighting

A criticism of the undersampling method is that it discards data and fails to use all available information. With a small data set where the rarer class has a few hundred or thousand records, undersampling the dominant class may cause a loss of valuable information. Instead of downsampling, we should oversample (upsample) the rarer class by bootstrapping additional rows.

Weighting the data can achieve a similar effect. Many classification algorithms have a weight argument to adjust data weight accordingly.

For example, a weight vector can be applied to the loan data using the weight argument to glm in R.

  • In R

    wt <- ifelse(full_train_set$outcome=='default',
                 1 / mean(full_train_set$outcome == 'default'), 1)
    full_model <- glm(outcome ~ payment_inc_ratio + purpose_ + home_ +
                                emp_len_+ dti + revol_bal + revol_util,
                      data=full_train_set, weight=wt, family='quasibinomial')
    pred <- predict(full_model)
    mean(pred > 0)
    ---
    [1] 0.5767208
    
  • In Python, most scikit-learn methods allow the specification of weights in the fit function by using the keyword argument sample_weight.

    default_wt = 1 / np.mean(full_train_set.outcome == 'default')
    wt = [default_wt if outcome == 'default' else 1
          for outcome in full_train_set.outcome]
      
    full_model = LogisticRegression(penalty="l2", C=1e42, solver='liblinear')
    full_model.fit(X, y, sample_weight=wt)
    print('percentage of loans predicted to default (weighting): ',
          100 * np.mean(full_model.predict(X) == 'default'))
    

The weights for defaulted loans are set to $\frac{1}{p}$, where $p$ represents the probability of default. Non-defaulting loans have a weight of 1. The total weights for defaulting and non-defaulting loans are approximately equal. The mean of the predicted values is now around 58% instead of 0.39%. It is important to note that weighting provides an alternative to upsampling the rarer class and downsampling the dominant class.


Adapting the Loss Function

Many classification and regression algorithms optimize specific criteria or loss functions. For instance, logistic regression aims to minimize deviance. In the literature, some researchers suggest modifying the loss function to address issues caused by rare classes. This can be challenging in practice: classification algorithms may be complex and complicated to change. Weighting presents a straightforward method to adjust the loss function by reducing the impact of errors from low-weight records in favor of those with higher-weighted ones.



Leave a comment