Day136 - STAT Review: Classification (6)

March 8, 2025 6 minute read

Practical Statistics for Data Scientists: Evaluating Classification Models (Confusion Matrix, ROC-AUC & Lift)

467D642A-A6D3-4390-BE4B-A7014AED23C9_1_105_c

Evaluation of Classification Models

Evaluating a classification model involves more than simply checking how many predictions are correct. Particularly in real-world scenarios where classes are imbalanced, such as fraud detection or disease diagnosis, we need more nuanced metrics to assess a model’s performance effectively.

In predictive modeling, different models are trained and tested on holdout samples to evaluate performance. After tuning models, a new holdout sample can estimate how the selected model will perform on unseen data. Various disciplines may use the terms validation and test interchangeably for these holdout samples. The goal is to determine which model yields the most accurate predictions.

Key Terms for Evaluating Classification Models

Accuracy
- The percent (or proportion) of cases classified correctly.
Confusion Matrix
- A tabular display (2 × 2 in the binary case) showing the record counts by their predicted and actual classification status.
Sensitivity
- The percentage (or proportion) of all 1s correctly classified as 1s.
- = Recall
Specificity
- The percentage (or proportion) of all 0s is correctly classified as 0s.
Precision
- The percentage (or proportion) of predicted 1s that are 1s.
ROC curve
- A plot of sensitivity versus specificity
Lift
- A measure of the model’s effectiveness in identifying (comparatively rare) 1s at various probability cutoffs.

1. Accuracy

A crucial metric for evaluating classification is accuracy, defined as the ratio of total correct predictions to all predictions. The formula is simply as follows.

$$ \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} $$

Accuracy represents a measure of total error. In many classification algorithms, each case is given an “estimated probability of being a 1.” The standard decision threshold, or cutoff, is usually set at 0.50 or 50%. If the probability exceeds 0.5, the outcome is classified as “1”; if not, it is “0.” Another possible default cutoff is the common probability of 1s present in the data.

2. Confusion Matrix

At the core of classification metrics lies the confusion matrix, which is a table that displays the number of correct and incorrect predictions, organized by type of response.

A 2×2 table (for binary classification):

Actual \ Predicted	Predicted 1 (Yes)	Predicted 0 (No)
Actual 1 (Yes)	TP (True Pos.)	FN (False Neg.)
Actual 0 (No)	FP (False Pos.)	TN (True Neg.)

From this table, we can define more detailed metrics.

To explain the confusion matrix, let’s consider a logistic_ gam model trained on a balanced dataset, which includes an equal number of defaulted and paid-off loans. According to standard conventions, $Y = 1 $ signifies the event of interest (for example, default), whereas $Y = 0 $ indicates a negative or typical event (like a loan being paid off).

In R, the following computes the confusion matrix for the logistic_gam model applied to the entire (unbalanced) training set.

pred <- predict(logistic_gam, newdata=train_set)
pred_y <- as.numeric(pred > 0)
pred <- predict(logistic_gam, newdata=train_set)
pred_y <- as.numeric(pred > 0)
true_y <- as.numeric(train_set$outcome=='default')
true_pos <- (true_y==1) & (pred_y==1)
true_neg <- (true_y==0) & (pred_y==0)
false_pos <- (true_y==0) & (pred_y==1)
false_neg <- (true_y==1) & (pred_y==0)
conf_mat <- matrix(c(sum(true_pos), sum(false_pos),
                     sum(false_neg), sum(true_neg)), 2, 2)
colnames(conf_mat) <- c('Yhat = 1', 'Yhat = 0')
rownames(conf_mat) <- c('Y = 1', 'Y = 0')
conf_mat
---
      Yhat = 1 Yhat = 0
Y = 1 14295    8376
Y = 0 8052     14619

3. Recall (Sensitivity)

Definition: Proportion of actual 1s correctly predicted.
Formula: $\text{Recall} = \frac{TP}{TP + FN}$

4. Specificity

Definition: Proportion of actual 0s correctly predicted.
Formula: $\text{Specificity} = \frac{TN}{TN + FP}$

5. Precision

Definition: Proportion of predicted 1s that are actually 1s.
Formula: $\text{Precision} = \frac{TP}{TP + FP}$

The Rare Class Problem

In many applications, the minority class (e.g., fraud, disease, customer purchse) is the most important but underrepresented.

For example, if only 0.1% of cases are class 1 (e.g., fraud), a model that always predicts class 0 is 99.9% accurate, but completely useless.
- In such cases, you want high precision and high recall for the minority class, even if overall accuracy drops.

ROC Curve (Receiver Operating Characteristic)

Y-axis: Recall (Sensitivity)
X-axis: 1 - Specificity (False Positive Rate)
Purpose: Shows the trade-off between correctly identifying 1s and avoiding false alarms.
Shape: The closer the curve is to the top-left corner, the better the model.
Diagonal line: Represents random guessing (AUC = 0.5)

Computing the ROC curve in R is straightforward. The following code computes ROC for the loan data.

idx <- order(-pred)
recall <- cumsum(true_y[idx] == 1) / sum(true_y == 1)
specificity <- (sum(true_y == 0) - cumsum(true_y[idx] == 0)) / sum(true_y == 0)
roc_df <- data.frame(recall = recall, specificity = specificity)
ggplot(roc_df, aes(x=specificity, y=recall)) +
  geom_line(color='blue') +
  scale_x_reverse(expand=c(0, 0)) +
  scale_y_continuous(expand=c(0, 0)) +
  geom_line(data=data.frame(x=(0:100) / 100), aes(x=x, y=1-x),
            linetype='dotted', color='red')

In Python, we can use the scikit-learn function sklearn.metrics.roc_curve to calculate the required information for the ROC curve.
```
fpr, tpr, thresholds = roc_curve(y, logit_reg.predict_proba(X)[:,0],
                                pos_label='default')
ax = roc_df.plot(x='specificity', y='recall', figsize=(4, 4), legend=False)
ax.set_ylim(0, 1)
ax.set_xlim(1, 0)
ax.plot((1, 0), (0, 1))
ax.set_xlabel('specificity')
ax.set_ylabel('recall')
```
The dotted diagonal line represents a classifier that performs no better than random chance. An effective classifier, especially in medical diagnostics, shows an ROC curve close to the upper-left corner, accurately identifying many true positives without misclassifying too many true negatives.

AUC (Area Under the Curve)

While the ROC curve is a useful graphical tool, it doesn’t solely represent a classifier’s performance. However, it can be utilized to calculate the area under the curve (AUC) metric.

A single number summary of the ROC curve.
Range: 0.5 (random) to 1 (perfect)
Result
- AUC ≈ 0.7 = weak model
- AUC ≈ 0.8–0.9 = good
- AUC ≈ 0.95+ = excellent
The model achieves an AUC of approximately 0.69, indicating it functions as a weak classifier.

Precision-Recall Curve

Useful for imbalanced data
Shows precision vs. recall as the cutoff changes
Helpful to understand model behavior when 1s are rare

Python & R: Metric Calculation

Both R and Python (scikit-learn) provide tools to compute:

Confusion matrix
Precision, Recall, Specificity
ROC/AUC scores and curves

In Python

from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, precision_recall_fscore_support
  
# Confusion matrix
confusion_matrix(y_true, y_pred)
  
# Precision, Recall, F1-score
precision_recall_fscore_support(y_true, y_pred)
  
# ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_proba)
  
# AUC
roc_auc_score(y_true, y_proba)

Lift

Lift is a powerful and intuitive concept used to evaluate the performance of classification models—especially in rare event prediction, where the outcome of interest (class 1) is uncommon (e.g., fraud detection, loan default, campaign conversion, etc.).

Lift answers the question:

“How much better is my model at identifying positives (1s) than random guessing?”

Why Not Just Use Accuracy or AUC?

Accuracy doesn’t help with imbalanced data (e.g., if only 1% of emails are spam, a model that always predicts “not spam” has 99% accuracy but is useless).
AUC shows how well the model separates classes across different thresholds—but doesn’t tell you how much better your model is at finding 1s than chance.
Lift fills that gap: it shows how concentrated the 1s are in the top predictions from your model.

Understanding Lift with an Example

Imagine:

Your model ranks all customers based on the probability they’ll respond to a campaign.
You only want to target the top 10% (due to budget).
Overall response rate = 1%
But in the top 10% ranked by your model, the response rate is 3%.

Then: $\textbf{Lift at 10\%} = \frac{3\%}{1\%} = 3$

Your model is 3× better than random targeting.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin