Day107 Deep Learning Lecture Review - HW4 - Adjusting Probabilities into Real-World

December 20, 2024 6 minute read

HW4: Model Calibration (Platt Scaling & Label Smoothing) and Conformal Prediction (Naive and Adaptive Predictions Sets)

110BCDEE-4EA5-47BD-877B-D706C51FB0B2_1_105_c

Model Calibration ensures that predicted probabilities align with actual likelihood, which is crucial in high-stakes applications like medical diagnosis, finance, and autonomous systems. A well-calibrated model provides accurate uncertainty estimation, improving reliability and trustworthiness.

In this posting, I will explore:

Model Calibration: Using Platt Scaling and Label Smoothing to adjust a model’s confidence levels.
Conformal Prediction (CP): Generating prediction sets instead of single predictions to quantify uncertainty.

1. Model Calibration

Model calibration ensures that a model’s predicted probabilities match real-world occurrences. For example, if a model predicts a 90% likelihood of an event, that event should happen 90% of the time.

Metrics for Calibration

Reliability Curve: A plot of predicted probabilities vs. actual frequencies.
Expected Calibration Error (ECE): Measures the difference between predicted probabilities and actual frequencies.
Maximum Calibration Error (MCE): The worst-case calibration error.

1.1 Platt Scaling

Platt Scaling is a post-processing method that applies logistic regression to recalibrated probabilities. It refines raw model output ($z$) into well-calibrated probabilities using:

$P(y=1 \vert z) = \frac{1}{1+e^{A_z + B}}$

where, $A,B$ are learned from a validation set.

Implementation Steps

Train a ResNet-18 binary classifier on the CIFAR-10 (dog vs. cat) dataset.
Extract logits from a validation set.
Fit a logistic regression model to the logits to learn $A,B$.
Apply the transformation to obtain calibrated probabilities.
Plot reliability curves before and after Platt Scaling.

Results

Before Platt Scaling: The reliability curve deviated from the diagonal, indicating overconfidence.
After Platt Scaling: The curve aligned closely with the diagonal, demonstrating improved calibration.

Conclusion: The curve aligned closed with the diagonal, demonstrating improved calibration.

# Obtain logits and labels from the validation set
  
model.eval()
val_logits = []
val_labels = []
with torch.no_grad():
  for inputs, labels in val_loader:
    inputs = inputs.cuda()
    outputs = model(inputs)
    val_logits.extend(outputs.cpu().numpy().ravel())
    val_labels.extend(labels.numpy())
  
# Fit logistic regression for Platt Scaling
val_logits = np.array(val_logits).reshape(-1, 1)
val_labels = np.array(val_labels)
platt_scaler = LogisticRegression(solver=’lbfgs’)
platt_scaler.fit(val_logits, val_labels)
  
# Apply Platt Scaling to obtain calibrated probabilities
def platt_scaled_probability(logit):
  logit = np.array(logit).reshape(-1, 1)
  return platt_scaler.predict_proba(logit)[:, 1]

1.2 Label Smoothing

Label Smoothing **modifies one-hot labels** by redistributing probability mass, preventing the model from becoming overconfident.

Instead of assigning 100% probability to the correct class ($y=1$), label smoothing adjusts it to:

$y_{smooth} = (1-a)y + \frac{\alpha}{K}$

where $\alpha$ controls the smoothing factor and $K$ is the number of classes.

Implementation Steps

Train a ResNet-18 model from scratch with smoothing values of 0.1, 0.2, and 0.3.
Evaluate reliability curves for different smoothing values.
Compare smoothing with and without Platt Scaling.

Results

Apply Label Smoothing After Training

Smoothing	Reliability Curve Alignment	Overconfidence Reduction
0.1	Slight improvement	Still overconfident
0.2	Best balance	Reduced overconfidence
0.3	Close to ideal, but noisy	Might underfit

Smoothing 0.2 provides the best balance, significantly reducing overconfidence while maintaining model accuracy.

Train the Model from Scratch Using Label Smoothing

Smoothing	Reliability Curve Alignment	Overconfidence Reduction
0.1	Improved, but still some deviations in high-probabilities	Still overconfident
0.2	Best balance	Reduced overconfidence
0.3	Close to ideal, but slightly fluctuating	Might underfit

Applying smoothing at levels 0.2 and 0.3 enhances calibration, but Platt Scaling from the previous section (Apply label smoothing after training) demonstrates an even diagonal reliability curve.

Comparison with Platt Scaling and Label Smoothing
- While label smoothing (especially at 0.2 and 0.3) improves calibration, Platt Scaling directly aligns the reliability curve with the diagonal. Platt Scaling is a more targeted post-processing approach for calibration. In contrast, label smoothing is integrated during training.
- After training without label smoothing, Platt Scaling achieves the closest alignment to the perfect calibration line. However, training from scratch with label smoothing (0.2 or 0.3) provides a reasonable alternative, yielding a well-calibrated model without post-processing.

1.3 Combining Platt Scaling with Label Smoothing

Smoothing values of 0.2 and 0.3, combined with Platt Scaling, provide the most accurate calibration. They align the model’s predictions closely with actual outcomes, even in the higher probability regions.
Higher smoothing values (0.2 and 0.3) help with initial calibration, mainly before Platt Scaling is applied, making the model less overconfident in its predictions.
Also, Platt Scaling effectively improves the model’s calibration for all levels of label smoothing, ensuring that the reliability curves closely match the perfectly calibrated line.

2. Conformal Prediction

Uncertainty quantification is essential to deep learning applications, especially in critical domains like healthcare and autonomous systems. I explored Conformal Prediction (CP), a framework that provides prediction sets with reliable coverage guarantees. The object was implementing Naive and Adaptive Prediction Sets Algorithms using a pre-trained ResNet Model.

Understanding Conformal Prediction

Traditional deep learning models provide a single-point prediction with a confidence score (e.g., softmax probability). However, these confidence scores can be miscalibrated and fail to deliver reliable uncertainty estimates. Conformal Prediction addresses this limitation by generating multiple possible prediction sets while guaranteeing a confidence level.

Mathematically, conformal prediction ensures that:

$1-\alpha \leq P(Y \in \tau(X)) \leq 1-\alpha + \frac{1}{(n+1)}$

Where:

$\alpha$ is the significance level,
$X$ is the input,
$Y$ is the true label,
$\tau (X)$ is the prediction set.

The main concept is to utilize a scoring function to assess how closely the model’s predictions align with the actual labels and then establish quantiles to form prediction sets.

2.1 the Naïve Prediction Set Algorithm

The naïve method constructs a prediction set by including classes until the cumulative probability surpasses a predefined threshold.

Implementation Steps

Prepare the datasets
- Used softmax_outputs.npy for predicted probabilities.
- Used correct_classes.npy for ground-truth labels.
Split the data
- The first 2000 samples are used for calibration.
- The remaining samples are for validation.
Calculate scores
- The score function is simply the probability assigned to the true class:
  $s(X,Y) = 1-\hat{f}(X)_Y$
- Compute the quantile threshold $\hat{q}$ using:
  $\hat{q} = Quantile(s_1, \dots, S_n; 1- \alpha $
Generate prediction sets
- Include the top $k$ classes until their cumulative probability exceeds $1-\hat{q}$.
- Iterate through softmax outputs until the cumulative probability surpasses the threshold.

Results

Empirical coverage: 98.66%
Key observation: The naïve method produces small prediction sets but can occasionally miss the true class, affecting reliability.
Coverage slightly below 99% meaning some samples were miclassified without correction.

2.2 the Adaptive Prediction Set Algorithm

The ensures that the true label is always included by dynamically adjusting the prediction set.

Implementation Steps

Score function:
- Instead of considering only the probability of the true label, accumulate probabilities until the true label is reached.
- Ensure that the true label is always present.
  $ s(X,Y) = \sum_{j=1}^k \hat{f}(X)_{\pi_j}$
Compute quantile threshold
- Compute the quantile threshold $\hat{q}$ based on the sorted scores.
Generate Adaptive Prediction Sets
- Continue adding classes until the true label is included.

Results

Empirical Coverage: 99.72%
Higher coverage than naïve method, ensuring the true label is always included.
Larger prediction sets, which reduces interpretability.

2.3 Comparing Naïve vs. Adaptive Methods

Aspect	Naïve Method	Adaptive Method
Coverage	98.66%	99.72%
Prediction Set Size	Smaller	Larger
Reliability	Occasionally misses true labels	Always includes true labels
Efficiency	Faster	Slightly slower due to set expansion

Main Insights

Naïve method is efficient but unreliable
- It sometimes omits the true label, leading to misclassification risks.
- Smaller sets improve interpretability, but at the cost of lower coverage.
Adaptive method ensures full coverage but at a cost
- Always includes the correct label, making it more reliable.
- Larger sets reduce interpretability, as multiple labels might be included unnecessarily.
Trade-off: Interpretability vs. Coverage
- If interpretability is crucial, the naïve method is preferable.
- If accuracy is the prority, the adpative method is the best choice.

Conclusion

Conformal Prediction provides a mathematically rigorous approach to uncertainty quantification. Through this homework, I could learn:

How to implement naïve and adaptive prediction sets.
The importance of quantile calibration in deep learning.
Trade-offs between compactness and reliability.

As deep learning models become increasingly deployed in real-world applications, techniques like conformal prediction will e essential for making reliable and interpretable AI systems.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day107 Deep Learning Lecture Review - HW4 - Adjusting Probabilities into Real-World

HW4: Model Calibration (Platt Scaling & Label Smoothing) and Conformal Prediction (Naive and Adaptive Predictions Sets)

1. Model Calibration

Metrics for Calibration

1.1 Platt Scaling

Implementation Steps

Results

1.2 Label Smoothing

Implementation Steps

Results

1.3 Combining Platt Scaling with Label Smoothing

2. Conformal Prediction

Understanding Conformal Prediction

2.1 the Naïve Prediction Set Algorithm

2.2 the Adaptive Prediction Set Algorithm

2.3 Comparing Naïve vs. Adaptive Methods

Conclusion

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)