Day107 Deep Learning Lecture Review - HW4 - Adjusting Probabilities into Real-World
HW4: Model Calibration (Platt Scaling & Label Smoothing) and Conformal Prediction (Naive and Adaptive Predictions Sets)
Model Calibration ensures that predicted probabilities align with actual likelihood, which is crucial in high-stakes applications like medical diagnosis, finance, and autonomous systems. A well-calibrated model provides accurate uncertainty estimation, improving reliability and trustworthiness.
In this posting, I will explore:
- Model Calibration: Using Platt Scaling and Label Smoothing to adjust a model’s confidence levels.
- Conformal Prediction (CP): Generating prediction sets instead of single predictions to quantify uncertainty.
1. Model Calibration
Model calibration ensures that a model’s predicted probabilities match real-world occurrences. For example, if a model predicts a 90% likelihood of an event, that event should happen 90% of the time.
Metrics for Calibration
- Reliability Curve: A plot of predicted probabilities vs. actual frequencies.
- Expected Calibration Error (ECE): Measures the difference between predicted probabilities and actual frequencies.
- Maximum Calibration Error (MCE): The worst-case calibration error.
1.1 Platt Scaling
Platt Scaling is a post-processing method that applies logistic regression to recalibrated probabilities. It refines raw model output ($z$) into well-calibrated probabilities using:
where, $A,B$ are learned from a validation set.
Implementation Steps
- Train a ResNet-18 binary classifier on the CIFAR-10 (dog vs. cat) dataset.
- Extract logits from a validation set.
- Fit a logistic regression model to the logits to learn $A,B$.
- Apply the transformation to obtain calibrated probabilities.
- Plot reliability curves before and after Platt Scaling.
Results


-
Before Platt Scaling: The reliability curve deviated from the diagonal, indicating overconfidence.
-
After Platt Scaling: The curve aligned closely with the diagonal, demonstrating improved calibration.
-
Conclusion: The curve aligned closed with the diagonal, demonstrating improved calibration.
# Obtain logits and labels from the validation set model.eval() val_logits = [] val_labels = [] with torch.no_grad(): for inputs, labels in val_loader: inputs = inputs.cuda() outputs = model(inputs) val_logits.extend(outputs.cpu().numpy().ravel()) val_labels.extend(labels.numpy()) # Fit logistic regression for Platt Scaling val_logits = np.array(val_logits).reshape(-1, 1) val_labels = np.array(val_labels) platt_scaler = LogisticRegression(solver=’lbfgs’) platt_scaler.fit(val_logits, val_labels) # Apply Platt Scaling to obtain calibrated probabilities def platt_scaled_probability(logit): logit = np.array(logit).reshape(-1, 1) return platt_scaler.predict_proba(logit)[:, 1]
1.2 Label Smoothing
Label Smoothing **modifies one-hot labels** by redistributing probability mass, preventing the model from becoming overconfident.
Instead of assigning 100% probability to the correct class ($y=1$), label smoothing adjusts it to:
where $\alpha$ controls the smoothing factor and $K$ is the number of classes.
Implementation Steps
- Train a ResNet-18 model from scratch with smoothing values of 0.1, 0.2, and 0.3.
- Evaluate reliability curves for different smoothing values.
- Compare smoothing with and without Platt Scaling.
Results
-
Apply Label Smoothing After Training
Smoothing Reliability Curve Alignment Overconfidence Reduction 0.1 Slight improvement Still overconfident 0.2 Best balance Reduced overconfidence 0.3 Close to ideal, but noisy Might underfit - Smoothing 0.2 provides the best balance, significantly reducing overconfidence while maintaining model accuracy.
-
Train the Model from Scratch Using Label Smoothing
Smoothing Reliability Curve Alignment Overconfidence Reduction 0.1 Improved, but still some deviations in high-probabilities Still overconfident 0.2 Best balance Reduced overconfidence 0.3 Close to ideal, but slightly fluctuating Might underfit - Applying smoothing at levels 0.2 and 0.3 enhances calibration, but Platt Scaling from the previous section (Apply label smoothing after training) demonstrates an even diagonal reliability curve.
- Applying smoothing at levels 0.2 and 0.3 enhances calibration, but Platt Scaling from the previous section (Apply label smoothing after training) demonstrates an even diagonal reliability curve.
-
Comparison with Platt Scaling and Label Smoothing
- While label smoothing (especially at 0.2 and 0.3) improves calibration, Platt Scaling directly aligns the reliability curve with the diagonal. Platt Scaling is a more targeted post-processing approach for calibration. In contrast, label smoothing is integrated during training.
- After training without label smoothing, Platt Scaling achieves the closest alignment to the perfect calibration line. However, training from scratch with label smoothing (0.2 or 0.3) provides a reasonable alternative, yielding a well-calibrated model without post-processing.
1.3 Combining Platt Scaling with Label Smoothing
- Smoothing values of 0.2 and 0.3, combined with Platt Scaling, provide the most accurate calibration. They align the model’s predictions closely with actual outcomes, even in the higher probability regions.
- Higher smoothing values (0.2 and 0.3) help with initial calibration, mainly before Platt Scaling is applied, making the model less overconfident in its predictions.
- Also, Platt Scaling effectively improves the model’s calibration for all levels of label smoothing, ensuring that the reliability curves closely match the perfectly calibrated line.
2. Conformal Prediction
Uncertainty quantification is essential to deep learning applications, especially in critical domains like healthcare and autonomous systems. I explored Conformal Prediction (CP), a framework that provides prediction sets with reliable coverage guarantees. The object was implementing Naive and Adaptive Prediction Sets Algorithms using a pre-trained ResNet Model.
Understanding Conformal Prediction
Traditional deep learning models provide a single-point prediction with a confidence score (e.g., softmax probability). However, these confidence scores can be miscalibrated and fail to deliver reliable uncertainty estimates. Conformal Prediction addresses this limitation by generating multiple possible prediction sets while guaranteeing a confidence level.
Mathematically, conformal prediction ensures that:
Where:
- $\alpha$ is the significance level,
- $X$ is the input,
- $Y$ is the true label,
- $\tau (X)$ is the prediction set.
The main concept is to utilize a scoring function to assess how closely the model’s predictions align with the actual labels and then establish quantiles to form prediction sets.
2.1 the Naïve Prediction Set Algorithm
The naïve method constructs a prediction set by including classes until the cumulative probability surpasses a predefined threshold.
Implementation Steps
-
Prepare the datasets
- Used
softmax_outputs.npy
for predicted probabilities. - Used
correct_classes.npy
for ground-truth labels.
- Used
-
Split the data
- The first 2000 samples are used for calibration.
- The remaining samples are for validation.
-
Calculate scores
-
The score function is simply the probability assigned to the true class:
$s(X,Y) = 1-\hat{f}(X)_Y$
-
Compute the quantile threshold $\hat{q}$ using:
$\hat{q} = Quantile(s_1, \dots, S_n; 1- \alpha $
-
-
Generate prediction sets
- Include the top $k$ classes until their cumulative probability exceeds $1-\hat{q}$.
- Iterate through softmax outputs until the cumulative probability surpasses the threshold.
Results
- Empirical coverage: 98.66%
- Key observation: The naïve method produces small prediction sets but can occasionally miss the true class, affecting reliability.
- Coverage slightly below 99% meaning some samples were miclassified without correction.
2.2 the Adaptive Prediction Set Algorithm
The ensures that the true label is always included by dynamically adjusting the prediction set.
Implementation Steps
-
Score function:
-
Instead of considering only the probability of the true label, accumulate probabilities until the true label is reached.
-
Ensure that the true label is always present.
$ s(X,Y) = \sum_{j=1}^k \hat{f}(X)_{\pi_j}$
-
- Compute quantile threshold
- Compute the quantile threshold $\hat{q}$ based on the sorted scores.
- Generate Adaptive Prediction Sets
- Continue adding classes until the true label is included.
- Continue adding classes until the true label is included.
Results
- Empirical Coverage: 99.72%
- Higher coverage than naïve method, ensuring the true label is always included.
- Larger prediction sets, which reduces interpretability.
2.3 Comparing Naïve vs. Adaptive Methods
Aspect | Naïve Method | Adaptive Method |
---|---|---|
Coverage | 98.66% | 99.72% |
Prediction Set Size | Smaller | Larger |
Reliability | Occasionally misses true labels | Always includes true labels |
Efficiency | Faster | Slightly slower due to set expansion |
Main Insights
- Naïve method is efficient but unreliable
- It sometimes omits the true label, leading to misclassification risks.
- Smaller sets improve interpretability, but at the cost of lower coverage.
- Adaptive method ensures full coverage but at a cost
- Always includes the correct label, making it more reliable.
- Larger sets reduce interpretability, as multiple labels might be included unnecessarily.
- Trade-off: Interpretability vs. Coverage
- If interpretability is crucial, the naïve method is preferable.
- If accuracy is the prority, the adpative method is the best choice.
Conclusion
Conformal Prediction provides a mathematically rigorous approach to uncertainty quantification. Through this homework, I could learn:
- How to implement naïve and adaptive prediction sets.
- The importance of quantile calibration in deep learning.
- Trade-offs between compactness and reliability.
As deep learning models become increasingly deployed in real-world applications, techniques like conformal prediction will e essential for making reliable and interpretable AI systems.
Leave a comment