Day204 - DL Review: Revisiting Optimizers, CNNs & Data Drifts

September 9, 2025 5 minute read

27C0AD67-6E5C-42DB-9390-EC80550D4A31_1_102_o

Optimizers

1. What Optimizers Do

Optimizers are algorithms that adjust the model’s weights to minimize loss. The choice of optimizer affects:

Convergence Speed (how fast the model learns)
Stability (whether training oscillates, diverges, or converges smoothly)
Generalization (how well the model performs on unseen data)

2. Core Optimizer Families

Stochastic Gradient Descent (SGD)
- Mechanism: Updates weights using mini-batch gradients; **momentum** accelerates learning by smoothing updates.
- Pros: Simple, good generalization, widely used in practice (e.g., ResNet)
- Cons: Can be slow to converge, sensitive to learning rate.
Adaptive Methods
- AdaGrad: Increases learning rate efficiency for sparse features, but aggressively decays learning rates.
- RMSProp: Uses an exponential moving average of squared gradients to prevent learning rate decay.
- **Adam (Adaptive Moment Estimation)**: Combines momentum and adaptive learning rates, often the default choice in practice.
- AdamW: Variant of Adam with proper weight decay regularization (preferred in modern NLP/CV models).

“SGD with momentum is robust and straightforward, often yielding better generalization in vision. Adam adapts learning rates per parameter, converging faster and making it popular in NLP and large-scale training. AdamW improves regularization, which is why it’s now widely used in transformers.”

3. Trade-offs in Optimizer Choice

SGD (with momentum): Often better for generalization, used in vision tasks.
Adam / AdamW: Converges faster, stable on noisy or sparse gradients, dominant in NLP and transformers.
RMSProp: Historically significant in RNNs.

4. From an MLOps Perspective

Optimizers affect compute cost and reproducibility. In distributed training, large batch optimizers like LAMB or AdamW are critical. Hyperparameter tuning (learning rate, momentum, betas) should be logged and versioned for reproducibility.

Parameter sharing is the idea that, instead of learning a unique weight for every pixel connection (as in a fully connected layer), CNNs learn a set of weights (a filter/kernel) that is applied across the entire input image.

In a fully connected network for an image of size $256 \times 256$, a single hidden neuron connected to all pixels would already need $65,536$ weights.
In contrast, CNNs use a small filter (e.g., $3 \times 3$) with only $9$ weights shared across all spatial positions.

So, instead of having millions of unique parameters, CNNs reuse the same parameters to detect the same feature (like an edge, corner, or texture) at different parts of the image.

Why Do We Need It?

Parameter Efficiency
- Dramatically reduces the number of parameters → faster training and less risk of overfitting.
- Example: A fully connected layer for a $256 \times 256$ images has $\sim 16M$ weights; a $3 \times 3$ convolutional layer with $64$ filters have only $64 \times 9 = 576$ weights.
Translation Invariance
- A filter detecting an “edge” in the top-left corner will also detect the same edge in the bottom-right.
- This allows CNNs to generalize across spatial locations.
Location & Hierarchy
- CNNs focus on local features first (edges, textures), and stacking layers allows them to build hierarchical features (object parts → complete objects)

How It Works Mathematically

Let’s say we have a filter of $K$ of size $3 \times 3$.

For each location $(i, j)$ in the image $X$, the convolution computes:

$Y[i,j]= \sum^2_{m=0} \sum^2_{n=0} K[m,n] \cdot X[i+m,j+n]$

Here, the same filter $K$ is reused (shared) across all $(i, j)$. This is parameter sharing.

Without sharing (Fully Connected): Each neuron learns its own weights → millions of parameters.
With sharing (CNN): The same filter slides (convolves) over the input → a thousand parameters at most.

This difference is why CNNs are practical for computer vision tasks.

MLOps & Deployment Relevance

Parameter sharing isn’t just a theory - it impacts real-world deployment:

Small model size: easier to deploy on edge devices (phones, IoT).
Less Computation: faster inference, lower cost in production.
Hardware accelaeration: GPUs/TPUs are optimized for convolutions, making CNNs very efficient at scale.

Types of Drift in Deep Learning Models

Drift occurs when the data distribution seen in production diverges from the training data. The three main types are covariate shift (changes in the input distribution), label shift (changes in class priors), and concept drift (changes in the relationship between inputs and outputs). In practice, drift detection involves monitoring statistical changes in features or embedding spaces and tracking model performance over time. Mitigation typically involves retaining pipelines, fine-tuning, and implementing robust monitoring frameworks.

Covariate Shift (Feature Drift)

The input distribution $P(X)$ changes, but the relationship $P(X \vert Y)$ remains the same.
Example: A model trained on medical images from one hospital (scanner type A) but deployed in another hospital (scanner type B), where the pixel intensity distribution is different.
Impact: Predictions degrade because the model encounters inputs unlike those in the training data.

Prior Probability Shift (Label Shift)

The distribution of labels $P(Y)$ changes, but the conditional $P(X \vert Y)$ stays the same.
Example: In fraud detection, the proportion of fraudulent transactions may increase from 1% to 5% over time.
Impact: The model becomes poorly calibrated because class priors are mismatched.

Concept Drift

The underlying relationship $P(Y \vert X)$ changes over time.
Example: In recommendation systems, user preferences evolve (e.g., seasonal shopping patterns).
Impact: Even if the inputs appear the same, the labels change, rendering the model fundamentally outdated.

Why Drifts Are Critical in Deep Learning

DL models are data-hungry; shifts cause significant accuracy drops.
They often acts as black boxes, so drift may not be immediately explainable.
In production, drift can cause bias reintroduction, safety issues (healthcare, finance), and loss of trust.

Detecting Drift in MLOps

Statistical Methods

Covariate Drift: KS-test, Chi-square test, Maximum Mean Discrepancy (MMD), KL Divergence between feature distributions.
Label Drift: Compare class frequencies over time with expected priors.
Concept Drift: Monitor accuracy over time, or use two-sample tests between prediction errors in past vs. present.

Embedding-Based Monitoring (DL-Specific)

Use embeddings from intermediate layers of a neural net → monitor distribution shifts in feature space.
Example: in NLP, track drift in sentence embeddings from a BERT model.

Drift Detection Libraries/Tools

Evidently AI, Fiddler AI, WhyLabs, AWS SageMaker Model Monitor, TFX Data Validation.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day204 - DL Review: Revisiting Optimizers, CNNs & Data Drifts

Optimizers

1. What Optimizers Do

2. Core Optimizer Families

3. Trade-offs in Optimizer Choice

4. From an MLOps Perspective

Why Do We Need It?

How It Works Mathematically

MLOps & Deployment Relevance

Types of Drift in Deep Learning Models

Covariate Shift (Feature Drift)

Prior Probability Shift (Label Shift)

Concept Drift

Why Drifts Are Critical in Deep Learning

Detecting Drift in MLOps

Statistical Methods

Embedding-Based Monitoring (DL-Specific)

Drift Detection Libraries/Tools

Share on

Leave a comment

You may also enjoy

Day208 - Leetcode: Python 121 & SQL 175,176 & DL Review

Day207 - Leetcode: Python 53 & SQL 185 & DL Review

Day206 - Leetcode: Python 217 & SQL 175,176 & DL Review

Day205 - Leetcode: Python 175 & SQL Inner Join & DL Review

Wonha Leah Shin

Optimizers in Neural Networks, Parameter Sharing in CNNs, and Data & Concept Drifts

Optimizers

1. What Optimizers Do

2. Core Optimizer Families

3. Trade-offs in Optimizer Choice

4. From an MLOps Perspective

Parameter Sharing

Why Do We Need It?

How It Works Mathematically

Example: Without vs. With Sharing

MLOps & Deployment Relevance

Types of Drift in Deep Learning Models

Covariate Shift (Feature Drift)

Prior Probability Shift (Label Shift)

Concept Drift

Why Drifts Are Critical in Deep Learning

Detecting Drift in MLOps

Statistical Methods

Embedding-Based Monitoring (DL-Specific)

Drift Detection Libraries/Tools

Share on

Leave a comment

You may also enjoy

Day208 - Leetcode: Python 121 & SQL 175,176 & DL Review

Day207 - Leetcode: Python 53 & SQL 185 & DL Review

Day206 - Leetcode: Python 217 & SQL 175,176 & DL Review

Day205 - Leetcode: Python 175 & SQL Inner Join & DL Review