Day204 - DL Review: Revisiting Optimizers, CNNs & Data Drifts
Optimizers in Neural Networks, Parameter Sharing in CNNs, and Data & Concept Drifts

Optimizers
1. What Optimizers Do
Optimizers are algorithms that adjust the modelβs weights to minimize loss. The choice of optimizer affects:
- Convergence Speed (how fast the model learns)
- Stability (whether training oscillates, diverges, or converges smoothly)
- Generalization (how well the model performs on unseen data)
2. Core Optimizer Families
- Stochastic Gradient Descent (SGD)
- Mechanism: Updates weights using mini-batch gradients; **momentum** accelerates learning by smoothing updates.
- Pros: Simple, good generalization, widely used in practice (e.g., ResNet)
- Cons: Can be slow to converge, sensitive to learning rate.
- Adaptive Methods
- AdaGrad: Increases learning rate efficiency for sparse features, but aggressively decays learning rates.
- RMSProp: Uses an exponential moving average of squared gradients to prevent learning rate decay.
- **Adam (Adaptive Moment Estimation)**: Combines momentum and adaptive learning rates, often the default choice in practice.
- AdamW: Variant of Adam with proper weight decay regularization (preferred in modern NLP/CV models).
βSGD with momentum is robust and straightforward, often yielding better generalization in vision. Adam adapts learning rates per parameter, converging faster and making it popular in NLP and large-scale training. AdamW improves regularization, which is why itβs now widely used in transformers.β
3. Trade-offs in Optimizer Choice
- SGD (with momentum): Often better for generalization, used in vision tasks.
- Adam / AdamW: Converges faster, stable on noisy or sparse gradients, dominant in NLP and transformers.
- RMSProp: Historically significant in RNNs.
4. From an MLOps Perspective
Optimizers affect compute cost and reproducibility. In distributed training, large batch optimizers like LAMB or AdamW are critical. Hyperparameter tuning (learning rate, momentum, betas) should be logged and versioned for reproducibility.
Parameter Sharing
Parameter sharing is the idea that, instead of learning a unique weight for every pixel connection (as in a fully connected layer), CNNs learn a set of weights (a filter/kernel) that is applied across the entire input image.
- In a fully connected network for an image of size $256 \times 256$, a single hidden neuron connected to all pixels would already need $65,536$ weights.
- In contrast, CNNs use a small filter (e.g., $3 \times 3$) with only $9$ weights shared across all spatial positions.
So, instead of having millions of unique parameters, CNNs reuse the same parameters to detect the same feature (like an edge, corner, or texture) at different parts of the image.
Why Do We Need It?
- Parameter Efficiency
- Dramatically reduces the number of parameters β faster training and less risk of overfitting.
- Example: A fully connected layer for a $256 \times 256$ images has $\sim 16M$ weights; a $3 \times 3$ convolutional layer with $64$ filters have only $64 \times 9 = 576$ weights.
- Translation Invariance
- A filter detecting an βedgeβ in the top-left corner will also detect the same edge in the bottom-right.
- This allows CNNs to generalize across spatial locations.
- Location & Hierarchy
- CNNs focus on local features first (edges, textures), and stacking layers allows them to build hierarchical features (object parts β complete objects)
How It Works Mathematically
Letβs say we have a filter of $K$ of size $3 \times 3$.
For each location $(i, j)$ in the image $X$, the convolution computes:
Here, the same filter $K$ is reused (shared) across all $(i, j)$. This is parameter sharing.
Example: Without vs. With Sharing
- Without sharing (Fully Connected): Each neuron learns its own weights β millions of parameters.
- With sharing (CNN): The same filter slides (convolves) over the input β a thousand parameters at most.
This difference is why CNNs are practical for computer vision tasks.
MLOps & Deployment Relevance
Parameter sharing isnβt just a theory - it impacts real-world deployment:
- Small model size: easier to deploy on edge devices (phones, IoT).
- Less Computation: faster inference, lower cost in production.
- Hardware accelaeration: GPUs/TPUs are optimized for convolutions, making CNNs very efficient at scale.
Types of Drift in Deep Learning Models
Drift occurs when the data distribution seen in production diverges from the training data. The three main types are covariate shift (changes in the input distribution), label shift (changes in class priors), and concept drift (changes in the relationship between inputs and outputs). In practice, drift detection involves monitoring statistical changes in features or embedding spaces and tracking model performance over time. Mitigation typically involves retaining pipelines, fine-tuning, and implementing robust monitoring frameworks.
Covariate Shift (Feature Drift)
- The input distribution $P(X)$ changes, but the relationship $P(X \vert Y)$ remains the same.
- Example: A model trained on medical images from one hospital (scanner type A) but deployed in another hospital (scanner type B), where the pixel intensity distribution is different.
- Impact: Predictions degrade because the model encounters inputs unlike those in the training data.
Prior Probability Shift (Label Shift)
- The distribution of labels $P(Y)$ changes, but the conditional $P(X \vert Y)$ stays the same.
- Example: In fraud detection, the proportion of fraudulent transactions may increase from 1% to 5% over time.
- Impact: The model becomes poorly calibrated because class priors are mismatched.
Concept Drift
- The underlying relationship $P(Y \vert X)$ changes over time.
- Example: In recommendation systems, user preferences evolve (e.g., seasonal shopping patterns).
- Impact: Even if the inputs appear the same, the labels change, rendering the model fundamentally outdated.
Why Drifts Are Critical in Deep Learning
- DL models are data-hungry; shifts cause significant accuracy drops.
- They often acts as black boxes, so drift may not be immediately explainable.
- In production, drift can cause bias reintroduction, safety issues (healthcare, finance), and loss of trust.
Detecting Drift in MLOps
Statistical Methods
- Covariate Drift: KS-test, Chi-square test, Maximum Mean Discrepancy (MMD), KL Divergence between feature distributions.
- Label Drift: Compare class frequencies over time with expected priors.
- Concept Drift: Monitor accuracy over time, or use two-sample tests between prediction errors in past vs. present.
Embedding-Based Monitoring (DL-Specific)
- Use embeddings from intermediate layers of a neural net β monitor distribution shifts in feature space.
- Example: in NLP, track drift in sentence embeddings from a BERT model.
Drift Detection Libraries/Tools
- Evidently AI, Fiddler AI, WhyLabs, AWS SageMaker Model Monitor, TFX Data Validation.
Leave a comment