Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

July 7, 2025 7 minute read

Designing Machine Learning Systems: Causes of ML System Failures (2) (Correcting Degenerate Feedback Loops) & Data Distribution Shifts

BA4320C1-8648-42C6-BED8-2DA53938BA39_1_105_c

Continuing from the previous posting..

How to Correct Degenerate Feedback Loops

Because degenerate feedback loops are a common problem, numerous proposed methods exist for correcting them. We’ve discussed that degenerate feedback loops can cause a system’s outputs to be more homogeneous over time.

Two widely used strategies are:

1. Randomization

Core idea: Add randomness to break the loop.

Show some random items to users, even if they aren’t top-ranked.
Use the feedback from this exploration traffic to estimate true item quality, not just popularity.
In recommender systems, instead of only showing users highly ranked items, TikTok shows random items and uses feedback to assess their true quality. New videos get an initial traffic pool to evaluate their unbiased quality, deciding if they move to a larger pool or are deemed irrelevant.
- Every video is given an initial random traffic pool (e.g., 100 impressions).
- Based on early user feedback (likes, watch time), TikTok decides if the video gets promoted to more users.

Tradeoff:

Random recommendations can harm user experience, risking user loss. An exploration strategy like those in “Contextual bandits as an exploration strategy” boosts item diversity while keeping accuracy.
- Schnabel et al. combine minimal randomness and causal inference to evaluate songs fairly, thereby enhancing the fairness of recommender systems.

2. Positional Features

Users are more likely to click items shown at the top of the screen. Therefore, if you don’t account for position bias, your model will learn incorrect patterns.

We’ve discussed how degenerate feedback loops are caused by biased user feedback on predictions, which depends on where they are displayed. For example, in a recommender system where five songs are recommended, the top song is more likely to be clicked. It’s unclear whether the model excels at selecting the top song or if users click more on top recommendations.

How to fix this:

During training, record position as a feature:
- e.g., “Is this item shown in position 1?”
- If the position of a prediction affects its feedback, encode position using features. These can be numerical (e.g., 1, 2, 3) or Boolean (e.g., first position or not).
Train your model to learn how position affects clicks.
At inference time, neutralize the position bias:
- Set the “position” feature to False for all predictions.
- This enables your model to estimate item relevance without considering position influence.

Then rank the items by these scores and decide final positions.

Example Table:

ID	Song	Position 1	Click
1	Shallow	False	No
2	Good Vibe	False	No
3	In Bloom	True	Yes
4	Shallow	True	Yes

When making predictions, the goal is to determine whether a user will click on a song, regardless of its position in the recommendations. Set the “1st Position” feature to false, then analyze the model’s predictions for each user to determine the song presentation order.

Advanced Fix: Two-Model Strategy

A more robust approach uses two models:

Exposure Model (with position as a feature):
- Predicts the probability a user sees/clicks the item given its position
Interest Model (without position):
- Predicts genuine user interest assuming equal exposure

Final ranking is based on the interest model, not the exposure-biased one.

Data Distribution Shifts

Data distribution shift refers to the phenomenon in supervised learning when the data a model works with changes over time, which causes this model’s predictions to become less accurate as time passes.

A data distribution shift happens when the statistical properties of the input data (or input–output relationship) change between training and deployment. That is:

Source Distribution: Training data (what the model learns from)
Target Distribution: Inference or real-world data (what the model predicts on)

This change can seriously degrade performance—even if the model was excellent during training and validation.

Types of Distribution Shifts

Mathematically, let:

$X$: Input features
$Y$: Target labels
$P(X, Y)$: Joint distribution the model assumes
$P(Y \vert X)$: Conditional distribution modeled by supervised learning

We can decompose this joint distribution in two ways:

$P(X, Y) = P(X) \cdot P(Y \vert X) = P(Y) \cdot P(X \vert Y)$

1. Covariate Shift

Definition: $P(X)$ changes, but $P(Y \vert X)$ stays the same
Covariate shift is a well-studied type of data distribution shift. In statistics, a covariate is an independent variable that influences an outcome but is not the main focus. For instance, if you’re examining how locations affect housing prices, the price is your primary interest, while square footage is a covariate. In supervised machine learning, the label is the main interest, and the input features are the covariates.
Covariate shift occurs when $P(X)$ changes but $P(Y \vert X)$ stays the same, meaning the input distribution changes while the conditional output probability remains constant.
Example: Breast cancer model trained on women over 40 (training set), but inference is on younger women (real-world). Probability of having cancer given age stays the same.
Common Causes:
- Biased data collection
  - For example, if you collect breast cancer data from a clinic, it may be skewed toward women over 40, as doctors recommend regular checkups for this age group.
- Active learning
  - Rather than randomly selecting samples for model training, we choose the most beneficial ones based on specific heuristics. This alters the training input distribution, leading to covariate shifts compared to the real-world input distribution.
- Real-world behavioral shifts
  - Suppose that you have a model predicting how likely a free user will convert to a paid user, with income level as one feature. Recently, your marketing department targeted a more affluent demographic, changing the model's input distribution. However, the conversion probability for users at each income level remains the same.
Mitigation: Importance weighting—reweight training data to reflect the test distribution reflecting real-world.
- Estimate the density ratio between the real-world input distribution and the training input distribution.
- Weight the training data according to this ratio and train an ML model on this weighted data.

2. Label Shift (Prior Shift)

Definition: $P(Y)$ changes, but $P(X \vert Y)$ stays the same. When the output distribution changes but, for a given output, the input distribution stays the same.
Covariate shift occurs when the input distribution changes, which also affects the output distribution, resulting in both covariate shift and label shift happening simultaneously.
- For example, in a breast cancer study, there may be more women over 40 in the training data than in the inference data, leading to a higher percentage of positive labels in the training set. However, if we randomly select two individuals with breast cancer—one from the training data and one from the test data—both have the same probability of being over 40. This consistency in conditional probability, $P(X \vert Y)$, indicates that this is also a case of label shift.
- However, not all covariate shifts result in label shifts. For example, imagine a preventive drug that reduces breast cancer risk for all women. Here, the probability $P(Y \vert X)$ decreases across ages, indicating it’s not solely covariate shift. However, for those already diagnosed with breast cancer, the age distribution remains the same, illustrating a case of label shift.
Example: Breast cancer rate declines due to medical advancements. Distribution of patients with cancer changes, but age distribution within cancer-positive patients is constant.
Common Causes:
- External interventions
- Temporal trends
Mitigation: Estimate $P(Y)$ in the new data using techniques like Black Box Shift Estimation (BBSE) and adapt model outputs.

3. Concept Drift (Posterior Shift)

Definition: $P(Y \vert X)$ changes, but $P(X)$ remains the same.
- Concept drift, or posterior shift, happens when the input distribution stays the same, but the output given that input changes. This can be thought of as “the same input, but a different output.”
- For example, before COVID-19, a three-bedroom apartment in San Francisco could cost $2,000,000, but during the pandemic, its price dropped to $1,500,000 as many residents left the city. While the features of the houses remained consistent, the price distribution changed.
- Concept drifts are often cyclic or seasonal, like rideshare prices which vary from weekdays to weekends, or flight ticket prices that rise during holidays. Companies typically use separate models to account for these variations.
Example: Same apartment used to cost $2M pre-pandemic; now it’s $1.5M. Features didn’t change, but pricing logic did.
Common Causes:
- Market dynamics
- Policy changes
- Seasonality
Mitigation:
- Drift-aware retraining
- Specialized models per season/context

General Data Distribution Shifts

There are other types of changes in the real world that, even though not well studied in research, can still degrade your models’ performance. These are practical concerns not captured by the above math:

➤ Feature Change

Feature changes can happen when new features are added, old ones removed, or the possible values of a feature change. For example, if a model switches from using years to months for the “age” feature, the value range has shifted. Once, our model’s performance dropped significantly due to a bug that resulted in a feature returning NaNs (not a number).

Changes in feature format (e.g., age in years → age in months)
Introduction/removal of features
Bugs causing NaNs

➤ Label Schema Change

A label schema change happens when the set of possible label values $(Y)$ changes. In label shift, only the distribution of labels, $P(Y)$, changes, while the relationship between features and labels, $P(X\vert Y)$, remains the same. However, in a label schema change, both $P(Y)$ and $P(X\vert Y)$ are affected.

A schema defines the structure of data, with the label schema describing how labels are organized. For example, a dictionary mapping classes to integers, like {“POSITIVE”: 0, “NEGATIVE”: 1}, illustrates a schema.

New classes added (e.g., “ANGRY” split from “NEGATIVE”)
Regression label range changed
May require full model re-engineering

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin