Day166 - MLOps Review: Feature Engineering (3)
Designing Machine Learning Systems: Data Leakage (Definition, Common Causes, Detecting and Preventing it)
Data Leakage
Data leakage is one of the most dangerous and subtle pitfalls in the development of machine learning systems. It occurs when information that would not be available at inference time (deployment) is inadvertently included during model training. This can lead to overly optimistic evaluation results and poor real-world performance.
What Is Data Leakage?
At its core, data leakage occurs when the training data contains information that reveals the label—directly or indirectly—of a feature that would not be available during real-world use. This can make the model appear highly accurate during testing, only to fail dramatically once deployed.
In July 2021, MIT Technology Review published an article highlighting how many AI tools designed to predict COVID-19 risks from medical scans failed in real-world use. For example, some models were trained on scans taken when patients were lying down or standing, and the models learned to associate a patient’s position with the severity of COVID-19, which isn’t a reliable indicator. Other models picked up on hospital-specific fonts used to label scans, inadvertently making font style a predictor of COVID risk. These issues are examples of data leakage, where information not available during actual prediction is mistakenly included in the model's features, leading to misleading results.
Both examples illustrate data leakage. What makes data leakage particularly challenging is that it’s often non-obvious. It can go unnoticed, even with rigorous cross-validation and testing, and has affected both novice practitioners and leading researchers.
Real-World Examples of Data Leakage
1. Cancer Prediction from CT Scans
A model trained on data from Hospital A performed poorly when tested on data from Hospital B. It turned out that Hospital A utilized specialized CT machines for patients with suspected cancer cases. The model learned to associate the scan characteristics of those machines—not the tumor indicators—with the cancer label. This feature was not present in Hospital B’s workflow, causing a performance decline.
2. Kaggle Ion Switching Competition
In the 2020 Ion Switching competition, test data was generated from training data. Some participants reverse-engineered the pattern, enabling them to recover test labels due to flaws in how the data was created. Although this wasn’t a typical leakage scenario, it highlights how overlooked design choices can be exploited.
Common Causes of Data Leakage and How to Prevent Them
1. Random Splits on Time-Correlated Data
- Random splitting in time-series tasks can cause future information to leak into the training data, potentially compromising the model's accuracy. For instance, stock market or recommendation data often exhibits trends specific to certain days. If a model is trained on data that is split randomly, it might unintentionally “learn” from the test data.
- Prevention: Split your dataset based on time, not randomly. For instance, train on weeks 1–4 and test on week 5. This simulates real-world conditions more accurately.
2. Scaling Before Splitting
- Feature scaling requires global statistics, such as the mean and variance. If these are computed on the full dataset before the train-test split, the model ends up having information from the test set baked into its preprocessing pipeline.
- Prevention: Always split the data first and compute scaling parameters using only the training set. Apply those parameters to both the validation and test sets.
3. Filling Missing Values with Global Statistics
- A common imputation strategy is to fill missing values with the mean or median of the dataset. If this is done using the entire dataset before splitting, it leaks information from the test set into the training.
- Prevention: Compute imputation values using only the training split.
4. Data Duplication
- If duplicate or near-identical records appear in both training and test sets, it can artificially inflate performance metrics. This is especially common in datasets that aggregate multiple sources or have undergone oversampling.
- Prevention: Check for duplicates before and after splitting the data.
5. Group Leakage
- Sometimes, samples that are strongly related appear in both training and test sets. For example, two CT scans from the same patient may be labeled similarly. If one scan is in training and the other in testing, the model may generalize incorrectly.
- Prevention: Use group-based splitting to ensure all samples from a related group (e.g., one patient) are assigned to the same split.
6. Leakage from Data Collection Processes
- Features that result from specific operational procedures—like scanning machines or workflow routing—can contain information that correlates with the label. These correlations don’t hold in other environments.
- Prevention: Understand the complete data generation process. Normalize or standardize differences across data sources. For example, ensure that scan images have a consistent resolution, regardless of the type of machine used.
Detecting Data Leakage
Data leakage can happen at different stages, including data creation, collection, sampling, splitting, and feature engineering. It’s crucial to watch for data leakage throughout the entire lifecycle of a machine learning project.
1. Analyze Feature Predictiveness
Measure the predictive power of each feature relative to the label. If a feature has an unexpectedly high correlation, investigate how it’s generated and whether its presence is justified.
Sometimes, two features independently do not leak label information, but together they do. For example, the combination of start_date
and end_date
reveals the target variable tenure
.
2. Conduct Ablation Studies
Remove features one by one and observe the impact on model performance. A significant drop suggests that the feature may be highly informative or leaking information.
3. Monitor Sudden Accuracy Spikes
If adding a new feature significantly improves validation performance, scrutinize whether it contains information that is unavailable during real-world inference.
4. Keep the Test Set Untouched
Use the test set only once for final model evaluation. Do not use it for feature engineering, hyperparameter tuning, or exploratory analysis. Doing so can cause unintentional data leakage.
Leave a comment