5 minute read

Designing Machine Learning Systems: Sampling (Nonprobability, Simple Random, Stratified, Weighted, Reservoir, and Importance Sampling)

BBD538C5-840E-463A-8774-E59B886F76CF_1_105_c


Training Data

How data quality, sampling, and representation shape model performance

In the previous chapter, we explored how data moves between systems. Now, we shift focus to something arguably even more fundamental in MLthe training data itself.

Despite the buzz around state-of-the-art models and algorithms, most experienced practitioners agree: models are only as good as the data they’re trained on. Yet, many ML courses emphasize modeling and treat data preparation as an afterthought. This chapter confronts that imbalance and provides tools and insights for managing training data effectively.

Why Training Data Matters

Training data isn’t just the data you use to fit a model—it’s everything that shapes how your model behaves in the real world. It includes:

  • Data for training, validation, and testing (often split into different subsets).
  • Labeled or unlabeled examples, structured or unstructured inputs.
  • All data collected or sampled before a model sees the world.

In short, training data is the raw material of machine learning. If it’s biased, noisy, or insufficient, no model—no matter how advanced—can compensate.

A Word of Caution: Data Biases

Training data is riddled with biases, many of which are invisible at first glance. These include sampling bias, which refers to collecting data that isn't representative; historical bias, which involves using data embedded with human or systemic prejudice; and labeling bias, arising from inconsistent or subjective labeling by annotators. While it is essential to use data, one should never trust it blindly, as biases can creep in during collection, annotation, storage, and even model evaluation.


Sampling: The Foundation of Training Data

Sampling refers to how we select data to include in training and is an essential process that occurs throughout a machine learning (ML) project. It is employed to build the training set from raw data, create validation and test splits, and monitor models during production.

Why do we sample?

We often don’t have access to all real-world data, and even when we do, processing all of it is infeasible. Sampling enables quicker experimentation and faster iteration.


1. Nonprobability Sampling

In nonprobability sampling, data is not selected based on formal probability. This type of sampling is often convenient, but it can lack objectivity, raising concerns about the reliability of the results obtained from such methods. Therefore, while nonprobability sampling may be easier to implement, researchers should be mindful of its limitations and potential biases.

Examples:

  • Convenience Sampling: Use what’s easy to access (e.g., Wikipedia for NLP).
  • Snowball Sampling: Start with a few samples, and expand from there (e.g., following Twitter accounts).
  • Judgment Sampling: Experts handpick examples.
  • Quota Sampling: Fixed counts for specific groups, without randomness.

⚠️ These methods are quick, but they often introduce bias and fail to represent real-world distributions. Still, they’re commonly used in practice, especially in early-stage projects or when high-quality labeled data is scarce.


2. Random (Probability-Based) Sampling

These methods guarantee more representative data selections by utilizing statistical rigor.

  • Simple Random Sampling
    • Every data point in the population has an equal probability of being selected. This method is praised for its straightforwardness and ease of implementation, making it a popular choice in various research fields, including social sciences, healthcare, and market research. Its simplicity allows researchers to implement.
    • Limitation: Rare categories can sometimes be overlooked or missed entirely, leading to potential biases in the analysis. This oversight might skew results, especially in studies focused on niche populations or specific subgroups. To address this limitation, they might consider combining simple random sampling with stratified or cluster sampling methods, which can enhance the diversity of the sample and provide more accurate insights.
  • Stratified Sampling
    • Stratified sampling involves dividing data into distinct groups, known as strata, and sampling from each group separately. This technique ensures that each group is represented in the sample, which is particularly important when certain groups, even relatively small ones, may be underrepresented in the population.
    • This method is commonly utilized in classification tasks where certain classes, particularly minority ones, may be overlooked without careful sampling strategies. It helps improve the robustness and reliability of the insights gained from the data.
    • Limitation: It is challenging to stratify the data appropriately when dealing with overlapping classes or multilabel classification scenarios, where multiple courses may pertain to a single sample.
  • Weighted Sampling
    • Assigning weights to samples can significantly influence selection probability in a dataset. This approach is beneficial when specific subgroups, such as recent users or high-value customers, are more significant.
    • It becomes essential when the data distribution diverges from real-world expectations, such as during seasonal trends or demographic shifts.
    • This concept relates to sample weighting during training, which subsequently impacts not only the loss function and decision boundaries in a machine learning model but also the overall model performance and interpretability.
  • Reservoir Sampling (for streaming data)
    • It is needed when:
      • You don’t know the total size of the dataset.
      • You can’t fit all the data in memory.
      • You want to maintain a representative sample over time, even as new data continuously arrives.
      • The data comes from an unpredictable source where you can't access the entire dataset simultaneously (e.g., Twitter).
      • You require that each element has an equal chance of being included in the final sample, preventing bias in your results.
    • Reservoir Sampling keeps a fixed-size sample with equal probability from a potentially infinite stream (e.g., Twitter).
  • How it works :
    1. Fill the reservoir with the first k elements.
    2. For each new element n, pick a random number between 1 and *n*.
    3. If that number is ≤ k, replace an existing item in the reservoir.
  • Reservoir sampling ensures uniform sampling from a data stream and is helpful in real-time or resource-constrained environments.


3. Importance Sampling

Sometimes, you want to simulate sampling from one distribution, but only have access to another. That’s where importance sampling comes in.

You sample from a more convenient or efficient distribution $Q(x)$, but correct the bias by reweighting samples using the ratio $\frac{P(x)}{Q(x)}$.

You trade sampling ease for statistical correction.

  • When is it used?
    1. Reinforcement Learning: Estimating new policy performance using past experiences from an older policy.
    2. Rare Event Modeling: This technique is crucial for accurately estimating the probabilities

    3. Simulations: Often employed when collecting real-world data is impractical due to high costs or slow
  • Additional Uses:

    1. Robotics: Reinforcement learning is used to train robots to navigate intricate environments adeptly, enhancing their ability to perform tasks autonomously.
    2. Finance: It is vital in evaluating risks and constructing investment strategies based on historical data and predicted future trends.
    3. Healthcare: This technique is utilized to forecast patient outcomes by analyzing past medical records and treatment results, aiding in better patient care and decision-making
  • Key Condition:

    • Q(x) > 0 wherever P(x) ≠ 0: This mathematical condition ensures that the estimated state-action values (Q-values) remain positive wherever the corresponding probabilities (P-values) are non-zero. This is essential to prevent overlooking significant regions in the data or possible outcomes, thereby allowing for more accurate and comprehensive modeling.



Leave a comment