4 minute read

Designing Machine Learning Systems: Data Augmentation (Simple Label-Preserving Transformations, Perturbation, and Data Synthesis)

F80D1458-7E73-40BA-9E95-2FDBC5CC8175_1_102_o


Data Augmentation

Data augmentation refers to the process of expanding a training dataset by generating new, synthetic examples from the existing data. Initially used to combat data scarcity, data augmentation has now become a staple in modern ML pipelines due to its benefits in enhancing generalization, mitigating overfitting, and improving model robustness to noise or adversarial perturbations.

Its application spans both computer vision (CV) and natural language processing (NLP), although the methods vary significantly due to the structure of the data.


Simple Label-Preserving Transformations

These transformations modify the input data without changing the associated label. They help expand datasets while maintaining the integrity of class labels.

  • In computer vision, the most straightforward data augmentation technique is to modify an image while preserving its label randomly. Techniques include cropping, flipping, rotating, inverting, erasing parts, and more. For example, a rotated dog remains a dog. ML frameworks like PyTorch, TensorFlow, and Keras support image augmentation.

  • Within NLP, you can replace a word with a similar one at random, assuming this change doesn’t alter the sentence’s meaning or sentiment. You can find similar words in a thesaurus or by matching embeddings close in a word space. For example,

    • Original sentence: I’m so happy to see you.
    • Generated sentences: I’m so glad to see you. / I’m so happy to see y’all. / I’m very happy to see you.


Perturbation: Injecting Controlled Noise to Enhance Robustness

Neural networks are typically sensitive to noise. In computer vision, this means that even a small amount of noise added to an image can lead to misclassification by a neural network.

A study (by Su et al.) showed that 67.97% of natural images in the Kaggle CIFAR-10 test dataset and 16.04% of ImageNet test images can be misclassified by altering just one pixel.

Adversarial attacks involve using deceptive data to trick a neural network into making deceptive predictions. Adding noise to samples is a common technique for generating adversarial samples. The effectiveness of adversarial attacks becomes even more pronounced as the resolution of images increases.

  • In Computer Vision, adding noisy samples to training data helps models identify weak spots in their decision boundary and enhances performance. Noisy samples can be created by either adding random noise or by a search strategy. Moosavi-Dezfooli et al. proposed DeepFool, an algorithm that finds the minimal noise to cause misclassification with high confidence. This type of augmentation is called adversarial augmentation.

  • Adversarial augmentation isn’t as standard in NLP, since adding pixels to a bear still makes it look like a bear, but inserting random characters into a sentence often turns it into gibberish. However, perturbation has been used to boost model robustness.

    • BERT-style pretraining involves randomly masking 15% of tokens and replacing them with [MASK], random words, or keeping them unchanged:
    • Example: “My dog is hairy” → “My dog is [MASK]” or “My dog is apple.”
    • Although some outputs are nonsensical, the model learns to utilize context for prediction, thereby improving generalization.


Data Synthesis: Creating New Examples from Scratch

Because data collection is costly, time-consuming, and privacy-sensitive, using synthesized data to train models is an ideal approach. While we can’t synthesize all training data yet, generating some can still boost model performance.

Synthetic data generation expands the training set using templated logic, statistical mixing, or generative models. It’s beneficial when hand-labeled data is limited or sensitive.

  • In NLP (Template-Based Generation)

    • Handwritten templates can bootstrap intent classification datasets.
    • Template: “Find me a [CUISINE] restaurant within [NUMBER] miles of [LOCATION].”
    • You can generate thousands of training queries from a template using lists of cuisines, reasonable distances (likely not exceeding 1,000 miles), and locations (such as home, office, landmarks, or exact addresses) for each city.
    • Examples:
      • “Find me a Thai restaurant within 2 miles of my home.”
      • “Find me a Mexican restaurant within 5 miles of Times Square.”
    • This strategy supports chatbots, virtual assistants, and search applications.
  • In Computer Vision (Mixup and GANs):

    • Mixup: Combined existing examples with discrete labels to create continuous labels in a linear way.

      • Consider a task of classifying images with two possible labels: DOG (encoded as 0) and CAT (encoded as 1). From example $x_1$ of label DOG and example $x_2$ of label CAT, you can generate $x^{‘}$ such as:

        $x^{'} = \gamma x_1 + (1-\gamma)x_2$
      • The label of $x^{‘}$ is a combination of the labels of $x_1$ and $x_2$: $\gamma \times 0 \ + \ (1 - \gamma) \times 1$. This method is called mixup.

      • It can improve models’ generalization, reduces their memorization of corrupt labels, increases their robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

    • CycleGAN and simliar models allow realistic domain translation (e.g., day ↔ night, CT ↔ MRI), producing synthetic samples for underrepresented classes.
    • Sandfort et al. demonstrated improved medical image segmentation performance using GAN-augmented datasets.

    Data synthesis provides high flexibility and scalability, particularly in industrial and privacy-sensitive applications.

Training data is the foundation of modern machine learning (ML) algorithms. Poor-quality training data leads to subpar algorithm performance, so investing in high-quality data is essential for effective learning.

When dealing with tasks that lack natural labels, companies often use human annotators, but this can be slow and costly. To overcome the shortage of labeled data, we explored methods such as weak supervision, semi-supervised learning, transfer learning, and active learning.

ML algorithms work best with balanced data distributions. Significant class imbalance, a common issue in the real world, makes learning difficult. We discussed the challenges posed by class imbalance and various techniques to address it, including selecting appropriate metrics, resampling data, and modifying the loss function to focus on specific samples.

We also talked about data augmentation strategies, which could improve the models’ accuracy and effectiveness especially with sample of computer vision and NLP tasks.



Leave a comment