Day169 - MLOps Review: Model Development and Offline Evaluation (3)

June 29, 2025 6 minute read

Designing Machine Learning Systems: Auto ML (Hyperparameter Tuning & NAS), Model Offline Evaluation (1) (Establishing Baselines)

579BA773-5294-4312-9F38-42FA3FEF3CDA_1_105_c

Auto ML: Automating the ML Lifecycle

AutoML (Automated Machine Learning) aims to reduce the human effort needed to design and train machine learning models. It attempts to automate tasks like model selection, feature engineering, hyperparameter tuning, and even neural architecture design.

Soft AutoML: Hyperparameter Tuning

This is the most common & popular form of AutoML and involves automatically searching for the best values of hyperparameters, such as:

Learning rate
Batch size
Dropout rate
Number of layers
Number of hidden units
Optimizer types and their internal parameters (e.g., $\beta_1$, $\beta_2$ in Adam)

With different hyperparameters, the same model can exhibit significantly different performance on the same dataset. Melis et al. (2018) demonstrated that weaker, well-tuned models can outperform stronger, more sophisticated ones.

Their popular tuning methods are as follows.s

Grid Search: Tries all combinations in a predefined grid. Exhaustive but expensive.
Random Search: Randomly samples from the space. Often more efficient than grid search.
Bayesian Optimization: Models the performance function and iteratively chooses promising hyperparameters using a probabilistic model.

Always remember that

Always use a validation set for tuning—never the test set, to avoid overfitting.
Some hyperparameters have greater sensitivity (e.g., learning rate) and should be more finely searched.

Hard AutoML: Neural Architecture Search (NAS) and Learned Optimizers

Some teams take hyperparameter tuning to the next level: What if we treat other components of a model, or even the entire model, as hyperparameters?

The size of a convolution layer or whether to have a skip layer can be considered a hyperparameter. Instead of manually adding pooling after convolution or ReLU after linear, you provide your algorithm with these building blocks and let it determine how to combine them. This research area, known as neural architecture search (NAS), seeks to identify the optimal model architecture.

Neural Architecture Search (NAS)

Instead of manually choosing model structures (such as the number of CNN layers or where to place pooling layers), NAS automates the design of the architecture.

Search Space: Defines possible components (e.g., conv, pooling, activation) and how they can be connected.
Search Strategy: Reinforcement learning, evolutionary algorithms, or random search.
Performance Estimation: Utilizes proxies (e.g., training a smaller model or a few epochs) to efficiently estimate performance.

Google’s EfficientNet is a successful NAS result—10× better compute efficiency with top-tier performance.

In typical ML training, you have a model and a learning procedure, an algorithm that finds parameters to minimize an objective function for data. Gradient descent is the most common method for neural networks, utilizing an optimizer to update weights based on the gradients. Popular optimizers include Adam, Momentum, and SGD. Although optimizers can be part of NAS to find the best one, it isn’t easy because they are sensitive to hyperparameter settings, which often don’t work well across architectures.

This prompts a research question: What if neural networks replace functions that define update rules? This network would determine how much to update the model's weights, creating learned optimizers instead of hand-designed ones.

Learned Optimizers

Rather than using hand-designed optimizers like SGD or Adam, learned optimizers are themselves neural networks that know how to update weights.

These optimizers are meta-learned by being trained across many tasks.
Once trained, they generalize to unseen datasets and architectures.
They can also train new, better learned optimizers, resulting in recursive self-improvement.

Four Phases of ML Model Development

This framework helps guide the adoption of ML depending on maturity, available resources, and problem complexity:

Phase 1: Before ML (Baseline Heuristics)

Start with simple rules, especially if this is your first time solving a problem with ML.

Examples: Recommend the most popular items, use median values, or sort posts chronologically.
Heuristics can achieve 50% or more of your desired performance.

Tip: Avoid overcomplicating things too early. A sound, rule-based system may suffice for many practical tasks.

Phase 2: Simplest ML Models

Once a baseline is insufficient:

Start with interpretable models, such as logistic regression, decision trees, or gradient boosting.
These give insight into data and feature importance.
They’re easy to debug and quick to deploy.

Goal: Build an end-to-end system with this simple model (including data ingestion, training, evaluation, and deployment).

Phase 3: Optimize Simple Models

If you’ve validated that ML works and built a pipeline:

Apply feature engineering
Add more data
Use regularization
Run a hyperparameter search.
Try ensembling (e.g., averaging predictions of multiple models)

This phase can deliver significant performance improvements with minimal extra complexity.

Phase 4: Complex Models

When simple models plateau, move to:

Deep learning architectures (CNNs, RNNs, Transformers)
Pretrained models (e.g., BERT, ResNet)
Multi-task learning or multi-modal models

At this stage:

Consider cost, latency, and inference environment.
Analyze model drift (decay in performance over time)
Plan infrastructure for frequent retraining and monitoring\

Model Offline Evaluation

A standard, challenging question when helping companies with ML is: “How do I know that our ML models are any good?” For example, a company used ML to detect intrusions in 100 surveillance drones but couldn’t measure missed intrusions or determine if one algorithm was better than another.

Offline evaluation is the process of assessing an ML model’s performance before deploying it to a live production environment. It relies on labeled datasets and simulates real-world conditions as closely as possible using held-out validation or test sets.

The core challenge: How can we trust a model’s performance when it’s never seen real-world (live) data yet?

In some tasks, like recommendations, we can infer labels from user feedback, such as clicks indicating a good recommendation, as discussed in ‘Natural Labels’. However, biases are common. For other tasks, you may not directly evaluate your model’s performance in production and may need to conduct extensive monitoring to detect changes and failures.

In this section, we’ll discuss methods to evaluate your model’s performance before deployment. We’ll start with the baselines for evaluation, then cover standard methods beyond overall accuracy metrics.

Establishing Baselines

The writer noted that someone once told him her new generative model scored 10.3 on the FID metric for ImageNet. He had no idea what this score meant and whether her model would be helpful in his situation. Also, he once assisted a company in developing a classification model where the positive class comprised 90% of the data. An excited ML engineer claimed their model achieved an F1 score of 0.90. When he asked how that compared to random predictions, he didn’t know. Since the positive class made up 90% of the labels, his model could achieve that F1 score just by randomly predicting the positive class. Essentially, it was making predictions at random.

👀 Side Note:

FID (Fréchet Inception Distance) is a standard metric to evaluate generative models (e.g., GANs).
A lower FID generally indicates better quality, generated images that are closer to real data.
Problem: Saying “FID = 10.3” means nothing on its own unless you know:
- What is the typical FID on ImageNet?
- What FID do other comparable models get?
- Is 10.3 good or bad for this specific task?

As shown in the example above, before evaluating a model, it is essential to define a baseline—a reference point against which to compare. Metrics alone (e.g., “Our model gets 90% accuracy”) are meaningless unless you understand what you’re comparing them to.

Common Baseline Types

Random Baseline

If our model predicts randomly, what is its expected performance?

For example, consider a task with two labels: NEGATIVE (90%) and POSITIVE (10%). The table below shows the F1 and accuracy scores of baseline models making predictions at random.

Random Distribution	Meaning	F1	Accuracy
Uniform Random	Predicting each label with equal probability (50%)	0.167	0.5
Task’s Label Distribution	Predicting NEGATIVE 90% of the time, and POSITIVE 10% of the time	0.1	0.82

Simple Heuristic
- If you just make predictions based on simple heuristics, what performance would you expect?
- Use simple rules (e.g., always recommend the latest item)
- This is useful in recommendation, ranking or forcasting.
Zero Rule
- Always predicts the most frequent class
- It is useful for fast sanity check.
Human-level Baseline
- Performance of a human expert
- Useful in fileds like medicine or hiring
Existing System Baseline
- Rule-based or older ML system
- Compare coast, accuracy and maintenance benefits

💡 Example: A model predicting user churn should beat a “recommend last used app” baseline to justify complexity.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin