5 minute read

Designing Machine Learning Systems: Iterative Process & Framing ML Problems (1)

98038097-1BFF-46F4-94DD-9F72551958F6_1_105_c


Iterative Process: Machine Learning Is Not a One-Time Job

When many of us first learn about machine learning, it often feels like a linear journey. We collect data, clean it, train a model, evaluate it, and deploy. Done. However, as soon as you begin working on a real-world ML system—especially one that serves production users—we will realize the process looks nothing like a clean, straight line. Instead, it’s a loop with constant iteration, feedback, and course correction.

Before deploying my first machine learning system, the most challenging part would be training a good model. But I quickly discovered that model training is just one part of a much bigger, cyclical workflow – one that loops repeatedly as data shifts, user behavior evolves, and business goals change.


The Cycle in Action

Imagine building a model to decide if an ad should be shown for a search query. First, define your goal: maximize ad displays. After gathering labeled data and creating features, you train your initial model.

During error analysis, you find issues with the model due to incorrect labels. You relabel the data and retrain the model. Then, you notice the dataset is unbalanced—99.99% indicate no ad should be shown, causing the model to choose often “no.” To fix this, you collect more positive examples and adjust the model.

Now the model works well on your old test data, but that data is two months old. When tested on recent queries, performance drops because user behavior or ad content has changed. You retrain the model with more current data.

Eventually, you deploy the model. It seems fine until marketing reports a drop in ad revenue. After examining the details, you find that the model focuses on impressions, rather than click-through rate (CTR). Ads are displayed, but clicks are low. So you start over: redefine the goal, collect new data, and reset the process.


A Real ML Development Lifecycle

This kind of workflow isn’t an exception—it’s the norm. In production, ML development is an iterative and never-ending process. The cycle can be broken down into six key stages as follows.

  1. Project Scooping
    • Every machine learning (ML) project should begin with clear scoping. This means defining objectives, understanding business needs, identifying stakeholders, and estimating the required time, data, and computing. A well-scooped project aligns technical decisions with real business goals, such as boosting conversions or reducing churn, not just model accuracy.
  2. Data Engineering
    • Most of the time spent in machine learning projects is not in modeling, but in handling data. Data must be collected, cleaned, joined from multiple sources, and transformed into a format suitable for training. This step often involves generating labels — either manually, programmatically, or via weak supervision — and creating a balanced, representative sample.
  3. Model Development
    • With clean training data, we move into the core of ML work: feature extraction, model selection, training, and evaluation. This is where we apply our machine learning knowledge to develop a performant model. However, this is also where the first round of surprises usually appears – such as imbalanced data, mislabeled examples, or underperforming features.
  4. Deployment
    • At some point, we will need to stop tweaking and put our model into the world. Deployment isn’t just about making a model accessible—it involves setting up APIs, infrastructure, and ensuring your system integrates with downstream applications. It’s a shift from experimentation to responsibility: you’re now serving live users, and mistakes can have real consequences.
  5. Monitoring and Continual Learning
    • After deployment, we must keep watch. Models can degrade over time due to data drift, concept drift, or changes in user behavior. Regular monitoring helps catch these issues early. In many cases, we will need to build pipelines that automatically retrain models as new data arrives, or at least alert teams when performance metrics dip.
  6. Business Analysis
    • Lastly, and often most importantly, our ML system must be evaluated not just on technical metrics, but on business outcomes. Does the new model increase revenue? Does it reduce cost? Does it improve customer satisfaction? These insights inform whether a project should continue, pivot, or end – and they often lead back to the beginning of the cycle.

It’s worth noting that this iterative process looks different depending on our role. A data scientist or ML engineer is typically focused on data preparation, modeling, and evaluation. An ML platform engineer or MLOps specialist may be more concerned with deployment reliability, version control, and infrastructure scaling.

While the perspectives differ, the underlying truth remains: machine learning is not a one-and-done task—it is an evolving system that requires ongoing attention, evaluation, and care.


Framing Machine Learning Problems

Not every business problem is a machine learning problem. One of the most critical skills for an ML practitioner, especially in applied settings, is the ability to frame a real-world need as a well-posed ML task. This means defining the inputs, outputs, and objective function in a way that enables the system to learn and improve over time.

Let’s say you’re a machine learning tech lead at a bank targeting millennial users. Your manager asks you to explore how to “use ML to make customer support faster,” inspired by a competitor’s success. On the surface, this request sounds promising, but it’s vague. ML systems can only learn when the problem is clearly defined in terms of data and measurable outcomes.


From Business Need to ML Problem

After investigation, you discover the actual bottleneck isn't in answering customer questions, but in routing the request to the correct department — accounting, inventory, HR, or IT. Now, you’ve found a precise ML framing: this is a multiclass classification problem.

  • Input: the customer request text
  • Output: the predicted department (one of four classes)
  • Objective: minimize misclassification (e.g., using cross-entropy loss)

Framing the problem this way allows us to design a pipeline that trains a model to classify incoming requests based on historical labeled data.

This reframing step is critical. Many failed ML projects result not from technical flaws, but from poor problem formulation. Before collecting data or training models, ML teams must ensure that the problem is one for which machine learning is well-suited.



Leave a comment