Day158 - MLOps Review: Data Engineering Fundamentals (1)

April 29, 2025 5 minute read

Designing Machine Learning Systems: Data Formats (JSON, Parquet & Binary Format), Data Models (Relational & NoSQL), and Data Storage Engines (ETL)

5E060BD4-FF0F-432D-9E90-94141FFACA10_1_105_c

Data Engineering Fundamentals

Building a stable foundation for machine learning systems

Machine learning (ML) has surged forward in the last decade, and that progress is tightly tied to the rise of big data. But building ML systems in production is not just about tuning models or training with more GPUs. The most challenging part, and often the most time-consuming, is the data engineering that underpins it all.

Understanding Your Data Sources

In production, ML systems ingest data from many sources, each with distinct characteristics and challenges.

User Input Data: This encompasses text, images, uploaded documents, and other multimedia content. It’s prone to errors and inconsistencies—people can enter the wrong format, miss fields, or submit overly long or short content. Since users expect immediate feedback, this data must be processed quickly and accurately, often requiring strict validation logic.
System-Generated Data: This includes logs, event streams, model predictions, and diagnostic signals. These are less error-prone but often large in volume. Logs are crucial for debugging and performance monitoring and are typically processed in batches (e.g., hourly or daily).
Behavioral Data: Systems track user interactions—clicks, scrolls, page views. While technically system-generated, it remains personal and is often subject to privacy regulations.
Internal Databases: Structured datasets managed by enterprise applications (e.g., customer databases, inventory systems). These are useful for personalization, product ranking, and more.
Third-Party Data: Purchased or licensed from vendors, this includes demographic trends, purchase histories, or behavioral data. It’s often processed and anonymized, but as privacy policies tighten (e.g., Apple’s opt-in IDFA), its availability is shrinking.

Each source requires different ingestion strategies, cleaning protocols, and storage solutions.

Data Formats: Text vs. Binary, Row vs. Column

Once collected, data needs to be stored, but choosing the proper format depends on access patterns and performance needs.

Text-based formats, such as JSON and CSV, are human-readable and easy to debug. But they’re verbose and inefficient in terms of space and speed.
Binary formats like Parquet and Avro are compact and faster for computation, but less intuitive to inspect directly.

There’s also a key distinction between row-major (e.g., CSV) and column-major (e.g., Parquet) formats:

Row-major is excellent for writing new records and retrieving complete examples.
Column-major is ideal for reading specific features across many examples, which is common in feature selection and statistical analysis.

Image Source: TextBook: Designing Machine Learning Systems

Understanding the trade-offs between formats can drastically improve performance and reduce storage costs. For example, converting a 14 MB CSV to Parquet may reduce it to 6 MB while speeding up queries. This is why AWS recommends using Parquet format files as their input.

Data Models: Relational vs. NoSQL

How you model your data determines how you can retrieve and update it—and ultimately what your ML system can do with it.

Relational Model

Invented in 1970, the relational model structures data in normalized tables (relations), making it ideal for:

Reducing redundancy
Ensuring consistency
Enabling complex queries via SQL

But normalization can spread data across many tables, making joins computationally expensive, especially at scale.

NoSQL Models

In response to the rigidity of relational schemas, NoSQL databases emerged to offer more flexible alternatives:

Document Stores (e.g., MongoDB): Each data item is a self-contained document, often in JSON or BSON. Useful when the data structure varies or relationships are minimal.
Graph Databases (e.g., Neo4j): Ideal for highly interconnected data, such as social networks, recommendation engines, or fraud detection. Relationships are first-class citizens.

Image Source: TextBook: Designing Machine Learning Systems

NoSQL models give you flexibility and speed, especially in non-tabular or semi-structured use cases. However, they may sacrifice the power and safety of relational schemas.

Structured vs. Unstructured Data

Structured data adheres to a defined schema. It’s easy to search, query, and analyze—but inflexible to change.

Unstructured data, such as raw logs, free text, or images, lacks a predefined structure. This makes it:

Easier to store quickly
More flexible across applications
Harder to analyze without custom parsing or feature extraction

Structured data lives in data warehouses, while unstructured data is stored in data lakes. Modern systems increasingly utilize lakehouses, which combine the flexibility of data lakes with the governance of data warehouses.

Data Storage Engines: OLTP vs. OLAP

Databases are not one-size-fits-all. Their design depends on the type of processing they need to support.

OLTP (Online Transaction Processing)

Optimized for handling frequent, atomic operations like purchases, uploads, or bookings. They prioritize:

Low latency
High availability
Data consistency (ACID compliance)

Common in apps and services where users expect instant feedback.

OLAP (Online Analytical Processing)

Optimized for analytical queries—aggregating data across many records (e.g., monthly averages, churn rates). These databases often use:

Columnar storage
High-throughput batch queries

While traditionally separate, new architectures blur the line, with tools like Snowflake, BigQuery, and Apache Iceberg supporting hybrid workloads.

ETL and ELT: Processing the Data Pipeline

Image Source: TextBook: Designing Machine Learning Systems

Once you have data, it needs to be processed for downstream use. This is where ETL comes in:

Extract: Gather data from various sources.
Transform: Clean, normalize, enrich, and join datasets.
Load: Insert into the target system (e.g., a data warehouse or file system).

Modern systems also follow an ELT pattern, where raw data is initially stored (in a data lake) and transformation occurs later. This allows for faster ingestion and schema flexibility, but can lead to processing delays and inefficient querying if not managed well.

Declarative Data and ML Systems

Inspired by SQL’s declarative nature, declarative ML aims to simplify modeling by allowing users to **specify what they want** (features, task) without coding how to get it.

Ludwig allows users to define models using configuration files, such as layer counts and activation functions.
H2O AutoML automates the entire pipeline, exploring multiple model variants and selecting the most effective one.

While promising, these systems mostly abstract model development—but overlook more challenging tasks, such as feature engineering, monitoring, and handling data drift. As a result, they help reduce entry barriers, but are not a silver bullet for production-grade ML.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin