Day172 - MLOps Review: Model Deployment and Prediction Service (2)

July 2, 2025 5 minute read

Designing Machine Learning Systems: Model Comparison (Low-Rank Factorization, Knowledge Distillation, Pruning, & Quantization)

581381AE-587D-4E6E-BC3D-2A15861630A2_1_105_c

Model Comparison

When deploying a machine learning model, especially in online prediction systems, latency matters. The quicker a model can produce predictions, the better the user experience (and sometimes even the difference between success and failure, e.g., in fraud detection or autonomous driving).

If your model is too large or too slow, you have three main options:

Inference Optimization: makes it compute faster
Model Compression: makes the model smaller
Hardware Acceleration: uses faster hardware

In this section, we focus on Model Compression — reducing the size and complexity of a model. Hence, it uses less memory, runs faster, and sometimes even consumes less power (which is crucial for edge devices like smartphones and IoT sensors).

Four Major Techniques in Model Compression

1. Low-Rank Factorization

Core idea: Replace large, high-dimensional tensors (arrays of numbers used in neural networks) with lower-rank approximations.
Example: Replace standard 3×3 convolutions with a combination of smaller convolutions, like 1×1.
- SqueezeNet: 50× fewer parameters than AlexNet but with similar performance.
- MobileNet: Replaces standard ($K \times K \times C$) convolution with ($K$ being the kernel size and $C$ being the number of channels:
  - Depthwise convolution: Applies one filter per input channel. ($K \times K \times 1$)
  - Pointwise convolution: 1×1 convolution to combine the outputs. ($1 \times 1 \times C$)
- This leads to massive parameter savings and speedup, especially in convolutional models.
Limitation: Requires architecture-specific knowledge and is mostly useful for CNNs (Convolutional Neural Networks).

2. Knowledge Distillation

Core idea: Train a smaller "student" model to imitate the behavior of a large “teacher” model.
The teacher model is typically large and accurate; the student is small and fast.
- DistilBERT: 40% smaller than BERT, 60% faster inference, 97% of performance retained.
Pros:
- Student model can have a different architecture from the teacher.
- Useful for deploying small models when high performance is still needed.
Cons:
- Requires a pretrained teacher model.
- If a teacher doesn’t exist, you must train one first, which is resource-intensive.

3. Pruning

Core idea: Identify and remove less important parameters in a neural network.

Two Types of Pruning:

Structural Pruning:
- Removes entire nodes/layers (changes the architecture).
Unstructured Pruning:
- Sets less important weights to zero, making the network sparse (doesn’t change the architecture).

Effect:
- Reduces storage and sometimes **computational** load.
- Sparse models (with many zero weights) require less memory and faster computation on supported hardware.
Caveat: Some researchers argue that pruning is effective not because of the surviving weights, but because the resulting architecture is inherently better and should be retrained from scratch.

4. Quantization

Quantization is the most general and commonly used model compression method. It’s straightforward to do and generalizes over tasks and architectures.

Core idea: Use lower precision numbers (fewer bits) to represent weights and activations.

Example:

Standard floating-point (float32) = 32 bits
Half-precision float (float16) = 16 bits → Half the memory
Integer (int8) = 8 bits → 4× smaller than float32

In extreme cases: use 1-bit quantization (Binary Neural Networks)

Pros:
- Major reduction in memory footprint
- Faster inference speed (simpler math)
- Enables larger batch sizes
Cons:
- Reduced precision introduces rounding errors
- Risk of overflow/underflow
- Requires careful scaling & calibration
Solution: Use frameworks that support quantization-aware training (QAT) or post-training quantization (PTQ).

Quantization in Practice

Post-Training Quantization (PTQ):
- Train model normally → then quantize for inference.
Quantization-Aware Training (QAT):
- Simulate low-precision training from the start → better final performance.

Recently, low-precision training has become increasingly popular, with support from most modern training hardware.

NVIDIA introduced Tensor Cores, processing units that support mixed-precision training.
Google TPUs (tensor processing units) also support training with Bfloat16 (16-bit Brain Floating Point Format), which the company dubbed “the secret to high performance on Cloud TPUs.”
Training in fixed-point is not yet as popular but has had a lot of promising results.

Case Study: Roblox Scaling BERT

Problem: Serve 1+ billion daily requests with low latency.
Baseline: Full-size BERT model → too slow.

Optimization Steps:

Replaced BERT with DistilBERT
Changed input to support dynamic shapes
Quantized the model to int8

Results:

Latency improvement by various model compression methods.
Source: Adapted from an image by Le and Kaehler

7× latency reduction
8× throughput boost

Biggest gains came from quantization, which highlights its practical impact in production.

ML on the Cloud and on the Edge

Another decision you’ll want to consider is where your model’s computation will happen: on the cloud or on the edge. On the cloud means a large chunk of computation is done on the cloud, either public clouds or private clouds. On the edge means a large chunk of computation is done on consumer devices-- such as browsers, phones, laptops, etc.

Cloud Deploymenst

What it is: Model runs on remote servers (AWS, GCP, Azure).
Advantages:
- Easy to set up (managed services like SageMaker, Vertex AI).
- Access to powerful compute (GPUs, TPUs).
- Centralized model management and versioning.
Disadvantages:
- Expensive: ML inference is compute-heavy and scales poorly with usage.
  
  Example: Pinterest and Intuit spend hundreds of millions per year on cloud compute.
- Network latency: Sending user data to the cloud, waiting for predictions, and sending results back can take time.
- Data privacy: Transmitting sensitive data increases risk of interception.
- Regulatory issues: Cloud-based systems may violate regulations like GDPR if user data crosses geographic borders.

Edge Deployment

What it is: Model runs on the end user's device (e.g., phone, smartwatch, camera).
Advantages:
- Lower latency: No network round-trips; prediction happens instantly.
- Offline support: Works in areas with no or poor internet (e.g., rural or secure zones).
- Better privacy: Data stays on the device.
- Lower cloud costs: Less compute usage on centralized infrastructure.
Challenges:
- Device constraints (memory, CPU, battery).
- Limited model size and compute budget.
- Need for specialized optimization and compilation.

To move computation to the edge, the edge devices have to be powerful enough to handle the computation, have enough memory to store ML models and load them into memory, as well as have enough battery or be connected to an energy source to power the application for a reasonable amount of time.

Compiling & Optimizing Models for Edge Devices

Why Compilation is Needed

When you build a model using high-level ML frameworks like PyTorch or TensorFlow, it can’t automatically run on all hardware (like CPUs, GPUs, or TPUs).

Each piece of hardware has a unique architecture—different compute primitives (scalar, vector, tensor), memory layouts (L1/L2/L3 cache), and buffer sizes. For instance:

CPUs operate on scalar (single numbers),
GPUs on 1D vectors,
TPUs on 2D tensors.

To bridge the gap, the model must be compiled into native code the hardware understands. However, supporting every ML framework on every hardware type is costly and slow, requiring hardware-specific kernel libraries.

Different compute primitives and memory layouts for CPU, GPU, and TPU.
Source: Adapted from an image by Chen et al.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin