Wonha (Leah) Shin

Logo


Machine Learning & MLOps Engineer

I design, build, and scale intelligent systems — from streaming NLP pipelines to real-time LLM applications. Driven by curiosity, I combine data engineering with AI research to turn ideas into production-ready systems.

View My LinkedIn Profile

View My GitHub Profile

🧠 Real-Time Tweet Sentiment Analysis Pipeline

Databricks | PySpark Structured Streaming | Delta Lake | Hugging Face | MLflow


🚀 Overview

Designed and deployed a real-time streaming pipeline to classify tweet sentiments at scale using transformer-based NLP models within Apache Spark Structured Streaming.
The system processes millions of tweets with low latency, leveraging a Delta Lake multi-layer architecture (Bronze → Silver → Gold) and MLflow for tracking, model registry, and deployment.

Goal: Bridge the gap between scalable data engineering and NLP by integrating distributed streaming with real-time model inference.


🧩 Architecture

Data Flow:
Twitter Stream (JSON)Bronze (Raw)Silver (Cleaned)Gold (Predicted)

Components:


⚙️ Pipeline Workflow

  1. Ingestion – Bronze Layer
    • Continuously reads tweet JSON streams from s3a://voc-75-databricks-data/voc_volume/
    • Stores raw semi-structured data into Bronze Delta Table
  2. Transformation – Silver Layer
    • Extracts key fields: full_text, timestamp, lang, user_id
    • Cleans nulls, deduplicates IDs, standardizes timestamps
  3. Model Inference – Gold Layer
    • Applies Transformer-based sentiment classifier via PySpark UDF
    • Predicts sentiment labels: positive, neutral, negative
    • Writes predictions to Gold Delta Table for downstream analytics

🧠 Model & Tracking


📊 Results & Observations


🧩 Key Challenges & Solutions


📎 Full Code:
👉 GitHub — Starter Streaming Tweet Sentiment (Spring 2024 Final Project)


📘 Keywords:
PySpark Structured Streaming | Delta Lake | Databricks | Transformer Models | MLflow |
Real-Time Inference | Data Engineering | Hugging Face | MLOps | Sentiment Analysis