Wonha (Leah) Shin

Logo


AI / ML Data Scientist MSDS @ Uni. Rochester
Fueled by a deep-rooted passion for data and its power to transform lives, dedicated to Data Science excellence. Particular interest lies in the realm of Natural Language Processing(NLP) & ML engineering.

View My LinkedIn Profile

View My GitHub Profile

Real-Time Tweet Sentiment Analysis Pipeline


Project description: Designed and deployed a scalable real-time streaming pipeline for classifying tweet sentiments using transformer-based models in Spark Structured Streaming. Leveraged the power of Delta Lake for incremental data ingestion and storage, with MLflow for experiment tracking and model registry. Built a robust Bronze–Silver–Gold architecture to support low-latency classification of tweet streams, enabling visualization of evolving sentiment patterns in real-time.

1. Motivation

The exponential growth of user-generated content on platforms like Twitter presents a unique opportunity to understand public sentiment on a global scale. However, streaming this data in real-time and analyzing it at scale poses multiple engineering and ML challenges. This project aimed to bridge that gap using distributed streaming and model inference pipelines with Spark.


2. Dataset & Streaming Architecture

3. Tools & Technologies


4. Data Preprocessing & Pipeline Construction


5. Model Development


6. Evaluation & Monitoring


7. Visualization & Insights


8. Challenges


9. Conclusion & Future Work

This project successfully demonstrated a fully functional, end-to-end real-time sentiment analysis pipeline using Spark and Hugging Face transformers. It showcases the feasibility of integrating NLP models into streaming systems at scale.


📎 Full Code Access

Full code snippet is available here: 👉 Let’s go to my GitHub!