AI / ML Data Scientist MSDS @ Uni. Rochester
Fueled by a deep-rooted passion for data and its power to transform lives, dedicated to Data Science excellence. Particular interest lies in the realm of Natural Language Processing(NLP) & ML engineering.
View My LinkedIn Profile
View My GitHub Profile
E-Cigarette Perception Capstone Project
Project description:
- Conducted a multilingual NLP project analyzing over 51,000 English and 7,000 Spanish tweets (from an original pool of 1.1M and 34K respectively) to uncover cultural and linguistic differences in public e-cigarette discourse on Twitter/X.
- Led key components including full-text crawling, NLP preprocessing, manual labeling, fine-tuning transformer models (Bertweet, Twitter Twin BERT Large), and multilingual topic modeling using Sentence Transformers, UMAP, and K-means.
- Revealed critical thematic and attitudinal differences between language groups, guiding culturally sensitive public health messaging strategies.
Overview
This project aimed to analyze public perceptions of e-cigarettes across English and Spanish tweets to support inclusive, culturally informed public health campaigns. Motivated by the lack of multilingual research in this domain, the project used advanced NLP tools to uncover thematic trends and attitudinal differences between English- and Spanish-speaking communities.
1. Dataset and Preprocessing
- Data Source: Tweets related to e-cigarettes collected via keyword filtering by a third-party provider
- Initial Dataset Sizes:
- ~1.1 million English tweets
- ~34,000 Spanish tweets
- Final Preprocessed Dataset:
- 51,000 English tweets
- 7,000 Spanish tweets
- Preprocessing Pipeline:
- Deduplication and removal of retweets
- Full-text crawling (sampled 100K English tweets)
- Text normalization: lowercasing, stop word removal, lemmatization, emoji/special character removal
2. Model Training and Fine-Tuning
3. Topic Modeling Framework
- Embedding + Clustering Pipeline:
- Sentence-BERT Embeddings
- UMAP for dimensionality reduction
- K-Means for topic grouping
- Models Compared: LDA, TF-IDF, BERT Topic
- Final Selection: BERT Topic for best balance between coherence, domain relevance, and interpretability
- Topic Naming: Refined using OpenAI-based LLMs with 100+ iterations per theme
4. Major Findings
- Cross-Linguistic Themes
- English Tweets: Focused on regulation, school-related vaping, and public health risks.
- Spanish Tweets: Emphasized social lifestyle, recreational habits, and cultural context.
- Translation Evaluation: Minimal loss in sentiment polarity; however, key cultural idioms and slang were diluted or altered.
- Sentiment vs. Attitude
- Sentiment (VADER): Majority of tweets labeled as neutral.
- Attitude (Fine-tuned Model): Revealed more polarized opinions (24.6% positive, 41.2% negative, 34.2% neutral), uncovering what traditional sentiment tools often miss.
- Translation vs. Original Spanish
- Original Spanish tweets emphasized addiction, cravings, and personal habits.
- Translations introduced themes like gift-giving and product preferences not found in the original.
- Demonstrated how translation may shift thematic interpretations and miss cultural depth.
5. Challenges and Limitations
- Ground Truth Labeling Subjectivity: Creating labeled data for classification tasks (e.g., attitude and commercial relevance) involved manual annotation and group consensus, making the labels inherently subjective. Especially in cross-cultural contexts, definitions of “positive,” “neutral,” and “negative” vary and may affect reproducibility and downstream model interpretation.
- Preserving Cultural Nuance During Translation: Machine translation often failed to retain idiomatic expressions, slang, emoji usage, and cultural references in Spanish tweets, leading to loss of contextual richness and thematic accuracy.
- Disparity in Dataset Sizes: The English dataset significantly outweighed the Spanish dataset in volume, creating imbalances that affected cross-linguistic comparability and interpretability.
- Cross-Language Topic Modeling: Structural and syntactic differences between English and Spanish introduced challenges in achieving consistent and unbiased topic clusters, limiting model performance and coherence across languages.
- Demographic Bias: Social media users, particularly those active on Twitter, are not representative of the general population. Differences in age, socioeconomic status, and language fluency limit the generalizability of findings to broader public health contexts.
6. Future Work
- Enhance multilingual topic modeling frameworks for improved cultural fidelity.
- Construct balanced datasets across languages to ensure unbiased comparison.
- Conduct temporal analysis to study perception shifts after policy changes or public health campaigns.
Report Download
Full report is available here Report Viewer