AI / ML Data Scientist MSDS @ Uni. Rochester
Fueled by a deep-rooted passion for data and its power to transform lives, dedicated to Data Science excellence. Particular interest lies in the realm of Natural Language Processing(NLP) & ML engineering.
View My LinkedIn Profile
View My GitHub Profile
Socio-Awareness Modeling and Predictive Analysis of COVID-19 Spread in Ohio Counties

Project description:
- Spearheaded an analytical project to assess the impact of socioeconomic factors and public awareness on the COVID-19 pandemic across Ohio counties.
- Utilized advanced machine learning techniques, employing a Random Forest model to analyze over 140 variables, achieving an R² score of 0.95 and 89% prediction accuracy.
- Conducted a multifaceted analysis incorporating health, societal, and economic data to explore the nuanced interplay between community behavior and pandemic trends.
Overview
The analysis described in the report centers around the exploration and modeling of COVID-19 case spread within Ohio counties, linked to various awareness levels on different topics among the population. This analysis sought to understand the relationships between the pandemic’s impact and the array of socioeconomic indicators available in the dataset.
1. Key components of the analysis included
- Data Utilization: The study harnessed a rich dataset with numerous variables including COVID-19 case and death counts, social media engagement (topic awareness measured via Jaccard similarity), and socioeconomic factors like employment rates, educational attainment, and median earnings.
- Modeling Technique: The primary analytical tool was a Random Forest algorithm, chosen for its robustness against noise and overfitting. This model was trained to predict COVID-19 cases, leveraging its ensemble learning nature.
- Insightful Findings: The analysis uncovered that despite the pandemic, public interest heavily leaned towards non-health-related topics such as sports and entertainment. The geographical analysis provided insights into the distribution of awareness and its correlation with COVID-19 cases across counties.
2. Method
- Random Forest
- tuning with multiple hyperparameters
3. Random Forest
- Hyperparameter tuning:
- A loop iterates over a range of values for
n_estimators
in the Random Forest Classifier to find the best performing model based on accuracy and R-squared metrics.
- The tuning process involves training the model with different numbers of trees
n_estimators
and observing the impact on performance metrics.
- Results are visualized using matplotlib plots, showing how different values of
n_estimators
affect the model’s accuracy and R-squared score, helping to identify the optimal model configuration.
4. The Analysis of Results:
- The analysis using a Random Forest model demonstrated high effectiveness, with an R² score of 0.955 and 89% accuracy, indicating reliable predictive capability for COVID-19 case distribution across Ohio counties.
- The study found that non-health topics like sports and entertainment dominated public interest, overshadowing health-related information during the pandemic.
- Geographical analysis highlighted variations in COVID-19 impact, with certain counties facing higher cases or deaths per capita.
- Temporal trends in topic awareness were dynamic, fluctuating with external events. Although the model was robust, future improvements could include integrating additional relevant features to enhance accuracy and insight depth.
Report Download
Full report is available here Report Viewer