Course Project MISM 6205: Data Wrangling for Business
Improving NEU’s Coffee Health
Topics and questions
Project Focus: Studying patterns in coffee consumption and dietary preferences of NEU people, specifically related to their coffee habits.
Goal: Providing relevant insights and recommendations for caffeinated beverages by combining survey insights with menu data from cafes.
Consumer Safety: Focus on serving coffee consumers in the coffee-bean-verse with safe coffee intake suggestions based on their preferences.
Data & Information Quality
Starbucks data: Addressed issues like symbols, null values, duplicates, and unrelated data using Python for cleaning.
Dunkin’ Donuts data: Removed limited-time-offer products and unrelated data, using Excel and Python for cleaning.
Survey data: Changed option text to simpler words for better analysis, removed irrelevant symbols, characters, and NaN values.
Methods and Tools
Data Cleaning: Python (Jupyter Notebooks), some preliminary steps in Excel.
Data Formatting: Pandas
Data Profiling: Pandas_profiling
Data Visualization: Matplotlib, Seaborn, RStudio, Scipy.stats, Numpy, Sklearn, pivot table
Data Validation: Excel
Challenges & Solutions
Different attributes and column order in Starbucks and Dunkin’ datasets; resolved by renaming columns and rearranging them.
Copious symbols in datasets; formatted data in Excel and code.
Questionnaire issues; lack of homogeneity and repetitive data; addressed with better survey design.
Data Wrangling Process
Data processing for Starbucks dataset: Profiling, value replacements, drop unrelated signs, further steps in Excel.
Data processing for Dunkin dataset: Initiated in Excel, profiling, drop unwanted signs and symbols.
Processing datasets of Starbucks and Dunkin Donut: Listed and kept relevant columns, renamed attributes, rearranged column order, and used concat() to combine data.
Analysis and Results
Analyzed Starbucks and Dunkin’ datasets to find highest & lowest calorie, sugar, and protein drinks.
Correlated attributes in both datasets, identifying relationships between nutrients and calories.
Analyzed survey data to understand preferences of different groups and provided specific drink recommendations based on analysis.
Proposed the development of an app that suggests healthier caffeinated beverage options based on users' coffee habits, incorporating variables like sugar and protein.