Course Project MISM 6205: Data Wrangling for Business

Improving NEU’s Coffee Health

Topics and questions

  • Project Focus: Studying patterns in coffee consumption and dietary preferences of NEU people, specifically related to their coffee habits.

  • Goal: Providing relevant insights and recommendations for caffeinated beverages by combining survey insights with menu data from cafes.

  • Consumer Safety: Focus on serving coffee consumers in the coffee-bean-verse with safe coffee intake suggestions based on their preferences.

Data & Information Quality

  • Starbucks data: Addressed issues like symbols, null values, duplicates, and unrelated data using Python for cleaning.

  • Dunkin’ Donuts data: Removed limited-time-offer products and unrelated data, using Excel and Python for cleaning.

  • Survey data: Changed option text to simpler words for better analysis, removed irrelevant symbols, characters, and NaN values.

Methods and Tools

  • Data Cleaning: Python (Jupyter Notebooks), some preliminary steps in Excel.

  • Data Formatting: Pandas

  • Data Profiling: Pandas_profiling

  • Data Visualization: Matplotlib, Seaborn, RStudio, Scipy.stats, Numpy, Sklearn, pivot table

  • Data Validation: Excel

Challenges & Solutions

  • Different attributes and column order in Starbucks and Dunkin’ datasets; resolved by renaming columns and rearranging them.

  • Copious symbols in datasets; formatted data in Excel and code.

  • Questionnaire issues; lack of homogeneity and repetitive data; addressed with better survey design.

Data Wrangling Process

  • Data processing for Starbucks dataset: Profiling, value replacements, drop unrelated signs, further steps in Excel.

  • Data processing for Dunkin dataset: Initiated in Excel, profiling, drop unwanted signs and symbols.

  • Processing datasets of Starbucks and Dunkin Donut: Listed and kept relevant columns, renamed attributes, rearranged column order, and used concat() to combine data.

Analysis and Results

  • Analyzed Starbucks and Dunkin’ datasets to find highest & lowest calorie, sugar, and protein drinks.

  • Correlated attributes in both datasets, identifying relationships between nutrients and calories.

  • Analyzed survey data to understand preferences of different groups and provided specific drink recommendations based on analysis.

  • Proposed the development of an app that suggests healthier caffeinated beverage options based on users' coffee habits, incorporating variables like sugar and protein.