Forewords
This article presents in a non technical way the user problem I wanted to address as well as the dashboard solution I created. To dive deeper, I kindly invite you to visit my Github repositories to read the underlying Python codes.
Tech stack : Python | Tableau
Github | Samsung Health revampProject summary
I have been using the Samsung Health smartphone app since 2016. This allowed me to track my daily number of steps and burned calories when exercising (running, hiking,...). Still, the information presented in the app often fails to provide me with what I need to know to monitor my physical activity.
My objective in this project was to obtain the most comprehensive view of my workouts using data from the Samsung Health app (running, hiking) and Google Maps Location History (swimming and dance workouts).
To do so, I imported, cleaned, merged and analyzed all my personal data thanks to several Python scripts. Finally, I constructed an interactive Tableau dashboard as the solution to user problems.
The problems at hand
The Samsung Health app has undergone many changes in the past. Still, I do not find the app user-friendly enough due to the following reasons:
- Many activity reports are scattered throughout the app, whose information is besides often overlapping. This is confusing for the user ("Which dashboard should I use primarily?").
- Each dashboard is not easily accessible: you need several clicks to access them.
- Dashboards are not flexible: you have very few options to choose the periods you want to analyze, compare, etc.
- The Samsung "hearts" iconography is hard to interpret (tiny icons, many colors with no legend displayed) and requires users to make efforts to understand them.
Due to such problems, app users like me struggle to find quick insights about their physical activity levels and may be tempted to use apps / connected devices from competitors (or construct their own analytics dashboards like me!).
Below: Illustrations of some of dashboards available in the Samsung Health app. They are dispersed in the app and can be accessed after several clicks. Kind of confusing isn't it?
User goals
My goal in this project was to construct a dashboard that would improve:
- Information accessibility: the information would be accessible in a few clicks and in a very straightfoward way
- Information readability and ease of understanding
- Flexibility so that the user can focus on the periods and activities she is most interested in
More specifically, thanks to the new dashboard, I wanted to answer the following questions : In 6 years, how many steps did I make in total ? What year was I the most active ? How did my exercise habits change over time in terms of frequency and intensity of workouts ?
As a cherry on the cake, I wanted to supplement the Samsung Health walking and running information with my Google Maps Location History data to:
- Get a full perspective on my workout activities (i.e. running, hiking... but also swimming and dancing)
- Link my geographical data to my workouts. The aim was to identify the places where I did long hikes for instance.
Reasons for using a Python script
I could not simply build a quick Tableau dashboard right after downloading the data from the app. Indeed:
- Tableau did not recognize the columns of the csv dataset provided by Samsung as the column names were not in the first row of the file
- The Samsung Health data had duplicates so cleaning was required before analyzing the data
- Google Maps Location History were provided by Google as dozens of JSON files (I received 72 files in total) so merging the files was necessary before visualizing them in Tableau.
- I also wanted to test whether some Python graphs gave meaningful insights about my physical activity habits (spoiler alert: it does, but Tableau is better!)
Data preparation process
Samsung Health data
Data import
- Data source : To retrieve my pedometer data from the SamsungHealth app, I had to download a csv file entitled "com.samsung.shealth. tracker.pedometer_day_summary. 202210091102".
- Data structure : The original dataset had 19 variables. The first row of the csv file held the title of the file. The 2nd row contained the names of the different variables. The rows below contained the observation values.
- Information in the dataset : It was mainly about the daily number of steps and calories burned. The information was summarized for each day in the dataset.
Data quality checks:
- Completion rate: All columns were very well completed except a single column, 'source info' (more than half of the observations were Null). The latter was dropped with other columns that were not useful for the case study.
- Duplicates check: The original dataset contained lots of duplicated values that could have led to wrong results if ignored. After some investigation, the duplicates seemed to result from:
i) Some specific columns: "binning data", "source info" and "deviceuuid". I chose to focus on a specific deviceuuid, the one that covered the whole period, and dropped the others.
ii) The mix between "create date" and "update date" information. This made me create a new variable, "clean date". - Outliers: No outlier was detected in the dataset / the values made sense.
Illustration of the duplicates problem in the csv raw data (before cleaning)
Google Maps Location History data
Data import
Google Maps data were difficult to import in Python because they were:
- located in different subfolders (one per year)
- stored across multiple files (I received 72 files in total)
- written in deep-nested JSON format (which was a new format to me)
Example below (credits: jsoncrack.com): Representation of a single data entry in one of the many JSON files
- Data structure : Google Maps stores data in "placeVisit" and "activitySegment" objects.
- Information in the dataset :
. The "placeVisit" object provides the location (latitude and longitude) of the place visited and the time spent there. The address has generally an ID and a name. Sometimes, the latter is Null if there is no Google Maps address (shop, cinema, school, etc) at this precise location. It also gives alternative names with confidence levels in case the main address name is wrong.
. In the 'activitySegment', data are much more detailed. It can give the exact itinerary taken to a certain address (including the name of the subway stations !).
Data cleaning and preparation
- To identify my dance and swimming workouts: I used the Google Maps names to look for "piscine" word (swimming pool in French). For dancing, it was a more manual process: the dance studios I attended had sometimes no name recorded by Google so I had to define manually several place IDs in the data.
- Geographical information: I tried to extract the zip codes of the places I visited. This worked quite well for French addresses, but not so good for Asian countries like Japan. Indeed addresses do not have the same conventions across countries, and Python did not seem to recognize kanjis.
- Privacy : I hid personal addresses (home, work) as much as possible in my code and dataframes
Data analysis in Python
Samsung Health
I created a few graphs in my Python Jupyter Notebook to produce the insights I could not easily get from the Samsung Health app. I focused mostly on 'long-term' analyses (yearly and monthly analyses).
I summarize below some of the insights I could get thanks to the above graphs (disclaimer: 2016 and 2022 are not complete years in the dataset)
- Overview: I walked ca. 12,000 km walked over 6 years. This is the distance between Paris (France) and Vladivostok (Russia)!
- On average, I tend to walk ca. 1,700 kilometers/year but there are some variations. For instance, 2019 was the year I was the least active (1,400 km) as I was busy preparing for my final exams and data analyst job search. On the contrary, 2021 was my record-breaking year (2,900 km). I was so excited that the Covid19 was (temporarily) over that I went out as much as possible. Every weekend, I was going out, hiking, etc.
- It seems that I became more active since the beginning of the Covid19 crisis in France (cf. positive slope for several consecutive months from March 2020 onwards). In 2022 however, I walked less than in 2021 but the daily average number of steps is still above that of the other years.
- Some seasonality seems to be at play. My busiest periods often take place in summer (July to September).
i) 2018 2nd semester: This corresponds to the period I was working in Japan. I had to walk a lot in my daily life for going to work and grocery shopping (my apartment was located 20 minutes away from the nearest train station). In summer, I also enjoyed "Obon" holidays by traveling to Hokkaido.
ii) 2019 August: This refers to my trip in the Netherlands before starting my first full time job.
iii) 2020 August: Covid was (temporarily) over in France and I joined hiking groups to walk in the forests every weekend. - Since 2021, there are more days when I walk a lot (more than 20km/day). Confidence intervals and boxplot tend to be bigger in 2021 and 2022 compared to previous years.
- When analyzing the most recent period (September / October 2022), it appears that I reached the "10,000 steps a day" challenge only every 3 other days (12 days out of 39 days in total). Indeed, I may walk a lot in a given month, but my walking habits may be concentrated on a few days. For instance, the days I reached my goal were mostly Fridays (hiking activities with my coworkers during lunchtime) and Saturdays (Meetup hikes). On the contrary, I tend to be not very active on Tuesdays (remote work). Finally, regarding Week 37 2022, I was able to reach the objective everyday (except Sunday) because I was on holiday and could hike in Spain the whole week.
Google Maps History Location
Here are some learning from the above graphs
- My swimming and dancing activities seem to follow the same pattern from 2017 to 2019, and then start diverging drastically. Indeed, from 2017 to 2019, I was often abroad so I had no time for sports activities. When the Covid19 crisis broke out early 2020, I completely stopped taking dance classes and I never went back since. On the contrary, I increased dramatically the number of my swimming workouts (x6 increase between 2020 and 2022).
- The duration of my workouts was pretty stable over time. On average, I spent 1 hour and a half at the dance studio. This makes sense since my dance classes usually last 90 minutes. Regarding swimming, the duration of my workouts has slightly increased (52 minutes to 78 minutes on average between 2018 and 2022).
Final output: an interactive Tableau dashboard
After combining all data sources in Python, I exported them to csv format and created an interactive dashboard on Tableau Public to reach the user goals that I had defined at the beginning of the project.
Link to the dashboardYou may also see below video screenshots of the dashboard (final product).
The above dashboard answers the user problems defined at the beginning of the case study namely:
- Information accessibility: Every KPI is available in one single dashboard (4 tabs in total).
- Information readability:
i) The dashboard has a clear title that mentions the specific indicator, statistic and temporal vision used. This is far more explicit than the titles in Samsung app ("Daily activity", "Steps" or "All exercises").
ii) The graph has no complex iconography and uses labels.
iii) A tooltip is displayed when the mouse hovers over the graph, displaying all KPIs for the period.
- Flexibility: The user can explore her statistics in a very flexible way. She may choose:
i) the level of granularity of her summarized data (daily, weekly, monthly, quaterly, yearly, all period).
ii) the statistic (sum, average, minimum, maximum)
iii) the indicator to track (distance in km, number of steps, number of walk steps, number of run steps, calories burned, number of times she reached her daily step count objective)
iv) the period (the year and the exact dates)