Analytics portfolio: Samsung Health x Google Maps

Project summary

I have been using the Samsung Health smartphone app since 2016. This allowed me to track my daily number of steps and burned calories when exercising (running, hiking,...). Still, the information presented in the app often fails to provide me with what I need to know to monitor my physical activity.
‍
My objective in this project was to obtain the most comprehensive view of my workouts using data from the Samsung Health app (running, hiking) and Google Maps Location History (swimming and dance workouts).

To do so, I imported, cleaned, merged and analyzed all my personal data thanks to several Python scripts. Finally, I constructed an interactive Tableau dashboard as the solution to user problems.

The problems at hand

The Samsung Health app has undergone many changes in the past. Still, I do not find the app user-friendly enough due to the following reasons:

Many activity reports are scattered throughout the app, whose information is besides often overlapping. This is confusing for the user ("Which dashboard should I use primarily?").
Each dashboard is not easily accessible: you need several clicks to access them.
Dashboards are not flexible: you have very few options to choose the periods you want to analyze, compare, etc.
The Samsung "hearts" iconography is hard to interpret (tiny icons, many colors with no legend displayed) and requires users to make efforts to understand them.

Due to such problems, app users like me struggle to find quick insights about their physical activity levels and may be tempted to use apps / connected devices from competitors (or construct their own analytics dashboards like me!).

Below: Illustrations of some of dashboards available in the Samsung Health app. They are dispersed in the app and can be accessed after several clicks. Kind of confusing isn't it?

Samsung Health: one of the many dashboards

Reasons for using a Python script

I could not simply build a quick Tableau dashboard right after downloading the data from the app. Indeed:

Tableau did not recognize the columns of the csv dataset provided by Samsung as the column names were not in the first row of the file
The Samsung Health data had duplicates so cleaning was required before analyzing the data
Google Maps Location History were provided by Google as dozens of JSON files (I received 72 files in total) so merging the files was necessary before visualizing them in Tableau.
I also wanted to test whether some Python graphs gave meaningful insights about my physical activity habits (spoiler alert: it does, but Tableau is better!)

Data preparation process

Samsung Health data

Data import

Data source : To retrieve my pedometer data from the SamsungHealth app, I had to download a csv file entitled "com.samsung.shealth. tracker.pedometer_day_summary. 202210091102".
‍Data structure : The original dataset had 19 variables. The first row of the csv file held the title of the file. The 2nd row contained the names of the different variables. The rows below contained the observation values.
‍Information in the dataset : It was mainly about the daily number of steps and calories burned. The information was summarized for each day in the dataset.

Data quality checks:

Completion rate: All columns were very well completed except a single column, 'source info' (more than half of the observations were Null). The latter was dropped with other columns that were not useful for the case study.
Duplicates check: The original dataset contained lots of duplicated values that could have led to wrong results if ignored. After some investigation, the duplicates seemed to result from:
i) Some specific columns: "binning data", "source info" and "deviceuuid". I chose to focus on a specific deviceuuid, the one that covered the whole period, and dropped the others.
ii) The mix between "create date" and "update date" information. This made me create a new variable, "clean date".
Outliers: No outlier was detected in the dataset / the values made sense.

Illustration of the duplicates problem in the csv raw data (before cleaning)

Google Maps Location History data

Data import

Google Maps data were difficult to import in Python because they were:

located in different subfolders (one per year)
stored across multiple files (I received 72 files in total)
written in deep-nested JSON format (which was a new format to me)

Example below (credits: jsoncrack.com): Representation of a single data entry in one of the many JSON files

JSON structure of Google Maps History Location data

‍Data structure : Google Maps stores data in "placeVisit" and "activitySegment" objects.
‍Information in the dataset :
. The "placeVisit" object provides the location (latitude and longitude) of the place visited and the time spent there. The address has generally an ID and a name. Sometimes, the latter is Null if there is no Google Maps address (shop, cinema, school, etc) at this precise location. It also gives alternative names with confidence levels in case the main address name is wrong.
. In the 'activitySegment', data are much more detailed. It can give the exact itinerary taken to a certain address (including the name of the subway stations !).

Data cleaning and preparation

To identify my dance and swimming workouts: I used the Google Maps names to look for "piscine" word (swimming pool in French). For dancing, it was a more manual process: the dance studios I attended had sometimes no name recorded by Google so I had to define manually several place IDs in the data.
Geographical information: I tried to extract the zip codes of the places I visited. This worked quite well for French addresses, but not so good for Asian countries like Japan. Indeed addresses do not have the same conventions across countries, and Python did not seem to recognize kanjis.
Privacy : I hid personal addresses (home, work) as much as possible in my code and dataframes

Data analysis in Python

Samsung Health

I created a few graphs in my Python Jupyter Notebook to produce the insights I could not easily get from the Samsung Health app. I focused mostly on 'long-term' analyses (yearly and monthly analyses).

Bar chart | Total distance walked per year

A line chart displaying the average daily number of steps, aggregated per month

Boxplot comparing physical activity in 2019 vs 2022

Heat map : Days when I reached my 10k steps / day objective

I summarize below some of the insights I could get thanks to the above graphs (disclaimer: 2016 and 2022 are not complete years in the dataset)

Overview: I walked ca. 12,000 km walked over 6 years. This is the distance between Paris (France) and Vladivostok (Russia)!
On average, I tend to walk ca. 1,700 kilometers/year but there are some variations. For instance, 2019 was the year I was the least active (1,400 km) as I was busy preparing for my final exams and data analyst job search. On the contrary, 2021 was my record-breaking year (2,900 km). I was so excited that the Covid19 was (temporarily) over that I went out as much as possible. Every weekend, I was going out, hiking, etc.
It seems that I became more active since the beginning of the Covid19 crisis in France (cf. positive slope for several consecutive months from March 2020 onwards). In 2022 however, I walked less than in 2021 but the daily average number of steps is still above that of the other years.
Some seasonality seems to be at play. My busiest periods often take place in summer (July to September).
i) 2018 2nd semester: This corresponds to the period I was working in Japan. I had to walk a lot in my daily life for going to work and grocery shopping (my apartment was located 20 minutes away from the nearest train station). In summer, I also enjoyed "Obon" holidays by traveling to Hokkaido.
ii) 2019 August: This refers to my trip in the Netherlands before starting my first full time job.
iii) 2020 August: Covid was (temporarily) over in France and I joined hiking groups to walk in the forests every weekend.
Since 2021, there are more days when I walk a lot (more than 20km/day). Confidence intervals and boxplot tend to be bigger in 2021 and 2022 compared to previous years.
When analyzing the most recent period (September / October 2022), it appears that I reached the "10,000 steps a day" challenge only every 3 other days (12 days out of 39 days in total). Indeed, I may walk a lot in a given month, but my walking habits may be concentrated on a few days. For instance, the days I reached my goal were mostly Fridays (hiking activities with my coworkers during lunchtime) and Saturdays (Meetup hikes). On the contrary, I tend to be not very active on Tuesdays (remote work). Finally, regarding Week 37 2022, I was able to reach the objective everyday (except Sunday) because I was on holiday and could hike in Spain the whole week.

Google Maps History Location

Python lineplot: Number of dance and swimming sessions per year

Python boxplot representing duration of swimming and dance workouts

Here are some learning from the above graphs

My swimming and dancing activities seem to follow the same pattern from 2017 to 2019, and then start diverging drastically. Indeed, from 2017 to 2019, I was often abroad so I had no time for sports activities. When the Covid19 crisis broke out early 2020, I completely stopped taking dance classes and I never went back since. On the contrary, I increased dramatically the number of my swimming workouts (x6 increase between 2020 and 2022).
The duration of my workouts was pretty stable over time. On average, I spent 1 hour and a half at the dance studio. This makes sense since my dance classes usually last 90 minutes. Regarding swimming, the duration of my workouts has slightly increased (52 minutes to 78 minutes on average between 2018 and 2022).

Final output: an interactive Tableau dashboard

After combining all data sources in Python, I exported them to csv format and created an interactive dashboard on Tableau Public to reach the user goals that I had defined at the beginning of the project.

Link to the dashboard

You may also see below video screenshots of the dashboard (final product).

The above dashboard answers the user problems defined at the beginning of the case study namely:

Information accessibility: Every KPI is available in one single dashboard (4 tabs in total).
Information readability:
i) The dashboard has a clear title that mentions the specific indicator, statistic and temporal vision used. This is far more explicit than the titles in Samsung app ("Daily activity", "Steps" or "All exercises").
ii) The graph has no complex iconography and uses labels.
iii) A tooltip is displayed when the mouse hovers over the graph, displaying all KPIs for the period.
Flexibility: The user can explore her statistics in a very flexible way. She may choose:
i) the level of granularity of her summarized data (daily, weekly, monthly, quaterly, yearly, all period).
ii) the statistic (sum, average, minimum, maximum)
iii) the indicator to track (distance in km, number of steps, number of walk steps, number of run steps, calories burned, number of times she reached her daily step count objective)
iv) the period (the year and the exact dates)

Samsung Health x Google Maps project

Forewords

Project summary

The problems at hand

User goals

Reasons for using a Python script

Data preparation process

Samsung Health data

Data import

Data quality checks:

Google Maps Location History data

Data import

Data cleaning and preparation

Data analysis in Python

Samsung Health

Google Maps History Location

Final output: an interactive Tableau dashboard

Conclusion

Explore other portfolio projects

Want to get in touch?