Holiday Ideas Database

Web scraping | API | Dataiku

Q1 2023

Database overview in Dataiku

Foreword

This article presents in a non technical way the user problem I wanted to address as well as the solution created. To dive deeper, I kindly invite you to visit my Github repository to read the underlying Python codes.
Tech stack : Python | Dataiku

Legal disclaimer: I hereby certify that the data that I have scraped : i) do not contain any personal information and ii) are publicly available on the internet. The web scraping has been performed one-shot (i.e. no regular Cron jobs) and in small batches to not overload the web servers. Finally, the scraped data were not used for any commercial purposes.

Github

Project summary

The process of planning one's holidays can be sometimes overwhelming. There are simply so many possibilities and so many different websites to look for trip ideas.

My objective in this project was to build a database aggregating data from from 3 well-known French websites dedicated to hiking and trekking trips: "Terres d'Aventures", "La Balaguère" and "Decathlon travel". It would become my personal "holiday ideas database".

To do so, I scraped the data from those sites and merged them after cleaning them.

Data flow representation

Web scraping: Methodology overview

Three web scraping methodologies have been applied. They were tailored to match each website's architecture.

La Balaguère : Scrapy method

Go to website

La Balaguère website is an HTML page that is very well structured. All results are displayed as list items of an unordered list. Information such as name, price, duration,... are well tagged so that it can be easily accessed thanks to CSS selectors (see illustration below to get for instance the name of the travel).

La Balaguere website source code
Command prompt - Scrapy

Still, the website was tricky for 2 reasons (see illustration below):

La Balaguere last page result

I coded the 2 spiders in separate python scripts: one for the "voyage", one for the "voyage last" results. Then, the 2 spiders were launched in the Anaconda command prompt thanks to the command line "scrapy crawl labalaguere -O labalaguere.json". This yielded the following JSON output almost instantaneously:

Scrapy JSON output

Terres d'Aventures: Selenium method

Go to website

I could not use the Scrapy method for the Terdav website because the spider was blocked at the 20 results threshold. Indeed, after 20 results, the website has a "Voir plus de voyages" (see more trips) button that is coded in Javascript. Scrapy is not able to click on this button since the latter has no accessible URL (see illustration below).

Terdav source code for the button "see more trips"

Thus, I chose an other scraping method: Selenium.
Selenium Webdriver is not a web scraping method per se: it is a remote control interface that enables introspection and control of user agents (browsers). In my case, I wanted to create a little robot that opens a new brower window, scrolls down the page and clicks on the "Voir plus de voyages" button  every time it sees it. And then scrape all results. To do so, I faced 2 main difficulties:

Illustrations of the process: Dealing with the cookies popup window

Website cookies popup window
Code used to parse Terdav website

Thanks to the Selenium parser, I was able to obtain the following results:

Results overview (csv format)

Decathlon Travel: Requests x BeautifulSoup methods

Go to website

The Decathlon Travel website has the same structure as that of Terres d'Aventures: it displays a few search results and then stops with a button "See more results". To get the results, I could have used Scrapy (the button seems to have a URL associated) or Selenium. But I wanted to take advantage of Decathlon Travels' unique feature: the possibility to retrieve results via API.

Decathlon Travel HTTP GET

Indeed, whenever one's clicking on the "See more results" button, Decathlon Travel exposes a request URL in the "Network" tab of the Inspect website page. It can be accessed with the GET method.
I used the software Insomnia to request the API and generate the associated Python client code. Thanks to this code, I could access the HTML response.

Insomnia - HTTP requestInsomnia - Generate Python code

To interpret and make sense of the HTML response, I used the Python BeautifulSoup to parse the results. I then stored the results in a dictionary that I then converted to a standard dataframe. This yielded the following results:

Decathlon Travel csv results overview

Data cleaning

The main issue was to make all datasets as standardized as possible so that they can be merged all together. To do so, I used Dataiku DSS software via a virtual machine.
For each input dataset (blue square with an arrow), I used a preparation recipe (yellow broom) to clean and harmonize the data.

Dataiku data flow representation

The cleaning process included the following actions: remove leading / trailing whitespaces, extract numbers from strings (price, duration), convert some columns to dummy values, add the name of the company in a column, remove useless columns, etc.
Illustration below: 29 cleaning actions were performed on the dataset about La Balaguère last voyage

Data cleaning in Dataiku (prepare recipe)

After cleaning each dataset, I could stack them with a dedicated Dataiku recipe (see below). An advantage of using Dataiku DSS is that it makes very clear which columns are present in each dataset. For instance, the variable "next_departure" has only 3 dots associated as Decathlon Travel does not provide the information (I could not scrape it on the website). Fortunately, for most variables, the information is available in each dataset.

Dataiku stack recipe (union of data sources)

Quick insights

Thanks to my web scraping actions, I could gather information for 1000+ hiking trips. I did some quick analyses using Dataiku DSS dashboard tool (note: I am not a fan of it because it does not display the legends, but still it gets to the point).

Dataiku dashboard

Conclusion

Possible next steps

  • Scrape more websites dedicated to hiking and trekking
  • Classify the destination information to be able to distinguish premium destinations from standard destinations
  • Schedule web scraping operations

What I learnt in this project

  • To create my own datasets without relying on cookie-cutter data sources (e.g. Kaggle)
  • Learn HTML and CSS structures
  • Understand and apply different web scraping methods (Scrapy, Selenium, Requests, BeautifulSoup)
  • Use the command line and execute code in the terminal of VS Code (instead of using Jupyter Notebooks)
  • Request a website API

See other projects

Want to get in touch?

Drop me a line!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.