Foreword

This article presents in a non technical way the user problem I wanted to address as well as the solution created. To dive deeper, I kindly invite you to visit my Github repository to read the underlying Python codes.
Tech stack : Python | Dataiku

Legal disclaimer: I hereby certify that the data that I have scraped : i) do not contain any personal information and ii) are publicly available on the internet. The web scraping has been performed one-shot (i.e. no regular Cron jobs) and in small batches to not overload the web servers. Finally, the scraped data were not used for any commercial purposes.

Github

Project summary

The process of planning one's holidays can be sometimes overwhelming. There are simply so many possibilities and so many different websites to look for trip ideas.
‍
My objective in this project was to build a database aggregating data from from 3 well-known French websites dedicated to hiking and trekking trips: "Terres d'Aventures", "La Balaguère" and "Decathlon travel". It would become my personal "holiday ideas database".

To do so, I scraped the data from those sites and merged them after cleaning them.

Web scraping: Methodology overview

Three web scraping methodologies have been applied. They were tailored to match each website's architecture.

La Balaguère : Scrapy method
Terres d'Aventures : Selenium method
Decathlon travel : Requests and Beautifulsoup methods

La Balaguère : Scrapy method

Go to website

La Balaguère website is an HTML page that is very well structured. All results are displayed as list items of an unordered list. Information such as name, price, duration,... are well tagged so that it can be easily accessed thanks to CSS selectors (see illustration below to get for instance the name of the travel).

Still, the website was tricky for 2 reasons (see illustration below):

Search results are spread across different pages. To access them, it is necessary to click on the next page buttons (i.e. page 2, page 3, etc). Fortunately, a URL link is available in the code source : it can be used by Scrapy to access subsequent pages.
The last item of each page has a specific class that is different from other items. Indeed, it is a list item whose class is "voyage last", and not "voyage". If not treated differently, it can be skipped by the spider. To deal with it, I created a second spider that has the same characteristics as the the main one. But the starting class is "voyage last", not "voyage".

I coded the 2 spiders in separate python scripts: one for the "voyage", one for the "voyage last" results. Then, the 2 spiders were launched in the Anaconda command prompt thanks to the command line "scrapy crawl labalaguere -O labalaguere.json". This yielded the following JSON output almost instantaneously:

Terres d'Aventures: Selenium method

Go to website

I could not use the Scrapy method for the Terdav website because the spider was blocked at the 20 results threshold. Indeed, after 20 results, the website has a "Voir plus de voyages" (see more trips) button that is coded in Javascript. Scrapy is not able to click on this button since the latter has no accessible URL (see illustration below).

Thus, I chose an other scraping method: Selenium.
Selenium Webdriver is not a web scraping method per se: it is a remote control interface that enables introspection and control of user agents (browsers). In my case, I wanted to create a little robot that opens a new brower window, scrolls down the page and clicks on the "Voir plus de voyages" button every time it sees it. And then scrape all results. To do so, I faced 2 main difficulties:

Cookies popup window. When launching a new automated window, the robot cannot do anything without dealing first with the annoying cookies popup window. Indeed, the background results are kind of "hidden" by the popup window. Therefore, the parser needs to first identify the "Accept all cookies" button and click on it. To make it more difficult: the popup window appears on screen with a small delay that needs to be taken into account in the parser code (otherwise, the robot does not detect it).
"Voir plus de voyages" button. It does not have any public URL and is present every 20 search results, so Selenium has to detect it and click on it multiple times. To deal with it, I used a while loop.

Illustrations of the process: Dealing with the cookies popup window

Thanks to the Selenium parser, I was able to obtain the following results:

Decathlon Travel: Requests x BeautifulSoup methods

Go to website

The Decathlon Travel website has the same structure as that of Terres d'Aventures: it displays a few search results and then stops with a button "See more results". To get the results, I could have used Scrapy (the button seems to have a URL associated) or Selenium. But I wanted to take advantage of Decathlon Travels' unique feature: the possibility to retrieve results via API.

Indeed, whenever one's clicking on the "See more results" button, Decathlon Travel exposes a request URL in the "Network" tab of the Inspect website page. It can be accessed with the GET method.
I used the software Insomnia to request the API and generate the associated Python client code. Thanks to this code, I could access the HTML response.

To interpret and make sense of the HTML response, I used the Python BeautifulSoup to parse the results. I then stored the results in a dictionary that I then converted to a standard dataframe. This yielded the following results:

Data cleaning

The main issue was to make all datasets as standardized as possible so that they can be merged all together. To do so, I used Dataiku DSS software via a virtual machine.
For each input dataset (blue square with an arrow), I used a preparation recipe (yellow broom) to clean and harmonize the data.

The cleaning process included the following actions: remove leading / trailing whitespaces, extract numbers from strings (price, duration), convert some columns to dummy values, add the name of the company in a column, remove useless columns, etc.
Illustration below: 29 cleaning actions were performed on the dataset about La Balaguère last voyage

Data cleaning in Dataiku (prepare recipe)

After cleaning each dataset, I could stack them with a dedicated Dataiku recipe (see below). An advantage of using Dataiku DSS is that it makes very clear which columns are present in each dataset. For instance, the variable "next_departure" has only 3 dots associated as Decathlon Travel does not provide the information (I could not scrape it on the website). Fortunately, for most variables, the information is available in each dataset.

Dataiku stack recipe (union of data sources)

Quick insights

Thanks to my web scraping actions, I could gather information for 1000+ hiking trips. I did some quick analyses using Dataiku DSS dashboard tool (note: I am not a fan of it because it does not display the legends, but still it gets to the point).

Terres d'Aventures (abbreviated Terdav) is the obvious leader with 55% market share (in volume), followed by La Balaguere (36%) and Decathlon Travel (9%)
Terdav targets the premium market. It has a pricing strategy radically different from La Balaguere or Decathlon Travel. Indeed, half of Terdav trips have a price above 430 € / day (and 34% have a price between 550 and 680 € / day). On the contrary, all trips of La Balaguère and Decathlon Travel companies have a price lower than 430 € / day. This difference is not entirely explained by the destination: Terdav is not the only company offering travels in exotic, far away destinations that require paying long haul flights.
The price does not seem to be correlated with the level of difficulty of the hiking trip. One could have imagined that very difficult trips require more experienced - and thus more expensive - guides.
Finally, Terdav is the only company that provides trips longer than 24 days (34%). Short trips (1 week) represent only 9% of its supply, compared with 22% for Decathlon travel and 12% for La Balaguère.

Conclusion

Possible next steps

Scrape more websites dedicated to hiking and trekking
Classify the destination information to be able to distinguish premium destinations from standard destinations
Schedule web scraping operations

What I learnt in this project

To create my own datasets without relying on cookie-cutter data sources (e.g. Kaggle)
Learn HTML and CSS structures
Understand and apply different web scraping methods (Scrapy, Selenium, Requests, BeautifulSoup)
Use the command line and execute code in the terminal of VS Code (instead of using Jupyter Notebooks)
Request a website API

Holiday Ideas Database

Foreword

Project summary

Web scraping: Methodology overview

La Balaguère : Scrapy method

Terres d'Aventures: Selenium method

Decathlon Travel: Requests x BeautifulSoup methods

Data cleaning

Quick insights

Conclusion

Possible next steps

What I learnt in this project

See other projects

Want to get in touch?