Q1 2023
This article presents in a non technical way the user problem I wanted to address as well as the solution created. To dive deeper, I kindly invite you to visit my Github repository to read the underlying Python codes.
Tech stack : Python | Dataiku
Legal disclaimer: I hereby certify that the data that I have scraped : i) do not contain any personal information and ii) are publicly available on the internet. The web scraping has been performed one-shot (i.e. no regular Cron jobs) and in small batches to not overload the web servers. Finally, the scraped data were not used for any commercial purposes.
The process of planning one's holidays can be sometimes overwhelming. There are simply so many possibilities and so many different websites to look for trip ideas.
My objective in this project was to build a database aggregating data from from 3 well-known French websites dedicated to hiking and trekking trips: "Terres d'Aventures", "La Balaguère" and "Decathlon travel". It would become my personal "holiday ideas database".
To do so, I scraped the data from those sites and merged them after cleaning them.
Three web scraping methodologies have been applied. They were tailored to match each website's architecture.
La Balaguère website is an HTML page that is very well structured. All results are displayed as list items of an unordered list. Information such as name, price, duration,... are well tagged so that it can be easily accessed thanks to CSS selectors (see illustration below to get for instance the name of the travel).
Still, the website was tricky for 2 reasons (see illustration below):
I coded the 2 spiders in separate python scripts: one for the "voyage", one for the "voyage last" results. Then, the 2 spiders were launched in the Anaconda command prompt thanks to the command line "scrapy crawl labalaguere -O labalaguere.json". This yielded the following JSON output almost instantaneously:
I could not use the Scrapy method for the Terdav website because the spider was blocked at the 20 results threshold. Indeed, after 20 results, the website has a "Voir plus de voyages" (see more trips) button that is coded in Javascript. Scrapy is not able to click on this button since the latter has no accessible URL (see illustration below).
Thus, I chose an other scraping method: Selenium.
Selenium Webdriver is not a web scraping method per se: it is a remote control interface that enables introspection and control of user agents (browsers). In my case, I wanted to create a little robot that opens a new brower window, scrolls down the page and clicks on the "Voir plus de voyages" button every time it sees it. And then scrape all results. To do so, I faced 2 main difficulties:
Illustrations of the process: Dealing with the cookies popup window
Thanks to the Selenium parser, I was able to obtain the following results:
The Decathlon Travel website has the same structure as that of Terres d'Aventures: it displays a few search results and then stops with a button "See more results". To get the results, I could have used Scrapy (the button seems to have a URL associated) or Selenium. But I wanted to take advantage of Decathlon Travels' unique feature: the possibility to retrieve results via API.
Indeed, whenever one's clicking on the "See more results" button, Decathlon Travel exposes a request URL in the "Network" tab of the Inspect website page. It can be accessed with the GET method.
I used the software Insomnia to request the API and generate the associated Python client code. Thanks to this code, I could access the HTML response.
To interpret and make sense of the HTML response, I used the Python BeautifulSoup to parse the results. I then stored the results in a dictionary that I then converted to a standard dataframe. This yielded the following results:
The main issue was to make all datasets as standardized as possible so that they can be merged all together. To do so, I used Dataiku DSS software via a virtual machine.
For each input dataset (blue square with an arrow), I used a preparation recipe (yellow broom) to clean and harmonize the data.
The cleaning process included the following actions: remove leading / trailing whitespaces, extract numbers from strings (price, duration), convert some columns to dummy values, add the name of the company in a column, remove useless columns, etc.
Illustration below: 29 cleaning actions were performed on the dataset about La Balaguère last voyage
After cleaning each dataset, I could stack them with a dedicated Dataiku recipe (see below). An advantage of using Dataiku DSS is that it makes very clear which columns are present in each dataset. For instance, the variable "next_departure" has only 3 dots associated as Decathlon Travel does not provide the information (I could not scrape it on the website). Fortunately, for most variables, the information is available in each dataset.
Thanks to my web scraping actions, I could gather information for 1000+ hiking trips. I did some quick analyses using Dataiku DSS dashboard tool (note: I am not a fan of it because it does not display the legends, but still it gets to the point).
Drop me a line!