It is day 3 of the dashboard week. Today, the challenge was to web scrape a website (HTML) that had data on flight crashes. The website was http://www.planecrashinfo.com/database.htm. It had a history of aviation disasters starting from 1920. We were asked to web scrape the data using Alteryx, build a tableau dashboard and publish the blog.
Choosing the right flight..err.data
This was a major exercise in Alteryx to download the data. The challenge involved designing a workflow that factored in timeouts as well as keeping it simple with regex for extraction of correct columns as focused on data cleaning as not all location entries were correct.
I wanted to focus mainly on the crashes and their locations. So for my dashboard, I needed to web scrape the first 2 pages as shown below.
There are 3 parts to the Alteryx workflow:
Firstly, I used the main part of the workflow to download the 2nd page (table). This meant downloading the links from the first page and using them to download individual pages. This was done as below. Since the website had timeouts, I wrote the output to a temporary file with a randomized number in the filename. This was to prevent rewriting the files with good output with bad ones. Once I was sure, I had all the data, I disabled the container.
Once the rows were downloaded, I extracted them using Regular expressions. Since the downloaded page contained HTML tags, these were removed first. The header information was removed. The information was arranged in rows and columns. The same was done in using text to columns too.
The regular expression was used in 2 ways:
First, obtain the main information such as date, location, aircraft type and casualties. Some fields were shown in the same column.
Second, extract the information using Regex tool and separate country or state and similarly other columns.
There was also data cleaning required as for places in the USA, the country column had the state name. Similarly, the names also had some mismatches. This was updated using a Find Replace tool.
I built the dashboard to crash data in the USA. It seemed that the data gradually increased with certain spikes during wars. For the USA, the major outlier was the 9/11 attacks which had the highest casualty. You can see the dashboard here.
Hope this blog helped you understand some aspects of the challenge and the solution.