Dashboard week: 2nd Day
In the last blog, I mentioned that DSAU14 is on the dashboard week. Today’s task was to build a dashboard on UFO sightings. As much as it was challenging to engage with the topic and brainstorm ideas about what to put on the dashboard, it was challenging because we had to web scrap the data from scratch. In this post, I will walk through how I approached the web scraping task and prepared the data mainly using Regex in Alteryx.
Dataset / Preparation
The data comes from the “National UFO Reporting Center” website at: https://nuforc.org/
Now, the first thing I did was to check the exact location I need to scrap the data from. There were two pages that I needed data from.
1) The first page had about 1000 links to monthly reports without report details.
2) The second page had all the reports and details for the month.
This is my Alteryx workflow, and I will go through a few important steps I took.
I brought the link into Alteryx using the text input and download tools.
I then tokenized the HTML script to get the unique part of the web address for each monthly report and made a full address for each report.
Once I had all the addresses for those reports,
I parsed each script using the Regex tool.
After a few Alteryx tools, I was able to successfully scrap over 130,000 rows of data from about 1000 different web addresses in a consistent manner.
I then brought the dataset to Tableau for visualization in a dashboard, which did not take much time to build. This is probably because I had a specific direction and story I wanted to visualize, taking a small but interesting portion of the big dataset. In my future post, I hope to address how I approached creating a story while focusing on a small portion of the dataset.