Day 2 of dashboard week has arrived. In today’s challenge, we have to scrape data on horse racing statistics from a selection of sports betting websites. I used this website to scrape data for (hopefully) predicting the outcome of the Melbourne Cup race. Compared to Day 1, I thought I managed my time better and got a dashboard out right before the race. Alas, looking at past historical win rates versus betting odds did not help my punt. In this blog post, I will describe how I parsed the data and an overview of the final dashboard.


Downloading “Hidden” Data

When initially trying to download the data, it was apparent that fields on the page were “hidden”. Fortunately, Alekh from DSAU2 had a Python Script for this. I have attached pictures showing the difference the script made to the downloaded web page HTML:

Notice the significantly increased number of rows of data (from ~600 rows to >10000 rows)


Next, I had to parse the relevant details on race horses (e.g. age, jockey weight, historical win rates under different conditions). Given the repetitive pattern of the HTML data, I used the RegEx Tool with Tokenize as output method. Data on each horse can then be split to rows and joined to other fields on record position. A sample of my workflow is provided below:




After some challenging RegEx (Regular expression) parsing and data cleaning, I started on my Tableau dashboard. For each horse, I focussed on comparing between historical win rates, betting odds (from bookmakers 1 hour prior to game) and also historical winnings (in prize money). Page 1 of the Viz looks like this:



The Viz will need some tweaking, but was excited that I was able to finish the Viz within the allotted time. Now that Day 2 is done, looking forward to day 3 of Dashboard Week.




Image on top of page by Mathew Schwartz on Unsplash

Alex Chan
Author: Alex Chan