Dataschool training has been intense, but the ultimate challenge is the dashboard week, where we have to produce a dashboard and a blog daily for a week. It is a timeboxing exercise that tests all of the skills we have learned so far. The first challenge that we have is scraping a website of our interest and producing a dashboard. If anything, parsing using regular expression is one of my forte, so it should technically be smooth sailing!

One of the main challenge in selecting a data of our interest is actually to find a website that allows their data to be publicly downloadable, so the data that I have chosen this time is the ‘Top 50 fantasy movie’ from IMDb. As the title implied, there are 50 semi-structured ‘rows’ of html codes that I can parse here, with information such as title, year of release, duration, and rating. Some of the challenges in parsing this html codes is (as expected) the inconsistent patterns such as missing rating value, which means they need to be parsed differently—adding one or two extra steps. This also means that I cannot parse all the information at once, and have to carefully parse the information one by one to ensure there is no missing values. Thankfully, Alteryx is very good at this, so the parsing went relatively smooth.

Now to create the dashboard… So this is the time when I actually just realised, “Oh, I have no idea what kind of dashboard and insights I want to show”. Yikes! Should have thought this before I started scraping! Lesson learned! But thankfully, my colleague Ramya has inspired me to find an additional data of the movies revenue. And with just this one additional metric, I can start finding meaningful insights. So I put myself in a movie director’s shoes—a fantasy movie director to be specific, and starting to think what kind of information do I need to create a successful movie. So the questions are;

– Is it a good idea overall to create a fantasy movie? (how has fantasy movies fared overtime in terms of revenue and rating)?
– Which rating is a better predictor of revenue?
– Which subgenre is the most favourite?
– Who are the top 5 movie stars that I can hire?
– What are the top 5 movies that I can learn from?

And so I have produced this dashboard!

Some of the insights that I have found are;

– Avatar has the highest revenue of $2.9B, 1.5 higher than the second highest revenue movie (spiderman no way home– $1.9B). However, it does not have the best rating overall (83 metascore, 8 IMDb)
– The best rating movie is (unsurprisingly) the Lord of the Ring! With 9 IMDb rating and 94 Metascore rating, pretty impressive huh? But well deserved. Note however, they only net $1.1B revenue, a third of Avatar!

– The worst performing movies (in both rating and revenue) are the Princess Bride (1987) and Mulan (2020). Perhaps impacted by COVID-19 for Mulan?

– Another interesting finding is that on average, the rating of fantasy genre is not the best rating of all (even though this is supposed to be a top 50 fantasy movie rating), Musical, animation, and adventure genres scored better than fantasy genre itself!

– IMDb rating is overall higher than metascore rating, and correlates more strongly with revenue (0.34) vs metascore (0.29)

 

Alrighty, that wraps up this blog about my first day of dashboard week! I definitely did not timebox well enough (It’s 6.40 AM and I am writing this blog), but I will do better! Thanks for reading 😊

 

~ Inspired by Donna Coles, Ramya Thela, and Juliet Ruan

The Data School
Author: The Data School