Today marks the beginning of the Dashboard Week. Our inaugural challenge started with web scraping followed by dashboard creation. But what exactly is web scraping? Web scraping is like having a robot that can pull information from websites. It visits web pages, takes out data like text or pictures, and gives it to you in a usable format. People use web scraping to collect all sorts of information from the internet, such as prices, news, or weather data. It’s handy for research, finding deals, and making apps smarter. Just remember, when you do web scraping, you need to play by the rules and respect the website’s terms and conditions about data use.

 

Our first task was to identify an intriguing topic. My initial idea was about the Melbourne property market, with plans to scrape data from Real Estate or Domain websites in an attempt to uncover suburbs with the highest growth potential. However, I soon realised that scraping data from these websites was quite challenging, as both required API access, exclusively available for business purposes. Exploring alternative websites also proved futile, as most demanded payment for content access. This was a dead end, forcing me to think about a different topic.

 

After thorough considerations, I shifted my focus to analysing childcare centres in Melbourne — a topic of interest to many parents. I began exploring childcare websites such as “Care for Kids” and “Kindicare” which appeared to contain a wealth of relevant information without the need for payment. My excitement grew. Yet, after inspecting these websites, I discovered that scraping all the necessary data would be both time-consuming and challenging. Given our limited timeframe of just one day, I had to select only the most vital information for my dashboard. Ultimately, I settled on “Care for Kids” for web scraping, but it took half a day to make this decision. This is how the website looks like:

Link: Care for kids

 

With my plan in place, I opened Alteryx and used a text input tool to input the URL. I then employed a download tool to retrieve all the website content. After that, I used a regex tool to parse out the required information—a task that proved to be very daunting for me. It had been a while since I’d worked with regex, and parsing a website of this nature presented significant challenges. Fortunately, I sought assistance from our team member, the “Regex Master” Edward, who was more than willing to help. I learned a great deal from him. Parsing was particularly challenging due to the website having three different formats for childcare centre windows, each requiring a distinct regex format. Moreover, some content shared identical formats, making it hard to differentiate. I dedicated the entire afternoon to crafting regex patterns to extract all the necessary data. I initially selected only one suburb as a trial run. The trial went pretty well and I decided to apply this workflow to all Melbourne suburbs.

 

A complete list of suburb names and postcodes is required to compile a comprehensive list of childcare centres in Melbourne. I scraped Wikipedia for detailed suburb information, formatted the information into “/suburb/postcode,” and then appended it to my previous URL. This is how the workflow looks like:

After getting all the URLs, I started to use these URLs to download all the childcare services data for all suburbs. However, the websites seemed to be very complicated and it took the whole afternoon to parse out the data I need. The regex was way too complicated than I expected and it taught me a good lesson to plan my tasks in a timely manner. This is how the Web scraping workflow looks like:

Finally, I got to start building my dashboard. My aim was to build an exploratory dashboard that can help parents to find the most affordable childcare centers with highest rating. I also added some performance measures like NQS rating and NQS Health & Safety rating to help parents make decisions. This is how my final dashboard looks like:

Link: Childcare service in Melbourne

 

I was quite happy with the final product although I think, if I had better time management on the task, I might be able to make the dashboard even better.  That said, I found Dashboard Week very interesting and challenging and I’m looking forward to making better progress this week.

 

 

 

The Data School
Author: The Data School