I’m back for another blog on dashboard week day 3. I’ve been pumping out blogs and vizzes like a factory these days. It’s hard to keep up the quality. Anyway, here is the journey.
The challenge overview
For dashboard week day 3, our mission is to connect to ArcGIS and create a dashboard from a dataset from there. Sounds easy enough, right? Except it wasn’t. The government’s website couldn’t be any slower and most of the datasets we tried either gave an error page or didn’t have much in it. The only thing that worked and had quite a bit of data was Land zoning datasets which contain several tables such as building height, lot area, and heritage sites. By the time we got our hands onto the dataset, it was already noon. I (randomly) picked building height and tried to weave a story out of it. The dataset is on the maximum building height allowed for an area. I immediately thought of housing prices. Would housing prices be higher in areas where the maximum building height is higher?
I quickly searched up for a dataset on housing prices and found one from here. The joining didn’t work too well but there were not many records so I did some adjustments to join them up. Everything worked out fine…except there was no relationship. No for metro, no for rural areas. Okay, calm down, let’s add some other datasets, like population, household income, distance to sydney CBD etc. By then, it was already 4pm and I was still prepping the data…. Enough rambling, let’s talk about the technical process.
The technical process
Step 1: Data prepping
As I wrote above, my starting point was the building height (which contains the shapefiles) but the final story I ended up with was on housing prices and rent. As a result, the workflow is a bit clunky. I really didn’t need the ArcGIS dataset on building height because there was no correlation and I could have used the LGA shapefile instead of the shapefile from this dataset. However, this was part of the challenge so I tried to incorporate data from this dataset. As a result, I encountered 2 challenges.
(1) Different level of granularity
The datasets on housing prices and rents are at the level of suburbs (LGA). However, the ArcGIS dataset is at a lower level, the block level. As a result, I needed to aggregate the ArcGIS dataset for measures such as building height but I needed the spacial objects as it is (obviously, you can’t sum or average spatial objects). Because of this problem, I ended up with 2 datasets, one just for the shapefile (at the block level) and one for aggregating the building height. All this effort I went through and there was no correlation!!!
Some of the datasets on attributes for suburbs do not have a LGA code. As a result, I needed to do some fuzzy matching which takes more time than usual. Furthermore, there were about 5 tables. After some cleaning, I tried to join in Tableau but it caused some errors when I tried to build the scatterplot so I had to go back into Alteryx again to join them all together and only join 2 tables in Tableau. I won’t show the workflow, it’s just lots of joining…
Step 2: Investigation & Visualisation
Once I got the data, the first 2 charts I built are a map and a scatterplot. Through the map, I can see that in Sydney metro areas, the rent and housing prices are higher than rural (obviously). I added some parameters to switch the axis in the scatterplot and group them into rural and urban. I can see how they differ. At this stage, I kind of had an insight and a story. All that’s left is just about pulling them together on a dashboard and add some user interactivity.
Overall, a city dweller pay rent twice as much as a rural dweller, while earning only 1.3 times more and they are slightly younger than rural areas. So you can see that the living standard is probably higher in the rural areas than in Sydney metro. The two factors influencing housing prices/rents the most are population density and household income. In rural area, distance to sydney CBD didn’t really matter as much as in Sydney (makes sense because they probably have their own shopping areas. For next iteration, I probably should calculate that instead). Building height and suburb’s median age didn’t matter. I also included a suburb profile so the viewers can click around different suburbs and have fun seeing how they differ.