Weather data can be a great way to supplement some data sources. It allows you to view datasets from another angle and determine if the weather has an effect on your data. Our cohort had a client project where we had to do exactly this, and I’ve decided to write a blog on it.

The Bureau of Meteorology

The Bureau of Meteorology is an executive agency of the Australian Government responsible for providing weather services to Australia. It was established in 1906 and since then has collected a lot of data. Publicly available climate data is limited to rainfall, temperature and solar exposure. For the purposes of this blog, we will look at rainfall, however a similar process can be used for the other metrics.

The Webpages

There are 2 main web pages that we will need to do this. The first is the stations webpage, which we can use to get a list of stations for any location in Australia and for any type of weather data. For the purpose of this blog, I chose monthly rainfall in Victoria. Once we have all the stations, we just have to find a URL that gives us the data and we can swap out the station ID for each station. The page that I used can be found at http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=139&p_display_type=dataFile&p_stn_num=85278. Now all we have to do is change the last 5 digits in the URL with other station ids found in the stations text file.

The Workflow

The station text file can be parsed with a series of text to columns. After parsing the data into a readable table format, the first 3 rows and last 6 rows will not hold stations so we can get rid of those. Now all we need to do is write a formula to create the url:

“http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=139&p_display_type=dataFile&p_stn_num=” + [Station ID]

The dataset should now look like this:

Finally we use the download tool to request the HTML from each of those pages. I chose to use python to parse the code from there. If you don’t know coding it’s possible to do it with regex. I was just on a time crunch and parsing HTML is much more comfortable in python for me. Or you could just use my code (be sure that your column names are the same as mine above though):
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
Package.installPackages(['pandas','numpy','bs4'])
from ayx import Alteryx
from bs4 import BeautifulSoup
import pandas as pd
data = Alteryx.read("#1")
months = []
years = []
readings = []
station = []
station_id=[]
lat = []
lon = []
for d in range(len(data.DownloadData)):
soup = BeautifulSoup(data.DownloadData[d])
table = soup.find("table")
if(not table):
continue
rows = table.findAll("tr")
year = rows[-1].find("th").text
rowIndex = -1
while(int(year) >=2015 or -rowIndex > len(rows)):
tds = rows[rowIndex].findAll('td')
for i in range(len(tds)):
months.append(i)
years.append(year)
readings.append(tds[i].text)
station_id.append(data["Station ID"][d])
station.append(data["Station"][d])
lat.append(data["Lat"][d])
lon.append(data["Lon"][d])
rowIndex-=1
year = rows[rowIndex].find("th").text
print(year)
if year == "Year":
rowIndex-=2
year = rows[rowIndex].find("th").text
print(year)
elif year == "Graph":
year = 0
df = pd.DataFrame({
"Month":months,
"Year":years,
"Reading":readings,
"Station":station,
"Station ID": station_id,
"Lat":lat,
"Lon":lon,
})
Alteryx.write(df,1)

Just copy and paste the above code block into an Alteryx Python tool and you should get the following output.