In this blog post, I’ll be taking you through the steps that I took to parse data from a series of PDF files containing records from my public transport trips in Sydney. A second post that I will publish next week will be showing a Tableau visualisation of that data in Tableau.

Unfortunately, the travel records at the Opal website are only available in PDF format which, it turns out, is quite tricky to parse.
I looked for solutions on the community website since there is no easy way to input a PDF file to alteryx. I ended up finding a lot of interesting workarounds, and an excellent PDF Input tool created and shared by our good pals over at the Information Lab!

At the same time, I remembered hearing about Tableau being able to recognise PDF files as input. So, I decided to give it a go in Tableau Prep. As you can see, there’s other information, like headers, footers and images. Notice also how in the Date and Details columns the information splits into two distinct rows (yep, that is going to be an issue when parsing this data!)

 

Parsing PDFs in Tableau Prep

 

Unexpectedly, Tableau Prep handled the PDF quite well. It was able to quickly recognise all the tables contained in each of the files. With some basic cleaning steps, followed by a Union, I was able to extract a CSV file per month with all the data that I needed.

Cleaning Data in Alteryx

 

Now I could import all the CSV files to Alteryx and get the data in a shape that would allow me to bring it to Tableau and create a viz. Here’s what the output of the Tableau Prep workflow looked like:

 

On alteryx, I still had to resolve the issue of records with information divided into multiple rows.
I started by populating the null Trip IDs to be able to group the information of the same record together, using a Multi-Row formula. I could dispense with the day of the week abbreviation. However, I needed to keep both the day and all the description of the trip (start and end point). After that, I changed the Trip ID variable to an int type. That generated null values on rows where had string info that I wanted to get rid of. After that, I summarised the data by grouping it according to the Trip ID.

Now, I had only one final issue to solve – the missing trip mode. If you look again at the PDF file generated by the Opal website, there’s a Mode column, with the logos for Train, Bus and Ferry, for each trip. It just happens that when pumping the file into Tableau Prep that information simply vanished.

Rather than looking for another solution to parse the files again, I decided to create a new field using the Formula tool. It happens that it’s actually easy to tell each trip’s mode only by looking at the description. Ferry trips have their start and end point denoted by the words ‘Ferry Wharf’.
Light rail trips are easily spotted by the characters ‘LR’ next to the start and end stations.

As for Bus trips, there’s a whole lot of different terminologies for the bus stops around Sydney. However, the majority of them will have either one of the following words ‘Stand’ (or, in some cases, an abbreviation). ‘Before’, ‘after’ or ‘opposite’ followed by the name of the streets where they’re located. With that, I was able to pick up all the Bus trips. All the others were train trips.

Check below my alteryx workflow and a preview of the cleaned data.

Ricardo Santos
Author: Ricardo Santos