We’ve learnt web scraping with Alteryx this month, and I found this skill very useful to get data on a website because data doesn’t always come in table. And recently when I was searching for my next brunch destination, I saw Good Food Guide 2022: Top 50 Sydney cafes. This page is an article with multiple paragraphs. I think it’ll be useful if I can put this information in a table so I can keep track on which ones I’ve visited. Let’s see how this can be done with Alteryx.
The two main tools used for web scraping in Alteryx will be Download Tool and RegEx Tool. But before that, first we use a Text Input Tool to get the URL. Then connect it with a Download Tool to download the data on the page. You can use a Select Tool here to only keep the filed with download data, and remove URL and headers which we don’t need in later steps.
From here is where the fun begins. To use RegEx Tool to parse data, first we need to inspect the HTML of the website by right-click on the page and select “Inspect”.
There’re actually different ways you can parse the data using RegEx. Here I parse cafe name, cafe type, and address/ phone number/ website separately, and then join them together.
Step 1. Parse Cafe Name
For cafe name, we can find out that they are within <h3> and </h3>, so our regular expression can be <h3>(.*?)</h3>. In RegEx configuration, select tokenize output method and split output to rows. This way we can parse out all the cafe names in different rows.
The result looks good, but there’re some cafe names that are embedded in an URL. We need to parse these with another RegEx tool. With regular expression >(.+?)</a> and choosing parse output method, we should be able to get all the cafe names. Though they are in different columns at the moment, we can use a Formula Tool with IF and IsNull functions to combine them together.
Step 2. Parse Cafe Type
Similar logic applies to parse cafe type. Cafe types are between <strong> and </strong>, therefore we can use regular expression <strong>(.+?)</strong>. Tokenize output method and split to rows are also used here.
Step 3. Parse Address/ Phone Number/ Website URL
Next we need to parse address, phone number, and cafe website URL. In HTML these information are all between <em> and </em>, so regular expression <em>(.+?)</em> can be used to parse out all of them first.
We can see that there’re different combinations of information in this field. Some of them got address, phone number, and website URL. Some of them got address and either phone number or website URL. Some of them have address only. I parse them separately according to their combination.
Step 3-1: Use regular expression (.+?),\s(\d+\s\d+\s\d+) to parse those rows with address and phone number.
Step 3-2: Then another RegEx (.+?),\s< to get the address from those rows with address and website URL but without phone number.
Step 3-3: Use the last RegEx >(.+?)</a> to parse out website URL from the output of Step 3.
Now we’ve got all the needed information parsed properly. The final step in to clean up the tables and join all the outputs from previous steps with a Join Multiple Tool. The final result looks like this, which is exactly what I want:
Below is the screenshot of my Alteryx workflow. You might find a different way of parsing, but the main idea is to find out the pattern of data and use RegEx to parse them.
Now I can move onto my cafe hunting journey!