A useful technique to extract data from websites and transform it into a structured format is known as “Web scraping”. With some ReGex (Regular Expression) knowledge, web scraping can be done conveniently in Alteryx. Note, however, that quite often, websites do not allow their data to be scraped, but we can very quickly find that out in Alteryx. Let’s dive into the process of how to web scrape in Alteryx!

Step 1: Fetching the data.

Drag the “Text Input” tool into the canvas, then click on the header and type ‘url’, and enter the webpage address in the row.

Drag the “Download” tool and connect it with the “Text Input”. For safety reason, please tick the “Encode URL text”!

If you get “HTTP/1.1 200 OK” in your “DownloadHeaders” column, your data is successfully downloaded. If not, then most likely the website does not allow you to scrape it.

If you are successful, the next step involves removing the unwanted tabs and linebreaks from the data before it is ready to be parsed. Drag the “Data Cleansing” tool and connect it with the “Download” tool. Select “DownloadData” in the field to cleanse, and on the Remove Unwanted Characters section, select “Leading and Trailing Whitespaces” and “Tabs, Line Breaks, and Duplicate Whitespace”.

 

Step 2: Parsing the data.

Next, connect the “Regex” tool with the “Data cleansing” tool, find the pattern surrounding the table of interest and parse them. You can use Regex101.com to assist you in parsing. To do so, connect a “Browse” tool to the “Download” tool end point, run the workflow, select the “Browse” tool. On the results section, right click on the “Download Data” and copy the selected cells without headers.

You can paste these codes into Regex101.com and find the parsing Regex there to return the table/s of interest. After finding the Regex code, return to the “Regex” tool above, select “DownloadData” as the Column to Parse, enter your Regex code there (for instance, I have “<table.*>.*?</table>” Regex to capture all the tables in this website), and select “Tokenize” split into rows to generate all the tables in that website.

You can select the table of interest by connecting the “Regex” tool with “Select Record” tool, then input the record row of interest. For instance, I input 1 in the “Enter the numeric ranges of records to return.

Now use the “Regex” tool again to split the table body from the headers (if they have any). In this case, select “Parse” in the output method to create 2 columns, one containing the headers, and one containing the table body.

Next, using two “Select” tools, we can split the flow into two. One so we can parse the header, and one to parse the table body. We will combine them together again later.

For the header, connect the “Regex” tool with the “Select” tool that connects with the header output, and parse them.

The Table body needs to first be split into rows of data from one line of html code. Just like above, connect the “regex” tool with the “Select” tool that connects with the body output, and tokenize them. This time, select “Tokenize” in the output method and select “Split to rows”.

Afterwards, connect another “Regex” tool to finally parse the table body.

Ensure that both the “parsed” body and header have the same number of columns and field names (you can use “Select” tool to achieve this), and finally connect them with the “Union” tool to combine them together again. You can select “Auto config by name” in the “Union” tool config as both datasets have the same field names (e.g., RegExOut1, RegExOut2, etc).

The final step is to convert the first row into header using “Dynamic Rename” tool, and select “Take field names from first row of data” in the Rename mode, and tick all field names underneath. And now your data is ready for analysis!

Web scraping tables in Alteryx allows data professionals to extract structured data from websites effortlessly. By following the step-by-step guide in this blog post, you can utilize Alteryx’s web scraping tool and integrate web data into your analytics workflows.

 

The Data School
Author: The Data School