This article will show how the efficiency of regex can influence the efficiency of the Alteryx workflow.

The following HTML code which is extracted from the website (https://www.officialcharts.com/chart-news/the-official-top-40-biggest-songs-of-2021__34857/) will be used for demonstration.

The challenge is that we want to extract the value of the following table in the website and form the table in Alteryx.

After using download tool to load all the HTML code to Alteryx, we can target the part of HTML code related to the table as is shown in the following picture we want to create by analysing the code structure.

It can be found out that the HTML code of the table header is different for the table value, which generates some difficulties to extract the information we need using one Regex.

Solution 1 overview

The first method we can use to tackle this challenge is to spilt the header row and the value row using filter tool and deal with them independently.

The configuration of filter tool

The regex to extract the information in the header.

The regex to extract the information in the value part.

After we extract the value we want, we can use union tool to concatenate the header and value.

Solution 2 overview

The second method is by using single regex to extract the information we need.

The regex and its configuration

But it can be found that the string in the header is not as clean as we want. Therefore, we also need some data cleaning steps like using Multifield tool to remove the ”strong” tag around the header.

Solution3 overview

The third method simply uses one efficient regex to precisely extract all the information we need.

The regex and its configuration

To summarise, we can realise that how powerful the regex can be to simplify our workflow, though writing an efficient regex is time-consuming at times.

Feel free to reach out to me on LinkedIn, if you have any questions.

 

Tony Tan
Author: Tony Tan