Although most of the data processed on a day-to-day basis is numeric or categorical, there are times when this data is a part of a string. This could be a customer review, a sentence or a dialogue. Extracting this data from texts is vital to be easily processed by applications from analyses.
A regular expression or RegEx is a string function in Alteryx that allows you to conduct text parsing for easy analysis.
Below is a small example on how to use RegEx to parse text data.
The aim is to separate the string from a webpage content that are in between the <a></a> tags. The contents of interest are highlighted in a box.
The Alteryx workflow looks as below. This comprises of a text input of the website’s URL, followed by the Download tool and then a RegEx tool.
The expression: <a.*?>.*?</a> is typed in the Configuration window as shown below.
– ‘<a’ is what we want the RegEx function to look to begin the parsing among other strings. HTML strings in this case.
– “.*?” is the 0 or more wild cards. This grabs what follows ‘<a’ and until the first ‘>’. The second set of ‘.*?’ grabs what is after the ‘>’
and before ‘<‘
– ‘/a>’ this is where you want the parsing to end.
The output should look like what’s in the below image. The ‘DownloadData’ column has all the ‘a’ tag contents from the URL.
Your RegEx can also be tested HERE to get your output in real time.
So there you have a it. A tiny tutorial on why and how to use RegEx.