Regex, or regular expression, is a sequence of characters to identify a specific patterns in string values. This is useful for those who have to extract or replace certain data that isn’t easily accessed through traditional data manipulation. Regex allows the user to parse, replace, tokenise and match values without isolating the text through multiple tools such as the text to columns tool. We will now have a look into how you can use regex and how different patterns can get different responses.

The Syntax

Regex has a few key different syntax that you will be using in your time within the tool. Many of them allow for the capture of specific alphanumeric characters and some have more specialised  purposes within the tool. What you can see below is a cheetsheet that has been made up by a site called RegExr (a great training resource).

Character Classes: This section will denote what is in your regex pattern, from words to pacing to numbers it can all be captured within this group.
as an example of this if I had a string that was 1234GH5678 to get to the GH it would take a few different formulas usually. With regex depending on the end result the expression would look something like: (\d+) would capture the digits in the string (\w+) would capture the letters.

Anchors: Anchors are useful when it comes to structuring strings and quantifying the end and start of them. Though they are useful I find them something that is very use case specific for the problem. Its something that wont always be needed in regular expression unlike the Character Classes.

Escaped Characters: Within a string we can have a boundless amount of different unique combinations. This means that at some point whatever is within the string will eventually clash with the syntax of regex. A good example of this will be rounded brackets () because they both indicate a capture group in regex and also can be used in everyday strings.
Thats where Escaped Characters comes in. It allows the user to indicate to the expression that they are not to be used as intended instead that they are to be considered apart of the string.
Groups and Lookaround: Groups and look arounds are another example of use case specific functions within Regex. They allow for you to search ahead in the string or behind very much like a multirow formula in Alteryx. That means apart for your current capture group you can reference the first capture group later in the expression.

Quantifiers and Alternation: Within this particular area we will be looking at how we can tell the expression how many characters are involved in the most recent capture.
There are many different ways to express quantifiers and they all have different use cases. The most common ones that will be used are * + ? and here are a few short examples of what they do:
* means 0 or more of this particular indication. so \w* means 0 or more letters in this area of expression
+ 1 or more. This is for when there will always be at least on of these characters in the string, so \d+ means at least one number but more can be here as well.
? 0 or 1. This is used where there will only ever be 0 or 1 of the particular character. It is useful for situations where where every now and then the string holds a unique character.

 

Application

Below is a text was taken from the Alteryx community challenge 121, looking into the FIFA world cup from 2018. The objective is to get the game and dates in different columns and format them. This could be done in a few different ways, but Regex allowed for easy extraction of this particular string . So from here what we are trying to take is just the games and the dates while doing it in the lease amount of regex as possible.

This is the string we will be working with: Games: Russia v Saudi Arabia (14 June); Germany v Mexico (17 June); Portugal v Morocco (20 June); Denmark v France (26 June); Round of 16 (1 July); Semi-final (11 July); Final (15 July)

So what I did was I took the string and placed it into a text input tool within Alteryx and joined the regex tool as seen below.

What I then done was configure the Regex tool to be tokenize and set the regular expression to be \W(.*?);
With this expression the results can be seen below in fields 11 through to 15. The only downfall of this is it also took round 16 and finals into their own fields. Though that is easily fixed with a select tool.

 

This shows how powerful regex can be in certain situations when it comes to extracting text. This is just one example of a use case but when you’re exploring data there will no doubt be many more uses then you will know. So what did we do here? lets have a look at the broken down expression and feel free to refer to the cheat sheet above. Firstly this expression can be broken into the different sections and values that give it meaning.

\W  – The \W is looking for items that are not a “letter” or more specifically any alphanumeric characters. This means that it starts at the first character (:) in the text line.

(.*?) – This section is known as a capture group. This is indicated by the open and closed bracket and highlights what to keep in the expression. Some forms of Regex only allow for one capture group like tokenize while parse you can have many.

.*?  –  Inside of the capture group is this formula. The full stop represents any character alphanumeric or not while the asterisk means 0 or more. So it’s saying to capture all of the characters and any amount until it’s presented with the next part of the formula. The question mark tells it that it has to contain 0 or 1 of the previous formula allowing multiple dates to be taken out.

;   – The semicolon has been used because of the way this string was formatted. It does not represent anything within regex but tells out current statement when to finish and start the formula again.

Within the world of Regex, there are many different “flavors” meaning that one statement can interact differently depending on which Regex is used. There are many great tools online that can help with all versions of Regex allowing you to test and try new expressions and see how they interact. Regexr is a tool that I use to help form my expressions that I then bring into Alteryx. If you’re interested in learning more then take this string into Regexr and see if you can parse out the needed sentence.

Mr Blobby    Mrs Blobby        Ms Blobby

Mr. Blobby    Mrs. Blobby        Ms. Blobby

Each has its own challenge with spaces and full stops and different formatting.

Good luck and don’t forget to read more DS16 posts about Tableau and Alteryx!

How to employ Web Scraping – Joe Chan
Two ways to create a chart type switch – Tony Tan

And any questions feel free to reach out on Linkedin!

Stephen Hughes
Author: Stephen Hughes