Regular Expressions, commonly known as Regex, is a powerful tool that enhances data transformation and analysis capabilities in Alteryx. Regex is a versatile technique that allows users to define complex patterns to match, locate, and manipulate specific text or data within a dataset. With Alteryx’s support for Regex, users gain access to a whole new level of control over their data processing tasks. Whether you’re a data analyst, data scientist, or business professional, mastering Regex in Alteryx opens up a world of possibilities for unlocking the full potential of your data.

Here we will explore 3 ways to use Regex to parse out certain parts of the string. So this is the dataset we will be working on here which is a mock dataset generated by Mockaroo. Here we have the full address in the [Address] column, and our goal is to get the postcode in a separate column.  


1. Text to Column
This is how the workflow looks like.

 

In the configuration pane, we first select the column contains the information we need, in this case, [Address]. Second, we need to identify the delimiters, in this case, the comma. Thirdly, we need to decide how many columns this column needs to be split into, and in this case, there are 4 parts separated by the columns: Room info, Street info, State, and Postcode. Then when you hit run, the [Address] Column will be split into 4 columns, with each part in separate column, among which the postcode is the last column. The data cleaning tool is placed to get rid of the leading spaces.

 

This tool is an easy and quick way to parse out certain part of a string if they are separated by the same delimiter. And the outcome looks like the following, and you will need to put a select tool after to rename to columns.

2. Formula
The second way is to use the formula tool, which is personally the method that I use the most. Here is how the workflow looks like.

In the configuration pane, we add a new column called [Post Code], and the function we use is “REGEX_Replace”, and the formula looks like this:



Let us break it down:

‘.*,\s(\d{4})$’  — This is a pattern described using Regular Expression. It’s like a set of rules that tell the formula what kind of text to look for in the [Address] Column.

‘.*’  —  This part means “any characters (including letters, numbers, symbols, etc.)” and the asterisk (*) means “zero or more times.” So, it’s looking for any characters repeated zero or more times.

‘,’  — This part simply looks for a comma (,) in the text.

‘\s’  — This represents a white space character (like a space or a tab).
(\d{4}) — This part looks for exactly four digits (numbers) and captures them within parentheses. For example, it’s trying to find a pattern like “1234.”, in this case is the four-digit postcode

‘$1’ — This means “replace what you found with the first captured group.” In this case, it’s trying to replace the whole text that matched the pattern with just the four-digit number captured earlier.

So, in simple terms, this formula takes an “Address” text and tries to find a pattern where there are some characters followed by a comma, a space, and then exactly four digits (like a postal code). When it finds this pattern, it replaces the whole address with just the four-digit postcode. With the formula tool, you can have multiple formulas with multiple capture groups in one configuration and the Columns can be named directly in the tool. The result looks like the following:


3. Regex Tool
The third way is using the Regex Tool. The workflow looks like this:

The Configuration Pane of the Regex Tool is similar to the Text to Column Tool. The difference lies in: instead of the delimiter, we need to entre the regular expression (as explained in the section above) with the capture group in the Bracket, and use Parse function to put the four digits in the capture group into a new column. In Regex Tool, there is a function that allows you to split the capture group into rows under “Tokenize” Method, where it will keep finding all the strings with the same pattern as the capture group and put each of them into a different row. But you will need one regex tool for each capture group with a different pattern.

So here we explored 3 ways we can use Regular Expression in Alteryx to parse out certain parts of a string based on identified patterns. With the same logic, they have different use cases so it’s useful to understand how they work and keep them all in your tool kit. For the regular expression exercise, here is a useful link to test out your regex formula: https://regex101.com/.

The Data School
Author: The Data School