Hi all! I have just finished my first training week with Alteryx, and I have to say Alteryx is a great platform for pre-processing data. In my opinion, one of the most powerful tools that Alteryx provides you is Regex. Today, in this blog, I will show you guys some basic ideas about Regex.

What is RegEx?

RegEx (Regular Expression) is a sequence of characters that define a pattern to search in a sentence or field. We can use this to replace, tokenize, parse, or match the text that fit your defined pattern.

The cheat sheet above will provide all the characters you need to build your own pattern. What we need to do is basically just set up the pattern that matches your expected outcome.

For example, if you want to find all the numbers in the sentence “Regex 101 in a nutshell part 1” you can build your expression using \d (which stands for a single digit). Notice that the characters which matches the expression is highlighted in blue.

or find all space with \s

Or something more complex, like the first word in a sentence

Application of RegEx

Sounds good so far? It gets even better! A powerful thing about Regex is that you can theoretically match all the patterns in the world. Emails, mobile phone, your code, you name it. As far as I know, you can apply Regex to most of the programming languages like Python, C, C++, Java, etc. and of course, Tableau and Alteryx is not the exception. The key is that the syntax that define a pattern will be consistent with all the tools. Bear with me and let’s see what we can do.

Extracting emails from a text document

In the above example, we have defined a simple pattern that matches only the email that appears in the text. Basically, an email will have 2 main parts: the domain and username. The domain and username are pretty much the same where we have defined: any combination of digits, characters, full stop, or semi-colon are eligible. Those 2 parts sandwich a “@” character in the middle.

Regular Expressions for Web Scraping

Basically, most of the data presented on the Internet can be extracted by accessing to the website’s HTML source code. For a table of data, the row of data will be stored inside the <tr> tag. So, you can extract the row by using a regular expression which matches a row like this (Example using Alteryx). Also, we can see the table that has been extracted in the right bottom of the picture.

Working with datetime features

Dealing with datetime is sometimes exhausting. Imagine the date in the dataset having a “unique” format, that you can’t use basic datetime functions to find it within a text. Fear not, because Regex is here to save the day!

In this example, we can find the exact datetime matches by using a pattern that covers the datetime format we desire.

And that was just 3 of various cases that we can use Regex to easily solve problems in data preprocessing. I bet that there are plenty of other situations where Regex can give you a hand and give you an easier life.

 

The Data School
Author: The Data School