RegEx can be daunting at first glance, but if you get the hang of it, it will make your life a lot easier. As such, I thought I’d share some tips in this blog on how to use it effectively.
What Is RegEx?
RegEx, short for regular expression, is an incredibly powerful tool for parsing data in Alteryx. It uses symbols to match sequences of letters, numbers, or characters in text, which allows us to extract useful information. In simpler terms, it recognises patterns that help us to group data. So what syntax is used to build regular expressions? Well, to get us started, here are the basic ones:
. = any single character
\w = word character
\d = digit
\n = new line
\s = white space character
+ = one or more
() = marked group (i.e. the information you want to extract)
So for example, (\d+)\n means that we want to group one or more digits until there is a new line. That is simple enough to digest, but we can build more complex expressions, as I will show you below.
Why Use RegEx?
This is an example of some xml that I scraped from a property website. I want to extract the suburb, property type and median price from this text, and group them accordingly. Our old friend text-to-columns isn’t going to cut it. This complexity of parsing requires a bit more heavy lifting, and this is where RegEx comes into its own.
How To Use RegEx
Figuring out the right regular expression usually requires some trial and error. As such, I suggest using a RegEx testing website such as RegEx Pal. It allows you to test your RegEx to see whether you are extracting the right information. I find it particularly helpful because you can see whether you are grouping correctly, as well as whether your expression works for all the text you want to parse, and not just a few lines. Codes such as xml and json usually have a similar structure across each line, so RegEx can find the revelant information in each line, it just requires a bit of testing.
Back to our example, here is how I tested my expression:
As you can see, when you hover over a highlighted section of text, a pop-up box appears with all the marked groups. As I mentioned before, this is a great way to check whether you have grouped correctly. Something else I would like to point out is that I started the regular expression with ‘strong’ because it is common to both lines and immediately precedes the information I want to extract. This also plays into another useful hack, which is that you can simply copy and paste common text into your regular expression that you don’t want to group. It can save you a lot of time instead of painstakingly putting together the required symbols for information you don’t want to extract.
Now that we’ve got our regular expression, we can copy and paste it into the RegEx tool in Alteryx, as shown above. Alteryx has created a new field for each group, which I have renamed accordingly in the Output section of the RegEx tool configuration window. Here is the output we get:
The final 3 columns are a lot more digestible than the 1st, which contains the text before it was parsed. This illustrates how useful RegEx can be to parse data. Obviously this is a simple example, but the same principles apply to text with hundreds or thousands of lines. Hopefully you know a little bit more about RegEx after reading this!