I’ve been learning Regex Tool for the previous two days. It is not an easy tool, but once you know how to use it, it can save you much energy in data preparation.
Here’s how I learn Regex.
- Watch all interactive video in Alteryx. (It is very helpful to understand the basic ideas about how it works)
- Regex material from Alteryx Inspire training sessions
- Practice Alteryx weekly challenges about data parsing
For testing regex expressions I can recommend the following free resources:
Make sure you select PCRE (Perl Compatible Regular Expressions) on these websites as there is a number of RegEx flavors with slight differences between them and Perl is the one used by Alteryx engine.
I also found that there is a great Tool Mastery article on Alteryx Community website.
In addition, there are some useful practice materials on Grace Murphy’s blog about how to learn RegEX. Materials include RegEx game from Machine Learning Lab, RegEx introducing video and so on.
During my study, I find that the most challenging part of Regex is to find the patten of the target, especially in HTML source. Here are a few useful tips in data parsing.
Regex Dictionary
In Regex tool, it’s convenient to look for dictionary in ‘Format to Convert’
Date
If I want to select the date in the following photo, the pattern would be
Every word before the date + 2 digits number + ’-’ + words + ‘-’ + 4 digits number
For every word before the date, I used .+ to select everything and ? to stop before any digital number.
For the date 16-APR-2005, I use (\d{2}-\w+-\d{4}) to paste the date only.
- \d represents digit character
- \w represents word character
- can use + or {} behind \d or \w to select how many characters, for example,
-
- \d{2} means select 2 digital characters
- \w+ means select one or more word characters
With or without ‘s’
For the chart below, if I want to select the first three rows, their pattern is
One or more digit+ space + file(s) + space + found?
The regex for this pattern is \d+\sfiles?\sfound\?
It’s easy to figure out
- One or more digits, which is \d+
- Space, which is \s
However, the challenging part is to choose ‘file’ in ‘1 file found?’ and ‘files’ for the rest of the rows. For the file(s) part, I use files?
- ? means match between Zero or One times
- s? in the middle means match no or one ‘s’, therefore, it’s either file or files
Get rid of HTML or XML Tags (<([^>]+)>)
I found a great trick when practicing web scraping. When dealing with tags
- use (<([^>]+)>) to select them out, like the photo below when it’s put into regex101
- replace to empty space in regex tool to remove all coding tags
- after simple cleaning, we have the clean data like the second photo below
Regex is a very complicated topic. If you have more related materials and tips, please leave a comment below.