Most data that we have are unstructured, but they can be organised to prepare it for data analysis. Regular Expressions, commonly referred to as regex, are powerful tools that allow you to search, match, and manipulate strings data with remarkable precision. Regex can greatly enhance your ability to manipulate and organise text data efficiently. It might be daunting and complicated to use, but there is a shortcut of how we can use regex without knowing everything about it! This blog will explain step by step how to tackle regex problem in a simple way.

 

  1. Utilise Regex101.com

Use this tool to solve your regex problem. Simply copy and paste the codes into this site and work your way step by step here.

 

  1. Find the separator patterns that wrap the characters of interest.

Know exactly what separates the data. For instance, in a row of address like this;

Unit 230/15 Scarlet Street, Vermillion Suburb, 300AZ, Crimson Country

We probably want to separate them into 1st and 2nd line of street address, the suburb, postcode, and country name;

So the goal is to get;

Unit 230
15 Scarlet Street
Vermillion Suburb
300AZ
Crimson Country

That means the separators would be
/
and
 , (comma then whitespace).

 

Another example in a html form;

<noscript><img src=”https://image.pr.sbsod.com/5cd74e0a-1cf9-5db2-96c7-7cd31b4f3000?crop=true&amp;width=1280&amp;height=1920&amp;quality=89″ alt=”13 Assassins”/></noscript></picture></div><div class=”jss14970″ data-testid=”tile-chin”><h3 class=”MuiTypography-root jss14971 MuiTypography-body1″>13 Assassins</h3><div class=”jss14982″><span><span class=”jss14984 jss14985 ellipsis”>Action</span><span class=“jss14984 jss14986″>2010</span></span><span title=”Unsuitable for persons aged under 15″ class=”jss14990 jss14987″><svg xmlns=”http://www.w3.org/2000/svg” viewBox=”0 0 50 16″ focusable=”false” class=”jss14991 jss14999″ role=”img” aria-label=”Classification MA15+“><g data-name=”Layer 1″>

We want the title, gender and classification;
13 Assassins
Action
MA15+

So the separators in this case would simply be some chopped off html clauses surrounding the target characters that are unique and repetitive. For instance, in this case they would be;

body1″> and </h3><div class=
ellipsis”> and </span><span class=”
label=”Classification
and “><g data-

  1. Fetch the characters of interest.

We do need to understand some of the Regex symbols to capture the characters of interest. Below is a dictionary of what I find to be sufficient to tackle most of my regex problems:

Characters

. = Any characters

\d = digits
\D = NOT digits

[A-Z] or \u = Uppercase alphabets
[a-z] or \l = Lowercase alphabets
[A-z] = Case insensitive alphabets
[^A-z] = NOT alphabets

White space, tabs, all symbols are exactly that.
\ = Escape or cancel symbol (e.g., \, To capture ,)

[ ] = to create a list. e.g., [ab23 ,#-]

( ) = to capture/parse the characters

Quantifiers

* = 0 or more

+ = 1 or more

{ } = insert a number inside the curly bracket, this number will be the exact number of digits fetched by regex

? = Take the first instance of the characters based on the expression to the left side of the ?

Honestly though, I mostly just use .*? (which means; take the first instance of any characters, 0 or more times, up to a certain separator). But some problems require specific characters, so it is worth knowing the specific values as well.

For instance,

Unit 230/15 Scarlet Street, Vermillion Suburb, 300AZ, Crimson Country

We could capture Unit 230 with (.*?) then add the separator / then fetch the address with (.*?) then add separator , and so on.

So the parsing regex for this problem would be;

(.*?)[\/-](.*?), (.*?), ([A-z0-9]{5}), (.*)

(.*?) will fetch Unit 230

[\/-] is the separator. Any values in this [ ], whether it is a – or a /. Note that the \ before the / acts as an escape or cancel symbol for /.

(.*?) will fetch 15 Scarlet Street

, will be the separator (note the white space after the comma)

(.*?) will fetch Vermillion Suburb

, will be another separator

([A-z\d]{5}) will fetch the 300AZ. It means any 5 characters inside the list of alphabets (case insensitive) and digits.

, as another separator

(.*) will fetch the Crimson Country.

 

As for the html example, for the capture group, add (.*?) in between the separators, and add .*? surrounding the separators to connect all the lines. Don’t forget to add the escape character \ before the / in the separators below.

body1″> and </h3><div class=
ellipsis”> and </span><span class=”
label=”Classification
and “><g data-

.*?body1″>(.*?)<\/h3><div class=.*?ellipsis”>(.*?)<\/span><span class=”.*?label=”Classification (.*?) “><g data-.*?

The (.*?) will fetch 13 Assassins, Action and MA15+

Regex is a very powerful tool to prepare your data. Hopefully after breaking the regex down into chunks with the steps explained above, you find regex easier to understand and less daunting. Happy parsing!

 

The Data School
Author: The Data School