When dealing with text data such as e-mails, it is common that people misspells or make typos. We do not want to exclude them and it is also difficult to exclude them when there are lots of data. The Soundex function in Alteryx could help us to solve the problem. How Soundex works is that it identifies the sound of a string and provides a code so that the strings with similar sound could have the same Soundex code.

I drafted a simple text input as an example:

We can see that for “gmail”, there are lots of typos such as gamil, gmal and for “hotmail”, there are hotmal, hotmaill, hotmil. We want to correct them. Here is how we can do it.

Step 1. Extract e-mail domains using Regex

Extracting e-mail domain is easy because it is structured to be after the symbol “@”. We can use Regex Parse with .*@(.*) to extract domains.

Step 2. Apply Soundex function

Drag in the Formula tool, create a new column named “Soundex”, apply the Soundex function “Soundex(column_name)”. We can see that the gmail variants have the same soundex “G542” and the hotmail variants have the same soundex “H354”.

Step 3. Correct e-mail domains

For a very small sample like this, we can simply do it in the Formula tool with IF function.

Then we can simply use Regex to replace the domain to correct those misspelled e-mails.

This provides an idea of how to use Soundex. When the data is big, it is very likely that the most frequent e-mail domain should be the correct ones (e.g. gmail.com, hotmail.com, yahoo.com etc.). We can use Summarize tool and Sort tool to find the correct ones and then replace the misspelled with the correct domain.

Limitations

Although Soundex is very powerful, there are many limitations to it. The main limitation is that some misspellings can have different Soundex code and therefore cannot be captured.

Hope this helps you on the journey with Alteryx!

 

 

The Data School
Author: The Data School