Have you ever been given a dataset that wasn’t quite clean? Lots of extra white spaces before or after the data in the cells? Punctuation marks, or numbers or letters/words showing up where they shouldn’t be? Or even null cells?
Misinformation can lead to quite costly consequences. For example, the 2008 Financial Crisis was the result of certain numbers being overstated and is estimated to have cost $19.2 trillion.
While some data cleansing aspects allow you to work with the data, others provide a better performance when querying the data.

There is good news!
You can clean this data in Alteryx and get it ready for wrangling or Analysis, in just a few minutes.
Today we will go through 3 Data cleansing tools, the Data Cleansing tool, the Select tool and the Text to Columns tool.

Data Cleansing Tool

The Data Cleansing tool is quite a powerful tool, because it is a Macro. This means that is combines multiple other tools into one convenient tool for the user. We can identify Macros in Alteryx with the plus symbol on the bottom right of the tool.

You may need to turn on the indicator setting to see this, and you can do this by going to:
Options > User Settings > Edit User Settings > Canvas > Display macro indicators on tools.

The Data Cleansing tool has many functions that it can perform on the dataset depending on how you configure the options.

White Spaces

We often see leading and trailing white spaces, which can cause performance issues especially for large datasets. Imagine you had a leading and trailing white space for every row in a column and there were 5 million rows of data. That would mean an extra 10 million ‘spaces’ that need to be catered for when using system memory.
The Data Cleansing tool can help clean this up by configuring it to remove the Leading and Trailing Whitespace.

Punctuation

Occasionally we see unwanted punctuation marks in the data. The options configuration window will be adjusted slightly to include punctuation.

Numbers and Letters

Sometimes, there will be numbers mixed in with text strings, or text mixed in with numbers. The options configuration window will be adjusted to include the removal of the unwanted characters.

Null Rows or Columns or Cells

If you have completely Null rows or columns, and it makes sense to remove them, the Data Cleansing tool can do this for you also.
However, if only some of the cells in the column are null, you can choose to replace them with blank or 0 depending on their data type.

Upper/Lower/Title Case

It is also possible to change the Case type of the Data to match your requirements.

Select Tool

The Select tool is another really useful tool found in the Preparation palette to help with Data Cleaning and preparation.

The Select tool allows us to change the Data type of a field, for example, string to integers, or numbers to dates. We can also change the size of the data type here also. The configuration window for the Select Tool would look like this.

It is also possible to change the order of the fields with the arrow keys at the top left.

Text to Columns Tool

The Text to Columns tool is found in the Parse palette of Alteryx.

The Text to Columns allows us to split a column into multiple columns using a specific delimiter. Some common delimiters include:

  • comma (,)
  • dot (.)
  • hyphen (-)
  • underscore (_)
  • pipe (|)
  • space (\s)
  • tab (\t)

In the configuration window of the Text to Columns, you will need to select which column to parse, the delimiter, the number of columns to split into, and a few others.

Practice

Now that we have gone through the 3 tools, lets put this into practice. Download the excel spreadsheet example exercise
Now try using everything that we have gone through today to clean up the data.

Solution

Here is one solution:

Text to Columns Tool configuration:

The 3 Data Cleansing Tools configuration:

Select Tool configuration:

Compare your final output to one I prepared earlier.
example exercise solution

WELL DONE!!

I hope you have enjoyed the post, and I will see you next time.

The Data School
Author: The Data School