One of the major pain points most data analysts or business users come across is ‘dirty data’. It is quite rare to come across some data that is perfectly clean, organised, stored and maintained properly, and easily accessible and understandable to be used for analysis. So part of the job of a data analyst is to analyse the data (well, obviously), but also clean and prepare the data to allow for easier analysis. This is where Alteryx comes in. Rather than having to spend hours and hours cleaning and organising the data using R, Alteryx provides a powerful tool with an easy user interface. This makes it highly user-friendly, and doesn’t require a high level of programming competence.

There are a lot of reasons as to why I love Alteryx. This blog will focus solely on Alteryx’s ability to prepare, organise and clean the data. The following blogs will delve into some of Alteryx’s more advanced tools and functions. But for now, let’s get stuck in to a demo workflow* and get familiar with some of the main tools.

*This demo workflow is structured in a way to demonstrate some of the more common cleaning and preparation tools, and hence is not the most efficient workflow.

Sample

 

The Sample tool pretty much does what it says in it’s own name – it takes a sample of the data. For example, in this demo workflow, the data we are interested in begins on the 6th row, meaning we need a way to cut out the first 5 rows. We can configure the sample tool to collect data based upon the first or last N rows, skipping the 1st N rows, 1 of every N rows, 1 in N chance to include each row, or the first N% of rows. More times than not, you will only be using the first 3 configuration options.

 

 

If we need to take a sample of the data from the 6th row onwards (ie cut out rows 1-5), we can use the “Skip 1st N rows” to do so. Or a less efficient way would be to take the “Last N rows”, but obviously that has it’s own very immediate limitations when working with large amounts of data.

 

Dynamic Rename

 

The dynamic rename tool is also very self explanatory in its own name – it renames the field headers. This is quite useful in situations where you want to extract specific data and input it into the header (whether it be from a record, or another data source altogether). In the demo workflow, you can see that the field headers are actually in the first record. So we want to extract the information and place it into the headers. For just renaming headers, the select tool (see below) is the most appropriate method.

 

 

DateTime

 

The DateTime tool is part of the Parse tool collection. It can be used to convert string data types into date or datetime data types. This is useful as it means visualisation software such as Power BI or Tableau will recognise the date format, so the data can then be used in date-based analysis.

 

An important tip when configuring the DateTime tool is to select the correct input format. If you do not see the matching input structure available, then you will need to type in a custom format.

 

 

In the demo workflow, both the “Join Date” and “First Purchase Date” are originally string values, meaning you can’t use any time/trend analysis. So the DateTime tool was applied twice – one for each field.

 

 

Text to Columns

 

If you come across concatenated data that you want to display in separate cells, rows, or columns, then you can use the Text to Columns tool. To configure the tool to split to columns correctly, you must specify the column you want to split out, the delimiter you want to split by, and how many columns you want it to split out into. Make sure that the delimiter you are splitting by isn’t included within any of the values, otherwise you will end up with half the value in one column, and the other half in another! Not only does this tool allow you to split to columns, but also rows. This is more useful when structuring the data using transpose and cross tab tools, but just wanted to mention that there is this option if desired.

 

 

In the example, you can see I split out each individual “product” by the “,” delimiter. That way, I can count the frequency of each individual product to analyse product popularity or use it to understand specific demographical tendencies.

 

 

Select

 

I love the select tool! It is probably one of my most commonly used tools. The select tool allows you to rename field headers, change the data type, or exclude specific fields from your workflow. The select tool is a very simple, yet extremely useful tool in managing the scope of the workflow, and organising/culling/keeping track of all of the fields you have and what they actually are showing.

 

 

Data Cleansing

 

As you may have realised by now, most of the Alteryx tool names are very user-friendly, and pretty much describe exactly what the tool does. The Data Cleansing tool is no exception here! The Data Cleansing tool is actually a macro (I will be covering macros in another future blog post), so it is a workflow made up of tools all nicely packaged up into the one cleansing tool. Anyways, the Data Cleansing tool can be configured to do all the hard manual cleaning for you, such as getting rid of trailing white spaces, unnecessary punctation, double white spaces etc.

 

 

Most commonly, like in this example, I use the Data Cleansing tool to remove leading/trailing white spaces.

 

Formula

 

The formula tool is arguable the most flexible and powerful tool in Alteryx. It can be used for simple cleaning tasks such as replacing specific values, or can be used to for more complex computations when generating new data. For this blog, I will just be focusing on using the formula tool for cleaning/preparation.

 

If you know a bit of SQL, writing formula’s in Alteryx will feel familiar. However, please note that this is not a SQL language so there are some differences! I recommend checking out the blue “fx” button on the left of the expression box – you can have a look through all the functions available and examples of how to write them.

 

 

In this example, I am doing a simple REPLACE expression, where I want to get rid of the “.”. The syntax for this formula is:

 

Replace([Last Name], “.”, “”)

 

Now, the main reason I did this in a formula instead of removing punctuation through the Data Cleansing tool, is because I didn’t want to remove the “-“‘ in “Berners-Lee”. If I used the Data Cleansing tool, I would’ve gotten rid of the full stops, but also would have turned that last name into “BernersLee”. So in that sense, the formula tool can be used to specific situations where other tools are not appropriate.

 

 

Multi-row Formula

 

The multi-row formula is starting to edge into more complex territory. This is a formula that can be applied over multiple rows in specifically selected columns. This will hopefully make more sense when I walk you through my example.

 

As you can see, there are a lot of null values in “Region”. In this data, each row where there is a region means that all following rows are from that same region (until it changes to another region value, which then follows the same pattern). So, what I want to do is take the region value, and replace the null values below with it, but only for the null values above an other region value. And then I want this process to repeat for every region value.

 

 

So the formula to do this would be:

 

if isnull([Region]) then [Row-1:Region] else [Region] endif

 

Let’s break this down a bit. All this formula is saying is if there is a null value in the region, then search an earlier rows for a value until a value is found. Then apply that value for all those specific nulls. However, if a cell already has a region value, then leave it as is. You can think of this as a sort of conditional formula, where if “x”, then “y”, else “b”.

 

That’s it!

 

Well there you have it, a short introduction into the tools most commonly used for cleaning and preparing data in Alteryx. The following blogs coming up will delve into more advance topics such as spatial analysis, apps and macros, so stayed tuned for that!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Ben Devries
Author: Ben Devries

Ben graduated with a Bachelor of Music Performance (Honours) from the Sydney Conservatorium of Music in 2023. For the last few years, Ben spent his time working as a professional jazz saxophonist which led him all around the world performing in cities such as London, San Fransisco, and of course, Sydney. But despite his musical background, Ben’s interest in data analytics came from his passion for problem solving and understanding the little details of how and why things work. From there, Ben went on to discover the Data School Down Under, and throughout the interview process became further inspired not only by the logic and flexibility of data, but also the ability for data to provide valuable insights to help solve complex business problems and present meaningful stories. Ben is excited to join Data School Down Under, and hopes to utilise his creativity, improvisational skills, and ability to draw connections upon diverse areas of information learnt as a musician within his new career in data analytics. In his spare time, Ben still enjoys playing his saxophone, as well as downhill longboarding, and spending time with his family.