Select Page

Recently I had a project where the client wanted to know if there was correlation between many fields in their data set. It was survey data and every row had some NULL values or zeros which meant there was no data rather than the answer to that question was 0. In Alteryx, the Pearson Correlation tool can answer the question. You can select multiple fields and the tool outputs a correlation matrix between each variable.

## The Problem

The Pearson Correlation tool in Alteryx can not handle NULL values in the data though. Consider the data set below. If we wanted to find the correlation coefficient between each variable, we were in trouble.

If we feed this data into the Pearson Correlation tool, this is the result we get:

One thing we can do is to filter out the rows with NULLs. But if we do that all that remains is the two highlighted rows. The other thing we can do is to replace the nulls with averages. But what if for a couple variables half of the rows were NULLs. Working with averages in that case would be too much assumption.

The second problem I had was that there were lots of 0 values which meant there was no data rather than having 0 as value. Alteryx’s Pearson Correlation tool handles zeros as actual values, which in our case would skew the result. So, again you might want to filter out those rows which have the value 0.

## My Solution

I thought it would be great to calculate the correlation between two variables at a time filtering out the rows which hold NULLs for those two variables only. Below you can see how it would work with three iterations for three variable pairs in the same data set.

This way we can calculate the correlation between W and X using the six highlighted rows as shown in the first image above. Then we could calculate the correlation between W and Y in another iteration, between W and Z in the third iteration and so on.

My solution to the second problem is that the new tool should have the option to ignore rows with value 0 the same way I showed with the NULLs in the above example.

## The Macro

I created a macro which calculates correlation coefficient using the above described method. See the result running the macro on the same data:

The user interface looks like this:

The first two interface elements are self explanatory. Tick the “Output Result Stats” checkbox if you want to know what percent of the rows were used to calculate correlation for each variable pair. This will output extra columns named “Based on percent of Rows”

Use the last radio button group to select how you want to output the results. The long format can be used in Tableau straight away to build a correlation heatmap.

I named the macro Pearson Correlation Matrix. You can download it from Alteryx Gallery or directly from here.