While looking at Predictive Analytics in week 3, we were shown a number of different tools in the ‘Predictive’ tab in Alteryx. Some of the powerful tools in there include Linear Regression, Decision Trees and scoring many different models. Before we start on these though it’s valuable to understand our data a little more. A great tool for this is the Association Analysis tool which is under the ‘Data Investigation’ tab in Alteryx.
A key to understanding our data is how each field relates to each other ie. Their correlation. For example, in Alteryx Challenge 18, we are asked to predict the total wins that baseball teams will get next year, based on stats provided covering many fields this year. When we feed this data in to the Association Analysis tool, the output is a Heat chart correlation matrix. Along with some specific detail which includes the P-Values ordered from smallest to largest:
Correlation is generally indicated by a very small p-value (often < 0.05 or <0.03)
This allows us to quickly see which fields are most likely to play a role in the outcome that we’re trying to predict. With this, we can then refine our models down the track to use only the highly correlated fields, to achieve a more accurate model and predictions, hopefully without over-fitting.
Some customisation that we can make within the Association Analysis tool include targeting one field that we are wanting to predict, selecting different fields to analyse, and switching between different measures for association between Pearson correlation, Spearman correlation or Hoeffding’s D statistic.
With this information we can get a good overview of the interrelations between the fields and know the key fields to target as we continue analysing and making predictions with our data sets.