8 min read

 

This two-part blog series aims to demonstrate how to empower Alteryx with Python to generate additional insights and create effective predictive models. In the first part of this series, we will showcase how the Python Tool can enrich exploratory data analysis (EDA) in an Alteryx workflow. In the second part, we will explore and compare the machine learning capabilities of Alteryx and Python, and understand how the Python Tool can help us easily build effective predictive models.

 

Content

  1. Exploratory Data Analysis (this blog)
    • The Case Study 
    • EDA using Alteryx’s Investigation Tools
    • EDA using Alteryx’s Python Tool
  2. Predictive Modelling (Part II)
    • Predictive Modelling using Alteryx’s Machine Learning Tools
    • Predictive Modelling using Alteryx’s Python Tool
    • AutoML using Alteryx’s Python Tool

 

 

Alteryx is a powerful platform to answer business questions efficiently. It allows users to quickly access, wrangle, analyse and output data. But did you know that Alteryx can become enough more powerful and flexible with the Python Tool?  Don’t take my word for it, let’s see what the Python Tool can do through a case study!

 

1. The Case Study

We will use a case study to illustrate how the Python Tool can be used in Alteryx. This case study is based on Kaggle’s Tabular Playground Series – Jan 2022. The problem that we are going to tackle was described as follows:

“There are two (fictitious) independent store chains selling Kaggle merchandise that want to become the official outlet for all things Kaggle. We’ve decided to see if the Kaggle community could help us figure out which of the store chains would have the best sales going forward. So, we’ve collected some data and are asking you to build forecasting models to help us decide.”

Essentially, this is a regression problem, where the goal is to predict the expected sales of the two independent store chains, KaggleMart and KaggleRama. Based on these predictions, Kaggle will determine which of the chains will become the official outlet.

 

 

2. Exploratory Data Analysis using Alteryx’s Investigation Tools

Before we start building machine learning models, it is often useful and necessary to perform Exploratory Data Analysis (EDA). EDA allows us to discover errors in the data, understand patterns in the data that can lead to feature engineering, as well as warning signs that warrant detailed data cleaning.  

Alteryx Designer provides a range of tools that can be used to conduct EDA. These tools can be found on the Data Investigation Palette.

 

Let’s start our analysis by importing our Training and Test Data using the Input Tool.

 

Based on the Metadata, we can see that all the fields are currently of V_String type. We know the date field should be date format, and num_sold should be numerical (integer). We can use a Select Tool here to fix the data types. 

 

Once the data types have been fixed, we can drag the appropriate Data Investigation Tools, such as Distribution Analysis, Field Summary, and Frequency Table onto the Canvas

 

Below is the output for Field Summary. We can immediately see that num_sold is positively skewed, with its Mean > Median. We can also see that there are no missing values in our data. 

 

In summary, the Data Investigation Tools is a useful method for gaining a quick understanding of our data. However, it does have two limitations: 1) The outputs/results can only be viewed using a Browse Tool, and rendering can be time consuming each time we switch between Browse Tools. 2) Our ability to explore the data is limited to the chart types and functionalities that have been pre-built into the Data Investigation Tools. 

 

 

3. Exploratory Data Analysis using Alteryx’s Python Tool

Fortunately, we can use the Python Tool to resolve both of the above challenges. The Python Tool operates like a Jupyter Notebook, which makes it easy to view and compare EDA results without repetitive rendering. The Python Tool also gives us the power and flexibility to develop charts and functionalities that are not available in standard Alteryx.

 

Step 1: Bringing the Python Tool onto the Workflow
  1. From the Developer Palette, drag the Python Tool onto the Canvas.
  2. Connect the input data to the Python Tool, and give the connection a sensible name such as “Training”. 
  3. Note that there’s also an R Tool available in Alteryx for those that prefer to code in R.

 

Step 2: Installing and Importing the Necessary Python Libraries
  1. Inside the Python Tool’s configuration window (which appears as a Jupyter Notebook), import the libraries that we will use.
  2. Note that we can also install additional libraries as needed, for example, I have installed Seaborn for EDA and Pycaret for AutoML.

 

 

Step 3: Importing the Training Data
  1. We can import our training data using Alteryx.read(), which reads the Alteryx input and turns it into a Pandas DataFrame.
  2. Make sure you Run the Alteryx workflow first in order to cache the incoming data to be used in the Python Tool.
  3. We can use Pandas’ .head() and .info() methods to gain some initial understanding of the data.

 

As good practice, we should always split our training data into train and validation sets, and only perform EDA on the train set to prevent data leakage and improve the chances of model validity and generalization. However, the focus of this blog is on the Python Tool and not on developing actual machine learning projects, we shall skip the formal data science processes here.

 

 

Step 4: Performing Data Quality Analysis
  1. We can use Panda’s isnull() method in combination with Seaborn’s heatmap() method to investigate missing values in the data.

 

2. We can also use describe() to get the summary statistics for our features. The summary statistics allow us to check whether values (especially numerical values) are within the expected range and scale.

 

 

Step 5: Performing Univariate Analysis
  1. We can also explore the distribution of our features. In this case the categorical features, such as Country, Store, and Product seem to have balanced class distributions.

 

2. We can also explore distribution of the target variable “num_sold”. Based on the Empirical Cumulative Distribution Function (ECDF) and the histogram of “num_sold”, we can clearly see that the target variable is skewed, rather than normally distributed. Almost 80% of “num_sold” observations are below 500. 

 

 

Step 6: Performing Multivariate Analysis

Multivariate Analysis allow us to explore the relationship between variables, such as the relationship between the features and the target variable.

  1. We can use the boxplot to explore how the distribution of the target variable is associated with different features. For example we can see that Norway tends to buy more goods than Finland and Sweden. We can also see that KaggleRama seem to generate higher sales than KaggleMart. Furthermore, we can see Kaggle Hat is definitely the best selling product by sales number. 

2. The scatter plot shows a cyclical pattern in sales. It seems to suggest that the sales number generally grow over time, and that sales tend to be higher in the early half of each year.

3. The KDE jointplot confirms that different products tend to be associated with different levels of sales volume, suggesting that product could be an important feature.

 

 

As we have seen, the Python Tool can really enrich our Alteryx EDA workflow and empower us with additional insights otherwise unavailable in standard Alteryx.

In the next blog of this series, we will take the Python Tool even further, and understand how the it can help us to create effective predictive models. See you there!

 

 

Martin Ding
Author: Martin Ding