10 min read

 

This two-part blog series aims to demonstrate how to empower Alteryx with Python to generate additional insights and create effective predictive models. In the first part of this series, we will showcase how the Python Tool can enrich exploratory data analysis (EDA) in an Alteryx workflow. In the second part, we will explore and compare the machine learning capabilities of Alteryx and Python, and understand how the Python Tool can help us easily build effective predictive models.

 

Content

  1. Exploratory Data Analysis (Part I)
    • The Case Study 
    • EDA using Alteryx’s Investigation Tools
    • EDA using Alteryx’s Python Tool
  2. Predictive Modelling (this blog)
    • Predictive Modelling using Alteryx’s Machine Learning Tools
    • Predictive Modelling using Alteryx’s Python Tool
    • AutoML using Alteryx’s Python Tool

 

In the previous blog, we introduced the case study, and performed exploratory data analysis (EDA) with the Python Tool to understood the various characteristics of the data. In this blog, we will continue with our case study, and demonstrate how to create predictive machine learning models in Alteryx with and without the Python Tool.

 

Before we dive into model building, there are a few additional data preparation steps that we need to take:

Feature Engineering: Creating Additional Date-Related Features

Machine learning algorithms generally don’t operate directly with date-formatted features, therefore we should create additional features such as categorical (or text) or numerical features to represent dates before feeding them into our models.

  1.  From the Preparation Palette, drag the Formula Tool onto the Canvas.
  2. Create “day”, “month”, “year”, “day_of_week”, and “season” features as follows.

 

3. From the Preparation Palette, drag the Select Tool onto the Canvas. Deselect “row_id”, and “date” fields, as they are not useful for predicting the target variable.

 

Now that we have prepared our data, we can finally start building our machine learning models!

 

 

1. Predictive Modelling using Alteryx’s Machine Learning Tools

Step 1: Create a Regression Model Using Alteryx’s Assisted Modeling Tool
  1. From the Machine Learning Palette, drag the Assisted Modeling Tool onto the Canvas.
  2. In the configuration window, click on Start Assisted Modeling, from where we can let Alteryx to determine the most suitable configurations for our problem and construct the machine learning model automatically.

 

 

Step 2: Use the Model to Make Predictions on the Test Data
  1. Connect the M (model) anchor of the Train Model Tool to the M anchor of the Predict Tool.
  2. Connect the Test Data (labelled “1” in the screenshot) to the D (data) anchor of the Predict Tool.
  3. Connect the output anchor of the Predict Tool to the L anchor of the Join Tool and the Sample Submission Format file to the R anchor of the Join Tool, this will make sure the data format will be consistent as what Kaggle’s autograder expects.
  4. Output the resulting data as a csv file using the Output Data Tool.

 

Step 3: Submit to Kaggle and Compare the Results

The evaluation metric used in this Kaggle competition is called SMAPE, which stands for Symmetric Mean Absolute Percentage Error. And like all error-based metrics, the smaller the score, the lower the error and therefore the better the model.

We can see that our Alteryx model has achieved a public score of 7.80749. This score would place us at the 653th place out of 1591 competitors, or top 41%. Not bad for an out-of-the-box model without much feature engineering or hyperparameter tuning!

 

 

2. Predictive Modelling using Alteryx’s Python Tool

Building a machine learning model typically requires an iterative approach that involves a lot of feature engineering, and hyperparameter tuning that’s often done through a pipeline. In this example, our focus will be on demonstrating how to use the Python Tool in Alteryx to build models, rather than on how to formally create a proper machine learning model. 

Step 1: Importing the Libraries and Connecting to the Training Data

Please see part I of the blog series for a detailed guide on how to bring in the Python Tool, import the libraries and connect to the training data in Alteryx.

 

 

Step 2: Training the Decision Tree Model Inside the Python Tool
  1. Split the training data into Target and Feature set (note that in general, we should further split the training data into training and validation set, or use a cross-validation scheme in order to minimize data leakage. But here we will ignore those steps for simplicity).
  2. Machine learning algorithms can’t recognize text data, therefore we need encode our categorical variables into numerical representations. There are many ways to encode data, here we’ve chosen to use the one-hot-encoding method.
  3. We will use the DecisionTreeRegressor algorithm from the Scikit-Learn library and fit the decision tree on the feature set and the target variable.

 

 

Step 3: Make Predictions on the Test Data Inside the Python Tool
  1. We can simply use the predict() method to make predictions on the test data.
  2. We then turn our prediction results back to tabular format using Panda’s DataFrame() method.
  3. Next, we use Alteryx.write() to return our predict back to the Alteryx workflow.
  4. Finally, we can connect the Python Tool’s designated output anchor (#1 in this case) to the Join Tool in the same way as Step 2 of the previous section, and output our results as a CSV file.

 

 

Step 4: Submit to Kaggle and Compare the Results

We can see that our Decision Tree model built using the Python Tool has achieved a public score of 8.06374. This is worse than our previous score from the Alteryx Assisted Modeling process (remember the higher the score, the greater the model’s error). However, this is not to say that Scikit-learn necessarily performs worse than Alteryx, because we only used the default settings and haven’t performed any hyperparameter tuning for the Decision Tree.

 

 

 

3. AutoML using Alteryx’s Python Tool

Automated Machine Learning or AutoML is the process of automating the executing of various machine learning tasks; AutoML can significantly speed up the model building process. There are many AutoML libraries available in Python, such as Auto-Sklearn, Auto-Keras and H20-3. In this blog, we will use the Pycaret library to perform AutoML.

Step 1: Importing the Libraries and Connecting to the Training Data

Please see part I of the blog series for a detailed guide on how to bring in the Python Tool, import the libraries and connect to the training data in Alteryx.

 

 

Step 2: Performing AutoML using Pycaret
  1. Set up the data for AutoML using the setup() function. The setup() function is used to prepare the data before feeding it into machine learning algorithm. Make sure to set “silent = True”, otherwise human input will be needed to for the function to proceed.
  2. Use the compare_models() function to run multiple regression models automatically. We can set various hyperparameters here, but for simplicity we will just use the default settings.

 

We can see that the Light Gradient Boosting Machine is the top model here, and this will be the model used for making predictions.

 

 

Step 3: Make Predictions on the Test Data Inside the Python Tool
  1. We can use the predict_model() function to make predictions on the test data.
  2. We then drop the unnecessary columns and rename the “Label” column as “num_sold” to be consistent with Kaggle’s submission format.
  3. Next, we use Alteryx.write() to return our predict back to the Alteryx workflow.
  4. Finally, we can connect the Python Tool’s designated output anchor (#1 in this case) to the Join Tool in the same way as Step 2 of the previous section, and output our results as a CSV file.

 

 

Step 4: Submit to Kaggle and Compare the Results

We can see that our Light Gradient Boosting Machine built using the Python Tool has achieved a public score of 7.34608. This is a big improvement from both of our previous models. A score of 7.34608 would place us at the 549th place out of 1591 competitors, or the top 35%. This is not a bad result considering how quickly and easily we were able to develop this model in Alteryx with the help of the Python Tool!

 

 

I hope you have found this blog series helpful and interesting! See you in my next blog!

 

 

 

Martin Ding
Author: Martin Ding