My previous blog demonstrates how to develop machine learning models in Python using the Sklearn library; in this blog, I will build machine learning models in Alteryx, which offers machine learning solutions through the implementation of an easy-to-use graphical user interface for the data processing and modeling

Introduction to Machine Learning Tools and Key Steps

Data Exploration Analysis

We could easily summarize and aggregate data using the summarize tool, and we don’t even need to switch to another coding language like Python to run some customer plots. Alteryx provides various customer plots like histogram charts, scatter plots, violin charts, association rules, heat maps, etc. All these tools can help you understand the data set’s relationships and patterns.

Data Cleaning and Feature Engineering

To deal with existing unstructured text or structured text data, Altery has the text columns tool and the regular expression tool; we can also use a variety of formula tools to help us create new columns for feature engineering and then use the join tools to join them back to the data. The data cleaning and select tools can help me make sure my data is in the right format, and the unique tool can help me get rid of any duplicate rows. Also, we can use the imputation tool to fill in the missing data field. Alteryx also provide the Python script tool which allow you integrate the python code in your workflow

Data modeling

From linear regression, logistic regression, Decision Tree, Random Forest, and SVN to new models like the booster model and even a neural network, you’ll find the widest range of models in Alteryx. The best part is that these models can be trained without having to run any code.

Model Validation

This is the most important step because we must ensure that the model performs exactly as we intended or to the best of its abilities. The score tool assists you in making predictions, and we also have model comparison tools that help you understand the differences between the models and how they perform.

Use Case Study

We have gone through the various tool in the key machine learning step. Now I will use a data set (shown below) to demonstrate how to make the machine learning model in Alteryx. My goal is to predict these salary bands (target field) in the test set.

Step1:  Import the data, then use the Field Summary Tool to perform EDA. In the output of the Field summary tool, you’ll also find a concise summary report of descriptive statistics for the selected data columns.

Step 2: Use the formula tool to fill in the missing values found in  EDA(columns “Country,” “Occupation,” and “Workclass”).  Apply the one-hot coding technique to convert categorical variables to binary variables as well (Here I use the python script tool to achieve One-Hot Coding)

Step 3: When the data preparation is complete, use the Create Sample Tool to split the data set into the train and test data set. The train data set is then connected to the machine learning model. To predict the salary band, I used Decision Tree, Random Forest, and boosted model. Before connecting to the model comparison tool, the output of these models must be unionized.

Step 4: After you’ve finished training the model, use the Model Comparisons to compare the performance of the selected model with the test data set. The random forest model has the best predictive result based on the ROC curve,Accuracy and F1 score from the comparing result.

Summary:

Alteryx is an excellent data analytics tool that provides the machine learning interface for non-programmers. Both Python and Alteryx, in my opinion, can be used in tandem to maximize efficiency in the data science process.

The workflow

 

Gary Li
Author: Gary Li