In this blog, I demonstrate a simple Alteryx workflow used for predicting the chance of survival of passengers travelling on the Titanic. Variants of the data set I used can be found on the Kaggle website. However, before I begin with explaining the workflow, it is important to first …

 

 

Understand the Data

I have attached a sample of the data below. The table contains records of whether a passenger survived the sinking as well as personal and travel details.

 

sample titanic data

 

When predicting an outcome, it is important to decide if outcomes are numeric or involve classification (e.g. Binary outcomes such as win or loss). Numeric outcomes (for example, predicting price of a product from a set of variables) will typically require the Linear Regression or Boosted Model tools. However, to predict whether a passenger survives, only 2 values are possible (i.e. Binary outcome). So, a Logistic Regression is used in this example. This predicts the probability of survival based on the given variables. Note the “Survived” field in the table above, which is useful for …

 

 

Supervised Learning via Logistic Regression

Before trying to predict outcomes, we must first have a model that can accurately predict a known target variable (i.e. the “Survived” field in this case). Whilst preparing the data for the modelling process, I find it useful to use the Field Summary tool.

 

Field Summary Output

 

As shown above, the tool allows easy visualisation of the contents of each field in the table. For example, there are missing values in the “Embarked” field and that there are 3 unique classes of travel (in the “Pclass” field). Filtering or modifying values in the “Embarked” field is preferable so that the regression tools used can function normally. In addition, remember to enter model names (when configuring predictive tools) without spaces. I find that errors can result from model names with spaces. Once the data is clean, this can be directed to the Logistic Regression tool. I have include a brief overview of the workflow and the main inputs to the tools:

 

Brief LogReg and Stepwise description

 

 

Comparing Accuracy of Models

The Stepwise tool connects with the logistic regression tool to improve the accuracy of the initial model. The accuracy of both models can be compared by using the Union tool to join the model outputs and then connecting to the model input of the Model Comparison tool.

modelcomparison example

ResultsModelComparison

 

The error measures table (Model Comparison tool output) displays the models’ accuracy. A higher F1 score (more info on F1 score here) usually implies a more accurate model. Another interesting observation is that the regression model from the Stepwise tool is more accurate but uses only 4 variables. This is compared to 8 variables used in the model from the Logistic Regression tool.

 

StepWise ModelLogRegAllfields

Logistic Regression Model Coefficients with Stepwise Tool (Left) and with Logistic Regression Tool (Right)

 

 

Applying Model to Data

Finally, I use the Score Tool to apply the preferred model to a new data set (with similar fields). From the predicted outcomes (“Chance of Survival” Field), I find that women have a much better chance of surviving the sinking. This is especially true for those travelling in First and Second class. A summary of the results is provided below:

 

ScoreTool

FinalOutputLogisticReg

 

I hope that this blog post has been of use to you! Also, check out these interesting blogs from the Data School UK:

 

 

Alex Chan
Author: Alex Chan