Context

This is part 3 of the predictive series. Part 1 and Part 2 went through the necessary preliminary steps and now we can start modelling. Funnily enough, this is the step that often takes the least time out of the whole predictive process. Although you do have some options to fine tune your models, this blog will be focusing more on giving an overview of how each model works, where to use them, and how to evaluate performance.

 

Modelling

As I have stated previously, the Titanic data set is a classification problem. So the following will be all models often used for classification.

The Steps

The first step in any predictive process is splitting the data into Estimation, Validation, and Holdout streams. The estimation anchor holds the data used for training the model. The validation anchor is used to test the model against. The holdout anchor is often used as a ‘sanity check’ when there were multiple iterations of the model tested against the validation stream. This to prevent the model over-fitting on test datasets but struggle when trying to predict new data. The common split between E:V:H is 70:20:10 or as shown here 80:20 if holdout anchor wasn’t included.

Next, the E anchor is connected to the selected model so that the model could be trained. A model usually contains 2-3 anchors. The O anchor for the actual output of the model. There is no additional information in this anchor and is just used to connect for later testing on new data. The R anchor is the summary report produced by the R software that powers the predictive tools. The I anchor is an interactive dashboard that gives you some accuracy feedback and plots.

We will now test the model on the validation data we created through the sample. Use the score tool in the predictive palette and connect the V – Validation anchor to its D – Data anchor and the O – Output anchor to the M – Model anchor. This should produce some decimal results in which we can round off into 0 or 1s. Join the new results back with the actual validation data results and compare the two. This can be easily done using the formula:

And then obtaining the average which will be the percentage.

 

Logistic Model

Overview

The logistic model is a classical statistical model that uses a log transformation of a linear regression. Like the linear regression model it provides an explanatory model as well as a predictive one. A logistic modelĀ  performs very well in predicting simple binary classifications. Although this is the only type of classification it really used for.

ACCURACY: 4/5

INTEROPERABILITY: 4/5

ADAPTABILITY: 2/5

Interpreting

For the Report output, you should mainly only be focusing on the parts I have shown in red. The number of * means the significance of the variable. So *** here means the variable that has the highest impact on whether the customer survived or not. The AIC is just usually an indication of how good of a fit the model is so the higher the AIC, the better. This is used for when you compare log models that have a slight change e.g. including, excluding a variable, adding a variable you aren’t so sure makes an impact etc.

For the Interactive output, you can just focus on the summary tab. The top four are all standard methods to measure the accuracy of a model, with the blue showing accuracy based off all results and red showing accuracy based off the confusion matrix. I will now sidetrack for a bit and go over the confusion matrix as I think it’s an important part of predictive modelling.

Confusion Matrix

In the predicted column on the left is the prediction we made with our model. The predictive positive means we predicted 1 for survived which meant they survived and vice-verse predicted negative means 0 where they didn’t survive. The actual positive and negative means where they actually survived or not. So the number 223 in the actual positive and predicted positive is the number of people that survived that we correctly predicted. The 59 is the number of people we predicted that survived but in actuality didn’t survive.

Now why would this matter – in a lot of circumstances the costs for predicting positive when the actual is negative and negative when the actual is positive may have different levels of impact. For example if there was a model that could predict COVID. If you predict that they do have COVID when they actually don’t could just mean some minor inconvenience for that person to get tested. However, if you predict that they don’t have COVID when they actually do, they could spread it to many others and the cost would be much greater than if you predicted them to have it when they didn’t. Therefore sometimes you might skew your model even if it lowers your overall accuracy to favour one side of the confusion matrix.

Hopefully that explanation wasn’t as confusing as the name.

Final Result:

80.33%

Solid result, for an overall accuracy classification models generally aim for around 80% accuracy.

 

Decision Tree

Overview

Without going into too much detail, a decision tree makes a bunch of yes and no style decisions.

It’s often used to help interpret and explain the effect of different variables on the model, and generally has lower accuracy compared to other models. The Yes and No decisions can also be used for continuous variables, making the decision if its greater or less than a certain number. This means decision trees can be used in not only classification but also regression.

ACCURACY: 2/5

INTEROPERABILITY: 5/5

ADAPTABILITY: 5/5

Interpreting

For the decision tree, the I anchor gives an actual plot of how the decisions were made.

We see here that the decision tree has done some feature selection. It has omitted the other variables and decided that the three most important are Title, Pclass and Fsize. Title is the most important indicator and if they have Mr or Other as their title, it would predict they didn’t survive. The next two decision points where if the passenger class was 3 or if the family size was 5 or greater. The R anchor also gives a text version of this visual.

One thing to note here is that the decision tree actually has a more balanced confusion matrix.

Final Result:

83.71%

The model has achieved a high result here because of the cleaning done beforehand and also the limited amount decisions it had to make. This is generally quite a simple dataset, so you can’t except this sort of result from decision trees constantly.

The Data School
Author: The Data School