A Decision Tree is a powerful machine learning model which has the added bonus of being easy to understand, as it can be used to visually understand the model.
Luckily, last week we had our 2nd taste of Predictive Analytics where we looked at Decision Trees. A decision tree model is intuitive in the sense that every possible outcome of a decision branches out from nodes and can be applied to classification/ prediction cases.
The model prescribes that a node can be pure (where there is no more branching) or impure (where there can be n number of branches). To test for the purity of the node, we looked at Entropy with an entropy of 1 being very impure and 0 being pure. In theory, the level of branching can continue on indefinitely until all the nodes are pure, however in practice and especially for business logic, it is best to limit the level of branching as we don’t want an over-fitted model.
Lets step back away from the business landscape and apply the decision tree model to something more frivolous.
I started by downloading the Titanic dataset from Kaggle. Their Titanic dataset is used for an ongoing introduction to machine learning competition. The folder you download contains 3 files; one to train your model, another to test your model, and a final one to evaluate it.
Looking at the training data, the “Survive” fields are labelled either 1 or 0. This is where some planning is needed. If you want a continuous decision tree, then numerical values are used, however, if a classification decision tree is required, then you need to change the target variable to a string. This is easily done with the formula tool.
There isn’t much else to do with the data in this scenario, so input the data in to the decision tree tool. Select your target variable, and the relevant predictor variables. It is important here to decide which predictor variables you want to use, as PassengerID, name and ticket number would be poor predictors. Taking PassengerID as an example, if you do not limit the branching for your decision tree, you may end up with 100% accuracy as the tree would classify the survivability of every passenger correctly for the trained dataset, however, when you use the trained model on another dataset, it will fail completely as it has been over-fitted.
As an additional step, hit the customize button at the bottom and change your HyperParameter settings. To prevent over-fitting, I decided to keep the tree depth at 5 nodes and allowed a terminal node to have more than 1 record as a minimum. In the settings, you also want to go in to Plots and make sure you select “Display tree plot”.
After you whack the browse tools on to the decision tree tool, you should be met with a plot as seen below.
To read this, you have to ask the questions displayed at the nodes. It should also be noted that the nodes at the top are considered to be a big determining factor as to how things are classified. In this case, if you were male or female as a big determining factor as to whether you would have survived the Titanic. Now, let’s follow the tree. The first node question is, “Is the sex of the passenger male?”. If yes, then is their age over 6.5 years? If yes, it is very likely that they are not alive, but if they are younger, then they would have survived the Titanic. Looking back at history, it supports the narrative where children were the first to be placed on the lifesaving boats. Looking at the females, if your Passenger Class was less than 2.5, then it was likely that you survived the Titanic.
What are the next steps?
Seeing as we have been provided with a test set of data, we will test our model. To do so, a score tool is used to test the model with an outside dataset. This same principle of having a training and test set can also be achieved with a single dataset by using the Oversample Field tool. With the score tool, we are met with percentages of the model believing whether a person would survive the Titanic given the predictor variables innate to that particular passenger. A summarize tool was then used to add all the correct classifications and to get a total count, and a final formula tool was used to calculate the model’s accuracy.