Context
This blog covers the Feature Engineering of the model building process. If you want more context on how to explore the data and the process of data modelling check Part 1. Otherwise, the download the Titanic dataset here. The workflow is available here and macro here.
Feature Engineering
String Variables
Creating a Macro
From the last blog, you may have realised we did a lot of visualisations using this same process.
I have also hinted about creating a macro for this, but now I will quickly show have to create one as this will be quite commonly used for predictive. Firstly, create a new workflow and set it to a standard macro.
Create a new dataset with just PassengerId Survived and another variable. Output as a csv file and bring in the data in front of the first summarize and convert that to a macro input. Convert the browse to a macro output. Save the macro and insert it into the workflow. There is a way to dynamically input the name but for simplicity sake we will just use the simpler method. Just put a select tool to rename to the variable name and a browse tool at end of macro.
The yellow highlighted tool is the macro (I have added a custom image for it).
Name
In most predictive models, names have little to no use. However, in the Titanic data, the title is actually located in the name. It’s in the format of Surname, Title. First Name Middle Name.
This expression allows us to parse out only the title from Name.
Looking at the distribution, there seems to be four major titles of Mr, Miss, Mrs and Master. Mme, Mlle and Ms are alternate ways of Mrs and Miss so we can group those together. We notice that there was one parsing error resulting in the. This turns out to be ‘The Countess’ as the title but this can we left as we will group all remaining titles as ‘Other’.
Cabin
Cabin has a huge amount of nulls so it is a question whether to keep it or not. This is because having a lot of predictors may not be actually beneficial to the model. However, we will later use models that can help decide whether to remove or keep it later on in the testing. I have decided to group by either the first letter or just as ‘U’ to represent unknown for any nulls.
You can see there isn’t many for each known letter. This reduces the reliability of potential results (generally you want around 30). There is a statement to be made for just grouping by known and null for cabin. We can create both variables and then test it during the modelling process.
Ticket Number
After cleaning the numbers off the ticket we got these results:
There is probably a method where you can group all of them in a significant way. Yet, ticket is a variable that is mostly nulls. If you can you should never spend too long on engineering these variables as it will never impact the accuracy by that much. Therefore I will group them by just has a ticket or null.
There doesn’t seem to be much of a difference between the two. This might be a variable to consider for removal from the model.
Numeric
Family Size
We can also create predictors from our own understanding of the data. People with larger families on the Titanic would look for their family after the crash occurred. This would be valuable time, so we would assume their survival chances would decrease. So we create a family size variable using SibSp + Parch + 1 (including themselves). Since there are only a few records above 5, we will just group them together.
We see that our hypothesis was correct. Large families did have a lower survival rate. This type of thinking is crucial in feature engineering. You should always be thinking to combine variables or introduce new ones through either research or joining other datasets.
Age
Cabin or Ticket had way too many nulls. Embarked only had two nulls. In both situations there was little worth in doing a more complicated imputation. Age on the other hand, has around 20 percent of nulls. This makes it worthwhile to do something that takes a little more time. As I mentioned before, you can do feature engineering through research as well. We could use the title of master or miss to see if the person is married or not and then find the average age someone gets married in the 1900. We could then impute the age as some number less then that with consideration of other predictors.
However, the people on the Titanic may be a different crowd to the average American, in fact some may not even be American. To solve this we can impute the age using other predictors in a model. This will also act as a nice warm-up before we actually model for survival.
As we need to be predicting the null values, it won’t be part of the data we feed into the model, or the ‘training’ data. We can split it using a filter tool on data that isn’t null. We will use a simple decision tree to predict the training data.
From the Pearson correlation in blog 1, we see that age has the highest correlation with Pclass, Parch and SibSp. We will add Title as it helps see if the person is married or not. That in turn will be a good indication of age. We can also use Family which is a combination of Parch and SibSp.
To use the finished model, we will use the score tool. Put the decision tree into the ‘M’ model input anchor and the filtered nulls into the ‘D’ data input anchor. Select the new predicted values and also the passenger id. Join on passenger id with full data set and then union the inner and left.