Getting Started
This will be a step by step guide for doing predictive analytics in Alteryx, using the example of the Titanic dataset. It will also cover the basic formula for you to complete your own predictive models in the future. This blog is adapted from Titanic Data Science Solutions by Manav Sehgal, which I highly recommend to check out if you want to complete this exercise in python.
Download the datasets from Kaggle here. You will receive 3 datasets: train.csv – for modelling and analysis, test.csv – for final submission on Kaggle, gender_submission.csv – a sample submission file.
Systematic Analysis
The workflow or method used to achieve and solve analytical problems.
- Identify the problem
- Research
- Data Exploration
- Feature Engineering
- Modelling and Testing
- Visualise and Report
The steps may be combined or separated depending on the situation. Steps may also be introduced earlier, later or skipped. For example analysis may be done through visualising or a problem might only be identified after some research or data exploration.
Part 1 will go through the first 3 steps. It is more of an introduction to the methodology and understanding before modelling. You can skip to Part 2 if you want to go straight to the feature engineering and Part 3 for modelling and testing.
Identifying the Problem
Because Kaggle is a competition site, the problem has already been predefined. On the website it states:
“Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.”
Some more contextual understanding can be found on the Kaggle competition description page here. Here are the highlights to note:
- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
- One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Research
Usually in this step you would research on what other people previously have done in the same field. Researching articles and papers also give you more context on the data and a better sense of what direction to take. We will skip this step in this guide, but if you want you can take sometime to see what predictive models others have used.
Data Exploration
Understanding the Data
From the descriptions of the Kaggle data page and a browse tool, we can identify the variables as follows:
Categorical:
These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.
- Categorical: Survived, Sex, and Embarked.
- Ordinal: Pclass.
Numerical:
These values change from sample to sample. Within numerical features are the values discrete, continuous, or time-series based? Among other things this helps us select the appropriate plots for visualization.
- Continuous: Age, Fare. Discrete: SibSp, Parch.
Mixed:
Numerical, alphanumeric data within same feature. These are candidates for correcting goal.
- Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric.
Errors and Nulls:
- Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.
- Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
- Cabin > Age are also incomplete in the given test dataset.
Feature Analysis
Before we start the feature analysis, we need to change each predictor to the desired data type. A easy way to do this is to use the auto field tool, which will also assign the correct length for strings. Putting a select tool next, we can see that everything seems to be correctly assigned. However, this can reduce the performance of the workflow if run on bigger datasets. Apart from caching, it’s actually better to save the data type configuration. You can do this via the configuration screen of the select tool.
Save the field config. and then delete the auto field. You can then use the Load Field Names and Types option which is last option below Save Field Config. as seen above.
Now that the data types are as desired, we will go through all variables individually and decide on how to approach them. It is also possible to do this mainly visual analysis step in Tableau and probably much easier for smaller datasets.
Numerical Variables
SibSP
We want to see how the number of siblings/spouses a passenger has can affect their survival rate. Summarize the data like following:
We then want to get the survival rate as a percentage of those who survived.
Note that for some values of SibSP, there are no values for survived = 1, so here we will just use survived = 0 as a proxy. This is fine as we for the next step we summarize, grouping by SibSP and taking max of Survival Percentage (%). Use an interactive chart tool afterwards with a browse tool and you should see something like this:
From this we can clearly see that passengers having a lot of siblings/spouses have less chance to survive. Quite an interesting observation that may not have been found using intuition.
Parch
Repeat similar steps as previously (using either macro, or changing the variables after copy and pasting). You should reach this chart:
Small families seems to have more chances to survive. This also seems to have some sort of correlation with SibSP both intuitively (because both relate to family) and also by prelim analysis.
Age
From the chart you can see that generally the youngest group had a much higher survival rate. There seems to be some degree of the distribution being skewed but I think its within an acceptable range.
Fare
The distribution for Fare, is very positively skewed. As such the high fare values will have very little impact in the model. Two approaches to help alleviate this include using a log transformation (which was used) and also turning it into a categorical variable.
Categorical Variables
The variables where all pivoted in a similar manner to that for SibSP and Parch. Also along the way, each variable was converted into a numerical variable to prepare for a Pearson Correlation chart: e.g. Male = 0, Female = 1.
Sex
Females had a much higher survival rate than males.
Pclass
The better the passenger class, the higher the chance of survival became.
Embarked
There are only 2 missing values for embarked. If there was more values missing there would maybe be value in replacing the NULL values by generating C, Q, S based on their percentage of the column each take up. Here it was just replaced by the most common occurring value of S.
Pearson Correlation/Association
For these charts, you will have to have numerical values for all selected variables as well as no nulls. Thus we will fill age using the impute tool on average age and use the prepared numerical version of sex and embarked.
This chart is not only useful for determining the trends and relationship between variable, but also later on to help better impute certain nulls for variables.
Using the association tool will create a more interactive chart with colours.
We see that sex has the highest impact on survived whilst Age, SibSP and Parch have very little correlation. However, through the visual analysis there seemed to be some impact on survived by the variables. The implications of this will be discussed in the next blog. That’s it for this blog, check out the next part to see some more preparation of the variables and the actual building of the models.