Select Page

Welcome to this four-part blog series where we introduce a powerful analytical tool called Survival Analysis. In this series, I will provide a beginner-friendly guide to help you understand this popular statistical method.

In the first part, I will introduce the key concepts of survival analysis and show you some use cases where it can be applied. In the second part, we will dive deeper into the key models in survival analysis. In the third part of the series, we will walk you through how you can perform survival analysis using Alteryx. Finally, we will see how we can perform survival analysis using Python.

Whether you are a marketing analyst, medical researcher, engineer, or social scientist, this series will help you understand how to analyse time-to-event data and predict survivability. So, let’s dive in!

#### Content

1. Part I: Introduction to Survival Analysis
• What is survival analysis?
• Why do we use it?
• Key concept: Censorship
• Use cases
2. Part II: Key Models in Survival Analysis
• The Kaplan-Meier Model
• The Cox Proportional Hazard Model
3. Part III: Survival Analysis using Alteryx
• Data Preparation
• The KM Model in Alteryx
• The CPH Model in Alteryx
4. Part IV: Survival Analysis using Python

To perform survival analysis in Alteryx you need to download the relevant survival tools from the Alteryx Analytics Gallery, if you haven’t done so already.

The tools you will need are:

The Survival Analysis Tool: https://help.alteryx.com/20223/designer/survival-analysis-tool

The Survival Score Tool: https://help.alteryx.com/20223/designer/survival-score-tool

#### Data Preparation

In order to conduct survival analysis, your dataset must include at least the following information:

1. A unique ID field (e.g. customer ID), so you when you correctly map the predictions back to your data.
2. A duration field (such as a customer’s tenure) up to the observation period or the event period, whichever is earlier.
• As an alternative you could have one column that contains the start date, and a second column that contains the end date, but I find that it is usually easier to just work with the duration.
3. A censor label (e.g. whether a customer experienced an event, or is right-censored)

For more detailed analysis, your data should also include:

• Other covariate fields, such as age group, gender, income, spending, payment method etc.

I’ve prepared a mock dataset, and It contains an ID field, a Duration field, a Censor Label field, and three covariate fields, Gender, ReturnedCustomer label and MonthlyBill(\$).

#### The Kaplan-Meier Model in Alteryx

Alteryx has really made it straightforward to construct the Kaplan-Meier model. We simply need to set up a few configurations, so let me walk you through how it is done.

Within the Configuration window of the Survival Analysis Tool, you will find three tabs: Input Options, Analysis Options, and Graph Properties. The first two tabs will impact the outcomes of our model, while the Graph Properties tab is solely concerned with the dimensions and resolution of the graphs.

Input Options:

• Model Name: Give your model a sensible name. Note the name should not contain special characters other than “.” or “_”, and no spaces are allowed.
• Data contains durations / Data contains start and stop times: Select one of these radio buttons based on whether your data contains durations or actual start and stop times. For our sample dataset, the correct choice is “Data contains durations”.
• Data is left-censored / Data is right-censored: These are optional checkboxes, check them only if your data is censored. Since our data is right-censored, we will check the appropriate box.
• When checked, you will then need to select the field that corresponds to the censorship label. Double check that you’ve correctly labelled the censorship; 0 is Censored and 1 should be the event.

Analysis Options:

• Make sure the Kaplan-Meier Estimate’s radio button is selected.
• You can also select the corresponding check box or boxes to suit your analytical needs.

In general, it can be useful to include a confidence interval with your statistical estimates. We can see below that on average, after 200 days, we can expect around 50% customers to remain with us, or between 45% to 55% retention according to the 95% confidence interval.

Often, we may want to group by a field to compare how the survival curve varies across groups, and the Survival Analysis Tool has an option that make it easy for you to do just that!

Pro’s Tip: When using the Select grouping variable option, make sure the field you select is the first field in the dataset, otherwise Alteryx may throw an error.

As we can see, at any given time point, female customers a noticeably higher survival rate than male customers. This suggests that potentially we should focus on attracting female users to our platform, as they tend to be more loyal than male customers.

The jagged survival curve of the non-binary gender simply reflects the fact that this group of customers has a relatively low number of observations.

#### The Cox Proportional Hazards Model in Alteryx

To construct a CPH model in Alteryx, we will continue using the same Survival Analysis Tool. In fact, we can even leave all the configurations in the Input Options to be the same as before. The only part that we need to configure is the Analysis Options.

Analysis Options:

• Make sure the Cox Proportional Hazards option is selected.
• Select predictor variables: Select the covariate variables that you’d like to incorporate in your survival analysis.
• Method for tie handling: Alteryx provides three methods for dealing with tied times (durations). See the R documentation for more information: https://stat.ethz.ch/R-manual/R-devel/library/survival/html/coxph.html
• You can also optionally select a field that contains case weights.

Here is what the results look like:

• Results of Factor Analysis Tests: This section informs us whether our CPH model is statistically significant. With a p-value much smaller than 0.05, our model is significant here.
• Summary of Cox Proportional Hazards Model: This section is quite similar to linear regression, where we can see
• the estimated coefficient for each variable (positive means increasing risk, negative means decreasing risk of churn). Like we observed in the KM model, male customers tend to be associated with increased risk (and hence the positive sign here as well).
• The exp(coef) gives us the effect size, it is simply the exponent of the coefficient. The exp(coef) for GenderMale is 1.64, meaning that males increase the churn risk by 64% compared with females.
• The se(coef), z and Pr(>|z|) are used to determine the statistical significance of the variables. We can see that in our model, GenderNon-binary, ReturnedCustomer label and MonthlyBill are not significantly different from 0, so we can drop these variables.
• Results of non-proportional hazards test: This section tests hypothesis whether the terms in the model meets the constant proportional hazards assumption. It seems that this assumption is broken by our model, since the p-values are small enough to reject the null hypothesis.

##### Author: Martin Ding

Martin earned his Honours degree in Economics at the University of Melbourne in 2011. He has more than 7 years of experience in product development, both as an entrepreneur and as a project manager in robotics at an AI unicorn. Martin is expecting to receive his Master’s degree in Data Science from CU Boulder at the end of 2022. Martin is excited about data and it’s power to transform organizations. He witnessed at first hand of how instrumental data driven decision making (DDDM) was in leading to more team buy-in and insightful decisions. Martin joined the Data School to systematically enhance his knowledge of the tools, methodologies and know-how of Data Analytics and DDDM. When not working, Martin enjoys readings, cooking, traveling and golf. He also thoroughly interested in the practice of mindfulness and meditation.