Predictive Analytics and machine learning evoke a certain kind of buzz, combining science and analytics to output an algorithm. In today’s blog, I will go through the process of predictive analytics in Alteryx, while explaining some key terms and requirements to conduct your own predictive analytics.
Step 1: Decide a Target Variable
The target variable in predictive analytics is the column that you want to predict. This could be a classification (binary or non-binary) or numeric (continuous or time-based). Each of these target variables has a place in answering business questions, but just because time may be a variable in problem does not mean that a time-based model (ARIMA or ETS) will be the best way to solve it, and just because a field has a numeric value doesn’t mean that using a binary model couldn’t answer the businesses needs better.
Step 2: Understand Your Data
The sample size is the biggest contributor to “good” predictive models. Anything under 5000 records will need to be under-sampled, which is not best practice. Alteryx and Tableau Prep are both great tools to understand your data, through the ways of building histograms, scatterplots, and correlation matrix. Before you should be thinking about proceeding to step 3, your data transformation step, you need to know what types of variables are in your data. Like there are different types of target variables, there are also different types of predictor variables, and these each needs to be formatted in a certain way.
- Categorical Data: String fields with no order, categorical data in a predictive workflow need to be labeled as 1 for use, and 0 for not used, and thus need to be in separate column headings. As categorical data need to be separated into their own headings as fields, you really do not want too many categories as it will reduce the sample size dramatically. For instance, having 8 different car brands may be a useful predictor in your analysis to fuel efficiency, but including 50 different car models will not be appropriate for analysis, especially if 30 of these car models only have 20 records for instance. This is why a histogram in Tableau Prep is especially powerful, allowing you to see the number of categories and the spread of categories very quickly.
- Ordinal Data: String fields with an order, in a predictive workflow you can substitute these into numeric order. For example, Likert scale data from 1 to 5 representing strongly agree to strongly disagree. This can all go into the same column
- Numeric Data: Data that has a quantifiable value. But you can also use numeric data as:
- Cyclical Data: Ryan from DSAU5 wrote a great blog explaining this.
Step 3: Create Calculations / Gather New Data
Getting the most amount of data, or inferring fields from the current data, for example adding seasonality, can be a great predictor variable. Be creative in what variables you want to add. It is important to add that if you infer a piece of data, it is at times unwise to add that data and the original data column it was inferred from in the same model because it will then have more weight towards this column naturally in the predictive model. It is also important to understand that it is good to include variables with correlation, but sometimes variables which drown out all the other variables will need to be excluded.
Step 4: Decision Tree is Your Friend
Decision Trees allow you to quickly see which of your variables are the most “important” for predicting the target variable. This particular model will not be used in the final prediction as it will over-fit, but another benefit will be clearly seeing if some of your variables are too closely related to the target variable.
Step 5: Try Different Models
Data Science is complex, and you never really know which model will produce the best results, which is why we are encouraged to use a multitude of models, for example, Random Forest, Boosted Model, and Neural Networks
Step 6: Score The Model
Alteryx has a scoring tool that can be used to score models. In this process, you should with-hold data in order for your model both to test and to score. At the end of the day, there are different scores for different models, for instance, numeric problems will have a number variance range, and classification problems will have an instance where the target was predicted in the negative and positive situation, or the instance was not predicted in the positive or negative situation. Some problems, the rate that you have predicted accurately will be very high, but really you want to predict the anomaly situations, which I would encourage will be a very difficult task, but with some sampling finesse and fair sample size, it is all a matter of testing and re-configuring.
There you have it, a not so simple approach that sounds kinda simple to tackling predictive analytics. Hopefully, in the future, I can write another blog including a workflow and some more details, but for now, this is an approach to get you started on conducting predictive analytics.