Today’s challenge involves the UFC dataset done by the new cohort on their stage 3 interview.

The plan:

  1. Create a predictive model that runs through for a small sub-set of the data
  2. Clean the data so it’s more suitable for visualisation
  3. Use the variable importance in the predictive model to help create explanatory charts in Tableau
  4. Show why upsets happen
  5. Analyse betting strategies
  6. Improve prediction using Tableau


1. Predictive

The data is actually set-up nicely already for a predictive model, with some a lot of feature engineering already pre-done. I removed any variables that are determined after the result.


I decided to go with a Random forest model as there seemed to be some quite a few linear relations within the data, that the XGBoost model would approximate.

Although the random forest has a lower balanced accuracy, we will be fine tuning the values later in Tableau so it’s fine just to look at the overall accuracy.

However, I removed the odds as they are already a sort of prediction based on data and so can just serve to distract the model making other variables more important. This made me change my model selection.

The XGBoost just performs better overall and so was selected.

Feature Importance:



2. Clean

So currently the data is on a fight grain, where the the fighters are determined as red and blue. This can make it difficult to calculate numbers and performance of individuals.

To quickly sum it up, this workflow separates the red and blue fighters into fighter and opponent. Each fight is separated into two rows from the perspective of the fighter themselves, seeing the other fighter as a opponent. The non-fighter related details are also separated and then joined back to each fighter.


3. Visualise


I decided to go for a simple viz that combines the most important variables from the predictive model for data exploration. I created several different strategies: 4 simple ones that only bet towards certain things and 3 more complicated ones based off the predictions from the model. For the Always bet prediction, I just straight up used the prediction output. However, although the predictive model has a 60% accuracy, it actually doesn’t make much overall. This is because the predictive model is just made to get the highest overall score accuracy achievable, and doesn’t take into account the odds.

The adjusted red and blue models take into account the odds, and calculate for each blue and red player which to bet on. It looks at the probability given by the official betting and the calculated probability from the model. If the probability of the model is higher than the official odds, this would be a buy to consider. I added a parameter which could adjust the probability of the model to make it lower than actually suggested. This acts as a sort of ‘safety net’ of how sure you want the prediction to be.


4. Conclusion and Future Improvements

It is actually surprising that always betting the red player allows the person to break-even. However, in the model because of the imbalance for reds, it may actually favour the reds too much. This is seen in the model red selection actually losing money. In future iterations I would definitely try to weight or undersample some red wins to get a more even balance. The blue model on its own worked fairly well although if I create a better red model and combine the decisions for the two I think it would create a even more reliable model.

In terms of visualisations, I think it once again proven that making a predictive dashboard is quite difficult, I struggled to include interactivity between the charts. I think this may also be due to the fact that the predictive model outputs are different to how I setup the visualisation data structure. In the future I would consider this before changing my data structure.

The Data School
Author: The Data School