Our mission for day 2 of Dashboard Week is to create the application viz for the next cohort DSAU7 in one day (applicants have 2 weeks). The dataset is on UFC. I found it particularly challenging to do in one day because I didn’t know anything about UFC and the dataset was very large with many fields. After some serious studies/research on UFC, I decided to go from the angle of how Male and Female Fighters differ in their winning traits & strategies. Below I’ve outlined my process of creating this viz (link to viz). It’s broken down into 3 parts: (1) Research, Cleaning & Transforming the data (2) Investigation (3) Visualisation. My actual process is more of an iterative approach rather than linear but for the sake of this article, I’d write it linearly.
Research, cleaning & transforming the data
This step was by far the most difficult part. The dataset on Kaggle was intended for predictive modelling so it has quite a number of features. Many of them are full of nulls. I took quite some time to read through the data dictionary, understand what each field means, and even researched on different match outcomes and finishing techniques. In terms of cleaning and transforming the data, I have 2 major workflows
(1) Basic cleaning & removing fields with lots of nulls
The initial dataset has 85 fields. I chose to remove clearly irrelevant fields and fields that have more than 80% nulls. To do this, I transposed the data into tall form, count null and non-null for each field using summarize tool, and finally calculate % Null using the formula tool. Once I found which fields have more than 80% nulls, I added an * using Find & Replace and remove any rows that contain an * in the Value Name.
(2) Transform & Add new features
This was the process that took up most of my time. Since I decided to do a fighter analysis, it’s better for each row to be a fighter. The current dataset has each row as a match. Using a hint from our coach, Craig, I split them into 2 streams, one for blue and one for red, then union them back so that I have 2 rows for each match (one row for each of the 2 fighters). There are other measures such as the difference in attributes (e.g age, height, etc). The dataset uses the difference between Blue & Red (Blue – Red). However, since my analysis is on the winning strategy, I decided to convert it into the difference between winner & loser (winner – loser) instead. I used multi-field formula to convert existing differences and multi-row formula to create new difference calculations.
Since the focus of my dashboard is on the winning strategy, I used Pearson Correlation tool to see which field has a high correlation with winning. I split the data into 2 streams, one for male and another one for female before using Pearson Correlation. I also imputed nulls with the average as person correlation doesn’t work with fields that contain nulls. My workflow looks like this.
The results are different for male and female fighters. From the results, I chose several attributes that have correlations for further analysis. Unsurprising, odds & ev (profits per 100 credits bets) have a high correlation. However, since these are not exactly a player’s trait I decided to exclude them from my analysis. There are fields that measure similar attributes such as “# wins” and “# of KO wins” and may have high covariance so I only included one. Unfortunately, I didn’t have time to look at covariance. I chose weight, total rounds, age, current_win_streak, takedown accuracy, # strike per minutes, reach and height for visualization.
Visualisation was quite difficult as most of the charts would be some form of bar charts. After some consideration I decided to make dumbbell charts to show the gap between male and female for measures. For categorical variables (such as finishing techniques and age group), I used butterfly chart.
As I wrote in my dashboard, there are different winning techniques/traits for male and female fighters. I expected techniques that require strength would have a higher proportion of male fighters (such as punches) whereas it may be the reverse for lock. The distribution of age group are similar for both gender. In terms of measures, I need to add some extra features into my dashboard to have a better picture. Currently, it only shows the difference between winner and losers for each gender. For improvement in the future, I plan to add a chart inside a tooltip to show the rank (of pearson correlation) between male & female.