Introduction to Stats
A couple of weeks ago in our training at the Data School, we had some lessons on Predictive analysis. The topic of Standard deviations and normal distributions came up, coming from a statistics background I really enjoyed it. But a lot of the examples with regards to Z scores and normal distributions were not done in Tableau where they had the perfect opportunity to do so. So in this blog, I will try to explain the statistical concepts behind prediction and provide a viz with calculations so that it may be used by others.
A lot of things in our world follow a normal distribution. What is the Normal Distribution? It is a concept whereby within 1 standard deviation (Level of dispersion of data) from the mean will contain 68% of all the values. Within 2 Standard deviations, 95%. Within 3, 99.7%.
The Calculation for the Standard Deviation (SD) is:
Where x is a given value, x̄ represents the mean across the group. The Sigma sign then |x- x̄|² means the Summation of all absolute (Always positive) values of the difference between value and the group mean. This is all divided by the number (n) of values in the dataset. This all then square rooted to get the SD.
So to recap
1 SD = 68% of Data
2 SD = 95% of Data
3 SD = 99.7% of Data
Luckily for Tableau users, Tableau has a built-in Standard deviation function, we will come back to that later. The Square (Value multiplied by itself) of SD is Variance, another measure to measure the dispersion of data from its mean. A dataset which has an SD of 10 will have a variance of 100, a dataset of an SD of 50 will have a variance of 2500. The data to use the previous example is more “spread out” meaning that in a given “bin” or group of values the peak will be lower for the higher variance as the data can afford to be more spread out now.
Now that we have covered SD, we can move on to the Z score. The Z score is like the Standard deviation but only for a given value to show how far it is at that point. Thus to get the Z score we take the value to subtract the mean and divide it by the SD.
Z = (x- x̄)/SD
By this point, you may be wondering from before how we know that for however many standard deviations, we calculate a percentage. The formula for that is as follows:
(Then you would integrate this)
Now in Tableau, we cannot Integrate, so I will use another way to get the area, but we will have to use this formula.
To use an example say you had a group of 10 boys and knew that the average height was 170 centimetres and that the SD was 20 centimetres what would be the chance that a selected boy was above 180 metres? Using the Z formula. (180-170)/20 which is 0.5. Now we know that we want a one-tailed answer, that is the area to the right of the mean of the graph below. From going online or using a lookup table you can calculate that the correct answer is 30.85%. So roughly 1 in 3 boys will be above 180 centimetres. If now the SD were to decrease to 10. Then the chance now that one is above is 16.57%. Why did it decrease? Because the dispersion of the data became thinner, so the chance of finding a boy above 180 is now less.
The example above can be applied to anything from finance, to weather and even data science.
To use some semi-real values, to make a Viz we can use this website to generate 5000 random values with a custom standard deviation and mean. To use these values, we can copy and paste them into excel, no Alteryx or data prep required. Before I go through everything I highly recommend downloading my Viz where I have grouped my calculations into steps so that you may follow along.
After we input that data source into Tableau, we can begin to make some calculated fields and parameters. I’ll try to explain in 7 steps.
Step 1 : Parameters
Create 2 parameters: “Value bin size” and “Z colour” both are floats. Value bin size can be anything above 0, as this decides how many groups of your data there will be. I put mine from 0.01 to 20. “Z Colour” can take any max and min, I put a step size of 0.01
Step 2: Creating the bins.
DO NOT right click under values and create bins. Instead use the formula:
( [Value] / [Value bin Size parameter] ) * [Value bin Size parameter] + IF [Value] <= 0 THEN -[Value bin Size parameter] ELSE 0 END
This way you can select the bin size dynamically.
Step 3: Colour
Used to give Visualisations meaning
([Z colour])<= [Value Bin Calc]
Step 4: Set
In this step, I made a set of all values so that later we can add an action to just click on a point and our values update (get added to an IN set)
Step 5: % of Totals and Index()
“% of Total” : RUNNING_SUM(SUM(ABS([Value])))/WINDOW_SUM(SUM(ABS([Value])))
“Chance that value will be above selected”: 1-% of Total
Step 6: Calculated fields
“Area”: if [index()] = 2 then WINDOW_MIN([% Of Total])
ELSE WINDOW_MAX([% Of Total])
“Mean Value”: TOTAL ( AVG( [Value Bin Calc] ) )
“ST DEV”: TOTAL ( STDEV ( [Value Bin Calc] ) )
Step 7: More Calculations
“Curve”: (1/(sqrt(([St Dev]^2)*2*pi()))) * exp(-( (attr([Value Bin Calc])-[Mean Value])^2 / (2*([St Dev]^2)) ))
* [Value bin Size parameter]
* TOTAL(SUM([Number of Records]))
“Z score”: (AVG([Value Bin Calc])-[Mean Value])/[St Dev]
These are all the calculations you need. All I have on the columns is “Value Bin Calc”. On rows: “Curve”. On the Marks, I’ve dragged “Colour” to colour. On the details I dragged:
% of Total
%Chance value will be above selected
On Index I computed using Colour, whereas for the rest I used table across.
Here is a picture of the final product. As you can see for the Z score being one we hit about 69.5% of the data, very close to the 68 that we would expect. If you click and interact with the Viz you will see how for 2 SD we get almost 95% and for 3, 99.7%. I haven’t manipulated the data in any way, the dataset I used was deliberately chosen to illustrate how to make a normal distribution curve. In the real world, you may not get this result, in some situation, you may hope you don’t!
I hope you can apply what I’ve written about here. If you have any question, don’t hesitate to contact me.
Many thanks to David Bartolo for the inspiration for this idea. I may add to it to try to make it two-tailed and more dynamic, so the user can switch between.
UPDATE: I have changed one of the calculations slightly to make the y axis more akin to a probability density function
“Curve” is now:
((1/(sqrt(([St Dev]^2)*2*pi()))) * EXP(-((attr([Value Bin Calc])-[Mean Value])^2) / (2*([St Dev]^2)) ))*10
I have also added some extra calcs to do with the number of records that the workbook contains and general formatting