Let’s play a game – I predict your home address begins with the number 1, 2 or 3.
If we made this a betting game and I had enough readers, I’d be filthy rich. Not because I’m lucky, but because I have a 60.2% of being right. Yes, almost double the more intuitive 33.3% from selecting three of the nine possible numbers. This is the result of Benford’s law which shows that the distribution of leading numbers is often not linear. This isn’t just useful for guessing addresses but can also be used in fraud detection. It’s a simple formula to use on any questionable data set and can be quickly visualised in Tableau.
How is this possible? Benford.
Benford was an electrical engineer and physicist. When he wasn’t busy publishing mathematical papers about the refractive properties of glass, he discovered an insightful occurrence in statistics, which he called Benford’s Law. I’ve included the formula above. If you’re not someone who can do logarithms in their head, don’t despair, neither can I. That’s why I’ve included the probability table to the right.
The numbers to the right show the probability of the leading digit occurring, which holds in a large number of data sets. Not just addresses – your credit card bills, stock prices, mortality, population and even lengths of rivers. This is exciting because you can use this to perform a smell test if you suspect a dataset may be fraudulent.
It’s always been a persistent fear of mine that I accidentally assume fake river lengths into my visualisations.
Let’s bring this to Tableau!
Alright, let’s grab a data set. For this example, I’m going to grab a data set which is known to be fraudulent. I got this from Kaggle, which is a great place to learn data skills (if you don’t want an account, I’ve uploaded my Viz at the end of this blog which contains the data).
Make the Formulas
First, we need to express the formula:
Now let’s bring the formulas to Tableau. We’ll create two calculated fields with the following equations:
Create the Bar Graph
Let’s go through creating a simple dashboard showing Benford’s Law on the Kaggle data. Feel free to follow along and you’re welcome to download the dashboard at the end of the blog!
Converting to Percent of Total
We now use the two formulas we created and Number of Records to create a simple bar graph where the columns show the frequency of the leading numbers. We want this to be comparable to Benford’s figures, so we need to convert the measure to a Percent of Total.
Add the Distribution Bands
Now comes the fun part – adding Benford’s numbers for analysis. We do this by adding a custom distribution band and setting the values to 90%, 100% and 110% of the (minimum) Benford measure we created earlier.
What we can see from this is that the transactions don’t fit this well. Although we have prior knowledge that the data has fraudulent transactions, it’s important to emphasise that this is a smell test and does not necessarily indicate fraud.
For instance, a legitimate business which only sells ice cream for $5 will not fit this model (the only possible leading numbers being 1 or 5). A financial institution, however, would be expected to correlate more closely with Benford’s law due to the large variety and quantity of transactions.
See the finished Viz on my Tableau Public
Hope you’ve enjoyed my first ever blog post! If you have any comments, suggestions or want to chat, free to connect with me on my LinkedIn!
~ Ryan Edwards