Week 2 of the Data School gave us a closer look at all things Tableau. One thing we covered was Regression and Forecasting, in the analytics tab. One thing I remember from my Stat classes in uni, was to be careful when using regression, as it is not always a valid practice. This was something I wanted to revise, so hey, why not write a blog about it too!

What is Regression?

Regression is summarizing a set of data into a usable model. Ideally the model either explains an existing relationship or establishes a new relationship between 2 variables. There are statistical measures associated with a regression model that tell us how good or bad it is.

The simplest form of regression is “The Straight-Line” method. This is easily generated by Tableau in the analytics tab, all you need is an ideal chart first. The line produced by Tableau summarizes the overall direction of the data points. The “goodness of fit” comes from the overall minimum distance from each data-point to the regression line. This is great as it doesn’t take much mathematical ability to do (the computer does it for you), but there are some dangers.

The first reason to use regression (that comes to mind) is extrapolation, the idea being the line generated by the data you have now, can predict a value outside the range of your data. This would most commonly be historical data used to predict future outcomes but can also be used for two “time invariant” measures (for example, GDP vs Mortality rate, or the cost of a Big Mac vs Cost of living).

Things to look out for…

From memory, my statistics lecturers would say the extrapolation is not really all that good. There are a multitude of factors that can give you an unreliable predictive line, I will focus on three;

1. There is a law/rule that contradicts the prediction, or the data is constrained by some limiting factor.

An example could be viewing data on the number on people carrying a contagious disease within a population, if the predictive line shows overall increase over time, you may predict a large amount of people sick at a time in the future, a constraint in this case would be the total population (there can’t be more sick people than the total amount of people). Be careful and check that your prediction sits inside the realm of possibility.

2. There isn’t actually a relationship in your measures and it was completely random

Below is a chart that plots 2 fields of random points against each other. As you can see, the predictive line is slightly descending, and its obvious that there isn’t much of a relationship here. All this is showing, is that there is a possibility within predictive analysis, that the relationship we are measuring is completely down to chance. If I generated multiple sets of random points, eventually there would be one that looked a bit better.

A value to pay close attention to is called the “p-value”. This is a statistical value that tells us “the probability of the relationship being random”. Generally, if the p-value is less than 0.05, it is accepted that the relationship is not because of random chance. We can then say “with 95% confidence” that the relationship is genuine. A more tangible relationship between 2 fields of data will give you a lower p-value. This p-value in the random chart was 0.7575, so you can see that this value can come in handy.

3. The sample size of your data, or the span of the data is not large enough

It is important to know the context of your data points, what points should not be included in your model (and why), and do your points represent a complete set? I think the easiest example to use would be weather data, as that is the first thing most people think of when they hear the word “Forecasting”.

This chart shows Austin (Texas) and it’s alarming temperature rises! Is it fair to say that its going to get hotter and hotter as the year progresses? Of course not, temperature is cyclical, Austin will cool down. This example is pretty obvious, but it may not be with every set of data. If the data is cyclical/periodic, it is important to ensure you have multiple “cycles” of data before doing any regression.

When do I use Regression?

Something I think regression can be useful for, is speculating values that lie in gaps of your data. Using the “y = mx+b” formula you can find a reasonable value of one variable if the other one is fixed/ known. In the example below, the temperatures for June 2017 are missing, using the regression line we can make a pretty good estimate as to what the temperature was.

If you are interested in learning more, click here, this is the site I used for a bit of background information. It does contain extra information of some other types of regression.