This blog post was inspired by my cohort not really getting how trend lines are calculated when using Tableau. So I thought I’d brush up on my statistics knowledge and write about it.
The five different types of trend lines are:
I will try to explain the differences and when to use them.
Most line equations are in the form Y = MX + C with Y as your variable on the y-axis, M as the slope or coefficient of the X variable, which is the values on your y-axis, C is the constant or value when no X value is present. On Tableau, you can download the Superstore sample dataset and put sales on Columns and Profit on the Y-axis to follow along. If we choose the circle under the marks card we get a scatter plot. Here this would try to see the relationship between sales and profit. The linear graph can sometimes be unrealistic, for example, the y-intercept is negative. In this case when there are no sales profit is negative, in this context it may make sense, but it always something to look out for, especially if the y-axis shouldn’t take negative numbers.
Linear trend line.
Don’t let it scare you, polynomial means “made up of several items”. A Polynomial equation is always in the form Y = M1X1 + M2X22+ M3X34+ M4X45 + M5X56 + M6X67+ M7X78 + C . In Tableau, you can set the number of coefficients from 2 to 8, so that more or fewer curves or “‘hills” can be added or taken away for accuracy purposes.
An Exponential equation is of the form Y = M1e(m*X). e here represents the 2.71828, or (1 + 1/n)n. e is very important in mathematics and economics, for example in determining the value of a return of investment with compounding interest like an account that starts at $1 and offers an annual interest rate of R will, after t years, will yield eRt. The name derives from its shape, an exponential or rapid increase.
A Logarithmic equation is always of the form Y = m1*ln(X) + C. A logarithmic equation is similar to the inverse of the exponential function. Here Ln represents the natural logarithm or log base e. As a result ln(e) = 1. To use other numbers so that it does not look so complex, ln(10) = 2.302, so e2.302 =10, you take e as the base (what to the power of e gives the value derived by the natural logarithm). The graph tends to rise very quickly and then trails off slowly rising as the x value increases.
A power curve is of the form Y = m1*X(m2) . The power trend line does not have a constant and is dependant on the x-axis value and a constant determined by tableau. It may look similar to the linear function in the graph, but it is created from a completely different function.
For all trend lines, there is an R squared value in the tooltip. The R squared value is always between 0 and 1 and represents the explained variation divided by the total variation. In essence, the closer to 1 the better as the model created by the trend line accurately describes 100% of all variation around the mean.
Going through the iterations of the different trend lines using the superstore data between profit and sales I found the power trend line to have the highest R squared value. Of course, I would not recommend using the trend line that has the highest R-value all the time, as visualisation is the most important thing to consider. If it looks nice and matches up with a lot of the points and is easy to comprehend then use it. Using a polynomial of degree 8 can produce “accurate” trend lines but one may have zoom out quite a bit. Also, use some intuition. If the graph looks like it slopes upwards in a curve fashion then a linear trend line is not the best to use. The type of trend line can add to you visualisation but may hinder it if you can’t explain why you used it or what value it adds. However, a linear trend line does have the advantage to be able to clearly see whether the trend is upwards or downwards sloping, and what the magnitude is easier than other lines.
There is P or probability value assigned to the trend line. The lower this value the better. Without going into stats too much, Tableau is computing how well your trend line fits the model based on an F statistic with a hypothesis that an intercept only model is accurate. If P is low we reject this and fail to reject (or accept, but we can never truly accept a hypothesis) the alternate option that this trend line accurately describes the data. A low p value may be a bad thing as the line may not be the best model to follow. To put it another way, there is a 1-p chance that the coefficient of the trend line is not 0.
– Alex Taylor-Jackson