The first week (and a bit) of The Data School has passed. For the next couple of weeks the focus is getting up to speed with the software Tableau and Alteryx, and along the way learning about industry best practices in data manipulation and visualisation. Some of these practices are quite different, or at least have a different emphasis, than the more academic statistical analyses I came across and carried out as part of my former studies in biostatistics. For my first blog post I would like to write about my first impressions on these differences.


Perhaps the biggest difference is that a lot of the time, a client will not have a specific scientific question for you to answer. Quite often they will present you with data, and want a visualisation and summary of interesting features and “insights”. This was what was asked of us in our second and final interview for The Data School. To be honest, I did not really know what was meant by the term “insight”. To me the word insight means a kind of knowledge that is both profound and broadly important. My dashboard for the interview was exploratory rather than explanatory, meaning that the user can select parameters and filters to find information of interest, with little direction. In my presentation I spent a fair amount of time explaining how this was possible, but also pointed out an interesting data point I had found, even though it was not broadly important to the data. It turned out that the interviewers were more interested in this than the broad summary.


The focus on outliers makes sense in a business setting also. A client will probably have a good general idea of how the business is going, but will want to hone in on under or over-performing areas. This is quite different in most biostatistical research. The interest there is usually in the form of an average causal effect of a measure (for example, the difference in risk ratio of a disease for two different treatment groups). Although outliers are acknowledged and sometimes questioned, there is a hesitation to treat outliers with an undue emphasis which would bias the result.


Visualisation has a much larger role in the industry compared to academia. In part this is probably due to the conservative nature of academic publication, which is still often printed. But it is also because the results of statistical analyses can often be summarised appropriately in a simple table format, for example, confidence intervals for an effect in two or more groups. In fancier papers they may use 


Finally, the tools we have learnt so far in The Data School (Tableau and Alteryx) are far more intuitive than what I used throughout my Biostatistics degree (mostly R, thankfully I only used a bare minimum amount of Stata). The data cleaning, filtering, unions, joins and summaries we learnt today in alteryx are all possible in R, but would have taken many lines of code and a lot of head-scratching. 


Having said that, I do hope to use my statistical training on the job. In particular I think the causal inference framework has potential application in a variety of contexts. And it turns out that Alteryx supports R or Python input, so if I really want to, say, complete statistically robust multiple imputation of missing data, I can use the mice package from R.

The Data School
Author: The Data School