After 4 weeks of intensive training in Tableau and Alteryx, we started our first client project on the 5th Monday. Client projects are great opportunities to work with real data on real business problems, delivering solutions within short period of time using agile methodology. At The Data School we use the Scrum framework.
On Monday morning we met the client. We received the datasets, gathered the requirements and were briefed about the data. Then we kicked off the sprint with the Sprint Planning. It was an interesting project, working with DNA sequencing data.
Most of the data we received was quite clean and in good shape, ready to work with. However, my part in the project relied on hundreds of Excel files which were not designed for data analysis. They were created manually and varied in different ways. Although the client didn’t believe we could use these files, they asked us to give it a go as it would give them great new insights if we could make it work.
If you are curious how the data looked, here is an example of the Excel files. I have changed the actual content as the original data is confidential, but it will give you a sense of what we were working with.
Preparing the Data
In order to analyse the content of these files, we needed to find a way to read them in with an automated process and export the data to a format that we could use in Tableau. Tableau comes with free access to Tableau Prep, which is a data cleaning and preparation tool. Because the client also had access to it, I gave it a go. Although Tableau Prep is a great tool and it is evolving fast, at this time I couldn’t manage to reshape the data with it. The other option was to use Alteryx. I have to admit here, that I chose to work on this part of the project because we knew we most probably would need to use Alteryx. If you remember my previous blog post, you know that I am in a close romantic relationship with this data cleaning, preparation and analytics tool.
Below is the workflow I created with Alteryx. It reads in, the hundreds of Excel files, cleaning and making the data more consistent to work with.
The tools in the first container clean the data. This includes eliminating unnecessary columns and empty rows, removing leading and trailing whitespace characters, etc. The second and third containers format the data, creating a column for the information in the header and one in the footer which is common for all sections within a file. I’ve hidden some of the other detail by closing later containers, as the main thing that I want to explain is that at the end of the workflow the data comes out in a nicely organised, consistent format ready to be analysed in Tableau.
The disorganisation and lack of consistency in the original files is no surprise. Many businesses have similarly potentially valuable data hidden in hundreds of Excel sheets or even text documents. At the time the files are created and used, the value of the data collected over many months or years is not obvious. It is only in hindsight and with valuable tools like Alteryx that we are beginning to make the most of these types of data.
Working with my fellow Data Schoolers, we then analysed the data to uncover important business intelligence for the client. On Friday afternoon we presented our findings to the client during the Sprint Review. They were particularly impressed that we were able to use the hundreds of Excel files to analyse the data within them. This gave the client new insights into the relationship between the data and the failure rate of DNA sequencing.
It was a jam packed week and we learnt a lot. Not just about how to apply data analysis skills and tools and about the subject, but our regular training continued in the background too. This week was such a great experience again.