Throughout my Data School training, the blogs of other Data Schoolers have been a hugely important resource. There exists a data school blog that will answer almost any Tableau or Alteryx development question, as well as vast numbers of other data-analytics resources.
If only you could find all this information in one place…
This is what my newest dashboard attempts. It is a network of categorically-grouped Data School blogs, connected by association.
It is a graphical representation of all the topics written about. You can use it to see what Data Schoolers have written about, find recommendations for your topic of interest, or even understand how the networking algorithm works.
Here is a demo of how you might use it.
So How Does it Work?
Web-Scraping the Data
I have a small addiction to web-scraping the Data School website. In my first week at the Data School I wrote this blog about how to sneakily download all the Data School’s blog information. While for that blog I only needed the blog titles and authors, now I need the whole blog text itself. To do this, I used Alteryx to download each blog page, then I used regex to find the individual blog links, and downloaded those as well.
After some removing of HTML tags I used Alteryx’s Text Pre-processing tool to convert each word in the body text of the articles to its root word, removed punctuation and removed stop words (such as “for” and “or”). Then I created the Count Vectorized data for each document – this is a column per word, where each row is a count of that word’s occurrences in the given blog. This can be achieved in Alteryx with a combination of the Text to Columns, Summarize and Cross-Tab tools.
Plotting the Points
If I want to plot the above data in Tableau, the biggest challenge is deciding where to place each point. These need to be arranged in a sensible way – blogs of the same topic should be plotted close to each other, and blogs of different topics should be plotted far apart.
This challenge can be thought of as a problem of dimensionality reduction. The vectorised word counts of each blog encodes the information about association. Blogs that use the words “batch macro”, for example, should be grouped together. However, there are over 8000 unique words over the blogs. This means our vectorised words data is 8000-dimensional! So how do we plot 8000-dimensional data on two dimensions?
I tried three popular dimensionality-reduction algorithms to achieve this – PCA, t-SNE, and UMAP. I will not go into detail on how these algorithms work – perhaps that is a topic for another blog. After some tweaking the UMAP algorithm generated X and Y coordinates that looked good when I plotted the points in Tableau – those coordinates are the point locations on the dashboard.
Drawing the Lines
Building the lines between points is a relatively simple task. Firstly, I create a correlation matrix of documents based on their vectorised words. This gives a 0 to 1 score for the association between each document. I use only strong associations to draw these lines. (“Strong associations” in this case is arbitrarily defined as a correlation > 0.6). To arrange the data for use in Tableau, I transposed the correlation matrix – this is a common way to arrange a correlation matrix for Tableau plotting.
Then I needed to created two points – a “from” point and a “to” point – for tableau line-chart plotting. To do this I union the data to itself, with one copy of the data taking the “from” point UMAP coordinates, and the other copy taking the “to” point UMAP coordinates. After some formatting, filters and parameter actions the finished dashboard was built!
I envision this visualisation as a way of navigating data school blogs. If there is a blog you like, you can search it and find similar blogs. If there is a topic you are interested in you can find all the blogs on that topic in one area.
Feel free to play around! I find the associations between each blog interesting. Try searching for “dashboard week” – you will see the centre of the network light up with blogs!