I had been planning to create a text analysis tool for a while, when Alteryx introduced its Intelligence Suite in version 2020.2. It includes seven text mining tools. They appear in the new Text Mining category on the tool palette. However, if you want to use them you need to purchase a license. This, an internal project where I needed to perform text analysis tasks and my newly found passion for text mining gave me the final motivation to start working on the tool.
Text data is everywhere
…and it’s growing fast. There are 1.4 billion posts a day on Facebook and Twitter alone.
More and more businesses want to analyse textual data such as emails, surveys, call center logs and social media streams like blogs, forum posts, tweets, newsfeeds to better understand what people think and say about them.
There are many open source Python libraries to perform Natural Language Processing tasks with amazing functionalities. I wanted to create a tool which gives the power of these packages into the hands of data analysts who don’t know how to code in Python and also makes work easier for those who do.
Used Python Libraries
To create the tool I used the following free, open source Python libraries: NLTK, Spacy, Sklearn, TextBlob and langdetect. I didn’t invent anything new. I just created an easy to use interface to some of the functionalities of these libraries. There is one exception though. For the Sentiment Analysis task I trained a model using labelled anonymized review data sets from various websites. This Sentiment Analysis model works with an amazing ~95% accuracy when tested on test data sets.
Download The Tool
The installer can be downloaded from here. In the folder you will find two versions of the installer beside other tools I created. Feel free to download those too if you like. I am planning to add more tools to this folder, so keep an eye on it if you are interested. The next tools to come will be all in the text mining area.
If you are using Alteryx version older than 2020.2 you will need the Text Analysis v3 – Pre 2020.2.yxi installer. Otherwise download the file named Text Analysis v3.yxi.
You will need admin rights to install the tool.
If you are asked to install for current user or all users, select all users.
After installation the tool will appear on the Text Mining palette if you are using version 2020.2 and newer, otherwise you will find it in the Unknown category.
The icon looks like this:
The User Interface
The tool has one input and three outputs. The main output is labelled as “O”. It will always contain every row from the input. The other two outputs are optional. They might be empty or have more than one row for each input row. In order to be able join the records from the Optional outputs to the records in the Main output, the tool generates a ‘record_id’ field in every output.
See the user interface of the main output below:
Select the Source Field which holds the text data you want to analyse. Then tick the metrics and functions you want to include in the output.
Hovering the mouse over the ? icon gives you information about the metric or function.
Be aware that to process some of these tasks is very time consuming. Especially towards the bottom of the screen.
You find two optional outputs labelled as “1” and “2”. Their user interface is identical.
The above tasks can be selected for the Optional outputs. If you need to perform more than two of these, you have to put another tool on the canvas. The generated ids will be the same for the same input rows in both tools, so you can rely on them if you want to join them together.
The model I trained to do sentiment analysis perform exceptionally well. And it also works well with longer sentences. It gives an average of 95% accuracy when classifying documents into positive and negative category.
See some example results below compared to the VADER algorithm which in my test performed with about 75% accuracy on the same data sets.
I let you decide which algorithm performs better.
Good Text Mining!
It was lots of fun creating the tool and I learnt a lot about different Python NLP libraries and training models on text data during the process.
I hope lots of people will enjoy using it as much as I enjoyed working on it.
Here is the link again to download the tool. Don’t hesitate to reach out to me if you have questions about it or suggestions on how to improve it.
If you get errors using the tool after updating Alteryx from 2020.2 to 2020.3, you need to do the following:
- Delete the Text Analysis folder from the following directory locations:
%APPDATA%\Alteryx\Tools or %ALLUSERSPROFILE%\Alteryx\Tools if it’s an Admin install
- Re-install the tool