The second Dashboard of the Dashboard week at the Data School was based on the Kaggle dataset named “YouTube Trending Video Dataset”. I am keen on Data Science and Machine Learning, so I decided to investigate this data set using sentiment analysis. Sentiment analysis plays an important role nowadays and can be applied to different areas of interest. For instance, share traders analysing tweets or other blogs can decide on buying or selling shares, and shops can investigate their customer’s attitudes to their products. Of course, having only a day for this project, you probably do not expect too much out of this project. However, I gave it a try and let see what I have obtained.
Creating Model in Alteryx
There may be different approaches to creating a sentiment analysis model. A good video shows the sentiment analysis in Alteryx you can find here presented by Tim Ngwena. However, it is about two years old, and I believe that the Sentiment Analysis Tool was added to Alteryx after that. This is a pre-trained model, and this tool has a straightforward setup where you define the field with text you want to analyse, the language of the text and an algorithm. Only one algorithm named “VADER” is presented nowadays, and I believe that there will be more in new versions. This is how the setup of the tool looks now:
In my analysis, I made a hypothesis stating that if the YouTube video has more likes than dislikes then it has a positive sentiment otherwise the sentiment is negative and the sentiment analysis performed by my model should match this hypothesis. In addition, I split it into 3 cases which will be compared afterwards. The cases are:
- consider only title;
- consider only description;
- consider concatenated title + description + tags.
As usually, data preprocessing takes about 80% of time development, and the rest takes the sentiment analysis and Tableau viz. Below is the workflow I developed:
The Sentiment Analysis tool produces the following results:
Finally, based on the compound_sentiment_score, you can categorise the text; if it is higher than 0.5, then the sentiment is positive; otherwise, it is negative.
Creating the Tableau Viz
Based on the results, the Tableau visualisation was created:
One can see that the plane is separated by a diagonal line that splits it into two halves. The above half shows videos (points) with a number of likes > a number of dislikes, so we assume positive sentiment of videos; otherwise, the second half has videos with the negative sentiment. Red dots represent videos where the sentiment of analysis in Alteryx matches the sentiment based on a number of likes and dislikes; otherwise, the dots are white. The highest accuracy was received for the case where the text for analysis has title, description, tags and the accuracy reached 53.10%. The accuracy value is close to 50%. This tells us that the model uses a random guess and we can deny our hypothesis.
A user can choose a button out of three choices on the viz: Title, Description and Title + Description + Tags. Based on the selected button, the results will change on the fly. The developed viz can be found here.
ConclusionÂ
Even though we denied our hypothesis based on the developed model, we could still try to improve our model if we had more time by introducing more text, for instance, from comments to each video. We could also train our model using the supervised learning techniques where the dependent variable would be categorised video based on the number of likes and dislikes. It would be interesting to try.