Welcome to my second post of Dashboard Week!
In case you’re asking, what is dashboard week? Dashboard week is a week where, during each day, all the data schoolers create a dashboard and post about it! Tuesday’s challenge was creating a dashboard using data on piracy (maritime).
Today our challenge involved a dataset about pirates from the Maritime Safety Office. I saw two key challenges with this dataset:
- Spatial data and the opportunity to blend it
- The majority of ‘data’ written in natural language
As every data scientist encounters natural language at some point, I thought it would be great to spend time learning this today!
Two fishing boats sailed towards Tambisan for fishing /shrimp activities. Seven armed persons in two pump boats, wearing camouflage uniforms and masks, approached two fishing vessels. The persons boarded the fishing vessels, held the crew at gunpoint and took their personal effects, phones and documentation. Before leaving the armed persons kidnapped three crew members from one fishing vessel and headed to Tawi-Tawi islands.
How many baddies? Seven. This is an example where extracting the numbers would be futile, let’s illustrate why:
- There were two fishing boats
- There were two pump boats
- There were seven armed persons
- There were three kidnapped crew members from one fishing vessel
Despite this, I was able to extract seven from the data! Read on to find out how.
What NLP does to sentences?
The API I used from Google Cloud was analyzeSyntax, which returns the linguistic structure of an input. As I self-taught myself this concept today, the explanation from Google is likely a safer bet than me trying to explain it. But the key takeaway for me was the idea that sentences can be represented as trees.
There’s a lot more going on here than I’ve had time to learn, and it’s something I’ll revisit in the future. In short, what I found is this; the numbers which connect to a ‘baddie’ are the number of baddies. In this case, the baddie is an ‘armed person’. As you can see above, the number used to describe the armed person (seven) is, when reading the sentence like a normal person, the number of armed persons. So the tree works.
Extracting this with Alteryx
I’ve skipped a big step here which involves connecting to the Google Cloud APIs (I’ll write a post on this soon!).
As for my approach with Alteryx, this blog post is TBC… 🙂
As always, thanks for taking the time to read my blog! If you have any comments, suggestions or want to chat, free to connect with me on my LinkedIn!
~ Ryan Edwards