I’ve always been interested in where words in English originate from. Many will know that English is a Germanic language, but due to various political changes throughout the ages (1066 for example), English vocabulary has become more Latin and French over time.
Trying to find out where words originate from is difficult, there are various resources like:
The Oxford English Dictionary – might have to pay $100 for access
Wiktionary – good website, for origin and how the word looked over time
Etymonilne – good website with multiple words related to the search term over multiple pages.
A blog by Andreas Simons – good analysis and with good visuals.
I’ll go through the analysis in Tableau first, then the Alteryx workflow.
I’ll be following the same method as Andreas Simons (see above). First I create a macro that scrapes the Etymonline website, secondly, I’ll then have a workflow that converts the data into groups of 50 words, and then I’ll use the description of the word to find out what language the word belongs to. Below I show Andreas Simons analysis and min for comparison. We achieve roughly the same results with some slight differences: similarities include the prevalence of germanic in the first few hundred words, then shifting to more Latinate. I believe we use the same word list found here. Differences include the other/unknown column, I do not want to include too much complexity into the workflow by doing this.
The first step of the alteryx workflow is to get the input and rank of the word into the Etymonline website and get the description out. I use the download tool to get the website out, then to split into a specific section I replace a piece of text in the downloaded data with a pipe “|” and use a text to columns tool on it. I regex out the word and classification of the word, by this I mean, a noun, verb, adverb, preposition, pronoun, and others. I regex again on words on the description if the first regex did not work. I get rid of any rows that are null and make the word and description lower case. I join the classification and word together to get it into a format “Word – Classification”.
The next part of the workflow is using the output from the first part, (it took about an hour to run 5000 words). I separate the words into groups of 50, sing a calculation: “floor(([rank]-1)/50)” and another “Mod([rank]-1, 50)+1”. I append 13 rows to the rows, I make a calculation later that classifies the language into 5 categories- German, French, Latin, Greek, and Other germanic.
I shorten the description to 200 words, as some languages are captured in the description even when they are not linked. I then go through different steps to get the correct language of a word, like if the word is made up of two different words (compound word) like ability-derived from able and the ending -ly. I also parse out a word if it is the same word but a different classification, like the word “input”, as it can be used as both a verb and a noun. I then combine everything, use a unique tool, and get the % of all words. Lastly, I transpose the results for each group of 50 words.
Another project I may look at in the future is another language to do this with.