How I used Fuzzy Matching.
When working with the dataset containing all Formula One driver names, all foreign names which contained letters with accents appeared as weird symbols.
ä -> ä
ö ->ö
This issue can be fixed by applying a replace formula on each one of the special symbols but the dataset contained hundreds of unique names. Luckily I had an easier alternative. Using a separate unordered document, I was able to retrieve the correct names and import them into my workflow. Since there were only slight differences between the names, the fuzzy match tool was the perfect solution for solving this problem. Here is what the workflow looked like:
Configuring Fuzzy Match
- Choose mode: Purge mode or merge sort.
- I used a purge mode for my workflow because I needed all the records to be compared.
- Specify match threshold
- A threshold of 80% was used for this data set. Any lower would lead to too many false positives.
- Select match style.
- ‘Name’ is selected in this case.
- Edit match options.
- In the match options, we get the choice between the use of ‘Jaro distance’ or ‘Levenshtein distance’. To select the better option we first need to understand how these two algorithms are calculated. The Jaro similarity looks at the length of the string, matching characters, and the number of transposes. The Levenshtein distance checks for the number of swaps it takes for one string to match the next. For our data, the length of the strings does not matter, the Levenshtein distance is the better option. The Jaro similarity algorithm has more problems with shorter names. Here is some more information about the two different algorithm.
- https://srinivas-kulkarni.medium.com/jaro-winkler-vs-levenshtein-distance-2eab21832fd6
- Specify advanced options.
- We select ‘output match scores’ and ‘output generated keys’. These options will create new columns specifying the quality of the match. They can be sorted and easily used to check if the fuzzy match is doing its job properly.
Here are my configuration settings.
Due to the advanced options introducing duplicates, a unique tool is required to clean up the data into the original format. The fuzzy match had worked flawlessly for 99% of the cases but there were a few cases such as Nelson Piquet and Nelson Piquet Jr that had to be corrected manually. The fuzzy match is a niche technique but can make data cleaning a lot easier in some cases and should always be an option to be considered.