There are maybe different situations when errors and typos are introduced in records. You can think about a doctor making a mistake in a patients’ name in the patient’s history. A clerk in a bank can also incorrectly spell a client’s name in an application. Such discrepancies lead to problems of matching records in tables of databases. Having a golden record of a person in one table and misspelled name in an application, for instance, lead to missing connections between these documents. For example, if someone is looking for me based on my last name Kushnarev in a database, it will not be possible to find my record if my last name is spelled in the French manner Kouchnarev. Also, my wife’s name is Anna, but she often finds her name miswritten as Hana. Finally, My daughter’s name is Nauma; however, her name can be misspelled as Naomi. 

The Fuzzy Match tool is introduced in Alteryx to deal with such inconsistencies. The Fuzzy Match tool identifies rows with similar string values in specified columns. It is easy to use this tool with some parameter adjustments. First of all, you have an option to use either Purge Mode (All records compared) or Merge (only Records from a Different source are compared). In Purge Mode, all rows are compared with each other, and the output contains similar or exact matches. The record ID column is meant to be a unique identifier for each row. The match columns are the ones that will be compared to find similar matches, and the match style determines the rules for the fuzzy matching logic.

On the other hand, in Merge Mode, rows from one source are strictly compared against the rows on another source. This is useful if you have a master list of data, such as a list of suppliers which can be used to standardize a list of dirty transaction data. With large datasets, Merge Mode is much faster than Purge Mode and will significantly reduce the size of the output data set. The match threshold parameter determines how strong a match is required for the tool to output the match. By decreasing the match threshold, more results will be output, but they will be less similar matches. You also have to specify Record ID, and if you are not concerned with the exact row each value came from, you can change the Record ID column to be the same as the match column for easier viewing of the results. However, especially when fuzzy matching across multiple columns, it is important to point the Record ID column to a unique identifier. Finally, Field Name(s) is specified at which the Fuzzy Logic is applied, and you also indicate Match Style. The match style can be the Company name, Name, Name with Nick Name, Address, Phone Number, etc.

Let’s consider the following example. Assume that we have the following dataset:

The workflow is straightforward with the parameter set-up shown below. Note that the Output Match Score and Output Generated Keys are also marked.

The fields we use for the fuzzy match are Full Name and City with match style Name. We also  So, in the output is

fuzzy match

We can see that Boris Kushnarev and Anna Yemasheva were matched. In addition, you can see the MarchScore parameter for each field and the average MatchScore as well as MatchKeys. However, Naomi was not defined as a corresponding match for Nauma. Even though I reduced the Match Threshold, I still was not able to link her records.  Though, if you want to use a fine-tuning, you may explore it by hitting the Edit Button, and you can find the following interface:

You can find that the Jaro Distance was used, and you might want to change it to Levenshtein Distance. After the Run button is hit, Nauma appears in the results even twice because two MatchKeys were found for her match.

fuzzy match

To sum up, if you are dealing with two or more sources of data, you can use the powerful Fuzzy Match tool to find missing records, for instance. You can use preset parameters or go further and do the fine-tuning.

 

Boris Kushnarev
Author: Boris Kushnarev