Joins are an essential part of data preparation. In addition to that they are essential to do correctly otherwise data is lost or duplicated. Here I will outline some issues that can occur when joining data and some tips to fixing them.

Data Loss

This occurs when there isn’t a match in the field(s) you are trying to join on. While this may be intended, it is always worth understanding why some data didn’t pass through and verifying that this is okay. Checking the data on the right and left join anchors is a good verification for if the join was done properly.

In the case there is something you wanted to join that didn’t pass through, here are some things to check:

  • Data cleanse tool – the data cleanse tool can make sure the cases are standardized. It can also remove any white space that could be interfering with the join
  • Abbreviations – a join won’t be able to tell the difference between “Mr.” and “Mister”, so it’s a good check to change make sure they’re the same in both datasets

Data Duplication

Unlike loss, data duplication is almost ALWAYS a bad thing. It can result in turning ten thousand record to ten million records and make your data unreliable. The main tell if data is being duplicated is if there are more records after the join than before. Conversely, you could use the Unique Tool on the fields you joined on to find any duplicate rows. You could also use this tool to remove duplicates but this should be a last resort.

Duplication of data can be mitigated with these two steps:

  • Understand the detail of the data – a good way to do this is to ask yourself “What does each row of data represent?”.
  • Join on the similar fields – if data is being duplicated in the join, most likely you haven’t joined on the right number of fields.

 

If you have trouble joining data, I hope this will prove to be helpful to you.

 

Ethan Tai
Author: Ethan Tai