OVERVIEW

Data surrounds us, and only about 95% of data has been processed. However, the amount of emitted data grows daily. Structured, semi-structured or unstructured data exists. The structured data is presented in a tabular way, for instance, in excel tables or relational datasets. An example of such data is clients’ transactions in a bank. The semi-structural data is usually stored in XML or JSON formats. One can think about the semi-structural data as websites parsed to JSON format afterwards. Finally, texts in books and emails are examples of unstructured data that can be processed as well.

After a data analyst or data scientist collect the necessary data, in most cases, they face dirty data that has to be cleaned. Data cleaning is not an easy procedure and requires knowledge and experience. However, an essential component of data clearing is data tidying.

Recently, I came across the Hadley Wickham article named “Tidy Data”. So, let’s try to clarify what tidy data is. For simplicity, I will follow the example from the article with slightly different changes.

TIDY DATA

Consider the following table:

treatment a treatment b
John Smith 2
Jane Doe 16 11
Mary Johnson 3 1

Table 1. Typical presentation dataset.

This data structure is commonly used. The Patient’s Names are stored in the first columns. Two other columns show the amount of Medicine A and Medicine B received by patients. The table can be rewritten as follows:

John Smith Jane Doe Mary Johnson
Medicine A 16 3
Medicine B 2 11 1

Table 2. Restructured data.

We can see that the columns and rows are swapped but the conveyed information is the same.

The data semantic can be described as follows: A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes. 

The above tables can be reorganized even in a different way and this will make the value, variables and observation more clear:

Patient’s Name Medicine Amount
John Smith a
Jane Doe a 16
Mary Johnson a 3
John Smith b 1
Jane Doe b 11
Mary Johnson b 1

Table 3. Restructured data in such a way that variables in columns and observations in rows.

Table 3 carries the same information but now the dataset contains 18 values representing three variables and six observations. The variables are:

1. Patient’s Name, with three possible values (John, Mary, and Jane).
2. Medicine, with two possible values (A and B).
3. Doze, with five or six values depending on how you think of the missing value (-, 16, 3, 2, 11, 1).

Table 3 is a representation of tidy data. So, according to Hadley Wickham, tidy data is

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. One can easily manipulate the data in a tidy dataset. Data can be sorted in the proper way, additional variables can be added if it is required. In the Table below, date and units are added:

Patient’s Name Date Medicine Amount units
John Smith 13/10/2020 a mg
Jane Doe 14/10/2020 a 16 mg
Mary Johnson 12/10/2020 a 3 mg
John Smith 10/10/2020 b 1 ml
Jane Doe 19/10/2020 b 11 ml
Mary Johnson 15/10/2020 b 1 ml

Table 4. Tidy dataset with added variables

TO SUM UP

Different dataset representations were considered above and how to bring them to the tidy dataset. The tidy data representation may be obvious but it might be difficult to precisely define variables and observations in general. To learn more about data tidiness refer to Hadley Wickham’s article.

Boris Kushnarev
Author: Boris Kushnarev