When performing data analysis, the structure of the data determines what kind of analysis you can do with it and how it is consumed – by machine or by human. Understanding the shape of data can help you perform you analysis more effectively, and help your end users understand you work. Data mostly come in two shapes – “long” format and “wide” format. These describes how your data is organised in terms of rows and columns.

Wide Data

In a wide data, each individual entity occupies their own row, and each of their variables occupy a single column. As such, an easy way to identify wide data is that the data in the first column tend not to repeat.

Wide data is generally considered people-friendly, as this format is easy to read and interpret. All the information about a single entity is available at a glance. As such, you tend to see this format used in descriptive statistics and reporting.

NBA Team Standings

 

Wide data may also be used in data collection, where different variable or new observations of the same entity are recorded in new columns, allowing for easy side by side comparison.

WDI Global Fertility Rate Data (Wide)

While wide data may be commonly used for human consumption of data, it’s also used in a number of data analysis. Notably, it is often used for machine learning tasks, where each observation occupies a single row, and the features used for prediction or classification are organised into individual columns.

Long Data

Unlike wide data, long data allows for multiple rows for each entity, and instead records new attributes or observations as a new row in the dataset.

WDI Global Fertility Rate Data (Long)

 

This is often considered a machine-friendly data structure, as it is easier to perform functions like filtering, aggregating and transforming on long data. Adding new data is also much easier with the long format, as you only need to add new rows to the data rather than create additional columns. It also avoids the problem of having null values in columns where no data is available for an entity, as you can simply omit rows where no data is present.

It is the preferred data format for many visualisation softwares such as Tableau, and its format makes it ideal for analyses such as Time-Series analysis or Repeated Measures analysis.

 

The Data School
Author: The Data School