Organisations tend to store hundreds if not thousands of terabytes of data. Each of these datasets contain a vast plenitude of hidden insights, waiting to be discovered by analysts such as ourselves. For decades, it has been standard practice to store these datasets in relational databases and use a language like SQL to retrieve relevant information. Yet, as time has gone on, the pitfalls of this system have become exacerbated and ultimately very costly, especially for companies who store particularly complex data. Let’s dig a little into the history of data storage to understand how we’ve ended up where we are today.

 

Once Upon a Time

During the early days of database management, the best technology only enabled sequential data access: to search for a particular record, it was necessary to sequentially move through each piece of data until the desired record was found. In the 1950s, the development of direct-access storage devices allowed navigational searches: searches could now follow established links to find particular records in a significantly more efficient manner. However, it wasn’t until 1970 when IBM’s Edgar F. Codd published his infamous paper that set off the relational database revolution.

 

Relational Databases

Relational databases organise data in tables (relations) of rows (tuples) and columns (attributes). Each table is uniquely identified with a relation variable (or relvar) and each row within a particular table is uniquely identified with a primary key. For example, there may be a customer table with rows containing various personal details for each customer as well as a unique customer ID. Now consider a transactions table with rows of transaction data, each with a unique transaction ID. To indicate which customer made a transaction, that transaction’s row will contain their corresponding customer ID. Thus, the information contained across numerous tables is able to be connected by joining the relevant primary and foreign keys. Whereas navigational databases are searched by following the successive locations in an address to access each record, relational databases are able to utilise the unique identifiers to find a more efficient path. This allows the user to simply input a declarative query, outlining what data they are interested in finding, and let the database management system mathematically determine the most optimal process to search for it.

Relational databases have dominated the market for decades due to how efficiently they are able to search tables, however it becomes increasingly computationally expensive as more tables are required to be joined. For instance, when the user is interested in finding the names of all the products that a group of consumers have purchased over a particular period of time, then relational databases work fantastically. On the other hand, to find the names of products that have been purchased by those who have purchased the products that have been purchased by a particular group of consumers, relational databases begin to struggle as they need to search each joined table successively. The relational model does not directly represent these simple relationships, so it cannot search by looking at them. This was not such a big deal when datasets were smaller and the relationships between data were less complex, but as it becomes easier to store such data, the need for a more efficient search process, and hence a new database model, has been growing.

 

Graph Databases

Graph databases offer an elegant solution by modelling data in terms of entities (nodes) and their relationships to other nodes (edges). Thus, the overall dataset structure is a variously interconnected network of nodes. In graph databases, relationships are first-class citizens so they form an integral part of the structure, and can have their own properties attributed to them. This is in contrast to relational databases where relationships merely exist in the form of corresponding keys between tables. Because the relationships are stored additionally, graph databases require a greater amount of storage space. Another significant difference is that graph databases only contain a single node for each distinct record with a series of relationships connecting it to others. Relational databases will instead contain a number of instances of a record’s unique key as foreign keys in related tables, which makes it more complicated to maintain consistency when updating data.

This entirely changes the nature of querying as, rather than checking through a significantly large number of joined rows to find the result, graph databases enable the database management system to simply traverse the established relationships and in a handful of steps reach the destination. Following the previous example, the system could start from the nodes of the original group of consumers, immediately follow the ‘have purchased’ relationships to the products they have purchased, then follow their ‘have been purchased by’ relationships to the next set of consumer nodes, and finally follow their ‘have purchased’ relationships to the desired product nodes. Having individually linked nodes makes searching significantly faster than indexing and sorting through linked tables of records.

 

It is interesting to see how graph databases are able to store variously associated sets of records like relational databases, yet are able to recapture the efficiencies of older navigational searches by following these prioritised relationships. The potential for identifying extremely complex trends and patterns in highly interconnected data is exciting as more software becomes available to manage and visualise these datasets. Who knows what will come next…

Cover image by Gordon Johnson from Pixabay

Hunter Iceton
Author: Hunter Iceton

Hunter Iceton is an enthusiastic and positive individual. He graduated from Sydney Uni in 2017 with a Bachelor of Commerce (Liberal Studies) majoring in Finance, Marketing and Quantitative Business Analytics. For the next few years, Hunter spent his time creating and releasing music, while tutoring primary and high school students in Mathematics and Business Studies. Hunter is now excited to be joining The Data School, looking forward to approaching analytics with a creative perspective. In his spare time, Hunter enjoys continuing to create music, reading philosophy and cooking plant-based dishes. Otherwise, he can usually be found at a restaurant, a bar or an art gallery.