Data warehouses are crucial for storing and managing enterprise-wide data. My previous blog Data Modeling: Building a Database outlined the importance of properly establishing the right conceptual, logical and physical models and eliminating data redundancy with normalisation. However, while it helps maintain a high level of data quality, this structure of data may not be optimal for the efficient retrieval necessary for reporting. Thus, understanding how to use snowflake schemas and star schemas as well as how to manage slowly changing dimensions will greatly improve the way you structure your data for reporting purposes.

 

Star Schemas & Snowflake Schemas

Much like transactional and reference data, all data can be classified as being fact or dimensional. Fact tables contain data from events that occur over time, like sales or production data, while dimension tables contain descriptive or categorising details about relevant fact attributes, like product or address data. Generally speaking, data models for reporting purposes are oriented around the organisation of fact and dimensional tables. Most common are star schemas and snowflake schemas.

Star schemas model data from a central fact table that joins to various relevant dimensional tables. They are called star schemas as the fact table in the middle looks like the centre of a star with each dimensional table looking like each of its points. These dimensional tables are almost always denormalised so that all of the desirable data relating to one fact table attribute is stored in the same table. Due to there only being one join between each dimensional table and fact table, star schemas are able to be queried relatively quickly.

Snowflake schemas also model data from a central fact table that joins to various relevant dimensional tables. However, the difference is that snowflake schemas only use normalised data and so, rather than having a single dimension table for each fact table attribute, there can be multiple joins from the fact table out to the furthest-removed dimension tables. They are called snowflake schemas as the fact table in the middle spans out via the various dimensional tables in a complex and layered manner, resembling a snowflake. Because snowflake schemas contain so many joins, it can become slow when queried, but the upshot is that less redundancy usually translates into less requisite storage space.

 

Slowly Changing Dimensions

Some dimensional data may slowly change over time like company names or an employee’s department. There are a few ways that data modelers can handle these changes, the choice of which impacts the reporting process.

  • Type 0 SCDs do not change their original recorded value even if there is a change in the actual value. Hence, if a company changes name, it will still show up in the records under its original name, with no record of its new name.
  • Type 1 SCDs will update their recorded value with a new value, but keep no record of the previous (or original) value. Hence, if a company changes name, it will now show up in the records under its new name, with no record of its history.
  • Type 2 SCDs will keep their original recorded value in one row and update their new value as a new row. Typically, this history table will also include columns to store version information such as version number, start date, end date, effective date and current flag. Hence, if a company changes name, it will show up in the records under its new name, and its previous name will still be accessible from the history table from the previous version’s row.
  • Type 3 SCDs will keep their original recorded value in one row and update their new value as a new row just like Type 2 SCDs, plus include a column containing the previous value. This makes it extremely simple for reporting purposes as it is unnecessary to access a different historical row. Hence, if a company changes name, it will show up in the records under its new name, and its previous name will still be accessible from the history table in the current-flagged row.

 

I have found data modeling to be the most underrated yet important area to become well acquainted with for all levels of data analytics. Understanding data modeling well will reward you with greater data integrity, ease of analysis and computational performance. Strong data modeling is the foundation on which strong data analysts are made. Happy modeling!

 

Image by Gerd Altmann from Pixabay

 

Hunter Iceton
Author: Hunter Iceton

Hunter Iceton is an enthusiastic and positive individual. He graduated from Sydney Uni in 2017 with a Bachelor of Commerce (Liberal Studies) majoring in Finance, Marketing and Quantitative Business Analytics. For the next few years, Hunter spent his time creating and releasing music, while tutoring primary and high school students in Mathematics and Business Studies. Hunter is now excited to be joining The Data School, looking forward to approaching analytics with a creative perspective. In his spare time, Hunter enjoys continuing to create music, reading philosophy and cooking plant-based dishes. Otherwise, he can usually be found at a restaurant, a bar or an art gallery.