Database Normalization and Denormalization

Understanding Database Normalization and Denormalization

Database design is an important first step in building any strong application. It sets the stage for how efficiently the application can store, retrieve and manage data. Normalization and denormalization are two primary concepts in database design that will affect how well your database performs, is protected from corruption, and is able to be scale. Both normalization and denormalization should be understood by any application developer or database administrator or architect, who wants to create a data structure that best suited for their application’s use.

What is this article will explore what normalization and denormalization are, the pros and cons to each, when to use them, and several examples to illustrate the importance of these concepts in real systems.

What is Database Normalization?

Normalization is a method of decoupling the structure of a database in a way that reduces redundancy of data and maintains logically related data. The goal is typically to take large, cumbersome tables and divide them into smaller, related tables while maintaining relationships. This method has the effect of removing data duplication, and reducing update anomalies, while simultaneously making consistent data much easier to achieve.

The concept of normalization was first introduced by Edgar F. Codd, the developer of the relational model. Codd also developed a series of normal forms, a set of principles the core rules of a database should be based on, which collectively established a guideline for databases that are considered normalized.

Normal Forms Defined

Normalization is conducted in steps. Each step is defined by a normal form. Each normal form also builds on the previous normal form, with further restrictions applied.

First Normal Form (1NF):

The most basic level of normalization. For a table to be in first normal form, it must have atomic columns (each column must contain only atomic values) and there must not be repeating groups or arrays. For example, if a person can have multiple phone numbers, those phone numbers must be put on separate rows or columns in a related table, not in a list in one column (i.e., “555-123-1234, 555-456-4567” is not acceptable, since it is an array).

Second Normal Form (2NF):

A table is in 2NF when it is in 1NF and when each non-key column is fully functionally dependent on the entire primary key (i.e., you cannot have any partial dependencies). Second normal form only applies to tables when the primary key is a composite key. For example, a table with a composite primary key of OrderID and ProductID, has non-key columns that are only dependent on OrderID is a violation of 2nf.

In Third Normal Form (3NF):

A table is in 3NF if it is in 2NF and all of the columns dependent on the primary key and not on other non-key columns. This simultaneously eliminates transitive dependency. For example, if there are columns in the table for CustomerID, CustomerCity, and CustomerState, and if CustomerState is dependent on CustomerCity, then the table does not satisfy 3NF, and we need to split it.

Boyce-Codd Normal Form (BCNF) applies as a version of 3NF that resolves other types of anomalies not covered by 3NF.

Fourth (4NF) and Fifth Normal

Forms (5NF) deal with more complicated multi-valued dependencies and join dependencies, and are usually not available in complex schemas.

The advantages of Normalization

Normalization steps have a few important advantages:

  • Elimination of Data Redundancy: If we are only going to store the data one time, we minimize redundancy and conserve storage space.
  • Data Integrity and Consistency:There is no need to update an address in more than one place. No need to worry about inconsistent data in the system.
  • Data Maintenance Efficiency: If we have our data entities structured properly, we can execute data maintenance in a less error-prone process.
  • Flexibility: Once we have a well-normalized schema, we can be creative about extending or changing it without significant ego and risking the loss of existing data.

The disadvantages of Normalization

  • Query Complexity:In a normalized database, it is common to issue multiple JOINs in order to expose the related data. Prioritizing data normalization can impact the performance of the database query without proper indexing.
  • Slower Read Transactions:Due to multiple JOINs the distributed data may lag for some queries when compared to the overall performance of a denormalized database schema, in read-heavy queries.

What is Database Denormalization?

Denormalization is the conscious act of putting redundancy into a database by mating tables or duplicating data to improve read performance and reduce complexity in queries. It is essentially the opposite of normalization and is often used when read performance is prioritized greater than the need to maintain a normalized database.

Why Denormalize?

Normalization is great for helping with data integrity, but sometimes you can end up with complex queries that have multiple JOINs that can slow down read operations. In scenarios like as reporting, analytics, or for high-trafficked web applications, denormalization can be a good way to benefit from having related data within a table, even if it is duplicating some data.

Common Denormalisation Techniques

  • Combine Tables:This reduces the number of JOINs by combining related tables into one, for example, having one table for Customers and one for the CustomersOrders.
  • Add Columns with Redundant Data:In some circumstances, you can store total sales, last order date or other aggregated or summary data in the table, to speed up queries that would take more time if you had to perform a separate SELECT.
  • Use Precomputed Aggregates:Keeping materialized views or summary tables that are up to date at set timeframes can have a large impact on speeding up analytics queries.

Denormalization Advantages

  • More Efficient Read Operations: Simplifying query logic and eliminating some JOINs enables quicker data access.
  • Simplified Query Logic: Because the query logic is only referencing one hierarchy, there are fewer JOINs and makes it easier for not only other developers’ queries and for visualization tools and reporting tools.
  • Better Performance in Read-Heavy Environments: Typically, this is how data warehousing, dashboards, and caching layers perform efficiently with denormalized data.

Denormalization Trade-offs and Challenges

  • Duplication of data: Denormalized database includes duplicate of data, which means additional storage and must be accurately updated if necessary to maintain consistency.
  • Generally more complicated writes: When writing, you must insert, update, and delete the same data in various places which adds complexity because you have to ensure all pieces of data remain consistent.
  • Maintenance: You have a complexity to manage duplicates either in your application through additional logic or may have to use triggers and other artifacts to maintain the redundancy. This adds complexity to your development of the application.

Practical Examples

Sample of Normalized vs. Denormalized

Normal example: E-commerce Database

When designing an e-commerce platform, an obvious way to largely normalized data into separate tables would be:

  • Customers: CustomerID, CustomerName, Email, Phone
  • Orders: OrderID, CustomerID, OrderDate
  • Products: ProductID, ProductName, Price
  • OrderDetails: OrderDetailID, OrderID, ProductID, Quantity

So, when a customer orders multiple products in an order, we do not duplicate the values in Customer table or Product table in the Orders table multiple times in the Orders table.

Denormalized example: sales performance summary table

For a summary sales reporting table that provides total sales, we could include a denormalized table with a Summary table:

  • CustomerSalesSummary: CustomerID, TotalOrders, TotalSpent, LastOrderDate

The table would be keeping track of totals periodically (for example, once a night as a batched job), so the data could permanently stored somewhere.

The reason to add this denormalized table, is without having to join many different tables together and processing in-memory for many unit rows each time we needed to provide reporting would hamper data retrieval time.

Best Practices for Normalization and Denormalization

Start Normalized:
When starting your initial design process, begin with normalization. This provides a solid, cleansed, and consistent data store.

Profile your Workload:
Asking if you are making more reads or writes. Ideally – we never would be denormalize when strictly ‘reading’ data without concern for performance.

Denormalizing is a Complex Opportunity:
Use denormalization only when performance gains outweigh the added complexity or risks that you need to manage.

Normalize the Data, Be sure to Synchronize the Data:
Data must be kept compliant for observance and suggestions for compliance. Therefore – if you are denormalizing, you will need a way to ensure compliance with data. To consider, most of the denormalization models rely on database triggers, so the denormalization continues whenever it needs to be updated, or is updated in the application’s logic, or a nightly (or everyday) scheduled job, etc.

Find Alternatives to Normalize Denormalized Data:
Sometimes caching layers improve performance enough. If you denormalize your data source or column, applications often “forget” or “ignore” the prefix of undetermined data in the cached column.

Monitor and Optimize:
Keep this guidance – workloads change, take this time to analyze your workload, and measure metrics and revise your schema and performance metrics fairly often.

Conclusion

A clear understanding of the distinctions between normalization and denormalization as part of the relational database design process will enable you to architect a solution which is efficient, scalable, and maintainable. Normalizing your data will keep it consistent, minimize duplication, and ease maintenance. It will also minimize the need to refactor redundant data, making it an appropriate solution for transaction applications. Denormalization will improve read performance at the cost of duplication and increased complexity. Denormalization is useful for reporting, analytics, and in scenarios where there is a lot of traffic and a high volume of reads.

By appropriately choosing the right normalization and denormalization approach, and considering the requirements and needs of your application, you will be able to create a database architecture that can maintain integrity and scale at the required performance level.