ThinkingCog

Designing a performant data model to accommodate data for an application can be a bit of a challenging task for a new developer. The difficulty arises from the lack of understanding of the nature of the data itself and using the tools relational database design to divide data logically into tables. In this short article, I have tried to shed light on normalization in relational databases.

Normalization in relational databases is the process of minimizing the duplication of data. This is done by spreading data into different tables, all logically connected via relationships. Normalization is a play on divide and rule strategy in the computer science domain.

Before we move further into the article, some of the terminologies bear a small introduction here.

Atomicity: In the context of relational databases, atomicity refers to the condition of data when it cannot be broken any further without loss of its comprehensiveness.

Attribute: Attribute describes characteristics of data. That can be the max length of the first name or the fact that the date of birth of a person in an application meant to store passport details can only go back in the past, say, 125 years.

Field: A field represents a data point which is a fundamental unit that makes up a record. A field stores an atomic value in an optimized form.

Record: A record represents a complete unit of data that is stored in a table. A record is made up of fields. Column: A column is an implementation in relational databases to store a field.

Row: A row is an implementation in relational databases to store a record.

Table: A table is a data structure in a relational database that is made up of rows and columns and facilitates the storage and querying of data.

Key: In the English language the word key as an adjective means something of crucial importance. There are different kinds of keys in a relational database and they play a crucial role in the functioning of a database.

Primary Key: A primary key is a construct in a table in a relational database that can be used to uniquely identify a record. We will discuss various kinds of primary keys at an opportune point in the article.

Now let’s take an example and thoroughly understand normalization.

Example: Consider that we have to store information about books in our database. The information comprises of the following:

Book’s title
Book’s author(s)
ISBN
Publisher’s name
Publisher’s contact
Date of publication
Cost
Category or categories in which the book can be filed
The book should be individually identifiable based on the ISBN
The book should be searchable on the basis of one of the categories that it can be filed in or the publisher’s name

Now, the way a new developer would think would be on the lines of accommodating all this information in just one table. Let’s understand the problems that one would encounter in such a scenario.

ISBN	Title	Authors	Publisher	PublisherContact	DateOfPublication	Price	Category
9781408855652	Harry Potter and the Philosopher's Stone	J.K. Rowling	Pottermore Publishing	Harry Potter	8-12-2015	200	Fantasy, Fiction
9780007547999	Algorithms to Live By: The Computer Science of Human Decisions	Brian Christian, Tom Griffiths	William Collins	John Doe	26-04-2017	250	Computer Science
938943243X	Rangbhoomi	Munshi Premchand	Fingerprint Publishing	Parakh Singhal	01-09-2019	672	Hindi Literature

Per the requirements, we can have multiple authors for a book and, a book can be categorized in multiple categories. Thus, we have the potential of storing multiple values for each of the fields of “Authors” and “Categories.”

Problems with this design: Anytime the number of authors changes for a book (by addition or update of a chapter(s) written by some new author, or by an existing author selling his stake in the works) or we want to increase or decrease the categories in which a book can be filed, we will have to read the existing data, modify it and then write it back into the existing row. This is tedious and time-consuming from a programming standpoint. A simple row deletion or addition independent of the existing row(s) would be far simpler and faster.

In order to eliminate the set of problems that arise under the category of organization of data in relational databases, we have the tool of normalization.

Normalization is a process under which we minimize data duplication and straighten relationships among the attributes of data. The various stages in this process are called "normal forms." There are 4 normal forms with a focus on organizing data progressively. In this article, we will focus on the 1st Normal Form, 2nd Normal Form, and the 3rd Normal Form.

1^st Normal Form

The academic definition of the first normal form is that a table is in the first normal form if a relation has all the attributes in the atomic state.

In simple words, the definition merely states not to store multiple values in a field. In our example above, both the fields of “Categories” and “Authors” violate the first normal form, as we are storing multiple values in those fields. When we store only a single coherent value in a field that cannot be broken any further, and hence atomic, we achieve the 1st Normal Form. Therefore, we need to store authors and categories separately in our table, which would then look something like this:

ISBN	Title	Authors	Publisher	PublisherContact	DateOfPublication	Price	Category
9781408855652	Harry Potter and the Philosopher's Stone	J.K. Rowling	Pottermore Publishing	Harry Potter	8-12-2015	200	Fantasy
9781408855652	Harry Potter and the Philosopher's Stone	J.K. Rowling	Pottermore Publishing	Harry Potter	8-12-2015	200	Fiction
9780007547999	Algorithms to Live By: The Computer Science of Human Decisions	Brian Christian	William Collins	John Doe	26-04-2017	250	Computer Science
9780007547999	Algorithms to Live By: The Computer Science of Human Decisions	Tom Griffiths	William Collins	John Doe	26-04-2017	250	Computer Science
938943243X	Rangbhoomi	Munshi Premchand	Fingerprint Publishing	Parakh Singhal	01-09-2019	672	Hindi Literature

Now we have atomic values in all our fields and the table complies with the first normal form.

But the compliance with the first normal form has given rise to another problem. Now we cannot use the ISBN field as the primary key as it exists in multiple rows.

This is a good place to introduce the three types of primary keys that may exist in your data. They are:

Natural Primary Key
Composite Primary Key
Surrogate Primary Key

Natural Primary Key: A primary key is called a natural primary key if the data that needs to be accommodated in a table has a field that is unique and can be used to identify a record uniquely among other records. In our example, before we implemented the first normal form, the ISBN can be organically used to uniquely identify a record, thus forming a natural primary key.

Composite Primary Key: A natural primary key that is made up of a bunch of fields in the data that needs to be accommodated in a table, is called a composite primary key. Composite primary key comes into the picture when a single field cannot be used to uniquely identify a record from among other records, but a combination of certain fields is unique and can serve the purpose. In our example, after the implementation of the first normal form, a combination of ISBN and author name can be used to uniquely identify a record from among other records, as an author’s name is only going to appear once for an ISBN. Thus, the combination of author name and ISBN can form a composite primary key.

Surrogate Primary Key: A key that is not related to data naturally in any manner whatsoever is called a surrogate primary key. Generally, it is a key that gets generated automatically by the database engine. The developer is just required to select the data type of the surrogate primary key. In our example, we can implement an int type or a GUID type surrogate primary key instead of relying on a composite primary key as described before.

Because we cannot use just the field of ISBN as the primary key, we will have to go for either a composite primary key or a surrogate primary key. In our case, we can go for a composite primary key and remain within the confines of our natural data, without introducing a surrogate primary key.

2^nd Normal Form

In the second normal form, we work with a table that is already in the first normal form. Then we remove the partial dependency from the table if any.

Partial Dependency: Partial dependency is a condition whereby the fields contained within a record depend only upon one portion of the primary key.

In our example, when we achieved the first normal form, we introduced a composite primary key which was made up of the ISBN and the author’s name fields. Hence, the data in the table depended on two fields that made up the primary key. To push our example to comply with the second normal form, we will have to bifurcate the data into two distinct tables – each portion of the composite primary key representing a separate table. In our example we will create two tables, one housing data about authors and the other housing data about the books.

Our tables would look something like this:

Books table with the field “ISBN” as the natural primary key:

ISBN	Title	Publisher	PublisherContact	DateOfPublication	Price	Category
9781408855652	Harry Potter and the Philosopher's Stone	Pottermore Publishing	Harry Potter	8-12-2015	200	Fantasy
9781408855652	Harry Potter and the Philosopher's Stone	Pottermore Publishing	Harry Potter	8-12-2015	200	Fiction
9780007547999	Algorithms to Live By: The Computer Science of Human Decisions	William Collins	John Doe	26-04-2017	250	Computer Science
9780007547999	Algorithms to Live By: The Computer Science of Human Decisions	William Collins	John Doe	26-04-2017	250	Computer Science
938943243X	Rangbhoomi	Fingerprint Publishing	Parakh Singhal	01-09-2019	672	Hindi Literature

Authors table with the field “AuthorId” as the surrogate primary key:

AuthorId	Authors
1	J.K. Rowling
2	Brian Christian
3	Tom Griffiths
4	Munshi Premchand

Now both the tables do not have any partial dependency and have graduated to the second normal form.

3^rd Normal Form

In the third normal form, we work with a table that is in the second normal form and aim to remove any transitive dependency, if any.

Transitive Dependency: Transitive dependency is a condition whereby a field is not dependent on the primary key in any way.

In our example, the publisher’s contact is dependent on the publisher and does not in any way connect to the book and hence ISBN, which happens to the natural primary key in our Books table. We will push the publisher’s contact into a separate table to achieve the third normal form. Our tables would then look something like this:

Books table with ISBN as the natural primary key and the transitive dependency of publisher’s contact field removed:

ISBN	Title	Publisher	DateOfPublication	Price	Category
9781408855652	Harry Potter and the Philosopher's Stone	Pottermore Publishing	8-12-2015	200	Fantasy
9781408855652	Harry Potter and the Philosopher's Stone	Pottermore Publishing	8-12-2015	200	Fiction
9780007547999	Algorithms to Live By: The Computer Science of Human Decisions	William Collins	26-04-2017	250	Computer Science
9780007547999	Algorithms to Live By: The Computer Science of Human Decisions	William Collins	26-04-2017	250	Computer Science
938943243X	Rangbhoomi	Fingerprint Publishing	01-09-2019	672	Hindi Literature

Authors table with the field “AuthorId” as the surrogate primary key:

AuthorId	Authors
1	J.K. Rowling
2	Brian Christian
3	Tom Griffiths
4	Munshi Premchand

Publisher Contact table with the field “Publisher” as the natural primary key:

Publisher	PublisherContact
Pottermore Publishing	Harry Potter
William Collins	John Doe
Fingerprint Publishing	Parakh Singhal

We can further improve upon the final design by adopting surrogate primary keys in all our tables. That way if there’s any update of data in fields acting as natural primary keys, we will not have to update the same in the tables utilizing them as foreign keys. Also, we reduced the duplication of data in the Books table by creating a bridge entity to accommodate the many-to-many relationship between a book and the categories into which it can be classified.

Data model with surrogate primary keys:

Figure 1 Normalized Data Model with Bridge Tables to handle Many-To-Many Relationships

In a nutshell:

1st Normal Form: Aim for atomicity of data by elimination of multiple values in a single field and identification of a field that can act as a primary key.

2nd Normal Form: Make sure the table already exists in the first normal form and eliminate any partial dependencies.

3rd Normal Form: Make sure the table already exists in the second normal form and eliminate any transitive dependencies.

I hope this article helped you in learning the concept of normalization.

Articles written by Parakh Singhal

Normalization In Relational Databases

1^st Normal Form

2^nd Normal Form

3^rd Normal Form