The normalization process has a set of steps to work towards in a database and the tables that are generated. Each table typically will represent a single subject. For example, in looking at our database in PostgreSQL, the customer table only contains the data about the customer, the employee table only contains data about the employee, and so forth. In addition, each row/column intersection should only contain one value and not a group of values. If we looked at the example from the movie rating table, we will see that the actor and genre have multiple values:
We also want to ensure that there is no data item that is unnecessarily stored in more than one table to help minimize redundancy. This way, if data needs to be added, updated, or deleted, it only needs to be done in one place. All of the attributes in a table should be dependent on the primary key so that the data can be uniquely identifiable by the primary key values. Each of the tables should have no insertion, update, or deletion anomalies so that we can ensure that the data is consistent.
To help reach these objectives, we will go through the normalization process one stage at a time. For the first normal form, we will ensure that each column in our tables will be single-valued meaning that they should not contain multiple values. By doing so, we will be able to eliminate any repeating groups of columns. We want to also ensure that we have a unique name for each of the attributes or columns in a table. The first normal form is when we identify the primary keys of each table. A common-sense rule with the first normal form is to ensure that each of the column values should be of the same data type and purpose. This means that if we set a phone number column in the customer table, we should only store the phone number rather than putting in an address in the field.
For the second normal form, the table should be in the first normal form to start with but there should also be no partial dependency. It can help to look at a table like the employee table:
The primary key in this table is the employee_id. This allows us to uniquely identify each row even if we may have two employees with the same name. The employee_id is unique for every single row and using that, we can reference a specific record. If we need to find out the address of an employee, we can use the employee_id to reference it and that would be the case for every column in this table. This means that every column depends on it or their respective data can be fetched using it.
Now that we have an idea of dependency, we can now define what partial dependency is. Partial dependency can occur where an attribute in a table depends on only a part of the primary key but not the whole key. This is even the case if the primary key is uniquely generated as you have to consider candidate keys. If this occurs, we need to remove the attribute that is causing partial dependency and move it to another table where it is fully dependent on the primary key.
For the third normal form, the tables should be in the second normal form and then have no transitive dependencies. A transitive dependency exists when there are functional dependencies when you may have a column X dependent on the primary key of a table but then you have a dependency of column Y being dependent on column X. With this, we have a transitive dependency of column Y being dependent on the primary key but not directly. The solution ends up being quite simple to remove the columns that are creating a transitive dependency into another table. The third normal form is typically where we will end the normalization process.
In other normal forms, we have Boyce-Codd Normal Form (BCNF)which is in the third normal form and if there are any functional dependencies, the referenced item should be a super key. With the fourth normal form, the database design needs to be in BCNF and tables should not have any multi-valued dependency. We will get into that detail later about what these normal forms represent in an upcoming tutorial.