Monday, December 28, 2009

The Relational Model and Data Normalization

The relational model is based on the concept that any predefined path through a data structure is too restrictive, particularly in light of ad hoc requests for information. This model presents data in a two-dimensional table format, similar to a spreadsheet, and provides the ability to relate records as needed irrespective of how the data records were first stored in a database.

Normalization is a data modeling technique developed by Dr. E.F. Codd at IBM in 1972 for producing a set of  relations that possess certain properties. Through his work with relational theory, Codd discovered that unnormalized data was subject to certain anomalies(insert, delete, update) when changes to data are made.

The normalization process is straightforward. You take any relation(data presented logically in column and row format) and choose a primary key. Then through a series of steps(normal forms), the data is reorganized to eliminate these anomalies. Lets consider a sample invoice:

                                                             Sample Company
                                                                   Invoice

Customer No: 12345                                                                         Terms: Net 30
Customer: ABC, Inc.                                                                         Ship Terms: USPS
               2222 Main Street
               Fallbrook, CO 83555                                                         Order Date: 11/30/2009
              (222)222-2222

Product No.          Description             Qty              Unit Price         Extended Amt
-----------            -------------          ------            -----------        -------------
RPS-4490            Springs                    2                   15.00                  $  30.00
STR-3300            Belts                        2                    5.00                      10.00
MLT-1200            Straps                      1                    3.00                       3.00

TOTAL ORDER AMOUNT                                                               $  43.00

This invoice would be considered a user view and displays the data problems identified by Codd.  If one were to make a database table from this view, a new customer could not be inserted into the database without them buying something.  This is because the customer data is embedded in the invoice.  This is the insert anomaly.

Similarly, if this was the last invoice for a customer, deleting it would cause the loss of all data on the customer.  Again, this is because data from two entities are mixed into a single relation. This is the delete anomaly.   Finally, if one wanted to change the customer's address, it would have to be changed on every invoice and the possibility of  an update being being missed is high.  This is the update anomaly.

To begin normalizing the data,  one should first identify a unique identifier for the invoice, e.g., invoice # and then separate the customer and product data into separate relations. Each additional relation gets its own unique identifier, called a primary key(#). The normalization process would result in the following tables:

INVOICE:  #Invoice Number, Customer Number, Terms, Ship Terms, Order Date
INVOICE LINE ITEM:  #Invoice Number, #Product Number, Quantity, Sales Price Unit
PRODUCT:  #Product, Description, List Unit Price
CUSTOMER:  #Customer Number, Customer Name, Customer Address, City, State, Zip Code
                         Phone No.
A relation is in 1st normal form if it contains no multivalued attributes.  That is, every intersection of a row and column must contain at most one data value. With this example, the product line items are placed into a separate relation with a composite key combining the invoice# and product#.

A relation is in 2nd normal form if all non-key attributes are dependent of the entire primary key.  Here, the product information depends only on the product# and not the composite key of invoice# and product#.  As such, product is moved to a separate relation.

A relation is in 3rd normal form if all the non-key attributes depend only on the primary key.  Here the invoice includes attributes of another entity(Customer) included in the invoice relation.  An attribute that depends on another attribute that is not the primary key of the relation is considered to be transitively dependent.  The solution is to simply move any transitory dependent attributes to relations where they depend only on the primary key.

The end result of the data normalization process is every non-key attribute must depend entirely on the primary key. Although there are now 4 relations instead of 1, the problems of insert, update, and delete anomalies are eliminated.  The data is linked through the unique key identifiers and ad hoc requests for information is not dependent on how the records are stored in the database.

No comments:

Post a Comment

Get your own Widget