Normalization is a data modeling technique developed by Dr. E.F. Codd at IBM in 1972 for producing a set of relations that possess certain properties. Through his work with relational theory, Codd discovered that unnormalized data was subject to certain anomalies(insert, delete, update) when changes to data are made.
The normalization process is straightforward. You take any relation(data presented logically in column and row format) and choose a primary key. Then through a series of steps(normal forms), the data is reorganized to eliminate these anomalies. Lets consider a sample invoice:
Sample Company
Invoice
Customer No: 12345 Terms: Net 30
Customer: ABC, Inc. Ship Terms: USPS
2222 Main Street
Fallbrook, CO 83555 Order Date: 11/30/2009
(222)222-2222
Product No. Description Qty Unit Price Extended Amt
----------- ------------- ------ ----------- -------------
RPS-4490 Springs 2 15.00 $ 30.00
STR-3300 Belts 2 5.00 10.00
MLT-1200 Straps 1 3.00 3.00
TOTAL ORDER AMOUNT $ 43.00
This invoice would be considered a user view and displays the data problems identified by Codd. If one were to make a database table from this view, a new customer could not be inserted into the database without them buying something. This is because the customer data is embedded in the invoice. This is the insert anomaly.Invoice
Customer No: 12345 Terms: Net 30
Customer: ABC, Inc. Ship Terms: USPS
2222 Main Street
Fallbrook, CO 83555 Order Date: 11/30/2009
(222)222-2222
Product No. Description Qty Unit Price Extended Amt
----------- ------------- ------ ----------- -------------
RPS-4490 Springs 2 15.00 $ 30.00
STR-3300 Belts 2 5.00 10.00
MLT-1200 Straps 1 3.00 3.00
TOTAL ORDER AMOUNT $ 43.00
Similarly, if this was the last invoice for a customer, deleting it would cause the loss of all data on the customer. Again, this is because data from two entities are mixed into a single relation. This is the delete anomaly. Finally, if one wanted to change the customer's address, it would have to be changed on every invoice and the possibility of an update being being missed is high. This is the update anomaly.
To begin normalizing the data, one should first identify a unique identifier for the invoice, e.g., invoice # and then separate the customer and product data into separate relations. Each additional relation gets its own unique identifier, called a primary key(#). The normalization process would result in the following tables:
INVOICE: #Invoice Number, Customer Number, Terms, Ship Terms, Order Date
INVOICE LINE ITEM: #Invoice Number, #Product Number, Quantity, Sales Price Unit
PRODUCT: #Product, Description, List Unit Price
CUSTOMER: #Customer Number, Customer Name, Customer Address, City, State, Zip Code
Phone No.
A relation is in 1st normal form if it contains no multivalued attributes. That is, every intersection of a row and column must contain at most one data value. With this example, the product line items are placed into a separate relation with a composite key combining the invoice# and product#.INVOICE LINE ITEM: #Invoice Number, #Product Number, Quantity, Sales Price Unit
PRODUCT: #Product, Description, List Unit Price
CUSTOMER: #Customer Number, Customer Name, Customer Address, City, State, Zip Code
Phone No.
A relation is in 2nd normal form if all non-key attributes are dependent of the entire primary key. Here, the product information depends only on the product# and not the composite key of invoice# and product#. As such, product is moved to a separate relation.
A relation is in 3rd normal form if all the non-key attributes depend only on the primary key. Here the invoice includes attributes of another entity(Customer) included in the invoice relation. An attribute that depends on another attribute that is not the primary key of the relation is considered to be transitively dependent. The solution is to simply move any transitory dependent attributes to relations where they depend only on the primary key.
The end result of the data normalization process is every non-key attribute must depend entirely on the primary key. Although there are now 4 relations instead of 1, the problems of insert, update, and delete anomalies are eliminated. The data is linked through the unique key identifiers and ad hoc requests for information is not dependent on how the records are stored in the database.
No comments:
Post a Comment