There’s been a bit of a database religious war on Dare Obasanjo and Jeff Atwood’s blogs, all on the subject of database normalization: when to normalize, when not to, and the performance and data integrity issues that underly the decision.
Here’s the root of the argument. What we’ve all been taught regarding database design is irrelevant if the design can’t deliver the necessary performance results.
The 3rd normal form helps to ensure that the relationships in your DB reflect reality, that you don’t have duplicate data, that the zero to many relationships in your system can accommodate any potential scenario, and that space isn’t wasted and reserved for data that isn’t explicitly being used. The downside is that a single object within the system may span many tables and, as your dataset grows large, the joins and/or multiple selects required to extract entities from the system begins to impact the system’s performance.
By denormalizing, you can compromise and pull some of those relationships back into the parent table. You might decide, for instance, that a user can have only 3 phone numbers, 1 work address, and 1 home address. In doing so, you’ve met the requirements of the common scenario and removed the need to join to separate address or contact number tables. This isn’t an uncommon compromise. Just look at the contacts table in your average cell phone to see it in action.
Both solutions have their pros and cons. So let me put the question to you: which is better — a normalized database, or a denormalized database?
Trick question! The answer is that it doesn’t matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n.
So for large n, what’s the solution? In my personal experience, you can usually have it both ways.
Design your database to 3NF from the beginning to ensure data integrity and to allow room for growth, additional relationships, and the sanity of future querying and indexing. Only when you find there are performance problems do you need to think about optimizing. Usually this can be accomplished through smarter querying. When it cannot, you derive a denormalized data set from the normalized source. This can be as simple as an extra field in the parent table that derives sort information on inserts, or it can be a full-blown object cache table that’s updated from the official source at some regular interval or when an important even occurs.
Read the discussions and share your comments. To me, the big takeaway is that there’s no one solution that will fit every real world problem. Ultimately, your final design has to reflect the unique needs of the problem that is being solved.