When to denormalize

Technology

There’s been a bit of a database religious war on Dare Obasanjo and Jeff Atwood’s blogs, all on the subject of database normalization: when to normalize, when not to, and the performance and data integrity issues that underly the decision.

Here’s the root of the argument. What we’ve all been taught regarding database design is irrelevant if the design can’t deliver the necessary performance results.

The 3rd normal form helps to ensure that the relationships in your DB reflect reality, that you don’t have duplicate data, that the zero to many relationships in your system can accommodate any potential scenario, and that space isn’t wasted and reserved for data that isn’t explicitly being used. The downside is that a single object within the system may span many tables and, as your dataset grows large, the joins and/or multiple selects required to extract entities from the system begins to impact the system’s performance.

By denormalizing, you can compromise and pull some of those relationships back into the parent table. You might decide, for instance, that a user can have only 3 phone numbers, 1 work address, and 1 home address. In doing so, you’ve met the requirements of the common scenario and removed the need to join to separate address or contact number tables. This isn’t an uncommon compromise. Just look at the contacts table in your average cell phone to see it in action.

Jeff writes:

Both solutions have their pros and cons. So let me put the question to you: which is better — a normalized database, or a denormalized database?

Trick question! The answer is that it doesn’t matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n.

So for large n, what’s the solution? In my personal experience, you can usually have it both ways.

Design your database to 3NF from the beginning to ensure data integrity and to allow room for growth, additional relationships, and the sanity of future querying and indexing. Only when you find there are performance problems do you need to think about optimizing. Usually this can be accomplished through smarter querying. When it cannot, you derive a denormalized data set from the normalized source. This can be as simple as an extra field in the parent table that derives sort information on inserts, or it can be a full-blown object cache table that’s updated from the official source at some regular interval or when an important even occurs.

Read the discussions and share your comments. To me, the big takeaway is that there’s no one solution that will fit every real world problem. Ultimately, your final design has to reflect the unique needs of the problem that is being solved.

When Not to Normalize your SQL Database
Maybe Normalizing Isn’t Normal

0 thoughts on “When to denormalize

  1. netcrusher88 says:

    Personally I’d just patch and not bother with going to djbdns right now – you’re going to need to switch back to BIND or something when you roll out DNSSEC, since djbdns has been abandoned (although it is public domain) and BIND already has DNSSEC support (though certain features aren’t implemented yet).

  2. Anonymous Joe says:

    Holy crap, while I’m a fan of this site, the last time I’ve seen someone gets stroked like that was in a porno. Dan Bernstein has an over-inflated ego, and with editorial like this, it’s no wonder.

    Make sure you use plenty of lube.

  3. Matt says:

    Since I have to deal with djbdns on a daily basis – as well as qmail, I would suggest that you fully evaluate these systems before you make a switch. It’s a great system if you don’t mind if your dns just stops working for unknown reasons now and then. Also, if you like reading m4, you’ll really enjoy reading zone files and logs.

    To get qmail to function as a “good email citizen” requires some patching. Sure, qmail is open source now. Ring me up in 1998 when it would have mattered.

    I don’t know DJB, but I’ve heard the rumors – but I figured some honest feedback from someone who administers these kinds of systems in a medium-sized environment might be useful. They work fairly well, but they make my life more difficult than it needs to be.

  4. Jason Striegel says:

    I think the point of this post is that here’s a great case study for security by design. Here’s an application created 8 years ago that remains secure against a modern day attack, an attack to which pretty much every other implementation fell victim.

    It’s no news to anyone that DJB is caustic, and that’s probably done more to reduce the adoption of his software over the years than anything else. Thing is, we _should_ be choosing software because it was designed well, not because we want to marry the designer.

    I don’t choose my DNS server because the creator had a slick website or a pleasant attitude. I’m not going to switch to an inferior filesystem just because the creator was a jerk in the community for 10 years and then murdered his wife. Full disclosure: I actually switched to ext3 a while back, but hell, you get my point.

    Bernstein isn’t really the subject here. Rather, his software is the example. What’s important to take away from this is that there is a software development model that works. Software should be crafted with a specific goal in mind: security.

    Matt, you make a valid point regarding ease of use and simplicity. That’s certainly a consideration, but I don’t think it trumps security and solid design on its merits alone. When I started using qmail and djbdns, it wasn’t because I liked administering it (I didn’t), it was because it was highly recommended by a decent sized ISP administrator that I trusted, and I could see that some care had been taken with the design of the software. Folks should use what they are comfortable with. It doesn’t matter who made it—I’m most comfortable with secure software.

  5. Jeff says:

    I agree with many of your points, but believe that we should strive for the best of both worlds. Everytime I think we have turned the corner and achieved the next best thing with regard to security, ease of use and integration, someone drinking cases of redbull and smoking cartons of cigarettes in their basement proves us wrong.

  6. Russell Nelson says:

    I have no idea what Matt is talking about. I’ve never had djbdns stop serving records except when I’ve filled up the file system. Djbdns wouldn’t be the only thing that pauses when the filesystem is full. And when you free up space it continues without any problems.

    Netcrusher88 is blowing smoke. DNSSEC doesn’t work; it’s never worked; and it will never be implemented, because it can’t work. Crypto software doesn’t fall on its face because the crypto gets broken. It fails because key distribution doesn’t work. Anyway, now that djbdns is in the public domain, we can add DNSSEC when (if) it’s been demonstrated to work and have value. In the meantime, implemented but unused features are a SECURITY HOLE waiting to be discovered.

    djb is only caustic to fools. If you’ve seen that side of him, wear your dunce cap proudly.

  7. Anonymous says:

    Bernstein is not infallible and djbdns is not perfect, either:

    http://secunia.com/advisories/33855/

Tagged
FEEDBACK