2nd December 2010, 18 min read

Over-normalization is bad for you

17 thoughts on “Over-normalization is bad for you”

Daniel Lemire says:

December 2, 2010 at 4:58 pm

@Antonio

I don’t think I am missing the point. I do ask however whether this possibility you evoke—that the names will go out of sync whereas they shouldn’t—is a real threat.

We used to think that catching bugs at design time was essential. Our experience in the last 20 years has shown that belief to be somewhat overblown. Facebook was built with PHP, Twitter with Ruby.

Many NoSQL databases work just like my example. There is no trace of normalization. And people seem to get real work done.
Daniel Lemire says:

December 2, 2010 at 5:21 pm

@glenn

Which is fine if that’s the correct application logic, and not fine if it isn’t.

I’d prefer to have strong data independence. That is, I would prefer to be able to change the application logic without having to change the database schema.

It seems best if the developer rarely needs to change the database schema, because that’s an expensive operation. In practice, I suspect a lot of features are not implemented, or they are badly implemented, because they don’t match the database schema.

I would rather make the database engine work harder to adapt, rather than have the developer redesign the tables.

I see lots of flattened data where the flattening has made it impossible or onerous to answer very reasonable and valuable questions.

In my example, there does not need to be any performance difference between the two databases. (I admit that there might be in the real world.) It is a matter of physical layout, but I’m only discussing the logical representation.
Daniel Lemire says:

December 2, 2010 at 8:09 pm

@glenn

Joins should be an internal issue below the query layer.

I think this is almost exactly what I am saying.
Antonio Badia says:

December 2, 2010 at 3:39 pm

Daniel,
I think you are missing the point here. The problem with non-normalized data is not the added storage (which happens, but disk is cheap), but the anomalies that happen when data is stored redundantly: insertion, deletion, and update anomalies. If the Jane Wright in your example is one person, and it is represented by two in the non-normalized table, what stops an application from changing one row without changing the other (say, change the customerid in one and not the other), thus leaving the database in an inconsistent state? This, I think, it’s the real reason to normalize.
glenn mcdonald says:

December 2, 2010 at 3:55 pm

Two notes:

“there is no need to change the schema to add such a feature”

More accurately, this feature matches the logical structure of this database. So it’s not that it’s easy to “add” such a feature, it’s that this database already has it. Which is fine if that’s the correct application logic, and not fine if it isn’t. And since the main pressure towards denormalization is bad tools for dealing with normalization, there’s no reason to believe that denormalized data structures will reliably coincide with correct features you just hadn’t thought about yet.

“most of your queries will be shorter and more readable”

But shorter and more readable is no victory if the results are wrong. And when you flatten a non-flat structure, you lose the ability to get reliable answers to questions along any dimension other than the one you used for flattening. If your application only ever needs to ask questions in one direction, that may be acceptable. But I almost never see non-trivial non-flat datasets where there actually is only one dimension anybody ever really cares about.

I should turn that around: I see lots of flattened data where the flattening has made it impossible or onerous to answer very reasonable and valuable questions.
Daniel Lemire says:

December 2, 2010 at 9:08 pm

@Badia

disk is cheap

Yes, but you do want to load as much as possible into your L2 and L1 caches. Larger databases are often slower on that account.
glenn mcdonald says:

December 2, 2010 at 7:55 pm

“Data independence” = full logical normalization. It’s tempting to wish that wasn’t the case, because we’ve spent decades working with bad tools that make you pay for every bit of normalization with relentless pain. But it’s really true, and shouldn’t be scary. All the principles of relational database construction (I mean the logical principles, not any tactical crap about specific nominally-relational implementations) are precisely about making your database accurately express the logical structure of the data, such that your use of it will be constrained only by what it *is*, rather than how it’s been stored. If you want to be able to change the application, with ebullient impunity, without changing the database schema, then you *have* to care about normalization. Normalization is to data what engineering is to bridges. Data modeling (calling it “normalization” was probably a mistake to begin with) is to data what physics is to things. You can’t be against it. It *is*.

And yet, you’re absolutely undeniably irrevocably right (I think) about at least two things:

1. Normalization orthodoxy encourages people to focus on database rules rather than scrutinizing actual logic, which often leads to them making bad modeling decisions. I think it’s possible that the biggest single cause of this is the now-established ass-backwards pattern of teaching “normalization” as a series of pedantically phrased pennances you are guilted into doing to a nice, normal spreadsheet. If data modeling was taught by explaining how to construct a logical model the right way, the first time, I think it would seem vastly more comprehensible and a lot less frightening.

2. In SQL, normalization = pain. It makes perfect sense to hate joins, and want to not have to do them. But you should never have had to do them to begin with. Joins should be an internal issue below the query layer. You should be able to ask for a book’s author without caring whether author is single or multiple or optional, and without knowing or caring how many “tables” are involved in answering your query.

Or, for my longer attempts at attacking this topic myself, see

http://www.furia.com/page.cgi?type=log&id=305
http://www.furia.com/page.cgi?type=log&id=311
Gary says:

December 7, 2010 at 11:34 pm

Taking your example one step further, what if you want to add an email address, and Jane Wright has three email addresses. Do you have three rows (one with a blank telephone number?) or six rows for each combination of telephone number and email address. The latter is effectively how you’ve dealt with two telephone numbers.
What about other attributes ? Perhaps Jane Wright has made 10 orders each with 3 items. Do you have those six email/telephone number combinations against all 30 order lines ?
Daniel Lemire says:

December 8, 2010 at 7:46 am

@Gary

You might have to add lots of rows, but remember that I am specifically saying that these don’t need to be physical rows. They are merely logical. So you don’t necessarily have more data to store than if the database was normalized. (And I’ll admit that I don’t know of a database engine that can do this. So my point is a bit theoretical.)
Daniel Lemire says:

December 8, 2010 at 5:46 pm

@Gary

Oh! If you are not concerned about storage, then it is a different question.

I am not saying your web application should run out of a single table. I am saying however that it shouldn’t use 15 tables if five will do. The sweet spot is not necessarily with the tables in third normal form. The normal forms are a dogma. (Knowledge without evidence.)
Gary says:

December 8, 2010 at 4:44 pm

I never suggested they would have to by physical (and I suspect the compression mechanism in Oracle, which works on repeated values, would minimise storage space). That isn’t the point. The point is that the code logic needs to deal with those rows. You need to make deliveries of those orders so your code need to work out whether you are delivering one item that happens to relate to two phone numbers or two items that have different phone number attributes.

That is the problem normalisation addresses. Normalisation is NOTHING to do with physical storage. It is to do with identifying dependencies between data values.
Alex says:

December 10, 2010 at 9:56 am

Daniel,
I absolutely agree with you post.
May be there is a trade-off in certain applications, but not for web – and this is what is driving NoSQL.
As soon as web application starts receiving some decent trafic – hundred users per minute, per second etc. Properly denormalized EBNF tables become very heavy liability and very quickly tables have to be denormalised into caching tables – one common table for view output and then denormalized data pushed into cache like memcached.
Pure SQL approach just doesn’t work in real world web – you have to add features, you have to change DB schema as you add them and users can think of more ways to exploit you website then anticipated by fixed database schema.
Andre Vellino says:

December 10, 2010 at 4:20 pm

I once worked for a company that had an amazingly, beautifully hyper-normalized DB for storing e-mail on an embedded device. Efficiency / storage was one big concern. But the tradeoff was *impossible to maintain*, let alone properly understand or debug SQL queries. Which made the UI to the DB really nasty to work with (and often slow too!)

So I agree with you – Normal forms are a dogma and ultimately an expensive one too (in situations where they are not really needed).

I’m using these two blog posts in my next DB class!
Daniel Lemire says:

December 10, 2010 at 5:15 pm

@Andre

Thanks for the testimonial.
tyler says:

March 12, 2011 at 8:01 pm

I have a question. I’m loving the idea of ditching normalization. But I’m not quite sure how to solve one problem. I can picture a site with profile pages and forums all being denormalized. But what if you don’t use their real name and a username… and they want to change their username? There would be thousands of records with the username hardcoded into the db. Or let’s say you don’t allow the user to change their username. But there is an avatar_url field. And they change their avatar?
Daniel Lemire says:

March 13, 2011 at 4:57 pm

@tyler

(1) My claim is not that normalization is bad. Certainly master-detail schemas are useful and a good practice. I oppose “over-normalization” which leads to overly complicated schemas.

(2) Do you actually want the name to be updated in old posts? What about the reported home page and email addresses? The answer is maybe not so obvious. If you send an email today, and change your name tomorrow, email systems won’t go back and update old emails.

(3) It can become quite wasteful to join the tables each and every time just for the rare event where someone might change his name. From an engineering point of view, it might make a lot more sense to take a big hit when the user does change his name, possibly doing it in an eventually consistent manner (for a time, both names may appear), than paying a price all the time for a feature that most users may never use. Which is better for your application is a matter of trade-off. You should maybe compare the two options experimentally.
Ethan says:

June 15, 2011 at 10:39 am

To me, what your comments above (such as: I do ask however whether this possibility you evokeâ€”that the names will go out of sync whereas they shouldn’tâ€”is a real threat) suggest is that you’ve never been responsible for anything where you have to provide real data to real people with real results behind them.

Web apps have an advantage in this regard: they often don’t matter. The company I work for does data processing and manipulation for law firms, and we provide a hosted data review platform for their convenience. It is a web app, but the data from it is provided, in the end, to courts and counsel.

So it matters greatly that what comes out relates to what goes in in a deterministic fashion. When a client decides to use redundant tags on objects, the result becomes easily apparent at production time, when it becomes a web to untangle to determine what is REALLY relevant/privileged/redacted/whatever.

Making sure that your information is normalized is ESSENTIAL if you are managing any data of actual consequence. While over-normalization is possible to some degree, it is really an esoteric idea suited only to highly performance-driven situations (and much of the time this can be fixed either by calling functions built into the database engine itself rather than reimplementing them in the application code… or by optimizing properly the applications database calls and caching the appropriate info… this all assumes that indexes are assigned right in the first place, of course.)

UNDER-normalization is a far, far more common situation, and one with real consequences.