I think that I allude to part of the question you thought I’d be asking. I work in data warehousing where we don’t even talk about normal forms because the typical usage calls for modest normalization, or none at all. But I decided to be more drastic in my post.
(1) My example is a typical textbook example. Assuming that a user has one and only one last name throughout his life is typical. There are many more subtle limitations enforced by database schemas. Anyone who has had to use enterprise systems knows how inflexible they can be, if not unusable.
(2) I’m well aware of the difference between transactions and warehousing. I would even say that I’m accutely aware of the difference. But that’s not so relevant to my main claims. Data warehouses almost all rely on normalization extensively. I’m saying they shouldn’t. I’m saying we could build transaction systems with very few joins.
(3) I am outside the dogma here. If you think I am wrong, I’m interested in reading your arguments. However, I see no need for “an academic should know better”. Academic researchers are wrong all the time, at least those who are any good. Criticizing the dogma is precisely my job, as I see it. The guy who runs the bank database can’t afford to take risks and be wrong… I can afford to be dead wrong… as long as I’m wrong in an interesting way. And this blog is precisely where I choose to test my craziest ideas. If you want more conservative, try my research papers… they are peer reviewed…
From the title, at first I thought you were going to ask a different question.
As to the SQL or no-SQL noise, I just cannot see any reason to get excited. Choose the database to fit the problem. By “database” I mean anything from a flat text file, through small and large SQL databases, or whatever else makes sense.
Follow the patterns in your application. Doing lots of complex queries, with more in the future? SQL should be a good bet. Doing simple queries in huge volume? A more specialized database may be a better fit.
What I thought you might be asking – what falls between – is what to do when you have both a high-volume special-purpose database, and an SQL database. Or when you have two SQL databases with different implementations (and differing performance characteristics). Joins between databases of differing nature are … not well treated?
Since we are visiting the topic of choosing differing sorts of databases, we should also be interested in clearly defined semantics and behavior when those differing databases need to be accessed to satisfy a query.
To your arguments about normalization, I completely agree.
When using specialized databases, at some point you will encounter the “join” problem. The main bulk of the application may be best suited for a specialized sort of database, but there are likely some tasks best suited to an SQL database. Of course, you can write custom code to glue the two together, but this costs much developer time, and can work poorly if algorithms are poorly chosen. If use of specialized databases is to become common, then common use needs to support direct “joins” between dissimilar databases.
For long term adoption of specialized databases, this problem is the elephant-in-the-room that needs to be addressed – and is not much discussed.
Justinsays:
Surely you can’t apply the design techniques for data warehousing in typical transaction oriented systems ?!
In addition, there is no DBA worth his title who would not de-normalize the schema as needed for the application.
I’m really surprised at the use of the strawman example as well .. an academic should know better.
In a relational database, you can handle the multi-author case without normalization by repeating the entries. Thus, there would be several entries with the same “document ID”. It looks wasteful, and would certainly be in most current database systems. Nevertheless, a sufficiently smart database engine could make it efficient, in theory. (Whether it is possible in practice, I do not know.)
This being said, it is trivial in a document-oriented database to avoid joins, and this was the topic of Curt Monash’s post. I would argue, in fact, that for bibliographic data, the relational model is a poor match because the data is too often irregular. (That is, it can be made to work, but it will introduce unnecessary complexity for the programmer.)
Mike Martinsays:
Denormalization is just as important to normalization. It’s a spectrum. Normalize a few tables to reduce redundant data and create faster writes. Denormalize a few tables to speed up reporting or database reads (because of less joins), although at the expense of having more redundant data. Normalization does improve database integrity. Denormalization means the application needs to manage the integrity.
Many an Indian programmer I’ve run across has only been taught to normalize the heck out of everything, and they have not been taught about the benefits of denormalization.
So, if you have a call tracking system where calltakers are entering order records very fast, then normalize those tables a bit and you’ll improve the write speed. And if you have a reporting system, or those call takers need to see summary information really fast on multiple accounts, then pull that from one or more denormalized tables.
I find it best to put the order entry tables in its own database in order to avoid abuse by other programmers who might want to denormalize it. As well, put the reporting system in another database in order to avoid abuse by programmers who might want to normalize it. And then use batch replication to move records between the two.
But in many companies, you really need 3 databases for a major system workflow. One is normalized fairly well and is for storing a queue of new records being entered in. But these are unqualified records — raw data. Once these records are cleaned up and merged with existing data (such as lookups that tack on extra columns or do code swaps for longer phrases) — that sort of thing needs to go into a production database that is slightly normalized, but not nearly as normalized as the order entry system. It is the production database that is the definitive database that gets the company stamp of approval for having the final say on a customer’s latest approved data. But then you need a reporting database where heavy reads and number crunching can occur for all kinds of scenarios. And this is where a database needs to be created just for that, with a fair amount of denormalization.
Now, if another field could have multiple values (paper awards?) you’d start to face combinatorial duplication and would have to normalize eventually.
Who said that the physical layout needed to match the logical layout? Why can’t the database automatically do the normalization if it needs to?
(Again, I stress that I don’t know whether databases could actually remain efficient and do the equivalent of the normalization on their own.)
you have to develop an aversion for duplicated data
I don’t think that it needs to be generally true. On a Mac, it is common to bundle the software libraries with the application. On Windows and Linux, it is often consider wasteful.
But is it wasteful, really? Yes. Mac applications use more memory and disk space. But they are much less likely to break down. They are much easier to install and uninstall. And testing a Mac application is much easier because you don’t have to fear that the user might have a slightly different version of the software library you rely on.
In the DB example I might fix a typo in just one version of the title.
It is a trade-off between agility and safety.
But remember that prior to the Web, people working on hypertext systems were awfully worried about broken links. People *thought* that high reliability mattered.
It practice, it seems that we often overestimate the problems caused by small failures, and we underestimate the benefits of agility.
To some extent I agree: it’s usually not the case that normalizing for the sake of normalizing is necessary. Personally, I tend to keep everything together until there’s a significant reason not to. The caveat being my DB experience is in the context of data manipulating applications, and thus not necessarily applicable to more common use cases.
However, the example is easy, if crude, to fix: create an alias table that says lampron01 and smith02 are the same person, and use that whenever you’re pulling articles by an author. More joins, longer queries, but it works. The counter example, though, is dealing with the first multi-author paper that arrives. In a slightly more normalized schema you just connect the doc_id to multiple author_id’s. In a denormalized schema you’re stuck with, what, delimiters? Now you’ve got to use wildcard matches to look up an author, and querying by author count requires more work.
Of course, my terminology may be off: does that still constitute normalization, or is a many-to-many relationship just assumed to be inherently multi-tabular? In any case, it’s one more example of how important it is to a) be able to easily change your schema or b) correctly predict the crazy data you’ll eventually have to process.
In a slightly more normalized schema you just connect the doc_id to multiple author_id’s. In a denormalized schema you’re stuck with, what, delimiters?
Hmm, No. In a denormalized Schema I just add the Second Author and the UUID of the Author and I still have the UUID of the Doc. Just because I have denormalized data in my schema just means I have redundant copies and groupings that make sense for read and application performance.
Are we talking about normalization, or (…) storing repeated column values once?
Storing repeated values once (or as few times as needed) is precisely what normalization is. We have come to view it strictly in terms of relational algebra, but the idea predates relational algebra. That’s why I wrote many database compression techniques are types of normalization.
Of course, normalization also does away with entire columns, so it is more than just storing column values once. You have to handle column dependencies as well.
I think that hit home your point for me. My reaction to your suggestion was basically fear. Storing a city name in the row, fine. But two mostly duplicate records??? Thinking through, it’s really no more dangerous than another setup, even lots of the queries would be identical. But working with computers, I think you have to develop an aversion for duplicated data, because in most cases they will fall out of sync. This is sort of an exception to that rule, where a healthy fear of ever storing the same constant in two places can be put aside.
Now, if another field could have multiple values (paper awards?) you’d start to face combinatorial duplication and would have to normalize eventually.
But as you say, bibliographic data is often irregular and not a good fit for highly structured schemas. Hmm…semi-structured data…I wonder what we could use instead? 🙂
Achilleas Margaritissays:
What if, instead of tables, only columns were used, and the database automatically eliminated duplicate entries for each column? multicolumn tables would exist only for keeping relationships between columns of data, using the indexes of entries into the column. This would automate normalization and make joins reduntant.
It’s an interesting suggestion. In essence you’d retain normalization when valuable, but handle the joins behind the scenes. Of course, the choice to normalize or not should be based on the work load of the db, so you’d have to let it do load tracking and switch the schema, if, e.g., a certain insert is becoming a bottleneck. Technically complicated, and a master dba might do a better job, but for the 99% case it could let a developer run a db well with a very cursory knowledge of the topic. I suppose this is the way software tools are headed.
re: duplicated data, while code is technically just data, I was referring more specifically to the data code operates on (like what you’d store in a db). When each application has its own address book, I might update a friend’s email address in one place but not another. In the DB example I might fix a typo in just one version of the title. I am also somewhat hesitant to duplicate code, but only within an application boundary: I agree that shipping the libraries you need makes life much easier.
Are we talking about normalization, or individual columns and storing repeated column values once?
(Though you might be able to derive something like normalization … with a few hints from both the compression algorithm and from the designer.)
Come to think of it … one of the algorithms I used for very high performance table extraction might be useful. Somewhat out of scope for this discussion.
(Have to think about this a bit.)
jamsays:
The problem here isn’t normalization, it’s a misunderstood requirement. You need to store the author name printed on the book, and that belongs with the book table. That’s a different piece of data than the author’s current name, which in a normalized database you would store in a different table.
This would help make clear what to do when the author’s name changes: update it in the authors table, and leave the book table alone.
Without the authors table, you have to either update the author’s old books, or you can’t store the new name. Or you could query for the name associated with that authorid’s latest book, but that’s so much more of a pain than just looking it up in a table keyed by authorid.
(I would say that normalization is storing each semantically distinct piece of information once. Storing the value “Tom Smith” more than once is fine, as long as they refer to different things.)
2) you wrote: “Yet, we never retroactively change the names of the authors on a paper.” and that’s quite a great observation.
It leads me to think about critical IT systems. For quite a number of such systems I have worked on, data are never updated for tracability purposes; instead new values are added “on top of the stack”. So, it sounds like “we never retroactively change the past data [like the names of the authors on a paper, as you wrote]”. That’s quite an interesting argument in favor of NoSQL databases, as RDBMS promoters say normalization is better, e.g. for update purposes. So, if there are less updates than expected, then there is less need for normalization (I am not writing here normalization is fully useless).
3) I like the idea of using a blog for testing craziest ideas, as I do the same.
rxinsays:
This comment is better to go in your previous post but you’ve disabled commenting in there. So here we go.
First, “NoSQL” is a terrible name. NoSQL can contain a wide spectrum, from distributed key value stores to graph databases. For our discussion here, let’s limit the domain to the NoSQL you were referring to in your NoJoin post.
Now my main point: NoSQL is about loosening consistency (ACID) rather than about NoJoin or No*SQL*.
RDBMSes were developed from the 70s to the 90s, originally for enterprise data management. There are two nice properties about traditional RDBMSes: ACID and data independence. ACID ensures transactions are processed reliably (e.g. you don’t want to mess with financial data). Data independence enables the developer to specify what data they want, rather than how the data are read/write, and the optimizer chooses the execution plan. This (independence) is a big deal because the application can remain unchanged even though the underlying software/hardware system undergoes significant updates.
These two (ACID and independence), however, are orthogonal.
Traditional RDBMSes are based on the closed-world assumption, where the number of users and concurrent transactions are limited. Even for the largest enterprise, its database can probably run on a single super powerful node. ACID is easy when the number of nodes are small. Developers don’t get a choice. ACID is strictly enforced.
In the last 10 – 15 years, databases are increasingly being used for the web (open-world). For most web applications, it is not uncommon to have thousands of nodes. To guarantee availability and balance load, data replication (both within datacenter and inter datacenter) is widely applied. In these cases, it is simply too expensive, if not impossible, to enforce ACID (CAP theorem). It is also OK to loosen the consistency constraint thanks to the nature of the applications (e.g. I don’t mind xxx not being my friend on Facebook for 20 seconds).
Consistency aside, nothing prevents us from implementing a declarative query language and an optimizer for the NoSQL databases. The query language and optimizer can for sure support joins. There is actually an active research project at Berkeley on building a SQL-like language and an optimizer on a distributed key-value store that has predictable performance upper-bound.
The various NoSQL databases (MongoDB, Riak, etc) are really about loosening the C in CAP, not about joins or SQL itself.
I’m not arguing against relational databases. I am arguing against complex schemas.
Steven Huwigsays:
Your strawman data modeler suggesting “normalization” of the example table is not familiar enough with the domain being modeled. It is a database of research papers, not a customer list. Research papers have named authors that they are published under, and that is what will always appear in a record of the research paper.
Managing author name changes is part of the practice called “authority control,” and you wouldn’t just update an author’s name. You would reference another record containing the accepted version of the author’s name. For example, the Library of Congress has an entry for Samuel Clemens at http://goo.gl/hdm8F , but it just points to Mark Twain.
Since computerized bibliographic standards practices predate the relational model of data, the state of the art does not use data normalization. However, there is no reason that the relational model of data could not be used for this purpose, given sufficient expertise and investment in developing an appropriate schema.
Your example is more an argument against domain ignorance than against relational databases.
Steven Huwigsays:
If so, it seems to me that you’ve successfully argued against complex schemas developed without the proper domain knowledge. But I am not sure that the argument against developing from ignorance is particularly novel…
In the example you have given, your model of the domain allows for surnames to vary for the same authorID and this suggests that there is no functional dependency between authorID and author name. The problem is that your database expert seems unaware of this and has made an assumption that turns out to be invalid.
The table is currently in at least 3NF, but would lose it the moment another row is added that contains the same combination of authorID and author name. In order to solve this, you’d have to normalise it somehow, but doing that without understanding the domain would certainly require a further adjustment at a later time.
I have been working on ‘enterprise’ systems with ‘enterprise’ SQL databases for 18 years and the problem I have encountered most often is that SQL databases are unnormalised (as distinct from ‘denormalised’). In addition, ‘keys’ are frequently unstable, relationships between tables are broken, etc. In other words, people don’t know how to manage with their data.
This problem would be the case whether they use SQL databases, relational databases, object databases, or another non-relational forms of database. It would also be the case regardless of whether they managed the integrity of their data in the database or in the application.
It is this lack of comprehension that makes the database complex, not the fact of normalisation. Just as a lack of understanding about how to program makes for complex and usually buggy code.
DBLP seems to have had the possibility of doing this right from the beginning, probably because it was created in a country were until recently nearly all women changed their family when they got married.
Microsoft’s Academic Search did not have it – in China names are constant throughout a lifetime – and in fact the feature was added recently to handle variant spellings of the same name. (And I had to ask them to merge my two author profiles.) Sergey still has two profiles there (which, amusingly, shows the same photo).
There are good arguments against databases. This is not one of them.
authorID should be a number internally assigned by the DB. To translate into Unix FS semantics, it should be an inode number. Having a DB key based on the data is a dumb idea. For instance if it’s a number it’s faster to index/lookup than the variable length strings you have in your example.
Rommelsays:
I agree with the previous comments that point out that this is not a problem with normalization but with misunderstanding the problem domain.
I would like to add though that normalization is mischaracterized here (and is often so in other sources as well) to be about eliminating redundancy. Normalization tends to reduce redundancy but is not the point of it, otherwise its just data compression as alluded here. See http://en.wikipedia.org/wiki/Database_normalization.
The primary goal of normalization is to eliminate update anomalies and its focus is not on redundancy per se but functional dependencies.
This reminds me of a newbie mistake of creating a string table (ID, String), a number table (ID, Number), etc. then creating a person table with (PersonID, FirstNameID, LastNameID, AgeID) that links to those “data type” tables arguing “what if I store 2 persons with the same last name? or with the same age? those are redundancies that I need to eliminate in pursuit of normalization”
I’d like to take issue with your example although I’ll leave the debate on your thesis to others because frankly, even with 20+ years of database experience I still don’t feel qualified to know for sure what is the right answer.
But in the case of “author” you are off base. One of the key reasons to normalize author (with an author_id that has no meaning other than to be a key) is to establish *identify* for the author record. Certainly the author may change their name but they doesn’t change who they are. Without normalization you can too easily loose track of that identity. Have a system where knowing that identity is not important to your business case? Go ahead and denormalize.
Let me give another use-case that illustrates this even better: Invoices and Customers. If a customer moves we still want our invoices to show their old address that we actually shipped to. So by your argument I would store all my customer information in each of my invoices, in a denormalized manner. But without normalization, when my customer changes their address or their business name, now I’ve lost track of their identity.
Instead we should store in the invoice that which we want to record about the customer at a point in time, i.e. their shipping address. But we want to know their *current* billing address even if it has changed because we don’t want to loose track of who owes us money when they move! If we don’t have a normalized customer record we really can’t.
So while denormalization can have benefits, it’s not a panacea. Or as a wise man once said: “Moderation in all things.”
The primary goal of normalization is to eliminate update anomalies
It is the first reason listed in my post for normalization: “To simplify the programming of the updates.”
The problem is that nobody can give me any sort of documentation as to how often databases are corrupted due to update anomalies due to lack of normalization. These would have to come through programming bugs because you don’t need normalization to do the updates right. It might just be more difficult. But it is only significantly more difficult if the schema is complicated. But why is the schema complicated in the first place?
Normalization is an attempt to address these problems “by design†such that the system is not reliant on correct coding at all points in time.
Right, but there is a corresponding cost. This cost in inflexibility. The enterprise databases I have seen are such that it would nearly unthinkable to change the schema. You’d be better off starting from scratch. And it would cost millions and a couple of years. That’s fine if your business is slow-moving.
Here’s a real example. I work for a school (Ph.D.-granting university) where someone assumed that students came in three times a year. Now, this assumption is seriously challenged. A lot of classes can now start and end at arbitrary times during the year. That’s true when you offer online classes to professionals. This requires a schema change. I’ve spent a lot of time with the people who could make this change possible… and honestly, it is not feasible. So, they are going to have to hack it. They create new tables that map out to the old tables, and they have scripts, and it is all very ugly and costly.And, frankly, it is probably going to be buggy as hell for the all times. It would be much better if we could alter the old schemas. But doing so is just too complicated: we would need to rewrite all of the software.
Where does the problem come from? It comes from the fact that the DBAs hard-coded their assumptions in the schemas, and then the developers had to embed these assumptions in their software. The assumptions have been mostly correct for 20 years and then they brake. But when they do brake, we have no upgrade path.
The sad thing is that it is not that much data. Much of the databases could fit on an USB key. And it is not all that complicated either. Student A takes classes B. It is not the end of world. Yet, it ended up using up a couple of hundreds of neatly normalized tables. The job is well done. But…
My conjecture is that this unnecessary pain. We could do build database engines in 2010 that would make such updates much less painful. And maybe DBAs should learn the value of “hiding” the assumptions they are making. Are you really sure you know the domain now and for all times?
Now joins are not exactly required only by normalization either.
No. See also the comments of Bannister who points out that joins are not even necessarily between tables within the same database.
@Daniel
I have encountered a lot of databases in my career that have been subject to update anomalies and I have seen or been part of more than one “data cleanup” project to address these problems after the fact.
You can argue that the problem there was that the original programmers were lousy, but its almost always because the database designer was the one who was lousy.
Normalization is an attempt to address these problems “by design” such that the system is not reliant on correct coding at all points in time. It’s better to get it right the first time and its solved from that point onwards than to rely that the business rules are enforced all throughout different systems or by different programmers over time.
Now joins are not exactly required only by normalization either. Anytime you only store a reference to something rather than the entire thing (because the entire thing is maintained by someone else), joins will be involved (storing just the URL and not the entire web page for example). If you vertically partition a table (to simplify the schema or to improve performance), you will probably doing some joins to bring the tables back together. These have nothing to do with normalization.
Mike Ssays:
One of the nice things about databases is that they impose structure on your data. (Assuming that it’s more complicated than can fit into a single row.) Many years ago, when I worked @ one of the major energy trading companies (which is no longer with us in that form), we had a trading system that stored data in (XML like) strings. That aspect of the system was widely regarded as a major failure. Nothing else could read their data, and they ended up supporting 3(!) databases for communicating with external systems. It made the Daily WTF several years ago.
Heh. Along the lines of the last, I tend to like the approach of serializing a bunch of attributes – those that I do not expect to participate as selection criteria in queries – into a single string for storage into the database. Allows the set of attributes to vary without needing changes to the schema, and keeps the schema simple. You could use XML for the bag-of-attributes, but I am more likely to use JSON.
I am sure Coad would disapprove.
(You could possibly use stored procedures to allow serialized attributes to participate in queries, but I’d be wary of that approach.)
If you have ever had to write software that maintained both upwards and downwards compatibility, then I hope the motivation for the above is familiar.
Been there, tried it and seen it fail in MUMPS, Lotus Notes and COBOL flat files.
The difficulty in schema-less (or any less rigidly structured database) comes five/ten years down the track when your application still has to deal with documents stored in 2010, 2011, 2012 etc, with half a dozen different internal structures to those documents.
If you can version your documents, and (from a business point of view) purge old versions to reduce long term support, then the model can work. Its great for ephemeral stuff where the life of a document is hours, days or weeks.
The problem isn’t in modifying schemas or even modifying applications, it’s building a new business model that can still cope with the residue of long history of other business models.
There’s ultimately no difference in having a data structure fixed at a schema level and a data structure fixed with individual documents
JKFsays:
Objecting to complex schemas makes as much sense as objecting to complex problems. A schema is as complex as the problem that it is addressing.
You can’t buy simplicity by ignoring complexity, and when it comes to data, you cannot recover complex structure from a simplified projection. If that was possible, then your projection wouldn’t actually be simpler! On the other hand, when your rules change in a complex schema there are correctness preserving transforms on that data to get what you need.
In my experience a complex schema is far more flexible and safely transformed than complex code created to deal with data that has been poorly modeled.
My job used to be mostly performing enterprise data conversions and upgrades, moving data from one system to another. The hard work was always in making sense of poorly normalized data, but converting the entire database structure to new rules was easier than rewriting that horrible code in those older systems to get new functionality. The advice to move to less normalization as simplification is completely opposite to my experience with large systems. I’ve never regretted a constraint in a database, but believe me I’ve regretted the lack of one!
I’m intrigued by your comment that ‘databases impose structure’, as I think of the relational model as being all about removing structure (‘logic without meaning’ and all that stuff).
I’m not sure whether I’m not interpreting ‘structure’ as you intended though.
It might make sense to enforce discipline through a well defined database schema, with carefully thought-out constraints. If there are many (possibly less capable folk) writing applications against the database, this can be an excellent approach … but only works if you have the right guy responsible for the database. Not all DBAs are so capable. In my past exercises, the database is accessed indirectly via a service, and very few (carefully controlled) applications access the database. For my purpose, there is much less value in constraints or more complicated schemas. Discipline is maintained within the boundary of the service API.
This appears to be a very sensible analysis.
Mike Ssays:
@Lance
Done correctly, a database schema forces your data to be stored a certain way. The client column will always contain client information, detail rows will always have a good reference to a master record, and so on. Schemas might change, but it should almost always be additions, which don’t invalidate existing data. (This was really important in our case, since we had contracts which ran for 10+ years.)
If you store the data as binary blobs/strings/XML/what have you, then you’re relying on the client(s) to always get it right. Unfortunately, with code drift, it’s possible to get into a situation where your system can’t read old data, which is what happened in this case.
I wasn’t directly involved in the project (I worked on several systems which used it as a source), and several managers involved in the project told me that it was their biggest mistake.
“Schemas might change, but it should almost always be additions, which don’t invalidate existing data.”
This is the key point to maintaining upwards and downwards compatibility, but use the notion of schema in the greater sense, not just limited to what the SQL database knows of as a schema. If you maintain discipline, things work well. If you lose discipline, you end up with a rats nest.
In the end, the key is the folk working on the system, and having the right folk in the right roles.
It might make sense to enforce discipline through a well defined database schema, with carefully thought-out constraints. If there are many (possibly less capable folk) writing applications against the database, this can be an excellent approach … but only works if you have the right guy responsible for the database. Not all DBAs are so capable.
In my past exercises, the database is accessed indirectly via a service, and very few (carefully controlled) applications access the database. For my purpose, there is much less value in constraints or more complicated schemas. Discipline is maintained within the boundary of the service API.
As always, the most appropriate solution depends on your application.
Ah, I see what you mean. I agree with the notions as you present them. We are considering different levels of structure.
I’m not disagreeing with you now (and I have had some of the experiences you describe too – not pleasant). I’m just going to wax rhapsodic for a while…
As I see it, the relational model flattens the high level structure out, which looks like a removal of structure to me. But as you kind of note, you achieve this by structuring your data into tuples which collectively make relations. So at that level, it’s a highly structured form.
On the other hand, an XML document is a highly structured form too. The fact that you can put it into a string isn’t really relevant to how structured the data is.
Yes, you can put anything into a string. If we serialized an entire relational schema into a string, would that make it unstructured data? Seems like a confusion of container with content to me…
Your example of a research paper design is just an example of bad analysis and design, nothing to do with the pros and coms of relational databases or joins.
“Did the bunch of guys in IT knew about this?”
They should have.
Know your business.
If you don’t, chances are that you’ll do something bad whether it’s a relational database, nosql, wherever, whatever.
I agree with what you are saying, but I must nitpick.
The difference between storing data in a relational database and storing it as an XML document is that the DB makes guarantees about the structure of the data, and XML doesn’t.
Relax NG, XML Schema and DTDs can be used to validate XML documents. In fact, an XML schema can be far richer than any relational schema. Lack of good schema validation is not a weakness of XML.
And, honestly, if you must store documents (such as the content of blog posts) in a database, can you do better than XML?
As an example, Atom stores are beautiful. They will not break (because the Atom format will remain backward compatible).
(Yes, that’s not what you were talking about. That’s why I say I’m nitpicking.)
Mike Ssays:
@Lance
The difference between storing data in a relational database and storing it as an XML document is that the DB makes guarantees about the structure of the data, and XML doesn’t.
While using a (single) web service can do a lot to validate that data gets saved correctly, you could still have an issue where old data was serialized in a format that new code can’t understand. (That’s what happened in the case I’m thinking of. The middle tier couldn’t read data it had written several years earlier, when it was just going into production.)
I’m not saying the SQL databases are a silver bullet. One of my coworkers did a database migration where the meaning of some columns in the original DB had changed over time. Getting that cleaned up was a chore.
I’m not saying the SQL databases are a silver bullet. One of my coworkers did a database migration where the meaning of some columns in the original DB had changed over time. Getting that cleaned up was a chore.
@Bannister
I think that I allude to part of the question you thought I’d be asking. I work in data warehousing where we don’t even talk about normal forms because the typical usage calls for modest normalization, or none at all. But I decided to be more drastic in my post.
@Justin
(1) My example is a typical textbook example. Assuming that a user has one and only one last name throughout his life is typical. There are many more subtle limitations enforced by database schemas. Anyone who has had to use enterprise systems knows how inflexible they can be, if not unusable.
(2) I’m well aware of the difference between transactions and warehousing. I would even say that I’m accutely aware of the difference. But that’s not so relevant to my main claims. Data warehouses almost all rely on normalization extensively. I’m saying they shouldn’t. I’m saying we could build transaction systems with very few joins.
(3) I am outside the dogma here. If you think I am wrong, I’m interested in reading your arguments. However, I see no need for “an academic should know better”. Academic researchers are wrong all the time, at least those who are any good. Criticizing the dogma is precisely my job, as I see it. The guy who runs the bank database can’t afford to take risks and be wrong… I can afford to be dead wrong… as long as I’m wrong in an interesting way. And this blog is precisely where I choose to test my craziest ideas. If you want more conservative, try my research papers… they are peer reviewed…
@Bannister
Absolutely. Stonebraker predicted the imminent rise of specialized engines. I think that he is a bit too eager.
From the title, at first I thought you were going to ask a different question.
As to the SQL or no-SQL noise, I just cannot see any reason to get excited. Choose the database to fit the problem. By “database” I mean anything from a flat text file, through small and large SQL databases, or whatever else makes sense.
Follow the patterns in your application. Doing lots of complex queries, with more in the future? SQL should be a good bet. Doing simple queries in huge volume? A more specialized database may be a better fit.
What I thought you might be asking – what falls between – is what to do when you have both a high-volume special-purpose database, and an SQL database. Or when you have two SQL databases with different implementations (and differing performance characteristics). Joins between databases of differing nature are … not well treated?
Since we are visiting the topic of choosing differing sorts of databases, we should also be interested in clearly defined semantics and behavior when those differing databases need to be accessed to satisfy a query.
To your arguments about normalization, I completely agree.
When using specialized databases, at some point you will encounter the “join” problem. The main bulk of the application may be best suited for a specialized sort of database, but there are likely some tasks best suited to an SQL database. Of course, you can write custom code to glue the two together, but this costs much developer time, and can work poorly if algorithms are poorly chosen. If use of specialized databases is to become common, then common use needs to support direct “joins” between dissimilar databases.
For long term adoption of specialized databases, this problem is the elephant-in-the-room that needs to be addressed – and is not much discussed.
Surely you can’t apply the design techniques for data warehousing in typical transaction oriented systems ?!
In addition, there is no DBA worth his title who would not de-normalize the schema as needed for the application.
I’m really surprised at the use of the strawman example as well .. an academic should know better.
@Paul
In a relational database, you can handle the multi-author case without normalization by repeating the entries. Thus, there would be several entries with the same “document ID”. It looks wasteful, and would certainly be in most current database systems. Nevertheless, a sufficiently smart database engine could make it efficient, in theory. (Whether it is possible in practice, I do not know.)
This being said, it is trivial in a document-oriented database to avoid joins, and this was the topic of Curt Monash’s post. I would argue, in fact, that for bibliographic data, the relational model is a poor match because the data is too often irregular. (That is, it can be made to work, but it will introduce unnecessary complexity for the programmer.)
Denormalization is just as important to normalization. It’s a spectrum. Normalize a few tables to reduce redundant data and create faster writes. Denormalize a few tables to speed up reporting or database reads (because of less joins), although at the expense of having more redundant data. Normalization does improve database integrity. Denormalization means the application needs to manage the integrity.
Many an Indian programmer I’ve run across has only been taught to normalize the heck out of everything, and they have not been taught about the benefits of denormalization.
So, if you have a call tracking system where calltakers are entering order records very fast, then normalize those tables a bit and you’ll improve the write speed. And if you have a reporting system, or those call takers need to see summary information really fast on multiple accounts, then pull that from one or more denormalized tables.
I find it best to put the order entry tables in its own database in order to avoid abuse by other programmers who might want to denormalize it. As well, put the reporting system in another database in order to avoid abuse by programmers who might want to normalize it. And then use batch replication to move records between the two.
But in many companies, you really need 3 databases for a major system workflow. One is normalized fairly well and is for storing a queue of new records being entered in. But these are unqualified records — raw data. Once these records are cleaned up and merged with existing data (such as lookups that tack on extra columns or do code swaps for longer phrases) — that sort of thing needs to go into a production database that is slightly normalized, but not nearly as normalized as the order entry system. It is the production database that is the definitive database that gets the company stamp of approval for having the final say on a customer’s latest approved data. But then you need a reporting database where heavy reads and number crunching can occur for all kinds of scenarios. And this is where a database needs to be created just for that, with a fair amount of denormalization.
@Paul
Now, if another field could have multiple values (paper awards?) you’d start to face combinatorial duplication and would have to normalize eventually.
Who said that the physical layout needed to match the logical layout? Why can’t the database automatically do the normalization if it needs to?
(Again, I stress that I don’t know whether databases could actually remain efficient and do the equivalent of the normalization on their own.)
you have to develop an aversion for duplicated data
I don’t think that it needs to be generally true. On a Mac, it is common to bundle the software libraries with the application. On Windows and Linux, it is often consider wasteful.
But is it wasteful, really? Yes. Mac applications use more memory and disk space. But they are much less likely to break down. They are much easier to install and uninstall. And testing a Mac application is much easier because you don’t have to fear that the user might have a slightly different version of the software library you rely on.
In the DB example I might fix a typo in just one version of the title.
It is a trade-off between agility and safety.
But remember that prior to the Web, people working on hypertext systems were awfully worried about broken links. People *thought* that high reliability mattered.
It practice, it seems that we often overestimate the problems caused by small failures, and we underestimate the benefits of agility.
To some extent I agree: it’s usually not the case that normalizing for the sake of normalizing is necessary. Personally, I tend to keep everything together until there’s a significant reason not to. The caveat being my DB experience is in the context of data manipulating applications, and thus not necessarily applicable to more common use cases.
However, the example is easy, if crude, to fix: create an alias table that says lampron01 and smith02 are the same person, and use that whenever you’re pulling articles by an author. More joins, longer queries, but it works. The counter example, though, is dealing with the first multi-author paper that arrives. In a slightly more normalized schema you just connect the doc_id to multiple author_id’s. In a denormalized schema you’re stuck with, what, delimiters? Now you’ve got to use wildcard matches to look up an author, and querying by author count requires more work.
Of course, my terminology may be off: does that still constitute normalization, or is a many-to-many relationship just assumed to be inherently multi-tabular? In any case, it’s one more example of how important it is to a) be able to easily change your schema or b) correctly predict the crazy data you’ll eventually have to process.
In a slightly more normalized schema you just connect the doc_id to multiple author_id’s. In a denormalized schema you’re stuck with, what, delimiters?
Hmm, No. In a denormalized Schema I just add the Second Author and the UUID of the Author and I still have the UUID of the Doc. Just because I have denormalized data in my schema just means I have redundant copies and groupings that make sense for read and application performance.
@Bannister
Are we talking about normalization, or (…) storing repeated column values once?
Storing repeated values once (or as few times as needed) is precisely what normalization is. We have come to view it strictly in terms of relational algebra, but the idea predates relational algebra. That’s why I wrote many database compression techniques are types of normalization.
Of course, normalization also does away with entire columns, so it is more than just storing column values once. You have to handle column dependencies as well.
@Daniel
I think that hit home your point for me. My reaction to your suggestion was basically fear. Storing a city name in the row, fine. But two mostly duplicate records??? Thinking through, it’s really no more dangerous than another setup, even lots of the queries would be identical. But working with computers, I think you have to develop an aversion for duplicated data, because in most cases they will fall out of sync. This is sort of an exception to that rule, where a healthy fear of ever storing the same constant in two places can be put aside.
Now, if another field could have multiple values (paper awards?) you’d start to face combinatorial duplication and would have to normalize eventually.
But as you say, bibliographic data is often irregular and not a good fit for highly structured schemas. Hmm…semi-structured data…I wonder what we could use instead? 🙂
What if, instead of tables, only columns were used, and the database automatically eliminated duplicate entries for each column? multicolumn tables would exist only for keeping relationships between columns of data, using the indexes of entries into the column. This would automate normalization and make joins reduntant.
@Daniel
It’s an interesting suggestion. In essence you’d retain normalization when valuable, but handle the joins behind the scenes. Of course, the choice to normalize or not should be based on the work load of the db, so you’d have to let it do load tracking and switch the schema, if, e.g., a certain insert is becoming a bottleneck. Technically complicated, and a master dba might do a better job, but for the 99% case it could let a developer run a db well with a very cursory knowledge of the topic. I suppose this is the way software tools are headed.
re: duplicated data, while code is technically just data, I was referring more specifically to the data code operates on (like what you’d store in a db). When each application has its own address book, I might update a friend’s email address in one place but not another. In the DB example I might fix a typo in just one version of the title. I am also somewhat hesitant to duplicate code, but only within an application boundary: I agree that shipping the libraries you need makes life much easier.
Are we talking about normalization, or individual columns and storing repeated column values once?
(Though you might be able to derive something like normalization … with a few hints from both the compression algorithm and from the designer.)
Come to think of it … one of the algorithms I used for very high performance table extraction might be useful. Somewhat out of scope for this discussion.
(Have to think about this a bit.)
The problem here isn’t normalization, it’s a misunderstood requirement. You need to store the author name printed on the book, and that belongs with the book table. That’s a different piece of data than the author’s current name, which in a normalized database you would store in a different table.
This would help make clear what to do when the author’s name changes: update it in the authors table, and leave the book table alone.
Without the authors table, you have to either update the author’s old books, or you can’t store the new name. Or you could query for the name associated with that authorid’s latest book, but that’s so much more of a pain than just looking it up in a table keyed by authorid.
(I would say that normalization is storing each semantically distinct piece of information once. Storing the value “Tom Smith” more than once is fine, as long as they refer to different things.)
1) I do agree NoSQL could be renamed as NoJoin, as I think NoSQL are very close to be disguised object databases as I wrote here:
http://www.jroller.com/dmdevito/entry/thinking_about_nosql_databases_classification
2) you wrote: “Yet, we never retroactively change the names of the authors on a paper.” and that’s quite a great observation.
It leads me to think about critical IT systems. For quite a number of such systems I have worked on, data are never updated for tracability purposes; instead new values are added “on top of the stack”. So, it sounds like “we never retroactively change the past data [like the names of the authors on a paper, as you wrote]”. That’s quite an interesting argument in favor of NoSQL databases, as RDBMS promoters say normalization is better, e.g. for update purposes. So, if there are less updates than expected, then there is less need for normalization (I am not writing here normalization is fully useless).
3) I like the idea of using a blog for testing craziest ideas, as I do the same.
This comment is better to go in your previous post but you’ve disabled commenting in there. So here we go.
First, “NoSQL” is a terrible name. NoSQL can contain a wide spectrum, from distributed key value stores to graph databases. For our discussion here, let’s limit the domain to the NoSQL you were referring to in your NoJoin post.
Now my main point: NoSQL is about loosening consistency (ACID) rather than about NoJoin or No*SQL*.
RDBMSes were developed from the 70s to the 90s, originally for enterprise data management. There are two nice properties about traditional RDBMSes: ACID and data independence. ACID ensures transactions are processed reliably (e.g. you don’t want to mess with financial data). Data independence enables the developer to specify what data they want, rather than how the data are read/write, and the optimizer chooses the execution plan. This (independence) is a big deal because the application can remain unchanged even though the underlying software/hardware system undergoes significant updates.
These two (ACID and independence), however, are orthogonal.
Traditional RDBMSes are based on the closed-world assumption, where the number of users and concurrent transactions are limited. Even for the largest enterprise, its database can probably run on a single super powerful node. ACID is easy when the number of nodes are small. Developers don’t get a choice. ACID is strictly enforced.
In the last 10 – 15 years, databases are increasingly being used for the web (open-world). For most web applications, it is not uncommon to have thousands of nodes. To guarantee availability and balance load, data replication (both within datacenter and inter datacenter) is widely applied. In these cases, it is simply too expensive, if not impossible, to enforce ACID (CAP theorem). It is also OK to loosen the consistency constraint thanks to the nature of the applications (e.g. I don’t mind xxx not being my friend on Facebook for 20 seconds).
Consistency aside, nothing prevents us from implementing a declarative query language and an optimizer for the NoSQL databases. The query language and optimizer can for sure support joins. There is actually an active research project at Berkeley on building a SQL-like language and an optimizer on a distributed key-value store that has predictable performance upper-bound.
The various NoSQL databases (MongoDB, Riak, etc) are really about loosening the C in CAP, not about joins or SQL itself.
@Huwing
I’m not arguing against relational databases. I am arguing against complex schemas.
Your strawman data modeler suggesting “normalization” of the example table is not familiar enough with the domain being modeled. It is a database of research papers, not a customer list. Research papers have named authors that they are published under, and that is what will always appear in a record of the research paper.
Managing author name changes is part of the practice called “authority control,” and you wouldn’t just update an author’s name. You would reference another record containing the accepted version of the author’s name. For example, the Library of Congress has an entry for Samuel Clemens at http://goo.gl/hdm8F , but it just points to Mark Twain.
Since computerized bibliographic standards practices predate the relational model of data, the state of the art does not use data normalization. However, there is no reason that the relational model of data could not be used for this purpose, given sufficient expertise and investment in developing an appropriate schema.
Your example is more an argument against domain ignorance than against relational databases.
If so, it seems to me that you’ve successfully argued against complex schemas developed without the proper domain knowledge. But I am not sure that the argument against developing from ignorance is particularly novel…
In the example you have given, your model of the domain allows for surnames to vary for the same authorID and this suggests that there is no functional dependency between authorID and author name. The problem is that your database expert seems unaware of this and has made an assumption that turns out to be invalid.
The table is currently in at least 3NF, but would lose it the moment another row is added that contains the same combination of authorID and author name. In order to solve this, you’d have to normalise it somehow, but doing that without understanding the domain would certainly require a further adjustment at a later time.
I have been working on ‘enterprise’ systems with ‘enterprise’ SQL databases for 18 years and the problem I have encountered most often is that SQL databases are unnormalised (as distinct from ‘denormalised’). In addition, ‘keys’ are frequently unstable, relationships between tables are broken, etc. In other words, people don’t know how to manage with their data.
This problem would be the case whether they use SQL databases, relational databases, object databases, or another non-relational forms of database. It would also be the case regardless of whether they managed the integrity of their data in the database or in the application.
It is this lack of comprehension that makes the database complex, not the fact of normalisation. Just as a lack of understanding about how to program makes for complex and usually buggy code.
This is a nice subject and one that deserves a little survey. I ask a related question on Quora: http://www.quora.com/De-normalizing-views
I’m appalled by the sexist assumptions in this post 🙂
You don’t have to be female to change your name during your lifetime.
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/c/Cheong:Otfried.html
http://academic.research.microsoft.com/Author/154726.aspx
@Cheong
I specifically wrote:”And indeed, people have their names changed all the time. ” (Not just ladies.)
Yet, I was trying to make a joke of the fact that IT departments are often male only.
Sorry if my post felt sexist to you. I should have been more careful.
I was joking, of course, hence the smiley.
DBLP is pretty good at this:
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/b/Bereg:Sergey.html
DBLP seems to have had the possibility of doing this right from the beginning, probably because it was created in a country were until recently nearly all women changed their family when they got married.
Microsoft’s Academic Search did not have it – in China names are constant throughout a lifetime – and in fact the feature was added recently to handle variant spellings of the same name. (And I had to ask them to merge my two author profiles.) Sergey still has two profiles there (which, amusingly, shows the same photo).
In any case, our cases are so rare that I’m mentioned in this paper: http://www.springerlink.com/content/c6472216637p57w4/
On the other hand, many female researchers have published under more than one name during their career.
There are good arguments against databases. This is not one of them.
authorID should be a number internally assigned by the DB. To translate into Unix FS semantics, it should be an inode number. Having a DB key based on the data is a dumb idea. For instance if it’s a number it’s faster to index/lookup than the variable length strings you have in your example.
I agree with the previous comments that point out that this is not a problem with normalization but with misunderstanding the problem domain.
I would like to add though that normalization is mischaracterized here (and is often so in other sources as well) to be about eliminating redundancy. Normalization tends to reduce redundancy but is not the point of it, otherwise its just data compression as alluded here. See http://en.wikipedia.org/wiki/Database_normalization.
The primary goal of normalization is to eliminate update anomalies and its focus is not on redundancy per se but functional dependencies.
This reminds me of a newbie mistake of creating a string table (ID, String), a number table (ID, Number), etc. then creating a person table with (PersonID, FirstNameID, LastNameID, AgeID) that links to those “data type” tables arguing “what if I store 2 persons with the same last name? or with the same age? those are redundancies that I need to eliminate in pursuit of normalization”
I’d like to take issue with your example although I’ll leave the debate on your thesis to others because frankly, even with 20+ years of database experience I still don’t feel qualified to know for sure what is the right answer.
But in the case of “author” you are off base. One of the key reasons to normalize author (with an author_id that has no meaning other than to be a key) is to establish *identify* for the author record. Certainly the author may change their name but they doesn’t change who they are. Without normalization you can too easily loose track of that identity. Have a system where knowing that identity is not important to your business case? Go ahead and denormalize.
Let me give another use-case that illustrates this even better: Invoices and Customers. If a customer moves we still want our invoices to show their old address that we actually shipped to. So by your argument I would store all my customer information in each of my invoices, in a denormalized manner. But without normalization, when my customer changes their address or their business name, now I’ve lost track of their identity.
Instead we should store in the invoice that which we want to record about the customer at a point in time, i.e. their shipping address. But we want to know their *current* billing address even if it has changed because we don’t want to loose track of who owes us money when they move! If we don’t have a normalized customer record we really can’t.
So while denormalization can have benefits, it’s not a panacea. Or as a wise man once said: “Moderation in all things.”
-Mike
@Schinkel
So by your argument I would store all my customer information in each of my invoices, in a denormalized manner.
No. What I’m saying is that should the developer choose to do so at a later date, he shouldn’t need to change the database schema.
@Rommel
The primary goal of normalization is to eliminate update anomalies
It is the first reason listed in my post for normalization: “To simplify the programming of the updates.”
The problem is that nobody can give me any sort of documentation as to how often databases are corrupted due to update anomalies due to lack of normalization. These would have to come through programming bugs because you don’t need normalization to do the updates right. It might just be more difficult. But it is only significantly more difficult if the schema is complicated. But why is the schema complicated in the first place?
@Rommel
Normalization is an attempt to address these problems “by design†such that the system is not reliant on correct coding at all points in time.
Right, but there is a corresponding cost. This cost in inflexibility. The enterprise databases I have seen are such that it would nearly unthinkable to change the schema. You’d be better off starting from scratch. And it would cost millions and a couple of years. That’s fine if your business is slow-moving.
Here’s a real example. I work for a school (Ph.D.-granting university) where someone assumed that students came in three times a year. Now, this assumption is seriously challenged. A lot of classes can now start and end at arbitrary times during the year. That’s true when you offer online classes to professionals. This requires a schema change. I’ve spent a lot of time with the people who could make this change possible… and honestly, it is not feasible. So, they are going to have to hack it. They create new tables that map out to the old tables, and they have scripts, and it is all very ugly and costly.And, frankly, it is probably going to be buggy as hell for the all times. It would be much better if we could alter the old schemas. But doing so is just too complicated: we would need to rewrite all of the software.
Where does the problem come from? It comes from the fact that the DBAs hard-coded their assumptions in the schemas, and then the developers had to embed these assumptions in their software. The assumptions have been mostly correct for 20 years and then they brake. But when they do brake, we have no upgrade path.
The sad thing is that it is not that much data. Much of the databases could fit on an USB key. And it is not all that complicated either. Student A takes classes B. It is not the end of world. Yet, it ended up using up a couple of hundreds of neatly normalized tables. The job is well done. But…
My conjecture is that this unnecessary pain. We could do build database engines in 2010 that would make such updates much less painful. And maybe DBAs should learn the value of “hiding” the assumptions they are making. Are you really sure you know the domain now and for all times?
Now joins are not exactly required only by normalization either.
No. See also the comments of Bannister who points out that joins are not even necessarily between tables within the same database.
@Mike
See my post:
You probably misunderstand XML
http://lemire.me/blog/archives/2010/11/17/you-probably-misunderstand-xml/
@Daniel
I have encountered a lot of databases in my career that have been subject to update anomalies and I have seen or been part of more than one “data cleanup” project to address these problems after the fact.
You can argue that the problem there was that the original programmers were lousy, but its almost always because the database designer was the one who was lousy.
Normalization is an attempt to address these problems “by design” such that the system is not reliant on correct coding at all points in time. It’s better to get it right the first time and its solved from that point onwards than to rely that the business rules are enforced all throughout different systems or by different programmers over time.
Now joins are not exactly required only by normalization either. Anytime you only store a reference to something rather than the entire thing (because the entire thing is maintained by someone else), joins will be involved (storing just the URL and not the entire web page for example). If you vertically partition a table (to simplify the schema or to improve performance), you will probably doing some joins to bring the tables back together. These have nothing to do with normalization.
One of the nice things about databases is that they impose structure on your data. (Assuming that it’s more complicated than can fit into a single row.) Many years ago, when I worked @ one of the major energy trading companies (which is no longer with us in that form), we had a trading system that stored data in (XML like) strings. That aspect of the system was widely regarded as a major failure. Nothing else could read their data, and they ended up supporting 3(!) databases for communicating with external systems. It made the Daily WTF several years ago.
Heh. Along the lines of the last, I tend to like the approach of serializing a bunch of attributes – those that I do not expect to participate as selection criteria in queries – into a single string for storage into the database. Allows the set of attributes to vary without needing changes to the schema, and keeps the schema simple. You could use XML for the bag-of-attributes, but I am more likely to use JSON.
I am sure Coad would disapprove.
(You could possibly use stored procedures to allow serialized attributes to participate in queries, but I’d be wary of that approach.)
If you have ever had to write software that maintained both upwards and downwards compatibility, then I hope the motivation for the above is familiar.
Been there, tried it and seen it fail in MUMPS, Lotus Notes and COBOL flat files.
The difficulty in schema-less (or any less rigidly structured database) comes five/ten years down the track when your application still has to deal with documents stored in 2010, 2011, 2012 etc, with half a dozen different internal structures to those documents.
If you can version your documents, and (from a business point of view) purge old versions to reduce long term support, then the model can work. Its great for ephemeral stuff where the life of a document is hours, days or weeks.
The problem isn’t in modifying schemas or even modifying applications, it’s building a new business model that can still cope with the residue of long history of other business models.
There’s ultimately no difference in having a data structure fixed at a schema level and a data structure fixed with individual documents
Objecting to complex schemas makes as much sense as objecting to complex problems. A schema is as complex as the problem that it is addressing.
You can’t buy simplicity by ignoring complexity, and when it comes to data, you cannot recover complex structure from a simplified projection. If that was possible, then your projection wouldn’t actually be simpler! On the other hand, when your rules change in a complex schema there are correctness preserving transforms on that data to get what you need.
In my experience a complex schema is far more flexible and safely transformed than complex code created to deal with data that has been poorly modeled.
My job used to be mostly performing enterprise data conversions and upgrades, moving data from one system to another. The hard work was always in making sense of poorly normalized data, but converting the entire database structure to new rules was easier than rewriting that horrible code in those older systems to get new functionality. The advice to move to less normalization as simplification is completely opposite to my experience with large systems. I’ve never regretted a constraint in a database, but believe me I’ve regretted the lack of one!
@Daniel:
“No. What I’m saying is that should the developer choose to do so at a later date, he shouldn’t need to change the database schema.”
They I completely didn’t follow your point. And still don’t.
@Mike S
I’m intrigued by your comment that ‘databases impose structure’, as I think of the relational model as being all about removing structure (‘logic without meaning’ and all that stuff).
I’m not sure whether I’m not interpreting ‘structure’ as you intended though.
Regards,
Lance
@Bannister
It might make sense to enforce discipline through a well defined database schema, with carefully thought-out constraints. If there are many (possibly less capable folk) writing applications against the database, this can be an excellent approach … but only works if you have the right guy responsible for the database. Not all DBAs are so capable. In my past exercises, the database is accessed indirectly via a service, and very few (carefully controlled) applications access the database. For my purpose, there is much less value in constraints or more complicated schemas. Discipline is maintained within the boundary of the service API.
This appears to be a very sensible analysis.
@Lance
Done correctly, a database schema forces your data to be stored a certain way. The client column will always contain client information, detail rows will always have a good reference to a master record, and so on. Schemas might change, but it should almost always be additions, which don’t invalidate existing data. (This was really important in our case, since we had contracts which ran for 10+ years.)
If you store the data as binary blobs/strings/XML/what have you, then you’re relying on the client(s) to always get it right. Unfortunately, with code drift, it’s possible to get into a situation where your system can’t read old data, which is what happened in this case.
I wasn’t directly involved in the project (I worked on several systems which used it as a source), and several managers involved in the project told me that it was their biggest mistake.
“Schemas might change, but it should almost always be additions, which don’t invalidate existing data.”
This is the key point to maintaining upwards and downwards compatibility, but use the notion of schema in the greater sense, not just limited to what the SQL database knows of as a schema. If you maintain discipline, things work well. If you lose discipline, you end up with a rats nest.
In the end, the key is the folk working on the system, and having the right folk in the right roles.
It might make sense to enforce discipline through a well defined database schema, with carefully thought-out constraints. If there are many (possibly less capable folk) writing applications against the database, this can be an excellent approach … but only works if you have the right guy responsible for the database. Not all DBAs are so capable.
In my past exercises, the database is accessed indirectly via a service, and very few (carefully controlled) applications access the database. For my purpose, there is much less value in constraints or more complicated schemas. Discipline is maintained within the boundary of the service API.
As always, the most appropriate solution depends on your application.
@Mike S
Ah, I see what you mean. I agree with the notions as you present them. We are considering different levels of structure.
I’m not disagreeing with you now (and I have had some of the experiences you describe too – not pleasant). I’m just going to wax rhapsodic for a while…
As I see it, the relational model flattens the high level structure out, which looks like a removal of structure to me. But as you kind of note, you achieve this by structuring your data into tuples which collectively make relations. So at that level, it’s a highly structured form.
On the other hand, an XML document is a highly structured form too. The fact that you can put it into a string isn’t really relevant to how structured the data is.
Yes, you can put anything into a string. If we serialized an entire relational schema into a string, would that make it unstructured data? Seems like a confusion of container with content to me…
Your example of a research paper design is just an example of bad analysis and design, nothing to do with the pros and coms of relational databases or joins.
“Did the bunch of guys in IT knew about this?”
They should have.
Know your business.
If you don’t, chances are that you’ll do something bad whether it’s a relational database, nosql, wherever, whatever.
@Mike
I agree with what you are saying, but I must nitpick.
The difference between storing data in a relational database and storing it as an XML document is that the DB makes guarantees about the structure of the data, and XML doesn’t.
Relax NG, XML Schema and DTDs can be used to validate XML documents. In fact, an XML schema can be far richer than any relational schema. Lack of good schema validation is not a weakness of XML.
And, honestly, if you must store documents (such as the content of blog posts) in a database, can you do better than XML?
As an example, Atom stores are beautiful. They will not break (because the Atom format will remain backward compatible).
(Yes, that’s not what you were talking about. That’s why I say I’m nitpicking.)
@Lance
The difference between storing data in a relational database and storing it as an XML document is that the DB makes guarantees about the structure of the data, and XML doesn’t.
While using a (single) web service can do a lot to validate that data gets saved correctly, you could still have an issue where old data was serialized in a format that new code can’t understand. (That’s what happened in the case I’m thinking of. The middle tier couldn’t read data it had written several years earlier, when it was just going into production.)
I’m not saying the SQL databases are a silver bullet. One of my coworkers did a database migration where the meaning of some columns in the original DB had changed over time. Getting that cleaned up was a chore.
Otfried Cheong and Otfried Schwarzkopf shares one ID in Microsoft Academic Search web site: http://academic.research.microsoft.com/Author/154726.aspx
I’m not saying the SQL databases are a silver bullet. One of my coworkers did a database migration where the meaning of some columns in the original DB had changed over time. Getting that cleaned up was a chore.
Hey, I agree with “Tech Irons” In SQL DataBase, some meaning columns had changed over time. This happened to us.
Are we talking about normalization, or individual columns and storing repeated column values once?