Thanks for the overview on NoSQL. I’m not convinced that imposing a strict structure on data or even how we access it is suitable for all circumstances, even though it seems that as an industry and a research field, we seem to have convinced ourselves that it is.
@Olivier: I don’t find the spam protection elitist at all. If Daniel wanted elitists, he’d ask questions only researchers could answer.
Itmansays:
Yes, some people believed, e.g., that you can use SQL DB to handle a search engine efficiently. All search engines fall short in terms of performance.
Itmansays:
I mean all SQL-based search engines that relied on normalization fall short in terms of performance.
This is one thing I keep failing to understand. Every database-powered application I have ever written, SQL or not, a join has been necessary at some point. If it’s a no join database, then I have to write the join in my application, by hand.
I would assume that the writers of such databases as PostgreSQL and MySQL are much better than I am. Their join should definitely be faster and better than my join. Therefore, what’s the benefit of using a nojoin database if I have an application that requires joins? And if every application requires joins, what’s the point of ever using a nojoin database?
Now for the hardest question. What if the join I wrote in my application using a nojoin database is somehow faster than the ones in MySQL and Postgres? In that case, why do we even need join databases at all? And why don’t the nojoin databases add a join command?
Itmansays:
If it’s a no join database, then I have to write the join in my application.
No, it is not necessarily true. If you don’t normalize, you may not need a join. Second, a join can be precomputed. Third, a relational DB is a compromise (OLTP vs retrieval) solution. If you don’t need all fancy transaction staff you can do much better.
Nice article. On this topic, I have a question which I have been trying to figure out for some time now. Whenever someone looks into NoSQL as a means to avoid costly join operations, there is usually a suggestion of using column-oriented databases (like HBase or Cassandra) to fix that issue. I’ve been thinking for quite some time on how column-oriented should be able to fix expensive join operations and I haven’t figured it out. Anyone care to enlighten me?
Hm is that it? I know Cassandra also allows you to have more than one value on the same field (sorry if the terminology is not correct), which allows you to avoid many-to-many relationships therefore avoiding join operations.
Itmansays:
@André Silva
Yes, it does not require normalization, which, can also be seen as a precomputed join.
In addition, I believe that some of the column-oriented DBs work with compacted semi-static data. At least, BigTable does.
I don’t remember how they handle transactions, I just remember that their data is mostly static. Therefore, it can be compressed and retrieved efficiently.
Matthew Wsays:
Interesting article; I found this point a bit puzzling though:
“SQL is based on joins and related low-level issues like foreign keys”.
I think when it comes to models and languages for discussing data, many would agree that the relational model (and SQL, as an approximation of it) is *higher-level* than object-based database models. The model is purer and works at a higher level of abstraction, not being concerned with lower-level optimisations for particular physical storage layouts and data access paths, but instead modelling the data declaratively in a very pure flexible normalised form without redundancies.
Also: SQL is declarative language – be careful to distinguish ‘join’ as a logical expression in a declarative language, and the implementation techniques used by some DMBSes to compute these joins. DBMSes are by no means compelled to use any one technique to compute a query. Sometimes they need to be given hints. But it’s worth separating the logical model from the physical storage model and any physical storage optimisations, at least in your mind when reasoning about these concepts.
NoSQL systems could do joins (in principle) but they can’t garantuee its consistency.
So “they” say that if you want to do a join you have to do it in the application. Doing a join in the application is always slower because it means that you have to call the db twice instead of once. There is also a greater change of inconsistency because stuff can change between the first call and the second call to the db.
It is all about who the problem owner is when it comes to inconsistency? The database or the application?
Cassandra (R + W > N) only offers consistency when you retrieve one “record”. Oracle offers consistency when you retrieve one, two or a million (aggregation) records.
Cassandra and MongoDB “records” can contain a lot of information so there is less need to join.
Joins are necessary and v e r y f a s t on SSD drives (speaking about “the characteristics of modern hardware are increasing detrimental to the hash join alternative”).
Sure it also helps if the implementation is a cache oblivious algorithm.
Dan, there are more formal arguments for and against MySQL than just its Turing completeness. Come on!
BTW, your spam protection is very elitist. But that’s what you want, right?
@Olivier
Requiring people connected to the web to recognize latin numerals is not elitist. Presumably, some people may even enjoy an original captcha.
Thanks for the overview on NoSQL. I’m not convinced that imposing a strict structure on data or even how we access it is suitable for all circumstances, even though it seems that as an industry and a research field, we seem to have convinced ourselves that it is.
@Olivier: I don’t find the spam protection elitist at all. If Daniel wanted elitists, he’d ask questions only researchers could answer.
Yes, some people believed, e.g., that you can use SQL DB to handle a search engine efficiently. All search engines fall short in terms of performance.
I mean all SQL-based search engines that relied on normalization fall short in terms of performance.
This is one thing I keep failing to understand. Every database-powered application I have ever written, SQL or not, a join has been necessary at some point. If it’s a no join database, then I have to write the join in my application, by hand.
I would assume that the writers of such databases as PostgreSQL and MySQL are much better than I am. Their join should definitely be faster and better than my join. Therefore, what’s the benefit of using a nojoin database if I have an application that requires joins? And if every application requires joins, what’s the point of ever using a nojoin database?
Now for the hardest question. What if the join I wrote in my application using a nojoin database is somehow faster than the ones in MySQL and Postgres? In that case, why do we even need join databases at all? And why don’t the nojoin databases add a join command?
If it’s a no join database, then I have to write the join in my application.
No, it is not necessarily true. If you don’t normalize, you may not need a join. Second, a join can be precomputed. Third, a relational DB is a compromise (OLTP vs retrieval) solution. If you don’t need all fancy transaction staff you can do much better.
Nice article. On this topic, I have a question which I have been trying to figure out for some time now. Whenever someone looks into NoSQL as a means to avoid costly join operations, there is usually a suggestion of using column-oriented databases (like HBase or Cassandra) to fix that issue. I’ve been thinking for quite some time on how column-oriented should be able to fix expensive join operations and I haven’t figured it out. Anyone care to enlighten me?
@André Silva
Precomputed joins.
Hm is that it? I know Cassandra also allows you to have more than one value on the same field (sorry if the terminology is not correct), which allows you to avoid many-to-many relationships therefore avoiding join operations.
@André Silva
Yes, it does not require normalization, which, can also be seen as a precomputed join.
In addition, I believe that some of the column-oriented DBs work with compacted semi-static data. At least, BigTable does.
I don’t remember how they handle transactions, I just remember that their data is mostly static. Therefore, it can be compressed and retrieved efficiently.
Interesting article; I found this point a bit puzzling though:
“SQL is based on joins and related low-level issues like foreign keys”.
I think when it comes to models and languages for discussing data, many would agree that the relational model (and SQL, as an approximation of it) is *higher-level* than object-based database models. The model is purer and works at a higher level of abstraction, not being concerned with lower-level optimisations for particular physical storage layouts and data access paths, but instead modelling the data declaratively in a very pure flexible normalised form without redundancies.
Also: SQL is declarative language – be careful to distinguish ‘join’ as a logical expression in a declarative language, and the implementation techniques used by some DMBSes to compute these joins. DBMSes are by no means compelled to use any one technique to compute a query. Sometimes they need to be given hints. But it’s worth separating the logical model from the physical storage model and any physical storage optimisations, at least in your mind when reasoning about these concepts.
An interesting discussion on that subject on the ACM blog:
http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext
NoSQL systems could do joins (in principle) but they can’t garantuee its consistency.
So “they” say that if you want to do a join you have to do it in the application. Doing a join in the application is always slower because it means that you have to call the db twice instead of once. There is also a greater change of inconsistency because stuff can change between the first call and the second call to the db.
It is all about who the problem owner is when it comes to inconsistency? The database or the application?
Cassandra (R + W > N) only offers consistency when you retrieve one “record”. Oracle offers consistency when you retrieve one, two or a million (aggregation) records.
Cassandra and MongoDB “records” can contain a lot of information so there is less need to join.
Read here: http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/
Joins are necessary and v e r y f a s t on SSD drives (speaking about “the characteristics of modern hardware are increasing detrimental to the hash join alternative”).
Sure it also helps if the implementation is a cache oblivious algorithm.