Daniel Lemire's blog

, 2 min read

NoSQL or NoJoin?

Several major players built alternatives to conventional database systems: Google created BigTable, Amazon built Dynamo and Facebook initiated Cassandra. There are many other comparable open source initiatives such as CouchDB and MongoDB. These systems are part of a trend called NoSQL because it is not centered around the SQL language. While there has always been non SQL-based database systems, the rising popularity of these alternatives in industry is drawing attention.

In The “NoSQL” Discussion has Nothing to Do With SQL, Stonebraker opposes the NoSQL trend in those terms:

(…) blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.

In effect, Stonebraker says that all of the benefits of the NoSQL systems have nothing to do with ditching the SQL language. Of course, because the current breed of SQL is Turing complete, it is difficult to argue against SQL at the formal level. In theory, all Turing complete languages are interchangeable. You can do everything (bad and good) in SQL.

However, in practice, SQL is based on joins and related low-level issues like foreign keys. SQL entices people to normalize their data. Normalization fragments databases into smaller tables which is great for data integrity and beneficial for some transactional systems. However, joins are expensive. Moreover, joins require strong consistency and fixed schemas.

In turn, avoiding join operations makes it possible to maintain flexible or informal schemas, and to scale horizontally. Thus, the NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the join operation.

How do we compute joins? There are two main techniques :

  • When dealing with large tables, you may prefer the sort merge algorithm. Because it requires sorting tables, it runs in O(n log n). (If your tables are already sorted in the correct order, sort merge is automatically the best choice.)
  • For in-memory tables, hash joins are preferable because they run in linear time O(n). However, the characteristics of modern hardware are increasing detrimental to the hash join alternative (see C. Kim, et al. Sort vs. Hash revisited. 2009).

(It is also possible to use bitmap indexes to precompute joins.) In any case, short of precomputing the joins, joining large tables is expensive and requires source tables to be consistent.

Conclusion: SQL is a fine language, but it has some biases that may trap developers. What works well in a business transaction system, may fail you in other instances.