Daniel Lemire's blog

, 17 min read

Do NULL markers in SQL cause any harm?

20 thoughts on “Do NULL markers in SQL cause any harm?”

  1. Tony Hoare called null references his billion dollar mistake (http://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare), similar reasoning applies to 3-valued logic. It’s very easy to get bugs in queries when you forget to handle nulls and such bugs usually aren’t noticed until very late in the development cycle. It also leads to things like automatically conversion of values to NULL and vice-versa (e.g. empty varchars are NULL in Oracle).

  2. Peter Turney says:

    Many-valued logic is not inherently inconsistent (see http://plato.stanford.edu/entries/logic-manyvalued/). Why not allow three values but use a consistent logic?

  3. Christopher Smith says:

    I think people unfairly treat clustering of their bugs around features as indications that those features are a huge problem. NULL is a classic case of this.

    Yes NULL is error prone. Yes NULL is inconsistent. Yes a lot of errors occur because of NULL.

    That’s because NULL is a necessary tool to address error prone, inconsistent aspects of reality. We need it.

  4. @Peter Turney

    Assuming that you do want 3-value logic, there is a computational cost to pay if you want consistency. Yet efficiency is very important for a database system, maybe more so than theoretical rigor.

    Let me quote John Grant from “Null Values in SQL”:

    Using Kleene’s 3-valued logic I showed that a truth-functional (i.e. the connectives are defined by truth-tables) 3-valued logic, where the third truth value stands for ”unknown”, will not give some formulas the correct truth value, and proposed a non-truth-functional 3-valued logic that gives all formulas correct truth values. In the case of null values for a relational database this means that the 3-valued logic truth tables used by Codd (the same as in Kleene’s 3-valued logic) do not always give correct answers to queries. First I wrote to Dr. Codd explaining the problem and after his reply I wrote a short article pointing out the problem. (…) for the correct evaluation of a query in the presence of a null value, all different cases must be considered.

    What is amazing to me is that Codd got away with an inconsistent 3-value system and not only that, but this became the de facto standard for database systems.

  5. 3-valued logic is “flattened” in SQL: In SELECT sentences, if the whole WHERE expression is “unknown” then it is assumed to be “false”. In constrat, in CHECK constraints, if the expression is evaluated to “unknown” (e.g. CHECK X>0) then it is assumed to be “true”.
    Some aggegate operators behave in a contra-intuitive way in presence of NULL marks.
    However, there is, in my opinion, a situation where NULLs make sense: when representing a N:0..1 relationship between two tables and you don’t want to have an extra table to represent such a relationship.

  6. Christopher Smith says:

    I should clarify my previous post: obviously there are ways to get the job done without NULL. However, we need to model and address the bits of reality that give rise to NULL, and while their are approaches that avoid the Date’s theoretical problem with NULL, I don’t think there is an approach which avoids all the bugs that so often occur and are attributed to NULL. Unlike say, C’s buffer overflows or NULL terminated strings, the bugs are caused by the nature of the problem, not the nature of the solution.

    I’d even argue NULL handles those problems better than most other solutions I’ve seen, but perhaps someone has a better alternative.

  7. Dominic Amann says:

    An interesting side note – in ieee floating point representation, there is “NaN” (not a number). If I have

    double a = numberic_limits::quietNaN();

    bool e = (a == a);

    will evaluate to false.

  8. Dominic Amann says:

    @Daniel I agree it follows the standard. I was cnotrasting its behaviour to sql’s null.

  9. Dominic Amann says:

    @Danieal – you may be right – I wasn’t commenting on the rightness of the approach, just the different choice. Sorry I wasn’t clear.

    To look at Codd’s approach – I think there are two problems here:

    one is “this answer is not defined for the arguments given”, and the other is “this data item has not been assigned a value”.

  10. @Dominic Amann

    NaN is supposed to be non-reflexive under IEEE 734. C++ supports IEEE 734.

    PostgreSQL and Oracle do claim to support IEEE 734 but I think that their NaN is reflexive (thus violating the standard).

    IEEE 734 is logically consistent, and relies on simple 2-value logic (false or true). Elements in IEEE 734 do not form equivalence classes, however, but that is not required for logical consistency.

  11. @Dominic Amann

    Well, the designers of IEEE 734 did not see a need to use 3-value logic. I think they were right. I think Codd was wrong.

  12. Onne says:

    Actually “1 = NULL” might not be false, but it is falsy. That is, SELECT * FROM table WHERE field = null; will simply select nothing, because the predicate is always falsy, even if field itself contains null for a record.

    The trouble is when you expect negation of falsy. The negation of something undefined is still undefined. Hence “not null” is still null in sql. And “1 null” is still falsy. (But funny enough, “1 (null OR false)” is true because “null OR false” is false and not null but “null AND true” is null.)

    The conclusion is, read the SQL null as “undefined” or “don’t know”, where the c/c++/java null is more like zero. Too bad javascript has undefined, but “undefined == undefined” is true. So undefined in javascript is just a marker and not a true undefined value, like null is in sql. But then javascript has NaN and “NaN == NaN” is false. So NaN behaves like a real undefined value.

    If anything, sql could benefit from a value that represents undefined and a value that represents nothing. But maybe todays situation is better, otherwise nobody would understand which value to use in what situation. And now everybody gets burned by sql null at least once and then learns 🙂

  13. @Dominic Amann

    Yes. Codd did, in fact, propose to handle different markers to represent just the kind of distinctions you are making, but it was never picked up.

  14. @Onne

    The trouble is when you expect negation of falsy

    I think that the trouble is deeper than that. You do end up with inconsistent results. Period. It is not just that NULL is difficult to understand, it is mathematically improper. Please see the detailed explanation Grant gave: http://www09.sigmod.org/sigmod/record/issues/0809/p23.grant.pdf

    Even Codd admitted as much: to him, the NULL markers were outside of the relational model and not subject to normalization. He knew of the problems, because they were reported to him, but he viewed them as irrelevant.

    In some sense, history proved him right. Or did it?

  15. Onne says:

    @Daniel

    In the example of the paper, P1 has NULL for a city, and the author poses that whatever you fill in for NULL, it must result in the predicate to become true. However, maybe the parts supplier P1 is omnipresent, it is in every city all at once. In that case the predicate should stay false on two accounts: P1 is the same city as S1 and P1 is in Paris.

    NULL is not a value, it is outside the domain of the values that could have been in its place.

    Is that mathematically improper? I am not convinced; it is practical, however. So yes, I think Codd was proven right.

  16. Onne says:

    @Daniel

    Going back to my first statement. Notice how there are no inconsistencies if you don’t use negations (not or unequal). If need be use “select inverse”.

    So how do you deal with this as a programmer? Just don’t use negations unless you think about how you want to deal with null. Basically you need to deal with the fact that the sql engine does not know that the domain of cities is finite and void of quantum values.

    And surely not having null is much more of a burden… it is like saying because division by zero is such a problem that we don’t want zero in math.

  17. @Onne

    Yes, it is mathematically inconsistent. There is no way around it. I’ll update my blog post right away with a more elaborate discussion. Your interpretation falls apart too as you’ll see.

  18. Anonymous says:

    @Daniel

    I see what you mean, but your reasoning is incorrect, see NULL might mean is nowhere, or everywhere, or only paris or only london. Or it might even be a quantum value that is london if you compare it to paris, and paris when you compare it to london. (This is easier to see if you stop using singular values and use sets of cities instead, now what does the NULL set mean, or a set containing Paris and NULL mean?)

    The only sensible answer to a comparison with null is to return null again. The inconsistency is the pragmatic choice to let null be a falsy value when it comes to actually doing work (selecting records).

    The sound thing to do in sql is not to return a list of some records when encountering null, instead to collapse the whole result into null itself. As in, I don’t know the answer to your question. But that is not very pragmatic …

  19. @Anonymous

    The sound thing to do in sql is not to return a list of some records when encountering null, instead to collapse the whole result into null itself. As in, I don’t know the answer to your question.

    But that is not what SQL does. It does return an answer even when nulls are involved. SQL itself does not offer any consistent view.

    Of course, the programmer can check (painfully) to see if any nulls are involved. Or he can forbid nulls. Or he can use other, more complicated schemas, or special values… Anyhow, the burden is on the programmer to do the right thing with nulls… SQL will not help you. Arguably, it makes it hard for you to do the right thing.

  20. @Onne

    Just so we are clear: I am not arguing against NULLs.