Daniel Lemire's blog

, 15 min read

Do hash tables work in constant time?

17 thoughts on “Do hash tables work in constant time?”

  1. Thus accessing an entry in a hash table of size n takes time O((log n)^(1+epsilon)), not O(n^(1+epsilon)).

    Duh! Thanks. (I don’t think anyone would be using hash tables, had I been right…) (:shame:)

  2. Rachel Blum says:

    You’re playing semantic games. If you look at algorithms, modulus computation is always considered O(1). And it is in most commercial CPUs. The time for a multiplication is indepedent of the size of the number, assuming we stay in the natural word size. The complexity you mention has simply been moved from time-complexity to gate-complexity.

    And even if we stipulate hashing were O(log n) – where do you make the jump to O(n log n)?

  3. @Rachel

    Thank you for your comments. Yes, I like to play games, but please give me some credit.

    Yes, we always assume that multiplications take constant time because we assume that we take in 64-bit integers and produce 64-bit integers using 64-bit processors.

    But is it true?

    No, it is not if you use vectorization. With vectorization, you can multiply four pairs of 16-bit numbers in the time it takes to multiply one pair of 64-bit numbers. And yes, people use vectorization, right now, in commercial products.

    Further reading:

    http://en.wikipedia.org/wiki/Vectorization_(computer_science)

  4. @Sammy

    When using larger hash tables, you have to compute larger hash values. This takes more time.

    Try it out. Right now. Implement a hash table with 2^256 keys and one with 2^16 keys. You’ll see that multiplying 256-bit integers takes longer. Thus, your 2^256-element hash table will be slower.

    True. Nobody right now has databases with 2^256 elements. But with vectorization, you can do 4 multiplications in 16-bit in the time it takes to do one 64-bit multiplication.

    So, right now, in real systems, I claim that larger hash tables could be slower, even if I discard the memory access time.

    True. Few people are using vectorization *right now*. But some important people are using it for important database applications.

  5. Rachel Blum says:

    That is *still* constant time… 4*O(1)=O(1)

    (I’m well aware of vectorization – given that I work on games, it’s inevitable 😉

    Only if your key size were unlimited it would be an issue. (Because then time required for hash computation would indeed change w/ the size of your key)

    And yes, larger hash tables are slower – but their time complexity is the same. (Don’t you love complexity arithemtic? 😉

  6. @Rachel

    And yes, larger hash tables are slower – but their time complexity is the same. (Don’t you love complexity arithemtic?)

    When a 2^m-key hash table runs 2 times slower than a 2^2m-key hash table, then you have O(log n) complexity by definition. And that’s precisely what happens when running multiple hash table queries simultaneously with vectorization.

  7. Rachel Blum says:

    As I said – it *is* a true statement when you consider variable key size.

    But for most hash table implementations, the key size is fixed. Hence there is a constant upper bound for modulus, hence O(1)

    That shouldn’t change for vectorization, either, as long as the key size has a fixed upper bound.

    What am I missing?

  8. @Rachel

    But for most hash table implementations, the key size is fixed. Hence there is a constant upper bound for modulus, hence O(1)

    That shouldn’t change for vectorization, either, as long as the key size has a fixed upper bound.

    What am I missing?

    If you fix the maximal number of keys, then all data structures run in constant time. Even a linear scan through an array will never use more than a fixed number of operations.

    That’s not very exciting.

  9. @Preston

    Regarding Pearson hashing, I specifically pointed out that multiplication was not required (see Disclaimer 1). As for Pearson being quite good… I agree, and more about this another day. Watch this blog for follow-ups. (Man! Do I wish I knew more people who knew about Pearson hashing!)

    As for using Vectorization, the idea I had was to implement several hash tables that are queried simultaneously; or a single hash table that is queried in small batches. Why not? People are implementing and selling database engines designed around vectorization. Am I going to try doing building these vectorized hash tables today? Probably not.

    I was just trying to get people to think about their assumptions this morning and it degenerated…

  10. You lost a logarithm: “Multiplications of numbers in [0,m) can almost be done in time O(m log m)” should read “Multiplications of numbers in [0, 2^m) can almost be done in time O(m log m)”.

    Thus accessing an entry in a hash table of size n takes time O((log n)^(1+epsilon)), not O(n^(1+epsilon)).

    There’s a more fundamental for hash tables not being constant-time, however: Random-access memory isn’t constant-time. You’ve got at least log n gate delays; and once you start dealing with large amounts of storage, you’ve got a speed-of-light cost of O(n^(1/2)) or O(n^(1/3)), depending on whether your circuits are two- or three- dimensional.

  11. Rachel Blum says:

    @Daniel – even w/ an upper bound on the number of elements, an array scan still depends on how full the array is, though.

    A hashtable still is independent of the number of entries, for a fixed *key* size. The number of elements can be variable.

    That’s not saying keysize is not important, just that a hashtable w/ a fixed keysize is O(1) w/ respect to number of entries, and an array scan is O(n)

    So I guess my quibble is that for a given key size, hash tables should be O(log k) – k being the maximum number of keys – , not O(log n).

  12. Sammy Larbi says:

    When doing complexity analysis you are supposed to be comparing apples with apples: how fast does the time or space complexity grow /with respect to the size of the input n/.

    You are taking n as the size of the hash table at the beginning, then using n as the size of one element in another part, and Colin is (I’m guessing facetiously) taking n as the number of logic gates in a stick of RAM.

    Neither of these variables change as n changes, so it doesn’t make sense to include them in analysis. Even if you do include them, they are still constants, and choosing M and x (following http://en.wikipedia.org/wiki/Big_O_notation) appropriately would show you the correct result.

  13. @Rachel I am happy with O(log k). But what if you are using strings as keys?

  14. Perhaps I am not tracking your argument, but as far as I can tell, 32 and 64-bit integer multiply instructions take a fixed number of clocks on recent x86 CPUs (the fixed time multiply hardware went into either the 386 or 486, if memory serves). Presuming you are interested in hash tables that fit in memory, integer multiply instructions are sufficient.

    Not sure how you would use vectorization with hashing, at least not in the usual cases.

    Then again, for my purpose, simple Pearson hashing (not using multiply or divide) has always won out when measured over actual problem sizes.

  15. D. Eppstein says:

    Yes, multiplication isn’t really constant time and memory access isn’t really constant time, but if you’re going to assess a penalty to those operations when you use them in hashing then you need to assess them consistently throughout whatever other algorithms you’re using.

    Or, you could use the uniform cost model, pretend the penalty is one, and get basically the same comparison between any two algorithms (that don’t have drastically different memory hierarchy behavior) without all the pain.

    Where this falls down, of course, is that the memory hierarchy behavior of hashing is bad. So if you compare it to something that’s designed to have good lcality of reference, the uniform cost model may not be the right thing to yse.

  16. @sleepnova

    You are correct. My blog post was over-engineered.

  17. sleepnova says:

    @Rachel

    Don’t you think number of keys is relevant to number of entries?

    @Daniel

    You even don’t have to discuss the time complexity of integer multiplication in modern CPUs. You only need to show the lower bound of the cost to generate the hash value. For ex. you have n data entries and an ideal hash function which generate hash value between 1~n without a single collision. You need a data field to contain this value in bits(Size >= log n). So the cost to process this size of data is obviously O(log n). At least you have to read through it!