Can you implement a hash table without storing the keys?
If you have multiple hash functions, you call the first one to see whether the text might be boiler plate. You only need to call the second one if the text is possibly boiler plate. If the text is mostly non-boiler plate and the first hash table is mostly empty, then you’ll rarely need to call the second (third, fourth, …) hash function(s). On average, you’ll only make slightly more than one hash call per chunk of text.
Peter Turneysays:
Given hash functions f1, f2, f3, …, fn, use f1 to index into a hash table and store the values of f2 … fn in that slot in the hash table. Most text will likely index into empty slots, so only f1 will need to be calculated. If the slot is not empty, then you calculate f2. If the value, matches, then try f3, and so on. It’s unlikely that all fn would match for a non-boiler plate text. On average, you would only need to calculate f1. And you wouldn’t be storing the keys.
You could still define a Perfect Hash function if you’re a little loose with your definition of “distinct element”. If you define any two elements of non-boilerplate as identical (since we’re only interested in boilerplate), you could write a hash that maps any nonboiler plate to zero, and otherwise maps to a hash for that text.
Of course, if you have the boilerplate, you could just write a trie or finite state machine to detect them.
@Peter
On a single pass of data, you’d still need to compute f1…fn for every piece of text. You don’t know during the initial computation how deep you need to hash. If you only do f1, since you hit an empty slot, and a later piece of text collides you’d need to go back and retrieve the original text. That basic idea would work well for a two pass detection though: You use a hash to see what could possibly be boilerplate (vs. innocent collisions), then you do a second pass over that much smaller set, hopefully down into the region that full keys can fit in memory.
Peter Turneysays:
You don’t know during the initial computation how deep you need to hash.
If the boiler plate is given in advance, you do one pass through the boiler plate collection, then you can handle a constant stream of non-boiler text. All you need to know is a rough estimate of the size of the boiler plate collection.
If the boiler plate is not given in advance, you’ll need to make two passes through the data, regardless of what algorithm you use, or you’re certain to mess up on the first few documents you see. Also, high frequency text is not necessarily boiler plate. Might be a famous quotation, for example.
(1) I guess it depends whether your Bloom filter fits in CPU cache.
(3) I think more and more serious applications run on tablets and mobile phones.
Itmansays:
Daniel,
Sorry did not have time to read any of these articles. Yet, I wonder if it’s really computation that takes a lot of time. And not the time wasted in missing the cache (accessing bits in the Bloom-filter table or the string for which hash values are computed)?
BTW, a lot of hash functions can be computed in parallel. What about using SIMD intel extensions?
Itmansays:
Daniel
(1) True, but is it really the case for a Bloom-filter?
(3) It is also true, but those gadgets are mostly for fun (with a few exceptions). You are not going to run boilerplate detecting applications on smartphones? If not, you will quickly find out that everything else today is Intel 🙂
@Turney
If the boilerplate text is given in advance, you could generate a perfect hash function. Then you wouldn’t need to store the keys.
Your domain space is not limited to boilerplate text.
Can you implement a hash table without storing the keys?
If the boilerplate text is given in advance, you could generate a perfect hash function. Then you wouldn’t need to store the keys.
http://en.wikipedia.org/wiki/Perfect_hash_function
Can you implement a hash table without storing the keys?
If you have multiple hash functions, you call the first one to see whether the text might be boiler plate. You only need to call the second one if the text is possibly boiler plate. If the text is mostly non-boiler plate and the first hash table is mostly empty, then you’ll rarely need to call the second (third, fourth, …) hash function(s). On average, you’ll only make slightly more than one hash call per chunk of text.
Given hash functions f1, f2, f3, …, fn, use f1 to index into a hash table and store the values of f2 … fn in that slot in the hash table. Most text will likely index into empty slots, so only f1 will need to be calculated. If the slot is not empty, then you calculate f2. If the value, matches, then try f3, and so on. It’s unlikely that all fn would match for a non-boiler plate text. On average, you would only need to calculate f1. And you wouldn’t be storing the keys.
@Daniel
You could still define a Perfect Hash function if you’re a little loose with your definition of “distinct element”. If you define any two elements of non-boilerplate as identical (since we’re only interested in boilerplate), you could write a hash that maps any nonboiler plate to zero, and otherwise maps to a hash for that text.
Of course, if you have the boilerplate, you could just write a trie or finite state machine to detect them.
@Peter
On a single pass of data, you’d still need to compute f1…fn for every piece of text. You don’t know during the initial computation how deep you need to hash. If you only do f1, since you hit an empty slot, and a later piece of text collides you’d need to go back and retrieve the original text. That basic idea would work well for a two pass detection though: You use a hash to see what could possibly be boilerplate (vs. innocent collisions), then you do a second pass over that much smaller set, hopefully down into the region that full keys can fit in memory.
You don’t know during the initial computation how deep you need to hash.
If the boiler plate is given in advance, you do one pass through the boiler plate collection, then you can handle a constant stream of non-boiler text. All you need to know is a rough estimate of the size of the boiler plate collection.
If the boiler plate is not given in advance, you’ll need to make two passes through the data, regardless of what algorithm you use, or you’re certain to mess up on the first few documents you see. Also, high frequency text is not necessarily boiler plate. Might be a famous quotation, for example.
@Itman
(1) There are definitively applications where hashing is a bottleneck. It is why people spent so much time designing fast hash functions.
(2) Your processor is likely to pipeline the computation of hash functions and you can use many tricks if you code it in assembly.
(3) Not every machine these days run an Intel processors. What about your iPhone?
@Item
(1) I guess it depends whether your Bloom filter fits in CPU cache.
(3) I think more and more serious applications run on tablets and mobile phones.
Daniel,
Sorry did not have time to read any of these articles. Yet, I wonder if it’s really computation that takes a lot of time. And not the time wasted in missing the cache (accessing bits in the Bloom-filter table or the string for which hash values are computed)?
BTW, a lot of hash functions can be computed in parallel. What about using SIMD intel extensions?
Daniel
(1) True, but is it really the case for a Bloom-filter?
(3) It is also true, but those gadgets are mostly for fun (with a few exceptions). You are not going to run boilerplate detecting applications on smartphones? If not, you will quickly find out that everything else today is Intel 🙂