Daniel Lemire's blog

Validating UTF-8 strings using as little as 0.7 cycles per byte

, 3 min read

Most strings found on the Internet are encoded using a particular unicode format called UTF-8. However, not all strings of bytes are valid UTF-8. The rules as to what constitute a valid UTF-8 string are somewhat arcane. Yet it seems important to quickly validate these strings before you consume…

Is research sick?

, 3 min read

One of the most important database researchers of all time, Michael Stonebraker, has given a talk recently on the state of database research. I believe that many of his points are of general interest: We have a lost our consumers… Researchers write for other researchers. They are being insular…

Science and Technology links (May 11th, 2018)

, 1 min read

It looks like avoiding food most of the day, even if you do not eat less, is enough to partially rejuvenate you. Google researchers use deep learning to emulate how mammals find their way in challenging environments. We know that athletes live longer than the rest of us. It turns out that Chess…

How quickly can you check that a string is valid unicode (UTF-8)?

, 4 min read

(This blog post is now obsolete, see for example Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition).) Though character strings are represented as bytes (values in [0,255]), not all sequences of bytes are valid strings. By far the most popular character encoding today is UTF-8,…

Science and Technology links (May 5th, 2018)

, 2 min read

Oculus, a subsidiary of Facebook, has released its $200 VR headset (the Oculus Go). You can order it on Amazon. The reviews are good. It is standalone and wireless which is excellent. The higher-quality Oculus Rift and its nifty controllers are down to only $400, with the caveat that it needs a…

How fast can you parse JSON?

, 3 min read

JSON has become the de facto standard exchange format on the web today. A JSON document is quite simple and is akin to a simplified form of JavaScript: { "Image": { "Width": 800, "Height": 600, "Animated" : false, …

Is software prefetching (__builtin_prefetch) useful for performance?

, 5 min read

Many software performance problems have to do with data access. You could have the most powerful processor in the world, if the data is not available at the right time, the computation will be delayed. It is intuitive. We used to locate libraries close to universities. In fact, universities were…

Science and Technology links (April 29th, 2018)

, 5 min read

Our heart regenerates very poorly. That is why many of us will die of a heart condition. Harvard researchers find the mice that exercise generate many more new heart cells. The researchers hint at the fact that you might be able to rejuvenate your heart by exercising. Cable TV is losing…

Why a touch of secrecy can help creative work

, 3 min read

Though I am a long-time blogger and I spend most of my day talking or writing to other people… I am also quite secretive about the research that I am doing. There are reasons to be secretive that are bogus. The primary one is that you are afraid others might steal your ideas. That’s…

Enough with the intrusive updates!

, 4 min read

This week-end, I went to my gaming PC in my living room. The PC did not respond when I grabbed the mouse. Puzzled, I pressed the “on” button on the PC. Then I saw that Microsoft saw fit to update my PC while I wasn’t looking. I had configured this particular PC to my liking, and many of my…

Science and Technology links (April 22nd, 2018)

, 9 min read

You probably can’t write the two forms of the letter g, even if you have seen them thousands and thousands of times. Some neurodegenerative diseases might result from a fungal infection. This would include diseases like Parkinson’s. The theory seems to be that many of us get infected with…

Introducing GapminderVR: Data Visualization in Virtual Reality

, 3 min read

I am a big fan of sites such as Gapminder and Our World in Data. Such data visualization sites are like intellectual pornography. You want to know which countries are doing better? Which continents drink more alcohol? How is alcohol related to GDP? Have people getting fatter recently, or is that a…

Iterating in batches over data structures can be much faster…

, 3 min read

We often need to iterate over the content of data structures. It is surprisingly often a performance bottleneck in big-data applications. Most iteration code works one value at a time… for value in datastructure { do something with value } There is a request to the data structure for a new…

Science and Technology links (April 13th, 2018)

, 3 min read

Somewhat depressingly, there is very little evidence that you can improve people’s overall cognitive abilities: Although cognitive ability correlates with domain-specific skills—for example, smarter people are more likely to be stronger chess players and better musicians—there is…

For greater speed, try batching your out-of-cache data accesses

, 2 min read

In software, we use hash tables to implement sets and maps. A hash table works by first mapping a key to a random-looking address in an array. In a recent series of blog posts (1, 2, 3), I have documented the fact that precomputing the hash values often accelerates hash tables. Some people thought…

Science and Technology links (April 7th, 2018)

, 2 min read

Mammals have a neocortex, some kind of upper layer on top of our ancestral brain. It is believed to be the key evolutionary trick that makes mammals smart. Yet birds have no cortex, but some of them (parrots and crows) are just as smart as monkeys. Thus some researchers conclude that a specific…

Caching hash values for speed (Swift-language edition)

, 2 min read

In my posts Should you cache hash values even for trivial classes? and When accessing hash tables, how much time is spent computing the hash functions?, I showed that caching hash values could accelerate operations over hash tables, sets and maps… even when the hash tables do not fit in CPU…