Daniel Lemire's blog

Transcoding UTF-8 strings to Latin 1 strings at 18 GB/s using AVX-512

, 4 min read

Most strings online are Unicode strings in the UTF-8 format. Other systems (e.g., Java, Microsoft) might prefer UTF-16. However, Latin 1 is still a common encoding (e.g., within JavaScript runtimes). Its relationship with Unicode is simple: Latin 1 includes the first 256 Unicode characters. It is…

Coding of domain names to wire format at gigabytes per second

, 5 min read

When you enter in your browser the domain name lemire.me, it eventually gets encoded into a so-called wire format. The name lemire.me contains two labels, one of length 6 (lemire) and one of length two (me). The wire format starts with 6lemire2me: that is, imagining that the name starts with an…

Science and Technology links (August 6 2023)

, 2 min read

In an extensive study, You et al. (2022) found that meat consumption was correlated with higher life expectancies: Meat intake is positively correlated with life expectancies. This relationship remained significant when influences of caloric intake, urbanization, obesity, education and…

Decoding base16 sequences quickly

, 4 min read

We sometimes represent binary data using the hexadecimal notation. We use a base-16 representation where the first 10 digits are 0, 1, 2, 3, 5, 6, 7, 8, 9 and where the following digits are A, B, C, D, E, F (or a, b, c, d, e, f). Thus each character represents 4 bits. A pair of characters can…

Science and Technology links (July 23 2023)

, 1 min read

People increasingly consume ultra processed foods. They include energy drinks, mass-produced packaged breads, margarines, cereal, energy bars, fruit yogurts, fruit drinks, vegan meat and cheese, infant formulas, pizza, chicken nuggets, and so forth. Ultra processed foods are correlated with poorer…

Fast decoding of base32 strings

, 3 min read

We often need to encode binary data into ASCII strings (e.g., email). The standards to do so include base16, base32 and base64. There are some research papers on fast base64 encoding and decoding: Base64 encoding and decoding at almost the speed of a memory copy and Faster Base64 Encoding and…

Science and Technology links (July 16 2023)

, 2 min read

Most people think that they are more intelligent than average. Lack of vitamin C may damage the arteries. Make sure you have enough! A difficult problem in software is caching. Caching is the idea that you keep some values in fast memory. But how do you choose which values to keep? A standard…

Recognizing string prefixes with SIMD instructions

, 6 min read

Suppose that I give you a long list of string tokens (e.g., “A”, “A6”, “AAAA”, “AFSDB”, “APL”, “CAA”, “CDS”, “CDNSKEY”, “CERT”, “CH”, “CNAME”, “CS”, “CSYNC”, “DHC”, etc.). I give you a pointer inside a much larger string and I ask you whether…

Stealth, not secrecy

, 3 min read

The strategy for winning is simple: do good work and tell the world about it. In that order! This implies some level of stealth as you are doing the good work. If you plan to lose weight, don’t announce it… lose the weight and then do the reveal. Early feedback frames the problem and might…

Packing a string of digits into an integer quickly

, 3 min read

Suppose that I give you a short string of digits, containing possibly spaces or other characters (e.g., "20141103 012910"). We would like to pack the digits into an integer (e.g., 0x20141103012910) so that the lexicographical order over the string matches the ordering of the integers. We…

Having fun with string literal suffixes in C++

, 2 min read

The C++11 standard introduced user-defined string suffixes. It also added regular expressions to the C++ language as a standard feature. I wanted to have fun and see whether we could combine these features. Regular expressions are useful to check whether a given string matches a pattern. For…

Parsing time stamps faster with SIMD instructions

, 3 min read

In software, it is common to represent time as a time-stamp string. It is usually specified by a time format string. Some standards use the format %Y%m%d%H%M%S meaning that we print the year, the month, the day, the hours, the minutes and the seconds. The current time as I write this blog post…

Dynamic bit shuffle using AVX-512

, 2 min read

Suppose that you want to reorder, arbitrarily, the bits in a 64-bit word. This question was raised on Twitter by @experquisite. Formally, you might want to provide, for each of the 64 bit position, an original bit position you want to copy. Hence, the following code would reverse the bit order in…

Science and Technology links (June 25 2023)

, 2 min read

Women in highly religious relationships report the highest levels of relationship quality. US politics is largely divided into two parties (Republicans and Democrats). People who are affiliated with the Republicans have many more kids. The Antartic ice shelves gained 661 gigaton of ice over the…

Citogenesis in science and the importance of real problems

, 6 min read

Scientists publish papers in refereed journals and conferences: they write up their results and we ask anonymous referees to assess it. If the work is published, presumably because the anonymous referees found nothing objectionable, the published paper joins the “literature”. It is not a strict…

Science and Technology links (June 11 2023)

, 2 min read

Similar species can have vastly different lifespan. Researchers have been looking for the limiting factors that explain these differences. As we age, our genes are expressed differently through methylation. Different species vary their methylation at different speeds. There is some evidence that…