, 4 min read
SWAR explained: parsing eight digits
It is common to want to parse long strings of digits into integer values. Because it is a common task, we want to optimize it as much as possible.
In the blog post, Quickly parsing eight digits, I presented a very quick way to parse eight ASCII characters representing an integers (e.g., 12345678) into the corresponding binary value. I want to come back to it and explain it a bit more, to show that it is not magic. This works in most programming languages, but I will stick with C for this blog post.
To recap, the long way is a simple loop:
uint32_t parse_eight_digits(const unsigned char *chars) {
uint32_t x = chars[0] - '0';
for (size_t j = 1; j < 8; j++)
x = x * 10 + (chars[j] - '0');
return x;
}
We use the fact that in ASCII, the numbers 0, 1, … are in consecutive order in terms of byte values. The character ‘0’ is 0x30 (or 48 in decimal), the character ‘1’ is 0x31 (49 in decimal) and so forth. At each step in the loop, we multiple the running value by 10 and add the value of the next digit.
It assumes that all characters are in the valid range (from ‘0’ to ‘9’): other code should check that it is the case.
An optimizing compiler will probably unroll the loop and produce code that might look like this in assembly:
movzx eax, byte ptr [rdi]
lea eax, [rax + 4*rax]
movzx ecx, byte ptr [rdi + 1]
lea eax, [rcx + 2*rax]
lea eax, [rax + 4*rax]
movzx ecx, byte ptr [rdi + 2]
lea eax, [rcx + 2*rax]
lea eax, [rax + 4*rax]
movzx ecx, byte ptr [rdi + 3]
lea eax, [rcx + 2*rax]
lea eax, [rax + 4*rax]
movzx ecx, byte ptr [rdi + 4]
lea eax, [rcx + 2*rax]
lea eax, [rax + 4*rax]
movzx ecx, byte ptr [rdi + 5]
lea eax, [rcx + 2*rax]
lea eax, [rax + 4*rax]
movzx ecx, byte ptr [rdi + 6]
lea eax, [rcx + 2*rax]
lea eax, [rax + 4*rax]
movzx ecx, byte ptr [rdi + 7]
lea eax, [rcx + 2*rax]
add eax, -533333328
Notice how there are many loads, and a whole lot of operations.
We can substantially shorten the resulting code, down to something that looks like the following:
imul rax, qword ptr [rdi], 2561
movabs rcx, -1302123111085379632
add rcx, rax
shr rcx, 8
movabs rax, 71777214294589695
and rax, rcx
imul rax, rax, 6553601
shr rax, 16
movabs rcx, 281470681808895
and rcx, rax
movabs rax, 42949672960001
imul rax, rcx
shr rax, 32
How do we do it? We use a technique called SWAR which stands for SIMD within a register. The intuition behind is that modern computers have 64-bit registers. Processing eight consecutive bytes as eight distinct words, as in the native code above, is inefficient given how wide our registers are.
The first step is to load all eight characters into a 64-bit register. In C, you might do it in this manner:
int64_t val;
memcpy(&val, chars, 8);
It looks maybe expensive, but most compilers will translate the memcpy instruction into a single load, when compiling with optimizations turned on.
Computers store values in little-endian order. This means that the first byte you encounter is going to be used as the least significant byte, and so forth.
Then we want to subtract the character ‘0‘ (or 0x30
in hexadecimal). We can do it with a single operation:
val = val - 0x3030303030303030;
So if you had the string ‘12345678‘, you will now have the value 0x0807060504030201.
Next we are going to do a kind of pyramidal computation. We add pairs of successive bytes, then pairs of successive 16-bit values and then pairs of successive 32-bit bytes.
It goes something like this, suppose that you have the sequence of digit values b1, b2, b3, b4, b5, b6, b7, b8. You want to do…
- add pairs of bytes: 10b1+b2, 10b3+b4, 10b5+b6, 10b7+b8
- combine first and third sum: 1000000*(10b1+b2) + 100(10*b5+b6)
- combine second and fourth sum: 10b7+b8 + 10000(10*b3+b4)
I will only explain the first step (pairs of bytes) as the other two steps are similar. Consider the least significant two bytes, which have value 256b2 + b1. We multiply it by 10, and we add the value shifted by 8 bits, and we get b1+10b2 in the least significant byte. We can compute 4 such operations in one operation…
val = (val * 10) + (val >> 8);
The next two steps are similar:
val1 = (((val & 0x000000FF000000FF) * (100 + (1000000ULL << 32)));
val2 = (((val >> 16) & 0x000000FF000000FF)
* (1 + (10000ULL << 32))) >> 32;
And the overall code looks as follows…
uint32_t parse_eight_digits_unrolled(uint64_t val) {
const uint64_t mask = 0x000000FF000000FF;
const uint64_t mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32)
const uint64_t mul2 = 0x0000271000000001; // 1 + (10000ULL << 32)
val -= 0x3030303030303030;
val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8;
val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;
return val;
}
Appendix: You can do much the same in C# starting with a byte
pointer (byte* chars):
ulong val = Unsafe.ReadUnaligned<ulong>(chars);
const ulong mask = 0x000000FF000000FF;
const ulong mul1 = 0x000F424000000064;
// 100 + (1000000ULL << 32)
const ulong mul2 = 0x0000271000000001;
// 1 + (10000ULL << 32)
val -= 0x3030303030303030;
val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8;
val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;