Daniel Lemire's blog

, 6 min read

Coding of domain names to wire format at gigabytes per second

8 thoughts on “Coding of domain names to wire format at gigabytes per second”

  1. KWillets says:

    Back when we were trying to validate UTF-8, I wracked my brain on how to get correct jump distances like this.

    The trick is to exponentiate a permutation of the following form:

    P[i] = i if position i has a dot

    P[i] = i+1 if not

    In SIMD terms that’s a constant vector plus the comparison mask:

    P=
    1 2 3 4 5 6 7 8 + -1 0 -1 0 0 0 0 -1
    = 0 2 2 4 5 6 7 7

    P^2 = 0 2 2 5 6 7 7 7 7

    P^4 = 0 2 2 7 7 7 7 7

    P has a fixed point at each dot, and the other locations fill from the right. The limit is an eigenvector under P.

    1. Damn it. Let me try.

      1. Ok. I think it is applicable, but it requires more work than I expected. I will need to come back to it.

    2. Noah goldstein says:

      What operation is it for P^2 / P^4. Seems to be neither power nor xor.

      1. KWillets says:

        P^2 means pshufb(P,P); sorry for the rough notation. That’s the shuffle mask equivalent to applying P twice (shuffle masks are composable, subject to certain constraints such as out-of-range entries being replaced by 0).

  2. Jeroen Koekkoek says:

    Only a minor remark, but the wire format for domain names is slightly different. Given lemire.me, the name contains three labels, one of length 6 (lemire), one of length 2 (me) and the root label. The correct wire format is 6lemire2me0 (in URLs the trailing dot is often implicit).

    Additional background info: A resolver starts looking up information by first querying the . (root server), which responds with the name server for .me, which responds with the location for lemire.me..

    1. Thanks. The blog post does not describe the final zero. However, the code does produce a trailing zero. In the case where the input is lemire.me, you get the following (see unit tests):

      input:
      6c 65 6d 69 72 65 2e 6d 65
      name_to_dnswire_simd output:
      06 6c 65 6d 69 72 65 02 6d 65 00
      name_to_dnswire output:
      06 6c 65 6d 69 72 65 02 6d 65 00
      name_to_dnswire_scalar_labels output:
      06 6c 65 6d 69 72 65 02 6d 65 00
      name_to_dnswire_avx output:
      06 6c 65 6d 69 72 65 02 6d 65 00
      name_to_dnswire_idx_avx output:
      06 6c 65 6d 69 72 65 02 6d 65 00
      
      1. Isaac Kabel says:

        The non-simd function works in the test because it has a final *counter = 0; whereas the simplified code in the article excludes it and is wrong as a result.

        I sympathize with excluding the label length validation and is_name_char implementation, both clearly for brevity’s sake, but the null terminator is a matter of correctness.

        I (and presumably Jeroen too) would have found your introduction easier to understand since I think of the DNS name wire format the other way (dots are separators between labels, there’s a trailing empty label, labels are serialized with a length prefix byte). From your description, I figured you’d have to add the null terminator after the loop, so when I didn’t see it in your code, I mentally desk checked it to make sure I wasn’t misunderstanding.