Daniel Lemire's blog

, 12 min read

Parsing JSON quickly: early comparisons in the wild

12 thoughts on “Parsing JSON quickly: early comparisons in the wild”

  1. Some of the python guys are already exposing interesting possibilities of ‘partial’ or lazy materialization of the JSON into the language. What they are doing at this stage is just suppressing the sometimes expensive construction of things in the language (we are still parsing everything under the hood). But it might later be a nice hook to explore lazy materialization in more detail.

    Small files are still a work in progress. I don’t want to break performance properties there, but I’m a bit skeptical that a workload that’s over in a microsecond or so is even properly measured by our infrastructure. That’s a good one for a HELP WANTED tag.

  2. along with the libraries in various languages, one popular json tool is jq (https://github.com/stedolan/jq).

    jq is part of many data processing pipelines, and a boost to jq’s processing speeds via simdjson would be a great thing to see.

    1. I agree, this seems worthwhile. Anyone wants to look at it?

  3. Oren Tirosh says:

    I wonder how it would perform if the 256 bit instructions were emulated by four (or more) non-simd 64 bit instructions.

    While not optimal, performance may be good compared to bytewise implementions. It still processes more bytes in parallel and has low data dependencies.

    1. I agree that it is an interesting question.

      1. In the paper, we provide a “model” (using linear regregression) and from this model it should be possible to reason about such things.

    2. Travis Downs says:

      Note that the width is only part of the story. Some SIMD instructions are just very slow to emulate with scalar instructions because there is no scalar instruction equivalent. Consider for example PSHUFB, which effectively acts as 32 parallel lookups in a 16-entry table, and no doubt features heavily in simdjson (I could be wrong, but let’s be realistic: we’re talking about Geoff “middle name PSHUFB” Langdale and Daniel “always looking for a new place to apply PSHUFB” Lemire here).

      Using 64-bit instructions, assuming a 64-bit PSHUFB existed, you can only process (look up) 8 bytes at a time, which is the 4x reduction due to width, but also now your lookup table is only 8 entries, so if you actually needed 16-entry tables, you may have to do two lookups and a bunch of merging or something like that. So this is a case where a width reduction has a quadratic rather than linear effect (it’s 4 * 2 not 4 * 4 because of the in-lane behavior of PSHUFB in AVX2 but that’s not fundamental).

      Even that scenario is only a fantasy though! There is of course no 64-bit scalar PSHUFB instruction: those kind of shuffle instructions generally only exist in the SIMD extensions in the first place. So how are you going to emulate PSHUFB with basic scalar operations you’d find in any ISA. It’s going to be really slow (unless there is some trick I’m missing).

      This is why SIMD algorithms often look very different to their scalar counterparts: it’s not just width that’s the difference – but rather whole operations are not available on both sides of the fence.

      This difference flows both ways too! There are scalar techniques that don’t scale to SIMD, such as many things involving loads (e.g., larger lookup tables) or loops with divergent control flow per iteration. There are also scalar operations not even available in SIMD on x86 like the carry-less multiply thing, 64×64->128 multiplication, pdep and pext and various other bit-bashing ops (although future AVX-512 extensions should add some of these).

  4. Travis Downs says:

    What’s the most efficient general purpose PSHUFB emulation with scalar instructions is actually an interesting question by itself…

    The baseline could be something like writing out the input bytes to memory, and then iterating over each 4-bit control in the shuffle mask using it to look up the corresponding byte in the output and storing it, (storing either byte-by-byte or in 64-bit chunks with some shift + OR to build the elements to store).

    So for 32-byte PSHUFB you are looking at 32 loads plus a bunch of other stuff, probably around 100 instructions at a minimum for this approach!

    1. It would be amusing to try to emulate PSHUFB with a shift and a conditional move. I think the load solution would be faster, though. You could merge 2 4-bit values together and look up a 256-way table, so only 16 loads. Also you might need to emulate the high-bit behavior.

      This is, however, silly. To substitute the use of PSHUFB in simdjson one would just look up 8-bit values in an 8-bit table with scalar code. This would be way faster than faking up SWAR replacements for SIMD.

      The SWAR approach to searching for particular bytes is reasonable – some silliness with carry, add and multiply and off you go. This was still worth doing in Hyperscan. There might be some arguably clever SWAR’isms for some of the other bits and pieces in simdjson, but I can’t imagine why you would want to do this.

      All that being said a scalar stage 1 would almost certainly perform best if built from scratch. I make a similar argument with ARM NEON.

  5. Travis Downs says:

    This is, however, silly. To substitute the use of PSHUFB in simdjson
    one would just look up 8-bit values in an 8-bit table with scalar
    code. This would be way faster than faking up SWAR replacements for
    SIMD.

    Agreed, although it still supports my original point: which is that it can be misleading to look at say 256-bit wide SIMD and then suppose a 4x drop in efficiency when moving to 64-bit scalar instructions, because some SIMD patterns suffer a much larger drop: in the case of single-byte lookups as you suggest, it’s roughly a 32x gap.

    So yeah, you’d not want to emulate PSHUFB directly in a scalar version of simdjson, but it did make me curious what the scalar emulation would be which is interesting in various scenarios like source-compatible emulation of SIMD intrinsics on platforms that don’t support them, like this.

  6. Twirrim says:

    I bought an cheap laptop to use while traveling, which comes with a low performance AMD A6-9225 CPU, and thought I’d give simdjson a shot, in part because I was curious about SSE performance.

    simdjson : 4.027 cycles per input byte (best) 4.156 cycles per input byte (avg) 0.644 GB/s (error margin: 0.020 GB/s)
    rapid : 9.478 cycles per input byte (best) 9.828 cycles per input byte (avg) 0.274 GB/s (error margin: 0.010 GB/s)
    sasjon : 9.025 cycles per input byte (best) 11.105 cycles per input byte (avg) 0.288 GB/s (error margin: 0.054 GB/s)
    simdjson (just parse) : 4.006 cycles per input byte (best) 4.103 cycles per input byte (avg) 0.648 GB/s (error margin: 0.015 GB/s)
    rapid (just parse) : 8.303 cycles per input byte (best) 10.494 cycles per input byte (avg) 0.312 GB/s (error margin: 0.065 GB/s)
    sasjon (just parse) : 7.871 cycles per input byte (best) 10.036 cycles per input byte (avg) 0.330 GB/s (error margin: 0.071 GB/s)
    simdjson (just dom) : 0.620 cycles per input byte (best) 0.726 cycles per input byte (avg) 4.175 GB/s (error margin: 0.609 GB/s)
    rapid (just dom) : 1.138 cycles per input byte (best) 1.510 cycles per input byte (avg) 2.278 GB/s (error margin: 0.562 GB/s)
    sasjon (just dom) : 0.610 cycles per input byte (best) 0.809 cycles per input byte (avg) 4.249 GB/s (error margin: 1.044 GB/s)

    Not bad at all, even on this slow chip it’s pushing 0.6GB/s, and seeing more than a 2x performance increase over the next fastest.

    1. Thanks for trying it out.