, 25 min read
We released simdjson 0.3: the fastest JSON parser in the world is even better!
30 thoughts on “We released simdjson 0.3: the fastest JSON parser in the world is even better!”
, 25 min read
30 thoughts on “We released simdjson 0.3: the fastest JSON parser in the world is even better!”
Congratulations is de rigueur here.
I guess the next challenge will be an API to speed up YAML parsing. YAML files are an important part of deploying PyTorch on most platforms, it will be worth seeing if you can easily adapt this library to this type of parsing.
Are there any SIMD accelerated XML parser?
There has been some XML parsers, at least at the prototype level, that have used SIMD instructions deliberately. However, I do not think that there has ever been something like simdjson for XML.
The Parabix XML project.
Yes, we know of parabix, but no, it is nothing like simdjson for XML.
And? How does it compare? It looks innovative. Is simdjson’s approach applicable to XML too?
Please see the paper where it is discussed in details. If you have questions after reading the paper, I will be happy to answer them.
Okay, I read it. Very nice work.
Q1: Why do you use 8 bytes to encode things like null, true, false, etc.? Couldn’t you just use one byte, or even a few bits? After all, there’s a null codepoint in ASCII / UTF-8 — it’s the first one, the binary zero byte 0000 0000. Do you need everything to be eight bytes for some reason?
Q2: Why are you focused only on huge files, over 50 KB? For JSON that’s huge. The most common use of JSON is slinging around requests and responses on the web, with small payloads. For example, a REST API for payments might involve JSON payloads that are about 1 to 3 KB each. (GraphQL, like REST, also uses JSON.) See PayPal’s API, or here’s an example of a typical request payload: https://doc.gopay.com/en/?lang=shell#standard-payment
Q3: What do you do if you can’t use
tzcnt
to count trailing zeros? That instruction is part of BMI, which came out in Haswell. Ivy Bridge and Sandy Bridge won’t have it, and there are still a lot of servers running on those families. How far back do you go on SIMD? Is something like Westmere or Nehalem you’re floor? They would have SSE4.2, carryless multiplication, and I think AES.You might be understating simdjson performance with all those number-heavy files, since number parsing should be your slowest.
FYI, I opened an issue on GitHub asking about CPU and memory overhead. That’s an important dimension that the paper and the website don’t address. It’s also important to know if it causes cores to throttle down when you use AVX2 or whatever. I think Skylake and Cascade Lake might be okay on that front, but there might be an issue using AVX or AVX2 on earlier families. If so, using simdjson would slow down all the other applications and workloads on the server. I know that AVX512 throttles cores, but I don’t remember about AVX2.
The tape uses a flat 8-byte per element, with some exceptions (numbers use two 8-byte entries).
That is what the paper benchmarks. But you can find other results on GitHub, including on tiny files.
The simdjson relies on runtime dispatching. It runs on every x64 processing under a 64-bit system. It is open source.
We do not support AVX-512. No downclocking is expected: we do not use AVX2 instructions requiring it (e.g., FMA). But, in any case, as the user, you are in charge of kernel that runs, so you can select SSSE3 is you prefer, even when your system supports AVX2. The simdjson has a non-allocation policy for parsing, so you can parse terabytes of data without allocating memory.
Okay, so on the issue of CPU overhead you said on GitHub that speed is CPU overhead or something. That’s not quite right. The equation is:
Overhead = (JSON GB / Parsing speed) × CPU usage
Where CPU usage is the percentage of CPU. There could also be a form that uses CPU clock cycles per byte or something. It’s not enough to know how fast some software is – we normally have to know its cost in resources like CPU and memory. It looks like you’re good on memory since you don’t allocate, but I’m surprised that you’re unwilling to report CPU overhead.
Another suggestions.
It would be great to know some things about simdjson’s security properties, to have some basic assurances. This is a strange era for computing given how insecure and primitive our programming languages and tools are. C++ is an unsafe language where exploitable memory bugs are inevitable on medium to large projects. One light assurance would be if you follow the C++ Core Guidelines. Much stronger assurance would be to pass something Coverity Scan. It’s free for open source projects.
A JSON parser might parse untrusted input, which can be malformed or not JSON at all. Ideally a parser would be formally verified, but hardly anyone does that since popular programming languages like C++ aren’t designed to facilitate verification and the tooling sucks. So Coverity is about as good as it gets. Address Sanitizer and Memory Sanitizer in the LLVM project are interesting too. The Software Engineering Institute at Carnegie Mellon has a Secure C++ Coding Guidelines too: https://insights.sei.cmu.edu/sei_blog/2017/04/cert-c-secure-coding-guidelines.html
If you had some kind of formal assurances like those it would be pretty distinctive.
I’m surprised that you’re unwilling to report CPU overhead.
If you have interesting performance metrics you would like to propose, we are always inviting new pull requests. The simdjson library is a community-based project. Please write up some code, and we shall be glad to discuss it.
Awesome work! When I looked at integrating the previous version into a higher level parser, it didn’t look like it handled streaming. Was that impression correct and if so, has that changed? Thanks!
We do handle long inputs containing multiple JSON documents (e.g., line separated). We even have a nifty API for it (see “parse_many”).
If you mean streaming as in “reading from a C++ istream”, then, no, we do not support this and won’t. It is too slow. We are faster than getline applied to an in-memory istream.
Don’t care about C++ istream. In order to stream, the parser must be able to deal with partial/incomplete inputs and with resuming such an incomplete parse.
Feeding it is then Somebody Else’s Problem.
One of the things I learned early on in building high-perf components is that having the component itself be fast is (at most) half the bette. The crucial bit is that it must be possible, preferably easy/straightforward, to use it in such a way the the whole ensemble is fast.
A lot of the “fast” XML parsers tended to fall flat in that regard.
battle, of course. I blame autocorrect…
The way simdjson is currently designed is that it won’t let you access a document (at all) unless it has been fully validated. The rationale behind this is that many people do not want to start ingesting documents that are incorrect. And, of course, you can only know that a document is valid if you have seen all of it.
For line-separated JSON documents, it is not an issue because you get to see the whole JSON document before returning it to the user, it is just that you have a long stream of them.
We plan to offer more options in future releases.
The parsing speed is impressive, great work.
I second Marcel’s point. The current interface works well if a processing pipeline starts with a file. However, the parser cannot be used in the middle of a pipeline in a larger system. Without supporting streaming input, materializing large intermediate result clogs the flow. Downloading a file from cloud storage or user-defined document transformations are common scenarios here.
The parser would not have to output incorrect or incomplete documents. It would wait for another chunk of input to continue parsing a document that is in-flight.
You write “materializing large intermediate”, and with that constraint, I agree. But be mindful that large means “out of cache”, and we have megabytes of cache on current processor cores. For small to medium files, querying cache lines through an interface is an anti-design.
Note that we have since released version 0.6 which introduces a new API that we call On Demand API. So this blog post is somewhat obsolete at this point.
Minor inconsistency in the example: the average tire pressure of the cars will not be what one would expect (only half of it)!
Any plans for Rust bindings?
There are Rust bindings but help is needed to get it updated:
https://github.com/SunDoge/simdjson-rust
Congrats !
What about a mooc/levure on parsing with C++ ?
@Catherine
I have a talk on YouTube, does that count?
https://www.youtube.com/watch?v=wlvKAT7SZIQ
Thanks. Great talk and aha moments.
By the way, it would be neat to have an ultrafast SIMD JSON minifier, something very light in terms of CPU and memory use.
This would presumably be much simpler than a parser, since all it would have to do is strip spaces, tabs, newlines, and carriage returns. Well, it would have to know not to touch the contents of quoted strings.
There’s an enormous amount of waste with all the unminified JSON people are slinging around. You can save 10% most of the time by minifying, but there aren’t any good minifiers out there.
But we do have that!!!! It is part of simdjson.
Oh nice! Does it minify by default, or is it a flag?
It is a function that you may call on JSON string. It does not parse. It is highly optimized.
It is not currently very well exposed or documented, since it has been updated to be multiplatform only recently.
How’s the performance for mobile? E.g Android and iOS devices.
I’m currently using rapidjson for a library that’s used for mobile devices and wondering if I should move over to simjson, if it’s faster and easier to use.
We support 64-bit ARM platforms with accelerated kernels. See https://lemire.me/blog/2019/08/01/a-new-release-of-simdjson-runtime-dispatching-64-bit-arm-support-and-more/