Also not obvious at first blush is why the standard (at least as quoted by cppreference) doesn’t guarantee interoperability when both sides use IEEE754 floating point: “The guarantee that std::from_chars can recover every floating-point value formatted by to_chars exactly is only provided if both functions are from the same implementation.” Why did they leave this wiggle room (vs stating something about a requirement about the std::numeric_limits<> of the types on the two platforms)? How could Linux and Windows, or Windows-on-x86 and Windows-on-x86_64 FAIL to round trip yet each individually conform to the standard? https://en.cppreference.com/w/cpp/utility/to_chars
Also not obvious at first blush is why the standard (at least as
quoted by cppreference) doesn’t guarantee interoperability when both
sides use IEEE754 floating point
I take it that it is a pessimistic view.
Konrad Rudolphsays:
Printing ≠ parsing!
Printing became an order of magnitude faster because itʼs using brand new algorithms to print floating point numbers.
Parsing “only” omits locale handling, so thereʼs a limit to how much faster it could get.
Ivan Bobevsays:
20% performance boost? To be correct this is a little above 14%, not 20%. And yes it is a meaningful increase, but let be correct.
Take into account that both values (1.40 and 1.20) were rounded to two significant digits, and add the fact that the measure has some incertitude to begin with (~3%)… So I need to round the ratio to one significant digit. Hence 20% is fair.
140 / 120 – 1 = 0.17
Charles Bloomsays:
It would be interesting to see the speed vs. the classic good open source strtod from “dtoa.c” ; historically the MSVC CRT strtod has been about 10X slower than “dtoa.c” so I’d be more pleased if they fixed their 10X slowdown. (the MUSL CRT strtod was 100X slower)
I think you mean “atof”. I think atof is legacy at this point.
Charles Bloomsays:
I don’t undestand where I could have meant “atof” ?
dtoa.c (.c indicates a source code file) is the classic gold standard for a good strtod that is both accurate and fast. It can be updated with modern instructions (64 bit registers and clz) to be even faster.
Rough order of magnitude, 140 MB/s looks about 10X too slow for strtod.
On most doubles, strtod only has to do *= 10, += digit, so it should be only a few cycles per digit, not hundreds of cycles per digit. Doubles that require more work than that are very rare.
dtoa.c averages around 2 cycles per byte so these times look very slow indeed.
Travis Downssays:
I added the netlib strtod that Charles linked and ran it unmodified (except for changing the name), and it provides a modest speedup:
# parsing random integers in the range [0,1)
volume = 2.09808 MB
strtod : 125.83 MB/s (+/- 1.4 %)
netlib : 174.48 MB/s (+/- 4.1 %)
This works out to about 15 cycles per byte on my machine, considerably more than the 2 Charles it suggests it should achieve.
Looking at the main loop it seems like it is not going to get to 2 cycles (there is at least one or two mispredicts every time a 0 appears), but 15 does seem like too much. The compiler does a poor job in various places too.
The “random” part of the benchmark is why this is so far off from Charles’s 10x guess. (Travis: I assume you didn’t mean “random integers in [0, 1)” since there aren’t a lot of those to go around.)
The standard fast path (which is in David Gay’s code) for correctly rounded decimal to double conversion is when you have 15 significant decimal digits (or 16 significant digits <= 2^53). Then you have an exactly represented integer (from the parsed decimal digits) divided by an exactly represented power of ten. That is a single IEEE-754 operation and hence correctly rounded. (You can also support ‘e’ on the fast path with a branch to either divide or multiply by a power of ten as long as the net power of ten is exactly representable.)
If you’re parsing hand-written double literals in a programming language parser or calculator program then you’re almost never going to need more than 15 digits and hence the fast path will be hit almost all the time. On the other hand, there aren’t really enough hand-written double literals in a typical source file for the performance to be a bottleneck.
Where performance really matters is when you’re trying to round-trip large arrays of doubles, and they will often have noise in the less significant digits (from accumulated round-off unrelated to any serialization issues) which can push you off the fast path most of the time.
A relevant factor for hitting the fast path is whether the doubles you’re reading were printed with the minimum number of required digits. In Daniel’s data generator in the benchmark, his code appears to be the printing the minimum (at most 17 digits for doubles). But if we assume a uniform distribution on [0, 1) and a fast path that only works with 15 digits, I think that means only 1% of cases will hit the fast path.
Travis Downssays:
From what I can tell most of the time is spent simply parsing the characters into an integer, not any slow path associated with more than 15 digits.
That is, the loop that does:
result *= 10 + *str - '0'
It does a bit more than that because it handles zeros specially (I suppose to avoid overflow in the case of many trailing zeros).
This parsing would need to happen regardless of 15 or 17 digits, so I’m not seeing the slow path unless it’s related to the trailing zero handling I mentioned.
I did make some tweaks to the netlib algo and got a good speedup to around 600 MB/s, but still pretty far away from Charles’ estimate.
Travis Downssays:
Sorry it should be result = result * 10 + *str - '0'.
As far as I can tell, fmtlib prints out floats. It is not the problem considered here. Thanks for the link.
Hasan Alaylisays:
Oops, my mind wandered in the other direction as I was reading the article.
Rudisays:
Not that I think that it will increase the performance significantly, but the handling of ‘string_end’ is not correct. It doesn’t have to be initialized and the check for nullptr is not necessary. When strtod() succeeds, string_end will point to the terminating zero of string. So the error check should read: if ( *string_end != ‘\0 ) { failed(); }
Why is the benchmark including the cost of the static c_locale?
I am not sure what you mean by cost. Both strtod_l and strtod depend on a locale. In this case, I specify the default locale.
As explained in the post strtod is locale-sensitive and we want a local insensitive version for comparison purposes.
Why not time based on a batch rather than on individual parses?
Because I am interested in throughput, as indicated by the fact that I express my results in MB/s. I am not interested in latency in this case.
Does the benchmark actually test and or include all possible ways a float can be represented?
It does not. Have you checked the code?
The exact performance number will depend on how the values were serialized with the number of digits being the most important factor. However, if you want full roundtrip, you need 17 digits in general for this problem. I have fixed the number of digits to a flat 17 throughout.
In practice, if you are parsing a lot of floats, chances are good that they are following the same format.
Another flawed benchmark.
Can you point at a better from_chars to strtod benchmark?
You offer a link to your own library, but it does not seem to include such a benchmark?
Travis Downssays:
I tested from_chars in gcc 11 snapshot (g++ (GCC) 11.0.0 20200913 (experimental)) and current it is actually considerably slower than strtod.
Of course, it might speed up before release.
# parsing random integers in the range [0,1)
volume = 2.09808 MB
strtod : 126.03 MB/s (+/- 2.5 %)
from_chars : 82.09 MB/s (+/- 2.5 %)
It is not a given that from_chars will be faster than strtod either: the former has a “round trip” requirement that the latter doesn’t.
Yakov Bachmutskysays:
You may want to have a look at https://github.com/yb303/swar.
It’s only a header really.
I wrote it a few years back when I was bored between jobs and now cleaned up and uploaded.
It provides a few variants of atoi / itoa, without using SSE.
SEE may make it a bit faster for strings longer than 8 (ints greater than 99999999)
20% doesn’t seem like a lot considering how the marketing copy for the new C++17 number conversion routines claimed that “[if] your serialization code is bottlenecked by floating-point printing, this will accelerate your code by roughly 3x to 30x (yes, times, not percent).” https://isocpp.org/blog/2020/08/cppcon-2019-floating-point-charconv-making-your-code-10x-faster-with-cpp17s
Also not obvious at first blush is why the standard (at least as quoted by cppreference) doesn’t guarantee interoperability when both sides use IEEE754 floating point: “The guarantee that std::from_chars can recover every floating-point value formatted by to_chars exactly is only provided if both functions are from the same implementation.” Why did they leave this wiggle room (vs stating something about a requirement about the std::numeric_limits<> of the types on the two platforms)? How could Linux and Windows, or Windows-on-x86 and Windows-on-x86_64 FAIL to round trip yet each individually conform to the standard? https://en.cppreference.com/w/cpp/utility/to_chars
I take it that it is a pessimistic view.
Printing ≠ parsing!
Printing became an order of magnitude faster because itʼs using brand new algorithms to print floating point numbers.
Parsing “only” omits locale handling, so thereʼs a limit to how much faster it could get.
20% performance boost? To be correct this is a little above 14%, not 20%. And yes it is a meaningful increase, but let be correct.
O, I’m a complete idiot. It is 16.6%. 😀
I don’t think you are an idiot.
I firstly just calculated the decrease from the upper value to the lower one, not the increase from the lower to the upper.
Take into account that both values (1.40 and 1.20) were rounded to two significant digits, and add the fact that the measure has some incertitude to begin with (~3%)… So I need to round the ratio to one significant digit. Hence 20% is fair.
140 / 120 – 1 = 0.17
It would be interesting to see the speed vs. the classic good open source strtod from “dtoa.c” ; historically the MSVC CRT strtod has been about 10X slower than “dtoa.c” so I’d be more pleased if they fixed their 10X slowdown. (the MUSL CRT strtod was 100X slower)
I think you mean “atof”. I think atof is legacy at this point.
I don’t undestand where I could have meant “atof” ?
dtoa.c (.c indicates a source code file) is the classic gold standard for a good strtod that is both accurate and fast. It can be updated with modern instructions (64 bit registers and clz) to be even faster.
http://www.ampl.com/netlib/fp/dtoa.c
Rough order of magnitude, 140 MB/s looks about 10X too slow for strtod.
On most doubles, strtod only has to do *= 10, += digit, so it should be only a few cycles per digit, not hundreds of cycles per digit. Doubles that require more work than that are very rare.
dtoa.c averages around 2 cycles per byte so these times look very slow indeed.
I added the netlib strtod that Charles linked and ran it unmodified (except for changing the name), and it provides a modest speedup:
# parsing random integers in the range [0,1)
volume = 2.09808 MB
strtod : 125.83 MB/s (+/- 1.4 %)
netlib : 174.48 MB/s (+/- 4.1 %)
This works out to about 15 cycles per byte on my machine, considerably more than the 2 Charles it suggests it should achieve.
Looking at the main loop it seems like it is not going to get to 2 cycles (there is at least one or two mispredicts every time a 0 appears), but 15 does seem like too much. The compiler does a poor job in various places too.
Thank you Travis.
If you are interested in it I can send a PR.
Hi everyone,
The “random” part of the benchmark is why this is so far off from Charles’s 10x guess. (Travis: I assume you didn’t mean “random integers in [0, 1)” since there aren’t a lot of those to go around.)
The standard fast path (which is in David Gay’s code) for correctly rounded decimal to double conversion is when you have 15 significant decimal digits (or 16 significant digits <= 2^53). Then you have an exactly represented integer (from the parsed decimal digits) divided by an exactly represented power of ten. That is a single IEEE-754 operation and hence correctly rounded. (You can also support ‘e’ on the fast path with a branch to either divide or multiply by a power of ten as long as the net power of ten is exactly representable.)
If you’re parsing hand-written double literals in a programming language parser or calculator program then you’re almost never going to need more than 15 digits and hence the fast path will be hit almost all the time. On the other hand, there aren’t really enough hand-written double literals in a typical source file for the performance to be a bottleneck.
Where performance really matters is when you’re trying to round-trip large arrays of doubles, and they will often have noise in the less significant digits (from accumulated round-off unrelated to any serialization issues) which can push you off the fast path most of the time.
A relevant factor for hitting the fast path is whether the doubles you’re reading were printed with the minimum number of required digits. In Daniel’s data generator in the benchmark, his code appears to be the printing the minimum (at most 17 digits for doubles). But if we assume a uniform distribution on [0, 1) and a fast path that only works with 15 digits, I think that means only 1% of cases will hit the fast path.
From what I can tell most of the time is spent simply parsing the characters into an integer, not any slow path associated with more than 15 digits.
That is, the loop that does:
result *= 10 + *str - '0'
It does a bit more than that because it handles zeros specially (I suppose to avoid overflow in the case of many trailing zeros).
This parsing would need to happen regardless of 15 or 17 digits, so I’m not seeing the slow path unless it’s related to the trailing zero handling I mentioned.
I did make some tweaks to the netlib algo and got a good speedup to around 600 MB/s, but still pretty far away from Charles’ estimate.
Sorry it should be
result = result * 10 + *str - '0'
.Furthermore, I don’t expect that is even theoretically possible to go 100x faster than the number I have reported here.
You might be interested in benchmarking fmt lib as they’ve done great work in speeding up double conversions.
https://github.com/fmtlib/fmt#speed-tests
As far as I can tell, fmtlib prints out floats. It is not the problem considered here. Thanks for the link.
Oops, my mind wandered in the other direction as I was reading the article.
Not that I think that it will increase the performance significantly, but the handling of ‘string_end’ is not correct. It doesn’t have to be initialized and the check for nullptr is not necessary. When strtod() succeeds, string_end will point to the terminating zero of string. So the error check should read: if ( *string_end != ‘\0 ) { failed(); }
You are correct but as you point out, this does not affect performance for our purposes.
Another flawed benchmark.
Why is the benchmark including the cost of the static c_locale?
Why not time based on a batch rather than on individual parses?
Does the benchmark actually test and or include all possible ways a float can be represented? eg: https://github.com/ArashPartow/strtk/blob/master/strtk_tokenizer_cmp.cpp#L568
I am not sure what you mean by cost. Both strtod_l and strtod depend on a locale. In this case, I specify the default locale.
As explained in the post strtod is locale-sensitive and we want a local insensitive version for comparison purposes.
Because I am interested in throughput, as indicated by the fact that I express my results in MB/s. I am not interested in latency in this case.
It does not. Have you checked the code?
The exact performance number will depend on how the values were serialized with the number of digits being the most important factor. However, if you want full roundtrip, you need 17 digits in general for this problem. I have fixed the number of digits to a flat 17 throughout.
In practice, if you are parsing a lot of floats, chances are good that they are following the same format.
Can you point at a better from_chars to strtod benchmark?
You offer a link to your own library, but it does not seem to include such a benchmark?
I tested
from_chars
in gcc 11 snapshot (g++ (GCC) 11.0.0 20200913 (experimental)
) and current it is actually considerably slower thanstrtod
.Of course, it might speed up before release.
# parsing random integers in the range [0,1)
volume = 2.09808 MB
strtod : 126.03 MB/s (+/- 2.5 %)
from_chars : 82.09 MB/s (+/- 2.5 %)
It is not a given that
from_chars
will be faster thanstrtod
either: the former has a “round trip” requirement that the latter doesn’t.You may want to have a look at https://github.com/yb303/swar.
It’s only a header really.
I wrote it a few years back when I was bored between jobs and now cleaned up and uploaded.
It provides a few variants of atoi / itoa, without using SSE.
SEE may make it a bit faster for strings longer than 8 (ints greater than 99999999)