Java by now also has ASCII strings stored as bytes. Largely to conserve memory.
foobarsays:
A well-optimised memcpy is mostly limited by write throughput on modern architectures and toutf16 has double the write throughput in comparison to read throughput. Although I didn’t really dive deep into code, it would seem that compilers eagerly use native vectorised unpack/zero extend instructions on the fastest parts of the compiled function, thus making the conversion itself mostly a triviality. I’m not entirely convinced that I’d understand why the choice of instructions would be optimal (in particular, compilers don’t seem eager to use AVX-512 registers), but write throughput of those functions matches memcpy – which alone makes me wonder if there’s any practical room for improvement for non-small inputs.
A well-optimised memcpy is mostly limited by write throughput on modern architectures and toutf16 has double the write throughput in comparison to read throughput.
That sounds reasonable, yes.
which alone makes me wonder if there’s any practical room for improvement for non-small inputs.
It is an interesting question.
Oren Tiroshsays:
The overhead is largely in having to perform an additional copy and the larger memory footprint. The conversion during the copying is probably less significant.
I agree and I believe that I allude to this fact at the end of the post.
Mischa Sandbergsays:
At a guess, the larger the block being translated as a unit, the more the bottleneck is at the bus, or ram. I work with such volumes, but broken into many separate units; far less using the same cache line or memory pages. Would you consider comparing single-block translations, with ones having no streaming advantages?
Why did you switch compilers? That potentially invalidates your test, since you’ve got gcc 8 vs. clang 12, which adds another variable differentiating your two results. If clang 12 is the only compiler that supports the M1 CPU, then just use clang 12 for the Zen 2 as well. (Does a compiler need to support the M1 specifically, or is it fine to just specify Armv8.x?)
The exact AMD CPU model and clock speed would also be helpful. Zen 2 is a whole family of CPUs with different performance.
I’m sometimes not clear on the compiler flags you use besides -O3, if any. There are a lot more flags to leverage, many of which aren’t included in -O2 or -O3. CPU target is a big one, and I’m not clear on how you handle it, like -march=tigerlake or whatever. It’s possible that a compiler will use some instructions only when certain CPU minimum targets are specified, and they’re not all SIMD instructions. e.g. the Bit Manipulation Instructions, POPCNT, TSX, or the arbitrary precision arithmetic in Broadwell. None of those would likely matter here, but they might in other contexts.
Why did you switch compilers? That potentially invalidates your test, since you’ve got gcc 8 vs. clang 12, which adds another variable differentiating your two results.
Apple makes its own compilers for its hardware which are related to LLVM clang, but distinct. My AMD Rome (Zen2) processor runs on a Linux server where GNU GCC is more commonly used.
If clang 12 is the only compiler that supports the M1 CPU, then just use clang 12 for the Zen 2 as well. (Does a compiler need to support the M1 specifically, or is it fine to just specify Armv8.x?) The exact AMD CPU model and clock speed would also be helpful. Zen 2 is a whole family of CPUs with different performance. I’m sometimes not clear on the compiler flags you use besides -O3, if any. There are a lot more flags to leverage, many of which aren’t included in -O2 or -O3. CPU target is a big one, and I’m not clear on how you handle it, like -march=tigerlake or whatever. It’s possible that a compiler will use some instructions only when certain CPU minimum targets are specified, and they’re not all SIMD instructions. e.g. the Bit Manipulation Instructions, POPCNT, TSX, or the arbitrary precision arithmetic in Broadwell. None of those would likely matter here, but they might in other contexts.
I’m not quite sure about the conclusion. In particular, you “measure the throughput based on the input size”. So the memcpy variant is writing half the number of bytes (and most architectures including M1 and Zen have lower store vs load throughput in L1D).
Wouldn’t it make more sense to base it on the (twice as large) output size if you are trying to make a claim about the cost of computation (ASCII->UTF-16 expansion) versus the raw copy?
E.g, if you had a system that stored strings in ASCII and processed them in UTF-16, and wanted to compare it to a system that both stored and processed them in UTF-16, from these results you might conclude that the later would be faster since getting the string from storage is “just a memcpy and memcpy is ~2x as fast” – but that’s not true because your memcpy will be 2x the size compared to what you are doing in this benchmark.
Furthermore, in the case where you are going from UTF-16 to ASCII, you’d probably find that the “converting” case is faster than the non-converting UTF-16->UTF-16 case (since now half as much data is written).
So I’d actually draw a somewhat different conclusion: converting between ASCII and UTF-16 is basically free if it is part of a copy, and speed primarily depends on the size of the written data. So it is feasible to use ASCII in parts of your pipeline even if other parts require UTF-16 if you have a copy that can serve dual-purpose to expand/compress the data.
Of course, just using ASCII everywhere is fastest: smallest working set and no conversion, but I don’t think this was in doubt!
Wouldn’t it make more sense to base it on the (twice as large) output size if you are trying to make a claim about the cost of computation (ASCII->UTF-16 expansion) versus the raw copy?
I could have presented it in this manner, yes. And I grant you that it might have made it clearer.
E.g, if you had a system that stored strings in ASCII and processed them in UTF-16, and wanted to compare it to a system that both stored
and processed them in UTF-16, from these results you might conclude that the later would be faster
I am sure that one could think that but it would not be a fair to blame me for it!
Furthermore, in the case where you are going from UTF-16 to ASCII, you’d probably find that the “converting” case is faster than the non-converting UTF-16->UTF-16 case (since now half as much data is written).
Which would be consistent with the model submitted in the blog post: ‘With code that reads and writes a lot of data, it is often sensible to use as a model the number of written bytes.’.
(…) speed primarily depends on the size of the written data (…)
Again, the blog post states… ‘With code that reads and writes a lot of data, it is often sensible to use as a model the number of written bytes.’.
I am sure that one could think that but it would not be a fair to blame me for it!
Then what is the claim exactly?
Roughly, I understood it as: doing the conversion has a computational cost above and beyond the pure copy.
Now, one can consider two types of “pure copies” here to compare with the ASCII->UTF16 conversion: ASCII->ASCII and UTF16->UTF16. It’s memcpy either way, but the input and output are twice the size in the latter case.
Now, I would interpret your claim as applying to both cases: they should both be faster than the converting case. Otherwise, if it only applies to the ASCII->ASCII case, the result is entirely obvious: even without considering computation, reading N bytes and writing 2N bytes is going to be slower than reading N and writing N.
Furthermore, in this case you can’t really determine whether the faster cost is due to “less copying” or “less computation” because it is less in both respects. So I submit that UTF16->UTF16 copy is the more interesting case, and therefore I am assigning you some faction of the blame (like an insurance adjuster). 🙂
Which would be consistent with the model submitted in the blog post:
‘With code that reads and writes a lot of data, it is often sensible
to use as a model the number of written bytes.’.
I think this is what confused me the most. When I originally read it, I took this to mean that you’d measure both the expand and memcpy approach using the number of written bytes, i.e., a N->2N expansion versus a 2N->2N memcpy, but you actually use input bytes for the measurement (which I picked up on after looking at the code, although you do mention it).
So on what basis do you make the claim “the computational cost alone might be twice that of a memory copy”? It is quite the opposite: the computational cost appears to be basically zero, with the determining factor related to bytes written. E.g., if you had a weird memcpy that interleaved full vectors (so the number and type of instructions was the same, but it had an output region 2x the size) I’d expect it to perform similarly to the expand case (for vectors less than a cache line in size).
Travis Downssays:
Oops this comment was intended to be nested under Daniel’s reply.
The computational cost model is stated as “With code that reads and writes a lot of data, it is often sensible to use as a model the number of written bytes.”
Travis Downssays:
It’s perhaps I misunderstood the thrust of this post. I thought it was discussing and providing some evidence for the cost of converting between ASCII and UTF-16 (hence the conversion test), compared against the alternative of fewer or no conversions (thus, to an “all ASCII” or “all UTF-16” implementation).
Based on your replies, it seems maybe you were rather comparing ASCII vs UTF-16? As in, what is the cost of using UTF-16 rather that ASCII? The conclusion being that UTF-16 is twice as slow under the model that uses the number of bytes written.
Still, I confused by some parts, e.g.:
I expect that it is entirely possible to greatly accelerate my C function.
However, since your results show that the C function is performing at almost exactly the same speed as a memcpy that writes the same number of bytes (slightly better in the Zen case, slightly worse in the M1) case, wouldn’t you conclude your C function is running at essentially the speed limit already?
On a 10kB, this code may well be optimal but I have no reason to expect that my hastily written loop will compile to optimal code generally. Maybe it is do, but my investigation was insufficient to draw this conclusion.
Java by now also has ASCII strings stored as bytes. Largely to conserve memory.
A well-optimised memcpy is mostly limited by write throughput on modern architectures and toutf16 has double the write throughput in comparison to read throughput. Although I didn’t really dive deep into code, it would seem that compilers eagerly use native vectorised unpack/zero extend instructions on the fastest parts of the compiled function, thus making the conversion itself mostly a triviality. I’m not entirely convinced that I’d understand why the choice of instructions would be optimal (in particular, compilers don’t seem eager to use AVX-512 registers), but write throughput of those functions matches memcpy – which alone makes me wonder if there’s any practical room for improvement for non-small inputs.
A well-optimised memcpy is mostly limited by write throughput on modern architectures and toutf16 has double the write throughput in comparison to read throughput.
That sounds reasonable, yes.
which alone makes me wonder if there’s any practical room for improvement for non-small inputs.
It is an interesting question.
The overhead is largely in having to perform an additional copy and the larger memory footprint. The conversion during the copying is probably less significant.
I agree and I believe that I allude to this fact at the end of the post.
At a guess, the larger the block being translated as a unit, the more the bottleneck is at the bus, or ram. I work with such volumes, but broken into many separate units; far less using the same cache line or memory pages. Would you consider comparing single-block translations, with ones having no streaming advantages?
That Apple M1 chip sure is impressive.
Why did you switch compilers? That potentially invalidates your test, since you’ve got gcc 8 vs. clang 12, which adds another variable differentiating your two results. If clang 12 is the only compiler that supports the M1 CPU, then just use clang 12 for the Zen 2 as well. (Does a compiler need to support the M1 specifically, or is it fine to just specify Armv8.x?)
The exact AMD CPU model and clock speed would also be helpful. Zen 2 is a whole family of CPUs with different performance.
I’m sometimes not clear on the compiler flags you use besides -O3, if any. There are a lot more flags to leverage, many of which aren’t included in -O2 or -O3. CPU target is a big one, and I’m not clear on how you handle it, like -march=tigerlake or whatever. It’s possible that a compiler will use some instructions only when certain CPU minimum targets are specified, and they’re not all SIMD instructions. e.g. the Bit Manipulation Instructions, POPCNT, TSX, or the arbitrary precision arithmetic in Broadwell. None of those would likely matter here, but they might in other contexts.
Apple makes its own compilers for its hardware which are related to LLVM clang, but distinct. My AMD Rome (Zen2) processor runs on a Linux server where GNU GCC is more commonly used.
As is nearly always the code with my blog, I provide the source code along with the build setup that I used, just follow the hyperlinks. It would sure be interesting to investigate all of the points you raise. I invite you to do so!
I’m not quite sure about the conclusion. In particular, you “measure the throughput based on the input size”. So the
memcpy
variant is writing half the number of bytes (and most architectures including M1 and Zen have lower store vs load throughput in L1D).Wouldn’t it make more sense to base it on the (twice as large) output size if you are trying to make a claim about the cost of computation (ASCII->UTF-16 expansion) versus the raw copy?
E.g, if you had a system that stored strings in ASCII and processed them in UTF-16, and wanted to compare it to a system that both stored and processed them in UTF-16, from these results you might conclude that the later would be faster since getting the string from storage is “just a memcpy and memcpy is ~2x as fast” – but that’s not true because your memcpy will be 2x the size compared to what you are doing in this benchmark.
Furthermore, in the case where you are going from UTF-16 to ASCII, you’d probably find that the “converting” case is faster than the non-converting UTF-16->UTF-16 case (since now half as much data is written).
So I’d actually draw a somewhat different conclusion: converting between ASCII and UTF-16 is basically free if it is part of a copy, and speed primarily depends on the size of the written data. So it is feasible to use ASCII in parts of your pipeline even if other parts require UTF-16 if you have a copy that can serve dual-purpose to expand/compress the data.
Of course, just using ASCII everywhere is fastest: smallest working set and no conversion, but I don’t think this was in doubt!
I could have presented it in this manner, yes. And I grant you that it might have made it clearer.
I am sure that one could think that but it would not be a fair to blame me for it!
Which would be consistent with the model submitted in the blog post: ‘With code that reads and writes a lot of data, it is often sensible to use as a model the number of written bytes.’.
Again, the blog post states… ‘With code that reads and writes a lot of data, it is often sensible to use as a model the number of written bytes.’.
Then what is the claim exactly?
Roughly, I understood it as: doing the conversion has a computational cost above and beyond the pure copy.
Now, one can consider two types of “pure copies” here to compare with the ASCII->UTF16 conversion: ASCII->ASCII and UTF16->UTF16. It’s
memcpy
either way, but the input and output are twice the size in the latter case.Now, I would interpret your claim as applying to both cases: they should both be faster than the converting case. Otherwise, if it only applies to the ASCII->ASCII case, the result is entirely obvious: even without considering computation, reading N bytes and writing 2N bytes is going to be slower than reading N and writing N.
Furthermore, in this case you can’t really determine whether the faster cost is due to “less copying” or “less computation” because it is less in both respects. So I submit that UTF16->UTF16 copy is the more interesting case, and therefore I am assigning you some faction of the blame (like an insurance adjuster). 🙂
I think this is what confused me the most. When I originally read it, I took this to mean that you’d measure both the expand and
memcpy
approach using the number of written bytes, i.e., a N->2N expansion versus a 2N->2N memcpy, but you actually use input bytes for the measurement (which I picked up on after looking at the code, although you do mention it).So on what basis do you make the claim “the computational cost alone might be twice that of a memory copy”? It is quite the opposite: the computational cost appears to be basically zero, with the determining factor related to bytes written. E.g., if you had a weird memcpy that interleaved full vectors (so the number and type of instructions was the same, but it had an output region 2x the size) I’d expect it to perform similarly to the expand case (for vectors less than a cache line in size).
Oops this comment was intended to be nested under Daniel’s reply.
The computational cost model is stated as “With code that reads and writes a lot of data, it is often sensible to use as a model the number of written bytes.”
It’s perhaps I misunderstood the thrust of this post. I thought it was discussing and providing some evidence for the cost of converting between ASCII and UTF-16 (hence the conversion test), compared against the alternative of fewer or no conversions (thus, to an “all ASCII” or “all UTF-16” implementation).
Based on your replies, it seems maybe you were rather comparing ASCII vs UTF-16? As in, what is the cost of using UTF-16 rather that ASCII? The conclusion being that UTF-16 is twice as slow under the model that uses the number of bytes written.
Still, I confused by some parts, e.g.:
However, since your results show that the C function is performing at almost exactly the same speed as a memcpy that writes the same number of bytes (slightly better in the Zen case, slightly worse in the M1) case, wouldn’t you conclude your C function is running at essentially the speed limit already?
On a 10kB, this code may well be optimal but I have no reason to expect that my hastily written loop will compile to optimal code generally. Maybe it is do, but my investigation was insufficient to draw this conclusion.