This seems to require SSE3. What is the best that can be done using only SSE2? Thanks!
On Intel processors, you need SSSE3 which came out in 2006. Unfortunately, I do not have access to hardware that old. I think that the oldest x64 processor I have is the Jaguar in my PS4 and it has SSSE3. Even old embedded Intel Atom processors have SSSE3.
My expectation is that a processor so old as to not support SSSE3 would not be supported by Windows 10.
What types of computers are you working on? Pentium 4 systems maybe? From a legacy architecture?
Windows 10 only requires SSE2, FWIW, and plenty of consumer software still targets SSE2 as its baseline.
The Steam Hardware Survey shows 98.88% support for SSSE3 (versus 100% for SSE2, which I believe is inevitable given Steam requires SSE2), so it’s not totally universal ever today. (And Steam users possibly have quicker hardware than on average?)
*Windows 10 only requires SSE2 (…) The Steam Hardware Survey shows 98.88% support for SSSE3 *
The same survey shows that Windows 10 represents 89% of all OSes with somewhere between 5% to 10% of PCs running older versions of Windows. Out of the 1.2% of folks with PCs that fail to support SSSE3, what fraction of them run on Windows 10? I’d guess a small fraction, certainly less than 1%. My understanding is that Microsoft only commits to supporting Windows 10 on hardware that falls within an OEM support agreement. It is almost certainly the case that a system that does not have SSSE3 is not part of a supported licence agreement with Microsoft. No matter how you look at it, we are talking about the legacy tail end.
(And Steam users possibly have quicker hardware than on average?)
I am sure that’s true. My sons’ school runs on Windows 2000. And a University I will not name runs its infrastructure on Windows NT. It is still legacy. Do not expect big exciting contracts building high speed software for Windows 2000. These customers using Windows 2000 are not shopping for new software to improve the performance of their machines.
You are absolutely not going to 100% SSE2 out in the real world. In fact, you have many 16-bit x86 software systems out there. There are people running companies on Apple II computers or the like.
There are many people who will laugh at the idea of using anything so modern as C++, and let us not even get into recent standards. C99 is esoteric. They are coding in C with a vendor-supplied compiler that has not seen any update in 20 years.
There is no end to the tail end… it goes far back and will always do so.
But do not for a minute think that you are going to spin out a 20-year old system and run Doom in your browser.
Davesays:
Can you provide some evidence? Some of these examples are flirting with a gossipy tone that suggests campfire stories, while others are simply irrelevant and would never be considered in scope of the “100%” target.
That some people run 16-bit software to this day? Come on. Live a little… go in the underbelly of large and old organizations. Just google about how one can run 16-bit applications under Windows 10. There is plenty of discussions on the topic.
Pascal code written in the 1980s is still the backbone of large organizations.
But even Pascal is too modern for some. Look at this 2020 article about unemployment system and ATM still relying on COBOL code from people long since retired.
they where still running 5 NT 4 servers on machines with P2
processors, 2 IBM Thinkcentres running NT4 workstation and about 10
PC’s running XP. Unbelievable. Looking at the dates on the system, I
think some of the NT4 servers where installed in 2004.
Some COBOL experts bristle at state leaders’ blaming current backlogs
of unemployment claims on a lack of coders. (…) The explanation,
they say, is that governments have not updated the bulky and sluggish
computers that run on COBOL, according to Derek Britton, product
director at England-based Micro Focus, which helps clients navigate
such updates.
Do you think that the bulky and slugglish computers they refer to have SSE2 support? No, they do not.
If you doubt that there are really old systems still running today, they have simply never worked for an old conservative organization. Period.
The legacy tail end is very long.
miniropsays:
One of Paris’ airports had to shutdown for several hours in 2015 because a computer crashed. They were running windows 3.1.
The article reminds us that the US military still relies on floppy drives.
-.-says:
AMD K10 (Phenom, Phenom II and similar) was the last uArch to not support SSSE3. I think they were first released in 2007 and the last ones were released around 2010 (the 6 core Phenom IIs, and probably some APUs).
The CPUs were fairly popular in the day, as they were good value, and its successor, AMD’s Bulldozer line, often performed worse, so many people kept buying K10s even after BD’s release. As such, there’s still quite a number of them around (in fact, I have one right here – mostly for testing SSE2 fallback performance).
Note that while the early Atoms (and AMD Bobcat) supported SSSE3, PSHUFB performance on them was fairly poor (5-6 cycle recip. throughput). Still, PSHUFB is difficult to emulate on SSE2, and you may be better of falling back to scalar code there.
Still, PSHUFB is difficult to emulate on SSE2, and you may be better of falling back to scalar code there.
Right. If someone follows my advice regarding production use, that is what is going to happen. If you are using 10-year old processors from AMD, you will get a fallback code routine that should serve you well, but will be nowhere close to what a Zen 2 can do.
To answer more directly your question, if someone has legacy hardware, I would not bother with any of software optimization. It is seems misplaced to try to optimize software for hardware that cannot be purchased anymore.
Software and hardware are in a closed loop of self-improvements. We upgrade the hardware, then upgrade the software to benefit from the hardware which further motivates new hardware, and so forth.
People are willing to pay large extra for relatively small gains. A 20% boost in hardware performance often translates into twice the price or more. To these people, who pay extra for small performance gains, the added software performance is valuable.
The counterpart to this observation is that for applications where the extra cost of upgrading your server every decade is too much, we can infer that the value of software performance is low.
If you are in a cost minimization routine, that is you are willing to trade hardware performance for low cost, then you probably want to do as few software upgrades as you can. My view is that if you have legacy hardware, you are probably well served with older software.
Travis Downs’s tests indicate that simply touching the upper 128 bits of an AVX register will trigger a (latency inducing) voltage transition. Did you observe that? This kind of fear has kept me from chasing SIMD acceleration of UTF-8 validation that’s in the middle of a bunch of other work.
Let me see if I can get Travis to answer this… Pinging him on Twitter now.
Travis Downssays:
Well this is a one time cost, so it is hard to measure it in any type of “typical” benchmark which executes the code under test millions of times. E.g., if your benchmark runs for at least 10 milliseconds (most run much longer), a transition period of 20 us (at the high end of what I measured) would add only 0.2% to the benchmark and be totally lost in the noise.
Furthermore, any benchmark which does any type of “warmup” iteration and doesn’t include this in the timing would never see the effect at all.
Overall, I don’t expect the transition timing to make much of a difference in most use cases that care about “milliseconds” (rather than nano or microseconds) or about throughput. The max throughput loss is small and the max latency penalty is small. Only people who might lose a bunch of money for occasional latency spikes in the order of 10s of microseconds are likely to care.
This is different from (but related to) frequency downclocking, where the CPU runs at a lower frequency for a significant period of time: this is a more significant effect and more people would care about this, but you aren’t likely to experience it just by using 256-bit instructions.
Finally, you might not care about this at all in the context of the algorithms presented here because you expect the rest of your binary to already be full of 256-bit instructions, as it will be if you compile with the CPU architecture set to something modern like -march=skylake or -march=native (the latter depending on where you compile it).
Thanks Travis. Certainly a better answer than what I could have come up with.
Antoinesays:
I wonder why Intel, who likes creating new instructions all the time, didn’t create a couple of instructions for vectorized UTF8 parsing and validation. It should be cheap in hardware resources, and UTF8 is here to stay.
They did create string comparisons functions and it did not end well.
In the broader scheme of things, you are right that hardware support for UTF-8 is not at all silly. In fact, I bet it already exists… just not in mainstream Intel processors.
Tomsays:
A value in the range 0 to 16 is sometimes called a nibble
Did you mean “0 to 15”? Or is range meant to be right-open (“[0, 16)”)?
The benefit of SSE2 is that you can assume it is available. The problem with SSE3 is that you have to write a bit of application prolog code which checks for whether or not SSE3 is available, at runtime. And then you need to provide a non-SSE3 implementation if necessary. This adds an additional run-time check. And it makes designing the library more difficult because you have a global variable (“static bool isSSE3Available;”). This is done with a “magic static” which adds costs.
The benefit of SSE2 is that you can assume it is available.
Of course, there are many more ARM systems than there are SSE2 systems.
265 993 303says:
It really just shows that UTF-16 is a much simpler encoding to deal with. Only 2 different encoding lengths and only mismatched surrogates are error, not 4 different encoding lengths and all those pesky combinations of two and three and four byte mismatches, overlong sequence rejecting, etc. a total of 13 different error cases in UTF-8. It is also easier to parse a minus sign by checking for 0x2212 than checking three consecutive bytes. UTF-8 is the bloatware text encoding.
First, great work!
This seems to require SSE3. What is the best that can be done using only SSE2? Thanks!
This seems to require SSE3. What is the best that can be done using only SSE2? Thanks!
On Intel processors, you need SSSE3 which came out in 2006. Unfortunately, I do not have access to hardware that old. I think that the oldest x64 processor I have is the Jaguar in my PS4 and it has SSSE3. Even old embedded Intel Atom processors have SSSE3.
My expectation is that a processor so old as to not support SSSE3 would not be supported by Windows 10.
What types of computers are you working on? Pentium 4 systems maybe? From a legacy architecture?
Windows 10 only requires SSE2, FWIW, and plenty of consumer software still targets SSE2 as its baseline.
The Steam Hardware Survey shows 98.88% support for SSSE3 (versus 100% for SSE2, which I believe is inevitable given Steam requires SSE2), so it’s not totally universal ever today. (And Steam users possibly have quicker hardware than on average?)
*Windows 10 only requires SSE2 (…) The Steam Hardware Survey shows 98.88% support for SSSE3 *
The same survey shows that Windows 10 represents 89% of all OSes with somewhere between 5% to 10% of PCs running older versions of Windows. Out of the 1.2% of folks with PCs that fail to support SSSE3, what fraction of them run on Windows 10? I’d guess a small fraction, certainly less than 1%. My understanding is that Microsoft only commits to supporting Windows 10 on hardware that falls within an OEM support agreement. It is almost certainly the case that a system that does not have SSSE3 is not part of a supported licence agreement with Microsoft. No matter how you look at it, we are talking about the legacy tail end.
(And Steam users possibly have quicker hardware than on average?)
I am sure that’s true. My sons’ school runs on Windows 2000. And a University I will not name runs its infrastructure on Windows NT. It is still legacy. Do not expect big exciting contracts building high speed software for Windows 2000. These customers using Windows 2000 are not shopping for new software to improve the performance of their machines.
You are absolutely not going to 100% SSE2 out in the real world. In fact, you have many 16-bit x86 software systems out there. There are people running companies on Apple II computers or the like.
There are many people who will laugh at the idea of using anything so modern as C++, and let us not even get into recent standards. C99 is esoteric. They are coding in C with a vendor-supplied compiler that has not seen any update in 20 years.
There is no end to the tail end… it goes far back and will always do so.
But do not for a minute think that you are going to spin out a 20-year old system and run Doom in your browser.
Can you provide some evidence? Some of these examples are flirting with a gossipy tone that suggests campfire stories, while others are simply irrelevant and would never be considered in scope of the “100%” target.
Can you provide some evidence?
Evidence of what?
That some people run 16-bit software to this day? Come on. Live a little… go in the underbelly of large and old organizations. Just google about how one can run 16-bit applications under Windows 10. There is plenty of discussions on the topic.
Pascal code written in the 1980s is still the backbone of large organizations.
But even Pascal is too modern for some. Look at this 2020 article about unemployment system and ATM still relying on COBOL code from people long since retired.
That people use really old computers to run businesses? See this 2016 article about a business relying on a Commodore 64.
Schools that still run Windows 2000 or Windows NT? Here is an answer on Quora from 2018:
That you are not going to get 100% SSE2 in the real world on x86 processors? There are people maintaining Firefox forks for CPUs without SSE.
Look at this quote from the COBOL article…
Do you think that the bulky and slugglish computers they refer to have SSE2 support? No, they do not.
If you doubt that there are really old systems still running today, they have simply never worked for an old conservative organization. Period.
The legacy tail end is very long.
One of Paris’ airports had to shutdown for several hours in 2015 because a computer crashed. They were running windows 3.1.
Reference: https://www.zdnet.com/article/a-23-year-old-windows-3-1-system-failure-crashed-paris-airport/
The article reminds us that the US military still relies on floppy drives.
AMD K10 (Phenom, Phenom II and similar) was the last uArch to not support SSSE3. I think they were first released in 2007 and the last ones were released around 2010 (the 6 core Phenom IIs, and probably some APUs).
The CPUs were fairly popular in the day, as they were good value, and its successor, AMD’s Bulldozer line, often performed worse, so many people kept buying K10s even after BD’s release. As such, there’s still quite a number of them around (in fact, I have one right here – mostly for testing SSE2 fallback performance).
Note that while the early Atoms (and AMD Bobcat) supported SSSE3, PSHUFB performance on them was fairly poor (5-6 cycle recip. throughput). Still, PSHUFB is difficult to emulate on SSE2, and you may be better of falling back to scalar code there.
Thanks for the informative comment.
Still, PSHUFB is difficult to emulate on SSE2, and you may be better of falling back to scalar code there.
Right. If someone follows my advice regarding production use, that is what is going to happen. If you are using 10-year old processors from AMD, you will get a fallback code routine that should serve you well, but will be nowhere close to what a Zen 2 can do.
To answer more directly your question, if someone has legacy hardware, I would not bother with any of software optimization. It is seems misplaced to try to optimize software for hardware that cannot be purchased anymore.
Software and hardware are in a closed loop of self-improvements. We upgrade the hardware, then upgrade the software to benefit from the hardware which further motivates new hardware, and so forth.
People are willing to pay large extra for relatively small gains. A 20% boost in hardware performance often translates into twice the price or more. To these people, who pay extra for small performance gains, the added software performance is valuable.
The counterpart to this observation is that for applications where the extra cost of upgrading your server every decade is too much, we can infer that the value of software performance is low.
If you are in a cost minimization routine, that is you are willing to trade hardware performance for low cost, then you probably want to do as few software upgrades as you can. My view is that if you have legacy hardware, you are probably well served with older software.
Travis Downs’s tests indicate that simply touching the upper 128 bits of an AVX register will trigger a (latency inducing) voltage transition. Did you observe that? This kind of fear has kept me from chasing SIMD acceleration of UTF-8 validation that’s in the middle of a bunch of other work.
Let me see if I can get Travis to answer this… Pinging him on Twitter now.
Well this is a one time cost, so it is hard to measure it in any type of “typical” benchmark which executes the code under test millions of times. E.g., if your benchmark runs for at least 10 milliseconds (most run much longer), a transition period of 20 us (at the high end of what I measured) would add only 0.2% to the benchmark and be totally lost in the noise.
Furthermore, any benchmark which does any type of “warmup” iteration and doesn’t include this in the timing would never see the effect at all.
Overall, I don’t expect the transition timing to make much of a difference in most use cases that care about “milliseconds” (rather than nano or microseconds) or about throughput. The max throughput loss is small and the max latency penalty is small. Only people who might lose a bunch of money for occasional latency spikes in the order of 10s of microseconds are likely to care.
This is different from (but related to) frequency downclocking, where the CPU runs at a lower frequency for a significant period of time: this is a more significant effect and more people would care about this, but you aren’t likely to experience it just by using 256-bit instructions.
Finally, you might not care about this at all in the context of the algorithms presented here because you expect the rest of your binary to already be full of 256-bit instructions, as it will be if you compile with the CPU architecture set to something modern like
-march=skylake
or-march=native
(the latter depending on where you compile it).Thanks Travis. Certainly a better answer than what I could have come up with.
I wonder why Intel, who likes creating new instructions all the time, didn’t create a couple of instructions for vectorized UTF8 parsing and validation. It should be cheap in hardware resources, and UTF8 is here to stay.
They did create string comparisons functions and it did not end well.
In the broader scheme of things, you are right that hardware support for UTF-8 is not at all silly. In fact, I bet it already exists… just not in mainstream Intel processors.
Did you mean “0 to 15”? Or is range meant to be right-open (“[0, 16)”)?
I had [0, 16) in mind, yes.
Great work, as usual!
The benefit of SSE2 is that you can assume it is available. The problem with SSE3 is that you have to write a bit of application prolog code which checks for whether or not SSE3 is available, at runtime. And then you need to provide a non-SSE3 implementation if necessary. This adds an additional run-time check. And it makes designing the library more difficult because you have a global variable (“static bool isSSE3Available;”). This is done with a “magic static” which adds costs.
Of course, there are many more ARM systems than there are SSE2 systems.
It really just shows that UTF-16 is a much simpler encoding to deal with. Only 2 different encoding lengths and only mismatched surrogates are error, not 4 different encoding lengths and all those pesky combinations of two and three and four byte mismatches, overlong sequence rejecting, etc. a total of 13 different error cases in UTF-8. It is also easier to parse a minus sign by checking for 0x2212 than checking three consecutive bytes. UTF-8 is the bloatware text encoding.
UTF-16 is definitively simpler.