Daniel Lemire's blog

, 11 min read

Converting binary integers to ASCII characters: Apple M1 vs AMD Zen2

9 thoughts on “Converting binary integers to ASCII characters: Apple M1 vs AMD Zen2”

  1. Travis Downs says:

    It is not necessary to do a division for each step: a single multiplication by 10 suffices.

    For example, you can interpret a number like 1235 as 0.1235 by setting the decimal point to the left of the highest digit (so-called “fixed point” representation). Depending on the size of the involved integers, this can be a simple reinterpretation, not requiring any code.

    Then, you can extract each digit one at a time by multiplying by 10: 0.1235 * 10 = 1.2345 and extracting the integer part (the part to the left of the fixed point). You can further optimize this by multiplying by 5 rather than 10, since the final missing factor of two can be rolled into the shift. This is a big advantage on Intel x86 since * 5 can be done with a single 1-cycle latency lea instruction.

    The approach is covered quite well here.

    Now the approach you describe only nominally uses a division, but the compiler transforms it into a multiplication: but using the multiplication directly results in fewer instructions and opens up other opportunities for optimization like the lea trick.

    Hence, at least in this instance, your best chance of speeding up this
    function is either by dividing by 10 faster (in latency) or else by
    reducing the number of iterations (by processing the data in large
    chunks). The latter is already found in production code.

    There’s a third way: try for increased ILP within a single conversion. E.g., you could split the number into two and extra one digit from each half per iteration. This does more total work but has two latency chains of half the length (plus some overhead for the splitting and combining). There are several other tricks which decrease the latency at the cost of a bit more work.

    1. You are entirely correct, of course, a third way is to try to break down the problem into two (or more) tasks. Thanks for the great link.

    2. Not important says:

      Sorry to disturb you by I stumbled on your place while searching for Mac m1 info ! I come from windows , I switched to Mac mini m1 when it came out ! I feel apple do things very differently ! Or have other name , not sure !apple often do things insanelly good in some stuff and bad in other it seem , then I read your post about 64 bit wide vs 128 bit wide ! if i know something it’s that when ms went 64 bit from 32 in 2004 coder had a hard time going to 64 oh they had some data , but going from 32 to 64 bit wide was and even today still is a huge challenge parallel does not seem to be the strength we were thinking it was going to be ! I suspect 2 x 64 bit is likely having same issue ! Like shader ! they end up with these millions of layer , I mean apple did mention to make less layer but old habit die hard ! and now it seem that neural engine play a major role in m1 ! it seem to be the chef d’orquestre of the whole cpu gpu h264 optimisation thing ! do you have basic pointer for the gaming side ? I don’t mean the game themselves , but more like , install tip or ajustement. To be done to the m1 for better experience and performance ! many things are hidden under short cut key often and if you don’t know them ! For that person they don’t exist ! No need to reply to me personally ! maybe eventually you ll get to make a post !

  2. Maynard Handley says:

    “The Apple M1 processor can retire up to 8 instructions per cycle.”

    M1 can decode, map, and rename 8 instructions per cycle.

    It can retire (as in clear and free) 16 History File entries per cycle (ie free 16 register per cycle)
    It can (I kid you not) retire up to 56(!) instructions from the ROB per cycle.

    1. Can we verify this with performance counters?

      1. Travis Downs says:

        In any case I think your original statement was correct in a sense: the M1 can sustain retirement of 8 instructions per cycle. Technically, on any given cycle it might retire more, but this is kind of uarch trivia in the sense that sustained retirement will be limited by earlier bottlenecks such as decode.

        1. Well, if it is able to retire 10, 20, 30 instructions per cycle even just some of the time, then it suggests that we should be able to see higher-than-8 retired instructions per cycle in some cases using performance counters… does it not?

          1. Travis Downs says:

            Yes, you can measure it: see here for example.

            I just mean that as far as I can tell when you said “The Apple M1 processor can retire up to 8 instructions per cycle.” you are just using retire as a standin for “process” or said in another way: “The M1 is 8 wide” or “The M1 has a maximum IPC of 8”.

            As Maynard points out, the statement isn’t strictly true when you refer to “retire” specifically, because it happens that this process has a larger retirement bandwidth than its width (unlike many x86 processors where retire was the limiting or tied-for-limiting factor). So the spirit of what you said is correct: it can handle up to 8 instructions per cycle, and even though some parts of the pipeline may momentarily exceed that, they will have be preceded by periods of < 8 throughput so the average never exceeds 8.

            1. Maynard Handley says:

              To add to Travis’ points:

              (a) The big “strategic” point is that one of the design principles for creating a truly fast CPU is that you want to split the machine into as many independent parts as possible, each connected by queues, and things are working well if those queue are as dynamic as possible — ideally they should constantly be alternating between almost full (just absorbed a large block of activity) and almost empty (ready to absorb the next block).
              Or to put it differently, designing for mean behavior (mean rate of integer instructions, mean rate of loads, etc) is table stakes; to do substantially better means designing beyond the mean to the standard deviation. Many, dynamic, queues are part of that design principle.

              (b) So while the M1 can only sustain 8 instructions per cycle, you want queues that are substantially more ambitious than this.

              You want your Fetch queue (sitting between the L1I$ and Decode) to be asynchronous, and able to pull in substantially more than 8 instructions per cycle, to provide a buffer against bad luck (either a jump to the end of a cache line from which you can only pull a few instructions; or a jump that misses L1I completely).

              You provide buffers between Rename and (int/fp/loadstore) Scheduling that can accept 8 instructions, even though int scheduling is only 6-wide, and fp and loadstore scheduling are each 4-wide — so that even in the case of a long burst of int/fp/loadstore you can clear eight instructions out of Decode per cycle and move onto the next eight — your 8-wide front-end is not throttled by the 6 or 4-wide capacity of your execution core, at least not for shortish (~50 or so) burst lengths.

              And you try to clear out the ROB machinery as rapidly as possible after a blocking head -of-ROB clears.
              Naively you might consider this unimportant – – unlike the previous cases, there is apparently no compelling reason for the back-end to have to clear faster than it can be fed by the front end.
              The point that is missed by this naive analysis is heterogeneity of resources, and the fact that resources are cleared in order after a blocking head of ROB clears.
              So, suppose the machine has stalled in Rename because it can no longer allocate a physical int register. When the head of ROB clears, registers will start to be freed — but in-order. And if all the registers cleared in the first cycle are FP registers, that doesn’t help the stall in Decode, which doesn’t make progress until an int register is freed. That’s why we want registers freed at faster than they can be consumed, to deal with this heterogeneity.

              The same holds for the ROB proper. Recall that the M1 ROB consists of ~330 rows, each of which can hold up to seven instructions of which one (and only one) can be a failable instruction (ie an instruction that can misspeculate — a load, store, or branch).
              The throughput constraint is, if you want to be able to process 7 instructions per cycle of exactly the right mix (4 load/store + 3 branches) you have to be able to clear 7 failables from the ROB per cycle.
              In fact Apple clear 8 rows per cycle, of which failables are the important part (you have to test that each one does not in fact result in a flush); the rest are just unimportant detritus at this point.
              Given that the History File (for tracking register dependencies and freeing registers) is a separate structure that operates on its own schedule (of 16 free’s per cycle), it’s in fact fairly easy to free up lots of the ROB in a single cycle!

              (Experts might ask about that three branches claim above — M1 only has two branch units! True — but simple branches [non conditional, don’t use the link register] “execute” at Decode, not in a branch unit…)