Interesting. In theory, writing is fundamentally cheaper than reading, because you typically have to wait for the value you read but not for the completion of your write requests.
However, a cache using an allocate-on-write policy will frequently end up reading a cache line so that it can then write to that cache line (because caches are maintained at line granularity and you can’t just write a word without knowing its “surroundings” or else you’ll eventually overwrite these surroundings with garbage); and then eventually it’ll have to evict the dirty cache line – making writing more work than reading, after all. And if large arrays are involved, high-end CPUs will eliminate the waiting for read results through prefetching.
An interesting thing to test is scattered reads and writes, where prefetching doesn’t kick in so reading gets more expensive (but writing still involves reading, just not necessarily waiting for its completion).
A more “extreme” thing to test is writing to non-cachable memory regions (so no reading of “surroundings” is involved), and still more “extreme” would be using some sort of DMA engine and local memory instead of cache; thus, the large latency of read and a lack thereof for write would be very visible. Of course this type of “close to the metal” work can be irrelevant if your target is commodity CPUs and OSes.
Hi Daniel, unpack()’s “d[x] = compressed[x];” needs a mask above; you’ve already fixed it in the code proper.
Also, Nathan Baum pointed out the code looks wrong elsewhere and I agree. A single index, x, is used in pack() and unpack() and strides on eight at a time. This means the compressed bytes holding eight bools have seven unused bytes between them. Referring to the github code, char comp[N / 8] would be too small given the long stride but fortunately, pack() and unpack() are both wrongly told there’s N / 8 bools in play, rather than N, so comp[]’s big enough but only the first eighth of bool data[] is used. Clearing the source array in the middle of the test loop before unpacking back into it should show the assert failing.
Yes, I have merged the bug fix contributed by Nathan Baum and updated the code in this blog post. It does not change the analysis.
Michielsays:
Reading is two times faster than writing, due to DDR design (the R stands for read, not write), assuming sequential access.
Writing is faster than reading, because the processor does not need to wait, see Yossi.
Interesting. In theory, writing is fundamentally cheaper than reading, because you typically have to wait for the value you read but not for the completion of your write requests.
However, a cache using an allocate-on-write policy will frequently end up reading a cache line so that it can then write to that cache line (because caches are maintained at line granularity and you can’t just write a word without knowing its “surroundings” or else you’ll eventually overwrite these surroundings with garbage); and then eventually it’ll have to evict the dirty cache line – making writing more work than reading, after all. And if large arrays are involved, high-end CPUs will eliminate the waiting for read results through prefetching.
An interesting thing to test is scattered reads and writes, where prefetching doesn’t kick in so reading gets more expensive (but writing still involves reading, just not necessarily waiting for its completion).
A more “extreme” thing to test is writing to non-cachable memory regions (so no reading of “surroundings” is involved), and still more “extreme” would be using some sort of DMA engine and local memory instead of cache; thus, the large latency of read and a lack thereof for write would be very visible. Of course this type of “close to the metal” work can be irrelevant if your target is commodity CPUs and OSes.
Hi Daniel, unpack()’s “d[x] = compressed[x];” needs a mask above; you’ve already fixed it in the code proper.
Also, Nathan Baum pointed out the code looks wrong elsewhere and I agree. A single index, x, is used in pack() and unpack() and strides on eight at a time. This means the compressed bytes holding eight bools have seven unused bytes between them. Referring to the github code, char comp[N / 8] would be too small given the long stride but fortunately, pack() and unpack() are both wrongly told there’s N / 8 bools in play, rather than N, so comp[]’s big enough but only the first eighth of bool data[] is used. Clearing the source array in the middle of the test loop before unpacking back into it should show the assert failing.
@Ralph
Yes, I have merged the bug fix contributed by Nathan Baum and updated the code in this blog post. It does not change the analysis.
Reading is two times faster than writing, due to DDR design (the R stands for read, not write), assuming sequential access.
Writing is faster than reading, because the processor does not need to wait, see Yossi.
Totally Agree with you.