23rd September 2022, 9 min read

Optimizing compilers deduplicate strings and arrays

14 thoughts on “Optimizing compilers deduplicate strings and arrays”

Jonathan Graehl says:

September 24, 2022 at 2:05 am

It’s suffixes of C-strings that can be shared, not prefixes (recall their length is not encoded except implicitly by the 0 byte terminator).
1. Andrew Dalke says:
  
  September 24, 2022 at 7:06 am
  
  The “dear friend” / “dear friend\0f” example uses a prefix that could be shared.
  1. Amy says:
    
    September 25, 2022 at 2:07 pm
    
    It wouldn’t be possible. The way suffix string optimization works is by having the block and placing pointers in the middle. Because null termination defines where the string is, you can’t just cut off the end for one of them; however, you can cut off the beginning by placing the pointer of the substring in the middle.
    1. Matthew Wozniczka says:
      
      September 28, 2022 at 9:28 pm
      
      In his example, the second string contains an explicit null character, so I think it COULD be shared with the first one?
      
      Of course, it’s probably rare enough that this comes up with string constants that compiler authors didn’t bother to add the optimization
Marius Miku?ionis says:

September 24, 2022 at 9:47 pm

I guess most of optimizations are due to deep inlining, but there is indeed lot of compression: very complex assembly in place of trivial lookup.
Very interesting cat-and-mouse games the compiler plays: when it comes to storage it makes very reasonable (conservative) assumptions, but when it comes to code generation, it just inlines as much as possible, to the point that most of computation is treated as “constexpr”.
Here are my experiments:
https://godbolt.org/z/bPqG9f14M
Pete Bannister says:

September 25, 2022 at 8:19 pm

If you can discard the null termination rule though semantic analysis (e.g if the use of the strings is equivalent to c++’s.std::span) then you ought to be able to optimize it – providing the value is not passed to any functions that expect null termination. Have you seen her Sutter’s recent cpp2 talk? In a cpp2-pure context, that sort of optimization ought to become more possible I think
1. Marius Mikucionis says:
  
  September 26, 2022 at 10:25 am
  
  I have seen suggestions to use std::stri g_view instead, which work like span, so that could be opportunity for compiler.
Brian says:

September 26, 2022 at 9:09 pm

Great
Geoff Hill says:

September 27, 2022 at 10:33 pm

This happens not just within compilation units—but across them too!—given a sufficiently smart linker. Read about the SHF_MERGE flag in ELF object files for more info.
Jonny says:

September 28, 2022 at 10:02 pm

Interesting article, however maybe a mistake?

Comparing pointers.. only compares pointers, not the strings of bytes they may point to. For that you’d need strcmp()

Your article writes:

“The same trick fails with extended strings:
return str1 == str2”
1. Daniel Lemire says:
  
  September 28, 2022 at 11:01 pm
  
  It is not a mistake.
2. Marius Mikucionis says:
  
  September 30, 2022 at 7:26 am
  
  That’s the whole point of the article: the compiler sometimes reuses the same piece of memory and therefore we compare pointers to detect that.
Volker Simonis says:

September 29, 2022 at 8:11 am

Java has a feature called “String deduplication” (see https://openjdk.org/jeps/192). It is not done by the compiler but by the GC and therefore GC specific. It doesn’t recognize common substrings but it works for dynamically allocated strings.
Sawyer Bergeron says:

October 1, 2022 at 8:00 pm

Can’t share the prefix for C strings since they implicitly have (and require) a null terminator, so the first string is not actually a prefix of the second in terms of what it can store in the data section of the bin.

You could potentially have that optimization in languages where length is stored out of line (cpp/rust) but to try to break up and print common prefixes of strings with diverging postfixes without having print basically be an intrinsic with a *lot* of magic is hard (as well as significantly less performant, or bigger if you monomorphise which takes away the space savings)