Daniel Lemire's blog

, 9 min read

Optimizing compilers deduplicate strings and arrays

14 thoughts on “Optimizing compilers deduplicate strings and arrays”

  1. Jonathan Graehl says:

    It’s suffixes of C-strings that can be shared, not prefixes (recall their length is not encoded except implicitly by the 0 byte terminator).

    1. Andrew Dalke says:

      The “dear friend” / “dear friend\0f” example uses a prefix that could be shared.

      1. Amy says:

        It wouldn’t be possible. The way suffix string optimization works is by having the block and placing pointers in the middle. Because null termination defines where the string is, you can’t just cut off the end for one of them; however, you can cut off the beginning by placing the pointer of the substring in the middle.

        1. Matthew Wozniczka says:

          In his example, the second string contains an explicit null character, so I think it COULD be shared with the first one?

          Of course, it’s probably rare enough that this comes up with string constants that compiler authors didn’t bother to add the optimization

  2. Marius Miku?ionis says:

    I guess most of optimizations are due to deep inlining, but there is indeed lot of compression: very complex assembly in place of trivial lookup.
    Very interesting cat-and-mouse games the compiler plays: when it comes to storage it makes very reasonable (conservative) assumptions, but when it comes to code generation, it just inlines as much as possible, to the point that most of computation is treated as “constexpr”.
    Here are my experiments:
    https://godbolt.org/z/bPqG9f14M

  3. Pete Bannister says:

    If you can discard the null termination rule though semantic analysis (e.g if the use of the strings is equivalent to c++’s.std::span) then you ought to be able to optimize it – providing the value is not passed to any functions that expect null termination. Have you seen her Sutter’s recent cpp2 talk? In a cpp2-pure context, that sort of optimization ought to become more possible I think

    1. Marius Mikucionis says:

      I have seen suggestions to use std::stri g_view instead, which work like span, so that could be opportunity for compiler.

  4. Brian says:

    Great

  5. Geoff Hill says:

    This happens not just within compilation units—but across them too!—given a sufficiently smart linker. Read about the SHF_MERGE flag in ELF object files for more info.

  6. Jonny says:

    Interesting article, however maybe a mistake?

    Comparing pointers.. only compares pointers, not the strings of bytes they may point to. For that you’d need strcmp()

    Your article writes:

    “The same trick fails with extended strings:
    return str1 == str2”

    1. It is not a mistake.

    2. Marius Mikucionis says:

      That’s the whole point of the article: the compiler sometimes reuses the same piece of memory and therefore we compare pointers to detect that.

  7. Volker Simonis says:

    Java has a feature called “String deduplication” (see https://openjdk.org/jeps/192). It is not done by the compiler but by the GC and therefore GC specific. It doesn’t recognize common substrings but it works for dynamically allocated strings.

  8. Sawyer Bergeron says:

    Can’t share the prefix for C strings since they implicitly have (and require) a null terminator, so the first string is not actually a prefix of the second in terms of what it can store in the data section of the bin.

    You could potentially have that optimization in languages where length is stored out of line (cpp/rust) but to try to break up and print common prefixes of strings with diverging postfixes without having print basically be an intrinsic with a *lot* of magic is hard (as well as significantly less performant, or bigger if you monomorphise which takes away the space savings)