Daniel Lemire's blog

, 4 min read

String representations are not unique: learn to normalize!

6 thoughts on “String representations are not unique: learn to normalize!”

  1. Djamé says:

    i wish this could be done at the OS clipboard level 🙁

  2. mischa sandberg says:

    It’s also a pain dealing with api’s that flip a coin to decide, do I reject an invalid “utf8” sequence with an error, or do I emit (ascii!) SUB? And maybe multiple adjacent erroneous sequences become multiple SUBs. Or not. Then some other system does a binary comparison of the equivalent (?!) strings. Gah.

    Likewise: overlong encodings are errors? subbed? mapped? passed through to be someone else’s problem?

  3. Nick Nolan says:

    It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

    Grapheme cluster boundaries are important for collation, regular expressions, UI interactions, segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text. Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.

    https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

  4. Markus Schaber says:

    One specific problem with normalization is that the concatenation of two normalized strings is not necessarily a normalized string, so normalization only during input is not necessarily sufficient.

    1. Interesting. Can you give me an example?

      1. Markus Schaber says:

        There are some in the Unicode TR 15:

        https://unicode.org/reports/tr15/#Concatenation