Daniel Lemire's blog

, 9 min read

How do search engines handle special characters? Should you care?

10 thoughts on “How do search engines handle special characters? Should you care?”

  1. @John

    Great point.

    In this case, comparing “Kurt Godel” and “Kurt Gödel”, both Bing and Google fail my “this is the same person” test.

    So, if you want people to find your article on Kurt Gödel more easily, maybe you should include a few “Kurt Godel” typos. 😉

  2. Will Fitzgerald says:

    Michael’s comments are well stated–at least, it’s what I would have written as my comment!

    The original question to Matt Cutts was about ligatures (e.g. is “Duff’s Beer” the same as “Duff’s beer”) and typographic hyphens.

  3. John says:

    Apparently Google gives you somewhat different results when you search on “Godel” as well. I would expect more English speakers would simply leave off diacritical marks than, for example, change ö to oe.

  4. Michael Brundage says:

    It’s actually even more complicated. The search engine isn’t a single entity, but rather many online and offline processes, each of which can implement different rules.

    Let’s assume for a moment that the web page and server code correctly handles and logs the Unicode text representation (so that ö isn’t corrupted somewhere along the way); surprisingly many sites already fail this step.

    You’ve got things like autocomplete, stemming, spell correction, and synonym handling, each of which is distinct software that normalizes it’s inputs differently. Then you’ve got all the offline processes that analyze query logs, index documents, extract terms, etc. Some of these can map oe to ö while others don’t.

    In your example, it’s unclear which of these systems are not treating these queries as equivalent. It might even be ranking; maybe uncyclopedia is in the results somewhere, but unnotmalized query-dependent ranking changes it’s order of appearance. By playing with additional constraints (eg, site:uncyclopedia.com) you may be able to further probe the implementation.

    Text normalization is still an afterthought in most software, even at Google.

  5. Michael Brundage says:

    Speaking of autocomplete, I typed that on an iPhone, and it “corrected” its to “it’s” without my noticing. But at least it handles umlauts. 🙂

  6. @Paul

    You are quite right that search engines could be more interactive and that Google’s innovation in this regard is great.

  7. Chris Betti says:

    I liked Michael’s post because it pointed out the complications inherent in the whole stack, from source text to web servers and on through any other storage and presentation tools (I would add the lack of capability for users to represent diacritical characters, in either unicode form, in an input text field to the list as well, for engines that don’t normalize both user input and index).

    I found Daniel’s blog post interesting because he’s measuring the search engines ability to help users wade through this dirty data. Despite the complications inherent in the whole stack, what can the engines do to get english speaking users the right information? As the full stack of tools improve their internationalization support, we’ll get improved source data, but for now, dirty data is a fact of life for the search engines.

    Complicating the picture even more is when two separate input concepts normalize to homographs. For example, Russian pisát “to write” vs písat “to piss” (I couldn’t enter the russian characters successfully, but this version suffices). There are situations in which a non-native russian speaker could become pretty embarassed when presenting to a russian speaking audience, all because the search engine decided to normalize the two terms to the same thing.

    One thing the experts could look into is, how are foreign search providers handling these issues? It’s possible that the answer for english speaking individuals is to normalize everything, but it may be worth investigating the foreign search provider’s tactics for handling the wealth of dirty data out there on the ‘net.

  8. Paul says:

    NLP is fun in that so often you run into cases like this where there just isn’t a universal right answer. Googling Godel, I see a gallery on the front page. What if that’s the Godel I want? Pulling Gödel in would dilute my results even more. Even capitalization can be significant. ‘Papa’ = Spanish for pope, ‘papa’ = Spanish for potato.

    I like Google’s use of “did you mean ” to suggest things I might have meant, but still let me see both sets. For researchers who really care about this, ways of explicitly enabling stemming/character normalization/etc. is useful (but overly complex for a Google). In the long run I think the solution is machine intelligence that can differentiate contexts (did you mean the vegetable or the pontiff? Let me find all the pages it was used in that sense)

  9. Michael Brundage says:

    Great points about homographs, capitalization, etc. affecting interpretation of the search. Of course, the minute we get beyond something “simple” like character equivalence classes, we’re into all the challenges that make search such a fun and exciting space to work in.

    Stop phrases (“The Help” is the title of a very popular book), punctuation (C++, R.S.V.P., WALL-E, Math.rand(), etc.), mixed encodings (especially in the Far East region, where you get URL-escaped GBK mixed with CJK in the same search request), etc.

    And these are all just the cases where we assume perfect queries and corpus. In the real world, spelling errors, encoding errors, and disagreement about canonical form abound. Recent examples include the movie “Kick-Ass” (or is it “Kick ass” or “Kickass”?), the product “iPhone” (or is it “i-Phone” or “eye phone” or oops “iPone” or “iPhome”), etc. Maybe you misplaced the umlaut: Kürt Godel (which seems to “work” in both Google and Bing, with no spell-correction or backout links displayed by either engine).

    Fun stuff!

  10. Richard says:

    Great post.