[LINK] Diacritics and Search Engines

Kim Davies kim at cynosure.com.au
Thu Jan 17 06:33:10 AEDT 2008


Quoting Roger Clarke on Wednesday January 16, 2008:
>
> I accept that, thanks, but my point wasn't that the paper couldn't be 
> found, nor that the u-umlaut isn't supported.
>
> The point I'm making is that a search on <uberveillance> doesn't locate 
> documents that contain the string <Xberveillance> where X = u-umlaut / ü / 
> %C3%9C (depending on the character-set and encoding that's used).

Perhaps not that surprising, given the ü in the context of über is a
distinct German letter separate from u, that is normally transcoded in
ASCII as "ue". Google does have some language contextuality – if you
Google search for "ueber" it will match "über".

> I'm arguing that, for a number of purposes, the set 'u-umlaut' is
> a subset of the set 'u', and that searches need to deal with that
> relationship in some way.

I don't think that is a safe assumption to make. It is, at least, a
rather English-centric view. Ask a Swede and they will say they are
taught in school of their 29 character alphabet. (28 until they recently
added the letter "w" due to its appearance in loanwords like "web").
Just dropping diacritics only works in some languages, not in others.

kim



More information about the Link mailing list