[LINK] Diacritics and Search Engines

Alastair Rankine arsptr at internode.on.net
Wed Jan 16 14:06:25 AEDT 2008


Roger Clarke wrote:
> I accept that, thanks, but my point wasn't that the paper couldn't be 
> found, nor that the u-umlaut isn't supported.
>
> The point I'm making is that a search on <uberveillance> doesn't 
> locate documents that contain the string <Xberveillance> where X = 
> u-umlaut / ü / %C3%9C (depending on the character-set and encoding 
> that's used).

I see what you mean now, and yes it is an issue.

I don't know the answer but I'd *guess* that the reason has something to 
do with the complexity and newness of the Unicode Collation Algorithm 
(http://www.unicode.org/unicode/reports/tr10/) This provides the sorts 
of equivalences that you're asking for, see for example the collation 
charts (http://www.unicode.org/charts/collation/).

In other words, it's a hard job, as with just about any i18n task. 
Trying to do it on a search engine scale with the requisite performance 
is a *very* hard job.




More information about the Link mailing list