Google moving to Unicode 5.1 (was Re: [LINK] Diacritics and Search Engines)
Kim Holburn
kim.holburn at gmail.com
Tue May 6 21:16:48 AEST 2008
Vaguely related to Roger's post some time ago.
http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html
> Moving to Unicode 5.1
> 5/05/2008 09:38:00 AM
> Posted by Mark Davis, Senior International Software Architect
>
> Google has just begun supporting Unicode 5.1, less than one month
> after it was released. It's now available in search, so people
> speaking languages such as Malayalam can now search for words
> containing the new characters in Unicode 5.1.
>
> Web pages can use a variety of different character encodings, like
> ASCII, Latin-1, or Windows 1252, or Unicode. Most encodings can only
> represent a few languages, but Unicode will handle anything from
> Chinese to French to Arabic. We have long used Unicode as the
> internal format for all the text we search: any other encoding is
> first converted to Unicode for processing. So we regularly update to
> each new version of Unicode (and relevant related standards like
> CLDR and BCP 47) to make sure we are current. Thus Unicode plays a
> key role in our mission.
>
> Uptick in native Unicode webpages
>
> Just last December there was an interesting milestone on the web.
> For the first time, we found that Unicode was the most frequent
> encoding found on web pages, overtaking both ASCII and Western
> European encodings—and by coincidence, within 10 days of one
> another. What's more impressive than simply overtaking them is the
> speed with which this happened; take a look at the blue line in this
> graph.
>
> You can see a long-term decline in pages encoded in ASCII
> (unaccented letters A through Z). More recently, there's been a
> significant drop in the use of encodings covering only Western
> European letters (ASCII and a few accented letters like Ä, Ç, and
> Ø). We're seeing similar declines in other language-specific
> encodings. Unicode, on the other hand, is showing a sharp increase
> in usage.
>
> This is based on our indexing of web pages, and thus may vary
> somewhat from what other search engines find. However, the trends
> are pretty clear, and the continued rise in use of Unicode makes it
> even easier to do the processing for the many languages that we cover.
--
Kim Holburn
IT Network & Security Consultant
Ph: +39 06 855 4294 M: +39 3494957443
mailto:kim at holburn.net aim://kimholburn
skype://kholburn - PGP Public Key on request
Democracy imposed from without is the severest form of tyranny.
-- Lloyd Biggle, Jr. Analog, Apr 1961
More information about the Link
mailing list