Google moving to Unicode 5.1 (was Re: [LINK] Diacritics and Search Engines)

Kim Holburn kim.holburn at gmail.com
Tue May 6 21:16:48 AEST 2008


Vaguely related to Roger's post some time ago.

http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

> Moving to Unicode 5.1
> 5/05/2008 09:38:00 AM
> Posted by Mark Davis, Senior International Software Architect
>
> Google has just begun supporting Unicode 5.1, less than one month  
> after it was released. It's now available in search, so people  
> speaking languages such as Malayalam can now search for words  
> containing the new characters in Unicode 5.1.
>
> Web pages can use a variety of different character encodings, like  
> ASCII, Latin-1, or Windows 1252, or Unicode. Most encodings can only  
> represent a few languages, but Unicode will handle anything from  
> Chinese to French to Arabic. We have long used Unicode as the  
> internal format for all the text we search: any other encoding is  
> first converted to Unicode for processing. So we regularly update to  
> each new version of Unicode (and relevant related standards like  
> CLDR and BCP 47) to make sure we are current. Thus Unicode plays a  
> key role in our mission.
>
> Uptick in native Unicode webpages
>
> Just last December there was an interesting milestone on the web.  
> For the first time, we found that Unicode was the most frequent  
> encoding found on web pages, overtaking both ASCII and Western  
> European encodings—and by coincidence, within 10 days of one  
> another. What's more impressive than simply overtaking them is the  
> speed with which this happened; take a look at the blue line in this  
> graph.
>
> You can see a long-term decline in pages encoded in ASCII  
> (unaccented letters A through Z). More recently, there's been a  
> significant drop in the use of encodings covering only Western  
> European letters (ASCII and a few accented letters like Ä, Ç, and  
> Ø). We're seeing similar declines in other language-specific  
> encodings. Unicode, on the other hand, is showing a sharp increase  
> in usage.
>
> This is based on our indexing of web pages, and thus may vary  
> somewhat from what other search engines find. However, the trends  
> are pretty clear, and the continued rise in use of Unicode makes it  
> even easier to do the processing for the many languages that we cover.

--
Kim Holburn
IT Network & Security Consultant
Ph: +39 06 855 4294  M: +39 3494957443
mailto:kim at holburn.net  aim://kimholburn
skype://kholburn - PGP Public Key on request

Democracy imposed from without is the severest form of tyranny.
                           -- Lloyd Biggle, Jr. Analog, Apr 1961







More information about the Link mailing list