[LINK] Dot Asia a good idea?

Kim Davies kim at cynosure.com.au
Tue Oct 9 22:33:49 AEST 2007


Quoting Roger Clarke on Tuesday October 09, 2007:
| 
| Challenge #1:  how many of you can declare that you saw all of the 
| above rendered in the appropriate glyphs??

I can :-)

| Eudora works interestingly.
| I have HTML switched off, of course.  (Please let's *not* switch that 
| particular thread back on!).

Theoretically, it should make no difference. However a particular mail
reader might call out to a web rendering engine for HTML to display the
codes and could have a better comprehension of how to deal with other
scripts.

| When I do a Reply-To, it reverts them to ASCII equivalents.

It could be doing a best effort based on the fact your outbound encoding
format probably does not support the reportoire of characters you are
writing with. The default for Western European languages is often 'ISO
8859-1' which only encodes some Latin characters.

| Challenge #2 - explain in terms that educated mortals can understand 
| why they display as they do above, i.e. rendered in 8-bit ASCII.

The wire format is not ASCII but it is 8-bit. There are a number of
transformation formats (encodings) of Unicode. UTF-8 is the 8-bit
encoding and is prevalent on the Internet, but there are others like
UTF-16.

| But the alphabetic scripts also surprise me, because they include repeats.

The exact bytes used to represent characters in UTF-8 in the same
language are very likely to repeat, as the characters are all within the
same general area of the space. In the multi-byte encodings the first
few bytes are used to jump up to the right area of the character set.

Taking the Tamil label, it is a representation of seven Unicode code
points:

    U+0BAA  (consonant - PA)
    U+0BB0  (consonant - RA)
    U+0BBF  (vowel - I)
    U+0B9F  (consonant - TTA)
    U+0BCD  (diacritic - virama)
    U+0B9A  (consonant - CA)
    U+0BC8  (vowel - AI)

In UTF-8 it is encoded as 21 bytes, three for each code point. The first
two bytes for each code point is the same:

    E0 AE AA / E0 AE B0 / E0 AE BF / E0 AE 9F / E0 AF 8D / E0 AE 9A / E0 AF 88

So, if these are showing as ASCII then it would look very repititious.

IDNA uses a much more efficient algorithm called Punycode, which is
able to compress this string into much less space - only 14 letters and
digits. This is because it uses bootstrap encoding and from code-point
to code-point only encodes the deltas. This of course would prove very
inefficient if every second character was a different script, but works
well in most cases:

    HLCJ6AYA9ESC7A

cheers,

kim



More information about the Link mailing list