Re: [EAI] Body parts
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [EAI] Body parts



At 03:03 08/07/11, John C Klensin wrote:

>So, one answer to your question is that for CJK characters that
>are coded in the Unicode BMP (Plane 0), the storage ratio is
>going to be strictly two octets of Big5 to three octets of
>UTF-8.

Yes. That's the basic ratio.

>If one starts mixing in characters from Plane 2, then
>UTF-8 starts occupying four octets.  I don't know whether there
>are any such characters in Big5 but, if there are, the
>efficiency ratio would depend on the percentage of Plane 2
>characters and than, in turn, would depend on the specific
>corpus involved.

I don't think Big5 itself has Plane2 characters. Even
vendor extension stuff must have made it into extension A,
which is in the BMP. Even otherwise, unless it's a list
of characters in Plane 2 or something else Plane2-specific,
there'd probably be about 1 in a thousand or so only from
Plane2, so it's mostly irrelevant.


>But, coming back to my opening comments, at today's bandwidth,
>processing, and storage costs, it would take a rather large body
>of text for any of this to make any difference.

Yes. These days, I'm less worried by large emails, but more
by their total number :-(.

>And, for an
>email body part containing a large enough body of text for the
>difference to be significant, compressing UTF-8 using one or
>another standard method would almost certainly yield more
>improvement than translating the UTF-8 to Big5.

Generic compression methods need some data to adapt to,
and so only get efficient after a certain time. But then
on small emails, the headers are dominant (and in most headers,
ASCII is dominant).


>Note that there may be lots of other reasons to transmit body
>parts in Big5 (or GB or Shift-JIS or...), but I suggest that
>transmission or storage efficiency will rarely be one of them.

No Shift-JIS for email, except occasionally for spam and the
like. Japanese email uses iso-2022-jp traditionally (and lately,
more and more, UTF-8).

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp     

_______________________________________________
IMA mailing list
IMA at ietf.org
https://www.ietf.org/mailman/listinfo/ima



Note: Messages sent to this list are the opinions of the senders and do not imply endorsement by the IETF.