Re: [EAI] Body parts
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [EAI] Body parts




--On Thursday, 10 July, 2008 10:29 +0100 Charles Lindsey
<chl at clerew.man.ac.uk> wrote:

> Actually, there may be good reasons for using other charsets
> in body parts. Charsets in the higher reaches of Unicode
> require rather long strings of bytes per character in UTF-8.
> That is not a serious issue for headers, but body parts might
> well be significantly shorter in charsets more suited to the
> language in use.

While that clearly would have been a very big deal in the slow
and expensive Internet connections of, say, 20-odd years ago, I
imagine we could have an endless, and pointless, debate about
this subject in today's environment.

> Can anyone provide data on the efficiency of UTF-8 and BIG5,
> for example?

I'm sure someone will correct me if I get this wrong but, as I
understand it, Big5 is is a strictly two-byte (16 bit) character
set.  UTF-8 goes to three octets above U+07FF and the first CJK
character is U+3000 (that character and the ones close to it are
invalid for IDNA, but that is irrelevant here) and stays at
three octets until one gets out of plane 0.

So, one answer to your question is that for CJK characters that
are coded in the Unicode BMP (Plane 0), the storage ratio is
going to be strictly two octets of Big5 to three octets of
UTF-8.  If one starts mixing in characters from Plane 2, then
UTF-8 starts occupying four octets.  I don't know whether there
are any such characters in Big5 but, if there are, the
efficiency ratio would depend on the percentage of Plane 2
characters and than, in turn, would depend on the specific
corpus involved.

But, coming back to my opening comments, at today's bandwidth,
processing, and storage costs, it would take a rather large body
of text for any of this to make any difference.  And, for an
email body part containing a large enough body of text for the
difference to be significant, compressing UTF-8 using one or
another standard method would almost certainly yield more
improvement than translating the UTF-8 to Big5.

Note that there may be lots of other reasons to transmit body
parts in Big5 (or GB or Shift-JIS or...), but I suggest that
transmission or storage efficiency will rarely be one of them.

     john

_______________________________________________
IMA mailing list
IMA at ietf.org
https://www.ietf.org/mailman/listinfo/ima



Note: Messages sent to this list are the opinions of the senders and do not imply endorsement by the IETF.