[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UTF-8 vs. Unicode (was: Re: Volunteer needed to serve as IANA charset reviewer
Following up on this topic:
> As for utf-8 vs. Unicode, this is a bit tricky. I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.
UTF-8 vs. Unicode is an incomplete way of specifying the
distinctions to be made. It is a level-appropriateness issue.
If your concern is specification of the character semantics,
then you designate the Unicode Standard (or the equivalent
ISO/IEC 10646) and a version level to get the exact
repertoire.
If your concern is memory representation or API support then
you designate one of the 3 Character Encoding Forms formally
and normatively defined in the Unicode Standard (and equivalently
in ISO/IEC 10646): UTF-8, UTF-16, or UTF-32.
If your concern is serial byte representation in a char-oriented
protocol or stream, then you designate one of the CES's formally
and normatively defined in the Unicode Standard: UTF-8,
(UTF-16BE, UTF-16LE, UTF-16 with BOM), (UTF-32BE, UTF-32LE, UTF-32 with
BOM).
All of the CES's are fully interoperable and compatible with
each other. And only those CES's normatively defined in
the Unicode Standard should be considered CES's of Unicode.
> And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages. IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.
Ah, but that is precisely none other than UTF-16, and is in
widespread use for that reason and other reasons. But it doesn't
make much sense for the web or for most internet protocols,
bFrom discuss-bounces at apps.ietf.org Thu Sep 07 13:41:35 2006
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com)
by megatron.ietf.org with esmtp (Exim 4.43)
id 1GLNs2-0004Ex-9E; Thu, 07 Sep 2006 13:40:58 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org)
by megatron.ietf.org with esmtp (Exim 4.43) id 1GL7s2-0002lT-CP
for discuss at apps.ietf.org; Wed, 06 Sep 2006 20:35:54 -0400
Received: from fm200.sybase.com ([192.138.151.122])
by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1GL7s1-0005lV-0d
for discuss at apps.ietf.org; Wed, 06 Sep 2006 20:35:54 -0400
Received: from smtp2.sybase.com (sybgate2.sybase.com [10.22.97.85])
by fm200.sybase.com with ESMTP id k870Zdn23263;
Wed, 6 Sep 2006 17:35:40 -0700 (PDT)
Received: from olympus-dublin.sybase.com (localhost [127.0.0.1])
by smtp2.sybase.com with ESMTP id k870Zaw19355;
Wed, 6 Sep 2006 17:35:36 -0700 (PDT)
Received: from birdie.sybase.com (birdie.sybase.com [10.22.85.43])
by olympus-dublin.sybase.com (8.11.7p1+Sun/8.10.2) with ESMTP id
k870ZWv29724; Wed, 6 Sep 2006 17:35:33 -0700 (PDT)
Received: from birdie (birdie [10.22.85.43])
by birdie.sybase.com (8.11.6+Sun/8.11.6) with SMTP id k870ZSm24224;
Wed, 6 Sep 2006 17:35:28 -0700 (PDT)
Message-Id: <200609070035.k870ZSm24224 at birdie.sybase.com>
Date: Wed, 6 Sep 2006 17:35:28 -0700 (PDT)
From: Kenneth Whistler <kenw at sybase.com>
Subject: UTF-8 vs. Unicode (was: Re: Volunteer needed to serve as IANA charset
reviewer
To: moore at cs.utk.edu
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Content-MD5: 7WItsAcmW8BZxzZUcGDF4w==
X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.4.6_06 SunOS 5.8 sun4u sparc
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 82c9bddb247d9ba4471160a9a865a5f3
X-Mailman-Approved-At: Thu, 07 Sep 2006 13:40:56 -0400
Cc: hardie at qualcomm.com, ietf-charsets at iana.org, discuss at apps.ietf.org
X-BeenThere: discuss at apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Kenneth Whistler <kenw at sybase.com>
List-Id: general discussion of application-layer protocols
<discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>,
<mailto:discuss-request at apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss at apps.ietf.org>
List-Help: <mailto:discuss-request at apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>,
<mailto:discuss-request at apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces at apps.ietf.org
Following up on this topic:
> As for utf-8 vs. Unicode, this is a bit tricky. I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.
UTF-8 vs. Unicode is an incomplete way of specifying the
distinctions to be made. It is a level-appropriateness issue.
If your concern is specification of the character semantics,
then you designate the Unicode Standard (or the equivalent
ISO/IEC 10646) and a version level to get the exact
repertoire.
If your concern is memory representation or API support then
you designate one of the 3 Character Encoding Forms formally
and normatively defined in the Unicode Standard (and equivalently
in ISO/IEC 10646): UTF-8, UTF-16, or UTF-32.
If your concern is serial byte representation in a char-oriented
protocol or stream, then you designate one of the CES's formally
and normatively defined in the Unicode Standard: UTF-8,
(UTF-16BE, UTF-16LE, UTF-16 with BOM), (UTF-32BE, UTF-32LE, UTF-32 with
BOM).
All of the CES's are fully interoperable and compatible with
each other. And only those CES's normatively defined in
the Unicode Standard should be considered CES's of Unicode.
> And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages. IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.
Ah, but that is precisely none other than UTF-16, and is in
widespread use for that reason and other reasons. But it doesn't
make much sense for the web or for most internet protocols,
because of the already existing ubiquity of UTF-8 in those
contexts.
> But I do think that use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.
I agree completely with that assessment.
--Ken
>
> Keith
>
>