idnits 2.17.1 

draft-yergeau-utf8-rev-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-19) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == There are 4 instances of lines with non-ascii characters in the document.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  -- The abstract seems to indicate that this document obsoletes RFC2044, but
     the header doesn't have an 'Obsoletes:' line to match this.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (14 October 1997) is 9684 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'ISO-10646' on line 311 looks like a reference

  -- Missing reference section? 'UNICODE' on line 339 looks like a reference

  -- Missing reference section? 'US-ASCII' on line 342 looks like a reference

  -- Missing reference section? 'RFC1642' on line 335 looks like a reference

  -- Missing reference section? 'FSS-UTF' on line 116 looks like a reference

  -- Missing reference section? 'MIME' on line 318 looks like a reference

  -- Missing reference section? 'RFC1641' on line 332 looks like a reference


     Summary: 7 errors (**), 0 flaws (~~), 2 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                       F. Yergeau
3	Internet Draft                                       Alis Technologies
4	<draft-yergeau-utf8-rev-00.txt>                          14 April 1997
5	Expires 14 October 1997

7	[Will obsolete RFC 2044]

9	        UTF-8, a transformation format of Unicode and ISO 10646

11	Status of this Memo

13	   This document is an Internet-Draft.  Internet-Drafts are working doc-
14	   uments of the Internet Engineering Task Force (IETF), its areas, and
15	   its working groups. Note that other groups may also distribute work-
16	   ing documents as Internet-Drafts.

18	   Internet-Drafts are draft documents valid for a maximum of six
19	   months.  Internet-Drafts may be updated, replaced, or obsoleted by
20	   other documents at any time.  It is not appropriate to use Internet-
21	   Drafts as reference material or to cite them other than as a "working
22	   draft" or "work in progress".

24	   To learn the current status of any Internet-Draft, please check the
25	   1id-abstracts.txt listing contained in the Internet-Drafts Shadow
26	   Directories on ds.internic.net (US East Coast), nic.nordu.net
27	   (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
28	   Rim).

30	   Distribution of this document is unlimited.

32	Abstract

34	   ISO/IEC 10646-1 and the Unicode Standard jointly define a multi-octet
35	   character set which encompasses most of the world's writing systems.
36	   Multi-octet characters, however, are not compatible with many current
37	   applications and protocols, and this has led to the development of a
38	   few so-called UCS transformation formats (UTF), each with different
39	   characteristics.  UTF-8, the object of this memo, has the character-
40	   istic of preserving the full US-ASCII range, providing compatibility
41	   with file systems, parsers and other software that rely on US-ASCII
42	   values but are transparent to other values. This memo updates and
43	   replaces RFC 2044, in particular addressing the question of versions
44	   of the relevant standards.

46	1.  Introduction

48	   ISO/IEC 10646-1 [ISO-10646] and the Unicode Standard [UNICODE]
49	   jointly define a 16-bit character set, UCS-2, which encompasses most
50	   of the world's writing systems.  ISO 10646 further defines a 31-bit
51	   character set, UCS-4, with currently no assignments outside of the
52	   region corresponding to UCS-2 (the Basic Multilingual Plane, BMP).
53	   The UCS-2 and UCS-4 encodings, however, are hard to use in many cur-
54	   rent applications and protocols that assume 8 or even 7 bit charac-
55	   ters.  Even newer systems able to deal with 16 bit characters cannot
56	   process UCS-4 data. This situation has led to the development of so-
57	   called UCS transformation formats (UTF), each with different charac-
58	   teristics.

60	   UTF-1 has only historical interest, having been removed from ISO
61	   10646.  UTF-7 has the quality of encoding the full Unicode repertoire
62	   using only octets with the high-order bit clear (7 bit US-ASCII val-
63	   ues, [US-ASCII]), and is thus deemed a mail-safe encoding
64	   ([RFC1642]).  UTF-8, the object of this memo, uses all bits of an
65	   octet, but has the quality of preserving the full US-ASCII range: US-
66	   ASCII characters are encoded in one octet having the normal US-ASCII
67	   value, and any octet with such a value can only stand for an US-ASCII
68	   character, and nothing else.

70	   UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
71	   into pairs of UCS-2 values from a reserved range.  UTF-16 impacts
72	   UTF-8 in that UCS-2 values from the reserved range must be treated
73	   specially in the UTF-8 transformation.

75	   UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
76	   octets, where the number of octets, and the value of each, depend on
77	   the integer value assigned to the character in ISO 10646.  This
78	   transformation format has the following characteristics (all values
79	   are in hexadecimal):

81	   -  Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
82	      correspond to octets 00 to 7F (7 bit US-ASCII values). A direct
83	      consequence is that a plain ASCII string is also a valid UTF-8
84	      string.

86	   -  US-ASCII values do not appear otherwise in a UTF-8 encoded charac-
87	      ter stream.  This provides compatibility with file systems or
88	      other software (e.g. the printf() function in C libraries) that
89	      parse based on US-ASCII values but are transparent to other val-
90	      ues.

92	   -  Round-trip conversion is easy between UTF-8 and either of UCS-4,
93	      UCS-2 or Unicode.

95	   -  The first octet of a multi-octet sequence indicates the number of
96	      octets in the sequence.

98	   -  The octet values FE and FF never appear.

100	   -  Character boundaries are easily found from anywhere in an octet
101	      stream.

103	   -  The lexicographic sorting order of UCS-4 strings is preserved.  Of
104	      course this is of limited interest since the sort order is not
105	      culturally valid in either case.

107	   -  The Boyer-Moore fast search algorithm can be used with UTF-8 data.

109	   -  UTF-8 strings can be fairly reliably recognized as such by a sim-
110	      ple algorithm, i.e. the probability that a string of characters in
111	      any other encoding appears as valid UTF-8 is low, diminishing with
112	      increasing string length.

114	   UTF-8 was originally a project of the X/Open Joint Internationaliza-
115	   tion Group XOJIG with the objective to specify a File System Safe UCS
116	   Transformation Format [FSS-UTF] that is compatible with UNIX systems,
117	   supporting multilingual text in a single encoding.  The original
118	   authors were Gary Miller, Greger Leijonhufvud and John Entenmann.
119	   Later, Ken Thompson and Rob Pike did significant work for the formal
120	   UTF-8.

122	   A description can also be found in Unicode Technical Report #4 and in
123	   the Unicode Standard, version 2.0 [UNICODE].  The definitive refer-
124	   ence, including provisions for UTF-16 data within UTF-8, is Annex R
125	   of ISO/IEC 10646-1 [ISO-10646].

127	2.  UTF-8 definition

129	   In UTF-8, characters are encoded using sequences of 1 to 6 octets.
130	   The only octet of a "sequence" of one has the higher-order bit set to
131	   0, the remaining 7 bits being used to encode the character value. In
132	   a sequence of n octets, n>1, the initial octet has the n higher-order
133	   bits set to 1, followed by a bit set to 0.  The remaining bit(s) of
134	   that octet contain bits from the value of the character to be
135	   encoded.  The following octet(s) all have the higher-order bit set to
136	   1 and the following bit set to 0, leaving 6 bits in each to contain
137	   bits from the character to be encoded.

139	   The table below summarizes the format of these different octet types.
140	   The letter x indicates bits available for encoding bits of the UCS-4
141	   character value.

143	   UCS-4 range (hex.)           UTF-8 octet sequence (binary)
144	   0000 0000-0000 007F   0xxxxxxx
145	   0000 0080-0000 07FF   110xxxxx 10xxxxxx
146	   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx

148	   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
149	   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
150	   0400 0000-7FFF FFFF   1111110x 10xxxxxx ... 10xxxxxx

152	   Encoding from UCS-4 to UTF-8 proceeds as follows:

154	   1) Determine the number of octets required from the character value
155	      and the first column of the table above.

157	   2) Prepare the high-order bits of the octets as per the second column
158	      of the table.

160	   3) Fill in the bits marked x from the bits of the character value,
161	      starting from the lower-order bits of the character value and
162	      putting them first in the last octet of the sequence, then the
163	      next to last, etc. until all x bits are filled in.

165	      The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
166	      obtained from the above, in principle, by simply extending each
167	      UCS-2 character with two zero-valued octets.  However, UCS-2 val-
168	      ues between D800 and DFFF, being actually UCS-4 characters trans-
169	      formed through UTF-16, need special treatment: the UTF-16 trans-
170	      formation must be undone, yielding a UCS-4 character that is then
171	      transformed as above.

173	      Decoding from UTF-8 to UCS-4 proceeds as follows:

175	   1) Initialize the 4 octets of the UCS-4 character with all bits set
176	      to 0.

178	   2) Determine which bits encode the character value from the number of
179	      octets in the sequence and the second column of the table above
180	      (the bits marked x).

182	   3) Distribute the bits from the sequence to the UCS-4 character,
183	      first the lower-order bits from the last octet of the sequence and
184	      proceeding to the left until no x bits are left.

186	      If the UTF-8 sequence is no more than three octets long, decoding
187	      can proceed directly to UCS-2 (or equivalently Unicode).

189	      A more detailed algorithm and formulae can be found in [FSS_UTF],

191	      [UNICODE] or Annex R to [ISO-10646].

193	3.  Versions of the standards

195	   Different versions of the Unicode standard exist: 1.0, 1.1 and 2.0 as
196	   of this writing.  Each new version obsoletes and replaces the previ-
197	   ous one, but implementations, and more significantly data, are not
198	   updated instantly.  Similarly, ISO 10646 is updated from time to time
199	   by published amendments, which up to now have tracked the changes in
200	   the Unicode standard, so that the two have remained in sync.

202	   In general, the changes amount to adding new characters, which does
203	   not pose particular problems with old data.  Amendment 5 to ISO
204	   10646, however, has moved and expanded the Korean Hangul block,
205	   thereby making any previous data containing Hangul characters invalid
206	   under the new version.  Unicode 2.0 has the same difference from Uni-
207	   code 1.1. The official justification for allowing such an incompati-
208	   ble change was that no implementations and no data containing Hangul
209	   existed, a statement that is likely to be true but remains unprov-
210	   able.  The incident has been dubbed the "Korean mess", and the rele-
211	   vant committees have pledged to never, ever again make such an incom-
212	   patible change.

214	   New versions, and in particular any incompatible changes, have conse-
215	   quences regarding MIME character encoding labels, to be discussed in
216	   section 5.

218	4.  Examples

220	   The UCS-2 sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262, 0391,
221	   002E) may be encoded as follows:

223	   41 E2 89 A2 CE 91 2E

225	   The UCS-2 sequence representing the Hangul characters for the Korean
226	   word "hangugo" (D55C, AD6D, C5B4) may be encoded as follows:

228	   ED 95 9C EA B5 AD EC 96 B4

230	   The UCS-2 sequence representing the Han characters for the Japanese
231	   word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows:

233	   E6 97 A5 E6 9C AC E8 AA 9E

235	5.  MIME registration

237	   This memo is meant to serve as the basis for registration of a MIME
238	   character set parameter (charset) [MIME].  The proposed charset
239	   parameter value is "UTF-8".  This string would label media types con-
240	   taining text consisting of characters from the repertoire of ISO/IEC
241	   10646 encoded to a sequence of octets using the encoding scheme out-
242	   lined above.  UTF-8 is suitable for use in MIME content types under
243	   the "text" top-level type.

245	   It is noteworthy that the label "UTF-8" does not contain a version
246	   identification, referring generically to ISO/IEC 10646.  This is
247	   intentional, the rationale being as follows:

249	   A MIME charset label is designed to give just the information needed
250	   to interpret a sequence of bytes received on the wire into a sequence
251	   of characters, nothing more (see RFC 2045, section 2.2, in [MIME]).
252	   As long as a character set standard does not change incompatibly,
253	   version numbers serve no purpose, because one gains nothing by learn-
254	   ing from the tag that newly assigned characters may be received that
255	   one doesn't know about.  The tag doesn't teach anything about the new
256	   characters, and they are going to be received anyway.

258	   Hence, as long as the standards evolve compatibly, the apparent
259	   advantage of having labels that identify the versions is only that,
260	   apparent.  But there is a disadvantage to such version-dependent
261	   labels: when an older application receives data accompanied by a
262	   newer, unknown label, it may fail to recognize the label and be com-
263	   pletely unable to deal with the data, whereas a generic, known label
264	   would have triggered mostly correct processing of the data, which may
265	   well not contain any new characters.

267	   Now the "Korean mess" (ISO 10646 amendment 5) is an incompatible
268	   change, in principle contradicting the appropriateness of a version-
269	   independent MIME charset label as described above.  But the compati-
270	   bility problem can only appear with data containing Korean Hangul
271	   characters encoded according to Unicode 1.1 (or equivalently ISO
272	   10646 before amendment 5), and there is arguably no such data to
273	   worry about, this being the very reason the incompatible change was
274	   deemed acceptable.

276	   In practice, then, a version-independent label is warranted.  Should
277	   the need ever arise to distinguish data containing Hangul encoded
278	   according to Unicode 1.1, then a version-dependent label, for that
279	   version only, should be registered (a suggestion would be "UNI-
280	   CODE-1-1-UTF-8"), in order to retain the advantages of a version-
281	   independent label for 2.0 and later versions.  Such a version-depen-
282	   dent label could even be registered before actual need arises, pre-
283	   emptively, but it is important to strongly recommend against creating
284	   any new Hangul-containing data without taking Amendment 5 of ISO
285	   10646 into account.

287	6.  Security Considerations

289	   Security issues are not discussed in this memo.

291	Acknowledgments

293	   The following have participated in the drafting and discussion of
294	   this memo:

296	   James E. Agenbroad   Andries Brouwer
297	   Martin J. D�rst      David Goldsmith
298	   Edwin F. Hart        Kent Karlsson
299	   Markus Kuhn          Michael Kung
300	   Alain LaBont�        Murray Sargent
301	   Keld Simonsen        Arnold Winkler

303	Bibliography

305	   [FSS_UTF]      X/Open CAE Specification C501 ISBN 1-85912-082-2 28cm.
306	                  22p. pbk. 172g.  4/95, X/Open Company Ltd., "File Sys-
307	                  tem Safe UCS Transformation Format (FSS_UTF)", X/Open
308	                  Preleminary Specification, Document Number P316.  Also
309	                  published in Unicode Technical Report #4.

311	   [ISO-10646]    ISO/IEC 10646-1:1993. International Standard -- Infor-
312	                  mation technology -- Universal Multiple-Octet Coded
313	                  Character Set (UCS) -- Part 1: Architecture and Basic
314	                  Multilingual Plane.  UTF-8 is described in Annex R,
315	                  published as Amendment 2.  UTF-16 is described in
316	                  Annex Q, published as Amendment 1.

318	   [MIME]         N. Freed, N. Borenstein, "Multipurpose Internet Mail
319	                  Extensions (MIME) Part One:  Format of Internet Mes-
320	                  sage Bodies", RFC 2045.  N. Freed, N. Borenstein,
321	                  "Multipurpose Internet Mail Extensions (MIME) Part
322	                  Two:  Media Types", RFC 2046.  K. Moore, "MIME (Multi-
323	                  purpose Internet Mail Extensions) Part Three: Message
324	                  Header Extensions for Non-ASCII Text", RFC 2047.  N.
325	                  Freed, J. Klensin, J. Postel, "Multipurpose Internet
326	                  Mail Extensions (MIME) Part Four: Registration Proce-
327	                  dures", RFC 2048.  N. Freed, N. Borenstein, "Multipur-
328	                  pose Internet Mail Extensions (MIME) Part Five:  Con-
329	                  formance Criteria and Examples", RFC 2049.  All
330	                  November 1996.

332	   [RFC1641]      D. Goldsmith, M.Davis, "Using Unicode with MIME", RFC
333	                  1641, Taligent inc., July 1994.

335	   [RFC1642]      D. Goldsmith, M. Davis, "UTF-7: A Mail-safe Transfor-
336	                  mation Format of Unicode", RFC 1642, Taligent inc.,
337	                  July 1994.

339	   [UNICODE]      The Unicode Consortium, "The Unicode Standard -- Ver-
340	                  sion 2.0", Addison-Wesley, 1996.

342	   [US-ASCII]     Coded Character Set--7-bit American Standard Code for
343	                  Information Interchange, ANSI X3.4-1986.

345	Author's Address

347	   Fran�ois Yergeau
348	   Alis Technologies
349	   100, boul. Alexis-Nihon
350	   Suite 600
351	   Montr�al  QC  H4M 2P2
352	   Canada

354	   Tel: +1 (514) 747-2547
355	   Fax: +1 (514) 747-2561
356	   EMail: fyergeau@alis.com