idnits 2.17.1 

draft-abela-utf9-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-26) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 222 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (23 December 1997) is 9621 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'ISO-10646' on line 189 looks like a reference

  -- Missing reference section? 'UNICODE' on line 200 looks like a reference

  -- Missing reference section? 'US-ASCII' on line 203 looks like a reference

  -- Missing reference section? 'RFC2152' on line 196 looks like a reference


     Summary: 6 errors (**), 0 flaws (~~), 2 warnings (==), 6 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET DRAFT                                                  J. Abela
2	Expires: 23 June 1998                                                HSC
3	<draft-abela-utf9-00.txt>                               23 December 1997

5	                 UTF-9, a transformation format of UCS

7	Status of this Memo

9	   This document is an Internet-Draft.  Internet-Drafts are working
10	   documents of the Internet Engineering Task Force (IETF), its areas,
11	   and its working groups.  Note that other groups may also distribute
12	   working documents as Internet-Drafts.

14	   Internet-Drafts are draft documents valid for a maximum of six months
15	   and may be updated, replaced, or obsoleted by other documents at any
16	   time.  It is inappropriate to use Internet-Drafts as reference
17	   material or to cite them other than as "work in progress".

19	   To learn the current status of any Internet-Draft, please check the
20	   1id-abstracts.txt listing contained in the Internet-Drafts Shadow
21	   Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
22	   ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim),
23	   ds.internic.net (US East Coast).

25	   Distribution of this document is unlimited.

27	Abstract

29	   ISO/IEC 10646 defines a multi-octet character set called the
30	   Universal Character Set (UCS) which encompasses most of the world's
31	   writing systems.  Multi-octet characters, however, are not compatible
32	   with many current applications and protocols, and this has led to the
33	   development of a few so-called UCS transformation formats (UTF), each
34	   with different characteristics.  UTF-9, the object of this memo, has
35	   the characteristic of preserving the full ISO-Latin1 range, providing
36	   compatibility with file systems, parsers and other software that rely
37	   on ISO-Latin1 values.

39	   ISO-Latin1 is almost as widespread as ASCII in many countries,
40	   especially in most of western Europe, and is the default character
41	   set for HTML.  A compatible encoding seems desirable, where possible.

43	1. Introduction

45	   ISO/IEC 10646-1 [ISO-10646] defines a multi-octet character set
46	   called the Universal Character Set (UCS), which encompasses most of
47	   the world's writing systems.  Two multi-octet encodings are defined,
48	   a four-octet per character encoding called UCS-4 and a two-octet per
49	   character encoding called UCS-2, able to address only the first 64K
50	   characters of the UCS (the Basic Multilingual Plane, BMP), outside of
51	   which there are currently no assignments.

53	   It is noteworthy that the same set of characters is defined by the
54	   Unicode standard [UNICODE], which further defines additional
55	   character properties and other application details of great interest
56	   to implementors, but does not have the UCS-4 encoding.  Up to the
57	   present time, changes in Unicode and amendments to ISO/IEC 10646 have
58	   tracked each other, so that the character repertoires and code point
59	   assignments have remained in sync.  The relevant standardization
60	   committees have committed to maintain this very useful synchronism.

62	   The UCS-2 and UCS-4 encodings, however, are hard to use in many
63	   current applications and protocols that assume 8 or even 7 bit
64	   characters.  Even newer systems able to deal with 16 bit characters
65	   cannot process UCS-4 data. This situation has led to the development
66	   of so-called UCS transformation formats (UTF), each with different
67	   characteristics.

69	   UTF-1 has only historical interest, having been removed from ISO/IEC
70	   10646.  UTF-7 has the quality of encoding the full BMP repertoire
71	   using only octets with the high-order bit clear (7 bit US-ASCII
72	   values, [US-ASCII]), and is thus deemed a mail-safe encoding
73	   ([RFC2152]).  UTF-8 uses all bits of an octet, but has the quality of
74	   preserving the full US-ASCII range: US-ASCII characters are encoded
75	   in one octet having the normal US-ASCII value, and any octet with
76	   such a value can only stand for an US-ASCII character, and nothing
77	   else. UTF-9, the object of this memo, has the quality of preserving
78	   the full ISO-Latin1 range: ISO-Latin1 characters are encoded in one
79	   octet having the normal ISO-Latin1 value.

81	   UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
82	   into pairs of UCS-2 values from a reserved range.  UTF-16 impacts
83	   UTF-9 in that UCS-2 values from the reserved range must be treated
84	   specially in the UTF-9 transformation.

86	   UTF-9 encodes UCS-2 or UCS-4 characters as a varying number of
87	   octets, where the number of octets, and the value of each, depend on
88	   the integer value assigned to the character in ISO/IEC 10646.  This
89	   transformation format has the following characteristics (all values
90	   are in hexadecimal):

92	   -  Character values from 0000 0000 to 0000 007F and 0000 00A0 to 0000
93	   00FF (Latin1 repertoire) correspond to octets 00 to 7F and A0 to FF
94	   (8 bit Latin1 values).  A direct consequence is that a plain Latin1
95	   string is also a valid UTF-9 string.  Note that Latin1 octets in a
96	   UTF-9 string may be non-Latin1 characters.

98	   -  US-ASCII values do not appear otherwise in a UTF-9 encoded
99	   character stream.  This provides compatibility with file systems or
100	   other software (e.g. the printf() function in C libraries) that parse
101	   based on US-ASCII values but are transparent to other values.
102	   However, note that Latin1 octets in a UTF-9 stream may be non-Latin1
103	   characters when used as part of multi-octet sequences.

105	   -  Round-trip conversion is easy between UTF-9 and either of UCS-4,
106	   UCS-2.

108	   -  The first octet of a multi-octet sequence indicates the number of
109	   octets in the sequence.

111	   -  UTF-9 encoding length is never bigger than UTF-8.

113	   -  unlike UTF-8, there is no reliable way to find character
114	   boundaries in a UTF-9 octet stream.

116	   UTF-9 is heavily based on UTF-8 definition. More information about
117	   UTF, Unicode, and their various versions can be found in RFC-2044.

119	UTF-9 definition

121	   In UTF-9, characters are encoded using sequences of 1 to 5 octets.
122	   The only octet of a "sequence" of one is in the ranges 00 to 7F or
123	   A0-FF. In a sequence of n octets, n>1, the initial octet is in the
124	   range 80 to 9F. This octet specifies the length of the sequence and
125	   contains value bits if in the range 80 to 8F. All the bits of the
126	   remaining octets are used to encode the character.

128	   The table below summarizes the format of these different octet types.
129	   The letter x indicates bits available for encoding bits of the UCS-4
130	   character value.

132	    UCS-4 range (hex)     UTF-9 octet sequence (binary)
133	    0000 0000-0000 007F   0xxxxxxx
134	    0000 00A0-0000 00BF   101xxxxx
135	    0000 00C0-0000 00FF   11xxxxxx
136	    0000 0100-0000 07FF   1000xxxx 1xxxxxxx
137	    0000 0800-0000 FFFF   100100xx 1xxxxxxx 1xxxxxxx
138	    0001 0000-007F FFFF   100101xx 1xxxxxxx 1xxxxxxx 1xxxxxxx
139	    0080 0000-7FFF FFFF   10011xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx

141	Examples

143	   The Latin1 sequence "No<e diaeresis>l" should be encoded as follows:

145	    UCS-2: 004E 006F 00EB 006C
146	    UTF-9: 4E   6F   EB   6C
147	    UTF-8: 4E   6F   C3AB 6C

149	   The UCS-2 sequence "A<NOT IDENTICAL TO><ALPHA>." should be encoded as
150	   follows:

152	    UCS-2: 0041  2262      0391   002E
153	    UTF-9: 41    90 C4 E2  87 91  2E
154	    UTF-8: 41    E2 89 A2  CE 91  2E

156	   The UCS-2 sequence representing the Hangul characters for the Korean
157	   word "hangugo" should be encoded as follows:

159	    UCS-2: D55C      AD6D      C5B4
160	    UTF-9: 93 AA DC  92 DA ED  93 8B B4
161	    UTF-8: ED 95 9C  EA B5 AD  EC 96 B4

163	Security Considerations

165	   Implementors of UTF-9 need to consider the security aspects of how
166	   they handle illegal UTF-9 sequences.  It is conceivable that in some
167	   circumstances an attacker would be able to exploit an incautious
168	   UTF-9 parser by sending it an octet sequence that is not permitted by
169	   the UTF-9 syntax.

171	   A particularly subtle form of this attack could be carried out
172	   against a parser which performs security-critical validity checks
173	   against the UTF-9 encoded form of its input, but interprets certain
174	   illegal octet sequences as characters.  For example, a parser might
175	   prohibit the NUL character when encoded as the single-octet sequence
176	   00, but allow the illegal two-octet sequence 80 80 and interpret it
177	   as a NUL character.  Another example might be a parser which
178	   prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
179	   illegal octet sequence 2F 2E 80 AE 2F.

181	Acknowledgments

183	   Most of the text of this memo comes from the UTF-8 memo from Francois
184	   Yergeau.  The following have participated in the drafting of this
185	   memo: Antoine Leca and Francois Yergeau

187	Bibliography

189	      [ISO-10646]    ISO/IEC 10646-1:1993. International Standard --
190	                     Information technology -- Universal Multiple-Octet
191	                     Coded Character Set (UCS) -- Part 1: Architecture
192	                     and Basic Multilingual Plane.  Five amendments and
193	                     a technical corrigendum have been published up to
194	                     now.

196	      [RFC2152]      D. Goldsmith, M. Davis, "UTF-7: A Mail-safe
197	                     Transformation Format of Unicode", RFC 1642,
198	                     Taligent inc., May 1997. (Obsoletes RFC1642)

200	      [UNICODE]      The Unicode Consortium, "The Unicode Standard --
201	                     Version 2.0", Addison-Wesley, 1996.

203	      [US-ASCII]     Coded Character Set--7-bit American Standard Code
204	                     for Information Interchange, ANSI X3.4-1986.

206	Author's Address

208	   Jerome Abela
209	   Herve Schauer Consultants
210	   142, rue de Rivoli
211	   75001 Paris
212	   France

214	   Phone: +33 141 409 700
215	   Fax:   +33 141 409 709

217	   EMail: Jerome.Abela@hsc.fr