idnits 2.17.1 draft-abela-utf9-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-26) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 222 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (23 December 1997) is 9621 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'ISO-10646' on line 189 looks like a reference -- Missing reference section? 'UNICODE' on line 200 looks like a reference -- Missing reference section? 'US-ASCII' on line 203 looks like a reference -- Missing reference section? 'RFC2152' on line 196 looks like a reference Summary: 6 errors (**), 0 flaws (~~), 2 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET DRAFT J. Abela 2 Expires: 23 June 1998 HSC 3 23 December 1997 5 UTF-9, a transformation format of UCS 7 Status of this Memo 9 This document is an Internet-Draft. Internet-Drafts are working 10 documents of the Internet Engineering Task Force (IETF), its areas, 11 and its working groups. Note that other groups may also distribute 12 working documents as Internet-Drafts. 14 Internet-Drafts are draft documents valid for a maximum of six months 15 and may be updated, replaced, or obsoleted by other documents at any 16 time. It is inappropriate to use Internet-Drafts as reference 17 material or to cite them other than as "work in progress". 19 To learn the current status of any Internet-Draft, please check the 20 1id-abstracts.txt listing contained in the Internet-Drafts Shadow 21 Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), 22 ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim), 23 ds.internic.net (US East Coast). 25 Distribution of this document is unlimited. 27 Abstract 29 ISO/IEC 10646 defines a multi-octet character set called the 30 Universal Character Set (UCS) which encompasses most of the world's 31 writing systems. Multi-octet characters, however, are not compatible 32 with many current applications and protocols, and this has led to the 33 development of a few so-called UCS transformation formats (UTF), each 34 with different characteristics. UTF-9, the object of this memo, has 35 the characteristic of preserving the full ISO-Latin1 range, providing 36 compatibility with file systems, parsers and other software that rely 37 on ISO-Latin1 values. 39 ISO-Latin1 is almost as widespread as ASCII in many countries, 40 especially in most of western Europe, and is the default character 41 set for HTML. A compatible encoding seems desirable, where possible. 43 1. Introduction 45 ISO/IEC 10646-1 [ISO-10646] defines a multi-octet character set 46 called the Universal Character Set (UCS), which encompasses most of 47 the world's writing systems. Two multi-octet encodings are defined, 48 a four-octet per character encoding called UCS-4 and a two-octet per 49 character encoding called UCS-2, able to address only the first 64K 50 characters of the UCS (the Basic Multilingual Plane, BMP), outside of 51 which there are currently no assignments. 53 It is noteworthy that the same set of characters is defined by the 54 Unicode standard [UNICODE], which further defines additional 55 character properties and other application details of great interest 56 to implementors, but does not have the UCS-4 encoding. Up to the 57 present time, changes in Unicode and amendments to ISO/IEC 10646 have 58 tracked each other, so that the character repertoires and code point 59 assignments have remained in sync. The relevant standardization 60 committees have committed to maintain this very useful synchronism. 62 The UCS-2 and UCS-4 encodings, however, are hard to use in many 63 current applications and protocols that assume 8 or even 7 bit 64 characters. Even newer systems able to deal with 16 bit characters 65 cannot process UCS-4 data. This situation has led to the development 66 of so-called UCS transformation formats (UTF), each with different 67 characteristics. 69 UTF-1 has only historical interest, having been removed from ISO/IEC 70 10646. UTF-7 has the quality of encoding the full BMP repertoire 71 using only octets with the high-order bit clear (7 bit US-ASCII 72 values, [US-ASCII]), and is thus deemed a mail-safe encoding 73 ([RFC2152]). UTF-8 uses all bits of an octet, but has the quality of 74 preserving the full US-ASCII range: US-ASCII characters are encoded 75 in one octet having the normal US-ASCII value, and any octet with 76 such a value can only stand for an US-ASCII character, and nothing 77 else. UTF-9, the object of this memo, has the quality of preserving 78 the full ISO-Latin1 range: ISO-Latin1 characters are encoded in one 79 octet having the normal ISO-Latin1 value. 81 UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire 82 into pairs of UCS-2 values from a reserved range. UTF-16 impacts 83 UTF-9 in that UCS-2 values from the reserved range must be treated 84 specially in the UTF-9 transformation. 86 UTF-9 encodes UCS-2 or UCS-4 characters as a varying number of 87 octets, where the number of octets, and the value of each, depend on 88 the integer value assigned to the character in ISO/IEC 10646. This 89 transformation format has the following characteristics (all values 90 are in hexadecimal): 92 - Character values from 0000 0000 to 0000 007F and 0000 00A0 to 0000 93 00FF (Latin1 repertoire) correspond to octets 00 to 7F and A0 to FF 94 (8 bit Latin1 values). A direct consequence is that a plain Latin1 95 string is also a valid UTF-9 string. Note that Latin1 octets in a 96 UTF-9 string may be non-Latin1 characters. 98 - US-ASCII values do not appear otherwise in a UTF-9 encoded 99 character stream. This provides compatibility with file systems or 100 other software (e.g. the printf() function in C libraries) that parse 101 based on US-ASCII values but are transparent to other values. 102 However, note that Latin1 octets in a UTF-9 stream may be non-Latin1 103 characters when used as part of multi-octet sequences. 105 - Round-trip conversion is easy between UTF-9 and either of UCS-4, 106 UCS-2. 108 - The first octet of a multi-octet sequence indicates the number of 109 octets in the sequence. 111 - UTF-9 encoding length is never bigger than UTF-8. 113 - unlike UTF-8, there is no reliable way to find character 114 boundaries in a UTF-9 octet stream. 116 UTF-9 is heavily based on UTF-8 definition. More information about 117 UTF, Unicode, and their various versions can be found in RFC-2044. 119 UTF-9 definition 121 In UTF-9, characters are encoded using sequences of 1 to 5 octets. 122 The only octet of a "sequence" of one is in the ranges 00 to 7F or 123 A0-FF. In a sequence of n octets, n>1, the initial octet is in the 124 range 80 to 9F. This octet specifies the length of the sequence and 125 contains value bits if in the range 80 to 8F. All the bits of the 126 remaining octets are used to encode the character. 128 The table below summarizes the format of these different octet types. 129 The letter x indicates bits available for encoding bits of the UCS-4 130 character value. 132 UCS-4 range (hex) UTF-9 octet sequence (binary) 133 0000 0000-0000 007F 0xxxxxxx 134 0000 00A0-0000 00BF 101xxxxx 135 0000 00C0-0000 00FF 11xxxxxx 136 0000 0100-0000 07FF 1000xxxx 1xxxxxxx 137 0000 0800-0000 FFFF 100100xx 1xxxxxxx 1xxxxxxx 138 0001 0000-007F FFFF 100101xx 1xxxxxxx 1xxxxxxx 1xxxxxxx 139 0080 0000-7FFF FFFF 10011xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 141 Examples 143 The Latin1 sequence "Nol" should be encoded as follows: 145 UCS-2: 004E 006F 00EB 006C 146 UTF-9: 4E 6F EB 6C 147 UTF-8: 4E 6F C3AB 6C 149 The UCS-2 sequence "A." should be encoded as 150 follows: 152 UCS-2: 0041 2262 0391 002E 153 UTF-9: 41 90 C4 E2 87 91 2E 154 UTF-8: 41 E2 89 A2 CE 91 2E 156 The UCS-2 sequence representing the Hangul characters for the Korean 157 word "hangugo" should be encoded as follows: 159 UCS-2: D55C AD6D C5B4 160 UTF-9: 93 AA DC 92 DA ED 93 8B B4 161 UTF-8: ED 95 9C EA B5 AD EC 96 B4 163 Security Considerations 165 Implementors of UTF-9 need to consider the security aspects of how 166 they handle illegal UTF-9 sequences. It is conceivable that in some 167 circumstances an attacker would be able to exploit an incautious 168 UTF-9 parser by sending it an octet sequence that is not permitted by 169 the UTF-9 syntax. 171 A particularly subtle form of this attack could be carried out 172 against a parser which performs security-critical validity checks 173 against the UTF-9 encoded form of its input, but interprets certain 174 illegal octet sequences as characters. For example, a parser might 175 prohibit the NUL character when encoded as the single-octet sequence 176 00, but allow the illegal two-octet sequence 80 80 and interpret it 177 as a NUL character. Another example might be a parser which 178 prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the 179 illegal octet sequence 2F 2E 80 AE 2F. 181 Acknowledgments 183 Most of the text of this memo comes from the UTF-8 memo from Francois 184 Yergeau. The following have participated in the drafting of this 185 memo: Antoine Leca and Francois Yergeau 187 Bibliography 189 [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- 190 Information technology -- Universal Multiple-Octet 191 Coded Character Set (UCS) -- Part 1: Architecture 192 and Basic Multilingual Plane. Five amendments and 193 a technical corrigendum have been published up to 194 now. 196 [RFC2152] D. Goldsmith, M. Davis, "UTF-7: A Mail-safe 197 Transformation Format of Unicode", RFC 1642, 198 Taligent inc., May 1997. (Obsoletes RFC1642) 200 [UNICODE] The Unicode Consortium, "The Unicode Standard -- 201 Version 2.0", Addison-Wesley, 1996. 203 [US-ASCII] Coded Character Set--7-bit American Standard Code 204 for Information Interchange, ANSI X3.4-1986. 206 Author's Address 208 Jerome Abela 209 Herve Schauer Consultants 210 142, rue de Rivoli 211 75001 Paris 212 France 214 Phone: +33 141 409 700 215 Fax: +33 141 409 709 217 EMail: Jerome.Abela@hsc.fr