idnits 2.17.1 draft-yergeau-utf8-rev-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-19) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == There are 4 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) -- The abstract seems to indicate that this document obsoletes RFC2044, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (14 October 1997) is 9684 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'ISO-10646' on line 311 looks like a reference -- Missing reference section? 'UNICODE' on line 339 looks like a reference -- Missing reference section? 'US-ASCII' on line 342 looks like a reference -- Missing reference section? 'RFC1642' on line 335 looks like a reference -- Missing reference section? 'FSS-UTF' on line 116 looks like a reference -- Missing reference section? 'MIME' on line 318 looks like a reference -- Missing reference section? 'RFC1641' on line 332 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 2 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group F. Yergeau 3 Internet Draft Alis Technologies 4 14 April 1997 5 Expires 14 October 1997 7 [Will obsolete RFC 2044] 9 UTF-8, a transformation format of Unicode and ISO 10646 11 Status of this Memo 13 This document is an Internet-Draft. Internet-Drafts are working doc- 14 uments of the Internet Engineering Task Force (IETF), its areas, and 15 its working groups. Note that other groups may also distribute work- 16 ing documents as Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months. Internet-Drafts may be updated, replaced, or obsoleted by 20 other documents at any time. It is not appropriate to use Internet- 21 Drafts as reference material or to cite them other than as a "working 22 draft" or "work in progress". 24 To learn the current status of any Internet-Draft, please check the 25 1id-abstracts.txt listing contained in the Internet-Drafts Shadow 26 Directories on ds.internic.net (US East Coast), nic.nordu.net 27 (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific 28 Rim). 30 Distribution of this document is unlimited. 32 Abstract 34 ISO/IEC 10646-1 and the Unicode Standard jointly define a multi-octet 35 character set which encompasses most of the world's writing systems. 36 Multi-octet characters, however, are not compatible with many current 37 applications and protocols, and this has led to the development of a 38 few so-called UCS transformation formats (UTF), each with different 39 characteristics. UTF-8, the object of this memo, has the character- 40 istic of preserving the full US-ASCII range, providing compatibility 41 with file systems, parsers and other software that rely on US-ASCII 42 values but are transparent to other values. This memo updates and 43 replaces RFC 2044, in particular addressing the question of versions 44 of the relevant standards. 46 1. Introduction 48 ISO/IEC 10646-1 [ISO-10646] and the Unicode Standard [UNICODE] 49 jointly define a 16-bit character set, UCS-2, which encompasses most 50 of the world's writing systems. ISO 10646 further defines a 31-bit 51 character set, UCS-4, with currently no assignments outside of the 52 region corresponding to UCS-2 (the Basic Multilingual Plane, BMP). 53 The UCS-2 and UCS-4 encodings, however, are hard to use in many cur- 54 rent applications and protocols that assume 8 or even 7 bit charac- 55 ters. Even newer systems able to deal with 16 bit characters cannot 56 process UCS-4 data. This situation has led to the development of so- 57 called UCS transformation formats (UTF), each with different charac- 58 teristics. 60 UTF-1 has only historical interest, having been removed from ISO 61 10646. UTF-7 has the quality of encoding the full Unicode repertoire 62 using only octets with the high-order bit clear (7 bit US-ASCII val- 63 ues, [US-ASCII]), and is thus deemed a mail-safe encoding 64 ([RFC1642]). UTF-8, the object of this memo, uses all bits of an 65 octet, but has the quality of preserving the full US-ASCII range: US- 66 ASCII characters are encoded in one octet having the normal US-ASCII 67 value, and any octet with such a value can only stand for an US-ASCII 68 character, and nothing else. 70 UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire 71 into pairs of UCS-2 values from a reserved range. UTF-16 impacts 72 UTF-8 in that UCS-2 values from the reserved range must be treated 73 specially in the UTF-8 transformation. 75 UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of 76 octets, where the number of octets, and the value of each, depend on 77 the integer value assigned to the character in ISO 10646. This 78 transformation format has the following characteristics (all values 79 are in hexadecimal): 81 - Character values from 0000 0000 to 0000 007F (US-ASCII repertoire) 82 correspond to octets 00 to 7F (7 bit US-ASCII values). A direct 83 consequence is that a plain ASCII string is also a valid UTF-8 84 string. 86 - US-ASCII values do not appear otherwise in a UTF-8 encoded charac- 87 ter stream. This provides compatibility with file systems or 88 other software (e.g. the printf() function in C libraries) that 89 parse based on US-ASCII values but are transparent to other val- 90 ues. 92 - Round-trip conversion is easy between UTF-8 and either of UCS-4, 93 UCS-2 or Unicode. 95 - The first octet of a multi-octet sequence indicates the number of 96 octets in the sequence. 98 - The octet values FE and FF never appear. 100 - Character boundaries are easily found from anywhere in an octet 101 stream. 103 - The lexicographic sorting order of UCS-4 strings is preserved. Of 104 course this is of limited interest since the sort order is not 105 culturally valid in either case. 107 - The Boyer-Moore fast search algorithm can be used with UTF-8 data. 109 - UTF-8 strings can be fairly reliably recognized as such by a sim- 110 ple algorithm, i.e. the probability that a string of characters in 111 any other encoding appears as valid UTF-8 is low, diminishing with 112 increasing string length. 114 UTF-8 was originally a project of the X/Open Joint Internationaliza- 115 tion Group XOJIG with the objective to specify a File System Safe UCS 116 Transformation Format [FSS-UTF] that is compatible with UNIX systems, 117 supporting multilingual text in a single encoding. The original 118 authors were Gary Miller, Greger Leijonhufvud and John Entenmann. 119 Later, Ken Thompson and Rob Pike did significant work for the formal 120 UTF-8. 122 A description can also be found in Unicode Technical Report #4 and in 123 the Unicode Standard, version 2.0 [UNICODE]. The definitive refer- 124 ence, including provisions for UTF-16 data within UTF-8, is Annex R 125 of ISO/IEC 10646-1 [ISO-10646]. 127 2. UTF-8 definition 129 In UTF-8, characters are encoded using sequences of 1 to 6 octets. 130 The only octet of a "sequence" of one has the higher-order bit set to 131 0, the remaining 7 bits being used to encode the character value. In 132 a sequence of n octets, n>1, the initial octet has the n higher-order 133 bits set to 1, followed by a bit set to 0. The remaining bit(s) of 134 that octet contain bits from the value of the character to be 135 encoded. The following octet(s) all have the higher-order bit set to 136 1 and the following bit set to 0, leaving 6 bits in each to contain 137 bits from the character to be encoded. 139 The table below summarizes the format of these different octet types. 140 The letter x indicates bits available for encoding bits of the UCS-4 141 character value. 143 UCS-4 range (hex.) UTF-8 octet sequence (binary) 144 0000 0000-0000 007F 0xxxxxxx 145 0000 0080-0000 07FF 110xxxxx 10xxxxxx 146 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx 148 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 149 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 150 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx 152 Encoding from UCS-4 to UTF-8 proceeds as follows: 154 1) Determine the number of octets required from the character value 155 and the first column of the table above. 157 2) Prepare the high-order bits of the octets as per the second column 158 of the table. 160 3) Fill in the bits marked x from the bits of the character value, 161 starting from the lower-order bits of the character value and 162 putting them first in the last octet of the sequence, then the 163 next to last, etc. until all x bits are filled in. 165 The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be 166 obtained from the above, in principle, by simply extending each 167 UCS-2 character with two zero-valued octets. However, UCS-2 val- 168 ues between D800 and DFFF, being actually UCS-4 characters trans- 169 formed through UTF-16, need special treatment: the UTF-16 trans- 170 formation must be undone, yielding a UCS-4 character that is then 171 transformed as above. 173 Decoding from UTF-8 to UCS-4 proceeds as follows: 175 1) Initialize the 4 octets of the UCS-4 character with all bits set 176 to 0. 178 2) Determine which bits encode the character value from the number of 179 octets in the sequence and the second column of the table above 180 (the bits marked x). 182 3) Distribute the bits from the sequence to the UCS-4 character, 183 first the lower-order bits from the last octet of the sequence and 184 proceeding to the left until no x bits are left. 186 If the UTF-8 sequence is no more than three octets long, decoding 187 can proceed directly to UCS-2 (or equivalently Unicode). 189 A more detailed algorithm and formulae can be found in [FSS_UTF], 191 [UNICODE] or Annex R to [ISO-10646]. 193 3. Versions of the standards 195 Different versions of the Unicode standard exist: 1.0, 1.1 and 2.0 as 196 of this writing. Each new version obsoletes and replaces the previ- 197 ous one, but implementations, and more significantly data, are not 198 updated instantly. Similarly, ISO 10646 is updated from time to time 199 by published amendments, which up to now have tracked the changes in 200 the Unicode standard, so that the two have remained in sync. 202 In general, the changes amount to adding new characters, which does 203 not pose particular problems with old data. Amendment 5 to ISO 204 10646, however, has moved and expanded the Korean Hangul block, 205 thereby making any previous data containing Hangul characters invalid 206 under the new version. Unicode 2.0 has the same difference from Uni- 207 code 1.1. The official justification for allowing such an incompati- 208 ble change was that no implementations and no data containing Hangul 209 existed, a statement that is likely to be true but remains unprov- 210 able. The incident has been dubbed the "Korean mess", and the rele- 211 vant committees have pledged to never, ever again make such an incom- 212 patible change. 214 New versions, and in particular any incompatible changes, have conse- 215 quences regarding MIME character encoding labels, to be discussed in 216 section 5. 218 4. Examples 220 The UCS-2 sequence "A." (0041, 2262, 0391, 221 002E) may be encoded as follows: 223 41 E2 89 A2 CE 91 2E 225 The UCS-2 sequence representing the Hangul characters for the Korean 226 word "hangugo" (D55C, AD6D, C5B4) may be encoded as follows: 228 ED 95 9C EA B5 AD EC 96 B4 230 The UCS-2 sequence representing the Han characters for the Japanese 231 word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows: 233 E6 97 A5 E6 9C AC E8 AA 9E 235 5. MIME registration 237 This memo is meant to serve as the basis for registration of a MIME 238 character set parameter (charset) [MIME]. The proposed charset 239 parameter value is "UTF-8". This string would label media types con- 240 taining text consisting of characters from the repertoire of ISO/IEC 241 10646 encoded to a sequence of octets using the encoding scheme out- 242 lined above. UTF-8 is suitable for use in MIME content types under 243 the "text" top-level type. 245 It is noteworthy that the label "UTF-8" does not contain a version 246 identification, referring generically to ISO/IEC 10646. This is 247 intentional, the rationale being as follows: 249 A MIME charset label is designed to give just the information needed 250 to interpret a sequence of bytes received on the wire into a sequence 251 of characters, nothing more (see RFC 2045, section 2.2, in [MIME]). 252 As long as a character set standard does not change incompatibly, 253 version numbers serve no purpose, because one gains nothing by learn- 254 ing from the tag that newly assigned characters may be received that 255 one doesn't know about. The tag doesn't teach anything about the new 256 characters, and they are going to be received anyway. 258 Hence, as long as the standards evolve compatibly, the apparent 259 advantage of having labels that identify the versions is only that, 260 apparent. But there is a disadvantage to such version-dependent 261 labels: when an older application receives data accompanied by a 262 newer, unknown label, it may fail to recognize the label and be com- 263 pletely unable to deal with the data, whereas a generic, known label 264 would have triggered mostly correct processing of the data, which may 265 well not contain any new characters. 267 Now the "Korean mess" (ISO 10646 amendment 5) is an incompatible 268 change, in principle contradicting the appropriateness of a version- 269 independent MIME charset label as described above. But the compati- 270 bility problem can only appear with data containing Korean Hangul 271 characters encoded according to Unicode 1.1 (or equivalently ISO 272 10646 before amendment 5), and there is arguably no such data to 273 worry about, this being the very reason the incompatible change was 274 deemed acceptable. 276 In practice, then, a version-independent label is warranted. Should 277 the need ever arise to distinguish data containing Hangul encoded 278 according to Unicode 1.1, then a version-dependent label, for that 279 version only, should be registered (a suggestion would be "UNI- 280 CODE-1-1-UTF-8"), in order to retain the advantages of a version- 281 independent label for 2.0 and later versions. Such a version-depen- 282 dent label could even be registered before actual need arises, pre- 283 emptively, but it is important to strongly recommend against creating 284 any new Hangul-containing data without taking Amendment 5 of ISO 285 10646 into account. 287 6. Security Considerations 289 Security issues are not discussed in this memo. 291 Acknowledgments 293 The following have participated in the drafting and discussion of 294 this memo: 296 James E. Agenbroad Andries Brouwer 297 Martin J. D�rst David Goldsmith 298 Edwin F. Hart Kent Karlsson 299 Markus Kuhn Michael Kung 300 Alain LaBont� Murray Sargent 301 Keld Simonsen Arnold Winkler 303 Bibliography 305 [FSS_UTF] X/Open CAE Specification C501 ISBN 1-85912-082-2 28cm. 306 22p. pbk. 172g. 4/95, X/Open Company Ltd., "File Sys- 307 tem Safe UCS Transformation Format (FSS_UTF)", X/Open 308 Preleminary Specification, Document Number P316. Also 309 published in Unicode Technical Report #4. 311 [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor- 312 mation technology -- Universal Multiple-Octet Coded 313 Character Set (UCS) -- Part 1: Architecture and Basic 314 Multilingual Plane. UTF-8 is described in Annex R, 315 published as Amendment 2. UTF-16 is described in 316 Annex Q, published as Amendment 1. 318 [MIME] N. Freed, N. Borenstein, "Multipurpose Internet Mail 319 Extensions (MIME) Part One: Format of Internet Mes- 320 sage Bodies", RFC 2045. N. Freed, N. Borenstein, 321 "Multipurpose Internet Mail Extensions (MIME) Part 322 Two: Media Types", RFC 2046. K. Moore, "MIME (Multi- 323 purpose Internet Mail Extensions) Part Three: Message 324 Header Extensions for Non-ASCII Text", RFC 2047. N. 325 Freed, J. Klensin, J. Postel, "Multipurpose Internet 326 Mail Extensions (MIME) Part Four: Registration Proce- 327 dures", RFC 2048. N. Freed, N. Borenstein, "Multipur- 328 pose Internet Mail Extensions (MIME) Part Five: Con- 329 formance Criteria and Examples", RFC 2049. All 330 November 1996. 332 [RFC1641] D. Goldsmith, M.Davis, "Using Unicode with MIME", RFC 333 1641, Taligent inc., July 1994. 335 [RFC1642] D. Goldsmith, M. Davis, "UTF-7: A Mail-safe Transfor- 336 mation Format of Unicode", RFC 1642, Taligent inc., 337 July 1994. 339 [UNICODE] The Unicode Consortium, "The Unicode Standard -- Ver- 340 sion 2.0", Addison-Wesley, 1996. 342 [US-ASCII] Coded Character Set--7-bit American Standard Code for 343 Information Interchange, ANSI X3.4-1986. 345 Author's Address 347 Fran�ois Yergeau 348 Alis Technologies 349 100, boul. Alexis-Nihon 350 Suite 600 351 Montr�al QC H4M 2P2 352 Canada 354 Tel: +1 (514) 747-2547 355 Fax: +1 (514) 747-2561 356 EMail: fyergeau@alis.com