idnits 2.17.1 draft-yergeau-rfc2279bis-01.txt: -(514): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 4 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC2279, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 1, 2002) is 7907 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'UNICODE' on line 482 looks like a reference -- Missing reference section? 'US-ASCII' on line 492 looks like a reference -- Missing reference section? 'RFC2119' on line 473 looks like a reference -- Missing reference section? 'RFC2234' on line 476 looks like a reference -- Missing reference section? '1' on line 496 looks like a reference -- Missing reference section? 'RFC2978' on line 479 looks like a reference -- Missing reference section? 'RFC2045' on line 469 looks like a reference Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group F. Yergeau 3 Internet-Draft Alis Technologies 4 Expires: March 2, 2003 September 1, 2002 6 UTF-8, a transformation format of ISO 10646 7 draft-yergeau-rfc2279bis-01 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance with 12 all provisions of Section 10 of RFC2026. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at http:// 25 www.ietf.org/ietf/1id-abstracts.txt. 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 This Internet-Draft will expire on March 2, 2003. 32 Copyright Notice 34 Copyright (C) The Internet Society (2002). All Rights Reserved. 36 Abstract 38 <1> 39 ISO/IEC 10646-1 defines a large character set called the Universal 40 Character Set (UCS) which encompasses most of the world's writing 41 systems. The originally proposed encodings of the UCS, however, were 42 not compatible with many current applications and protocols, and this 43 has led to the development of UTF-8, the object of this memo. UTF-8 44 has the characteristic of preserving the full US-ASCII range, 45 providing compatibility with file systems, parsers and other software 46 that rely on US-ASCII values but are transparent to other values. 47 This memo updates and replaces RFC 2279. 48 <2> 49 Discussion of this draft should take place on the ietf- 50 charsets@iana.org mailing list. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2. Notational conventions . . . . . . . . . . . . . . . . . . . . 5 56 3. UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . . 6 57 4. Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . . 8 58 5. Versions of the standards . . . . . . . . . . . . . . . . . . 9 59 6. Byte order mark (BOM) . . . . . . . . . . . . . . . . . . . . 10 60 7. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 61 8. MIME registration . . . . . . . . . . . . . . . . . . . . . . 13 62 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 63 10. Security Considerations . . . . . . . . . . . . . . . . . . . 15 64 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 16 65 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 17 66 A. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 67 B. Changes from RFC 2279 . . . . . . . . . . . . . . . . . . . . 19 68 Full Copyright Statement . . . . . . . . . . . . . . . . . . . 20 70 1. Introduction 71 <3> 72 ISO/IEC 10646 [ISO.10646-1] defines a large character set called the 73 Universal Character Set (UCS), which encompasses most of the world's 74 writing systems. The same set of characters is defined by the 75 Unicode standard [UNICODE], which further defines additional 76 character properties and other application details of great interest 77 to implementors. Up to the present time, changes in Unicode and 78 amendments and additions to ISO/IEC 10646 have tracked each other, so 79 that the character repertoires and code point assignments have 80 remained in sync. The relevant standardization committees have 81 committed to maintain this very useful synchronism. 82 <4> 83 ISO/IEC 10646 and Unicode define several encoding forms of their 84 common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an 85 encoding form, each character is represented as one or more encoding 86 units. All standard UCS encoding forms except UTF-8 have an encoding 87 unit larger than one octet, making them hard to use in many current 88 applications and protocols that assume 8 or even 7 bit characters. 89 <5> 90 UTF-8, the object of this memo, has a one-octet encoding unit. It 91 uses all bits of an octet, but has the quality of preserving the full 92 US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one 93 octet having the normal US-ASCII value, and any octet with such a 94 value can only stand for an US-ASCII character, and nothing else. 95 <6> 96 UTF-8 encodes UCS characters as a varying number of octets, where the 97 number of octets, and the value of each, depend on the integer value 98 assigned to the character in ISO/IEC 10646 (the character number, 99 a.k.a. code point or Unicode scalar value). This encoding form has 100 the following characteristics (all values are in hexadecimal): 101 <7> 102 o Character numbers from U+0000 to U+007F (US-ASCII repertoire) 103 correspond to octets 00 to 7F (7 bit US-ASCII values). A direct 104 consequence is that a plain ASCII string is also a valid UTF-8 105 string. 106 <8> 107 o US-ASCII octet values do not appear otherwise in a UTF-8 encoded 108 character stream. This provides compatibility with file systems 109 or other software (e.g. the printf() function in C libraries) 110 that parse based on US-ASCII values but are transparent to other 111 values. 112 <9> 113 o Round-trip conversion is easy between UTF-8 and other encoding 114 forms. 115 <10> 116 o The first octet of a multi-octet sequence indicates the number of 117 octets in the sequence. 119 <11> 120 o The octet values C0, C1, FE and FF never appear. If the range of 121 character numbers is restricted to U+0000..U+10FFFF (the UTF-16 122 accessible range), then the octet values F5..FD also never appear. 123 <12> 124 o Character boundaries are easily found from anywhere in an octet 125 stream. 126 <13> 127 o The lexicographic sorting order of UTF-8 strings is the same as if 128 ordered by character numbers. Of course this is of limited 129 interest since a sort order based on character numbers is not 130 culturally valid. 131 <14> 132 o The Boyer-Moore fast search algorithm can be used with UTF-8 data. 133 <15> 134 o UTF-8 strings can be fairly reliably recognized as such by a 135 simple algorithm, i.e. the probability that a string of 136 characters in any other encoding appears as valid UTF-8 is low, 137 diminishing with increasing string length. 139 <16> 140 UTF-8 was originally a project of the X/Open Joint 141 Internationalization Group XOJIG with the objective to specify a File 142 System Safe UCS Transformation Format [FSS_UTF] that is compatible 143 with UNIX systems, supporting multilingual text in a single encoding. 144 The original authors were Gary Miller, Greger Leijonhufvud and John 145 Entenmann. Later, Ken Thompson and Rob Pike did significant work for 146 the formal definition of UTF-8. 148 2. Notational conventions 149 <17> 150 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 151 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 152 document are to be interpreted as described in [RFC2119]. 153 <18> 154 UCS characters are designated by the U+HHHH notation, where HHHH is a 155 string of from 4 to 6 hexadecimal digits representing the character 156 number in ISO/IEC 10646. 158 3. UTF-8 definition 159 <19> 160 UTF-8 is defined by Annex D of ISO/IEC 10646-1 [ISO.10646-1]. 161 Descriptions and formulae can also be found in the Unicode Standard 162 [UNICODE] and in [FSS_UTF]. 163 <20> 164 In UTF-8, characters are encoded using sequences of 1 to 6 octets. 165 If the range of character numbers is restricted to U+0000..U+10FFFF 166 (the UTF-16 accessible range), then only sequences of one to four 167 octets will occur. The only octet of a "sequence" of one has the 168 higher-order bit set to 0, the remaining 7 bits being used to encode 169 the character number. In a sequence of n octets, n>1, the initial 170 octet has the n higher-order bits set to 1, followed by a bit set to 171 0. The remaining bit(s) of that octet contain bits from the number 172 of the character to be encoded. The following octet(s) all have the 173 higher-order bit set to 1 and the following bit set to 0, leaving 6 174 bits in each to contain bits from the character to be encoded. 175 <21> 176 The table below summarizes the format of these different octet types. 177 The letter x indicates bits available for encoding bits of the 178 character number. 180 Char. number range | UTF-8 octet sequence 181 (hexadecimal) | (binary) 182 --------------------+--------------------------------------------- 183 0000 0000-0000 007F | 0xxxxxxx 184 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 185 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 186 0001 0000-001F FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 187 0020 0000-03FF FFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 188 0400 0000-7FFF FFFF | 1111110x 10xxxxxx ... 10xxxxxx 189 <22> 190 Encoding a character to UTF-8 proceeds as follows: 191 <23> 192 1. Determine the number of octets required from the character number 193 and the first column of the table above. It is important to note 194 that the rows of the table are mutually exclusive, i.e. there is 195 only one valid way to encode a given character. 196 <24> 197 2. Prepare the high-order bits of the octets as per the second 198 column of the table. 199 <25> 200 3. Fill in the bits marked x from the bits of the character number, 201 expressed in binary. Start from the lower-order bits of the 202 character number and put them first in the last octet of the 203 sequence, then the next to last, etc. until all x bits are 204 filled in. 206 <26> 207 The definition of UTF-8 prohibits encoding character numbers between 208 U+D800 and U+DFFF, which are reserved for use with the UTF-16 209 encoding form (as surrogate pairs) and do not directly represent 210 characters. When encoding in UTF-8 from UTF-16 data, it is necessary 211 to first decode the UTF-16 data to obtain character numbers, which 212 are then encoded in UTF-8 as described above. 213 <27> 214 Decoding a UTF-8 character proceeds as follows: 215 <28> 216 1. Initialize a binary number with all bits set to 0. Up to 31 bits 217 may be needed (up to 21 if the range of character numbers is 218 known to be restricted to the UTF-16 accessible range). 219 <29> 220 2. Determine which bits encode the character number from the number 221 of octets in the sequence and the second column of the table 222 above (the bits marked x). 223 <30> 224 3. Distribute the bits from the sequence to the binary number, first 225 the lower-order bits from the last octet of the sequence and 226 proceeding to the left until no x bits are left. The binary 227 number is now equal to the character number. 229 <31> 230 Implementations of the decoding algorithm above MUST protect against 231 decoding invalid sequences. For instance, a naive implementation may 232 decode the overlong UTF-8 sequence C0 80 into the character U+0000, 233 or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding 234 invalid sequences may have security consequences or cause other 235 problems. See Security Considerations (Section 10) below. 237 4. Syntax of UTF-8 Byte Sequences 238 <32> 239 A UTF-8 string is a sequence of bytes representing a sequence of UCS 240 characters. The byte sequence is valid UTF-8 only if it matches the 241 following syntax, which is derived from the rules for encoding UTF-8 242 and is expressed in the ABNF of [RFC2234]. 244 UTF8-string = *( UTF8-char ) 245 UTF8-char = UTF8-1 / 246 UTF8-2-head 1( UTF8-tail ) / 247 UTF8-3-head 1( UTF8-tail ) / 248 UTF8-4-head 2( UTF8-tail ) / 249 UTF8-5-head 3( UTF8-tail ) / 250 UTF8-6-head 4( UTF8-tail ) 251 UTF8-1 = %x00-7F 252 UTF8-2-head = %xC2-DF 253 UTF8-3-head = %xE0 %xA0-BF / %xE1-EC %x80-BF / 254 %xED %x80-9F / %xEE-EF %x80-BF 255 UTF8-4-head = %xF0 %x90-BF / %xF1-F7 %x80-BF 256 UTF8-5-head = %xF8 %x88-BF / %xF9-FB %x80-BF 257 UTF8-6-head = %xFC %x84-BF / %xFD %x80-BF 258 UTF8-tail = %x80-BF 260 UTF8-string = *( UTF8-char ) 261 UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 / UTF8-5 / UTF8-6 262 UTF8-char = UTF8-1 / 263 UTF8-2 / 264 UTF8-3 / 265 UTF8-4 / 266 UTF8-5 / 267 UTF8-6 268 UTF8-1 = %x00-7F 269 UTF8-2 = %xC2-DF UTF8-tail 270 UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / 271 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) 272 UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F7 3( UTF8-tail ) 273 UTF8-5 = %xF8 %x88-BF 3( UTF8-tail ) / %xF9-FB 4( UTF8-tail ) 274 UTF8-6 = %xFC %x84-BF 4( UTF8-tail ) / %xFD 5( UTF8-tail ) 275 UTF8-tail = %x80-BF 277 5. Versions of the standards 278 <33> 279 ISO/IEC 10646 is updated from time to time by publication of 280 amendments and additional parts; similarly, new versions of the 281 Unicode standard are published over time. Each new version obsoletes 282 and replaces the previous one, but implementations, and more 283 significantly data, are not updated instantly. 284 <34> 285 In general, the changes amount to adding new characters, which does 286 not pose particular problems with old data. In 1996, Amendment 5 to 287 the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded 288 the Korean Hangul block, thereby making any previous data containing 289 Hangul characters invalid under the new version. Unicode 2.0 has the 290 same difference from Unicode 1.1. The justification for allowing 291 such an incompatible change was that there were no major 292 implementations and no significant amounts of data containing Hangul. 293 The incident has been dubbed the "Korean mess", and the relevant 294 committees have pledged to never, ever again make such an 295 incompatible change (see Unicode Consortium Policies [1]). 296 <35> 297 New versions, and in particular any incompatible changes, have 298 consequences regarding MIME charset labels, to be discussed in MIME 299 registration (Section 8). 301 6. Byte order mark (BOM) 302 <36> 303 The Unicode Standard and ISO 10646 define the character "ZERO WIDTH 304 NO-BREAK SPACE" (U+FEFF), which is also known informally as "BYTE 305 ORDER MARK" (abbreviated "BOM"). The latter name hints at a second 306 possible usage of the character, in addition to its normal use as a 307 genuine "ZERO WIDTH NO-BREAK SPACE" within text. This usage, 308 suggested by Unicode section 2.7 and ISO/IEC 10646 Annex H 309 (informative), is to prepend a U+FEFF character to a stream of UCS 310 characters as a "signature"; a receiver of such a serialized stream 311 may then use the initial character as a hint that the stream consists 312 of UCS characters. The signature can also be used to recognize which 313 UCS encoding is involved and, with encodings having a multi-octet 314 encoding unit, as a way to recognize the serialization order of the 315 octets. UTF-8 having a single-octet encoding unit, this last 316 function is useless and the BOM will always appear as the octet 317 sequence EF BB BF. 318 <37> 319 It is important to understand that the character U+FEFF appearing at 320 any position other than the beginning of a stream MUST be interpreted 321 with the semantics for the zero-width non-breaking space, and MUST 322 NOT be interpreted as a byte-order mark. The contrapositive of that 323 statement is not always true: the character U+FEFF in the first 324 position of a stream MAY be interpreted as a zero-width non-breaking 325 space, and is not always a byte-order mark. For example, if a 326 process splits a UCS string into many parts, a part might begin with 327 U+FEFF because there was a zero-width no-break space at the beginning 328 of that substring. 329 <38> 330 The Unicode standard further suggests than an initial U+FEFF 331 character may be stripped before processing the text, the rationale 332 being that such a character in initial position may be an artifact of 333 the encoding (an encoding signature), not a genuine intended "ZERO 334 WIDTH NO-BREAK SPACE". Note that such stripping might affect an 335 external process at a different layer (such as a digital signature or 336 a count of the characters) that is relying on the presence of all 337 characters in the stream. 338 <39> 339 In particular, in UTF-8 plain text it is likely, but not certain, 340 that an initial octet sequence of EF BB BF is a signature. When 341 concatenating two strings, it is important to strip out those 342 signatures, because otherwise the resulting string may contain an 343 unintended "ZERO WIDTH NO-BREAK SPACE" at the connection point. 344 <40> 345 In an attempt at diminishing the uncertainty, Unicode 3.2 adds a new 346 character, U+2060 WORD JOINER, with exactly the same semantics and 347 usage as U+FEFF except for the signature function, and strongly 348 recommends its exclusive use for expressing word-joining semantics. 350 Eventually, following this recommendation will make it all but 351 certain that any initial U+FEFF is a signature, not an intended "ZERO 352 WIDTH NO-BREAK SPACE". 354 7. Examples 355 <41> 356 The character sequence U+0041 U+2262 U+0391 U+002E "A." is encoded in UTF-8 as follows: 359 --+--------+-----+-- 360 41 E2 89 A2 CE 91 2E 361 --+--------+-----+-- 362 <42> 363 The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo", 364 meaning "the Korean language") is encoded in UTF-8 as follows: 366 --------+--------+-------- 367 ED 95 9C EA B5 AD EC 96 B4 368 --------+--------+-------- 369 <43> 370 The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo", 371 meaning "the Japanese language") is encoded in UTF-8 as follows: 373 --------+--------+-------- 374 E6 97 A5 E6 9C AC E8 AA 9E 375 --------+--------+-------- 376 <44> 377 The character U+233B4 (a Chinese character meaning 'stump of tree'), 378 prepended with a UTF-8 BOM, is encoded in UTF-8 as follows: 380 --------+----------- 381 EF BB BF F0 A3 8E B4 382 --------+----------- 384 8. MIME registration 385 <45> 386 This memo serves as the basis for registration of the MIME charset 387 parameter for UTF-8, according to [RFC2978]. The charset parameter 388 value is "UTF-8". This string labels media types containing text 389 consisting of characters from the repertoire of ISO/IEC 10646 390 including all amendments at least up to amendment 5 of the 1993 391 edition (Korean block), encoded to a sequence of octets using the 392 encoding scheme outlined above. UTF-8 is suitable for use in MIME 393 content types under the "text" top-level type. 394 <46> 395 It is noteworthy that the label "UTF-8" does not contain a version 396 identification, referring generically to ISO/IEC 10646. This is 397 intentional, the rationale being as follows: 398 <47> 399 A MIME charset label is designed to give just the information needed 400 to interpret a sequence of bytes received on the wire into a sequence 401 of characters, nothing more (see [RFC2045], section 2.2). As long as 402 a character set standard does not change incompatibly, version 403 numbers serve no purpose, because one gains nothing by learning from 404 the tag that newly assigned characters may be received that one 405 doesn't know about. The tag itself doesn't teach anything about the 406 new characters, which are going to be received anyway. 407 <48> 408 Hence, as long as the standards evolve compatibly, the apparent 409 advantage of having labels that identify the versions is only that, 410 apparent. But there is a disadvantage to such version-dependent 411 labels: when an older application receives data accompanied by a 412 newer, unknown label, it may fail to recognize the label and be 413 completely unable to deal with the data, whereas a generic, known 414 label would have triggered mostly correct processing of the data, 415 which may well not contain any new characters. 416 <49> 417 Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible 418 change, in principle contradicting the appropriateness of a version 419 independent MIME charset label as described above. But the 420 compatibility problem can only appear with data containing Korean 421 Hangul characters encoded according to Unicode 1.1 (or equivalently 422 ISO/IEC 10646 before amendment 5), and there is arguably no such data 423 to worry about, this being the very reason the incompatible change 424 was deemed acceptable. 425 <50> 426 In practice, then, a version-independent label is warranted, provided 427 the label is understood to refer to all versions after Amendment 5, 428 and provided no incompatible change actually occurs. Should 429 incompatible changes occur in a later version of ISO/IEC 10646, the 430 MIME charset label defined here will stay aligned with the previous 431 version until and unless the IETF specifically decides otherwise. 433 9. IANA Considerations 434 <51> 435 The entry for UTF-8 in the IANA charset registry should be updated to 436 point to this memo. 438 10. Security Considerations 439 <52> 440 Implementors of UTF-8 need to consider the security aspects of how 441 they handle illegal UTF-8 sequences. It is conceivable that in some 442 circumstances an attacker would be able to exploit an incautious UTF- 443 8 parser by sending it an octet sequence that is not permitted by the 444 UTF-8 syntax. 445 <53> 446 A particularly subtle form of this attack can be carried out against 447 a parser which performs security-critical validity checks against the 448 UTF-8 encoded form of its input, but interprets certain illegal octet 449 sequences as characters. For example, a parser might prohibit the 450 NUL character when encoded as the single-octet sequence 00, but 451 erroneously allow the illegal two-octet sequence C0 80 and interpret 452 it as a NUL character. Another example might be a parser which 453 prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the 454 illegal octet sequence 2F C0 AE 2E 2F. This last exploit has 455 actually been used in a widespread virus attacking Web servers in 456 2001; the security threat is thus very real. 458 Bibliography 460 [FSS_UTF] X/Open Company Ltd., "X/Open CAE Specification C501 -- 461 File System Safe UCS Transformation Format (FSS_UTF)", 462 ISBN 1-85912-082-2, April 1995. 464 [ISO.10646-1] International Organization for Standardization, 465 "Information Technology - Universal Multiple-octet 466 coded Character Set (UCS) - Part 1: Architecture and 467 Basic Multilingual Plane", ISO Standard 10646-1, 2000. 469 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet 470 Mail Extensions (MIME) Part One: Format of Internet 471 Message Bodies", RFC 2045, November 1996. 473 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 474 Requirement Levels", BCP 14, RFC 2119, March 1997. 476 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 477 Specifications: ABNF", RFC 2234, November 1997. 479 [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration 480 Procedures", BCP 19, RFC 2978, October 2000. 482 [UNICODE] The Unicode Consortium, "The Unicode Standard -- 483 Version 3.2", defined by The Unicode Standard, 484 Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 485 0-201-61633-5), as amended by the Unicode Standard 486 Annex #27: Unicode 3.1 (see http://www.unicode.org/ 487 reports/tr27) and by the Unicode Standard Annex #28: 488 Unicode 3.2 (see http://www.unicode.org/reports/tr28), 489 March 2002, . 492 [US-ASCII] American National Standards Institute, "Coded 493 Character Set - 7-bit American Standard Code for 494 Information Interchange", ANSI X3.4, 1986. 496 [1] 498 Author's Address 500 FranȺois Yergeau 501 Alis Technologies 502 100, boul. Alexis-Nihon, bureau 600 503 MontrȨal, QC H4M 2P2 504 Canada 506 Phone: +1 514 747 2547 507 Fax: +1 514 747 2561 508 EMail: fyergeau@alis.com 510 Appendix A. Acknowledgements 511 <62> 512 The following have participated in the drafting and discussion of 513 this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer, 514 Mark Davis, Martin J. Dȭrst, Patrick F��ltstrȵm, Ned Freed, David 515 Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood, 516 Kent Karlsson, Markus Kuhn, Michael Kung, Alain LaBontȨ, John 517 Gardiner Myers, Dan Oscarsson, Murray Sargent, Markus Scherer, Keld 518 Simonsen, Arnold Winkler, Kenneth Whistler and Misha Wolf. 520 Appendix B. Changes from RFC 2279 521 <63> 522 <64> 523 o Significantly shortened Introduction. No more mention of UTF-1 or 524 UTF-7, of Transformation Formats. 525 <65> 526 o Straightened out terminology. UTF-8 now described in terms of an 527 encoding form of the character number. UCS-2 and UCS-4 almost 528 disappeared. 529 <66> 530 o Note warning against decoding of invalid sequences turned into a 531 normative MUST NOT. 532 <67> 533 o New section about the BOM, mostly extracted and slightly adapted 534 from RFC 2781. 535 <68> 536 o Updated a couple of references (10646-1:2000, Unicode 3.2, RFC 537 2978). 538 <69> 539 o Added TOC. 540 <70> 541 o Removed suggested UNICODE-1-1-UTF-8 MIME charset registration. 542 <71> 543 o New "Notational conventions" section about RFC 2119 and U+HHHH 544 notation. 545 <72> 546 o Pointer to Unicode Consortium Policies added in "Versions of the 547 standards" section. 548 <73> 549 o Added a fourth example with a non-BMP character and a BOM. 550 <74> 551 o Added a paragraph about U+2060 WORD JOINER. 552 <75> 553 o Enumerate more byte values impossible in UTF-8, either as a result 554 of forbidding overlong sequences or of restricting to the UTF-16 555 accessible range. 556 <76> 557 o Added "IANA Considerations" section to ask that the UTF-8 entry in 558 the charset registry point to this memo. 560 Full Copyright Statement 562 Copyright (C) The Internet Society (2002). All Rights Reserved. 564 This document and translations of it may be copied and furnished to 565 others, and derivative works that comment on or otherwise explain it 566 or assist in its implementation may be prepared, copied, published 567 and distributed, in whole or in part, without restriction of any 568 kind, provided that the above copyright notice and this paragraph are 569 included on all such copies and derivative works. However, this 570 document itself may not be modified in any way, such as by removing 571 the copyright notice or references to the Internet Society or other 572 Internet organizations, except as needed for the purpose of 573 developing Internet standards in which case the procedures for 574 copyrights defined in the Internet Standards process must be 575 followed, or as required to translate it into languages other than 576 English. 578 The limited permissions granted above are perpetual and will not be 579 revoked by the Internet Society or its successors or assigns. 581 This document and the information contained herein is provided on an 582 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 583 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 584 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 585 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 586 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 588 Acknowledgement 590 Funding for the RFC Editor function is currently provided by the 591 Internet Society.