idnits 2.17.1 draft-klensin-net-utf8-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 18. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 950. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 961. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 968. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 974. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == The 'Obsoletes: ' line in the draft header should list only the _numbers_ of the RFCs which will be obsoleted by this document (if approved); it should not include the word 'RFC' in the list. == The 'Updates: ' line in the draft header should list only the _numbers_ of the RFCs which will be updated by this document (if approved); it should not include the word 'RFC' in the list. -- The draft header indicates that this document obsoletes RFC698, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 10, 2008) is 5920 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'NFC' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode32' -- Obsolete informational reference (is this intentional?): RFC 542 (Obsoleted by RFC 765) -- Obsolete informational reference (is this intentional?): RFC 698 (Obsoleted by RFC 5198) -- Obsolete informational reference (is this intentional?): RFC 742 (Obsoleted by RFC 1194, RFC 1196, RFC 1288) -- Obsolete informational reference (is this intentional?): RFC 954 (Obsoleted by RFC 3912) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 2821 (Obsoleted by RFC 5321) -- Obsolete informational reference (is this intentional?): RFC 3454 (Obsoleted by RFC 7564) -- Obsolete informational reference (is this intentional?): RFC 3491 (Obsoleted by RFC 5891) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 20 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft M. Padlipsky 4 Obsoletes: RFC 698 February 10, 2008 5 (if approved) 6 Updates: RFC854 (if approved) 7 Intended status: Standards Track 8 Expires: August 13, 2008 10 Unicode Format for Network Interchange 11 draft-klensin-net-utf8-09.txt 13 Status of this Memo 15 By submitting this Internet-Draft, each author represents that any 16 applicable patent or other IPR claims of which he or she is aware 17 have been or will be disclosed, and any of which he or she becomes 18 aware will be disclosed, in accordance with Section 6 of BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt. 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet-Draft will expire on August 13, 2008. 38 Copyright Notice 40 Copyright (C) The IETF Trust (2008). 42 Abstract 44 The Internet today is in need of a standardized form for the 45 transmission of internationalized "text" information, paralleling the 46 specifications for the use of ASCII that date from the early days of 47 the ARPANET. This document specifies that format, using UTF-8 with 48 normalization and specific line-ending sequences. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 1.1. Requirement for a Standardized Text Stream Format . . . . 3 54 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 55 1.3. Mailing List . . . . . . . . . . . . . . . . . . . . . . . 4 56 2. Net-Unicode Definition . . . . . . . . . . . . . . . . . . . . 4 57 3. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 6 58 4. Versions of Unicode . . . . . . . . . . . . . . . . . . . . . 6 59 5. Applicability and Stability of this Specification . . . . . . 8 60 5.1. Use in IETF Applications Specifications . . . . . . . . . 8 61 5.2. Unicode Versions and Applicability . . . . . . . . . . . . 8 62 6. Security Considerations . . . . . . . . . . . . . . . . . . . 10 63 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 64 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 11 65 Appendix A. History and Context . . . . . . . . . . . . . . . . . 11 66 Appendix B. The ASCII NVT Definition . . . . . . . . . . . . . . 13 67 Appendix C. The Line-Ending Problem . . . . . . . . . . . . . . . 14 68 Appendix D. A Note About Related Future Work . . . . . . . . . . 15 69 Appendix E. Change log . . . . . . . . . . . . . . . . . . . . . 15 70 E.1. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 15 71 E.2. Changes from -01 to -02 . . . . . . . . . . . . . . . . . 15 72 E.3. Changes from -02 to -03 . . . . . . . . . . . . . . . . . 16 73 E.4. Changes from -03 to -04 . . . . . . . . . . . . . . . . . 16 74 E.5. Changes from -04 to -05 . . . . . . . . . . . . . . . . . 16 75 E.6. Changes from -05 to -07 . . . . . . . . . . . . . . . . . 17 76 E.7. Changes in version -08 . . . . . . . . . . . . . . . . . . 17 77 E.8. Changes in version -09 . . . . . . . . . . . . . . . . . . 17 78 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 79 9.1. Normative References . . . . . . . . . . . . . . . . . . . 17 80 9.2. Informative References . . . . . . . . . . . . . . . . . . 18 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 82 Intellectual Property and Copyright Statements . . . . . . . . . . 22 84 1. Introduction 86 1.1. Requirement for a Standardized Text Stream Format 88 Historically, Internet protocols have been largely ASCII-based and 89 references to "text" in protocols have assumed ASCII text and 90 specifically text in Network Virtual Terminal ("NVT") or "Network 91 ASCII" form (see Appendix A and Appendix B). Protocols and formats 92 that have moved beyond ASCII have included arrangements to 93 specifically identify the character set and often the language being 94 used. 96 In our more internationalized world, "text" clearly no longer equates 97 unambiguously to "network ASCII". Fortunately, however, we are 98 converging on Unicode [Unicode] [ISO10646] as a single international 99 interchange character coding and no longer need to deal with per- 100 script standards for character sets (e.g., one standard for each of 101 Arabic, Cyrillic, Devanagari, etc., or even standards keyed to 102 languages that are usually considered to share a script, such as 103 French, German, or Swedish). Unfortunately, though, while it is 104 certainly time to define a Unicode-based text type for use as a 105 common text interchange format, "use Unicode" involves even more 106 ambiguity than "use ASCII" did decades ago. 108 Unicode identifies each character by an integer, called its "code 109 point", in the range 0-0x10ffff. These integers can be encoded into 110 byte sequences for transmission in at least three standard and 111 generally-recognized encoding forms, all of which are completely 112 defined in The Unicode Standard and the documents cited below: 114 o UTF-8 [RFC3629] defines a variable-length encoding that may be 115 applied uniformly to all code points. 117 o UTF-16 [RFC2781] encodes the range of Unicode characters whose 118 code points are less than 65536 straightforwardly as 16-bit 119 integers, and provides a "surrogate" mechanism for encoding larger 120 code points in 32 bits. 122 o UTF-32 (also known as UCS-4) simply encodes each code point as a 123 32-bit integer. 125 Older forms and nomenclature, such as the 16 bit UCS-2, are now 126 strongly discouraged. 128 As with ASCII, any of these forms may be used with different line- 129 ending conventions. That flexibility can be an additional source of 130 confusion with, e.g., index (offset) references into documents based 131 on character counts. 133 This document proposes to establish "Net-Unicode" as a new 134 standardized text transmission form for the Internet, to serve as an 135 internationalized alternative for NVT ASCII when specified in new -- 136 and, where appropriate, updated -- protocols. UTF-8 [RFC3629] is 137 chosen for the coding because it has good compatibility properties 138 with ASCII and for other reasons discussed in the existing IETF 139 character set policy [RFC2277]. "Net-Unicode" is specified in 140 Section 2; the subsequent sections of the document provide background 141 and explanation. 143 In circumstances in which there is a choice, use of Unicode and the 144 text encoding specified here is preferred to the double-byte encoding 145 of "extended ASCII" [RFC0698] or the assorted per-language or per- 146 country character coding systems and SHOULD be used. 148 1.2. Terminology 150 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 151 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 152 document are to be interpreted as described in [RFC2119]. 154 1.3. Mailing List 156 [[RFC Editor: Please remove this subsection prior to publication.]] 158 Along with related work on general internationalization issues, this 159 document is being discussed on the discuss@apps.ietf.org mailing 160 list. 162 2. Net-Unicode Definition 164 The Network Unicode format (Net-Unicode) is defined as follows: 166 1. Characters MUST be encoded in UTF-8 as defined in [RFC3629]. 168 2. If the protocol has the concept of "lines", line-endings MUST be 169 indicated by the sequence Carriage-Return (CR, U+000D) followed 170 by Line-Feed (LF, U+000A), often known just as CRLF. CR SHOULD 171 NOT appear except when followed by LF. The only other allowed 172 context in which CR is permitted is in the combination CR NUL, 173 which is not recommended (see the note at the end of this 174 section). 176 3. The control characters in the ASCII range (U+0000 to U+001F and 177 U+007F to U+009F) SHOULD generally be avoided. CR, LF, and Form 178 Feed (FF, U+000C) are exceptions to this principle. However, use 179 of all but the first requires care as discussed elsewhere in this 180 document. The so-called "C1 Controls" (U+0080 through U+009F), 181 which did not appear in ASCII, MUST NOT appear. 183 FF should be used only with caution: it does not have a standard 184 and universal interpretation and, in particular, if its use 185 assumes a page length, such assumptions may not be appropriate in 186 international contexts (e.g., considering 8.5x11 inch paper 187 versus A4). Other control characters are used to affect display 188 format, control devices, or to structure files. None of those 189 uses is appropriate for streams of plain text. 191 4. Before transmission, all character sequences SHOULD be normalized 192 according to Unicode normalization form "NFC" (see Section 3). 194 5. As suggested in Section 6 of RFC 3629, the Byte Order Mark 195 ("BOM") signature MUST NOT appear at the beginning of these text 196 strings. 198 6. Systems conforming to this specification MUST NOT transmit any 199 string containing any code point that is unassigned in the 200 version of Unicode on which they are dependent. The version of 201 NFC and the version of Unicode used by that system MUST be 202 consistent. 204 The use of LF without CR is questionable; see Appendix B for more 205 discussion. The newer control characters IND (U+0084) and NEL ("Next 206 Line", U+0085) might have been used to disambiguate the various line- 207 ending situations, but, because their use has not been established on 208 the Internet, because many protocols require CRLF, and because IND 209 and NEL fall within the "C1 Controls" group (see above), they MUST 210 NOT be used. Similar observations apply to the yet newer line and 211 paragraph separators at U+2028 and U+2029 and any future characters 212 that might be defined to serve these functions. For this 213 specification and protocols that depend on it, lines end in CRLF and 214 only in CRLF. Strings that do not end in CRLF are either not lines 215 or are not in conformance with this specification. 217 The NVT specification contained a number of additional provisions, 218 e.g., for the optional use of backspacing and "bare CR" (sent as CR 219 NUL) to generate overstruck character sequences. The much greater 220 number of precomposed characters in Unicode, the availability of 221 combining characters, and the growing use of markup conventions of 222 various types to show, e.g., emphasis (rather than attempting to do 223 that via the use of special characters), should make such sequences 224 largely unnecessary. These sequences SHOULD be avoided if at all 225 possible. However, because they were optional in NVT applications 226 and this specification is an NVT superset, they cannot be prohibited 227 entirely. The most important of these rules is that CR MUST NOT 228 appear unless it is immediately followed by LF (indicating end of 229 line) or NUL. Because NUL (an octet whose value is all zeros, i.e., 230 %x00 in the notation of [RFC5234]) is hostile to programming 231 languages that use that character as a string delimiter, the CR NUL 232 sequence SHOULD be avoided for that reason as well. 234 3. Normalization 236 There are cases where strings of Unicode are fundamentally 237 equivalent, essentially representing the same text. These are called 238 "canonical equivalents" in the Unicode Standard. For example, the 239 following pairs of strings are canonically equivalent: 241 U+2126 OHM SIGN 242 U+03A9 GREEK CAPITAL LETTER OMEGA 244 U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT 245 U+00E0 LATIN SMALL LETTER A WITH GRAVE 247 Comparison of strings becomes much easier if any such cases are 248 always represented by a single unique form. The Unicode Consortium 249 specifies a normalization form, known as NFC [NFC], which provides 250 the necessary mappings and mechanisms to convert all canonically 251 equivalent sequences to a single unique form. Typically, this form 252 produces precomposed characters for any sequences that can be 253 represented in that fashion. It also reorders other combining marks 254 so that they have a unique and unambiguous order. 256 Of the various normalization forms defined as part of Unicode, NFC is 257 closest to actual use in practice, minimizes side-effects due to 258 considering characters equivalent that may not be equivalent in all 259 situations, and typically requires the least work when converting 260 from non-Unicode encodings. 262 The section above requires that, except in very unusual 263 circumstances, all Net-Unicode strings be transmitted in normalized 264 form. Recognition of the fact that some applications implementations 265 may rely on operating system libraries over which they have little 266 control and adherence to the robustness principle suggests that 267 receivers of such strings should be prepared to receive unnormalized 268 ones and to not react to that in excessive ways. 270 4. Versions of Unicode 272 Unicode changes and expands over time. Large blocks of space are 273 reserved for future expansion. New versions, which appear at regular 274 intervals, add new scripts and characters. Occasionally they also 275 change some property definitions. In retrospect, one of the 276 advantages of ASCII [X3.4-1968] when it was chosen was that the code 277 space was full when the Standard was first published. There was no 278 practical way to add characters or change code point assignments 279 without being obviously incompatible. 281 While there are some security issues if people deliberately try to 282 trick the system (see Section 6), Unicode version changes should not 283 have a significant impact on the text stream specification of this 284 document for the following reasons: 286 o The transformation between Unicode code table positions and the 287 corresponding UTF-8 code is algorithmic; it does not depend on 288 whether a code point has been assigned or not. 290 o The normalization recommended here, NFC (see Section 3), performs 291 a very limited set of mappings, much more limited than those of 292 the more extensive NFKC used in, e.g., Nameprep [RFC3491]. 294 The NFC tables may be updated over time as new characters are added, 295 but the Unicode Consortium has guaranteed the stability of all NFC 296 strings. That is, if a string does not contain any unassigned 297 characters, and it is normalized according to NFC, it will always be 298 normalized according to all future versions of the Unicode Standard. 299 The stability of the Net-Unicode format is thus guaranteed when any 300 implementation that converts text into Net-Unicode format does not 301 permit unassigned characters. 303 Because Unicode code points that are reserved for private use do not 304 have standard definitions or normalization interpretations, they 305 SHOULD be avoided in strings intended for Internet interchange. 307 Were Unicode to be changed in a way that violated these assumptions, 308 i.e., that either invalidated the byte string order specified in RFC 309 3629 or that changed the stability of NFC as stated above, this 310 specification would not apply. Put differently, this specification 311 applies only to versions of Unicode starting with version 5.0 and 312 extending to, but not including, any version for which changes are 313 made in either the UTF-8 definition or to NFC stability. Such 314 changes would violate established Unicode policies and are hence 315 unlikely, but, should they occur, it would be necessary to evaluate 316 them for compatibility with this specification and other Internet 317 uses of NFC. 319 If the specification of a protocol references this one, strings that 320 are received by that protocol and that appear to be UTF-8 and are not 321 otherwise identified (e.g., by charset labeling) SHOULD be treated as 322 using UTF-8 in conformance with this specification. 324 5. Applicability and Stability of this Specification 326 5.1. Use in IETF Applications Specifications 328 During the development of this specification, there was some 329 confusion about where it would be useful given that, e.g., the 330 individual MIME media types used in email and with HTTP have their 331 own rules about UTF-8 character types and normalization and the 332 application transport protocols impose their own conventions about 333 line endings. There are three answers. The first is that, in 334 retrospect, it would have been better to have those protocols and 335 content types standardized in the way specified here, even though it 336 is certainly too late to change them at this time. The second is 337 that we have several protocols that are dependent on either the 338 original Telnet design or other arrangements requiring a standard, 339 interoperable, string definition without specific content-labels of 340 one sort or another. Whois [RFC3912] is an example member of this 341 group. As consideration is given to upgrading them for non-ASCII 342 use, this specification provides a normative reference that provides 343 the same stability that NVT has provided the ASCII forms. This 344 specification is intended for use by other specifications that have 345 not yet defined how to use Unicode. Having a preferred standard 346 Internet definition for Unicode text streams -- rather than just one 347 for transmission codings -- may help improve the specification and 348 interoperability of protocols to be developed in the future. This 349 specification is not intended for use with specifications that 350 already allow the use of UTF-8 and precisely define that use. 352 5.2. Unicode Versions and Applicability 354 The IETF faces a practical dilemma with regard to versions of 355 Unicode. Each new version brings with it new characters and 356 sometimes new combining characters. Version 5.0 introduces the new 357 concept of sequences of characters named as if they were individual 358 characters (see [NamedSequences]). The normalization represented by 359 NFC is stable if all strings are transmitted and stored in normalized 360 form if corrections are never made to character definitions or 361 normalization tables and if unassigned code points are never used. 362 The latter is important because an unassigned code point always 363 normalizes to itself. However, if the same code point is assigned to 364 a character in a future version, it may participate in some other 365 normalization mapping (some specific difficulties in this regard are 366 discussed in [RFC4690]). It is worth noting that transmission in 367 normalized form is not required by either the IETF's UTF-8 Standard 368 [RFC3629] or by standards dependent on the current version of 369 Stringprep [RFC3454]. 371 All would be well with this as described in Section 4 except for one 372 problem: Applications typically do not perform their own conversions 373 to Unicode and may not perform their own normalizations but instead 374 rely on operating system or language library functions -- functions 375 that may be upgraded or otherwise changed without changes to the 376 application code itself. Consequently, there may be no plausible way 377 for an application to know which version of Unicode, or which version 378 of the normalization procedures, it is utilizing, nor is there any 379 way by which it can guarantee that the two will be consistent. 381 Because of per-version changes in definitions and tables, Stringprep 382 and documents depending on it are now tied to Unicode Version 3.2 383 [Unicode32] and full interoperability of Internet Standard UTF-8 384 [RFC3629], when used with normalization as specified here, is 385 dependent on normalization definitions and the definition of UTF-8 386 itself not changing after Unicode Version 5.0. These assumptions 387 seem fairly safe, but they are still assumptions. Rather than being 388 linked to the latest available version of Unicode, version 5.0 389 [Unicode] or broader concepts of version independence based on 390 specific assumptions and conditions, this specification could 391 reasonably have been tied, like Stringprep and Nameprep to Unicode 392 3.2 [Unicode32] or some more recent intermediate version, but, in 393 addition to the obvious disadvantages of having different IETF 394 standards tied to different versions of Unicode, the library-based 395 application implementation behavior described above makes these 396 version linkages nearly meaningless in practice. 398 In theory, one can get around this problem in four ways: 400 1. Freeze on a particular version of Unicode and try to insist that 401 applications enforce that version by, e.g., containing lists of 402 unassigned characters and prohibiting their use. Of course, this 403 would prohibit evolution to include newly-added scripts and the 404 tables of unassigned code points would be cumbersome. 406 2. Require that every Unicode "text" string or file start with a 407 version indication, somewhat akin to the "byte order mark" 408 indicator. It is unlikely that this provision would be 409 practical. More important, it would require that each 410 application implementation be prepared to either support multiple 411 normalization tables and versions or that it reject text from 412 Unicode Versions with which it was not prepared to deal. 414 3. Devise a different set of normalization rules that would, e.g., 415 guarantee that no character assigned to a previously-unassigned 416 code point in Unicode was ever normalized to anything but itself 417 and use those rules instead of NFC. It is not clear whether or 418 not such a set of rules is possible or whether some other 419 completely stable set of rules could be devised, perhaps in 420 combination with restrictions on the ways in which characters 421 were added in future versions of Unicode. 423 4. Devise a normalization process that is otherwise equivalent to 424 NFC but that rejects code points that are unassigned in the 425 current version of Unicode, rather than mapping those code points 426 to themselves. This would still leave some risk of incompatible 427 corrections in Unicode and possibly a few edge cases, but it is 428 probably stable enough for Internet use in the overwhelming 429 number of cases. This process has been discussed in the Unicode 430 Consortium under the name "Stable NFC". 432 None of these approaches seems ideal: the ideal procedure would be as 433 stable and predictable as ASCII has been. But that level is simply 434 not feasible as long as Unicode continues to evolve by the addition 435 of new code points and scripts. The fourth option listed above 436 appears to be a reasonable compromise. 438 6. Security Considerations 440 This specification provides a standard form for the use of Unicode as 441 "network text". Most of the same security issues that apply to 442 UTF-8, as discussed in [RFC3629], apply to it, although it should be 443 slightly less subject to some risks by virtue of requiring NFC 444 normalization and generally being somewhat more restrictive. 445 However, shifts in Unicode versions, as discussed in Section 5.2, may 446 introduce other security issues. 448 Programs that receive these streams should use extreme caution about 449 assuming that incoming data are normalized, since it might be 450 possible to use unnormalized forms, as well as invalid UTF-8, as part 451 of an attack. In particular, firewalls and other systems that 452 interpret UTF-8 streams should be developed with the clear knowledge 453 that an attacker may deliberately send unnormalized text, for 454 instance to avoid detection by naive text-matching systems. 456 NVT contains a requirement, of necessity repeated here (see 457 Section 2), that the CR character be immediately followed by either 458 LF or ASCII NUL (an octet with all bits zero). NUL may be 459 problematic for some programming languages that use it as a string 460 terminator, and hence a trap for the unwary, unless caution is used. 461 This may be an additional reason to avoid the use of CR entirely, 462 except in sequence with LF, as suggested above. 464 The discussion about Unicode versions above (see Section 4 and 465 Section 5.2) makes several assumptions about future versions of 466 Unicode, about NFC normalization being applied properly, and about 467 UTF-8 being processed and transmitted exactly as specified in RFC 468 3629. If any of those assumptions are not correct, then there are 469 cases in which strings that would be considered equivalent do not 470 compare equal. Robust code should be prepared for those 471 possibilities. 473 7. IANA Considerations 475 [[RFC Editor: Please remove this useless subsection prior to 476 publication.]] 478 This specification requires no actions of any type from the IANA. 480 8. Acknowledgments 482 Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for 483 suggestions about Unicode normalization that led to the format 484 described here and especially to Mark for providing the paragraphs 485 that describe the role of NFC. Thanks also to Mark, Doug Ewell, 486 Asmus Freytag for corrected text describing Unicode transmission 487 forms and to Tim Bray, Carsten Bormann, Stephane Bortzmeyer, Martin 488 Duerst, Frank Ellermann, Clive D.W. Feather, Ted Hardie, Bjoern 489 Hoehrmann, Alfred Hoenes, Kent Karlsson, Bill McQuillan, George 490 Michaelson, Chris Newman, and Marcos Sanz for a number of helpful 491 comments and clarification requests. 493 Appendix A. History and Context 495 This subsection contains a review of prior work in the ARPANET and 496 Internet to establish a standard text type, work that establishes the 497 context and motivation for the approach taken in this document. The 498 text is explanatory rather than normative: nothing in this section is 499 intended to change or update any current specification. Those who 500 are uninterested in this review and analysis can safely skip this 501 section. 503 One of the earlier application design decisions made in the 504 development of ARPANET, a decision that was carried forward into the 505 Internet, was the decision to standardize on a single and very 506 specific coding for "text" to be passed across the network [RFC0020]. 507 Hosts on the network were then responsible for translating or mapping 508 from whatever character coding conventions were used locally to that 509 common intermediate representation, with sending hosts mapping to it 510 and receiving ones mapping from it to their local forms as needed. 511 It is interesting to note that at the time the ARPANET was being 512 developed, participating host operating systems used at least three 513 different character coding standards: the antiquated BCD (Binary 514 Coded Decimal), the then-dominant major manufacturer-backed EBCDIC 515 (Extended BCD Interchange Code), and the then-still emerging ASCII 516 (American Standard Code for Information Interchange). Since the 517 ARPANET was an "open" project and EBCDIC was intimately linked to a 518 particular hardware vendor, the original Network Working Group agreed 519 that its standard should be ASCII. That ASCII form was precisely 520 "7-bit ASCII in an 8-bit field", which was in effect a compromise 521 between hosts that were natively 7-bit oriented (e.g., with five 522 seven-bit characters in a 36 bit word), those that were 8-bit 523 oriented (using eight-bit characters) and those that placed the 524 seven-bit ASCII characters in 9-bit fields with two leading zero bits 525 (four characters in a 36 bit word). 527 More standardization was suggested in the first preliminary 528 description of the Telnet protocol [RFC0097]. With the iterations of 529 that protocol [RFC0137] [RFC0139] and the drawing together of an 530 essentially formal definition somewhat later [RFC0318], a standard 531 abstraction, the Network Virtual Terminal (NVT) was established. NVT 532 character-coding conventions (initially called "Telnet ASCII" and 533 later called "NVT ASCII", or, more casually, "network ASCII") 534 included the requirement that Carriage Return followed by Line Feed 535 (CRLF) be the common representation for ending lines of text (given 536 that some participating "Host" operating systems used the one 537 natively, some the other, at least one used both, and a few used 538 neither (preferring variable-length lines with counts or special 539 delimiters or markers instead) and specified conventions for some 540 other characters. Also, since NVT ASCII was restricted to seven-bit 541 characters, use of the high-order bit in octets was reserved for the 542 transmission of control signaling information. 544 At a very high level, the concept was that a system could use 545 whatever character coding and line representations were appropriate 546 locally, but text transmitted over the network as text must conform 547 to the single "network virtual terminal" convention. Virtually all 548 early Internet protocols that presume transfer of "text" assume this 549 virtual terminal model, although different ones assume or limit it in 550 different ways. Telnet, the command stream and ASCII Type in FTP 551 [RFC0542], the message stream in SMTP transfer [RFC2821], and the 552 strings passed to finger [RFC0742] and whois [RFC0954] are the 553 classic examples. More recently, HTTP [RFC1945] [RFC2616] follows 554 the same general model but permits 8 bit data and leaves the line end 555 sequence unspecified (the latter has been the source of a significant 556 number of problems). 558 Appendix B. The ASCII NVT Definition 560 The main body of this specification is intended as an update to, and 561 internationalized version of, the Net-ASCII definition. The 562 specification is self-contained in that parts of the Net-ASCII 563 definition that are no longer recommended are not included above. 564 Because Net-ASCII evolved somewhat over time and there has been 565 debate about which specification is the "official" Net-ASCII, it is 566 appropriate to review the key elements of that definition here. This 567 review is informal with regard to the contents of Net-ASCII and 568 should not be considered as a normative update or summary of the 569 earlier specifications (Section 2 does specify some normative updates 570 to those specifications and some comments below are consistent with 571 it). 573 The first part of the section titled "THE NVT PRINTER AND KEYBOARD" 574 in RFC 854 [RFC0854] is generally, although not universally, 575 considered to be the normative definition of the (ASCII) Network 576 Virtual Terminal and hence of Net-ASCII. It includes not only the 577 graphic ASCII characters but a number of control characters. The 578 latter are given Internet-specific meanings that are often more 579 specific than the definitions in the ASCII specification. In today's 580 usage, and for the present specification, the following 581 clarifications and updates to that list should be noted. Each one is 582 accompanied by a brief explanation of the reason why the original 583 specification is no longer appropriate. 585 1. The "defined but not required" codes -- BEL (U+0007), BS 586 (U+0008), HT (U+0009), VT (U+000B), and FF (U+000C) -- and the 587 undefined control codes ("C0") SHOULD NOT be used unless required 588 by exceptional circumstances. Either their original "network 589 printer" definitions are no longer in general use, common 590 practice has evolved away from the formats specified there, or 591 their use to simulate characters that are better handled by 592 Unicode is no longer appropriate. While the appearance of some 593 of these characters on the list may seem surprising, BS now has 594 an ambiguous interpretation in practice (erasing in some systems 595 but not in others), the width associated with HT varies with the 596 environment, and VT and FF do not have a uniform effect with 597 regard to either vertical positioning or the associated 598 horizontal position result. Of course, telnet escapes are not 599 considered part of the data stream and hence are unaffected by 600 this provision. 602 2. In Net-ASCII, CR MUST NOT appear except when immediately followed 603 by either NUL or LF, with the latter (CR LF) designating the "new 604 line" function. Today and as specified above, CR should 605 generally appear only when followed by LF. Because page layout 606 is better done in other ways, because NUL has a special 607 interpretation in some programming languages, and to avoid other 608 types of confusion, CR NUL should preferably be avoided as 609 specified above. 611 3. LF CR SHOULD NOT appear except as a side-effect of multiple CR LF 612 sequences (e.g., CR LF CR LF). 614 4. The historical NVT documents do not call out either "bare LF" (LF 615 without CR) or HT for special treatment. Both have generally 616 been understood to be problematic. In the case of LF, there is a 617 difference in interpretation as to whether its semantics imply 618 "go to same position on the next line" or "go to the first 619 position on the next line" and interoperability considerations 620 suggest not depending on which interpretation the receiver 621 applies. At the same time, misinterpretation of LF is less 622 harmful than misinterpretation of "bare" CR: in the CR case, text 623 may be erased or made completely unreadable; in the LF one, the 624 worst consequence is a very funny-looking display. Obviously, HT 625 is problematic because there is no standard way to transmit 626 intended tab position or width information in running text. 627 Again, the harm is unlkely to be great if HT is simply 628 interpreted as one or more spaces, but, in general, it cannot be 629 relied upon to format information. 631 It is worth noting that the telnet IAC character (an octet consisting 632 of all ones, i.e., %xFF) itself is not a problem for UTF-8 since that 633 particular octet cannot appear in a valid UTF-8 string. However, 634 while few of them have been used, telnet permits other command- 635 introducer characters whose bit sequences in an octet may be part of 636 valid UTF-8 characters. While it causes no ambiguity in UTF-8, 637 Unicode assigns a graphic character ("Latin Small Letter Y with 638 Diaeresis") to U+00FF (octets C3 B0 in UTF-8). Some caution is 639 clearly in order in this area. 641 Appendix C. The Line-Ending Problem 643 The definition of how a line ending should be denoted in plain text 644 strings on the wire for the Internet has been controversial from even 645 before the introduction of NVT. Some have argued that recipients 646 should be required to interpret almost anything that a sender might 647 intend as a line ending as actually a line ending. Others have 648 pointed out that this would lead to some ambiguities of 649 interpretation and presentation and would violate the principle that 650 we should minimize the number of forms that are permitted on the wire 651 in order to promote interoperability and eliminate the "every 652 recipient needs to understand every sender format" problem. The 653 design of this specification, like that of NVT, takes the latter 654 approach. Its designers believe that there is little point in a 655 standard if it is to specify "anyone can do whatever they like and 656 the receiver just needs to cope". 658 A further discussion of the nature and evolution of the line-ending 659 problem appears in Section 5.8 of the Unicode Standard [Unicode] and 660 is suggested for additional reading. If we were starting with the 661 Internet today, it would probably be sensible to follow the 662 recommendation there and use LS (U+2028) exclusively, in preference 663 to CRLF. However, the installed base of use of CRLF and the 664 importance of forward compatibility with NVT and protocols that 665 assume it makes that impossible, so it is necessary to continue using 666 CRLF as the "New Line Function" ("NLF", see the terminology section 667 in that reference) discussed there. 669 Appendix D. A Note About Related Future Work 671 Once this proposal is approved, consideration should be given to a 672 Telnet (or SSH [RFC4251]) option to specify this type of stream and 673 an FTP extension [RFC0959] to permit a new "Unicode text" data TYPE. 675 Appendix E. Change log 677 [[ RFC Editor: Please remove this section before publication. ]] 679 E.1. Changes from -00 to -01 681 o Replaced the section on Normalization with text provided by Mark 682 Davis 684 o Several small editorial changes and corrections. 686 E.2. Changes from -01 to -02 688 o Added material explaining the relationship to Net-ASCII and the 689 NVT. 691 o Brought the material on transmission forms into line with current 692 practice and terminology. 694 o Made terminology more consistent. 696 o Inserted normalization text provided by Mark Davis. 698 o Rewrote and reorganized Unicode versioning material. 700 o Clarified relationships to existing protocols, stressing that this 701 is not, in itself, a proposal to change any of them. 703 E.3. Changes from -02 to -03 705 o Clarification of several relationships and updating to reflect 706 mailing list comments and other work. 708 o Inserted a discussion and pair of placeholders about prohibited 709 NVT characters. 711 o Several corrections of typographic and editorial errors and 712 additions of relevant references. 714 E.4. Changes from -03 to -04 716 o Reduced requirement for NFC on transmission to a SHOULD, per on- 717 list discussion and the realization that receivers cannot safely 718 assuming that normalization was applied. 720 o Rewrote the discussion of Net-ASCII to separate changes for Net- 721 Unicode from the original model, rewrote the description of the 722 latter, and moved most background/ historical material to 723 appendices, as suggested by Chris Newman and others. 725 o Several small editorial improvements, including those suggested in 726 a March note from Chris Newman. 728 o Removed remain editorial/ work in progress notes. 730 E.5. Changes from -04 to -05 732 o Additions to Security Considerations and elsewhere for 733 unnormalized text. 735 o Discussion of FormFeed (FF) rewritten and rationale provided 737 o Added preliminary "updates" and "obsoletes" indications. 739 o Significant rewrites, and some text moved to a new appendix, 740 responding from comments from Martin Duerst. 742 o Several small editorial / typographical corrections. 744 E.6. Changes from -05 to -07 746 Version -06 and -07 included a number of editorial improvements, plus 747 addition discussion of characters that were included or excluded, 748 especially characters that end lines or set the position of the next 749 character to be displayed. 751 Version -07 became the version announced for IETF Last Call. 753 E.7. Changes in version -08 755 These changes were made subsequent to IETF Last Call and in response 756 to Last Call comments. 758 o Added a useless IANA Considerations section so that the RFC Editor 759 can remove it (at the same time this log is removed). 761 o Clarified that, if the relevant protocol doesn't have a concept of 762 "line", the material in Section 2, bullet 2 is irrelevant. 764 o Rearranged some text, modified section titles, and elaborated on 765 the comment at the end of Appendix C to improve clarity. 767 o Added an additional discussion of Line Ending issues as an 768 appendix (Appendix C with renumbering). 770 o Several editorial corrections and clarifications, including 771 reference updates. 773 E.8. Changes in version -09 775 Some additional editorial changes that weren't picked up or -08. 777 9. References 779 9.1. Normative References 781 [ISO10646] 782 International Organization for Standardization, 783 "Information Technology - Universal Multiple- Octet Coded 784 Character Set (UCS)"", ISO/IEC 10646:2003 (with 785 amendments), 2003. 787 [NFC] Davis, M. and M. Duerst, "Unicode Standard Annex #15: 788 Unicode Normalization Forms", October 2006, 789 . 791 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 792 Requirement Levels", BCP 14, RFC 2119, March 1997. 794 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 795 10646", STD 63, RFC 3629, November 2003. 797 [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 798 Specifications: ABNF", STD 68, RFC 5234, January 2008. 800 [Unicode] The Unicode Consortium, "The Unicode Standard, Version 801 5.0", 2007. 803 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0 805 [Unicode32] 806 The Unicode Consortium, "The Unicode Standard, Version 807 3.0", 2000. 809 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5). 810 Version 3.2 consists of the definition in that book as 811 amended by the Unicode Standard Annex #27: Unicode 3.1 812 (http://www.unicode.org/reports/tr27/) and by the Unicode 813 Standard Annex #28: Unicode 3.2 814 (http://www.unicode.org/reports/tr28/). 816 9.2. Informative References 818 [ISO.646.1991] 819 International Organization for Standardization, 820 "Information technology - ISO 7-bit coded character set 821 for information interchange", ISO Standard 646, 1991. 823 [ISO.8859.2003] 824 International Organization for Standardization, 825 "Information processing - 8-bit single-byte coded graphic 826 character sets - Part 1: Latin alphabet No. 1 (1998) - 827 Part 2: Latin alphabet No. 2 (1999) - Part 3: Latin 828 alphabet No. 3 (1999) - Part 4: Latin alphabet No. 4 829 (1998) - Part 5: Latin/Cyrillic alphabet (1999) - Part 6: 830 Latin/Arabic alphabet (1999) - Part 7: Latin/Greek 831 alphabet (2003) - Part 8: Latin/Hebrew alphabet (1999) - 832 Part 9: Latin alphabet No. 5 (1999) - Part 10: Latin 833 alphabet No. 6 (1998) - Part 11: Latin/Thai alphabet 834 (2001) - Part 13: Latin alphabet No. 7 (1998) - Part 14: 835 Latin alphabet No. 8 (Celtic) (1998) - Part 15: Latin 836 alphabet No. 9 (1999) - Part 16: Part 16: Latin alphabet 837 No. 10 (2001)", ISO Standard 8859, 2003. 839 [NamedSequences] 840 The Unicode Consortium, "NamedSequences-4.1.0.txt", 2005, 841 . 844 [RFC0020] Cerf, V., "ASCII format for network interchange", RFC 20, 845 October 1969. 847 [RFC0097] Melvin, J. and R. Watson, "First Cut at a Proposed Telnet 848 Protocol", RFC 97, February 1971. 850 [RFC0137] O'Sullivan, T., "Telnet Protocol - a proposed document", 851 RFC 137, April 1971. 853 [RFC0139] O'Sullivan, T., "Discussion of Telnet Protocol", RFC 139, 854 May 1971. 856 [RFC0318] Postel, J., "Telnet Protocols", RFC 318, April 1972. 858 [RFC0542] Neigus, N., "File Transfer Protocol", RFC 542, 859 August 1973. 861 [RFC0698] Mock, T., "Telnet extended ASCII option", RFC 698, 862 July 1975. 864 [RFC0742] Harrenstien, K., "NAME/FINGER Protocol", RFC 742, 865 December 1977. 867 [RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol 868 Specification", STD 8, RFC 854, May 1983. 870 [RFC0954] Harrenstien, K., Stahl, M., and E. Feinler, "NICNAME/ 871 WHOIS", RFC 954, October 1985. 873 [RFC0959] Postel, J. and J. Reynolds, "File Transfer Protocol", 874 STD 9, RFC 959, October 1985. 876 [RFC1945] Berners-Lee, T., Fielding, R., and H. Nielsen, "Hypertext 877 Transfer Protocol -- HTTP/1.0", RFC 1945, May 1996. 879 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 880 Languages", BCP 18, RFC 2277, January 1998. 882 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 883 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 884 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 886 [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 887 10646", RFC 2781, February 2000. 889 [RFC2821] Klensin, J., "Simple Mail Transfer Protocol", RFC 2821, 890 April 2001. 892 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 893 Internationalized Strings ("stringprep")", RFC 3454, 894 December 2002. 896 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 897 Profile for Internationalized Domain Names (IDN)", 898 RFC 3491, March 2003. 900 [RFC3912] Daigle, L., "WHOIS Protocol Specification", RFC 3912, 901 September 2004. 903 [RFC4251] Ylonen, T. and C. Lonvick, "The Secure Shell (SSH) 904 Protocol Architecture", RFC 4251, January 2006. 906 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 907 Recommendations for Internationalized Domain Names 908 (IDNs)", RFC 4690, September 2006. 910 [X3.4-1968] 911 American National Standards Institute (formerly United 912 States of America Standards Institute), "USA Code for 913 Information Interchange", ANSI X3.4-1968, 1968. 915 ANSI X3.4-1968 has been replaced by newer versions with 916 slight modifications, but the 1968 version remains 917 definitive for the Internet. 919 Authors' Addresses 921 John C Klensin 922 1770 Massachusetts Ave, #322 923 Cambridge, MA 02140 924 USA 926 Phone: +1 617 491 5735 927 Email: john-ietf@jck.com 928 Michael A. Padlipsky 929 8011 Stewart Ave. 930 Los Angeles, CA 90045 931 USA 933 Phone: +1 310-670-4288 934 Email: the.map@alum.mit.edu 936 Full Copyright Statement 938 Copyright (C) The IETF Trust (2008). 940 This document is subject to the rights, licenses and restrictions 941 contained in BCP 78, and except as set forth therein, the authors 942 retain all their rights. 944 This document and the information contained herein are provided on an 945 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 946 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 947 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 948 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 949 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 950 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 952 Intellectual Property 954 The IETF takes no position regarding the validity or scope of any 955 Intellectual Property Rights or other rights that might be claimed to 956 pertain to the implementation or use of the technology described in 957 this document or the extent to which any license under such rights 958 might or might not be available; nor does it represent that it has 959 made any independent effort to identify any such rights. Information 960 on the procedures with respect to rights in RFC documents can be 961 found in BCP 78 and BCP 79. 963 Copies of IPR disclosures made to the IETF Secretariat and any 964 assurances of licenses to be made available, or the result of an 965 attempt made to obtain a general license or permission for the use of 966 such proprietary rights by implementers or users of this 967 specification can be obtained from the IETF on-line IPR repository at 968 http://www.ietf.org/ipr. 970 The IETF invites any interested party to bring to its attention any 971 copyrights, patents or patent applications, or other proprietary 972 rights that may cover technology that may be required to implement 973 this standard. Please address the information to the IETF at 974 ietf-ipr@ietf.org. 976 Acknowledgment 978 Funding for the RFC Editor function is provided by the IETF 979 Administrative Support Activity (IASA).