idnits 2.17.1 draft-bormann-dispatch-modern-network-unicode-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 08, 2019) is 1725 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 DISPATCH Working Group C. Bormann 3 Internet-Draft Universitaet Bremen TZI 4 Intended status: Standards Track July 08, 2019 5 Expires: January 9, 2020 7 Modern Network Unicode 8 draft-bormann-dispatch-modern-network-unicode-02 10 Abstract 12 RFC 5198 both defines common conventions for the use of Unicode in 13 network protocols and caters for the specific requirements of the 14 legacy protocol Telnet. In applications that do not need Telnet 15 compatibility, some of the decisions of RFC 5198 are cumbersome. 17 The present specification defines "Modern Network Unicode" (MNU), 18 which is a form of RFC 5198 Network Unicode that can be used in 19 specifications that require the exchange of plain text over networks 20 and where just mandating UTF-8 (RFC 3629) may not be sufficient, but 21 there is also no desire to import all of the baggage of RFC 5198. 23 In addition to a basic "Clean Modern Network Unicode" (CMNU), this 24 specification defines a number of variances that can be used to 25 tailor MNU to specific areas of application. In particular, "Modern 26 Network Unicode with lines" can be used in applications that require 27 line-structured text such as plain text documents or markdown format. 29 Status 31 The present version of this document represents the author's reaction 32 to initial exposure on the art@ietf.org mailing list. Some more 33 editorial cleanup is probably desirable, but could not be achieved in 34 time for the IETF105 Internet-Draft deadline. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on January 9, 2020. 53 Copyright Notice 55 Copyright (c) 2019 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (https://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 72 2. Clean Modern Network Unicode . . . . . . . . . . . . . . . . 3 73 3. Variances . . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 3.1. With lines . . . . . . . . . . . . . . . . . . . . . . . 4 75 3.2. With CR-tolerant lines . . . . . . . . . . . . . . . . . 4 76 3.3. With HT Characters . . . . . . . . . . . . . . . . . . . 4 77 3.4. With CCC Characters . . . . . . . . . . . . . . . . . . . 4 78 3.5. With NFKC . . . . . . . . . . . . . . . . . . . . . . . . 5 79 3.6. With Unicode Version NNN . . . . . . . . . . . . . . . . 5 80 4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 5 81 4.1. Relationship to RFC 5198 . . . . . . . . . . . . . . . . 5 82 4.2. Going beyond RFC 5198 . . . . . . . . . . . . . . . . . . 5 83 5. Using ABNF with Unicode . . . . . . . . . . . . . . . . . . . 7 84 6. IANA considerations . . . . . . . . . . . . . . . . . . . . . 7 85 7. Security considerations . . . . . . . . . . . . . . . . . . . 8 86 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 87 8.1. Normative References . . . . . . . . . . . . . . . . . . 8 88 8.2. Informative References . . . . . . . . . . . . . . . . . 8 89 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 9 90 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 9 92 1. Introduction 94 (Insert embellished copy of abstract here.) 96 Complex specifications that use Unicode often come with detailed 97 information on their Unicode usage; this level of detail generally is 98 necessary to support some legacy applications. New, simple protocol 99 specifications generally do not have such a legacy or need such 100 details, but can instead simply use common practice, informed by 101 decades of using Unicode. The present specification attempts to 102 serve as a convenient reference for such protocol specifications, 103 reducing their need for discussing Unicode to just pointing to the 104 present specification and making a few simple choices. 106 There is no intention that henceforth all new protocols "must" use 107 the present specification. It is offered as a standards-track 108 specification simply so it can be normatively referenced from other 109 standards-track specifications. 111 1.1. Terminology 113 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 114 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 115 "OPTIONAL" in this document are to be interpreted as described in 116 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 117 capitals, as shown here. 119 Characters in this specification are named with their Unicode name 120 notated in the usual form U+NNNN or with their ASCII names (such as 121 CR, LF, HT, RS, NUL) [RFC0020]. 123 2. Clean Modern Network Unicode 125 Clean Modern Network Unicode (CMNU) is the form of Modern Network 126 Unicode that does not make use of any of the variances defined below. 127 It requires conformance to [RFC3629], as well as to the following 128 four mandates: 130 o Control characters (U+0000 to U+001F and U+007F to U+009F) MUST 131 NOT be used. (Note that this also excludes line endings, so a 132 CMNU text string cannot extend beyond a single line. See 133 Section 3.1 below if line structure is needed.) 135 o The characters U+2028 and U+2029 MUST NOT be used. (In case 136 future Unicode versions add to the Unicode character categories Zl 137 or Zp, any characters in these categories MUST NOT be used.) 139 o Modern Network Unicode requires that, except in very unusual 140 circumstances, all text is transmitted in normalization form NFC. 142 o As per the Unicode specification, the code points U+FFFE and 143 U+FFFF MUST NOT be used. Also, Byte Order Marks (leading U+FEFF 144 characters) MUST NOT be used. 146 3. Variances 148 In addition to CMNU, this specification describes a number of 149 variances that can be used in the form "Modern Network Unicode with 150 VVV", or "Modern Network Unicode with VVV, WWW, and ZZZ" for multiple 151 variances used. Specifications that cannot directly use CMNU may be 152 able to use MNU with one or more of these variances added. 154 3.1. With lines 156 While Clean Modern Network Unicode rules out line endings completely, 157 line-structured text is often required. The variance "with lines" 158 allows the use of line endings, represented by a single LF character 159 (which is then the only control character allowed). 161 3.2. With CR-tolerant lines 163 The variance "with CR-tolerant lines" allows the sequence CR LF as 164 well as a single LF character as a line ending. This may enable 165 existing texts to be used as MNU without processing at the sender 166 side (substituting that by processing at the receiver side). Note 167 that, with this variance, a CR character cannot be used anywhere else 168 but immediately preceding an LF character. 170 3.3. With HT Characters 172 In some cases, the use of HT characters ("TABs") cannot be completely 173 excluded. The variance "with HT characters" allows their use, 174 without attempting to define their meaning (e.g., equivalence with 175 spaces, column definitions, etc.). 177 3.4. With CCC Characters 179 Some applications of MNU may need to add specific control characters, 180 such as RS [RFC7464] or FF characters. This variance is spelled with 181 the ASCII name of the control character for CCC, e.g., "with RS 182 characters". 184 3.5. With NFKC 186 Some applications require a stronger form of normalization than NFC. 187 The variance "with NFKC" swaps out NFC and uses NFKC instead. This 188 is probably best used in conjunction with "with Unicode version NNN". 190 3.6. With Unicode Version NNN 192 Some applications need to be sure that a certain Unicode version is 193 used. The variance "with Unicode version NNN" (where nnn is a 194 Unicode version number) defines the Unicode version in use as NNN. 195 Also, it requires that only characters assigned in that Unicode 196 version are being used. 198 4. Discussion 200 At the time of writing, RFCs are formatted in "Modern Network Unicode 201 with CR-tolerant lines and FF characters". 203 4.1. Relationship to RFC 5198 205 The third and fourth requirement listed above are also posed by 206 [RFC5198], while the first two remove further legacy compatibility 207 considerations. 209 [RFC5198] contains some discussion and background material that the 210 present document does not attempt to repeat; the interested reader 211 may therefore want to consult it as an informative reference. See 212 also Section 4 below. 214 Mandates of [RFC5198] that are specific to a version of Unicode are 215 not picked up in this specification, e.g., there is no check for 216 unassigned code points. Note that this means that a CMNU 217 implementation may not be able to handle the normalization of a 218 character not yet assigned in the version of Unicode that it uses. 219 (See also Section 3.6 below.) 221 4.2. Going beyond RFC 5198 223 The handling of line endings (not being part of CMNU, providing LF- 224 only and LF/CRLF line endings as variances) may be controversial. In 225 particular, calling out CR-tolerance as an extra (and often 226 undesirable) feature may seem novel to some readers. The handling as 227 specified here is much closer to the way line endings are handled on 228 the software side than the cumbersome rules of [RFC5198]. More 229 generally speaking, one could say that the present specification is 230 intended to be used by state of the art protocols going forward, 231 maybe less so by existing protocols that have legacy baggage. 233 Even in the "with CR-tolerant lines" variance, the CR character is 234 only allowed as an embellishment of an immediately following LF 235 character. This reflects the fact that overprinting has only seen 236 niche usage for quite a number of decades now. 238 Unicode Line and Paragraph separators probably seemed like a good 239 idea at the time, but have not taken hold. Today, their occurrence 240 is more likely to trigger a bug or even serve as an attack. 242 HT characters ("TABs") were needed on ASR33 terminals to speed up 243 whitespace processing at 110 bit/s line speed. Unless some legacy 244 applications require compatibility with this ancient and frequently 245 varied convention, HT characters are no longer appropriate in Modern 246 Network Unicode. In support of legacy compatibility cases that do 247 require tolerating their use, the "with HT characters" variance is 248 defined. 250 The version-nonspecific nature of CMNU creates some fuzziness that 251 may be undesirable but is more realistic in environments where 252 applications choose the Unicode version with the Unicode library that 253 happens to be available to them. 255 With respect to Normalization (NFC), the unusual circumstances 256 alluded to above can come from the the fact that some implementations 257 of applications may rely on operating system libraries over which 258 they have little control. Adherence to the robustness principle 259 suggests that receivers of Modern Network Unicode should be prepared 260 to receive unnormalized text and should not react to that in 261 excessive ways; however, there also is no expectation for receivers 262 to go out of their way doing so. 264 Some background on the prohibition of byte order marks: The 16-bit 265 and 32-bit encodings for Unicode are available in multiple byte 266 orders. The byte order in use in a specific piece of text can be 267 provided by metadata (such as a media type) or by prefixing the text 268 with a "Byte Order Mark", U+FEFF. Since code point U+FFFE is never 269 used in Unicode, this unambiguously identifies the byte order. 271 For UTF-8, there is no ambiguity and thus no need for a byte order 272 mark. However, some systems have made regular of a leading U+FEFF 273 character in UTF-8 files, anyway, often in order to mark the file as 274 UTF-8 in case other character codings are also in use and metadata is 275 not available. This can wreak havoc with the ASCII compatibility of 276 UTF-8; it also creates problems when systems then start to expect a 277 BOM in UTF-8 input and none is provided. Section 6 of [RFC3629] also 278 recommends not using Byte Order Marks with UTF-8, but does not phrase 279 this as an unambiguous mandate, so we add that here. 281 5. Using ABNF with Unicode 283 Internet STD 68, [RFC5234], defines Augmented BNF for Syntax 284 Specifications: ABNF. Since the late 1970s, ABNF has often been used 285 to formally describe the pieces of text that are meant to be used in 286 an Internet protocol. ABNF was developed at a time when character 287 coding grew more and more complicated, and even in its current form, 288 discusses encoding of characters only briefly (Section 2.4 of 289 [RFC5234]). This discussion offers no information about how this 290 should be used today (it actually still refers to 16-bit Unicode!). 292 The best current practice of using ABNF for Unicode-based protocols 293 is as follows: ABNF is used as a grammar for describing sequences of 294 Unicode code points, valued from 0x0 to 0x10FFFF. The actual 295 encoding (as UTF-8) is never seen on the ABNF level; see Section 9.4 296 of [RFC6020] for a recent example of this. Approaches such as 297 representing the rules of UTF-8 encoding in ABNF (see Section 3.5 of 298 [RFC5255] as an example) add complexity without benefit and are NOT 299 RECOMMENDED. 301 ABNF features such as case-insensitivity in literal text strings 302 essentially do not work for general Unicode; text string literals 303 therefore (and by the definition in Section 2.3 of [RFC5234]) are 304 limited to ASCII characters. That is often not actually a problem in 305 text-based protocol definitions. Still, characters beyond ASCII need 306 to be allowed in many productions. ABNF does not have access to 307 Unicode character categories and thus will be limited in its 308 expressiveness here. The core rules defines in Appendix B of 309 [RFC5234] are limited to ASCII as well; new rules will therefore need 310 to be defined in any protocol employing modern Unicode. 312 The present specification recommends defining the following rules: 314 ; modern unicode character: 315 uchar = %x20-7E / %xA0-2027 / %x202A-D7FF 316 / %xE000-FFFD / %x10000-10FFFD 317 ; modern unicode newline: 318 unl = %x0A 319 ; alternatively, modern unicode CR-tolerant newline: 320 utnl = [%x0D] %x0A 321 ; if really needed, HT-tolerant unicode character: 322 utchar = %x09 / uchar 324 6. IANA considerations 326 This specification places no requirements on IANA. 328 7. Security considerations 330 The security considerations of [RFC5198] apply. 332 A variance "with NUL characters" would create specific security 333 considerations as discussed in the security considerations of 334 [RFC5198] and should therefore only be used in circumstances that 335 absolutely do require it. 337 8. References 339 8.1. Normative References 341 [RFC0020] Cerf, V., "ASCII format for network interchange", STD 80, 342 RFC 20, DOI 10.17487/RFC0020, October 1969, 343 . 345 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 346 Requirement Levels", BCP 14, RFC 2119, 347 DOI 10.17487/RFC2119, March 1997, 348 . 350 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 351 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 352 2003, . 354 [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 355 Specifications: ABNF", STD 68, RFC 5234, 356 DOI 10.17487/RFC5234, January 2008, 357 . 359 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 360 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 361 May 2017, . 363 8.2. Informative References 365 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network 366 Interchange", RFC 5198, DOI 10.17487/RFC5198, March 2008, 367 . 369 [RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet 370 Message Access Protocol Internationalization", RFC 5255, 371 DOI 10.17487/RFC5255, June 2008, 372 . 374 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 375 the Network Configuration Protocol (NETCONF)", RFC 6020, 376 DOI 10.17487/RFC6020, October 2010, 377 . 379 [RFC7464] Williams, N., "JavaScript Object Notation (JSON) Text 380 Sequences", RFC 7464, DOI 10.17487/RFC7464, February 2015, 381 . 383 Acknowledgements 385 Klaus Hartke and Henk Birkholz drove the author out of his mind 386 enough to make him finally write this up. James Manger, Tim Bray and 387 Martin Thomson provided comments on an early version of this draft. 389 Author's Address 391 Carsten Bormann 392 Universitaet Bremen TZI 393 Postfach 330440 394 Bremen D-28359 395 Germany 397 Phone: +49-421-218-63921 398 Email: cabo@tzi.org