idnits 2.17.1 draft-seantek-unicode-in-abnf-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC5234, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 164 has weird spacing: '...] could be id...' (Using the creation date from RFC5234, updated by this document, for RFC5378 checks: 2007-04-24) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 31, 2016) is 2706 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'ABNFMORE' is mentioned on line 362, but not defined == Unused Reference: 'RFC1345' is defined on line 412, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Leonard 3 Internet-Draft Penango, Inc. 4 Updates: 5234 (if approved) C. Newman 5 Intended Status: Experimental Oracle 6 Expires: May 4, 2017 October 31, 2016 8 Unicode in ABNF 9 draft-seantek-unicode-in-abnf-02 11 Abstract 13 This experimental document adds support for Unicode strings in ABNF 14 (Augmented Backus-Naur Form), and provides certain symbols related to 15 Unicode code point ranges. 17 Status of This Memo 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF). Note that other groups may also distribute working 24 documents as Internet-Drafts. The list of current Internet-Drafts is 25 at http://datatracker.ietf.org/drafts/current/. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 This Internet-Draft is a fork of 33 draft-seantek-abnf-more-core-rules-05. 35 Copyright Notice 37 Copyright (c) 2016 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (http://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 1. Introduction 52 Augmented Backus-Naur Form (ABNF) [RFC5234] is a formal syntax that 53 is popular among many Internet specifications. Many Internet 54 documents employ this syntax along with the Core Rules defined in 55 Appendix B.1 of [RFC5234]. ABNF is defined in terms of ASCII 56 [ASCII86, RFC0020]; however, Unicode [UNICODE] has become 57 increasingly popular--even required--as the Internet has evolved over 58 the last two decades. Unicode (as UTF-8) will be permitted in the RFC 59 series [IABNA], while [RFC5198] established Net-Unicode as the 60 standard form for the use of Unicode as "network text". Protocols 61 that originally were ASCII-based have been, or are being, extended to 62 support Unicode. However, protocols that use Unicode in some way 63 (e.g., permit UTF-8 content in a production) use different ABNF 64 expressions, some of which do not conform to the modern Unicode 65 Standard 9.0.0, and therefore could introduce interoperability or 66 security problems. 68 Many parties have expressed interest in incorporating [UNICODE] into 69 ABNF, yet the questions remain: "How?" and "To what extent?" 71 This document proposes standardized techniques for expressing Unicode 72 code points using ABNF. This document intends to be very conservative 73 in its approach: a conforming implementation only needs to know how 74 to map between the Unicode scalar values and any Unicode encoding 75 form. The Unicode Character Database (UCD, Section 4.1 of [UNICODE]) 76 is intentionally not necessary. ABNF text that uses the syntax in 77 this document needs to be in a Unicode encoding form (Conformance 78 Clause D89 of [UNICODE]), but ABNF text that just uses the rules or 79 terminal values can be expressed in ASCII [RFC0020]. 81 2. Unicode Code Points in ABNF 83 (Consult Section 2.3 of [RFC5234] in relation to this paragraph.) 84 Unicode has been expressed in several different ways in RFCs to-date. 85 This document establishes that in contexts where Unicode is specified 86 as the coded character set [RFC2130], the terminal values %x00-10FFFF 87 are to be used to represent the Unicode code points. Only the Unicode 88 scalar values are to be used in specifications that follow this 89 document; surrogate code points (%xD800-DFFF) are not to be used 90 [[NB: directly]]. This technique aligns ABNF with W3C EBNF [XMLEBNF] 91 and Unicode EBNF [UNICODE]. 93 (Consult Section 2.4 and Appendix B.2 of [RFC5234] in relation to 94 this paragraph.) 95 In contexts where Unicode is specified as the character set, the 96 ABNF-based grammar may have multiple external encodings. This 97 document does not fix the encoding scheme. The obvious external 98 encoding is UTF-8 (see Net-Unicode [RFC5198]), but other encodings 99 are possible. This document neither restricts productions to NFC, nor 100 provides a syntax for normalization to NFC. 102 3. Unicode Core Rule Update 104 Appendix A furnishes Unicode Core Rules that include comprehensive 105 support for certain Unicode ranges and characters. These Unicode Core 106 Rules supplement the Core Rules of [RFC5234] and [ABNFMORE]; they are 107 intended to be available whenever this document is invoked. 109 The rules reflect broad categories of allowable and disallowable 110 characters in protocols for interchange between systems, as the 111 Internet community has evolved, and as of Unicode 9.0.0 in August 112 2016 [UNICODE]. It is a design goal that a general-purpose ABNF 113 grammar should not need to delve into the minutiae of Unicode 114 character properties, which can be tailorable (i.e., language- 115 specific), overridable, and unstable (between Unicode versions). It 116 is a further design goal that a general-purpose ABNF grammar should 117 not need to rely on sizeable external sources, namely the Unicode 118 Character Database (Section 4.1 of [UNICODE]). To constrain this 119 document's scope, character properties are not addressed further. 121 According to a survey of all RFCs published through August 2016, many 122 widely used Internet protocols rely on horizontal whitespace (HT and 123 SP, or occasionally SP alone) and line breaks (usually CRLF, 124 sometimes LF) as delimiters. Therefore, the rules specifically 125 address horizontal whitespace and line breaks. 127 Rules that both include and exclude the private-use characters 128 (Section 23.5 of [UNICODE]) are provided. Private-use characters "are 129 intended for open interchange, subject to interpretation by private 130 agreement" (Section 23.7 of [UNICODE]). Therefore, there is no way 131 within [UNICODE] itself to provide for a common interpretation of 132 these code points. See also Section 4 of [RFC5198]. A protocol 133 designer needs to establish that common interpretation in prose, 134 provide for protocol elements that establish the common 135 interpretation, or (explicitly) accept that a common interpretation 136 is done outside of the designer's protocol. 138 4. Case-Sensitive Unicode String Syntax 140 This document extends ABNF with a new case-sensitive Unicode string 141 literal. The type is denoted using a type prefix similar to the type 142 prefixes used with numeric values and case-sensitive ASCII string 143 literals. No syntax is provided for a case-insensitive Unicode string 144 literal because doing so would require implementing Unicode caseless 145 matching [UNICODE], which is language-dependent, Unicode version- 146 dependent, and very complicated overall. Caseless matching also 147 requires the UCD. 149 Add the contents of Section 4.1 to [RFC5234]. 151 4.1. Terminal Values - Literal Text Strings 153 Literal case sensitive text strings in ABNF may be in the Unicode 154 character set [UNICODE]. The following prefix is used: 156 %su = case-sensitive, Unicode 158 To be consistent with prior implementations of ABNF, having no prefix 159 means that the string is case insensitive and in ASCII. 161 [[ALT/DISCUSS: [RFC7405] %s"text" could be extended to support 162 characters beyond ASCII. It is a strict superset of [RFC7405] and 163 thus simpler. This document would leave [%i]"text" undefined for the 164 time being, or, a collation from [RFC4790] could be identified.]] 166 The case-sensitive Unicode string can be comprised of any Graphic, 167 Format, or Reserved code point. Control, Private-Use, Surrogate, and 168 Noncharacter code points are excluded. Newline (line breaking) 169 characters are also omitted. (See Table 2-3 of [UNICODE].) 171 An example: 173 rulename = %su"!100Q$" 175 where the character ! is actually the Unicode code point U+00A5 YEN 176 SIGN, and the character $ is actually the Unicode code point U+1F39F 177 ADMISSION TICKETS, is equivalent to the rule: 179 rulename = %xA5.31.30.30.51.1F39F 181 4.2. ABNF Definition of ABNF - char-val 183 char-val =/ case-sensitive-Unicode-string 185 ; ALT/DISCUSS: "%s", modify 7405 186 case-sensitive-Unicode-string = 187 "%su" quoted-Unicode-string 189 quoted-Unicode-string = DQUOTE *(%x20-21 / %x23-7E / 190 UVCHARBEYONDASCII) DQUOTE 191 ; quoted string of SP and VCHAR 192 ; without DQUOTE, and UVCHAR 193 ; beyond the ASCII range 195 5. Terminal Value Transformation Syntax for UTF-8 and UTF-16 197 While Section 2 establishes terminal values %x00-10FFFF for Unicode, 198 many Internet protocols incorporate Unicode using UTF-8 and define 199 protocol elements using UTF-8 terminal values (i.e., values in the 8- 200 bit range of %x00-FF, or more specifically, %x00-BF and %xC2-F4); see 201 [RFC3629]. A smaller yet notable set of protocols use UTF-16. 203 Writing out Unicode code points or ranges in UTF-8 or UTF-16 can be 204 cumbersome and error-prone. This document therefore provides a 205 "terminal value transformation syntax", so that the code points %x00- 206 10FFFF can be written out natively, but the resulting ABNF represents 207 8-bit or 16-bit units at the level of ABNF syntax. From there, a 208 protocol can supply a specific mapping (encoding) of those values 209 into a character set or other representation, consistent with Section 210 2.3 of [RFC5234]. 212 The syntax is: 213 %t8(...) for 8-bit UTF-8 (transform to %x00-BF and %xC2-F4) 214 %t16(...) for 16-bit UTF-16 (transform to %x00-D7FF, 215 %xD800-DBFF %xDC00-DFFF, and %xE000-FFFF) 216 %t16le(...) for 8-bit UTF-16LE (transform to %x00.00-%xFF.FF, 217 little-endian) 218 %t16be(...) for 8-bit UTF-16BE (transform to %x00.00-%xFF.FF, 219 big-endian) 221 [[NB: Other possibilities: !t8 ~t8 $t8 #t8 -t8]] 223 A transform is applied by recursively driving it into the elements, 224 transforming terminal values from the original code point to the 225 corresponding Unicode Transformation Format over an 8-bit (or 16-bit) 226 field. The transforms in this document distribute over ABNF 227 operators. "%t16" outputs 16-bit terminal values from %x00-FFFF, 228 meaning that the endianness is not specified: a protocol needs to 229 specify this or furnish a protocol slot for 16-bit code units. In 230 contrast, "%t16be" and "%t16le" output 8-bit terminal values: each 231 terminal value in the input will correspond to two or four terminal 232 values in the output. 234 If a transform is used on a terminal value outside the Unicode scalar 235 value range (see the proposed Core Rule ), the resulting 236 terminal value can be neither satisfied nor produced. 238 A "reverse transformation syntax" to go from 8-bit or 16-bit terminal 239 values to reassembled Unicode code points is not proposed at this 240 time. 242 5.1. Examples 243 Example 1: The following rules are equivalent; see [RFC3629]: 245 UTF8-MB = UTF8-2 / UTF8-3 / UTF8-4 ; from RFC 3629 247 ; %x80-D7FF / %xE000-10FFFF 248 UTF8-MB = %t8( BEYONDASCII ) 250 Example 2: The code point U+1F430 RABBIT FACE can be represented as 251 %x1F430. It can also be represented as %xD83D.DC30 or %t16( %x1F430 ) 252 when UTF-16 is intended. 254 5.2. Advantages and Features 256 Using transformation syntax offers several advantages: 258 The generic ABNF syntax of a textual protocol can take full advantage 259 of the Unicode character set; the syntax is not dependent on a 260 particular encoding form. 262 Specifying ranges of characters becomes unwieldy when explicitly 263 defined in terms of code units in a Unicode encoding form, e.g., as 264 UTF-8 code units (octets) for characters beyond ASCII, or as UTF-16 265 code units (16-bit words) for supplementary characters. Trying to 266 specify Punycode in ABNF would be, for all intents and purposes, 267 impossible! (Note: it's not actually impossible, but very difficult 268 and not particularly useful.) 270 Protocols that have arbitrary binary slots (e.g., BINARYMIME) are 271 inherently incompatible with Section 2 syntax, but compatibility can 272 be achieved by using transformation syntax. 274 Protocol designers can effectively exploit the "holes" in UTF-8, 275 because octets C0, C1, and F5-FF are never seen in UTF-8. These 276 octets provide natural delimiters for arbitrary runs of UTF-8. An 277 advantage of using such octets as delimiters is that checking for 278 these octets has to be done anyway for security reasons, so a 279 designer can save cycles by incorporating this part of a check for 280 well-formed Unicode into a protocol. Such delimiters can only be 281 expressed outside of "%t8", since a "%t8" transform will never 282 produce those terminal values. 284 (UTF-16 also has such "holes", namely, in unpaired surrogates. But 285 using unpaired surrogates as delimiters may suffer from other 286 security pitfalls; in any event, UTF-16 is far less common in IETF 287 usage.) 289 6. Comment Syntax 291 This document extends ABNF to have Unicode comments. Comments are 292 treated as specification prose, so they may be normative depending on 293 the context. Comment text allows for the same repertoire of 294 characters as RFC text. The RFC Editors can regulate comments to the 295 same extent as specification prose, including disallowing certain 296 characters or code points. 298 6.1. Comment: ; Comment 300 (No changes to the text of Section 3.9 of [RFC5234] are needed.) 302 6.2. ABNF Definition of ABNF - comment 304 ; given: 305 comment = ";" *(WSP / VCHAR) CRLF 307 ; increment (unambiguous grammar): 308 comment =/ ";" *(UWSP / UVCHAR / PUACHAR) 309 (UWSPBEYONDASCII / UVCHARBEYONDASCII / PUACHAR) 310 *(UWSP / UVCHAR / PUACHAR) CRLF 312 ; or redefine: 313 comment = ";" *(UWSP / UVCHAR / PUACHAR) CRLF 315 7. Notational Conventions 317 For readability it is advisable to express a Unicode code point as 318 the character itself, the numeric terminal value, and the name or a 319 name alias. Only one expression is used for the formal ABNF notation: 320 either the character itself (Section 4) or the numeric terminal value 321 (Section 2). The other expressions can be incorporated into an 322 adjacent comment. 324 The suggested notational convention for the adjacent comment follows 325 Appendix A of [UNICODE]. The comment text is comprised of one or more 326 WSP characters, optionally either the character itself or "U+" syntax 327 followed by exactly one SP, and the name or a name alias in ALL-CAPS 328 ASCII. Multiple characters can be notated in sequence on multiple 329 comment lines or on a single comment line. It is neither advisable 330 nor necessary to notate characters in the ASCII range. Examples of 331 the notation include: 333 ; U+2206 INCREMENT 334 ; U+2030 PER MILLE SIGN 335 change-in-temp = %su"$" 3DIGIT %su"%" 337 ; # EURO SIGN ZWJ / VULGAR FRACTION ONE HALF 338 euros = %x20AC 3DIGIT [%x200D.BD] 340 where the characters $, %, #, and / are actually the respective 341 Unicode characters mentioned in the comments. 343 8. Effects on RFC 5234 345 Formally, this document updates [RFC5234] but does not modify it in 346 situ. Authors need to reference this document if they want to include 347 these enhancements; bare references to [RFC5234] do not include this 348 specification (or, for that matter, [RFC7405]). This directive 349 follows a model whereby document authors can choose whether to invoke 350 particular enhancements to ABNF. As time goes on, the IETF can 351 determine how often these enhancements are invoked, and can decide 352 whether to include them as part of a revision to the base [RFC5234]. 354 A bare reference to this document invokes the case-sensitive Unicode 355 literal string syntax enhancement, the Unicode comment syntax 356 enhancement, and the Unicode Core Rules of Appendix A (i.e., the Core 357 Rules do not have to be further referenced). Nevertheless, document 358 authors are free to qualify a reference to this document to invoke 359 each feature selectively. 361 Appendix A of this document is meant to supplement Appendix B.1 of 362 [RFC5234] and Appendix A of [ABNFMORE]; therefore, concurrently 363 referencing those documents is a good idea. Document authors who 364 reference this document should use the rules of Appendix A, and 365 should not attempt to redefine or provide incremental alternatives to 366 them (except for backwards compatibility with prior documents). 368 9. IANA Considerations 370 This document implies no IANA considerations. 372 10. Security Considerations 374 While the Unicode Core Rules themselves may not be security-relevant, 375 the use of C1 control characters could very well be security- 376 relevant, because they may trigger special functions on various 377 devices, while being invisible in other contexts. Similarly, case- 378 sensitive Unicode string syntax allows for a broad range of code 379 points, many of which represent characters that are confusable with 380 other characters, or can only be inferred by visible yet subtle 381 changes in the surrounding graphemes (or worse, semantic changes that 382 do not have visual representations). 384 Protocols using Unicode should evaluate the applicability of Unicode 385 security considerations [UTR#36]. 387 11. References 389 11.1. Normative References 391 [ASCII86] American National Standards Institute, "Coded Character 392 Set -- 7-bit American Standard Code for Information 393 Interchange", ANSI X3.4, 1986. 395 [RFC0020] Cerf, V., "ASCII format for network interchange", RFC 20, 396 October 1969. 398 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network 399 Interchange", RFC 5198, March 2008. 401 [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 402 Specifications: ABNF", STD 68, RFC 5234, January 2008. 404 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 405 9.0.0", The Unicode Consortium, August 2016. 407 11.2. Informative References 409 [IABNA] Flanagan, H., "The Use of Non-ASCII Characters in RFCs", 410 draft-iab-rfc-nonascii-02 (work in progress), April 2016. 412 [RFC1345] Simonsen, K., "Character Mnemonics and Character Sets", 413 RFC 1345, June 1992. 415 [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., 416 Atkinson, R., Crispin, M., and P. Svanberg, "The Report of 417 the IAB Character Set Workshop held 29 February - 1 March, 418 1996", RFC 2130, April 1997. 420 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 421 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 422 2003. 424 [RFC4790] Newman, C., Duerst, M., and A. Gulbrandsen, "Internet 425 Application Protocol Collation Registry", RFC 4790, March 426 2007. 428 [RFC7405] Kyzivat, P., "Case-Sensitive String Support in ABNF", RFC 429 7405, December 2014. 431 [UTR#36] Davis, M. and M. Suignard, "Unicode Security 432 Considerations", Unicode Technical Report #36, September 433 2014, . 435 [XMLEBNF] Bray, T., Paoli, J., Sperberg-McQueen, M., Maler, E., and 436 F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth 437 Edition)", Section 6, W3C Recommendation REC-xml-20081126, 438 November 2008, . 441 Appendix A. Comprehensive Unicode Core Rules 443 Certain basic rules are in uppercase, such as SP, HTAB, CRLF, DIGIT, 444 ALPHA, etc. 446 ; D76 Unicode scalar value 448 UNICODE = 449 BEYONDASCII = 450 BEYONDG0 = 452 C1 = 453 BEYONDC1 = 454 G1 = ; 96-set 455 BEYONDG1 = 456 LATIN1 = 457 BEYONDLATIN1 = 459 ; C2 D14 noncharacter (sentinel) 460 ; Section 23.7 Noncharacters, see also NUL 462 NONUCHAR = 472 ; UCHAR rules are analogous to CHAR 474 UCHARBEYONDBMP = 483 UCHARBEYONDLATIN1 = / UCHARBEYONDBMP 486 UCHARBEYONDC1 = 487 / UCHARBEYONDBMP 489 UCHARBEYONDASCII = C1 / UCHARBEYONDC1 491 UCHAR = / 492 UCHARBEYONDBMP 494 ; D49 private-use 495 ; Section 23.5 Private-Use Characters 497 ; Primary Private Use Area (in BMP) 498 PPUACHAR = 499 ; Supplementary Private Use Area-A 500 SPUAACHAR = 501 ; Supplementary Private Use Area-B 502 SPUABCHAR = 504 ; TODO: possible alternates: PUCHAR, PUA 505 PUACHAR = PPUACHAR / SPUAACHAR / SPUABCHAR 507 ; Unicode-y VCHAR: like VCHAR, attempts to capture 508 ; "all standardized graphic and formatting 509 ; characters/code points for open interchange, 510 ; excluding white space and controls" 511 ; EXCLUDES: Noncharacters (some Cn), Cs, Co, Cc, Z (Zs, Zl, Zp) 513 UVCHARBEYONDBMP = 521 UVCHARBEYONDLATIN1 = / 526 UVCHARBEYONDBMP 528 UVCHARBEYONDASCII = / 533 UVCHARBEYONDBMP 535 UVCHARBEYONDC1 = UVCHARBEYONDASCII 537 UVCHAR = VCHAR / UVCHARBEYONDASCII 539 ; horizontal white space only (Zs beyond ASCII), 540 ; NO line breaks (Cc, Zl, Zp) 541 ; cf Section 5.8 Newline Guidelines with RFC 5198 542 ; see also SP 543 UWSPBEYONDASCII = 546 ; includes HT 547 UWSP = WSP / UWSPBEYONDASCII 549 ; C1 Controls 550 PAD = ; gov't health warning: figment 551 HOP = ; gov't health warning: figment 552 BPH = 553 NBH = 554 IND = 555 NEL = 556 ; NLF CRLF, CR, LF, NEL (not LS or PS) 557 ; --probably unnecessary for Internet usage: 558 ; CRLF is already the standard 559 SSA = 560 ESA = 561 HTS = 562 HTJ = 563 VTS = 564 PLD = 565 PLU = 566 RI = 567 SS2 = 568 SS3 = 569 DCS = 570 PU1 = 571 PU2 = 572 STS = 573 CCH = 574 MW = 575 SPA = 576 EPA = 577 SOS = 578 SGCI = ; or SGC, gov't health warning: figment 579 SCI = 580 CSI = 581 ST = 582 OSC = 583 PM = 584 APC = 586 ; Latin1 587 NBSP = 588 SHY = 590 ; Zl, Zp 591 ; NB: These are excluded from both UVCHAR and UWSP 592 LS = 593 PS = 595 Authors' Addresses 597 Sean Leonard 598 Penango, Inc. 599 5900 Wilshire Boulevard 600 21st Floor 601 Los Angeles, CA 90036 602 USA 604 EMail: dev+ietf@seantek.com 605 URI: http://www.penango.com/ 607 Chris Newman 608 Oracle 609 440 E. Huntington Dr., Suite 400 610 Arcadia, CA 91006 611 USA 613 EMail: chris.newman@oracle.com