idnits 2.17.1 draft-bormann-jsonpath-iregexp-04.txt: -(3): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 2 instances of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (25 April 2022) is 730 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0-1' is mentioned on line 370, but not defined == Missing Reference: '1-3' is mentioned on line 370, but not defined == Missing Reference: '0-9' is mentioned on line 424, but not defined == Missing Reference: '1-9' is mentioned on line 370, but not defined == Missing Reference: 'A-Z' is mentioned on line 364, but not defined == Missing Reference: '0-9a-fA-F' is mentioned on line 386, but not defined == Missing Reference: '4-9' is mentioned on line 398, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'XSD-2' -- Possible downref: Non-RFC (?) normative reference: ref. 'XSD11-2' Summary: 0 errors (**), 0 flaws (~~), 9 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Bormann 3 Internet-Draft Universität Bremen TZI 4 Intended status: Standards Track T. Bray 5 Expires: 27 October 2022 Textuality 6 25 April 2022 8 I-Regexp: An Interoperable Regexp Format 9 draft-bormann-jsonpath-iregexp-04 11 Abstract 13 This document specifies I-Regexp, a flavor of regular expressions 14 that is limited in scope with the goal of interoperation across many 15 different regular-expression libraries. 17 About This Document 19 This note is to be removed before publishing as an RFC. 21 Status information for this document may be found at 22 https://datatracker.ietf.org/doc/draft-bormann-jsonpath-iregexp/. 24 Discussion of this document takes place on the JSONpath Working Group 25 mailing list (mailto:JSONpath@ietf.org), which is archived at 26 https://mailarchive.ietf.org/arch/browse/JSONpath/. 28 Source for this draft and an issue tracker can be found at 29 https://github.com/cabo/iregexp. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on 27 October 2022. 48 Copyright Notice 50 Copyright (c) 2022 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 55 license-info) in effect on the date of publication of this document. 56 Please review these documents carefully, as they describe your rights 57 and restrictions with respect to this document. Code Components 58 extracted from this document must include Revised BSD License text as 59 described in Section 4.e of the Trust Legal Provisions and are 60 provided without warranty as described in the Revised BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 65 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 66 2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 3 67 3. I-Regexp Syntax . . . . . . . . . . . . . . . . . . . . . . . 3 68 4. I-Regexp Semantics . . . . . . . . . . . . . . . . . . . . . 5 69 5. Mapping I-Regexp to Regexp Dialects . . . . . . . . . . . . . 5 70 5.1. XSD Regexps . . . . . . . . . . . . . . . . . . . . . . . 5 71 5.2. ECMAScript Regexps . . . . . . . . . . . . . . . . . . . 5 72 5.3. PCRE, RE2, Ruby Regexps . . . . . . . . . . . . . . . . . 6 73 6. Motivation and Background . . . . . . . . . . . . . . . . . . 6 74 6.1. Implementing I-Regexp . . . . . . . . . . . . . . . . . . 6 75 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 76 8. Security considerations . . . . . . . . . . . . . . . . . . . 7 77 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 78 9.1. Normative References . . . . . . . . . . . . . . . . . . 7 79 9.2. Informative References . . . . . . . . . . . . . . . . . 7 80 Appendix A. Regexps and Similar Constructs in Recent Published 81 RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . 8 82 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 10 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 85 1. Introduction 87 This specification describes an interoperable regular expression 88 flavor, I-Regexp. 90 This document uses the abbreviation "regexp" for what are usually 91 called regular expressions in programming. "I-Regexp" is used as a 92 noun meaning a character string which conforms to the requirements in 93 this specification; the plural is "I-Regexps". 95 I-Regexp does not provide advanced regexp features such as capture 96 groups, lookahead, or backreferences. It supports only a Boolean 97 matching capability, i.e., testing whether a given regexp matches a 98 given piece of text. 100 I-Regexp supports the entire repertoire of Unicode characters. 102 I-Regexp is a subset of XSD regexps [XSD-2]. 104 This document includes rules for converting I-Regexps for use with 105 several well-known regexp libraries. 107 1.1. Terminology 109 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 110 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 111 "OPTIONAL" in this document are to be interpreted as described in 112 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 113 capitals, as shown here. 115 The grammatical rules in this document are to be interpreted as ABNF, 116 as described in [RFC5234] and [RFC7405]. 118 2. Requirements 120 I-Regexps should handle the vast majority of practical cases where a 121 matching regexp is needed in a data model specification or a query 122 language expression. 124 A brief survey of published RFCs yielded the regexp patterns in 125 Appendix A (with no attempt at completeness). With certain 126 exceptions as discussed there, these should be covered by I-Regexps, 127 both syntactically and with their intended semantics. 129 3. I-Regexp Syntax 131 An I-Regexp MUST conform to the ABNF specification in Figure 1. 133 i-regexp = branch *( "|" branch ) 134 branch = *piece 135 piece = atom [ quantifier ] 136 quantifier = ( %x2A-2B ; '*'-'+' 137 / "?" ) / ( "{" quantity "}" ) 138 quantity = QuantExact [ "," [ QuantExact ] ] 139 QuantExact = 1*%x30-39 ; '0'-'9' 141 atom = NormalChar / charClass / ( "(" i-regexp ")" ) 142 NormalChar = ( %x00-27 / %x2C-2D ; ','-'-' 143 / %x2F-3E ; '/'-'>' 144 / %x40-5A ; '@'-'Z' 145 / %x5E-7A ; '^'-'z' 146 / %x7E-10FFFF ) 147 charClass = "." / SingleCharEsc / charClassEsc / charClassExpr 148 SingleCharEsc = "\" ( %x28-2B ; '('-'+' 149 / %x2D-2E ; '-'-'.' 150 / "?" / %x5B-5E ; '['-'^' 151 / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}' 152 ) 153 charClassEsc = catEsc / complEsc 154 charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]" 155 CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc 156 CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z' 157 / %x5E-10FFFF ) / SingleCharEsc 158 catEsc = %s"\p{" charProp "}" 159 complEsc = %s"\P{" charProp "}" 160 charProp = IsCategory / IsBlock 161 IsCategory = Letters / Marks / Numbers / Punctuation / Separators / 162 Symbols / Others 163 Letters = %s"L" [ ( %x6C-6D ; 'l'-'m' 164 / %s"o" / %x74-75 ; 't'-'u' 165 ) ] 166 Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ] 167 Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ] 168 Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f' 169 / %s"i" / %s"o" / %s"s" ) ] 170 Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ] 171 Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ] 172 Others = %s"C" [ ( %s"c" / %s"f" / %x6E-6F ; 'n'-'o' 173 ) ] 174 IsBlock = %s"Is" 1*( "-" / %x30-39 ; '0'-'9' 175 / %x41-5A ; 'A'-'Z' 176 / %x61-7A ; 'a'-'z' 177 ) 179 Figure 1: I-Regexp Syntax in ABNF 181 As an additional restriction, charClassExpr is not allowed to match 182 [^], which according to this grammar would parse as a positive 183 character class containing the single character ^. 185 This is essentially XSD regexp without character class subtraction 186 and multi-character escapes such as \s, \S, and \w. 188 An I-Regexp implementation MUST be a complete implementation of this 189 limited subset. In particular, full Unicode support is REQUIRED; the 190 implementation MUST NOT limit itself to 7- or 8-bit character sets 191 such as ASCII and MUST support the Unicode character property set in 192 character classes. 194 4. I-Regexp Semantics 196 This syntax is a subset of that of [XSD-2]. Implementations which 197 interpret I-Regexps MUST yield Boolean results as specified in 198 [XSD-2]. (See also Section 5.1.) 200 5. Mapping I-Regexp to Regexp Dialects 202 (TBD; these mappings need to be further verified in implementation 203 work.) 205 5.1. XSD Regexps 207 Any I-Regexp also is an XSD Regexp [XSD-2], so the mapping is an 208 identity function. 210 Note that a few errata for [XSD-2] have been fixed in [XSD11-2], 211 which is therefore also included as a normative reference. XSD 1.1 212 is less widely implemented than XSD 1.0, and implementations of XSD 213 1.0 are likely to include these bugfixes, so for the intents and 214 purposes of this specification an implementation of XSD 1.0 regexps 215 is equivalent to an implementation of XSD 1.1 regexps. 217 5.2. ECMAScript Regexps 219 Perform the following steps on an I-Regexp to obtain an ECMAScript 220 regexp [ECMA-262]: 222 * For any dots (.) outside character classes (first alternative of 223 charClass production): replace dot by [^\n\r]. 225 * Envelope the result in ^ and $. 227 Note that where a regexp literal is required, the actual regexp needs 228 to be enclosed in /. 230 5.3. PCRE, RE2, Ruby Regexps 232 Perform the same steps as in Section 5.2 to obtain a valid regexp in 233 PCRE [PCRE2], the Go programming language [RE2], and the Ruby 234 programming language, except that the last step is: 236 * Enclose the regexp in \A and \z. 238 6. Motivation and Background 240 While regular expressions originally were intended to describe a 241 formal language to support a Boolean matching function, they have 242 been enhanced with parsing functions that support the extraction and 243 replacement of arbitrary portions of the matched text. With this 244 accretion of features, parsing regexp libraries have become more 245 susceptible to bugs and surprising performance degradations which can 246 be exploited in Denial of Service attacks by an attacker who controls 247 the regexp submitted for processing. I-Regexp is designed to offer 248 interoperability, and to be less vulnerable to such attacks, with the 249 trade-off that its only function is to offer a boolean response as to 250 whether a character sequence is matched by a regexp. 252 6.1. Implementing I-Regexp 254 XSD regexps are relatively easy to implement or map to widely 255 implemented parsing regexp dialects, with these notable exceptions: 257 * Character class subtraction. This is a very useful feature in 258 many specifications, but it is unfortunately mostly absent from 259 parsing regexp dialects. Thus, it is omitted from I-Regexp. 261 * Multi-character escapes. \d, \w, \s and their uppercase 262 complement classes exhibit a large amount of variation between 263 regexp flavors. Thus, they are omitted from I-Regexp. 265 * Not all regexp implementations support accesses to Unicode tables 266 that enable executing on constructs such as \p{IsCoptic}, although 267 the \p/\P feature in general is now quite widely available. While 268 in principle it's possible to translate these into codepoint-range 269 matches, this also requires access to those tables. Thus, regexp 270 libraries in severely constrained environments may not be able to 271 support I-Regexp conformance. 273 7. IANA Considerations 275 This document makes no requests of IANA. 277 8. Security considerations 279 As discussed in Section 6, more complex regexp libraries may contain 280 exploitable bugs leading to crashes and remote code execution. There 281 is also the problem that such libraries often have hard-to-predict 282 performance characteristics, leading to attacks that overload an 283 implementation by matching against an expensive attacker-controlled 284 regexp. 286 I-Regexps have been designed to allow implementation in a way that is 287 resilient to both threats; this objective needs to be addressed 288 throughout the implementation effort. 290 9. References 292 9.1. Normative References 294 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 295 Requirement Levels", BCP 14, RFC 2119, 296 DOI 10.17487/RFC2119, March 1997, 297 . 299 [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 300 Specifications: ABNF", STD 68, RFC 5234, 301 DOI 10.17487/RFC5234, January 2008, 302 . 304 [RFC7405] Kyzivat, P., "Case-Sensitive String Support in ABNF", 305 RFC 7405, DOI 10.17487/RFC7405, December 2014, 306 . 308 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 309 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 310 May 2017, . 312 [XSD-2] Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes 313 Second Edition", World Wide Web Consortium Recommendation 314 REC-xmlschema-2-20041028, 28 October 2004, 315 . 317 [XSD11-2] Peterson, D., Gao, S., Malhotra, A., Sperberg-McQueen, M., 318 Thompson, H., and P. Biron, "W3C XML Schema Definition 319 Language (XSD) 1.1 Part 2: Datatypes", World Wide Web 320 Consortium Recommendation REC-xmlschema11-2-20120405, 5 321 April 2012, 322 . 324 9.2. Informative References 326 [ECMA-262] Ecma International, "ECMAScript 2020 Language 327 Specification", ECMA Standard ECMA-262, 11th Edition, June 328 2020, . 331 [PCRE2] "Perl-compatible Regular Expressions (revised API: 332 PCRE2)", n.d., . 334 [RE2] "RE2 is a fast, safe, thread-friendly alternative to 335 backtracking regular expression engines like those used in 336 PCRE, Perl, and Python. It is a C++ library.", n.d., 337 . 339 [RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493, 340 DOI 10.17487/RFC7493, March 2015, 341 . 343 Appendix A. Regexps and Similar Constructs in Recent Published RFCs 345 This appendix contains a number of regular expressions that have been 346 extracted from some recently published RFCs based on some ad-hoc 347 matching. Multi-line constructions were not included. With the 348 exception of some (often surprisingly dubious) usage of multi- 349 character escapes, all regular expressions validate against the ABNF 350 in Figure 1. 352 rfc6021.txt 459 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*)))) 353 rfc6021.txt 513 \d*(\.\d*){1,127} 354 rfc6021.txt 529 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)? 355 rfc6021.txt 631 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)? 356 rfc6021.txt 647 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5} 357 rfc6021.txt 933 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5} 358 rfc6021.txt 938 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))| 359 rfc6021.txt 1026 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5} 360 rfc6021.txt 1031 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))| 361 rfc6020.txt 6647 [0-9a-fA-F]* 362 rfc6095.txt 2544 \S(.*\S)? 363 rfc6110.txt 1583 [aeiouy]* 364 rfc6110.txt 3222 [A-Z][a-z]* 365 rfc6536.txt 1583 \* 366 rfc6536.txt 1632 [^\*].* 367 rfc6643.txt 524 \p{IsBasicLatin}{0,255} 368 rfc6728.txt 3480 \S+ 369 rfc6728.txt 3500 \S(.*\S)? 370 rfc6991.txt 477 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*)))) 371 rfc6991.txt 525 \d*(\.\d*){1,127} 372 rfc6991.txt 541 [a-zA-Z_][a-zA-Z0-9\-_.]* 373 rfc6991.txt 542 .|..|[^xX].*|.[^mM].*|..[^lL].* 374 rfc6991.txt 571 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)? 375 rfc6991.txt 665 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)? 376 rfc6991.txt 693 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5} 377 rfc6991.txt 725 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)? 378 rfc6991.txt 743 [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}- 379 rfc6991.txt 1041 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5} 380 rfc6991.txt 1046 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))| 381 rfc6991.txt 1099 [0-9\.]* 382 rfc6991.txt 1109 [0-9a-fA-F:\.]* 383 rfc6991.txt 1164 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5} 384 rfc6991.txt 1169 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))| 385 rfc7407.txt 933 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){0,254} 386 rfc7407.txt 1494 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){4,31} 387 rfc7758.txt 703 \d{2}:\d{2}:\d{2}(\.\d+)? 388 rfc7758.txt 1358 \d{2}:\d{2}:\d{2}(\.\d+)? 389 rfc7895.txt 349 \d{4}-\d{2}-\d{2} 390 rfc7950.txt 8323 [0-9a-fA-F]* 391 rfc7950.txt 8355 [a-zA-Z_][a-zA-Z0-9\-_.]* 392 rfc7950.txt 8356 [xX][mM][lL].* 393 rfc8040.txt 4713 \d{4}-\d{2}-\d{2} 394 rfc8049.txt 6704 [A-Z]{2} 395 rfc8194.txt 629 \* 396 rfc8194.txt 637 [0-9]{8}\.[0-9]{6} 397 rfc8194.txt 905 Z|[\+\-]\d{2}:\d{2} 398 rfc8194.txt 963 (2((2[4-9])|(3[0-9]))\.).* 399 rfc8194.txt 974 (([fF]{2}[0-9a-fA-F]{2}):).* 400 rfc8299.txt 7986 [A-Z]{2} 401 rfc8341.txt 1878 \* 402 rfc8341.txt 1927 [^\*].* 403 rfc8407.txt 1723 [0-9\.]* 404 rfc8407.txt 1749 [a-zA-Z_][a-zA-Z0-9\-_.]* 405 rfc8407.txt 1750 .|..|[^xX].*|.[^mM].*|..[^lL].* 406 rfc8525.txt 550 \d{4}-\d{2}-\d{2} 407 rfc8776.txt 838 /?([a-zA-Z0-9\-_.]+)(/[a-zA-Z0-9\-_.]+)* 408 rfc8776.txt 874 ([a-zA-Z0-9\-_.]+:)* 409 rfc8819.txt 311 [\S ]+ 410 rfc8944.txt 596 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){7} 412 Figure 2: Example regular expressions extracted from RFCs 414 The multi-character escapes (MCE) or the character classes built 415 around them used here can be substituted as shown in Table 1. 417 +===========+==================+ 418 | MCE/class | Substitute class | 419 +===========+==================+ 420 | \S | [^ \t\n\r] | 421 +-----------+------------------+ 422 | [\S ] | [^\t\n\r] | 423 +-----------+------------------+ 424 | \d | [0-9] | 425 +-----------+------------------+ 427 Table 1: Substitutes for 428 multi-character escapes in 429 examples 431 Note that the semantics of \d in XSD regular expressions is that of 432 \p{Nd}; however, this would include all Unicode characters that are 433 digits in various writing systems and certainly is not actually meant 434 in the RFCs listed. 436 Acknowledgements 438 This draft has been motivated by the discussion in the IETF JSONPATH 439 WG about whether to include a regexp mechanism into the JSONPath 440 query expression specification, as well as by previous discussions 441 about the YANG pattern and CDDL .regexp features. 443 The basic approach for this draft was inspired by The I-JSON Message 444 Format [RFC7493]. 446 Authors' Addresses 448 Carsten Bormann 449 Universität Bremen TZI 450 Postfach 330440 451 D-28359 Bremen 452 Germany 453 Phone: +49-421-218-63921 454 Email: cabo@tzi.org 456 Tim Bray 457 Textuality 458 Email: tbray@textuality.com