idnits 2.17.1 draft-seantek-mail-regexen-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 2 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 13, 2017) is 2572 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: '0-5' is mentioned on line 654, but not defined == Missing Reference: '0-4' is mentioned on line 654, but not defined == Missing Reference: '0-9' is mentioned on line 654, but not defined == Missing Reference: '1-9' is mentioned on line 654, but not defined == Missing Reference: 'A-Za-z0-9' is mentioned on line 779, but not defined == Missing Reference: 'Sm' is mentioned on line 816, but not defined == Outdated reference: A later version (-08) exists of draft-seantek-abnf-more-core-rules-07 ** Obsolete normative reference: RFC 821 (Obsoleted by RFC 2821) ** Obsolete normative reference: RFC 822 (Obsoleted by RFC 2822) ** Obsolete normative reference: RFC 973 (Obsoleted by RFC 1034, RFC 1035) ** Obsolete normative reference: RFC 2821 (Obsoleted by RFC 5321) ** Obsolete normative reference: RFC 2822 (Obsoleted by RFC 5322) -- Obsolete informational reference (is this intentional?): RFC 724 (Obsoleted by RFC 733) -- Obsolete informational reference (is this intentional?): RFC 733 (Obsoleted by RFC 822) -- Obsolete informational reference (is this intentional?): RFC 772 (Obsoleted by RFC 780, RFC 821, RFC 974, RFC 1869, RFC 1870) -- Obsolete informational reference (is this intentional?): RFC 882 (Obsoleted by RFC 1034, RFC 1035) -- Obsolete informational reference (is this intentional?): RFC 883 (Obsoleted by RFC 1034, RFC 1035) Summary: 6 errors (**), 0 flaws (~~), 9 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Leonard 3 Internet-Draft Penango, Inc. 4 Intended status: Informational J. Hildebrand 5 Expires: September 14, 2017 Cisco Systems 6 T. Hansen 7 AT&T Laboratories 8 March 13, 2017 10 Regular Expressions for Internet Mail 11 draft-seantek-mail-regexen-02 13 Abstract 15 Internet Mail identifiers are used ubiquitously throughout computing 16 systems as building blocks of online identity. Unfortunately, 17 incomplete understandings of the syntaxes of these identifiers has 18 led to interoperability problems and poor user experiences. Many 19 users use specific characters in their addresses that are not 20 properly accepted on various systems. This document prescribes 21 normative regular expression (regex) patterns for all Internet- 22 connected systems to use when validating or parsing Internet Mail 23 identifiers, with special attention to regular expressions that work 24 with popular languages and platforms. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on September 14, 2017. 43 Copyright Notice 45 Copyright (c) 2017 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 61 1.1. Normative Effects . . . . . . . . . . . . . . . . . . . . 4 62 1.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 4 63 2. History and Formal Models for Internet Mail Identifiers . . . 6 64 2.1. The Core History . . . . . . . . . . . . . . . . . . . . 6 65 2.2. Multipurpose Internet Mail Extensions and Uniform 66 Resource Identifiers . . . . . . . . . . . . . . . . . . 9 67 2.3. Email Address Internationalization . . . . . . . . . . . 9 68 2.4. The Data Model . . . . . . . . . . . . . . . . . . . . . 10 69 2.4.1. Email Address . . . . . . . . . . . . . . . . . . . . 10 70 2.4.2. Message-ID . . . . . . . . . . . . . . . . . . . . . 11 71 2.5. Equivalence and Comparison . . . . . . . . . . . . . . . 11 72 2.5.1. Email Address . . . . . . . . . . . . . . . . . . . . 11 73 2.5.2. Message-ID . . . . . . . . . . . . . . . . . . . . . 13 74 3. Regular Expressions for Email Addresses . . . . . . . . . . . 13 75 3.1. Deliverable Email Address . . . . . . . . . . . . . . . . 14 76 3.1.1. ASCII Building Blocks . . . . . . . . . . . . . . . . 14 77 3.1.2. Deliverable Email Address . . . . . . . . . . . . . . 14 78 3.1.3. (Leftover from draft-00) Basic Rules of Derivation 79 (Unicode) . . . . . . . . . . . . . . . . . . . . . . 16 80 3.1.4. Complete Expression for Deliverable Email Address . . 17 81 3.1.5. Using Character Classes . . . . . . . . . . . . . . . 18 82 3.1.6. "Flotsam" and "Jetsam" Beyond ASCII . . . . . . . . . 18 83 3.1.7. Certain Expressions for Restrictions . . . . . . . . 18 84 3.1.8. Unquoting Local-Part . . . . . . . . . . . . . . . . 20 85 3.1.9. Quoting Local-Part . . . . . . . . . . . . . . . . . 20 86 3.2. Modern Email Address . . . . . . . . . . . . . . . . . . 20 87 3.3. Legacy Email Address . . . . . . . . . . . . . . . . . . 21 88 3.4. Algorithms for Detecting Email Addresses . . . . . . . . 21 89 3.5. Handling Domain Names . . . . . . . . . . . . . . . . . . 22 90 4. Regular Expressions for Message-IDs . . . . . . . . . . . . . 22 91 4.1. Modern Message-ID . . . . . . . . . . . . . . . . . . . . 22 92 4.2. General Message-ID . . . . . . . . . . . . . . . . . . . 23 93 5. Security Considerations . . . . . . . . . . . . . . . . . . . 23 94 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 24 95 6.1. Normative References . . . . . . . . . . . . . . . . . . 24 96 6.2. Informative References . . . . . . . . . . . . . . . . . 26 97 Appendix A. Test Vectors . . . . . . . . . . . . . . . . . . . . 28 98 Appendix B. Change Log . . . . . . . . . . . . . . . . . . . . . 28 99 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 101 1. Introduction 103 Internet Mail is everywhere. This fact of modern connected life is 104 so self-evident that [RFC5598] says: "In practical terms, an email 105 address string has become the common identifier for representing 106 online identity." [MTECHDEV] acknowledges that email "has been one 107 of the major technical and sociological developments of the past 40 108 years." Whether it is joining a social network, participating in a 109 forum, blogging, paying taxes, buying products, conducting 110 professional correspondence, or communicating with loved ones, one's 111 email address forms the cornerstone or backstop (frequently both) to 112 these methods of communication. Internet mail is not only 113 ubiquitous: it is essentially free for all users connected to the 114 Internet. 116 Yet it is surprising how fragile or cavalier many systems are with 117 their treatment of Internet Mail identifiers, namely with email 118 addresses. Prominent government agencies, financial institutions, 119 and even major mail services reject a variety of forms that are in 120 wide use by many user communities. [[NB: do a survey.]] For example, 121 in the intervening time between IETF 94 and the submission of this 122 Internet-Draft, the author interacted with not less than 25 different 123 web or other Internet-connected services that rejected or mangled his 124 perfectly valid email addresses. The result is a pernicious and 125 creeping degradation of mail service and of the usability of the 126 Internet Mail infrastructure, resulting in undelivered mail, 127 misdelivered mail (which can constitute a security vulnerability), 128 and denial of service. 130 The Internet Mail standards, like the mail system, have evolved over 131 time and have been modified to accommodate volumes and scenarios far 132 beyond their original design goals. Furthermore, some identifier 133 forms have been restricted over time as certain syntaxes were 134 determined to be harmful, arcane, or just plain useless. [[So while 135 not "blame", some responsibility or causation lies with these 136 standards, which go out of their way to balance backwards- 137 compatibility, complexity, completeness, and flexibility at the 138 expense of a simple and widely-implementable addressing format.]] 140 This document prescribes normative regular expression (regex) 141 patterns for all Internet-connected systems to use when validating or 142 parsing Internet Mail identifiers. Attention specifically focuses on 143 "email address" (the specification for the string commonly associated 144 with a single mailbox at a single named entity), and Message-ID, 145 which share nearly identical syntax, but have different use cases and 146 semantics. First, the history of Internet Mail is traced to build a 147 coherent data model for Internet Mail identifiers. Second, relevant 148 expression formats are discussed. Third, expressions to fit the 149 identifiers in a variety of computing contexts are developed and 150 presented. The overall goal of this document is to establish cut- 151 and-dried algorithms that developers can incorporate directly into 152 their mail-using products (including web browsers, form validators, 153 and software libraries), replacing current ad-hoc (and oftentimes 154 atrociously inconsistent) approaches with standardized behavior. 156 1.1. Normative Effects 158 This document is proposed with either Informational or Best Current 159 Practice status [RFC1818] for all users and systems of Internet Mail 160 identifiers, which basically means everyone connected to the 161 Internet, other than the mail infrastructure itself. The Internet 162 Mail infrastructure has been standardized (and continues to be 163 standardized by) other documents, most notably [RFC5321] and 164 [RFC5322]. Therefore, implementers developing mail systems MUST rely 165 on those standards when building interoperable mail systems. At the 166 same time, the text of this specification has been [[NB: will be]] 167 carefully vetted by [[the IETF]] so that implementers SHALL be able 168 to rely on it as a normative reference. Whether designing a new 169 standard or implementing a new system that uses Internet Mail 170 identifiers for some other purpose (e.g., as usernames, security 171 principals, or keys in a database), relying parties can "copy-and- 172 paste" the expressions in this document to parse, validate, compose, 173 and process Internet Mail identifiers, rather than relying on 174 homegrown solutions. 176 Internet Mail has evolved over forty years, and will undoubtedly 177 continue to evolve over time. This document does not constrain that 178 development process. Actually, it is expected that expressions in 179 this document will be updated to match changes in Internet Mail. 181 1.2. Definitions 183 The terms "email address" and "address" (without qualification) refer 184 to the string commonly associated with a single mailbox at a single 185 named entity. In this document, the prose text always qualifies 186 these terms with the source document when using a different sense. 188 The term "Message-ID" (without qualification) refers to the globally 189 unique string comprised of a left part, the at-sign (@), and a right- 190 part, that is used to identify a single message. In this document, 191 the prose-text always qualifies this term when using a different 192 sense. The term "Message-ID field", for example, refers to the 193 Message-ID Header Field, which includes the characters "Message-ID:" 194 and the surrounding "<" and ">" angle brackets. 196 Unquoted all-capital symbols in prose text have the meanings 197 specified in [I-D.seantek-abnf-more-core-rules]. 199 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 200 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 201 "OPTIONAL" in this document are to be interpreted as described in 202 [RFC2119]. 204 This document provides expressions that can be used directly in 205 popular computing platforms. An important subset of email address 206 syntaxes, namely deliverable email addresses, can be described in a 207 regular language [[CITE]]. A regular language is a language 208 recognized by (and computable with) a finite automaton. [[CITE: 209 Kleene's Theorem]] However, the full syntax of email addresses 210 requires a context-free language (i.e., governed by ABNF) and a 211 pushdown automaton. 213 The term "regular expression" is a sequence of characters that define 214 a search pattern for string matching. Originally the term referred 215 to expressions that described regular languages in formal language 216 theory. Formal regular expressions are limited to: 218 o empty set and empty string o literal character o concatenation o 219 alternation o repetition (Kleene star) 221 Modern-day libraries support expressions that far exceed the regular 222 languages. Some libraries even support capabilities that exceed the 223 context-free languages. However, this document limits itself to 224 truly regular grammars where possible, and where not possible, to 225 context-free grammars. Implementers can therefore implement (or 226 compile) these specifications on computing-constrained devices. 228 The regular expressions in this document are intended to conform to 229 the following de-jure or de-facto standards. Where expressions are 230 given, they are annotated with single characters that refer to the 231 standards to which they conform. [[NB: or, are intended to conform 232 to, after further development.]] 233 +---+---------+-----------------------------------------+-----------+ 234 | 1 | 4 | Title | Ref | 235 +---+---------+-----------------------------------------+-----------+ 236 | P | PCRE: | Perl Compatible Regular Expressions 2 | [PCRE2] | 237 | | | (version 10.21, January 12, 2016) | | 238 | | | | | 239 | E | (P)ERE: | POSIX Extended Regular Expressions | [POS-ERE] | 240 | | | (POSIX/IEEE Std 1003.1, Issue 7, 2013 | | 241 | | | Ed.) [[NB: could be ERE, PERE, or | | 242 | | | P-ERE]] | | 243 | | | | | 244 | J | JSRE: | JavaScript Regular Expressions | [JSRE6ED] | 245 | | | (ECMAScript/ECMA-262, 6th Ed., 2015) | | 246 +---+---------+-----------------------------------------+-----------+ 248 [[TODO: Need to do something with UTS #18: Unicode Regular 249 Expressions.]] 251 Implementers should exercising caution when using a library that 252 claims to be "Perl-Compatible" without actually being the bona-fide 253 PCRE library: it may exhibit different or incomplete behavior. 254 Implementers should also note that ERE and JSRE are fully implemented 255 as alternative grammars in the std::regex library of C++11 and its 256 successor, C++14. [[TODO: cite C++ standards.]] In the absence of a 257 "live" regular expression library, the expressions in this document 258 are easily compiled into automata (i.e., target language code) using 259 well-studied algorithms. 261 Surrounding delimiters (i.e., slashes) are omitted unless relevant to 262 the proffered usage. 264 2. History and Formal Models for Internet Mail Identifiers 266 2.1. The Core History 268 Internet Mail (also known as electronic mail, e-mail, email, or 269 simply "mail") is an asynchronous, store-and-forward method of 270 exchanging digital messages from an author to one or more recipients. 271 [MTECHDEV] recounts the technical development of email, which is too 272 voluminous to be repeated here. This section specifically focuses on 273 the history of the identifiers used in email. 275 When people think of an email address, is the 276 quintessential example. However, addresses did not always look this 277 way. Electronic mailing systems come from the earliest days of 278 networking, before ARPANET. The specification for identifiers was 279 defined by the networking project on a more-or-less ad-hoc basis. 280 When ARPANET began, mail and file transfer were seen as important 281 founding services. In fact, FTP was a significant transport 282 mechanism for mail. 284 By 1973, users of ARPANET came together to propose [RFC0524] and 285 standardize [RFC0561] parts of the mail system. In these early 286 specifications, the at-sign "@" was not used; instead, the term "AT" 287 separated the left-side (user production) from the right-side (host 288 production). Tokens (the word production, specifically) during that 289 time were separated by SP, CR, and LF. The word production was 290 defined at that time to be any ASCII character other than CR, LF, and 291 SP. [RFC0524] only standardized the From header, not any recipient 292 headers. 294 In 1975, [RFC0680] proposed a more formal format for for message 295 fields beyond [RFC0561], including "receiver specification fields" 296 (TO, CC, and BCC) and "reference specification fields" (MESSAGE-ID, 297 IN-REPLY-TO, REFERENCES, and KEYWORDS) for the first time. The 298 receiver fields used mailbox productions with user and host parts 299 separated by "@" (without spaces), in contrast to the originator 300 specification fields (specifically FROM and SENDER), which continued 301 to use "AT" (with single spaces on either side). An [RFC0680] 302 Message-ID was structured with a "Net Address" (presumably, a host 303 name) in square brackets, followed by a line production (any 304 characters other than CR or LF). 306 The first real email standards that resemble modern usage were 307 published in 1977: [RFC0724] and its revision, [RFC0733]. Those 308 documents are historic and their formats are incompatible with most 309 modern mail systems, but understanding them provide important 310 insights into the structure of identifiers starting with [RFC0821] 311 and [RFC0822]. The "@" character gained greater prominence, although 312 "AT" (then lowercased to "at" in the specifications) was still 313 supported. Importantly, [RFC0733] Message-ID was standardized to 314 match the [RFC0733] email address format, and a uniqueness guarantee 315 was added: "The uniqueness of the message identifier is guaranteed by 316 the host which generates it." Essentially, this text implies that 317 the right-hand part of the Message-ID is to be a hostname, or 318 operatively associated with a hostname (i.e., not just a random 319 string, but a unique-possibly random-string assigned by a host). At 320 this point, Message-ID and email address specifications converged in 321 the host-phrase production. 323 [RFC0821] and [RFC0822] are the foundational RFCs for modern mail 324 usage, and their revisions over the years retain the same basic 325 structure and division between mail transfer (specifically, SMTP) and 326 mail format. Jon Postel's invention of SMTP grew out of experience 327 and disenchantment with the Mail Transfer Protocol (MTP) [RFC0772]. 328 The Simple Mail Transfer Protocol is indeed simple: it is structured 329 like FTP but only has a limited set of commands: HELO, MAIL FROM, 330 RCPT TO, DATA, QUIT, VRFY, and EXPN (and a few others not relevant 331 for this discussion). The commands can take at most one argument. 332 [RFC0821] describes a forward-path argument rather than an email 333 address argument: the distinction is that after the username, a 334 series of "@" host specifications designate the hops through which 335 the message is supposed to travel before it reaches its destination. 337 The main [RFC0821] production for an email address is the , 338 which is defined as "@" . The 339 production can be a or a . In both 340 cases, the full range of ASCII characters is actually permitted, 341 although different characters must be backslash-escaped in different 342 productions. The production is extremely broad and can 343 "stack" domain-element components with the period "."; a domain- 344 element component can be a hostname, "#" and a number, or an IPv4 345 address enclosed in square brackets "[" and "]". 347 [RFC0822] introduced the "addr-spec" ABNF production, which is that 348 series' term for an email address. While a [RFC0822] route-addr 349 production can include a source route (aka forward-path with multiple 350 hosts), addr-spec is noted to be global address, with the right side 351 being a "domain" production. This definition presaged the first DNS 352 standards [RFC0882] and [RFC0883], although it was clearly designed 353 with DNS in mind. The [RFC0822] ABNF permits a domain to include 354 multiple domain-literal productions (i.e., bracketed) separated by 355 "."; however, the accompanying text basically obviates such 356 productions. As [RFC0822] presaged the widespread implementation of 357 DNS, various systems would spread routing information between the 358 local-part and domain productions (see Section 6.2.5). [RFC0822] 359 discusses local-part syntax extensively, including examples of 360 comment productions that are supposed to be ignored semantically (see 361 Section A.1). 363 Section 3.4.7 of [RFC0822] describes the constituent components of an 364 address as requiring "preservation of case information", which is 365 slightly different than saying "case-sensitive" outright (although 366 the latter is strongly implied). The main historical point to glean 367 is that intermediate mail systems were supposed to transit the local- 368 part AS-IS without modification, so that the destination system--and 369 only the destination system--would parse it. 371 [RFC0822] assigns specific semantics to Message-ID but is light on 372 syntax: the msg-id production is just addr-spec enclosed in mandatory 373 "<" and ">". 375 The widespread deployment of DNS [RFC0973] (later [RFC1034] and 376 [RFC1035], with [RFC1123] relaxation) dramatically changed the 377 Internet messaging landscape of the late 1980s. [[TODO: complete.]] 378 With respect to email addresses and Message-IDs, it became obvious 379 that the right-hand side of the "@" was supposed to represent a 380 fully-qualified domain name at which Mail eXchanger records are 381 located [[TODO: cite]]. Other networks and host-naming formats 382 became obsolete by the mid-1990s. 384 The 2001 standards [RFC2821] and [RFC2822] reflect the explosive 385 growth and diverse deployments of Internet mail. The work was 386 undertaken in the DRUMS (Detailed Revision/Update of Message 387 Standards) working group between 1995 and 2001. [RFC2822] introduced 388 the "obs-" prefix in its ABNF, along with Section 4, Obsolete Syntax. 389 Essentially, [RFC2822] prescribes a generation format that is much 390 stricter than the parsing format, but still demands that conforming 391 implementations understand the parsing format. Therefore, the U+0000 392 NULL character that is considered obsolete in [RFC2822] must still be 393 considered part of the data model of email addresses. The syntax of 394 [RFC2821] is tighter than [RFC2822], and a careful read makes it 395 apparent that the underlying address formats diverged in the 396 intervening years. Specifically, [RFC2822] is much more concerned 397 with historical forms than [RFC2821], which is more about 398 contemporary transmission behavior. At the same time, [RFC2821] does 399 not actually prohibit a wide range of C0 control characters, which 400 still remain part of [RFC2821]'s data model. [[TODO: complete 401 history of RFC 282x.]] 403 Revisions to the base mail standards most recently completed, 404 [RFC5321] and [RFC5322], were worked on between 2005 and 2008. Email 405 addresses saw further character restrictions, namely around the 406 entire range of C0 control characters, which [RFC5321] explicitly 407 prohibits. [[TODO: complete history of RFC 532x.]] 409 2.2. Multipurpose Internet Mail Extensions and Uniform Resource 410 Identifiers 412 [[Message-ID and Content-ID. Therefore this document applies with 413 equal normative force to Content-IDs, and to mid: and cid: URIs that 414 use them.]] 416 2.3. Email Address Internationalization 418 As the Internet became a ubiquitous feature of modern life, and as 419 email followed it, users in various countries called for identifiers 420 that were usable in their native languages. The IETF [[worked on 421 this]] in the email address internationalization (EAI) effort, 422 culminating in [RFC6530], [RFC6531], and [RFC6532]. The key changes 423 for identifiers were to expand the character repertoire of email 424 addresses, and anything looking like email addresses (e.g., Message- 425 ID), to include UTF-8 octets. As UTF-8 can encode any Unicode scalar 426 value (but not any Unicode code point), the practical result is that 427 addresses and Message-IDs can contain (almost) any Unicode scalar 428 value. 430 A practical result of EAI combined with MIME and MIME URIs, is that 431 MIME URIs now are UTF-8 encoded in the pct-encoded production. 433 2.4. The Data Model 435 Parties that rely on this document SHALL interpret the semantically 436 meaningful parts of Internet Mail identifiers as follows. 438 2.4.1. Email Address 440 An email address is comprised of a local-part and a domain, separated 441 by "@". The parsed local-part is a sequence of Unicode scalar 442 values, and can be represented by any well-formed Unicode encoding. 443 [[NB: The following may be CONTROVERSIAL!!]] The parsed domain is a 444 sequence of restricted Unicode scalar values that represent some 445 identifier for some host on the network. The parsed domain can be: 447 1. a hostname of any kind, including (but not limited to) a DNS 448 name; 450 2. an IPv4 address literal; 452 3. an IPv6 address literal; 454 4. an address literal, comprised of a standardized tag and a 455 sequence of ASCII characters. 457 Conveniently, the domain string subtypes can be combined into a 458 single well-formed Unicode string, discriminated as follows: 460 1. If the string begins with "IPv6:", it is a type 3 IPv6 address, 461 and the remainder had better be a valid IPv6 address in textual 462 form. 464 2. If the string begins with Ldh-str and a colon ":", it is a type 4 465 address literal, and the remainder had better be dcontent (which 466 notably is not supposed to contain characters beyond ASCII). 468 3. If the string has four sets of digits 0-255 separated by dots, 469 then it is an IPv4 address. 471 4. Otherwise, it had better be a domain name (i.e., comprised of NR- 472 LDH labels and U-labels, separated by dots). [[NB: [RFC1912] 473 says a label can't be all-numeric, but then it catalogs some 474 exceptions.]] 476 5. Finally, it is "some random Unicode string" that is syntactically 477 valid under the most expansive rules, but is not useful for 478 delivering or reporting on Internet mail. 480 2.4.2. Message-ID 482 A Message-ID is comprised of a left part (id-left) and a right part 483 (id-right), separated by "@". The id-left is a sequence of Unicode 484 scalar values, and can be represented by any well-formed Unicode 485 encoding. [[NB: The following may be CONTROVERSIAL!!]] The id-right 486 is a sequence of Unicode scalar values, and can be represented by ay 487 well-formed Unicode encoding. 489 (A Content-ID [RFC2045] has the same composition as a Message-ID.) 491 2.5. Equivalence and Comparison 493 2.5.1. Email Address 495 Two email addresses are equivalent if the parsed local-part and 496 parsed domain values are equivalent. 498 Two parsed local-parts are equivalent if their Unicode scalar values 499 are equal. 501 The special values "postmaster", "abuse", and [TODO: fill out] SHALL 502 be compared case-insensitively [TODO: full-width characters? Unicode 503 case folding? etc.]. 505 [Case sensitivity] In all other cases, two parsed local-parts may be 506 equivalent if the [TODO: receiving MTA] delivers mail addressed to 507 them to the same mailbox. There is no algorithmic comparison to 508 determine said equivalence. 510 Two parsed domains are equivalent if both have the same type, and the 511 values are equivalent. Additionally, an IPv6 address literal is 512 equal to an IPv4 address literal when the IPv6 address is an 513 "IPv4-mapped IPv6 address", and its IPv4 component equals the IPv4 514 address of the IPv4 address literal. 516 1. hostnames (domain names): equal [TODO: RFC-REF, IDNA, RFC1034, 517 RFC1035, etc.]. 519 2. IPv4 addresses: equal [TODO: RFC-REF]. 521 3. IPv6 addresses: equal [TODO: RFC-REF]. 523 4. address literal: the standardized tag is equal (case- 524 insensitive), and the address value is octet-for-octet equal 525 (case-sensitive), or is equal per the rules standardized by the 526 standardized tag registration. 528 2.5.1.1. Case sensitivity of local-part 530 Of all equivalence issues, no issue generates more confusion and 531 dissent in the email community than the case sensitivity of the 532 local-part. Formally local-part is case-sensitive. A significant 533 fraction of installed mail servers treat local-part as case- 534 insensitive in the ASCII range. (At the time of this writing, EAI 535 has not been implemented widely enough to make statements about case 536 insensitivity for characters beyond ASCII.) [[TODO: survey and 537 statistics.]] Furthermore, many systems [[TODO: quantify]] outside of 538 the Internet Mail infrastructure compare the local-part of email 539 addresses case-insensitively. 541 Historically, local-parts were case-sensitive because of MULTICS. 542 However, as time went on they became case-preserving: receiving 543 systems would not register additional mailbox names (i.e., local- 544 parts) if the proposed mailbox name differed from an existing mailbox 545 name only by case. 547 [[TODO: develop recommendation about how wise or unwise it is to go 548 one way or another.]] One possible approach is to define an 549 additional output, "conditionally equivalent" of the equivalence 550 algorithm. Therefore an implementation conforming to this document 551 SHALL output one of three states: equivalent, not equivalent, and 552 "conditionally equivalent", that is, equivalent if and only if local- 553 part is compared case-insensitively in the ASCII range [[TODO: should 554 we say in the full Unicode range?]]. Applications implementing this 555 algorithm SHOULD NOT treat such a state as "equivalent". For 556 example, a user-facing application SHOULD treat this state as a 557 "warning" that requires further intervention. 559 In modern times, email addresses tend to be emitted [[TODO: 560 statistics]] in all-lowercase, when case is normalized. Therefore, 561 applications implementing this algorithm [[TODO: document]] that are 562 aware that the receiving MTA is case-insensitive, as well as 563 applications implementing this algorithm that receive input that is 564 case-ambiguous (such as voice input), SHOULD record the local-part in 565 all-lowercase unless presented with evidence to the contrary. 567 The EAI standards (RFCs 6530-6532) make no mention of case 568 sensitivity issues for characters beyond the ASCII range. Permitting 569 Unicode scalar values (i.e., UTF-8) opens up a whole range of 570 comparison issues with potentially far-reaching identity and security 571 implications. [[TODO: discuss case preservation and sensitivity 572 issues in characters beyond the ASCII range. The bottom line is, use 573 what you're given, don't mess with it, hope for the best.]] 575 2.5.2. Message-ID 577 Two Message-IDs are equivalent if the parsed id-left and parsed id- 578 right values are equivalent. 580 Two parsed id-left values are equivalent if their Unicode scalar 581 values are equal. 583 [[NB: controversial!]] Two parsed id-right values are equivalent if 584 their Unicode scalar values are equal. 586 [[TODO: The case sensitivity of id-right has not been fully explored 587 in any standard to-date. To the extent that id-right represents a 588 domain name, there is a strong argument to treat id-right as case- 589 insensitive in the ASCII range. Standardized tags are probably case- 590 insensitive based on the ABNF of [RFC5321] relating to "IPv6". The 591 rest is kind of up for grabs. The bottom line is, if you intend to 592 match the Message-ID of an existing message, don't take chances: just 593 copy it verbatim into the destination.]] 595 (Two Content-IDs are equivalent under the same rules. However, a 596 Content-ID and a Message-ID are never equal to each other, and if 597 such a thing occurs, it is not correct because both Content-ID and 598 Message-ID are supposed to be "world-unique" [RFC2045] [RFC5322].) 600 3. Regular Expressions for Email Addresses 602 [[Valid email address vs. deliverable email address]] A "deliverable 603 email address" complies with the modern production rules of [RFC5321] 604 and [RFC6531]. A deliverable email address SHALL have a domain part 605 that is a domain name (Section 2.3.5 of [RFC5321]); it SHALL NOT have 606 a domain part that is an address literal (Section 4.1.3 of [RFC5321]) 607 or a host name that does not comply with domain name rules (see 608 [RFC1034], [RFC1035], [RFC5890] et. seq.). [[TODO: justify. 609 Basically experiments have borne out that many mail systems will not 610 accept user@[3.3.3.3] as a RCPT TO. Technically it is valid per RFC 611 5321, but practically any receiving MTA that handles more than one 612 MX/domain will have difficulty in figuring out what domain-specific 613 mailbox to which to deliver the mail. The author tried a couple of 614 popular MTAs.]] Systems that use email addresses with an expectation 615 of SMTP delivery SHALL accept productions that comply with this 616 document. 618 Email addresses that do not meet these modern production rules may 619 nevertheless be valid under the other modern (e.g., [RFC5322]) or 620 legacy (e.g., [RFC0821] and [RFC0822]) production rules. 622 Systems that recognize "modern email addresses" in new corpa (e.g., 623 in text editors) SHALL accept productions that comply with this 624 document. 626 Systems that recognize "legacy email addresses" in existing corpa 627 (e.g., in email messages or documents predating this document) SHALL 628 accept productions that comply with this document. 630 3.1. Deliverable Email Address 632 A deliverable email address is an email address that can be used to 633 deliver messages over the modern SMTP infrastructure. This has 634 important implications for the domain part, because the domain part 635 MUST represent a domain name that complies with contemporary rules 636 and regulations, such as [RFC5890]. 638 3.1.1. ASCII Building Blocks 640 The following rules are amalgamated from the SMTP and Internet 641 message format standards [RFC5321] [RFC5322]. All expressions are 642 PCRE2-compatible. 644 (?(DEFINE) 645 (?#local) 646 (?[0-9A-Za-z!#-'*+\-/=?^_`{-~]) 647 (?(?&atext)+(?:\.(?&atext)+)*) 648 (?[ !#-\[\]-~]) 649 (?\\[ -~]) 650 (?(?&qtext)|(?"ed_pair)) 651 (?"(?&qcontent)*") 652 (?(?&dot_string)|(?"ed_string)) 653 (?#domain) 654 (?0*(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]?)) 655 (?(?&oct)(?:\.(?&oct)){0,3}) 656 (?[0-9A-Za-z](?:[\-0-9A-Za-z]{0,61}[0-9A-Za-z])?) 657 (?(?!(?&IPv4)[^\-.0-9A-Za-z])(?&sub_domain) 658 (?:\\.(?&sub_domain))*(?![\-.0-9A-Za-z])) 659 ) 661 3.1.2. Deliverable Email Address 663 A deliverable email address matches the production of 664 [RFC5321]. 666 (?&local_part)@(?&domain) 668 (?&local_part)@(?&domain)(? is not sufficient to ensure a 683 safe production, because what will happen is that the domain 684 production might end after 63 consumed characters in a putative sub- 685 domain, leaving overlong characters that are still part of the 686 putative sub-domain hanging at the end. (This is not an issue when 687 the entire string is an address, in which case $ definitively 688 terminates the string.) The way to deal with this it the negative- 689 lookahead on the end of that fails to match when additional 690 domain characters are present. This negative-lookahead also deals 691 with some unpleasant corner cases when the is a partial IPv4 692 address. 694 The aforementioned patterns also take into account [RFC1912], namely, 695 "[l]abels may not be all numbers". Actually the way that software 696 works is that if the putative domain name can be parsed as an IPv4 697 address, then it is treated as an IPv4 address; otherwise, it is 698 treated as a DNS name. Therefore, character sequences that are valid 699 IPv4 addresses need to be restricted out. For example: "1.3.4.255" 700 is invalid; but, "1.3.4.256" and "1.3.4.255.1" are valid, because 701 they do not parse to IPv4 addresses. (411.org is valid for obvious 702 reasons.) 704 Borderline cases are partial IPv4 addresses, such as "411" (could be 705 0.0.1.155) and "1.411" (could be 1.0.1.155). The regular expressions 706 above accept these domain productions, but they may not be safe. The 707 regular expressions above also accept hex form, such as "0xef" (could 708 be 0.0.0.239). 710 3.1.3. (Leftover from draft-00) Basic Rules of Derivation (Unicode) 712 The following rules are amalgamated from the SMTP standards [RFC5321] 713 and [RFC6531], the foundational DNS standards [RFC1034], [RFC1035], 714 and [RFC1123], and the modern IDNA standards [RFC5890], [RFC5891], 715 and [RFC5892]. 717 [[NB: The syntax below assumes that Perl Compatible Regular 718 Expressions 2 [PCRE2] is being used, such that \xnn and \x{...} refer 719 to valid Unicode scalar values, i.e., well-formed UTF-8 sequences. 720 Specifically, the surrogate range U+D800-U+DFFF is omitted. Although 721 listed as JSRE and ERE-compatible, these expressions will need to be 722 massaged somewhat to handle the Unicode-referencing differences.]] 724 [[TODO: There need to be two forms: a form that matches a complete 725 string, so the regular expression can start with ^ and end with $. 726 This will make the execution pretty fast. A second form can match 727 any string in a block of text. This will be much more intensive 728 because ^ and $ cannot be used; instead the boundary could be 729 delimited by any of a HUGE yet noncontiguous quantity of characters 730 beyond ASCII, such as fullwidth punctuation, spaces of various kinds, 731 etc.]] 733 P E J atext = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}] 735 P E J dot-string = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+ 736 (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]) 738 P E J qtext = [ !#-\[\]-~\xA0-\x{10FFFF}] 740 P E J quoted-pair = \\[ -~] 742 P E J qcontent = [ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~] 744 P E J quoted-string = "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*" 746 P E J local-part = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+ 747 (?:\.[A-Za-z0-9!#-'*+-/=?^_`{-~\xA0-\x{10FFFF}])| 748 "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*" 750 ; RFC 5890, must contain at least one non-ASCII character 751 ; TODO: express other constraints such as in Protocol document [RFC5891] 752 ; and Tables document [RFC5892] 753 ; WARNING: intentional omission: dot . is included in this production-- 754 ; it should not be 755 U-label = [[TOO COMPLEX TO DO RIGHT NOW...]] 756 P E J U-label = [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]* 758 P E J sub-domain = [A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?| 759 [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+ 760 [\x00-\x{10FFFF}]* 762 P E J domain = (?:[A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?| 763 [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*) 764 (?:\.(?:[A-Za-z0-9](?:[A-Za-z0-9\-]* 765 [A-Za-z0-9])?|[\x00-\x{10FFFF}]* 766 [\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*))* 768 3.1.4. Complete Expression for Deliverable Email Address 770 The following regular expression is a deliverable email address: 772 ; Mailbox from RFC 5321, as amended 773 P E J DEA = ([A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+ 774 (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}])| 775 "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*")@ 776 ((?:[A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?| 777 [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*) 778 (?:\.(?:[A-Za-z0-9](?:[A-Za-z0-9\-]* 779 [A-Za-z0-9])?|[\x00-\x{10FFFF}]* 780 [\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*))*) 782 In the regular expression DEA, capturing group 1 is the local-part 783 production, and capturing group 2 is the domain production. 785 3.1.5. Using Character Classes 787 [[TODO: provide expressions that use character classes, and explain 788 the benefits and tradeoffs.]] 790 3.1.6. "Flotsam" and "Jetsam" Beyond ASCII 792 As mail usage is international in scope, modern mail and mail 793 identifier-using systems MUST support Unicode EAI identifiers. 794 Unfortunately, rigorously following the EAI specifications [RFC6530], 795 [RFC6531], and [RFC6532] will lead to (possibly) unforeseen text 796 parsing problems, where naive (or strictly conforming) parsers will 797 tend both to overconsume and underconsume non-ASCII text surrounding 798 an otherwise "obvious" e-mail address. The problem is described in 799 this section, while the following Section 3.1.3 provides at least 800 some partial mitigations. 802 The right-hand side of a deliverable email address is a domain name. 803 A conforming parser may well overconsume text on the right-hand side, 804 aka "jetsam", that cannot possibly be in a domain name, such as non- 805 ASCII punctuation, spaces, control characters, and noncharacter code 806 points. On the left-hand side (local-part), for better or for worse, 807 the atext production of local-part has been extended in both 808 [RFC6531] and [RFC6532] to accept any Unicode character beyond the 809 ASCII range. Therefore, a whole slew of "flotsam" can get validly 810 prepended to an internationalized email address. Apparently the only 811 way to "stop" this is to quote the local-part. 813 Characters will get inconsistent treatment depending on which end the 814 characters appear. For example: the characters U+FF1C FULLWIDTH 815 LESS-THAN SIGN and U+FF1E FULLWIDTH GREATER-THAN SIGN are classified 816 as [Sm] aka "Symbol, Math", which are DISALLOWED under [RFC5892]. 817 Therefore the presence of U+FF1E on the domain end will terminate the 818 address (for a properly implemented regular expression), but the 819 corresponding presence of U+FF1C on the local-part end will NOT 820 terminate the address! A conforming parser would actually start much 821 earlier in a blob of text (possibly as early as the beginning of a 822 new line) and match all characters up to the @ delimiter, blowing 823 straight through U+FF1C. 825 3.1.7. Certain Expressions for Restrictions 827 Email addresses incorporate internationalized domain names, so the 828 complex and confusing rules of IDNs apply directly to the right-hand 829 side of deliverable email addresses. 831 [[TODO: integrate these expressions into the main expressions of 832 3.1.2.]] The following expressions apply to an individual sub-domain 833 production: 835 [[NB: open issue if length should be restricted. Author believes it 836 should be length-restricted, because overlong labels in domain names 837 mean the address can't be looked up, and therefore, the address is 838 not deliverable.]] 840 The following lookahead regular expressions apply on a per-label 841 (i.e., per-sub-domain) basis. 843 P E J (?=[^.]{1,63}\.|$) 844 restricts to 63 characters 846 P E J (?=[0-z]{1,63}\.|$) 847 restricts to 63 LDH characters 849 P E J (?=[0-z\xA0-\x{10FFFF}]{1,56}\.|$)(?!..--) 850 (?=[0-z\xA0-\x{10FFFF}]{0,55}[\xA0-\x{10FFFF}][0-z]{0,55}\.|$) 851 or 852 (?=[0-z\xA0-\x{10FFFF}]{1,56}\.|$)(?!..--)(?![0-z]{1,56}\.|$) 853 Enforces that for U-labels (where at least one non-ASCII char 854 is present), there cannot be more than 55 chars. The equivalent 855 ACE will include xn-- PLUS -2rh, which means 7 extra characters 856 (8 chars minus 1 Unicode char). Did a test with U+00DE. 857 Also not allowed to have any any HYPHEN HYPHEN 858 (Section 4.2.3.1 of [RFC5891]). 860 [[NB: Should we permit fullwidth and other dots (not just ASCII 861 dot)?]] 863 P E J (?![0-9]{1,63}\.|$) 864 restricts out all-numeric labels [RFC1912] 865 [[TODO: Would it be more accurate to say that the all-numeric 866 labels 0-255 are prohibited, but 256+ are permissible?]] 868 [[NB: The end-of-address production \.|$ must also include the 869 possibility of any number of non-domain name characters, when 870 searching through an arbitrary block of text.]] 872 [[TODO: the flotsam on the local-part end is potentially a big 873 problem, unless we say that deliverable email addresses MUST be 874 delimited on the left-hand side by the ASCII character < or other 875 well-established characters that are not in dot-atom-text/dot- 876 string.]] 878 3.1.8. Unquoting Local-Part 880 The following regular expressions can be used to unquote the local- 881 part production. 883 P sed s/^"(.*)"$/$1/ 884 Applies when the local-part is isolated from @domain. 885 Removes surrounding quotations. 887 P sed s/\\(.)/$1/g 888 Applies when the local-part is isolated from the surrounding 889 quotations. (Safe to use with dot-string aka dot-atom-text, 890 since backslashes are not present.) Unquotes quoted-pairs. 892 P J /^"|(?=.+"$)\\(.)|"$/g (with "$1" replacement string) 893 P sed s/^"|(?=.+"$)\\(.)|"$/$1/g 894 Applies when local-part is isolated from @domain. 895 Removes surrounding quotations and unquotes quoted-pairs 896 in one loop by using lookahead. 898 [[TODO: add more detailed descriptions of operations.]] 900 3.1.9. Quoting Local-Part 902 Given a Unicode string that represents the unquoted local-part, the 903 following regular expressions can be used to create a quoted 904 production. 906 J /(?=[^"\]*["\])(["\])/\$1/g 907 Quote each and every " and \. 908 J /^(?![A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+ 909 (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}])$).*$/"$&"/ 910 If the string does not conform to dot-string (including, 911 e.g., the presence of consecutive dots), surround the 912 entire string with quotations. 914 3.2. Modern Email Address 916 A modern email address is an email address that conforms to 917 [RFC5322], except for the ABNF productions in that standard that are 918 marked as "obsolete". For example, control characters are excluded. 919 Modern email addresses are a superset of deliverable email addresses 920 [RFC5321]. Since modern email addresses are not necessarily 921 deliverable by SMTP, the domain production does not need to conform 922 to DNS rules. This relaxation makes the regular expressions much 923 simpler. On the other hand, modern email addresses permit embedded 924 comments and folding whitespace, requiring the use of a pushdown 925 automaton. 927 (?(DEFINE) 928 (?#whitespace) 929 (?(?:[\t ]*\r\n)?[\t ]+) 930 (?(?:(?&FWS)?(?&comment))+(?&FWS)?|(?&FWS)) 931 (?#local) 932 (?[0-9A-Za-z!#-'*+\-/=?^_`{-~]) 933 (?(?&atext)+(?:\.(?&atext)+)*) 934 (?[!-'*-\[\]-~]) 935 (?(?&ctext)|(?"ed_pair)|(?&comment)) 936 (?\((?:(?&FWS)?(?&ccontent))*(?&FWS)?\)) 937 (?[ !#-\[\]-~]) 938 (?\\[ -~]) 939 (?(?&qtext)|(?"ed_pair)) 940 (?(?&CFWS)?"(?:(?&FWS)?(?&qcontent))*(?&FWS)?"(?&CFWS)?) 941 (?(?&CFWS)?(?&dot_atom_text)(?&CFWS)?|(?"ed_string)) 942 (?#domain) 943 (?[!-Z^-~]) 944 (?(?&CFWS)?(?:(?&dot_atom_text)| 945 \[(?:(?&FWS)?(?&dtext))*(?&FWS)?\])(?&CFWS)?) 946 ) 948 A modern email address matches the production of 949 [RFC5322], without any obsolete parts. 951 (?&local_part)@(?&domain) 953 ^(?&local_part)@(?&domain)$ 955 3.3. Legacy Email Address 957 [[TODO: expand regular expressions. Arguably, characters beyond 958 ASCII need not be included. Therefore domain should be MUCH simpler: 959 the name form should be restricted to LDH-strings.]] 961 3.4. Algorithms for Detecting Email Addresses 963 As Section 3.1 indicates, compiling and executing a true and correct 964 regular expression for an email address (deliverable, valid, 965 historic) will be complicated and time-consuming. More efficient 966 algorithms are desirable. 968 o Scan text buffer for "@". 970 o Evaluate characters after "@" for domain production. 972 o Evaluate character prior to "@". 974 o If prior character is <">, scan backwards for initial (unescaped) 975 <">; evaluate characters in between to match quoted-string 976 production. 978 o Otherwise if prior character is a valid atext, consume characters 979 backwards while evaluating for dot-string. (The dot-string 980 production is palindromic.) 982 [[TODO: add other suggested algorithms.]] 984 [[Splitting valid email address into local-part and right-hand- 985 side.]] 987 3.5. Handling Domain Names 989 [[TODO: discuss the issues with handling domain names.]] 991 4. Regular Expressions for Message-IDs 993 Message-ID values form a disjoint set from email address values, 994 i.e., a Message-ID that also happens to be an email address is just a 995 coincidence. 997 The productions that comprise Message-ID are called id-left and id- 998 right. 1000 4.1. Modern Message-ID 1002 A modern Message-ID is one that complies with the strict generation 1003 rules of [RFC5322]. In particular: id-left is only dot-atom-text (as 1004 amended by [RFC6532], and id-right is dot-atom-text or no-fold- 1005 literal (as amended by [RFC6532]). Notably, virtually any Unicode 1006 scalar value is permissible in id-right, because [RFC6532] does not 1007 import U-label (unlike [RFC6531]). The resulting regular expressions 1008 will therefore be more expansive, at the cost of accepting 1009 characters, such as fullwidth punctuation, that would otherwise 1010 delimit Message-IDs on both ends in text. 1012 The regular expressions reuse many of the subroutines of Section 3.1. 1013 [[POINTER: obs-id-left and obs-id-right are supersets of their modern 1014 forms, so deliverable email address regular expressions may well be 1015 reused directly.]] 1016 (?(DEFINE) 1017 (?#id-left) 1018 (?[0-9A-Za-z!#-'*+\-/=?^_`{-~]) 1019 (?(?&atext)+(?:\.(?&atext)+)*) 1020 (?(?&dot_atom_text)) 1021 (?#id-right) 1022 (?[!-Z^-~]) 1023 (?(?&dot_atom_text)|\[(?&dtext)*\]) 1024 ) 1026 A modern Message-ID matches the production of [RFC5322], 1027 without any obsolete parts. 1029 (?&id_left)@(?&id_right) 1031 ^(?&id_left)@(?&id_right)$ 1033 4.2. General Message-ID 1035 A "general Message-ID" is one that complies with any of the mail 1036 rules. 1038 [[TODO: complete.]] 1040 5. Security Considerations 1042 Internet Mail identifiers are important for identifying users and 1043 other principals in security systems. While a user's login token and 1044 an email address are formally separate entities, many common 1045 Internet-connected systems conflate the two. Systems that accept 1046 email addresses as login tokens (in particular, other systems' email 1047 addresses, rather than their own) SHALL accept the full range of 1048 valid email addresses. To do otherwise is to act as a denial of 1049 service against legitimate users with legitimate mailbox names. 1051 When a user forgets his or her password or other login credentials, 1052 the most common recovery method on the Internet is to send a recovery 1053 message to the user's registered [[email address]]. Preventing users 1054 from using their chosen or assigned addresses acts as a denial of 1055 service. 1057 Because a local-part can contain almost any Unicode scalar value, 1058 security issues are essentially pushed from clients to servers and to 1059 registration processes. I.e., it is up to a server implementation to 1060 decide whether to accept an arbitrary Unicode string for registration 1061 or for delivery with SMTP, folding or normalizing input at its 1062 discretion. A robust server implementation needs to handle arbitrary 1063 input gracefully. 1065 In contrast, when a domain part represents a domain name, the string 1066 is severely restricted by the IDNA documents [RFC5890] et. seq. A 1067 server is perfectly within its rights to reject input that is not in 1068 NFC or contains disallowed characters. Specifically, since the 1069 domain part is the key to retrieving the MX resource record, the 1070 Internet Mail standards hardly get involved. These restrictions put 1071 more onus on clients to validate strings. However, integrating the 1072 entire static list of [RFC5892] into regular expressions would be 1073 unduly burdensome on many implementations. While an implementation 1074 can consider using character classes, the risks and benefits of using 1075 character classes need to be carefully considered. 1077 Character classes represent a shorthand for certain ranges of 1078 characters based on Unicode properties. Effectively the total 1079 system's state table remains the same, but the complexity is pushed 1080 from one component (the regular expression definition) to another 1081 component (the table representing the character classes). 1082 Ultimately, the regular expression character classes in popular 1083 formulations will derive from the Unicode Standard [UNICODE] 1084 definitions such as UnicodeData.txt. Since a proper IDNA-enabled DNS 1085 library needs to keep track of these character classes anyway, 1086 referencing these ranges by character classes should not add much to 1087 the image size. However, now the regular expression (which 1088 previously was self-contained) now has an external dependency to data 1089 that may-and probably will-frequently change, including (for example) 1090 the ranges of unassigned code points. What could have been a valid 1091 email address one month may be invalid the next month, or vice-versa, 1092 simply depending on the version of the regular expression or DNS 1093 library that an implementation depends on. Implementers should 1094 therefore evaluate their own needs for security and stability in 1095 picking particular regular expression forms. 1097 6. References 1099 6.1. Normative References 1101 [I-D.seantek-abnf-more-core-rules] 1102 Leonard, S., "Comprehensive Core Rules and References for 1103 ABNF", draft-seantek-abnf-more-core-rules-07 (work in 1104 progress), September 2016. 1106 [JSRE6ED] Ecma International, "ECMAScript 2015 Language 1107 Specification", Standard ECMA-262, 6th Edition , June 1108 2015, . 1110 [PCRE2] Hazel, P., "Perl Compatible Regular Expressions 2, version 1111 10.21", January 2016, . 1113 [POS-ERE] IEEE Std 1003.1, 2013 Edition (incorporates IEEE Std 1114 1003.1-2008 and IEEE Std 1003.1-2008/Cor 1-2013), 1115 ""Standard for Information Technology - Portable Operating 1116 System Interface (POSIX(R)) Base Specifications, Issue 7" 1117 (incorporating Technical Corrigendum 1), Section 9.4, 1118 "Extended Regular Expressions"", April 2013, 1119 . 1122 [RFC0821] Postel, J., "Simple Mail Transfer Protocol", STD 10, RFC 1123 821, DOI 10.17487/RFC0821, August 1982, 1124 . 1126 [RFC0822] Crocker, D., "STANDARD FOR THE FORMAT OF ARPA INTERNET 1127 TEXT MESSAGES", STD 11, RFC 822, DOI 10.17487/RFC0822, 1128 August 1982, . 1130 [RFC0973] Mockapetris, P., "Domain system changes and observations", 1131 RFC 973, DOI 10.17487/RFC0973, January 1986, 1132 . 1134 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1135 STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, 1136 . 1138 [RFC1035] Mockapetris, P., "Domain names - implementation and 1139 specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, 1140 November 1987, . 1142 [RFC1123] Braden, R., Ed., "Requirements for Internet Hosts - 1143 Application and Support", STD 3, RFC 1123, DOI 10.17487/ 1144 RFC1123, October 1989, 1145 . 1147 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1148 Extensions (MIME) Part One: Format of Internet Message 1149 Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996, 1150 . 1152 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1153 Requirement Levels", BCP 14, RFC 2119, March 1997. 1155 [RFC2821] Klensin, J., Ed., "Simple Mail Transfer Protocol", RFC 1156 2821, DOI 10.17487/RFC2821, April 2001, 1157 . 1159 [RFC2822] Resnick, P., Ed., "Internet Message Format", RFC 2822, DOI 1160 10.17487/RFC2822, April 2001, 1161 . 1163 [RFC5321] Klensin, J., "Simple Mail Transfer Protocol", RFC 5321, 1164 DOI 10.17487/RFC5321, October 2008, 1165 . 1167 [RFC5322] Resnick, P., Ed., "Internet Message Format", RFC 5322, 1168 October 2008. 1170 [RFC5890] Klensin, J., "Internationalized Domain Names for 1171 Applications (IDNA): Definitions and Document Framework", 1172 RFC 5890, DOI 10.17487/RFC5890, August 2010, 1173 . 1175 [RFC5891] Klensin, J., "Internationalized Domain Names in 1176 Applications (IDNA): Protocol", RFC 5891, DOI 10.17487/ 1177 RFC5891, August 2010, 1178 . 1180 [RFC5892] Faltstrom, P., Ed., "The Unicode Code Points and 1181 Internationalized Domain Names for Applications (IDNA)", 1182 RFC 5892, DOI 10.17487/RFC5892, August 2010, 1183 . 1185 [RFC6530] Klensin, J. and Y. Ko, "Overview and Framework for 1186 Internationalized Email", RFC 6530, DOI 10.17487/RFC6530, 1187 February 2012, . 1189 [RFC6531] Yao, J. and W. Mao, "SMTP Extension for Internationalized 1190 Email", RFC 6531, DOI 10.17487/RFC6531, February 2012, 1191 . 1193 [RFC6532] Yang, A., Steele, S., and N. Freed, "Internationalized 1194 Email Headers", RFC 6532, February 2012. 1196 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 1197 9.0.0", August 2016. 1199 6.2. Informative References 1201 [MTECHDEV] 1202 Partridge, J., "The Technical Development of Internet 1203 Email", IEEE Annals of the History of Computing, Vol. 30, 1204 No. 2, DOI 10.1109/MAHC.2008.32 , June 2008. 1206 [RFC0524] White, J., "Proposed Mail Protocol", RFC 524, DOI 1207 10.17487/RFC0524, June 1973, 1208 . 1210 [RFC0561] Bhushan, A., Pogran, K., Tomlinson, R., and J. White, 1211 "Standardizing Network Mail Headers", RFC 561, DOI 1212 10.17487/RFC0561, September 1973, 1213 . 1215 [RFC0680] Myer, T. and D. Henderson, "Message Transmission 1216 Protocol", RFC 680, DOI 10.17487/RFC0680, April 1975, 1217 . 1219 [RFC0724] Crocker, D., Pogran, K., Vittal, J., and D. Henderson, 1220 "Proposed official standard for the format of ARPA Network 1221 messages", RFC 724, DOI 10.17487/RFC0724, May 1977, 1222 . 1224 [RFC0733] Crocker, D., Vittal, J., Pogran, K., and D. Henderson, 1225 "Standard for the format of ARPA network text messages", 1226 RFC 733, DOI 10.17487/RFC0733, November 1977, 1227 . 1229 [RFC0772] Sluizer, S. and J. Postel, "Mail Transfer Protocol", RFC 1230 772, DOI 10.17487/RFC0772, September 1980, 1231 . 1233 [RFC0882] Mockapetris, P., "Domain names: Concepts and facilities", 1234 RFC 882, DOI 10.17487/RFC0882, November 1983, 1235 . 1237 [RFC0883] Mockapetris, P., "Domain names: Implementation 1238 specification", RFC 883, DOI 10.17487/RFC0883, November 1239 1983, . 1241 [RFC1818] Postel, J., Li, T., and Y. Rekhter, "Best Current 1242 Practices", RFC 1818, DOI 10.17487/RFC1818, August 1995, 1243 . 1245 [RFC1912] Barr, D., "Common DNS Operational and Configuration 1246 Errors", RFC 1912, DOI 10.17487/RFC1912, February 1996, 1247 . 1249 [RFC5598] Crocker, D., "Internet Mail Architecture", RFC 5598, DOI 1250 10.17487/RFC5598, July 2009, 1251 . 1253 Appendix A. Test Vectors 1255 [[NB: This appendix will include a large set of test vectors to test 1256 matching and validation patterns.]] 1258 Appendix B. Change Log 1260 Draft-02 just updates the date to keep the document active. 1262 The document status is now marked as Informational instead of Best 1263 Current Practice (although it seems that it could go either way). 1265 The authors decided to focus on "modern" ASCII-only email identifiers 1266 first and to get those right before tackling Unicode email 1267 identifiers and "obsolete" ABNF productions. This draft-01 preserves 1268 the main text of draft-00 rather than remove potentially useful text. 1269 Deliverable and modern email addresses, and modern Message-IDs, have 1270 been addressed. The Unicode work remains unfinished for now; 1271 "obsolete" ABNF productions (which are still useful for archival 1272 applications) will also be addressed in future drafts. 1274 The authors decided to write one set of regular expressions in one 1275 dialect (namely, PCRE/PCRE2) before tackling others (e.g., 1276 JavaScript). Different dialects will be addressed in future drafts. 1278 Authors' Addresses 1280 Sean Leonard 1281 Penango, Inc. 1282 5900 Wilshire Boulevard 1283 21st Floor 1284 Los Angeles, CA 90036 1285 USA 1287 Email: dev+ietf@seantek.com 1289 Joe Hildebrand 1290 Cisco Systems 1292 Email: jhildebr@cisco.com 1293 Tony Hansen 1294 AT&T Laboratories 1295 200 Laurel Ave South 1296 Middletown, NJ 07748 1297 USA 1299 Email: tony@att.com