idnits 2.17.1 draft-klensin-emailaddr-i18n-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1460. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1437. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1444. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1450. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 396 has weird spacing: '... copied back ...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: An SMTP Client that receives the I18N extension keyword MAY transmit a mailbox name as an internationalized string in UTF-8 form. It MAY transmit the domain part of that string in either punycode (derived from the IDNA process) or UTF-8 form but, if it sends the domain in UTF-8, it SHOULD first verify that the string is valid for a domain name according to IDNA rules. As required by RFC 2821, it MUST not attempt to parse, evaluate, or transform the local part in any way. If the I18N SMTP extension is not offered by the Server, the SMTP Client MUST not transmit an internationalized address. Instead, it MUST either return the message to the user as undeliverable or replace it, either using the ASCII-only address specified with the ALT-ADDRESS parameter or using some process (such as a directory lookup) outside the scope of this specification, with a local-part that conforms to the syntax rules of RFC 2821. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 18, 2005) is 6849 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'ASCII' is mentioned on line 174, but not defined == Missing Reference: 'Ldh-str' is mentioned on line 859, but not defined == Unused Reference: 'RFC3491' is defined on line 1344, but no explicit reference was found in the text == Unused Reference: 'RFC3492' is defined on line 1348, but no explicit reference was found in the text == Unused Reference: 'RFC2056' is defined on line 1381, but no explicit reference was found in the text ** Obsolete normative reference: RFC 821 (Obsoleted by RFC 2821) ** Obsolete normative reference: RFC 1651 (Obsoleted by RFC 1869) ** Obsolete normative reference: RFC 2821 (Obsoleted by RFC 5321) ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 1652 (Obsoleted by RFC 6152) -- Obsolete informational reference (is this intentional?): RFC 2476 (Obsoleted by RFC 4409) -- Obsolete informational reference (is this intentional?): RFC 2554 (Obsoleted by RFC 4954) -- Obsolete informational reference (is this intentional?): RFC 2822 (Obsoleted by RFC 5322) -- Obsolete informational reference (is this intentional?): RFC 3454 (Obsoleted by RFC 7564) Summary: 8 errors (**), 0 flaws (~~), 10 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft July 18, 2005 4 Expires: January 19, 2006 6 Internationalization of Email Addresses 7 draft-klensin-emailaddr-i18n-03.txt 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on January 19, 2006. 34 Copyright Notice 36 Copyright (C) The Internet Society (2005). 38 Abstract 40 Internationalization of electronic mail addresses is, if anything, 41 more important than the already-completed effort for domain names. 42 In most of the contexts in which they are used, domain names can be 43 hidden within or as part of various types of references. Email 44 addresses, by contrast, are crucial: use of names of people or 45 organizations as, or as part of, the email local part is, for obvious 46 reasons, a well-established tradition on the network. Preventing 47 people from spelling their names correctly is, in the long term, 48 inexcusable. At the same time, email addresses pose a number of 49 special problems -- they are more difficult than simple domain names 50 in some respects, but actually easier in others. This document 51 discusses the issues with internationalization of email addresses, 52 explains why some obvious approaches are incompatible with the 53 definitions and use of Internet mail, and proposes a solution -- for 54 both addressing and email internationalization more generally -- that 55 is likely to serve users and the network well for the long term. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 61 2. Three Models for Transition . . . . . . . . . . . . . . . . . 5 62 2.1 No Infrastructure Changes . . . . . . . . . . . . . . . . 5 63 2.2 Transport-level Negotiation . . . . . . . . . . . . . . . 5 64 2.3 Replace SMTP and the Current Internet Mail Environment . . 6 65 2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 6 66 3. History, Context, and Design Constraints . . . . . . . . . . . 7 67 3.1 The Presentation Issue . . . . . . . . . . . . . . . . . . 8 68 3.2 MUAs, MTAs, addresses, and learning from MIME and ESMTP . 8 69 3.3 An Encoded-address, MUA-transparent, Solution may 70 Eliminate an Important Opportunity . . . . . . . . . . . . 11 71 3.4 An MUA-only-based Solution is Not Necessary . . . . . . . 12 72 3.4.1 Obtaining an Internationalized Email Address . . . . . 12 73 3.4.2 Relay environment . . . . . . . . . . . . . . . . . . 13 74 3.4.3 Internationalizing the Sender . . . . . . . . . . . . 14 75 3.5 A Solution Based on MUA Changes Alone is Unworkable . . . 15 76 3.5.1 MX Diversion . . . . . . . . . . . . . . . . . . . . . 15 77 3.5.2 Embedded commands . . . . . . . . . . . . . . . . . . 15 78 3.6 Encoding the Whole Address String . . . . . . . . . . . . 15 79 3.7 Looking back and looking forward . . . . . . . . . . . . . 16 80 3.8 Summary of Design Issues and Tradeoffs . . . . . . . . . . 16 81 4. A Mail Transport-level Protocol . . . . . . . . . . . . . . . 17 82 4.1 General Principles and Objectives . . . . . . . . . . . . 17 83 4.2 Framework for the Internationalization Extension . . . . . 17 84 4.3 The Address Internationalization Service Extension . . . . 18 85 4.4 Extended Mailbox Address Syntax . . . . . . . . . . . . . 19 86 4.5 The ALT-ADDRESS parameter . . . . . . . . . . . . . . . . 19 87 4.6 Formation of the Alternate Address . . . . . . . . . . . . 20 88 4.7 Additional ESMTP Changes and Clarifications . . . . . . . 21 89 4.7.1 The Initial SMTP Exchange . . . . . . . . . . . . . . 21 90 4.7.2 Trace Fields . . . . . . . . . . . . . . . . . . . . . 21 91 5. Impact on the MUA and on Message Headers . . . . . . . . . . . 21 92 6. Bundling of Extensions and Options . . . . . . . . . . . . . . 22 93 7. Protocol Loose Ends . . . . . . . . . . . . . . . . . . . . . 23 94 7.1 Punycode in Domain Names? . . . . . . . . . . . . . . . . 23 95 7.2 Local Character Codes in Local Parts? . . . . . . . . . . 23 96 7.3 Restrictions on Characters in Local Part? . . . . . . . . 24 97 7.4 Requirement for 8BITMIME? . . . . . . . . . . . . . . . . 24 98 7.5 Message Header and Body Issues with MTA Approach? . . . . 24 99 7.6 The Received field 'for' clause . . . . . . . . . . . . . 24 100 8. Internationalization and Full Localization . . . . . . . . . . 25 101 9. Advice to Designers and Operators of Mail-receiving Systems . 26 102 10. Internationalization Considerations . . . . . . . . . . . . 27 103 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . 27 104 12. Security considerations . . . . . . . . . . . . . . . . . . 27 105 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 28 106 14. An Appeal . . . . . . . . . . . . . . . . . . . . . . . . . 28 107 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 29 108 15.1 Normative References . . . . . . . . . . . . . . . . . . . 29 109 15.2 Informative References . . . . . . . . . . . . . . . . . . 29 110 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 31 111 Intellectual Property and Copyright Statements . . . . . . . . 32 113 1. Introduction 115 Internationalization of electronic mail addresses is, if anything, 116 more important than the already-completed effort for domain names. 117 In most of the contexts in which they are used, domain names can be 118 hidden within, or as part of, various types of references or the 119 references themselves may be hidden. It also remains controversial 120 whether internationalization of domain names is actually necessary, 121 no matter how attractive and important it may appear at first glance. 122 Email addresses, by contrast, are crucial: use of names of people or 123 organizations as, or as part of, the email local part is, for obvious 124 reasons, a well-established tradition on the network. While the 125 characters permitted in domain name strings have always been somewhat 126 constrained so that they are not confused with syntax requirements of 127 present and future applications, preventing people from spelling 128 their names correctly is, in the long term, inexcusable. However, 129 while it is tempting to ignore them, email addresses pose a number of 130 special problems. Unlike domain names --and, consequently, the 131 domain part of an email address (after the last "@")-- the local part 132 (or mailbox name) is essentially unconstrained with regard to syntax 133 or the characters used. There are no special delimiters comparable 134 to the period used to separate domain name labels, there is no 135 standardized structure comparable to the domain name system's 136 hierarchy, and it has always been a firm protocol requirement that no 137 host other than the one to which final delivery is made is permitted 138 to parse or interpret the address (see section 2.3.10 of [RFC2821]). 139 In some respects, this makes things much more difficult: it is far 140 more difficult to know what behavior will cause existing systems to 141 cease working properly. In others, it actually makes them easier, 142 since the originating system is not required to understand how the 143 receiving one will interpret an address and indeed must not do so. 145 The balance of this document explores these issues in more detail. 147 1.1 Terminology 149 While much of the description here depends on the abstractions of 150 "Mail Transfer Agent" ("MTA") and "Mail User Agent" ("MUA"), it is 151 important to understand that those terms and the underlying concepts 152 postdate the design of the Internet's email architecture and the 153 "protocols on the wire" principle. That architecture, as it has 154 evolved, and the "wire" principle have prevented any strong and 155 standardized distinctions about how MTAs and MUAs interact on a given 156 origin or destination host (or even whether they are separate). 158 This document assumes a reasonable understanding of the protocols and 159 terminology of the most recent core email standards documented in RFC 160 2821 [RFC2821] and RFC 2822 [RFC2822]. 162 In its present internet-draft form, the document contains a great 163 deal of explanatory material and rationale for the approach chosen. 164 The actual protocol material appears almost entirely in Section 4, 165 especially Section 4.2 through Section 4.4, and in Section 5. If it 166 appears to be a candidate for standards-track publication, the 167 explanatory material, rationale, and most of the other background 168 materials should be removed to a separate document. Those who wish 169 to bypass the reasoning and comparison to other alternatives in this 170 document and examine the protocol proposal should skip to those 171 sections. 173 In this document, an address is "all-ASCII" if every character in the 174 address is in the ASCII character repertoire [ASCII]; an address is 175 "non-ASCII" if any character is not in the ASCII character 176 repertoire. 178 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", 179 and "MAY" in this document are to be interpreted as described in RFC 180 2119 [RFC2119]. 182 2. Three Models for Transition 184 Almost every attempt to extend the Internet mail system to support 185 new formats leads quickly to a controversy about how to implement 186 such changes and fit them into the existing system. The proposals 187 tend to fall into three categories: 189 2.1 No Infrastructure Changes 191 Avoid any more infrastructure changes than are absolutely necessary. 192 Instead, use sometimes-elaborate coding and other "tricks" to embed 193 the new facilities in the old ones, even if those facilities will 194 look very odd to those whose client and user interface systems have 195 not been upgraded. MIME, which has been quite successful, is the 196 most-cited example of this approach, although it is worth remembering 197 that it was initially quite unpopular because it exposed many users 198 to rather opaque codings. It continues to be criticized as 199 institutionalizing incompatibility rather than providing good 200 interoperability. Specificially on the latter observation, MIME 201 conformance does nothing to prevent delivery of a message to a user 202 that the user has no capability of decoding and reading. 204 2.2 Transport-level Negotiation 206 Use a transport-level negotiation model of some variety to ensure 207 that the recipient machine can and will accept the format and 208 structure of the message and options being sent. Variations on this 209 approach include ensuring that the message can be delivered, but not 210 that it can be read. This approach has been taken in situations in 211 which, e.g., sending the message without such acceptance could result 212 in actual information loss (as distinct from "mere" information 213 inaccessibility for the MIME case). ESMTP ([RFC1651], [RFC2821]) 214 provides the framework for this approach; options such as 8BITMIME 215 [RFC1652] are clear examples of cases in which the approach is 216 necessary. 218 2.3 Replace SMTP and the Current Internet Mail Environment 220 Start a conversation about discarding more or less everything and 221 moving toward a "next generation" of Internet mail in the hope of a 222 huge gain in elegance, capability, or other functions. 224 2.4 Analysis 226 Of these three, only the second appears to be plausible for 227 internationalization of email local parts. 229 The third --a new mail system, replacing SMTP and possibly the mail 230 header and MIME models-- is the most easily discarded as a 231 possibility. Despite many brief bursts of enthusiasm, development of 232 standards by other organizations, and well-funded proprietary 233 products, proposed SMTP replacements have tended to not go anywhere: 234 standard Internet mail, with or without extensions, is sufficiently 235 well-established (and entrenched) as an interoperable interchange 236 standard that historical proposed "next generation" alternatives have 237 been forced, by the marketplace, to interoperate with and, 238 ultimately, to yield to, it. Those that did not make a serious 239 attempt to interoperate with Internet mail have largely disappeared. 240 There is no reason to believe that a new set of proposals will fare 241 any better. 243 In considering the other two approaches, it is important to note that 244 none of the fundamental transitions have ever been, or will ever be, 245 easy, quick, and without side-effects. While it is easy, safe, and 246 fairly painless to add a new media type to MIME, the MIME framework 247 itself provided a very unpleasant user experience until mail user 248 agents (MUAs) and related software were upgraded. Similarly, while 249 most ESMTP extensions can be added at a relatively low level of pain, 250 the infrastructure upgrades needed to accommodate the extension 251 framework itself (and hence any of the extensions) were significant, 252 especially in the case of MTA software that had been operating 253 smoothly, without maintenance or upgrades, for years. 255 For internationalization, the important question is whether the 256 transition properties are "more like MIME" (i.e., can be accomplished 257 at an MUA-MUA level, without the transport system getting involved) 258 or whether the changes required are significant enough to require 259 transport negotiation (or a "next generation mail" environment). It 260 is the position of this document that the complex of changes needed 261 to make Internet email fully internationalized requires, if not "next 262 generation mail", at least the type of position that characterized 263 the original, "framework", MIME and ESMTP deployments: "if you want 264 any of these features, you are going to implement and accept a 265 package of changes; if that is not feasible, you need to stay with 266 ASCII, or ASCII-encoded, addresses and headers until you are ready to 267 upgrade". The alternatives are just too complex, and therefore 268 problem-prone, in terms of combinations of alternate forms, options, 269 transitions, upgrading and downgrading, and so on to preserve a high 270 level of interoperability. 272 The sections that follow discuss the motivation and implications of 273 this conclusion in more detail before moving on to a specific 274 proposal. 276 3. History, Context, and Design Constraints 278 Several key issues in how email works and is handled impose 279 significant constraints on the solution space. Email is often used 280 as a transport mechanism for information that will be acted on by 281 computers, not merely read by people. While the approach is not 282 common, some of the systems that use it as a computer-computer 283 communication medium encode routing, processing, or validation 284 information into the envelope address fields. More commonly, 285 recipient systems use special address formats to encode local routing 286 or priority information. In recent years, some of these addressing 287 techniques have become important anti-spam tools for some users and 288 communities. Most of these techniques have a long history. Most or 289 all of them conform to email standards and practices that, in turn, 290 go back to the first uses of email on the ARPANet. Backward- 291 compatibility --not damaging the interoperability of standards- 292 conforming programs that are now deployed and working correctly-- 293 makes it inappropriate to make decisions by conducting user surveys 294 and concluding that "not too many" people will be hurt. Any new 295 system must preserve existing practices and flexibilities unless 296 there are overwhelming reasons -- e.g., an absence of plausible 297 alternatives -- to not do so. 299 Historically, when one of these approaches has required that the 300 email address local part be partitioned into components that are then 301 interpreted differently or in some special sequence, the information 302 has been organized according to some lexical convention, typically 303 either based on one or more delimiters or on some sort of position 304 and length notation (or a mixture of the two for different purposes). 305 Either may be applied left-to-right or right-to-left and, again, we 306 have a history of each, including the notorious "!a!b!c!d%e%f" local 307 parts, which can be interpreted as 309 o A single mailbox name, "!a!b!c!d%e%f" 310 o A routing instruction to send the address "!b!c!d%e%f" to host "a" 311 o A routing instruction to send the address "!a!b!c!d%e" to host "f" 313 or some combinations of those interpretations. 315 Because the correct interpretation can only be known at the 316 destination host, attempts by intermediate hosts, or even the 317 originating user, to interpret the structure of the address string 318 cause serious problems with mail reliability. Worse, because the 319 organization of the system(s) that make up the destination host 320 cannot, in general, be known to the sender, approaches that assuming 321 decoding from some coded form at some particular order in the process 322 of receiving and delivering the message will cause some fraction of 323 systems that are now fully conformant to Internet mail standards to 324 fail to properly handle the address. 326 3.1 The Presentation Issue 328 Before continuing, it is important to note that any 329 internationalization system, regardless of how it is implemented at 330 the protocol level, will require changes at the user interface level 331 if it is to function in a way that end users consider reasonable. 332 Unless addresses are presented to the user in familiar characters and 333 formats, the user's perception will be, not of internationalization 334 and behavior that is user and culturally friendly, but of a 335 relatively hostile environment. One thing we have almost certainly 336 learned from nearly forty years of experience with email is that 337 users strongly prefer email addresses that closely resemble names or 338 initials to those involving, e.g., user ID numbers or complex coding 339 that makes the local part appear as gibberish. Indeed, that 340 principle --of wanting local parts to appear intelligible-- is 341 arguably the entire reason for wanting to internationalize these 342 addresses. Otherwise, any identifier would suffice whether it had 343 mnemonic value to users or not. If a user sees "xn--fltstrm-5wa1o" 344 (a punycode form) or "F=E4ltstr=F6m" (the MIME quoted-printable 345 form), rather than the correctly-written localized string, the result 346 is almost certain to be unhappiness. 348 3.2 MUAs, MTAs, addresses, and learning from MIME and ESMTP 350 The development and deployment of MIME [RFC2045] provided a number of 351 important lessons for the community about how to design extensions 352 and enhanced features without harm to the installed and conforming 353 email system. Perhaps the most important of these was that it is 354 easier, and often more expedient, to make changes that have impact 355 only on mail user agents. If it is possible to make changes that way 356 --generally changes that involve only message headers and the message 357 body or body parts-- users who need particular features can switch to 358 user agents that support them or press for those features in the user 359 agents they have already selected. Even in the worst case in which 360 support for features the user considers critical is not readily 361 available, it is possible, with proper user agent design, to save the 362 entire message to a file and then use stand-alone software to 363 interpret the information and perform the desired functions. 365 Providing these functions in the message headers and body permits 366 them to be moved opaquely through the mail transport system, thus 367 avoiding any requirement to modify originating or delivery MTAs or 368 intermediate relays. In practice, the user may have little control 369 over those systems. Since changes to them typically impacts large 370 numbers of users, those who are responsible for them are often 371 reluctant to make changes in response to the needs of a few users. 373 It is hence reasonable to conclude that, if it is feasible to support 374 address internationalization strictly at the MUA level, keeping the 375 internationalized addresses opaque to the transport system, that is a 376 more desirable approach than requiring MTA changes. The MUA-only 377 approach has been carefully examined by others (see, e.g., the 378 obsolete Internet Draft [Hoffman-IMAA]) and proposed more recently as 379 a temporary measure in [JET-IMA]. 381 The present document argues that 383 1. Addressing is a fundamental MTA-level function, 384 2. Some of the complexities encountered when trying to encode 385 addresses so as to avoid MTA interactions are symptoms that 386 attempting to "hide" the MTA function so that it can be handled 387 by MUAs is not an architecturally desirable approach, 388 3. The restrictions on email uses and syntax required to provide 389 internationalization at MUA level are unnecessarily risky, and 390 almost certainly damaging, to deployed email infrastructure, 391 4. If internationalization is to be plausible, it is critical that 392 addressing information be represented in essentially the same way 393 in the message envelope (i.e., the SMTP command structure) and 394 the message body (i.e., both message headers and, where feasible, 395 message text). Different encodings in different places, 396 especially ones that are copied back and forth, will make both 397 mail system maintainers and operators and end users very unhappy. 398 and 399 5. MTA-level solutions are feasible, architecturally more elegant, 400 and perhaps not as difficult to deploy in relevant communities as 401 the strongest advocates of the MUA-only approach appear to 402 imagine. See Section 3.8 for additional discussion on this 403 point. 405 The decision as to what to do in message bodies and formats (e.g., 406 [RFC2822] and MIME [RFC2045]) and what to handle in message 407 transport (i.e., in extended SMTP) is critical because, as discussed 408 below, the level at which something is handled is both determined by, 409 and determines, how information is appropriately encoded. This 410 decision ultimately depends on the application of two principles: 412 1. If body content is opaque, anything still visible to transport 413 requires transport negotiation. 414 2. Anything an MTA -- be it origin, relay, MX, gateway, or delivery 415 -- needs to understand or process must be handled as part of mail 416 transport. The discussion below might be titled "why the MTA 417 must get involved". 419 Whether mail addresses meet these criteria, and hence must be 420 comprehensible in transport, depends on how much the sending MUA 421 needs to know to construct, and the delivery MTA needs to know to 422 deliver, a message. Traditionally, we have kept the former knowledge 423 level at zero: if a sender produces "!a!b!c@example.com" in response 424 to information that it is a valid address, it still does not know 425 whether this is a "bang path" or a slightly-perverse name for a 426 single mailbox. Is "xyz%def@example.com" a specification for routing 427 to mailbox "xyz" on host "def" or a mailbox named "xyz%def" on the 428 example.com host? Are "foo+bar@..." or "foo-baz@..." subaddresses 429 "bar" and "baz" for the mailbox "foo", or are they simple addresses? 430 Is "jjoneschem@labs.example.com" a local mailbox on that host or an 431 instruction to route mail to "jjones" in the chemistry department? 433 Under the rules established in [RFC0821] and [RFC1123], as summarized 434 and updated in [RFC2821], all of those decisions are up to 435 "example.com", its MX alternatives, or hosts in that domain, and they 436 may make very local decisions about them. For example, even within 437 the same domain (on the same apparent host), "xyz%def" might be a 438 mailbox while "xyz%ghi" might contain a route; "foo-baz" might 439 represent and address and subaddress while "foo-blog" might be a 440 mailbox. 442 The sender cannot, in the general case, know. 444 Worse, while non-alphanumeric characters like "+", "-", and "%" have 445 been used in these examples, delimiters for subaddresses, implicit 446 routing, embedded commands, and so on are, again, up to the 447 destination MTA and its interpretations. "X" might be as good a 448 delimiter as "+". It might even be a better one in some 449 applications. And, since local-parts are defined as case-sensitive, 450 "x" might be a normal address character in the same address in which 451 "X" was an important delimiter. 453 Of course, in a completely non-ASCII environment, it would make sense 454 to substitute characters from the local script for "+", "-", "%", 455 and so on. If one wants a string completely in local language (i.e., 456 non-ASCII) characters, then there may be no desire to break that 457 convention in order to use an ASCII delimiter (see Section 8 for 458 additional discussion of this issue). 460 It is not even necessary to use a delimiter to support some forms of 461 subaddressing or local routing. Suppose an organization adopted the 462 convention that externally-visible email address local parts were 463 structured as, e.g., a three-letter department code, followed by a 464 five-letter code representing the individual, optionally followed by 465 a code representing a project. Many organizations use just such 466 systems and there is no way (and no need) for an email sender to 467 understand the system or whether it is actually used for mail routing 468 internally. 470 Consequently, the idea of a sender breaking an address up into its 471 component parts and encoding those parts separately, or even just 472 doing an encoding in sections that preserves the positions of the 473 delimiters (as measured from the left) is an impossibility without 474 major, incompatible, and retroactive changes in how mail addressing 475 is defined. Conversely, if the sender encodes the entire address, or 476 the entire local part, without understanding the structure of the 477 address in the same way that the target system does, it is likely 478 that important information will be lost or, possibly, the message 479 will be mis-delivered. 481 3.3 An Encoded-address, MUA-transparent, Solution may Eliminate an 482 Important Opportunity 484 The principle above that addresses should have the same form in 485 headers and in envelopes leads quickly to a reasoning path that 486 argues for representation of most or all mail headers in some form of 487 Unicode, including, but not limited to, those headers that explicitly 488 list addresses. Several proposals have been outlined for doing this; 489 perhaps the best developed (at the time the first version of this 490 document was written) was the "UTF-8-HEADERS" proposal 491 [Hoffman.UTF-8]. Like this proposal, it requires envelope (SMTP 492 extension) negotiation to protect the headers that are encoded in 493 UTF-8. This proposal differs from that one by putting somewhat more 494 reliance on envelope facilities to prevent what its author considers 495 a number of layering and interaction problems, most of them arising 496 from the proposed "Address-map:" headers of that UTF-8-HEADERS 497 proposal. 499 3.4 An MUA-only-based Solution is Not Necessary 501 3.4.1 Obtaining an Internationalized Email Address 503 One of the classic arguments for an approach based on MUA changes 504 only (for international addresses or anything else) is that users 505 will be able to install and use solutions on their own, even if the 506 administrators of their systems are unenthused about the particular 507 function or extension and delay, or decline, to install it. That 508 argument was certainly true for MIME, especially in the presence of 509 the capability to store messages as files and apply post-MUA tools. 510 But it does not seem to apply for email addresses. In general, users 511 cannot create email accounts or aliases controlling delivery of 512 messages from external systems. Those accounts and aliases must be 513 created by system administrators responsible for the mail servers. 514 If those administrators are not sympathetic to internationalized 515 mailbox names, such names will not exist on the receiving system. 516 Having apparatus to send those names through the protocols will be 517 essentially useless: a message that bounces because the relevant 518 account or mailbox does not exist will bounce equally well whether 519 the target address is in ASCII or in some other script and whether or 520 not the receiving MTA is required to explicitly agree to access 521 internationalized addresses. Conversely, if the administrators of 522 the mail system host are sympathetic to internationalization, it is 523 reasonable to expect that appropriate software can and will be 524 installed at the MTA level. 526 An apparent important exception to the position taken in the above 527 paragraph arises for subscription, often free, email services such as 528 those operated under the "Hotmail", "Juno", "Netscape", and "Yahoo" 529 names. Some of these systems permit users to select their own names 530 (local parts) through an automated process. If the user creates a 531 mailbox using an encoded name, users with MUAs that support the 532 encoding will be able to send mail using a name in the user's 533 preferred characters if, of course, the user or MTA somehow knows the 534 encoding is being used. But the user cannot know what capabilities 535 the correspondents will have available, and hence must give out both 536 the name in local characters and the encoded form. This may turn out 537 to be necessary, but is unlikely to be considered desirable. Also, 538 if the user has presentation software that recognizes the coding 539 conventions, then he or she will be able to see the original-language 540 names in incoming messages and may not know what names to pass on to 541 those who lack such presentation software. And, more important, many 542 users of such systems access them through "web mail" interfaces, 543 using standard (or at least common) browsers. The issues in getting 544 those browsers upgraded to automatically recognize and decode special 545 encoding forms may be as difficult, or more difficult, than those 546 associated with convincing a system administrator to install a 547 special address or upgraded MTA. 549 Consider this practice from a user point of view. First, the domain 550 names for these systems --as compared to institutional mail systems-- 551 will generally continue to be in ASCII, so the goal of an email 552 address that is entirely or predominantly in the user's language will 553 be unattainable. If the domain names are non-ASCII (i.e., are IDNA 554 encodings of non-ASCII strings), it is reasonable to assume that an 555 operator who would choose such a name would be willing to 556 internationalize its MTA. Second, such systems are most often 557 accessed through web-based interfaces where most email header 558 information appears to the user browser as running text. Because an 559 email local part can, today, take on the form of almost any ASCII 560 string, it is not reasonable to expect that a browser, even one with 561 some localized functions, will be able to accurately detect an 562 embedded, specially-coded, mailbox local part and correctly decode 563 and render it. Heuristics based on detection of an at-sign ("@") 564 will, of course, work for many cases, but will also produce a certain 565 number of false positives, perhaps destroying URLs or examples in the 566 text. After all, the "@" symbol has been around since long before 567 there was email. It is worth noting that any recognition and 568 decoding of local parts using a local encoding relies on heuristics 569 that may fail: all such strings are historically-valid email local 570 parts, and, unlike the DNS situation, it is impossible to conduct a 571 reliable survey to determine that no one is using any particular 572 encoding form, especially if the encoding indicator appears embedded 573 in the local part string, rather than as a prefix. By contrast, if 574 the MTA sees a Unicode string, and Unicode strings are placed in 575 message headers and message bodies as needed, the transition may be 576 more difficult, but no long-term user confusion or exposure to ugly 577 encodings will be necessary. 579 To be very specific about this, if the local part of an address is 580 encoded, e.g., with some ACE form as suggested in [JET-IMA], there is 581 no practical way for the receiving system to know to decode it, 582 rather than treat it as an ordinary mailbox name, unless it is 583 notified via a SMTP envelope extension. 585 3.4.2 Relay environment 587 As in many other areas with email, many of the difficulties with an 588 MTA-based model for internationalization of addresses arise, not when 589 the originating MTA communicates directly with the delivery MTA, but 590 when relay MTAs are involved. If the both the sending and receiving 591 systems support internationalized addresses, it is still possible 592 that an intermediate relay will not do so, forcing mail to bounce 593 that could be delivered if there were a direct connection between 594 sender and receiver. But, as with the installation of email 595 addresses on a system, relays do not get inserted in the mail path by 596 accident. If internationalized addresses are important to the 597 destination host, its administrators will chose lower-preference MX 598 hosts or other relays that can support internationalized addresses. 600 3.4.3 Internationalizing the Sender 602 If we assume a destination host that can accept, and properly handle, 603 an internationalized address, and we assume that any MX-designated 604 intermediaries for that host will be chosen to be similarly capable, 605 one situation is left in which it would be advantageous to have an 606 MUA-only-based solution. If a originating/ sending system is not 607 capable of generating or sending an internationalized address, but 608 the prospective receiving system is, it would be good to enable the 609 originating user to generate and somehow send to the relevant 610 address. 612 This is a real issue, and deserves some serious consideration. But 613 it seems better to find a good temporary, transitional, mechanism for 614 it than to permanently burden the email system with an uncomfortable 615 mechanism just to accommodate this case. One example of a 616 transitional mechanism might be to use encapsulation, i.e., ESMTP 617 tunneling over MIME [RFC2442], to route the address and message to a 618 friendly gateway host that would unpack the message and transmit it 619 using this specification. Other examples, less attractive at first 620 glance but still plausible, would include defining and using small 621 variations on the message encapsulation mechanisms that are integral 622 to MIME [RFC2046], or the more complex encapsulation designed for 623 HTML [RFC2557], to accomplish the same purpose. 625 The one transitional option that is not plausible is to simply send 626 an encoded string without envelope modification. "Just send ACE" has 627 the same unfortunate properties as the "just send eight" system 628 proposed when MIME was adopted: in both cases, even if the message or 629 address are not damaged in transit, the recipient will not have 630 sufficient information available to be able to accurately determine 631 what it has received and, hence, whether or not to decode it. 633 So, a user with an MUA that has the capability to handle an 634 internationalized address, but who does not have access to an 635 originating MTA with the capabilities defined here, may be given 636 access to a reasonable transition strategy until the needed 637 capabilities are available. Note that this does not require an open 638 relay, since all of the user authentication capabilities of ESMTP 639 [RFC2554] and SUBMIT [RFC2476] would be available. One can even 640 imagine a service with a per-message charging system, which would 641 presumably encourage rapid upgrading. 643 3.5 A Solution Based on MUA Changes Alone is Unworkable 645 The difficulties identified in the examples above are, perhaps 646 obviously, not the only ones. Other issues arise with intermediate 647 MX relay and gateway hosts, commands embedded in local parts, and 648 special formats used in gateways to other environments, among other 649 cases. Some of those additional cases are described briefly below. 651 3.5.1 MX Diversion 653 If the domain part of an email address is associated with several MX 654 records and the mail is delivered to one of them that is not the best 655 preference host, subsequent mail processing between that intermediate 656 host and the ultimately destination one is not required to use SMTP. 657 If, instead, it performs some gateway function, it may need to 658 inspect or alter the local part to determine how to route and deliver 659 the message. If the local part were encoded in some fashion that 660 prevented that inspection process, and the MTA was not aware that it 661 needed to apply special techniques, mail delivery might well fail. 663 3.5.2 Embedded commands 665 In addition to the address forms with special syntax or semantics 666 described elsewhere, systems have been developed that embed commands 667 in address local parts. These might, of course, use entirely 668 different syntax constructions and formats than are typical in 669 conventional addresses and, in an internationalized environment, 670 might reasonably use character coding conventions that are neither 671 ASCII nor Unicode-based. 673 A number of specialized applications of email do require, or 674 recommend, specific syntax in the local part. These are identified, 675 not to indicate that they are the only cases (they are not) but to 676 reinforce the point that one must be quite cautious in doing anything 677 that makes global assumptions about local part syntax and significant 678 characters. These applications include local part explicit routing 679 with the "percent hack" [RFC1123], gateways to and from X.400 680 environments [RFC2156], and gateways to fax systems [RFC3192]. 682 3.6 Encoding the Whole Address String 684 Much of the above demonstrates why selective encoding of parts of the 685 local-part string is not practical, will exclude many important 686 cases, or will subject users to permanent use of the crytpic encoded 687 forms. Why, then, not encode the entire string and insist that the 688 delivery MTA recognize the presence of an encoded form and do 689 whatever decoding is needed before it does other processing? There 690 are three major reasons to approach the problem this way: 692 1. Any change in address syntax interpretation is likely to be a 693 major, incompatible, change, since we do not now impose any 694 restrictions on how an MTA is organized or even on how, or 695 whether, the MTA, MUA, and other delivey-related functions are 696 actually divided up on a given host. Converting user agents to 697 handle international forms of addresses in a way that does not 698 produce user astonishment is likely to be a major undertaking, 699 regardless of what is done to the protocols and at what level. 700 2. Imposing a requirement that MTAs "understand" local-parts so that 701 they can be partially decoded as part of mail routing would seem 702 to defeat the main goal of encoding internationalized strings 703 into a compact ASCII-compatible form, i.e., to keep MTAs from 704 needing to understand the extended naming system 705 3. We potentially have three different encodings of an 706 internationalized string: the one used by the MTA, the one used 707 by the MUA, and the one seen by the user through applications 708 software or the operating system's display interface. Having all 709 three of these identical or closely compatible is desirable from 710 the standpoint of user understanding and debugging. Having them 711 different can cause many "interesting" problems, e.g., having to 712 return an error message that uses different coding, and hence 713 might represent an entirely different string, than the string the 714 user put into the process. 716 Instead, it would seem sensible to move from a straightforward 717 encoding of mail addresses in ASCII to a straightforward encoding in 718 Unicode via UTF-8 [RFC2277], imposing only those restrictions on the 719 characters in the local part that are implied by Unicode itself. 721 3.7 Looking back and looking forward 723 Another principle is implied by some of the discussion above. 724 Internationalization measures for the Internet will be with us for as 725 long as there are multiple languages and scripts in the world, i.e., 726 probably forever. If a satisfactory long-term solution can be found, 727 and a reasonable transition strategy can be defined for it, it is 728 much better to optimize for the long term. The alternative of making 729 things more difficult or less functional forever -- for the 730 transport, the MUA, and/or the user interface system -- in order to 731 save some small effort in transition, or even to make the transition 732 a few months faster, represents a very poor tradeoff. 734 3.8 Summary of Design Issues and Tradeoffs 736 Each of the above subsections describes a strong case for continuing 737 to treat addressing as an MTA function, opaque except at the end 738 systems. The main alternative is to rely on the sending system being 739 able to understand the addressing system of the target host, and any 740 relays accessed through MX relays, potentially needing to be able to 741 remove IDN encoding ("punycode" or otherwise) in order to determine 742 how to process or route the message. That alternative violates a 743 long-standing and important design principle of Internet email, 744 complicates a number of other cases, and does not offer sufficient 745 transition advantages to be worth any of those difficulties. 747 The protocol proposed here takes a giant step toward true 748 internationalization of electronic mail, providing a good functional 749 approximation to what we might have done several decades ago had 750 Unicode and the necessary understanding been available. It does not 751 go as far as one could imagine going in providing address forms that 752 would be compatible with local styles and models all over the world. 753 The issues in considering, and taking, those extra steps are 754 discussed in Section 8. 756 4. A Mail Transport-level Protocol 758 4.1 General Principles and Objectives 760 1. Whatever encoding is used should apply to the whole address and 761 be directly compatible with software used at the user interface. 762 2. An SMTP relay must either recognize the format explicitly, 763 agreeing to do so via an ESMTP option, select and use an ASCII- 764 only address, or bounce the message so that the sender can make 765 another plan. 766 3. In the interest of interoperability, charsets other than UTF-8 or 767 punycode are strongly discouraged. If a mail environment chooses 768 to use them anyway in the local part, interpretation at the "what 769 does this mean" level is the responsibility of the receiving MTA. 771 4.2 Framework for the Internationalization Extension 773 The following service extension is defined: 775 1. The name of the SMTP service extension is "Address 776 Internationalization"; 777 2. The EHLO keyword value associated with this extension is "I18N"; 778 3. No parameter values are defined for this EHLO keyword value. In 779 order to permit future (although unanticipated) extensions, the 780 EHLO response MUST NOT contain any parameters for that keyword. 781 If a parameter appears, the SMTP client that is conformant to 782 this version of this specification MUST treat the ESMTP response 783 as if the I18N keyword did not appear. 784 4. An optional parameter is added to the SMTP MAIL and RCPT 785 commands. This parameter is named ALT-ADDRESS. It requires an 786 argument that may be useful in "downgrading" (see Section 4.5) as 787 a substitute for the internationalized (UTF-8 coded) address. 789 This all-ASCII address MAY incorporate the IDNA "punycode" form 790 if the domain name is internationalized. No algorithmic 791 transformation is specified for the local-part; in the general 792 case, it may identify a completely separate mailbox from the one 793 identified in the primary command argument. 794 5. No additional SMTP verbs are defined by this extension. 796 Most of the remainder of this memo specifies how support for the 797 extension affects the behavior of an SMTP client and server and what 798 message header changes it implies. 800 4.3 The Address Internationalization Service Extension 802 In the absence of this extension, SMTP clients and servers are 803 constrained to using only those addresses permitted by RFC 2821. The 804 local parts of those addresses may be made up of any ASCII 805 characters, although certain of them must be quoted as specified 806 there. It is notable in an internationalization context that there 807 is a long history on some systems of using overstruck ASCII 808 characters (a character, a backspace, and another character) within a 809 quoted string to approximate non-ASCII characters. This form of 810 internationalization should be phased out as this extension becomes 811 widely deployed but backward-compatibility considerations require 812 that it continue to be supported. 814 An SMTP Server that announces this extension MUST be prepared to 815 accept a UTF-8 string [RFC3629] in any position in which RFC 2821 816 specifies that a "mailbox" may appear. That string must be parsed 817 only as specified in RFC 2821, i.e., by separating the mailbox into 818 source route, local part and domain part, using only the characters 819 colon (U+003A), comma (U+002C), and at-sign (U+0040) as specified 820 there. Once isolated by this parsing process, the local part MUST be 821 treated as opaque unless the SMTP Server is the final delivery MTA. 822 Any domain names that are to be looked up in the DNS MUST be 823 processed into punycode form as specified in IDNA [RFC3490] unless 824 they are already in that form. Any domain names that are to be 825 compared to local strings SHOULD be checked for validity and then 826 MUST be compared as specified in IDNA. 828 An SMTP Client that receives the I18N extension keyword MAY transmit 829 a mailbox name as an internationalized string in UTF-8 form. It MAY 830 transmit the domain part of that string in either punycode (derived 831 from the IDNA process) or UTF-8 form but, if it sends the domain in 832 UTF-8, it SHOULD first verify that the string is valid for a domain 833 name according to IDNA rules. As required by RFC 2821, it MUST not 834 attempt to parse, evaluate, or transform the local part in any way. 835 If the I18N SMTP extension is not offered by the Server, the SMTP 836 Client MUST not transmit an internationalized address. Instead, it 837 MUST either return the message to the user as undeliverable or 838 replace it, either using the ASCII-only address specified with the 839 ALT-ADDRESS parameter or using some process (such as a directory 840 lookup) outside the scope of this specification, with a local-part 841 that conforms to the syntax rules of RFC 2821. 843 4.4 Extended Mailbox Address Syntax 845 RFC 2821, section 4.1.2, defines the syntax of a mailbox as 847 Mailbox = Local-part "@" Domain 849 Local-part = Dot-string / Quoted-string 850 ; MAY be case-sensitive 852 Dot-string = Atom *("." Atom) 854 Atom = 1*atext 856 Quoted-string = DQUOTE *qcontent DQUOTE 858 Domain = (sub-domain 1*("." sub-domain)) / address-literal 859 sub-domain = Let-dig [Ldh-str] 861 (see that document for productions and definitions not provided here 862 -- their details are not important to understanding this 863 specification). The key changes made by this specification are, 864 informally, to 866 o Change the definition of "sub-domain" to permit either the 867 definition above or a UTF-8 (or other, see Section 7.1) string 868 representing a DNS label that is conformant with IDNA [RFC3490]. 869 That sub-domain string MUST NOT contain the characters "@" or ".". 870 o Change the definition of "Atom" to permit either the definition 871 above or a UTF-8 (or other, see Section 7.3) string. That string 872 MUST NOT contain any of the ASCII characters (either graphics or 873 controls) that are not permitted in "atext"; it is otherwise 874 unrestricted. 876 4.5 The ALT-ADDRESS parameter 878 If the I18N extension is offered, the syntax of the SMTP MAIL and 879 RCPT commands is extended to support the optional "ALT-ADDRESS" 880 parameter, which takes one argument. Its syntax is 882 "alt-address=" Mailbox 883 where "Mailbox" is strictly according to the (unextended) syntax 884 specified in RFC 2821. If the receiving SMTP server is required to 885 send the message on to a system that does not support this extension 886 or the one for UTF-8 headers (see Section 6, it MAY substitute the 887 specified address for the internationalized one to permit the message 888 to be delivered to a mailbox specified by the sender. If UTF-8 889 headers are not supported, it SHOULD also encapsulate the message as 890 described in Section 3.4.3. An SMTP server that cannot forward an 891 internationalized message using the addressing and UTF-8 header 892 extensions MUST either 894 o Substitute all-ASCII addresses and encapsulate the message, or 895 o "Bounce" the message, either rejecting it during the SMTP 896 transaction or returning it, as specified in RFC 2821. 898 Under normal circumstances, the final delivery SMTP server should be 899 configured so that the two mailbox names point to the same physical 900 store. However, just as case-matching for traditional ASCII local- 901 parts is not a requirement of the unmodified SMTP protocol, mapping 902 of these two mailbox names is not a requirement here: sites for whom 903 other issues outweigh the potential confusion should configure their 904 systems as they find appropriate. 906 While further analysis is required, it will probably be desirable to 907 extend the "Return-path:" trace header to include this parameter when 908 it is provided; doing so would increase the odds that an error 909 message could be properly routed in a number of edge and transition 910 cases. 912 4.6 Formation of the Alternate Address 914 It is tempting to want to form the alternate, all-ASCII, address 915 algorithmically so that, e.g., the sending MUA could compute it 916 without any user input other than the native-form address. Such a 917 computation could, for example, be the ACE transformation 918 contemplated by [JET-IMA]. In the general case, this is not possible 919 for the same reason that the use of an ASCII-compatible form without 920 option negotiation is not feasible: the sending system cannot know 921 the way in which the receiving one parses and interprets address 922 local parts. 924 However, for the range of simple cases in which the local part really 925 is atomic and represents a simple mailbox without subaddresses or 926 other internal structure, it would probably be helpful to have a 927 convention that the destination host could use to derive the 928 alternate address, such as a standard ACE-style encoding. One can 929 imagine prohibiting the use of alternate addresses that used selected 930 ACE prefix except as an ACE-introducer so that the sending MUA, upon 931 receiving an reply or bounce on the alternate address could at least 932 produce a helpful error message for the user. 934 4.7 Additional ESMTP Changes and Clarifications 936 The mail transport process involves addresses ("mailboxes") and 937 domain names in contexts in addition to the MAIL and RCPT commands 938 and extended alternatives to them. In general, the rule is that, 939 when RFC 2821 specifies a mailbox, this document expects UTF-8 to be 940 used for the entire string; when RFC 2821 specifies a domain name, 941 the name should be in punycode form if its raw form is non-ASCII. 943 The following subsections list and discuss all of the relevant cases. 944 [[Note in draft: I hope]] 946 4.7.1 The Initial SMTP Exchange 948 When an SMTP or ESMTP connection is opened, the server sends a 949 "banner" response consisting of the 220 reply code and some 950 information. The client then sends the EHLO command. Since the 951 client cannot know whether the server supports internationalized 952 addresses until after it receives the response from EHLO, any domain 953 names that appear in this dialogue, or in responses to EHLO, must be 954 in hostname form, i.e., internationalized ones must be in punycode 955 form. 957 4.7.2 Trace Fields 959 Internationalized domain names in Received fields should be 960 transmitted in Unicode form. Addresses in "for" clauses need further 961 examination and might be treated differently depending on whether 962 8BITMIME is a requirement for internationalized addresses (See 963 Section 6. The reasoning in the introductory portion of Section 5 964 strongly suggests that these addresses be in Unicode form, rather 965 than some specialized encoding, but a counterargument is that users 966 do not look at Received fields and, if there is a standard encoding 967 available that is completely interoperable and information- 968 preserving, it should be used for both domain names and addresses 969 (perhaps in a comment or other supplemental information). 971 5. Impact on the MUA and on Message Headers 973 In addition to the trace fields ("Received" headers), mentioned 974 above, there are many other places in MUAs or in user presentation in 975 which email addresses or domain names appear. Each one, whether the 976 conventional From, To, or Cc header fields, or Message-IDs, or In- 977 Reply-To fields that may contain addresses or domain names, or in 978 message bodies or elsewhere, must be examined from an 979 internationalization perspective. The user will expect to see 980 mailbox and domain names in local characters, and to see them 981 consistently: a situation in which an address is coded one way in a 982 "From" field, another way in a signature line in the body of a the 983 message, and, apparently arbitrarily, in one or the other of those 984 forms in Return-Path, Received, or reference fields, will create 985 confusion and frustration. Variations on that problem will exist 986 with any internationalization method, whether transport or MUA-only 987 in structure. Perhaps, if we have to live with it for a short time 988 as a transition activity, that is worthwhile. But the only practical 989 way to avoid it, in both the medium and the longer term, is to have 990 the encodings used in transport be as nearly as possible the same as 991 the encodings used in message headers and message bodies. 993 There appears to be a very strong case for concluding that the point 994 at which we internationalize email local parts is the point that we 995 should simply shift email headers to a full internationalized form, 996 presumably using UTF-8 rather than ASCII. The transition to that 997 model might involve support for legacy systems by extending the 998 encoding models of [RFC2045] and [RFC2231] to cover address, and 999 address-related, fields within headers but our target should be fully 1000 internationalized headers, as discussed elsewhere in this document. 1002 6. Bundling of Extensions and Options 1004 An ongoing concern about any extensions to SMTP is that they will 1005 combine to create a situation with enough possible combinations to 1006 require profiling and the consequent increased complexity and 1007 negative impact on interoperability. Reducing the number of 1008 different options and combinations of them is therefore a useful 1009 goal. In this particular case, we have 1011 1. This proposal, to permit UTF-8 addresses and a alternate 1012 addresses when it is not possible to reliably transmit or use the 1013 UTF-8 ones. 1014 2. Several proposals to permit the use of UTF-8 (or other non-ASCII 1015 encodings) in mail headers. See, e.g., [Hoffman.UTF-8] 1016 3. Several proposals to protect those UTF-8 headers with an SMTP 1017 extension. See, again, e.g., [Hoffman.UTF-8] 1018 4. The established "8BITMIME" extension [RFC1652] that permits 1019 message bodies to be transmitted in 8 bit form, rather than 1020 requiring a content transfer encoding to force them into 7 bit 1021 form. Given the difficulties that would occur if UTF-8 (or other 1022 8 bit) headers were significantly encoded, this extension is all 1023 but required for the UTF-8 header extensions to function 1024 properly. 1026 5. Possibly also some type of "get envelope components out of the 1027 headers" proposal, such as [Klensin.envelope] 1029 In the interest of compatability and interoperability, it is 1030 suggested that all of these extensions be bundled, with only one EHLO 1031 keyword for all of them other than 8BITMIME, and with that new 1032 keyword implying (and requiring support for) all of the listed 1033 functions. This requires MTA implementers who wish to support these 1034 features to do a slightly larger job, but avoids the interoperability 1035 costs of excessive incrementalism. Put differently, 1036 internationalization, taken seriously, is a large issue and a large 1037 job, and these appear to be the pieces needed to get the job done 1039 7. Protocol Loose Ends 1041 These issues should be resolved, and this section eliminated, before 1042 the document is considered complete. 1044 7.1 Punycode in Domain Names? 1046 It is not clear whether the flexibility of being able to pass domain 1047 names in punycode, as well as UTF-8, form is needed. If it is not, 1048 it should be eliminated as excess complexity. 1050 7.2 Local Character Codes in Local Parts? 1052 There are some reasons for permitting local-parts to be written in 1053 locally-used character codes, i.e., in other than the UTF-8 encoding 1054 of UNICODE. This could be done by tagging the local part in a 1055 fashion similar to the technique of [RFC2047] but without encoding 1056 the local part string itself. It clearly increases flexibility, and 1057 the mailbox part can be defined as a simple octet string (as it 1058 essentially is in the sections above). We can reasonably expect that 1059 some systems, operating in local environments, will use local 1060 character codes no matter what we specify. On the other hand, having 1061 an application presented with an octet (or bit) string and not 1062 knowing what charset is involved would wreak havoc on any attempt to 1063 intelligently display local parts: if one cannot know the character 1064 coding being used, then it is not possible to accurately decode the 1065 characters and display appropriate character glyphs. 1067 Use of local coding also implies an encoding for the local part 1068 different from that for the domain part -- any MTA in the path must 1069 be able to resolve the domain part into something that can be looked 1070 up in the DNS and resolved and that, in turn, requires a globally- 1071 known encoding. 1073 On the other hand, if local codings can be avoided entirely, it will 1074 considerably reduce complexity and "opportunities" for systems to not 1075 interoperate. 1077 7.3 Restrictions on Characters in Local Part? 1079 This specification is extremely liberal about what can be included in 1080 a UTF-8 string that represents a local-part. In return, it 1081 effectively prohibits the use of quoted strings, or quoted 1082 characters, in non-ASCII local parts. Quoted strings and characters 1083 in local parts have, in general, been nothing but trouble and there 1084 appears to be no reason to carry that trouble forward into an 1085 internationalized world (and the much greater complexity that quoting 1086 in that environment might imply). There may also be a strong case 1087 for applying restrictions, e.g., by use of a stringprep [RFC3454] 1088 profile that would eliminate particularly problematic characters 1089 while not forcing, e.g., even an approximation to case-mapping 1090 (remember that ASCII local-parts are inherently case sensitive, even 1091 though local systems are encouraged to not take advantage of that 1092 feature). 1094 7.4 Requirement for 8BITMIME? 1096 This extension is carefully defined to be independent of "8BITMIME". 1097 However, given the length of time 8BITMIME has been around, the 1098 amount of deployment of it that exists, and the rather low likelihood 1099 that any MTA implementer in his or her right mind will go to the 1100 trouble of implementing this extension without also implementing 1101 8BITMIME, it may be sensible to permit this extension only if 1102 8BITMIME also appears or is combinated with it as suggested in 1103 Section 6 1105 7.5 Message Header and Body Issues with MTA Approach? 1107 By viewing i18n addresses as an MTA problem, this document may not 1108 address all of the interesting 2822/MIME and MUA implementation and 1109 presentation style issues. 1111 In particular, if both this extension and 8BITMIME are in use, is it 1112 sensible to drop the requirement for RFC 2047/ 2231 encoding of 1113 personal name fields? And, whether or not that requirement is 1114 dropped, is the MUA description of Section 5 adequate? 1116 7.6 The Received field 'for' clause 1118 Decide what to do about the value of the "for" clause in Received 1119 fields. See Section 4.7.2. 1121 8. Internationalization and Full Localization 1123 Whenever one considers a new protocol, or revision of an existing 1124 one, for internationalization or other aspects of support for an 1125 improved user interface, important tradeoffs arise. These tradeoffs 1126 can be described in several ways, e.g., 1128 o Simplicity versus localization capability 1129 o User convenience, especially within a particular area or culture 1130 versus global interoperability 1131 o and so on 1133 Maximum global interoperability is obtained by confining a protocol 1134 to an very limited number of characters, ideally ones that are easily 1135 distinguishable by people. The historical choice in this regard has 1136 been the 26 upper-case ASCII letters, plus digits, plus a very small 1137 number of special characters. It is probably no coincidence that 1138 these characters (with different, bit-minimizing, encodings) are the 1139 normal ones in early telegraphy and subsequent Telex character sets. 1140 But, as soon as users start looking at these characters, the 1141 complaints start to appear: text in all-upper-case is ugly, people 1142 should be able to write their names as they normally do and not in 1143 some transliterated or variant form, people should be able to 1144 communicate in their own languages using their own character sets, 1145 and so on. Ultimately, not only are the characters used in writing 1146 at issue; so are the structures for constructing, e.g., command 1147 sequences, with different preferences typically reflecting the 1148 grammatical structures of different languages. With sufficient 1149 ingenuity, all of these requirements can be accommodated, but only 1150 typically at the cost of global interoperability or at the expense of 1151 convenient use by people outside the locality or cultural group. 1153 Email addresses illustrate this problem at its most difficult. They 1154 are seen and used by end users and there has been little success in 1155 hiding the forms that are actually used in the protocols. Worldwide, 1156 most communication is almost certainly among people who share 1157 languages and cultural assumptions, not in situations in which global 1158 interoperability is important (and where it is important that global 1159 interoperability be convenient and very reliable). On the other 1160 hand, situations and communications that require global 1161 interoperability are still common and are commercially and 1162 intellectually important. 1164 So the question is how far should one go. It is clearly important 1165 and sensible to accommodate local character sets, and to do so in a 1166 way that creates maximum convenience and attractive user interfaces 1167 in the long term. But, as pointed out in passing in Section 4.3, 1168 RFC2821 still requires the ASCII at-sign character to divide the 1169 local part from the domain name. If even lexical support for the 1170 long-deprecated source routes is to be provided, comma and colon must 1171 also be preserved and supported. This implies that a mailbox name 1172 that is completely in some character script other than ASCII is 1173 impossible without further changes to the email protocols. In 1174 addition, the ordering implied by the "local-part@domain" 1175 construction, usually read in English as "local part at domain", 1176 seems quite strange and foreign in some other languages and cultures. 1177 It is interesting that X.400 avoided this delimiter and ordering 1178 problem entirely by using Distinguished Names in which the various 1179 elements of an address were explicitly identified. But, when 1180 Distinguished Names appear at the presentation layer or above, they 1181 appear with the various fields identified by tags which are, 1182 themselves, keywords that use a very restricted set of ASCII 1183 (actually ISO 646 or IA5) characters. 1185 In principle at least, the protocol extensions proposed here could be 1186 further extended to specify a separator character to distinguish 1187 local part from domain name and the order in which those names 1188 occurred. For example, the MAIL and RCPT commands could be extended 1189 with parameters like 1191 SEPARATOR="UTF-8-character" ORDER-RL to identify a form consisting of 1192 the domain name followed by the local part, separated by the 1193 designated character 1195 But, while this would not impose particularly heavy burdens on SMTP 1196 processors, it would be a potential nightmare for users, who would 1197 have no way to accurately identify the components of an email 1198 address, at least without significant out-of-band information. In 1199 addition, going that far would almost certainly touch off the debate, 1200 again, as to whether domain names should be presented in little- 1201 endian or big-endian order -- an issue that is, again, culturally 1202 sensitive as to which one feels most natural. 1204 It is not clear how far one should go, and the community should 1205 consider the issue very carefully. 1207 9. Advice to Designers and Operators of Mail-receiving Systems 1209 As discussed above, in the historical Internet email context, the 1210 interpretation and permitted syntax for an email local-part is 1211 entirely the responsibility of the receiving system. Systems can get 1212 themselves into trouble and, more particularly, can seriously 1213 restrict the number and type of users who can send mail to their 1214 users, by poor choices of format and syntax. For example, general 1215 advice to system designers has long included "treat addresses in a 1216 case-independent fashion" and "do not use addresses that require 1217 quoting" in order to increase the odds that remote users will be able 1218 to properly compose and transmit intended addresses. In a way, that 1219 advice is an extreme generalization of the "receiver" side of the 1220 robustness principle: being generous in what one accepts implies 1221 accepting as many plausible variations of an address local-part 1222 string as possible and designing the strict forms of those strings to 1223 facilitate differentiation when it is appropriate. 1225 As one moves toward internationalization of local parts, an expanded 1226 version of these principles is useful and may be even more 1227 appropriate, even though it is neither necessary nor desirable to 1228 turn those principles into protocol requirements. For example, a 1229 receiving host should normally consider any string that would match 1230 under nameprep rules --or perhaps any string that would match under a 1231 stringprep profile that provides more matching and exclusions than 1232 nameprep-- as matching for local-part purposes. An even more 1233 "liberal" receiving host might use some sort of variant tables for 1234 its script(s) of interest to further expand the matching rules. 1236 But, whatever extended matching rules the local host adopts, those 1237 rules are a property of that host. Senders should continue to be 1238 conservative about what they send, and relays should continue to 1239 avoid presumptions about their understanding of the content of local- 1240 parts. Receiving systems that have reason to adopt more restricted 1241 syntax rules, or interpretations of matching, should continue to be 1242 able to do so. 1244 10. Internationalization Considerations 1246 This entire specification addresses issues in internationalization 1247 and especially the boundaries between internationalization and 1248 locationalization and between network protocols and client/user 1249 interface actions. 1251 11. IANA Considerations 1253 This specification does not contemplate any IANA registrations or 1254 other actions. 1256 12. Security considerations 1258 Any expansion of permitted characters and encoding forms in email 1259 addresses raises the risk, however slight, of misdirected or 1260 undeliverable mail. The problem is worsened if address information 1261 is carried in local character sets and must be converted to some 1262 standard form. Any conversion of character sets may also be 1263 problematic for digitally-signed information. Modulo those concerns, 1264 the ideas proposed here do not introduce new security issues. 1266 Since email addresses are often transcribed from business cards and 1267 notes on paper, they are subject to problems arising from confusable 1268 characters. Those problems are somewhat reduced if the domain 1269 associated with the mailbox is unambiguous and supports a relatively 1270 small number of mailboxes whose names follow local system 1271 conventions; they are increased with very large mail systems in which 1272 users can freely select their own addresses. 1274 13. Acknowledgements 1276 The author acknowledges the contributions and comments of Dave 1277 Crocker in a personal conversation, and the efforts of a private 1278 discussion group, led by Paul Hoffman and Adam Costello, to develop 1279 an MUA-only solution to this problem. The author had hoped that 1280 effort would succeed, since the idea of requiring transport changes 1281 to support internationalization (or any other new function) is 1282 unattractive and should be avoided when possible. Difficulties that 1283 group has encountered in properly defining a number of boundary 1284 conditions, including appropriate delimiters for permitting internal 1285 parsing of the local part and problems with right-to-left characters 1286 and substrings, have led to the conclusion that it is time to get a 1287 specific, transport-based, approach on the table. That conclusion 1288 has been reinforced by increasing interest in the IETF in more 1289 radical changes to the mail system, starting with extensions to 1290 permit mail headers to be written in UTF-8. While the ideas leading 1291 to the "IMAA" and other drafts have inspired several of the 1292 properties of this proposal, their authors are, of course, not 1293 responsible for the result and will probably disagree with it. 1294 Comments from Adam Costello on the first public draft were 1295 particularly helpful, and James Seng identified some 1296 internationalization issues that had not been addressed in the 1297 previous version. 1299 14. An Appeal 1301 The author received a number of favorable comments on the general 1302 principles and design discussed in early drafts of this 1303 specification. He is not, however, able to continue its development 1304 as a one-person, or even one-person with occasional comments from 1305 others, basis. In particular, he has almost no resources for 1306 developing MTA, MUA, or presentation code to test and demonstrate the 1307 concepts and details outlined above; without such resources, this 1308 approach will, inevitably, fail sooner or later. So those who 1309 consider the idea attractive should think about, and develop, ways to 1310 join with the author in design team and development efforts. 1312 15. References 1313 15.1 Normative References 1315 [Hoffman.UTF-8] 1316 Hoffman, P., "SMTP Service Extensions for Transmission of 1317 Headers in UTF-8 Encoding", draft-hoffman-utf8headers-00 1318 (work in progress), December 2003. 1320 [Klensin.envelope] 1321 Klensin, J., "A Cleaner SMTP Envelope for Internet Mail", 1322 draft-klensin-email-envelope-00 (work in progress), 1323 January 2004. 1325 [RFC0821] Postel, J., "Simple Mail Transfer Protocol", STD 10, 1326 RFC 821, August 1982. 1328 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 1329 and Support", STD 3, RFC 1123, October 1989. 1331 [RFC1651] Klensin, J., Freed, N., Rose, M., Stefferud, E., and E. 1332 Crocker, "SMTP Service Extensions", RFC 1651, July 1994. 1334 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1335 Requirement Levels'", RFC 2119, March 1997. 1337 [RFC2821] Klensin, J., "Simple Mail Transfer Protocol", RFC 2821, 1338 April 2001. 1340 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1341 "Internationalizing Domain Names in Applications (IDNA)", 1342 RFC 3490, March 2003. 1344 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1345 Profile for Internationalized Domain Names (IDN)", 1346 RFC 3491, March 2003. 1348 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 1349 for Internationalized Domain Names in Applications 1350 (IDNA)", RFC 3492, March 2003. 1352 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1353 10646", RFC 3629, November 2003. 1355 15.2 Informative References 1357 [Hoffman-IMAA] 1358 Hoffman, P. and A. Costello, "Internationalizing Mail 1359 Addresses in Applications (IMAA)", draft-hoffman-imaa-03 1360 (work in progress), October 2003. 1362 [JET-IMA] Yao, J. and J. Yeh, "Internationalized eMail Address 1363 (IMA)", draft-lee-jet-ima-00 (work in progress), 1364 June 2005. 1366 [RFC1652] Klensin, J., Freed, N., Rose, M., Stefferud, E., and E. 1367 Crocker, "SMTP Service Extensions", RFC 1652, July 1994. 1369 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1370 Extensions (MIME) Part One: Format of Internet Message 1371 Bodies", RFC 2045, November 1996. 1373 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1374 Extensions (MIME) Part Two: Media Types", RFC 2046, 1375 November 1996. 1377 [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail Extensions) 1378 Part Three: Message Header Extensions for Non-ASCII Text", 1379 RFC 2047, November 1996. 1381 [RFC2056] Denenberg, R., Kunze, J., and D. Lynch, "Uniform Resource 1382 Locators for Z39.50", RFC 2056, November 1996. 1384 [RFC2156] Kille, S., "MIXER (Mime Internet X.400 Enhanced Relay): 1385 Mapping between X.400 and RFC 822/MIME", RFC 2156, 1386 January 1998. 1388 [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded 1389 Word Extensions: Character Sets, Languages, and 1390 Continuations", RFC 2231, November 1997. 1392 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1393 Languages", BCP 18, RFC 2277, January 1998. 1395 [RFC2442] Freed, N., Newman, D., and Hoy, M., "The Batch SMTP Media 1396 Type", RFC 2442, November 1998. 1398 [RFC2476] Gellens, R. and J. Klensin, "Message Submission", 1399 RFC 2476, December 1998. 1401 [RFC2554] Myers, J., "SMTP Service Extension for Authentication", 1402 RFC 2554, March 1999. 1404 [RFC2557] Palme, F., Hopmann, A., Shelness, N., and E. Stefferud, 1405 "MIME Encapsulation of Aggregate Documents, such as HTML 1406 (MHTML)", RFC 2557, March 1999. 1408 [RFC2822] Resnick, P., "Internet Message Format", RFC 2822, 1409 April 2001. 1411 [RFC3192] Allocchio, C., "Minimal FAX address format in Internet 1412 Mail", RFC 3192, October 2001. 1414 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 1415 Internationalized Strings ("stringprep")", RFC 3454, 1416 December 2002. 1418 Author's Address 1420 John C Klensin 1421 1770 Massachusetts Ave, #322 1422 Cambridge, MA 02140 1423 USA 1425 Phone: +1 617 491 5735 1426 Email: john-ietf@jck.com 1428 Intellectual Property Statement 1430 The IETF takes no position regarding the validity or scope of any 1431 Intellectual Property Rights or other rights that might be claimed to 1432 pertain to the implementation or use of the technology described in 1433 this document or the extent to which any license under such rights 1434 might or might not be available; nor does it represent that it has 1435 made any independent effort to identify any such rights. Information 1436 on the procedures with respect to rights in RFC documents can be 1437 found in BCP 78 and BCP 79. 1439 Copies of IPR disclosures made to the IETF Secretariat and any 1440 assurances of licenses to be made available, or the result of an 1441 attempt made to obtain a general license or permission for the use of 1442 such proprietary rights by implementers or users of this 1443 specification can be obtained from the IETF on-line IPR repository at 1444 http://www.ietf.org/ipr. 1446 The IETF invites any interested party to bring to its attention any 1447 copyrights, patents or patent applications, or other proprietary 1448 rights that may cover technology that may be required to implement 1449 this standard. Please address the information to the IETF at 1450 ietf-ipr@ietf.org. 1452 Disclaimer of Validity 1454 This document and the information contained herein are provided on an 1455 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1456 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1457 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1458 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1459 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1460 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1462 Copyright Statement 1464 Copyright (C) The Internet Society (2005). This document is subject 1465 to the rights, licenses and restrictions contained in BCP 78, and 1466 except as set forth therein, the authors retain all their rights. 1468 Acknowledgment 1470 Funding for the RFC Editor function is currently provided by the 1471 Internet Society.