idnits 2.17.1 draft-klensin-name-filters-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 347 has weird spacing: '... of for an ...' == Line 385 has weird spacing: '...m. See for d...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 5, 2003) is 7539 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'ASCII' is mentioned on line 128, but not defined == Missing Reference: 'IS3166' is mentioned on line 176, but not defined == Unused Reference: 'RFC1535' is defined on line 613, but no explicit reference was found in the text == Unused Reference: 'RFC1738' is defined on line 617, but no explicit reference was found in the text == Unused Reference: 'RFC2396' is defined on line 629, but no explicit reference was found in the text == Unused Reference: 'RFC2616' is defined on line 633, but no explicit reference was found in the text == Unused Reference: 'RFC3490' is defined on line 643, but no explicit reference was found in the text == Unused Reference: 'RFC3491' is defined on line 647, but no explicit reference was found in the text == Unused Reference: 'RFC3492' is defined on line 651, but no explicit reference was found in the text == Unused Reference: 'JET' is defined on line 670, but no explicit reference was found in the text == Unused Reference: 'RegRestr' is defined on line 678, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 1535 ** Obsolete normative reference: RFC 1738 (Obsoleted by RFC 4248, RFC 4266) ** Obsolete normative reference: RFC 1866 (Obsoleted by RFC 2854) ** Obsolete normative reference: RFC 2368 (Obsoleted by RFC 6068) ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 2821 (Obsoleted by RFC 5321) ** Obsolete normative reference: RFC 2822 (Obsoleted by RFC 5322) ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) == Outdated reference: A later version (-05) exists of draft-jseng-idn-admin-03 == Outdated reference: A later version (-08) exists of draft-klensin-reg-guidelines-00 Summary: 12 errors (**), 0 flaws (~~), 17 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft September 5, 2003 4 Expires: March 5, 2004 6 User Interface Evaluation and Filtering of Internet Addresses and 7 Locators -or- Syntaxes for Common Namespaces 8 draft-klensin-name-filters-03.txt 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that other 17 groups may also distribute working documents as Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at http:// 25 www.ietf.org/ietf/1id-abstracts.txt. 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 This Internet-Draft will expire on March 5, 2004. 32 Copyright Notice 34 Copyright (C) The Internet Society (2003). All Rights Reserved. 36 Abstract 38 Many Internet applications have been designed to deduce top-level 39 domains (or other domain name labels) from partial information. The 40 introduction of new top level domains, expecially non-country-code 41 ones, has exposed flaws in some of the methods used by these 42 applications. These flaws make it more difficult, or impossible, for 43 users of the applications to access the full Internet. This memo 44 discusses some of the techniques that have been used and gives some 45 guidance for minimizing their negative impact as the domain name 46 environment evolves. This document draws summaries of the applicable 47 rules together in one place and supplies references to the actual 48 standards. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 2. Restrictions on domain (DNS) names . . . . . . . . . . . . . . 4 54 3. Restrictions on email addresses . . . . . . . . . . . . . . . 7 55 4. URLs and URIs . . . . . . . . . . . . . . . . . . . . . . . . 9 56 4.1 URI syntax definitions and issues . . . . . . . . . . . . . . 9 57 4.2 The HTTP URL . . . . . . . . . . . . . . . . . . . . . . . . . 10 58 4.3 The MAILTO URL . . . . . . . . . . . . . . . . . . . . . . . . 10 59 4.4 Guessing domain names in web contexts . . . . . . . . . . . . 12 60 5. Implications of internationalization . . . . . . . . . . . . . 14 61 6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 62 7. Security considerations . . . . . . . . . . . . . . . . . . . 16 63 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17 64 Normative References . . . . . . . . . . . . . . . . . . . . . 18 65 Non-normative References . . . . . . . . . . . . . . . . . . . 20 66 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 20 67 Intellectual Property and Copyright Statements . . . . . . . . 21 69 1. Introduction 71 Designers of user interfaces to Internet applications have often 72 found it useful to examine user-provided values for validity before 73 passing them to the Internet tools themselves. This type of test, 74 most commonly involving syntax checks or application of other rules 75 to domain names, email addresses, or "web addresses" (URLs or, 76 occasionally, extended URI forms (see URLs and URIs)) may enable 77 better-quality diagnostics for the user than might be available from 78 the protocol itself. Local validity tests on values are also thought 79 to improve the efficiency of back-office processing programs and to 80 reduce load on the protocols themselves. Certainly they are 81 consistent with the well-established principle that it is better to 82 detect errors as early as possible. 84 The tests must, however, be made correctly or at least safely. If 85 criteria are applied that do not match the protocols, users will be 86 inconvenienced, addresses and sites will effectively become 87 inaccessible to some groups, and business and communications 88 opportunities will be lost. Experience in recent years indicates 89 that syntax tests are often performed incorrectly and that tests for 90 top-level domain names are applied using obsolete lists and 91 conventions. We assume that most of these incorrect tests are the 92 result of inability to conveniently locate exact definitions for the 93 criteria to be applied. This document draws summaries of the 94 applicable rules together in one place and supplies references to the 95 actual standards. It does not add anything to those standards; it 96 merely draws the information together into a form that may be more 97 accessible. 99 Many experts on Internet protocols believe that tests and rules of 100 these sorts should be avoided in applications and that the tests in 101 the protocols and back-office systems should be relied on instead. 102 Certainly implementations of the protocols cannot assume that the 103 data passed to them will be valid. Unless the standards specify 104 particular behavior, this document takes no position on whether or 105 not the testing is desirable. It only identifies the correct tests 106 to be made if tests are to be applied. 108 The sections that follow discuss domain names, email addresses, and 109 URLs. 111 2. Restrictions on domain (DNS) names 113 The authoritative definitions of the format and syntax of domain 114 names appear in RFCs 1035 [RFC1035], 1123 [RFC1123], and 2181 115 [RFC2181]. 117 Any characters, or combination of bits (as octets), are permitted in 118 DNS names. However, there is a preferred form that is required by 119 most applications. This preferred form has been the only one 120 permitted in the names of top-level domains, or TLDs. In general, it 121 is also the only form permitted in most second-level names registered 122 in TLDs although some names that are normally not seen by users obey 123 other rules. It derives from the original ARPANET rules for naming 124 of hosts (i.e., the "hostname" rule) and is perhaps better described 125 as the "LDH rule", after the characters that it permits. The LDH 126 rule, as updated, provides that the labels (words or strings 127 separated by periods) that make up a domain name must consist only of 128 the ASCII [ASCII] alphabetic and numeric characters, plus the hyphen. 129 No other symbols or punctuation characters are permitted, nor is 130 blank space. If the hyphen is used, it is not permitted to appear at 131 either the beginning or end of a label. There is an additional rule 132 that essentially requires that top-level domain names not be 133 all-numeric. 135 When it is necessary to express labels with non-character octets, or 136 to embed periods within labels, there is a mechanism for keying them 137 in that utilizes an escape sequence. RFC 1035 should be consulted if 138 that mechanism is needed (most common applications, including email 139 and the Web, will generally not permit those escaped strings). A 140 special encoding is now available for non-ASCII characters, see the 141 brief discussion in Implications of internationalization. 143 Most internet applications that reference other hosts or systems 144 assume they will be supplied with "fully-qualified" domain names, 145 i.e., ones that include all of the labels leading to the root, 146 including the TLD name. Those fully-qualified domain names are then 147 passed to either the domain name resolution protocol itself or to the 148 remote systems. Consequently, purported DNS names to be used in 149 applications and to locate resources generally must contain at least 150 one period (".") character. Those that do not are either invalid or 151 require the application to supply additional information. Of course, 152 this principle does not apply when the purpose of the application is 153 to process or query TLD names themselves. The DNS specification also 154 permits a trailing period to be used to denote the root, e.g., 155 "a.b.c" and "a.b.c." are equivalent, but the latter is more explicit 156 and is required to be accepted by applications. This convention is 157 especially important when a TLD name is being referred to directly. 158 For example, while ".COM" has become the popular terminology for 159 referring to that top-level domain, "COM." would be strictly and 160 technically correct in talking about the DNS, since it shows that 161 "COM" is a top-level domain name. 163 There is a long history of applications moving beyond the "one or 164 more periods" test to trying to verify that a valid TLD name is 165 actually present. They have done this either by applying some 166 heuristics to the form of the name or by consulting a local list of 167 valid names. The historical heuristics are no longer effective. If 168 one is to keep a local list, much more effort must be devoted to 169 keeping it up-to-date than was the case several years ago. 171 The heuristics were based on the observation that, since the DNS was 172 first deployed, all top-level domain names were two, three, or four 173 characters in length. All two-character names were associated with 174 "country code" domains, with the specific labels (with a few early 175 exceptions), drawn from the ISO list of codes for countries and 176 similar entitles [IS3166]. The three-letter names were "generic" 177 TLDs, whose function was not country-specific. And there was exactly 178 one four-letter TLD, the infrastructure domain "ARPA." [RFC1591]. 179 These length-dependent rules were, however, conventions, rather than 180 anything on which the protocols depended. 182 Before the mid-1990s, lists of valid top-level domain names changed 183 infrequently. New country codes were gradually, and then more 184 rapidly, added as the Internet expanded, but the list of generic 185 domains did not change at all between the establishment of the "INT." 186 domain and ICANN's allocation of new generic TLDs in 2000. Some 187 application developers responded by assuming that any two-letter 188 domain name could be valid as a TLD, but that the list of generic 189 TLDs was fixed and could be kept locally and tested. Several of 190 these assumptions changed as ICANN started to allocate new top-level 191 domains: one two-letter domain that does not appear in the ISO 3166-1 192 table was tentatively approved, and new domains were created with 193 three, four, and even six letter codes. 195 As of the first quarter of 2003, the list of valid, non-country, 196 top-level domains was .AERO, .BIZ, .COM, .COOP, .EDU, .GOV, .INFO, 197 .INT, .MIL, .MUSEUM, .NAME, .NET, .ORG, .PRO, and .ARPA. ICANN is 198 expected to expand that list at regular intervals, so the list that 199 appears here should not be used in testing. Instead, systems that 200 filter by testing top-level domain names should regularly check the 201 current list of TLDs (both "generic" and country-code-related) 202 published by IANA at http://www.iana.org/domain-names.htm. It is 203 likely that the better strategy has now become to make the "at least 204 one period" test, to verify LDH conformance (including verification 205 that the apparent TLD name is not all-numeric), and then to use the 206 DNS to determine domain name validity, rather than trying to maintain 207 a local list of valid TLD names. 209 A DNS label may be no more than 63 octets long. This is in the form 210 actually stored; if non-ASCII label is converted to encoded 211 "punycode" form (see Implications of internationalization) the length 212 of that form may restrict the number of actual characters (in the 213 original character set) that can be accomodated. A complete, 214 fully-qualified, domain name must not exceed 255 octets. 216 Some additional mechanisms for guessing correct domain names when 217 incomplete information is provided have been developed for use with 218 the web and are discussed in Section 4.4. 220 3. Restrictions on email addresses 222 Reference documents: RFC 2821 [RFC2821] and RFC 2822 [RFC2822] 224 Contemporary email addresses consist of a "local part" separated from 225 a "domain part" (a fully-qualified domain name) by an at-sign ("@"). 226 The syntax of the domain part corresponds to that in the previous 227 section. The concerns identified in that section about filtering and 228 lists of names apply to the domain names used in an email context as 229 well. The domain name can also be replaced by an IP address in 230 square brackets, but that form is strongly discouraged except for 231 testing and troubleshooting purposes. 233 The local part may appear using the quoting conventions described 234 below. The quoted forms are rarely used in practice, but are required 235 for some legitimate purposes. Hence, they should not be rejected in 236 filtering routines but, instead, should be passed to the email system 237 for evaluation by the destination host. 239 The exact rule is that any ASCII character, including control 240 characters, may appear quoted, or in a quoted string. When quoting 241 is needed, the backslash character is used to quote the following 242 character. For example 244 Abc\@def@example.com 246 is a valid form of an email address. Blank spaces may also appear, 247 as in 249 Fred\ Bloggs@example.com 251 The backslash character may also be used to quote itself, e.g., 253 Joe.\\Blow@example.com 255 In addition to quoting using the backslash character, conventional 256 double-quote characters may be used to surround strings. For example 258 "Abc@def"@example.com 260 "Fred Bloggs"@example.com 262 are alternate forms of the first two examples above. These quoted 263 forms are rarely recommended, and are uncommon in practice, but, as 264 discussed above, must be supported by applications that are 265 processing email addresses. In particular, the quoted forms often 266 appear in the context of addresses associated with transitions from 267 other systems and contexts; those transitional requirements do still 268 arise and, since a system that accepts a user-provided email address 269 cannot "know" whether that address is associated with a legacy 270 system, the address forms must be accepted and passed into the email 271 environment. 273 Without quotes, local-parts may consist of any combination of 274 alphabetic characters, digits, or any of the special characters 276 ! # $ % & ' * + - / = ? ^ _ ` . { | } ~ 278 period (".") may also appear, but may not be used to start or end the 279 local part, nor may two or more consecutive periods appear. Stated 280 differently, any ASCII graphic (printing) character other than the 281 at-sign ("@"), backslash, double quote, comma, or square brackets may 282 appear without quoting. If any of that list of excluded characters 283 are to appear, they must be quoted. Forms such as 285 user+mailbox@example.com 287 customer/department=shipping@example.com 289 $A12345@example.com 291 !def!xyz%abc@example.com 293 _somename@example.com 295 are valid and are seen fairly regularly, but any of the characters 296 listed above are permitted. In the context of local parts, 297 apostrophe ("'") and acute accent ("`") are ordinary characters, not 298 quoting characters. Some of the characters listed above are used in 299 conventions about routing or other types of special handling by some 300 receiving hosts. But, since there is no way to know whether the 301 remote host is using those conventions or just treating these 302 characters as normal text, sending programs (and programs evaluating 303 address validity) must simply accept the strings and pass them on. 305 In addition to restrictions on syntax, there is a length limit on 306 email addresses. That limit is a maximum of 64 characters (octets) 307 in the "local part" (before the "@") and a maximum of 255 characters 308 (octets) in the domain part (after the "@") for a total length of 320 309 characters. Systems that handle email should be prepared to process 310 addresses which are that long, even though they are rarely 311 encountered. 313 4. URLs and URIs 315 4.1 URI syntax definitions and issues 317 The syntax for URLs (Uniform Resource Locators) is specified in The 318 syntax for the more general "URI" (Uniform Resource Identifier) is 319 specified in . The URI syntax is extremely general, with 320 considerable variations permitted according to the type of "scheme" 321 (e.g., "http", "ftp", "mailto") that is being used. While it is 322 possible to use the general syntax rules of RFC 2396 to perform 323 syntax checks, they are general enough -- essentially only specifying 324 the separation of the scheme name and "scheme specific part" with a 325 colon (":") and excluding some characters that must be escaped if 326 used-- to provide little significant filtering or validation power. 328 The following characters are reserved in many URIs -- they must be 329 used for either their URI-intended purpose or must be encoded. Some 330 particular schemes may either broaden or relax these restrictions 331 (see the following sections for URLs applicable to "web pages" and 332 electronic mail), or apply them only to particular URI component 333 parts. 335 ; / ? : @ & = + $ , ? 337 In addition, control characters, the space character, the 338 double-quote (") character, and the following special characters 340 < > # % 342 are generally forbidden and must either be avoided or escaped. 344 When it is necessary to encode these, or other, characters, the 345 method used is replace it with a percent-sign ("%") followed by two 346 hexidecimal digits representing its octet value. See section 2.4.1 347 of for an exact definition. Unless it is used as a delimiter of the 348 URI scheme itself, any character may optionally be encoded this way; 349 systems that are testing URI syntax should be prepared for these 350 encodings to appear in any component of the URI except the scheme 351 name itself. 353 A "generic URI" syntax is specified and is more restrictive, but 354 using it to test URI strings requires that one know whether or not 355 the particular scheme in use obeys that syntax. Consequently, 356 applications that intend to check or validate URIs should normally 357 identify the scheme name and then apply scheme-specific tests. The 358 rules for two of those -- HTTP [RFC1866] and MAILTO [RFC2368] URLs -- 359 are discussed below, but the author of an application which intends 360 to make very precise checks, or to reject particular syntax rather 361 than just warning the user, should consult the relevant 362 scheme-definition documents for precise syntax and relationships. 364 4.2 The HTTP URL 366 Absolute HTTP URLs consist of the scheme name, a host name (expressed 367 as a domain name or IP address), and optional port number, and then, 368 optionally, a path, a search part, and a fragment identifier. These 369 are separated, respectively, by a colon and the two slashes that 370 precede the host name, a colon, a slash, a question mark, and a hash 371 mark ("#"). So we have 373 http://host:port/path?search#fragment 375 http://host/path/ 377 http://host/path#fragment 379 http://host/path?search 381 http://host 383 and other variations on that form. There is also a "relative" form, 384 but it almost never appears in text that a user might, e.g., enter 385 into a form. See for details. 387 The characters 389 / ; ? 391 are reserved within the path and search parts and must be encoded; 392 the first of these may be used unencoded, and is often used, within 393 the path to designate hierarchy. 395 4.3 The MAILTO URL 397 MAILTO is a URL type whose content is an email address. It can be 398 used to encode any of the email address formats discussed in 399 Restrictions on email addresses above. It can also support multiple 400 addresses and inclusion of headers (e.g., Subject lines) within the 401 body of the URL. MAILTO is authoritatively defined in RFC 2368 402 [RFC2368]; anyone expecting to accept, and test, multiple addresses 403 or mail header or body formats should consult that document 404 carefully. 406 In accepting text for, or validating, a MAILTO URL, it is important 407 to note that, while it can be used to encode any valid email address, 408 it is not sufficient to copy an email address into a MAILTO URL since 409 email addresses may include a number of characters that are invalid 410 in, or have reserved uses for, URLs. Those characters must be 411 encoded, as outlined in URI syntax definitions and issues above, when 412 the addresses are mapped into the URL form. Conversely, addresses in 413 MAILTO URLs cannot be copied directly into email contexts, since few 414 email programs will reverse the decodings (and doing so might be 415 interpreted as a protocol violation). 417 The following characters may appear in MAILTO URLs only with the 418 specific defined meanings given. If they appear in an email address 419 (i.e., for some other purpose), they must be encoded: 421 : The colon in "mailto:" 423 <>#"%\{\}|\\^~` These characters are "unsafe" in any URL,and must 424 always be encoded. 426 The following characters must also be encoded if they appear in a 427 MAILTO URL 429 ?&= Used to delimit headers and their values when these 430 are encoded into URLs. 432 Some examples may be helpful: 434 +----------------------+----------------------+---------------------+ 435 | Email address | MAILTO URL | Notes | 436 +----------------------+----------------------+---------------------+ 437 | Joe@example.com | mailto:joe@example.c | 1 | 438 | | om | | 439 | | | | 440 | user+mailbox@example | mailto: | 2 | 441 | .com | user%2Bmailbox@examp | | 442 | | le.com | | 443 | | | | 444 | customer/department= | mailto:customer%2Fde | 3 | 445 | shipping@example.com | partment=shipping@ex | | 446 | | ample.com | | 447 | | | | 448 | $A12345@example.com | mailto:$A12345@examp | 4 | 449 | | le.com | | 450 | | | | 451 | !def!xyz%abc@example | mailto:!def!xyz%25ab | 5 | 452 | .com | c@example.com | | 453 | | | | 454 | _somename@example.co | mailto:_somename@exa | 4 | 455 | m | mple.com | | 456 +----------------------+----------------------+---------------------+ 458 Table 1 460 Notes on Table 462 1. No characters appear in the email address that require escaping, 463 so the body of the MAILTO URL is identical to the email address. 465 2. There is actually some uncertainty as to whether or not the "+" 466 characters requires escaping in MAILTO URLs (the standards are 467 not precisely clear). But, since any character in the address 468 specification may optionally be encoded, it is probably safer to 469 encode it. 471 3. The "/" character is generally reserved in URLs, and must be 472 encoded as %2F. 474 4. Neither the "$" nor the "_" character are given any special 475 interpretation in MAILTO URLs, so need not be encoded. 477 5. While the "!" character has no special interpretation, the "%" 478 character is used to introduce encoded sequences and hence it 479 must always be encoded. 481 4.4 Guessing domain names in web contexts 483 Several web browsers have adopted a practice that permits an 484 incomplete domain name to be used as input instead of a complete URL. 485 This has, for example, permitted users to type "microsoft" and have 486 the browser interpret the input as "http://www.microsoft.com/". 487 Other browser versions have gone even further, trying to build DNS 488 names up through a series of heuristics, testing each variation in 489 turn to see if it appears in the DNS, and accepting the first one 490 found as the intended domain name. If this approach is to be used, 491 it is often critical that the browser recognize the complete list of 492 TLDs. If an incomplete list is used, complete domain names may not 493 be recognized as such and the system may try to turn them into 494 completely different names. For example, "example.aero" is a 495 fully-qualified name, since "AERO." is a TLD name. But, if the 496 system doesn't recognize "AERO" as a TLD name, it is likely to try to 497 look up "example.aero.com" and "www.example.aero.com" (and then fail 498 or find the wrong host), rather than simply looking up the 499 user-supplied name. 501 As discussed in Restrictions on domain (DNS) names above, there are 502 dangers associated with software that attempts to "know" the list of 503 top-level domain names locally and take advantage of that knowledge. 504 These name-guessing heuristics are another example of that situation: 505 if the lists are up-to-date and used carefully, the systems in which 506 they are embedded may provide an easier, and more attractive, 507 experience for at least some users. But finding the wrong host, or 508 being unable to find a host even when its name is precisely known, 509 constitute bad experiences by any measure. 511 More generally, there have been bad experiences with attempts to 512 "complete" domain names by adding additional information to them. 513 These issues are described in some detail in RFC 1535 . 515 5. Implications of internationalization 517 The IETF has adopted a series of proposals ( - ) whose purpose is to 518 permit encoding internationalized (i.e., non-ASCII) names in the DNS. 519 The primary standard, and the group generically, are known as "IDNA". 520 The actual strings stored in the DNS are in an encoded form: the 521 labels begin with the characters "xn--" followed by the encoded 522 string. Applications should be prepared to accept and process both 523 the encoded form (those strings are consistent with the "LDH rule" 524 (see Restrictions on domain (DNS) names) so should not raise any 525 separate issues) and the use of local, and potentially other, 526 characters as appropriate to local systems and circumstances. 528 The IDNA specification describes the exact process to be used to 529 validate a name or encoded string. The process is sufficiently 530 complex that shortcuts or heuristics, especially for versions of 531 labels written directly in Unicode or other coded character sets, are 532 likely to fail and cause problems. In particular, the strings cannot 533 be validated with syntax or semantic rules of any of the usual sorts: 534 syntax validity is defined only in terms of the result of executing a 535 particular function. 537 In addition to the restrictions imposed by the protocols themselves, 538 many domains are implementing rules about just which non-ASCII names 539 they will permit to be registered (see, e.g., . This work is still 540 relatively new, and the rules and conventions are likely to be 541 different for each domain, or at least each language or script group. 542 Attempting to test for those rules in a client program to see if a 543 user-supplied name might possibly exist in the relevant domain would 544 almost certainly be ill-advised. 546 One quick, local, test, however, may be reasonable: as of the time of 547 this writing, there should be no instances of labels in the DNS that 548 start with two characters, followed by two hyphens, where the two 549 characters are not "xn" (in, of course, either upper or lower case). 550 Such label strings, if they appear, are probably erroneous or 551 obsolete, and it may be reasonable to at least warn the user about 552 them. 554 There is ongoing work in the IETF and elsewhere to define 555 internationalized formats for use in other protocols, including email 556 addresses. Those forms may or may not conform to existing rules for 557 ASCII-only identifiers; anyone designing evaluators or filters should 558 watch that work closely. 560 6. Summary 562 When an application accepts a string from the user and ultimately 563 passes it on to an API for a protocol, the desirability of testing or 564 filtering the text in any way not required by the protoocl itself is 565 hotly debated. If it must divide the string into its components, or 566 otherwise interpret it, it obviously must make at least enough tests 567 to validate that process. With, e.g., domain names or email 568 addresses that can be passed on untouched, the appropriateness of 569 trying to figure out which ones are valid and which ones are not 570 requires a more complex decision, one that should include 571 considerations of how to make exactly the correct tests and to keep 572 information that changes and evolves up-to-date. Making the test 573 incorrectly, or with obsolete information, can be extremely 574 frustrating for potential correspondents or customers and may harm 575 desired relationships. 577 7. Security considerations 579 Since this document merely summarizes the requirements of existing 580 standards, it does not introduce any new security issues. However, 581 many of the techniques that motivate the document raise important 582 security concerns of their own. Rejecting valid forms of domain 583 names, email addresses, or URIs often denies service to the user of 584 those entities. Worse, guessing at the user's intent when an 585 incomplete address, or other string, is given can result in 586 compromises to privacy or accuracy of reference if the wrong target 587 is found and returned. From a security standpoint, the optimum 588 behavior is probably to never guess, but, instead, to force the user 589 to specify exactly what is wanted. When that position involves a 590 tradeoff with an acceptable user experience, good judgment should be 591 used and the fact that it is a tradeoff recognized. 593 8. Acknowledgements 595 The author would like to express his appreciation for helpful 596 comments from Harald Alvestrand, Eric A. Hall, and the RFC Editor, 597 and for partial support of this work from SITA. Responsibility for 598 any errors remains, of course, with the author. 600 The first Internet-Draft on this subject was posted in February 2003. 601 The document was submitted to the RFC Editor on 20 June 2003, 602 returned for revisions on 19 August, and resubmitted on 5 September 603 2003. 605 Normative References 607 [RFC1035] Mockapetris, P., "Domain names - implementation and 608 specification", RFC 1035, STD 13, November 1987. 610 [RFC1123] Braden, R., "Requirements for Internet Hosts -Application 611 and Support", RFC 1123, STD 3, October 1989. 613 [RFC1535] Gavron, E., "A Security Problem and Proposed Correction 614 With Widely Deployed DNS Software", RFC 1535, October 615 1993. 617 [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform 618 Resource Locators (URL)", RFC 1738, December 1994. 620 [RFC1866] Berners-Lee, T. and D. Connolly, "Hypertext Markup 621 Language - 2.0", RFC 1866, November 1995. 623 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS 624 Specification", RFC 2181, July 1997. 626 [RFC2368] Hoffman, P., Masinter, L. and J. Zawinski, "The mailto URL 627 scheme", RFC 2368, July 1998. 629 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 630 Resource Identifiers (URI): Generic Syntax", RFC 2396, 631 August 1998. 633 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., 634 Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext 635 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 637 [RFC2821] Klensin, J., "Simple Mail Transfer Protocol", RFC 2821, 638 April 2001. 640 [RFC2822] Resnick, P., "Internet Message Format", RFC 2822, April 641 2001. 643 [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, 644 "Internationalizing Domain Names in Applications (IDNA)", 645 RFC 3490, March 2003. 647 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 648 Profile for Internationalized Domain Names (IDN)", RFC 649 3491, March 2003. 651 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 652 for Internationalized Domain Names in Applications 653 (IDNA)", RFC 3492, March 2003. 655 [refs.ASCII] 656 American National Standards Institute (formerly United 657 States of America Standards Institute), "USA Code for 658 Information Interchange. ANSI X3.4-1968 has been replaced 659 by newer versions with slight modifications, but the 1968 660 version remains definitive for the Internet.", ANSI 661 X3.4-1968. 663 Non-normative References 665 [ISO.3166.1988] 666 International Organization for Standardization, "Codes for 667 the representation of names of countries, 3rd edition", 668 ISO Standard 3166, August 1988. 670 [JET] Seng, J., "Internationalized Domain Names Registration and 671 Administration Guideline for Chinese, Japanese and 672 Korean", draft-jseng-idn-admin-03 (work in progress), June 673 2003. 675 [RFC1591] Postel, J., "Domain Name System Structure and Delegation", 676 RFC 1591, March 1994. 678 [RegRestr] 679 Klensin, J., "Registration Restrictions on 680 Internationalized Domain Names -- An Overview", 681 draft-klensin-reg-guidelines-00 (work in progress), June 682 2003. 684 Author's Address 686 John C Klensin 687 1770 Massachusetts Ave, #322 688 Cambridge, MA 02140 689 USA 691 Phone: +1 617 491 5735 692 EMail: john-ietf@jck.com 694 Intellectual Property Statement 696 The IETF takes no position regarding the validity or scope of any 697 intellectual property or other rights that might be claimed to 698 pertain to the implementation or use of the technology described in 699 this document or the extent to which any license under such rights 700 might or might not be available; neither does it represent that it 701 has made any effort to identify any such rights. Information on the 702 IETF's procedures with respect to rights in standards-track and 703 standards-related documentation can be found in BCP-11. Copies of 704 claims of rights made available for publication and any assurances of 705 licenses to be made available, or the result of an attempt made to 706 obtain a general license or permission for the use of such 707 proprietary rights by implementors or users of this specification can 708 be obtained from the IETF Secretariat. 710 The IETF invites any interested party to bring to its attention any 711 copyrights, patents or patent applications, or other proprietary 712 rights which may cover technology that may be required to practice 713 this standard. Please address the information to the IETF Executive 714 Director. 716 Full Copyright Statement 718 Copyright (C) The Internet Society (2003). All Rights Reserved. 720 This document and translations of it may be copied and furnished to 721 others, and derivative works that comment on or otherwise explain it 722 or assist in its implementation may be prepared, copied, published 723 and distributed, in whole or in part, without restriction of any 724 kind, provided that the above copyright notice and this paragraph are 725 included on all such copies and derivative works. However, this 726 document itself may not be modified in any way, such as by removing 727 the copyright notice or references to the Internet Society or other 728 Internet organizations, except as needed for the purpose of 729 developing Internet standards in which case the procedures for 730 copyrights defined in the Internet Standards process must be 731 followed, or as required to translate it into languages other than 732 English. 734 The limited permissions granted above are perpetual and will not be 735 revoked by the Internet Society or its successors or assignees. 737 This document and the information contained herein is provided on an 738 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 739 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 740 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 741 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 742 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 744 Acknowledgment 746 Funding for the RFC Editor function is currently provided by the 747 Internet Society.