idnits 2.17.1 draft-klensin-idnabis-issues-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1865. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1876. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1883. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1889. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 18, 2007) is 6003 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2119' is defined on line 1752, but no explicit reference was found in the text == Unused Reference: 'Unicode32' is defined on line 1785, but no explicit reference was found in the text == Unused Reference: 'Unicode40' is defined on line 1796, but no explicit reference was found in the text == Unused Reference: 'RFC3986' is defined on line 1828, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'IDNA200X-Bidi' == Outdated reference: A later version (-05) exists of draft-faltstrom-idnabis-tables-03 == Outdated reference: A later version (-04) exists of draft-klensin-idnabis-protocol-01 -- Possible downref: Normative reference to a draft: ref. 'IDNA200X-protocol' ** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564) ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) ** Downref: Normative reference to an Informational RFC: RFC 3743 ** Downref: Normative reference to an Informational RFC: RFC 4290 -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode-UAX15' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode32' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode40' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode50' -- Obsolete informational reference (is this intentional?): RFC 810 (Obsoleted by RFC 952) Summary: 6 errors (**), 0 flaws (~~), 9 warnings (==), 15 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin, Ed. 3 Internet-Draft November 18, 2007 4 Expires: May 21, 2008 6 Internationalizing Domain Names for Applications (IDNA): Issues and 7 Rationale 8 draft-klensin-idnabis-issues-05.txt 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that any 13 applicable patent or other IPR claims of which he or she is aware 14 have been or will be disclosed, and any of which he or she becomes 15 aware will be disclosed, in accordance with Section 6 of BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on May 21, 2008. 35 Copyright Notice 37 Copyright (C) The IETF Trust (2007). 39 Abstract 41 A recent IAB report identified issues that have been raised with 42 Internationalized Domain Names (IDNs). Some of these issues require 43 tuning of the existing protocols and the tables on which they depend. 44 Based on intensive discussion by an informal design team, this 45 document provides an overview some of the proposals that are being 46 made, provides explanatory material for them and then further 47 explains some of the issues that have been encountered. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 52 1.1. Context and Overview . . . . . . . . . . . . . . . . . . . 4 53 1.2. Discussion Forum . . . . . . . . . . . . . . . . . . . . . 4 54 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4 55 1.4. Applicability and Function of IDNA . . . . . . . . . . . . 5 56 1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 57 1.5.1. Documents and Standards . . . . . . . . . . . . . . . 6 58 1.5.2. Terminology about Characters and Character Sets . . . 6 59 1.5.3. DNS-related Terminology . . . . . . . . . . . . . . . 7 60 1.5.4. Terminology Specific to IDNA . . . . . . . . . . . . . 7 61 1.5.5. Punycode is an Algorithm, not a Name . . . . . . . . . 10 62 1.5.6. Other Terminology Issues . . . . . . . . . . . . . . . 10 63 2. The Original (2003) IDNA Model . . . . . . . . . . . . . . . . 11 64 2.1. Proposed label . . . . . . . . . . . . . . . . . . . . . . 12 65 2.2. Permitted Character Identification . . . . . . . . . . . . 12 66 2.3. Character Mappings . . . . . . . . . . . . . . . . . . . . 12 67 2.4. Registry Restrictions . . . . . . . . . . . . . . . . . . 12 68 2.5. Punycode Conversion . . . . . . . . . . . . . . . . . . . 13 69 2.6. Lookup or Insertion in the Zone . . . . . . . . . . . . . 13 70 3. A Revised IDNA Model . . . . . . . . . . . . . . . . . . . . . 13 71 3.1. Localization: The Role of the Local System and User 72 Interface . . . . . . . . . . . . . . . . . . . . . . . . 13 73 3.2. IDN Processing in the IDNA200x Model . . . . . . . . . . . 14 74 3.2.1. Summary of Effects . . . . . . . . . . . . . . . . . . 14 75 4. IDNA200x Document List . . . . . . . . . . . . . . . . . . . . 15 76 5. Permitted Characters: An Inclusion List . . . . . . . . . . . 15 77 5.1. A Tiered Model of Permitted Characters and Labels . . . . 15 78 5.1.1. ALWAYS . . . . . . . . . . . . . . . . . . . . . . . . 16 79 5.1.2. MAYBE . . . . . . . . . . . . . . . . . . . . . . . . 17 80 5.1.3. CONTEXTUAL RULE REQUIRED . . . . . . . . . . . . . . . 18 81 5.1.4. NEVER . . . . . . . . . . . . . . . . . . . . . . . . 18 82 5.2. Layered Restrictions: Tables, Context, Registration, 83 Applications . . . . . . . . . . . . . . . . . . . . . . . 19 84 5.3. A New Character List -- History . . . . . . . . . . . . . 19 85 5.4. Understanding New Issues and Constraints . . . . . . . . . 20 86 5.5. ALWAYS, MAYBE, and Contextual Rules . . . . . . . . . . . 20 87 6. Issues that Any Solution Must Address . . . . . . . . . . . . 21 88 6.1. Display and Network Order . . . . . . . . . . . . . . . . 21 89 6.2. Entry and Display in Applications . . . . . . . . . . . . 22 90 6.3. The Ligature and Digraph Problem . . . . . . . . . . . . . 23 91 6.4. Right-to-left Text . . . . . . . . . . . . . . . . . . . . 25 92 7. IDNs and the Robustness Principle . . . . . . . . . . . . . . 25 93 8. Migration and Version Synchronization . . . . . . . . . . . . 26 94 8.1. Design Criteria . . . . . . . . . . . . . . . . . . . . . 26 95 8.2. More Flexibility in User Agents . . . . . . . . . . . . . 29 96 8.3. The Question of Prefix Changes . . . . . . . . . . . . . . 31 97 8.3.1. Conditions requiring a prefix change . . . . . . . . . 31 98 8.3.2. Conditions not requiring a prefix change . . . . . . . 31 99 8.4. Stringprep Changes and Compatibility . . . . . . . . . . . 32 100 8.5. The Symbol Question . . . . . . . . . . . . . . . . . . . 33 101 8.6. Other Compatibility Issues . . . . . . . . . . . . . . . . 33 102 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 34 103 10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 34 104 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 105 11.1. IDNA Permitted Character Registry . . . . . . . . . . . . 34 106 11.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 34 107 11.3. IANA Repository of TLD IDN Practices . . . . . . . . . . . 35 108 12. Security Considerations . . . . . . . . . . . . . . . . . . . 35 109 13. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 36 110 13.1. Version -01 . . . . . . . . . . . . . . . . . . . . . . . 36 111 13.2. Version -02 . . . . . . . . . . . . . . . . . . . . . . . 36 112 13.3. Version -03 . . . . . . . . . . . . . . . . . . . . . . . 37 113 13.4. Version -04 . . . . . . . . . . . . . . . . . . . . . . . 37 114 13.5. Version -05 . . . . . . . . . . . . . . . . . . . . . . . 37 115 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 37 116 14.1. Normative References . . . . . . . . . . . . . . . . . . . 37 117 14.2. Informative References . . . . . . . . . . . . . . . . . . 39 118 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 40 119 Intellectual Property and Copyright Statements . . . . . . . . . . 41 121 1. Introduction 123 1.1. Context and Overview 125 A recent IAB report [RFC4690] identified issues that have been raised 126 with Internationalized Domain Names (IDNs) and the associated 127 standards. Those standards are known as Internationalized Domain 128 Names in Applications (IDNA), taken from the name of the highest 129 level standard within that group (see Section 1.5). Based on 130 discussion of those issues and their impact, some of these standards 131 now require tuning the existing protocols and the tables on which 132 they depend. This document further explains, based on the results of 133 some intensive discussions by an informal design team, on a mailing 134 list, and in broader discussions, some of the issues that have been 135 encountered. It also provides an overview of the proposals that are 136 being made and explanatory material for them. Additional explanatory 137 material for other proposals will appear with the associated 138 documents. 140 This document begins with a discussion of the original and new IDNA 141 models and the general differences in strategy between the original 142 version of IDNA and the proposed new version. It continues with a 143 description of specific changes that are needed and issues that the 144 design must address, including some that were not explicitly 145 addressed in RFC 4690. 147 1.2. Discussion Forum 149 [[anchor4: RFC Editor: please remove this section.]] 151 This work is being discussed on the mailing list 152 idna-update@alvestrand.no 154 1.3. Objectives 156 The intent of the IDNA revision effort, and hence of this document 157 and the associated ones, is to increase the usability and 158 effectiveness of internationalized domain names (IDNs) while 159 preserving or strengthening the integrity of references that use 160 them. The original "hostname" (LDH) character definitions (see, 161 e.g., [RFC0810]) struck a balance between the creation of useful 162 mnemonics and the introduction of parsing problems or general 163 confusion in the contexts in which domain names are used. Our 164 objective is to preserve that balance while expanding the character 165 repertoire to include extended versions of Roman-derived scripts and 166 scripts that are not Roman in origin. No work of this sort will be 167 able to completely eliminate sources of visual or textual confusion: 168 such confusion exists even under the original rules. However, one 169 can hope, through the application of different techniques at 170 different points (see Section 5.2), to keep problems to an acceptable 171 minimum. One consequence of this general objective is that the 172 desire of some user or marketing community to use a particular string 173 --whether the reason is to try to write sentences of particular 174 languages in the DNS, to express a facsimile of the symbol for a 175 brand, or for some other purpose-- is not a primary goal or even a 176 particularly important one. 178 1.4. Applicability and Function of IDNA 180 The IDNA standard does not require any applications to conform to it, 181 nor does it retroactively change those applications. An application 182 can elect to use IDNA in order to support IDN while maintaining 183 interoperability with existing infrastructure. If an application 184 wants to use non-ASCII characters in domain names, IDNA is the only 185 currently-defined option. Adding IDNA support to an existing 186 application entails changes to the application only, and leaves room 187 for flexibility in the user interface. 189 A great deal of the discussion of IDN solutions has focused on 190 transition issues and how IDN will work in a world where not all of 191 the components have been updated. Proposals that were not chosen by 192 the original IDN Working Group would depend on user applications, 193 resolvers, and DNS servers being updated in order for a user to use 194 an internationalized domain name in any form or coding acceptable 195 under that method. While processing must be performed prior to or 196 after access to the DNS, no changes are needed to the DNS protocol or 197 any DNS servers or the resolvers on user's computers. 199 The IDNA specification solves the problem of extending the repertoire 200 of characters that can be used in domain names to include a large 201 subset of the Unicode repertoire. 203 IDNA does not extend the service offered by DNS to the applications. 204 Instead, the applications (and, by implication, the users) continue 205 to see an exact-match lookup service. Either there is a single 206 exactly-matching name or there is no match. This model has served 207 the existing applications well, but it requires, with or without 208 internationalized domain names, that users know the exact spelling of 209 the domain names that are to be typed into applications such as web 210 browsers and mail user agents. The introduction of the larger 211 repertoire of characters potentially makes the set of misspellings 212 larger, especially given that in some cases the same appearance, for 213 example on a business card, might visually match several Unicode code 214 points or several sequences of code points. 216 IDNA allows the graceful introduction of IDNs not only by avoiding 217 upgrades to existing infrastructure (such as DNS servers and mail 218 transport agents), but also by allowing some rudimentary use of IDNs 219 in applications by using the ASCII representation of the non-ASCII 220 name labels. While such names are user-unfriendly to read and type, 221 and hence not optimal for user input, they allow (for instance) 222 replying to email and clicking on URLs even though the domain name 223 displayed is incomprehensible to the user. In order to allow user- 224 friendly input and output of the IDNs, the applications need to be 225 modified to conform to this specification. 227 IDNA uses the Unicode character repertoire, which avoids the 228 significant delays that would be inherent in waiting for a different 229 and specific character set be defined for IDN purposes, presumably by 230 some other standards developing organization. 232 1.5. Terminology 234 1.5.1. Documents and Standards 236 This document uses the term "IDNA2003" to refer to the set of 237 standards that make up and support the version of IDNA published in 238 2003, i.e., those commonly known as the IDNA base specification 239 [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep 240 [RFC3454]. In this document, those names are used to refer, 241 conceptually, to the individual documents, with the base IDNA 242 specification called just "IDNA". 244 The term "IDNA200x" is used to refer to a possible new version of 245 IDNA without specifying which particular documents would be affected. 246 While more common IETF usage might refer to the successor document(s) 247 as "IDNAbis", this document uses that term, and similar ones, to 248 refer to successors to the individual documents, e.g., "IDNAbis" is a 249 synonym for the specific successor to RFC3490, or "RFC3490bis". See 250 also Section 4. 252 1.5.2. Terminology about Characters and Character Sets 254 A code point is an integer value associated with a character in a 255 coded character set. 257 Unicode [Unicode50] is a coded character set containing tens of 258 thousands of characters. A single Unicode code point is denoted by 259 "U+" followed by four to six hexadecimal digits, while a range of 260 Unicode code points is denoted by two hexadecimal numbers separated 261 by "..", with no prefixes. 263 ASCII means US-ASCII [ASCII], a coded character set containing 128 264 characters associated with code points in the range 00..7F. Unicode 265 may be thought of as an extension of ASCII: it includes all the ASCII 266 characters and associates them with equivalent code points. 268 1.5.3. DNS-related Terminology 270 When discussing the DNS, this document generally assumes the 271 terminology used in the DNS specifications [RFC1034] [RFC1035]. The 272 terms "lookup" and "resolution" are used interchangeably and the 273 process or application component that performs DNS resolution is 274 called a "resolver". The process of placing an entry into the DNS is 275 referred to as "registration" paralleling common contemporary usage 276 in other contexts. 278 The term "LDH code points" is defined in this document to mean the 279 code points associated with ASCII letters, digits, and the hyphen- 280 minus; that is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an 281 abbreviation for "letters, digits, hyphen". 283 The base DNS specifications [RFC1034] [RFC1035] discuss "domain 284 names" and "host names", but many people and sections of these 285 specifications use the terms interchangeably. Further, because those 286 documents were not terribly clear, many people who are sure they know 287 the exact definitions of each of these terms disagree on the 288 definitions. In this document the term "domain name" is used in 289 general. This document explicitly cites those documents whenever 290 referring to the host name syntax restrictions defined therein. The 291 remaining definitions in this subsection are essentially a review. 293 A label is an individual part of a domain name. Labels are usually 294 shown separated by dots; for example, the domain name 295 "www.example.com" is composed of three labels: "www", "example", and 296 "com". (The zero-length root label described in [RFC1123], which can 297 be explicit as in "www.example.com." or implicit as in 298 "www.example.com", is not considered a label in this specification.) 299 IDNA extends the set of usable characters in labels that are text. 300 For the rest of this document, the term "label" is shorthand for 301 "text label", and "every label" means "every text label". 303 1.5.4. Terminology Specific to IDNA 305 Some of the terminology used in describing IDNs in the IDNA2003 306 context has been a source of confusion. This section defines some 307 new terminology to reduce dependence on the problematic terms and 308 definitions that appears in RFC 3490. 310 1.5.4.1. Terms for IDN Label Codings 312 1.5.4.1.1. IDNA-valid strings, A-label, and U-label 314 To improve clarity, this document introduces three new terms. A 315 string is "IDNA-valid" if it meets all of the requirements of this 316 specification for an IDNA label. It may be either an "A-label" or a 317 "U-label", and it is expected that specific reference will be made to 318 the form appropriate to any context in which the distinction is 319 important. An "A-label" is the ASCII-Compatible Encoding (ACE) form 320 of an IDNA-valid string. It must be a complete label and valid as 321 the output of ToASCII, regardless of how it is actually produced. 322 This means, by definition, that every A-label will begin with the 323 IDNA ACE prefix, "xn--", followed by a string that is a valid output 324 of the Punycode algorithm and hence a maximum of 59 ASCII characters 325 in length. The prefix and string together must conform to all 326 requirements for a label that can be stored in the DNS including 327 conformance to the LDH rule. A "U-label" is an IDNA-valid string of 328 Unicode-coded characters that is a valid output of performing 329 ToUnicode on an A-label, again regardless of how the label is 330 actually produced. A Unicode string that cannot be generated by 331 decoding a valid A-label is not a valid U-label. [IDNA200X-protocol] 332 specifies the conversions between U-labels and A-labels. 334 Any rules or conventions that apply to DNS labels in general, such as 335 rules about lengths of strings, apply to whichever of the U-label or 336 A-label would be more restrictive. The exception to this, of course, 337 is that the restriction to ASCII characters does not apply to the 338 U-label. 340 A different way to look at these terms, which may be more clear to 341 some readers, is that U-labels, A-labels, and LDH-labels are disjoint 342 categories that, together, make up the forms of legitimate strings 343 for use in domain names that describe hosts. Of the three, only 344 A-labels and LDH-labels can actually appear in DNS zone files or 345 queries; U-labels can appear, along with those two, in presentation 346 and user interface forms and in selected protocols other than the DNS 347 ones themselves. Strings that do not conform to the rules for one of 348 these three categories and, in particular, strings that contain "-" 349 in the third or fourth character position but are 351 o not A-labels or 353 o that cannot be processed as U-labels or A-labels as described in 354 these specifications, 356 are invalid as labels in domain names that identify Internet hosts or 357 similar resources. 359 1.5.4.1.2. LDH-label and Internationalized Label 361 In the hope of further clarifying discussions about IDNs, this 362 document uses the term "LDH-label" strictly to refer to an all-ASCII 363 label that obeys the "hostname" (LDH) conventions and that is not an 364 IDN. In other words, the categories "U-label", "A-label", and "LDH- 365 label" are disjoint, with only the first two referring to IDNs. When 366 such a term is needed, an "internationalized label" is one that is a 367 member of the union of those three categories. There are some 368 standardized DNS label formats, such as those for service location 369 (SRV) records [RFC2782] that do not fall into any of the three 370 categories and hence are not internationalized labels. 372 1.5.4.2. Equivalence 374 In IDNA, equivalence of labels is defined in terms of the A-labels. 375 If the A-labels are equal in a case-independent comparison, then the 376 labels are considered equivalent, no matter how they are represented. 377 Traditional LDH labels already have a notion of equivalence: within 378 that list of characters, upper case and lower case are considered 379 equivalent. The IDNA notion of equivalence is an extension of that 380 older notion. Equivalent labels in IDNA are treated as alternate 381 forms of the same label, just as "foo" and "Foo" are treated as 382 alternate forms of the same label. 384 1.5.4.3. ACE prefix 386 The "ACE prefix" is defined in this document to be a string of ASCII 387 characters "xn--" that appears at the beginning of every A-label. 388 "ACE" stands for "ASCII-Compatible Encoding". 390 1.5.4.4. Domain name slot 392 A "domain name slot" is defined in this document to be a protocol 393 element or a function argument or a return value (and so on) 394 explicitly designated for carrying a domain name. Examples of domain 395 name slots include: the QNAME field of a DNS query; the name argument 396 of the gethostbyname() library function; the part of an email address 397 following the at-sign (@) in the From: field of an email message 398 header; and the host portion of the URI in the src attribute of an 399 HTML tag. General text that just happens to contain a domain 400 name is not a domain name slot. For example, a domain name appearing 401 in the plain text body of an email message is not occupying a domain 402 name slot. 404 An "IDN-aware domain name slot" is defined in this document to be a 405 domain name slot explicitly designated for carrying an 406 internationalized domain name as defined in this document. The 407 designation may be static (for example, in the specification of the 408 protocol or interface) or dynamic (for example, as a result of 409 negotiation in an interactive session). 411 An "IDN-unaware domain name slot" is defined in this document to be 412 any domain name slot that is not an IDN-aware domain name slot. 413 Obviously, this includes any domain name slot whose specification 414 predates IDNA. 416 1.5.5. Punycode is an Algorithm, not a Name 418 There has been some confusion about whether a "Punycode string" does 419 or does not include the prefix and about whether it is required that 420 such strings could have been the output of ToASCII (see RFC 3490, 421 Section 4 [RFC3490]). This specification discourages the use of the 422 term "Punycode" to describe anything but the encoding method and 423 algorithm of [RFC3492]. The terms defined above are preferred as 424 much more clear than terms such as "Punycode string". 426 1.5.6. Other Terminology Issues 428 The document departs from historical DNS terminology and usage in one 429 important respect. Over the years, the community has talked very 430 casually about "names" in the DNS, beginning with calling it "the 431 domain name system". That terminology is fine in the very precise 432 sense that the identifiers of the DNS do provide names for objects 433 and addresses. But, in the context of IDNs, the term has introduced 434 some confusion, confusion that has increased further as people have 435 begun to speak of DNS labels in terms of the words or phrases of 436 various natural languages. 438 Historically, many, perhaps most, of the "names" in the DNS have just 439 been mnemonics to identify some particular concept, object, or 440 organization. They are typically derived from, or rooted in, some 441 language because most people think in language-based ways. But, 442 because they are mnemonics, they need not obey the orthographic 443 conventions of any language: it is not a requirement that it be 444 possible for them to be "words". 446 This distinction is important because the reasonable goal of an IDN 447 effort is not to be able to write the great Klingon (or language of 448 one's choice) novel in DNS labels but to be able to form a usefully 449 broad range of mnemonics in ways that are as natural as possible in a 450 very broad range of scripts. 452 An "internationalized domain name" (IDN) is a domain name that may 453 contain one or more A-labels or U-labels, as appropriate, instead of 454 LDH labels. This implies that every conventional domain name is an 455 IDN (which implies that it is possible for a name to be an IDN 456 without it containing any non-ASCII characters). This document does 457 not attempt to define an "internationalized host name". Just as has 458 been the case with ASCII names, some DNS zone administrators may 459 impose restrictions, beyond those imposed by DNS or IDNA, on the 460 characters or strings that may be registered as labels in their 461 zones. Such restrictions have no effect on the syntax or semantics 462 of DNS protocol messages; a query for a name that matches no records 463 will yield the same response regardless of the reason why it is not 464 in the zone. Clients issuing queries or interpreting responses 465 cannot be assumed to have any knowledge of zone-specific restrictions 466 or conventions. 468 2. The Original (2003) IDNA Model 470 IDNA is a client-side protocol, i.e., almost all of the processing is 471 performed by the client. The strings that appear in, and are 472 resolved by, the DNS conform to the traditional rules for the naming 473 of hosts, and consist of ASCII letters, digits, and hyphens. This 474 approach permits IDNA to be deployed without modifications to the DNS 475 itself. That, in turn, avoids both having to upgrade the entire 476 Internet to support IDNs and needing to incur the unknown risks to 477 deployed systems of DNS structural or design changes especially if 478 those changes need to be deployed all at the same time. 480 This section contains a summary of the model underlying IDNA2003. It 481 is approximate and is not a substitute for reading and understanding 482 the actual specification document [RFC3490] and the documents on 483 which it depends. The summary is not intended to be completely 484 balanced. It emphasizes some characteristics of IDNA2003 that are 485 particularly important to understanding the nature of the proposed 486 changes. 488 The original IDNA specifications have the logical flow in domain name 489 registration and resolution outlined in the balance of this section. 490 They are not defined this way; instead, the steps are presented here 491 for convenience in comparison to what is being proposed in this 492 document and the associated ones. In particular, IDNA2003 does not 493 make as strong a distinction between procedures for registration and 494 those for resolution as the ones suggested in Section 3 and 495 Section 5.1. 497 The IDNA2003 specification explicitly includes the equivalents of the 498 steps in Section 2.2, Section 2.3, and Section 2.5 below. While the 499 other steps are present --either inside the protocol or presumed to 500 be performed before or after it-- they are not discussed explicitly. 501 That omission has been a source of confusion. Another source has 502 been definition of IDNA2003 as an algorithm, expressed partially in 503 prose and partially in pseudo code and tables. The steps below 504 follow the more traditional IETF practice: the functions are 505 specified, rather than the algorithms. The breakdown into steps is 506 for clarity of explanation; any implementation that produces the same 507 result with the same inputs is conforming. 509 2.1. Proposed label 511 The registrant submits a request for an IDN or the user attempts to 512 look up an IDN. The registrant or user typically produces the 513 request string by keyboard entry of a character sequence. That 514 sequence is validated only on the basis of its displayed appearance, 515 without knowledge of the character coding used for its internal 516 representation or other local details of the way the operating system 517 processes it. This string is converted to Unicode if necessary. 518 IDNA2003 assumes that the conversion is straightforward enough not to 519 be considered by the protocol. 521 2.2. Permitted Character Identification 523 The Unicode string is examined to prohibit characters that IDNA does 524 not permit in input. The list of excluded characters is quite 525 limited because IDNA2003 permits almost all Unicode characters to be 526 used as input, with many of them mapped into others. 528 2.3. Character Mappings 530 The label string is processed through the Nameprep [RFC3491] profile 531 of the Stringprep [RFC3454] tables and procedure. Among other 532 things, these procedures apply the Unicode normalization procedure 533 NFKC [Unicode-UAX15] which converts compatibility characters to their 534 base forms and resolves the different ways in which some characters 535 can be represented in Unicode into a canonical form. In IDNA2003, 536 one-way case mapping was also performed, partially simulating the 537 query-time folding operation that the DNS provides for ASCII strings. 539 2.4. Registry Restrictions 541 Registries at all levels of the DNS, not just the top level, are 542 expected to establish policies about the labels that may be 543 registered and for the processes associated with that action (see the 544 discussion of guidelines and statements in [RFC4690]). Such 545 restrictions have always existed in the DNS and have always been 546 applied at registration time, with the most notable example being 547 enforcement of the hostname (LDH) convention itself. For IDNs, the 548 restrictions to be applied are not an IETF matter except insofar as 549 they derive from restrictions imposed by application protocols (e.g., 550 email has always required a more restricted syntax for domain names 551 than the restrictions of the DNS itself). Because these are 552 restrictions on what can be registered, it is not generally necessary 553 that they be global. If a name is not found on resolution, it is not 554 relevant whether it could have been registered; only that it was not 555 registered. Registry restrictions might include prohibition of 556 mixed-script labels or restrictions on labels permitted in a zone if 557 certain other labels are already present. The "variant" systems 558 discussed in [RFC3743] and [RFC4290] are examples of fairly 559 sophisticated registry restriction models. The various sets of ICANN 560 IDN Guidelines [ICANN-Guidelines] also suggest restrictions that 561 might sensibly be imposed. 563 The string produced by the above steps is checked and processed as 564 appropriate to local registry restrictions. Application of those 565 registry restrictions may result in the rejection of some labels or 566 the application of special restrictions to others. 568 2.5. Punycode Conversion 570 The resulting label (in Unicode code point character form) is 571 processed with the Punycode algorithm [RFC3492] and converted to a 572 form suitable for storage in the DNS (the "xn--..." form). 574 2.6. Lookup or Insertion in the Zone 576 For registration, the Punycode-encoded label is then placed in the 577 DNS by insertion into a zone. For lookup, that label is processed 578 according to normal DNS query procedures [RFC1035]. 580 3. A Revised IDNA Model 582 One of the major goals of this work is to improve the general 583 understanding of how IDNA works and what characters are permitted and 584 what happens to them. Comprehensibility and predictability to users 585 and registrants are themselves important motivations and design goals 586 for this effort. The effort includes some new terminology and a 587 revised and extended model, both covered in this section, and some 588 more specific protocol, processing, and table modifications. Details 589 of the latter appear in other documents (see Section 4). 591 3.1. Localization: The Role of the Local System and User Interface 593 Several issues are inherent in the application of IDNs and, indeed, 594 almost any other system that tries to handle international characters 595 and concepts. They range from the apparently trivial --e.g., one 596 cannot display a character for which one does not have a font 597 available locally-- to the more complex and subtle. Many people have 598 observed that internationalization is just a tool to permit effective 599 localization while permitting some global uniformity. Issues of 600 display, of exactly how various strings and characters are entered, 601 and so on are inherently issues about localization and user interface 602 design. 604 A protocol such as IDNA can only assume that such operations as data 605 entry are possible. It may make some recommendations about how 606 display might work when characters and fonts are not available, but 607 they can only be general recommendations. 609 Operations for converting between local character sets and Unicode 610 are part of this general set of user interface issues. The 611 conversion is obviously not required at all in a Unicode-native 612 system where no conversion is required. It may, however, involve 613 some complexity in one that is not, especially if the elements of the 614 local character set do not map exactly and unambiguously into Unicode 615 characters and do so in a way that is completely stable over time. 616 Perhaps more important, if a label being converted to a local 617 character set contains Unicode characters that have no correspondence 618 in that character set, the application may have to apply special, 619 locally-appropriate, methods to avoid or reduce loss of information. 621 Depending on the system involved, the major difficulty may not lie in 622 the mapping but in accurately identifying the incoming character set 623 and then applying the correct conversion routine. It may be 624 especially difficult when the character coding system in local use is 625 based on conceptually different assumptions than those used by 626 Unicode about, e.g., how different presentation or combining forms 627 are handled. Those differences may not easily yield unambiguous 628 conversions or interpretations even if each coding system is 629 internally consistent and adequate to represent the local language 630 and script. 632 3.2. IDN Processing in the IDNA200x Model 634 [[anchor20: Placeholder ??? Do we need a summary of the two parts 635 here???]] 637 3.2.1. Summary of Effects 639 Separating Domain Name Registration and Resolution in the protocol 640 specification has one substantive impact. With IDNA2003, the tests 641 and steps made in these two parts of the protocol are essentially 642 identical. Separating them reflects current practice in which per- 643 registry restrictions and special processing are applied at 644 registration time but not on resolution. Even more important in the 645 longer term, it allows incremental addition of permitted character 646 groups to avoid freezing on one particular version of Unicode. 648 4. IDNA200x Document List 650 [[anchor22: This section will need to be extensively revised or 651 removed before publication.]] 653 The following documents are being produced as part of the IDNA200x 654 effort. 656 o A revised version of this document, containing an overview, 657 rationale, and conformance conditions. 659 o A separate document, drawn from material in early versions of this 660 one, that explicitly updates and replaces RFC 3490 but which has 661 most rationale material from that document moved to this one 662 [IDNA200X-protocol]. 664 o A document describing the "Bidi problem" with Stringprep and 665 proposing a solution [IDNA200X-Bidi]. 667 o A list of code points allowed in a U-label, based on Unicode 5.0 668 code assignments. See Section 5. 670 o One or more documents containing guidance and suggestions for 671 registries (in this context, those responsible for establishing 672 policies for any zone file in the DNS, not only those at the top 673 or second level). The documents in this category may not all be 674 IETF products and may be prepared and completed asynchronously 675 with those described above. 677 5. Permitted Characters: An Inclusion List 679 This section describes the model used to establish the algorithm and 680 character lists of [IDNA200X-Tables] and describes the names and 681 applicability of the categories used there. Note that the inclusion 682 of a character in one of the first three categories does not imply 683 that it can be used indiscriminately; some characters are associated 684 with contextual rules that must be applied as well. 686 5.1. A Tiered Model of Permitted Characters and Labels 688 Moving to an inclusion model requires a new list of characters that 689 are permitted in IDNs. In IDNA2003, the role and utility of 690 characters are independent of context and fixed forever. Making 691 those rules globally has proven impractical, partially because 692 handling of particular characters across the languages that use a 693 script, or the use of similar or identical-looking characters in 694 different scripts, are less well understood than many people believed 695 several years ago. Conversely, IDNA2003 prohibited some characters 696 entirely to avoid dealing with some of the issues discussed here -- 697 restrictions that were much too severe for mnemonics based on some 698 languages. 700 Independently of the characters chosen (see next subsection), the 701 theory is to divide the characters that appear in Unicode into four 702 categories: 704 5.1.1. ALWAYS 706 Characters identified as "ALWAYS" are permitted for all uses in IDNs, 707 but may be associated with contextual restrictions (for example, any 708 character in this group that has a "right to left" property must be 709 used in context with the "Bidi" rules). The presence of a character 710 in this category implies that it has been examined and determined to 711 be appropriate for IDN use, and that it is well-understood that 712 contextual protocol restrictions in addition to those already 713 specified, such as rules about the use of given characters, are not 714 required. That, in turn, indicates that the script community 715 relevant to that character, reflecting appropriate authorities for 716 all of the known languages that use that script, has agreed that the 717 script and its components are sufficiently well understood. This 718 subsection discusses characters, rather than scripts, because it is 719 explicitly understood that a script community may decide to include 720 some characters of the script and not others. 722 Because of this condition, which requires evaluation by individual 723 script communities of the characters suitable for use in IDNs (not 724 just, e.g., the general stability of the scripts in which those 725 characters are embedded) it is not feasible to define the boundary 726 point between this category and the next one by general properties of 727 the characters, such as the Unicode property lists. 729 Despite its name, the presence of a character on this list does not 730 imply that a given registry need accept registrations containing any 731 of the characters in the category. Registries are still expected to 732 apply judgment about labels they will accept and to maintain rules 733 consistent with those judgments (see [IDNA200X-protocol] and 734 Section 5.2). 736 Characters that are placed in the "ALWAYS" category are never removed 737 from it unless the code points themselves are removed from Unicode (a 738 condition that may never occur). 740 5.1.2. MAYBE 742 Characters that are used to write the languages of the world and that 743 are thought of broadly as "letters" rather than, e.g., symbols or 744 punctuation, and that have not been placed in the "ALWAYS" or "NEVER" 745 categories (see Section 5.1.4 for the latter) belong to the "MAYBE" 746 category. As implied above, the collection of scripts and characters 747 in "MAYBE" has not yet been reviewed and finally approved by the 748 script community. It is possible that they may be appropriate for 749 general use only when special contextual rules (tests on the entire 750 label or on adjacent characters) are identified and specified. 752 In general and for maximum safety, registries SHOULD confine 753 themselves to characters from the "ALWAYS" category. However, if a 754 registry is permitting registrations only in a small number of 755 scripts the usage of which it is familiar with to develop rules that 756 are safe in its own environment -- it may be entirely appropriate for 757 it permit registrations that use characters from the "MAYBE" 758 categories as well as the "ALWAYS" one. 760 Applications are expected to not treat "ALWAYS" and "MAYBE" 761 differently with regard to name resolution ("lookup"). They may 762 choose to provide warnings to users when labels or fully-qualified 763 names containing characters in the "MAYBE" categories are to be 764 presented to users. 766 There are actually two subcategories of MAYBE. The assignment of a 767 character to one or the other represents an estimate of whether the 768 character will eventually be treated as "ALWAYS" or "NEVER" (some 769 characters may, however, remain in the "MAYBE" categories 770 indefinitely). Since the differences between the "MAYBE" 771 subcategories do not affect the protocol, characters may be moved 772 back and forth between them as information and knowledge accumulates. 774 5.1.2.1. Subcategory MAYBE YES 776 These are letter, digit, or letter-like characters that are generally 777 presumed to be appropriate in DNS labels, for which no specific in- 778 depth script or character evaluation has been performed. The risk 779 with characters in the "MAYBE YES" category is that it may later be 780 discovered that contextual rules are required for their safe use with 781 labels that otherwise contain characters from arbitrary scripts or 782 that the characters themselves may be problematic. 784 5.1.2.2. Subcategory MAYBE NO 786 These are characters that are not letter-like, but are not excluded 787 by some other rule. Given the general ban on characters other than 788 letters and digits, it is likely that they will be moved to "NEVER" 789 when their contexts are fully understood by the relevant community. 790 However, since characters once moved to "NEVER" cannot be moved back 791 out, conservatism about making that classification is in order. 793 5.1.3. CONTEXTUAL RULE REQUIRED 795 These characters are unsafe for general use in IDNs, typically 796 because they are invisible in most scripts but affect format or 797 presentation in a few others or because they are combining characters 798 that are safe for use only in conjunction with particular characters 799 or scripts. In order to permit them to be used at all, these 800 characters are assigned to the category "CONTEXTUAL RULE REQUIRED" 801 and, when adequately understood, associated with a rule. Examples of 802 typical rules include "Must follow a character from Script XYZ", "MAY 803 occur only if the entire label is in Script ABC", "MAY occur only if 804 the previous and subsequent characters have the DEF property". 806 Because it is easier to identify these characters than to know that 807 they are actually needed in IDNs or how to establish exactly the 808 right rules for each one, a character in the CONTEXTUAL RULE REQUIRED 809 category may have a null (missing) rule set in a given version of the 810 tables. Such characters MUST NOT appear in putative labels for 811 either registration or lookup. Of course, a later version of the 812 tables might contain a non-null rule. 814 If there is a rule, it MUST be evaluated and tested on registration 815 and SHOULD be evaluated and tested on lookup. If the test fails, the 816 label should not be processed for registration or lookup in the DNS. 818 5.1.4. NEVER 820 Some characters are sufficiently problematic for use in IDNs that 821 they should be excluded for both registration and lookup (i.e., 822 conforming applications performing name resolution should verify that 823 these characters are absent; if they are present, the label strings 824 should be rejected rather than converted to A-labels and looked up. 826 Of course, this category includes code points that have been removed 827 entirely from Unicode should such characters ever occur. 829 Characters that are placed in the "NEVER" category are never removed 830 from it or reclassified. If a character is classified as "NEVER" in 831 error and the error is sufficiently problematic, the only recourse is 832 to introduce a new code point into Unicode and classify it as "MAYBE" 833 or "ALWAYS" as appropriate. 835 5.2. Layered Restrictions: Tables, Context, Registration, Applications 837 The essence of the character rules in IDNAbis is that there is no 838 magic bullet for any of the issues associated with a multiscript DNS. 839 Instead, we need to have a variety of approaches that, together, 840 constitute multiple lines of defense. The actual character tables 841 are the first mechanism, protocol rules about how those characters 842 are applied or restricted in context are the second, and those two in 843 combination constitute the limits of what can be done in a protocol 844 context. Registrars are expected to restrict what they permit to be 845 registered, devising and using rules that are designed to optimize 846 the balance between confusion and risk on the one hand and maximum 847 expressiveness in mnemonics on the other. 849 5.3. A New Character List -- History 851 [[anchor29: RFC Editor: please delete this subsection.]] 853 A preliminary version of a character list that reflects the above 854 categories has been was developed by the contributors to this 855 document [IDNA200X-Tables]. An earlier, initial, version was 856 developed by going through Unicode 5.0 one block and one character 857 class at a time and determining which characters, classes, and blocks 858 were clearly acceptable for IDNs, which one were clearly unacceptable 859 (e.g., all blocks consisting entirely of compatibility characters and 860 non-language symbols were excluded as were a number of character 861 classes), and which blocks and classes were in need of further study 862 or input from the relevant language communities. That effort was 863 successful, but not at the level of producing a directly-useful 864 character table. Additional iterations on the mailing list and with 865 UTC participation largely dropped the use of Unicode blocks and 866 focused on character classes, scripts, and properties together with 867 understandings gained from other Unicode Consortium efforts. Those 868 iterations have been more successful. The iterative process has led 869 to the conclusion that the best strategy is likely to be a mixed one 870 consisting of (i) classification into "ALWAYS" and "MAYBE YES" versus 871 "MAYBE NO" and "NEVER" based on Unicode properties and a few 872 exceptions and (ii) discrimination between "ALWAYS" and "MAYBE YES" 873 and between "MAYBE NO" and "NEVER" based on script community criteria 874 about IDN appropriateness will be needed. An alternative would 875 involve an entirely new property specifically associated with 876 appropriateness for IDN use, but it is not clear that is either 877 necessary or desirable. 879 5.4. Understanding New Issues and Constraints 881 The discussion in [IDNA200X-Bidi] illustrates some areas in which 882 more work and input is needed. Other issues are raised by the 883 Unicode "presentation form" model and, in particular, by the need for 884 zero-width characters in some limited cases to correctly designate 885 those forms and by some other issues with combining characters in 886 different contexts. It is expected that, once expert and materially- 887 concerned parties are identified to supply contextual rules, such 888 problems will be resolved quickly and the questioned collections of 889 characters either added to the list of permitted characters or 890 permanently excluded. 892 5.5. ALWAYS, MAYBE, and Contextual Rules 894 As discussed above, characters will be associated with the "ALWAYS" 895 or "MAYBE YES" properties if they can plausibly be used in an IDN. 896 They are classified as "MAYBE NO" if it appears unlikely that they 897 should be used in IDNs but there is uncertainty on that point. Non- 898 language characters and other character codes that can be identified 899 as globally inappropriate for IDNs, such as conventional spaces and 900 punctuation, will be assigned to "NEVER" (i.e., will never be 901 permitted in IDNs). A character associated with "CONTEXTUAL RULE 902 REQUIRED" is acceptable in a label if it is associated with the 903 identifier of a contextual rule set and the test implied by the rule 904 set is successful. If no such identifier is present in the version 905 of the tables in use, the character is treated as roughly equivalent 906 to "NEVER", i.e., it MUST NOT be used in either registration or 907 lookup with that version of the tables. Because a rule set 908 identifier may be installed in a later table version, this status is 909 obviously not permanent. This general approach could, obviously, be 910 implemented in several ways, not just by the exact arrangements 911 suggested above. 913 The property and rule sets are used as follows: 915 o Systems supporting domain name resolution SHOULD attempt to 916 resolve any label consisting entirely of characters that are in 917 the "ALWAYS" or "MAYBE" categories, including those that have not 918 been permanently excluded but that have not been classified with 919 regard to whether additional restrictions are needed, i.e., they 920 are categorized as "MAYBE YES" or "MAYBE NO". They MUST NOT 921 attempt to resolve label strings that contain unassigned character 922 positions or those that contain "NEVER" characters. 924 o Systems providing domain name registration functions MUST NOT 925 register any label that contains characters classified as "NEVER" 926 OR code point positions that are unassigned in the version of 927 Unicode they are using. If a character in a label has associated 928 contextual rules, they MUST NOT register the label unless the 929 conditions required by those rules are satisfied. They SHOULD NOT 930 register labels that contain a character assigned to a "MAYBE" 931 category. 933 A procedure for assigning rules to characters with the "MAYBE YES" or 934 "MAYBE NO" property, and for assigning (or not) the property to 935 characters assigned in future version of Unicode, is outlined under 936 Section 11. A key part of that procedure will be specifications that 937 make it possible to add new characters and blocks without long delays 938 in implementation. The procedure will result in an update to 939 existing IANA-maintained registries. 941 6. Issues that Any Solution Must Address 943 6.1. Display and Network Order 945 The correct treatment of domain names requires a clear distinction 946 between Network Order (the order in which the code points are sent in 947 protocols) and Display Order (the order in which the code points are 948 displayed on a screen or paper). The order of labels in a domain 949 name is discussed in [IDNA200X-Bidi]. There are, however, also 950 questions about the order in which labels are displayed if left-to- 951 right and right-to-left labels are adjacent to each other, especially 952 if there are also multiple consecutive appearances of one of the 953 types. The decision about the display order is ultimately under the 954 control of user agents --including web browsers, mail clients, and 955 the like-- which may be highly localized. Even when formats are 956 specified by protocols, the full composition of an Internationalized 957 Resource Identifier (IRI) [RFC3987] or Internationalized Email 958 address contains elements other than the domain name. For example, 959 IRIs contain protocol identifiers and field delimiter syntax such as 960 "http://" or "mailto:" while email addresses contain the "@" to 961 separate local parts from domain names. User agents are not required 962 to use those protocol-based forms directly but often do so. While 963 display, parsing, and processing within a label is specified by the 964 IDNA protocol and the associated documents, the relationship between 965 fully-qualified domain names and internationalized labels is 966 unchanged from the base DNS specifications. Comments here about such 967 full domain names are explanatory or examples of what might be done 968 and must not be considered normative. 970 Questions remain about protocol constraints implying that the overall 971 direction of these strings will always be left-to-right (or right-to- 972 left) for an IRI or email address, or if they even should conform to 973 such rules. These questions also have several possible answers. 975 Should a domain name abc.def, in which both labels are represented in 976 scripts that are written right-to-left, be displayed as fed.cba or 977 cba.fed? An IRI for clear text web access would, in network order, 978 begin with "http://" and the characters will appear as 979 "http://abc.def" -- but what does this suggest about the display 980 order? When entering a URI to many browsers, it may be possible to 981 provide only the domain name and leave the "http://" to be filled in 982 by default, assuming no tail (an approach that does not work for 983 other protocols). The natural display order for the typed domain 984 name on a right-to-left system is fed.cba. Does this change if a 985 protocol identifier, tail, and the corresponding delimiters are 986 specified? 988 While logic, precedent, and reality suggest that these are questions 989 for user interface design, not IETF protocol specifications, 990 experience in the 1980s and 1990s with mixing systems in which domain 991 name labels were read in network order (left-to-right) and those in 992 which those labels were read right-to-left would predict a great deal 993 of confusion, and heuristics that sometimes fail, if each 994 implementation of each application makes its own decisions on these 995 issues. 997 It should be obvious that any revision of IDNA must be more clear 998 about the distinction between network and display order for complete 999 (fully-qualified) domain names, as well as simply for individual 1000 labels, than the original specification was. It is likely that some 1001 strong suggestions should be made about display order as well. 1003 6.2. Entry and Display in Applications 1005 Applications can accept domain names using any character set or sets 1006 desired by the application developer, and can display domain names in 1007 any charset. That is, the IDNA protocol does not affect the 1008 interface between users and applications. 1010 An IDNA-aware application can accept and display internationalized 1011 domain names in two formats: the internationalized character set(s) 1012 supported by the application (i.e., an appropriate local 1013 representation of a U-label), and as an A-label. Applications MAY 1014 allow the display and user input of A-labels, but are not encouraged 1015 to do so except as an interface for special purposes, possibly for 1016 debugging, or to cope with display limitations. A-labels are opaque 1017 and ugly, and, where possible, should thus only be exposed to users 1018 who absolutely need them. Because IDN labels can be rendered either 1019 as the A-labels or U-labels, the application may reasonably have an 1020 option for the user to select the preferred method of display; if it 1021 does, rendering the U-label should normally be the default. 1023 Domain names are often stored and transported in many places. For 1024 example, they are part of documents such as mail messages and web 1025 pages. They are transported in many parts of many protocols, such as 1026 both the control commands and the RFC 2822 body parts of SMTP, and 1027 the headers and the body content in HTTP. It is important to 1028 remember that domain names appear both in domain name slots and in 1029 the content that is passed over protocols. 1031 In protocols and document formats that define how to handle 1032 specification or negotiation of charsets, labels can be encoded in 1033 any charset allowed by the protocol or document format. If a 1034 protocol or document format only allows one charset, the labels MUST 1035 be given in that charset. Of course, not all charsets can properly 1036 represent all labels. If a U-label cannot be displayed in its 1037 entirety, the only choice (without loss of information) may be to 1038 display the A-label. 1040 In any place where a protocol or document format allows transmission 1041 of the characters in internationalized labels, labels SHOULD be 1042 transmitted using whatever character encoding and escape mechanism 1043 the protocol or document format uses at that place. 1045 All protocols that use domain name slots already have the capacity 1046 for handling domain names in the ASCII charset. Thus, A-labels can 1047 inherently be handled by those protocols. 1049 6.3. The Ligature and Digraph Problem 1051 There are a number of languages written with alphabetic scripts in 1052 which single phonemes are written using two characters, termed a 1053 "digraph", for example, the "ph" in "pharmacy" and "telephone". 1054 (Note that characters paired in this manner can also appear 1055 consecutively without forming a digraph, as in "tophat".) Certain 1056 digraphs are normally indicated typographically by setting the two 1057 characters closer together than they would be if used consecutively 1058 to represent different phonemes. Some digraphs are fully joined as 1059 ligatures (strictly designating setting totally without intervening 1060 white space, although the term is sometimes applied to close set 1061 pairs). An example of this may be seen when the word "encyclopaedia" 1062 is set with a U+00E6 LATIN SMALL LIGATURE AE (and some would not 1063 consider that word correctly spelled unless the ligature form was 1064 used or the "a" was dropped entirely). 1066 Difficulties arise from the fact that a given ligature may be a 1067 completely optional typographic convenience for representing a 1068 digraph in one language (as in the above example with some spelling 1069 conventions), while in another language it is a single character that 1070 may not always be correctly representable by a two-letter sequence 1071 (as in the above example with different spelling conventions). This 1072 can be illustrated by many words in the Norwegian language, where the 1073 "ae" ligature is the 27th letter of a 29-letter extended Latin 1074 alphabet. It is equivalent to the 28th letter of the Swedish 1075 alphabet (also containing 29 letters), U+00E4 LATIN SMALL LETTER A 1076 WITH DIAERESIS, for which an "ae" cannot be substituted according to 1077 current orthographic standards. 1079 That character (U+00E4) is also part of the German alphabet where, 1080 unlike in the Nordic languages, the two-character sequence "ae" is 1081 usually treated as a fully acceptable alternate orthography. The 1082 inverse is however not true, and those two characters cannot 1083 necessarily be combined into an "umlauted a". This also applies to 1084 another German character, the "umlauted o" (U+00F6 LATIN SMALL LETTER 1085 O WITH DIAERESIS) which, for example, cannot be used for writing the 1086 name of the author "Goethe". It is also a letter in the Swedish 1087 alphabet where, in parallel to the "umlauted a", it cannot be 1088 correctly represented as "oe" and in the Norwegian alphabet, where it 1089 is represented, not as "umlauted o", but as "slashed o", U+00F8. 1091 Additional cases with alphabets written right-to-left are described 1092 in Section 6.4. This constitutes a problem that cannot be resolved 1093 solely by operating on scripts. It is, however, a key concern in the 1094 IDN context. Its satisfactory resolution will require support in 1095 policies set by registries, which therefore need to be particularly 1096 mindful not just of this specific issue, but of all other related 1097 matters that cannot be dealt with on an exclusively algorithmic 1098 basis. 1100 Just as with the examples of different-looking characters that may be 1101 assumed to be the same, it is in general impossible to deal with 1102 these situations in a system such as IDNA -- or with Unicode 1103 normalization generally -- since determining what to do requires 1104 information about the language being used, context, or both. 1105 Consequently, these specifications make no attempt to treat these 1106 combined characters in any special way. However, their existence 1107 provides a prime example of a situation in which a registry that is 1108 aware of the language context in which labels are to be registered, 1109 and where that language sometimes (or always) treats the two- 1110 character sequences as equivalent to the combined form, should give 1111 serious consideration to applying a "variant" model [RFC3743] 1112 [RFC4290] to reduce the opportunities for user confusion and fraud 1113 that would result from the related strings being registered to 1114 different parties. 1116 6.4. Right-to-left Text 1118 In order to be sure that the directionality of right-to-left text is 1119 unambiguous, IDNA2003 required that any label in which right-to-left 1120 characters appear both starts and ends with them, may not include any 1121 characters with strong left-to-right properties (which excludes other 1122 alphabetic characters but permits European digits), and rejects any 1123 other string that contains a right-to-left character. This is one of 1124 the few places where the IDNA algorithms (both old and new) are 1125 required to look at an entire label, not just at individual 1126 characters. Unfortunately, the algorithmic model used in IDNA2003 1127 fails when the final character in a right-to-left string requires a 1128 combining mark in order to be correctly represented. The mark will 1129 be the final code point in the string but is not identified with the 1130 right-to-left character attribute and Stringprep therefore rejects 1131 the string. 1133 This problem manifests itself in languages written with consonantal 1134 alphabets to which diacritical vocalic systems are applied, and in 1135 languages with orthographies derived from them where the combining 1136 marks may have different functionality. In both cases the combining 1137 marks can be essential components of the orthography. Examples of 1138 this are Yiddish, written with an extended Hebrew script, and Dhivehi 1139 (the official language of Maldives) which is written in the Thaana 1140 script (which is, in turn, derived from the Arabic script). Other 1141 languages are still being investigated, but the new rules for right 1142 to left scripts are described in [IDNA200X-Bidi]. 1144 7. IDNs and the Robustness Principle 1146 The model of IDNs described in this document can be seen as a 1147 particular instance of the "Robustness Principle" that has been so 1148 important to other aspects of Internet protocol design. This 1149 principle is often stated as "Be conservative about what you send and 1150 liberal in what you accept" (See, e.g., RFC 1123, Section 1.2.2 1151 [RFC1123]). For IDNs to work well, registries must have or require 1152 sensible policies about what is registered -- conservative policies 1153 -- and implement and enforce them. Registries, registrars, or other 1154 actors who do not do so, or who get too liberal, too greedy, or too 1155 weird may deserve punishment that will primarily be meted out in the 1156 marketplace or by consumer protection rules and legislation. One can 1157 debate whether or not "punishment by browser vendor" is an effective 1158 marketplace tool, but it falls into the general category of 1159 approaches being discussed here. In any event, the Protocol Police 1160 (an important, although mythical, Internet mechanism for enforcing 1161 protocol conformance) are going to be worth about as much here as 1162 they usually are -- i.e., very little -- simply because, unlike the 1163 marketplace and legal and regulatory mechanisms, they have no 1164 enforcement power. 1166 Conversely, resolvers can (and SHOULD or maybe MUST) reject labels 1167 that clearly violate global (protocol) rules (no one has ever 1168 seriously claimed that being liberal in what is accepted requires 1169 being stupid). However, once one gets past such global rules and 1170 deals with anything sensitive to script or locale, it is necessary to 1171 assume that garbage has not been placed into the DNS, i.e., one must 1172 be liberal about what one is willing to look up in the DNS rather 1173 than guessing about whether it should have been permitted to be 1174 registered. 1176 As mentioned above, if a string doesn't resolve, it makes no 1177 difference whether it simply wasn't registered or was prohibited by 1178 some rule. 1180 If resolvers, as a user interface (UI) matter, decide to warn about 1181 some strings that are valid under the global rules but that they 1182 perceive as dangerous, that is their prerogative and we can only hope 1183 that the market (and maybe regulators) will reward the good choices 1184 and punish the bad ones. In this context, a resolver that decides a 1185 string that is valid under the protocol is dangerous and refuses to 1186 look it up is in violation of the protocols (if they are properly 1187 defined); one that is willing to look something up, but warns against 1188 it, is exercising a UI choice. 1190 8. Migration and Version Synchronization 1192 8.1. Design Criteria 1194 As mentioned above and in RFC 4690, two key goals of this work are to 1195 enable applications to be agnostic about whether they are being run 1196 in environments supporting any Unicode version from 3.2 onward and to 1197 permit incrementally adding permitted scripts and other character 1198 collections without disruption. The mechanisms that support this are 1199 outlined above, but this section reviews them in a context that may 1200 be more helpful to those who need to understand the approach and make 1201 plans for it. 1203 1. The general criteria for a putative label, and the collection of 1204 characters that make it up, to be considered IDNA-valid are: 1206 * The characters are "letters", numerals, or otherwise used to 1207 write words in some language. Symbols, drawing characters, 1208 and various notational characters are permanently excluded -- 1209 some because they are actively dangerous in URI, IRI, or 1210 similar contexts and others because there is no evidence that 1211 they are important enough to Internet operations or 1212 internationalization to justify large numbers of special cases 1213 and character-specific handling (additional discussion and 1214 rationale for the symbol decision appears in Section 8.5). If 1215 strings are read out loud, rather than seen on paper, there 1216 are opportunities for considerable confusion between the name 1217 of a symbol (and a single symbol may have multiple names) and 1218 the symbol itself. Other than in very exceptional cases, 1219 e.g., where they are needed to write substantially any word of 1220 a given language, punctuation characters are excluded as well. 1221 The fact that a word exists is not proof that it should be 1222 usable in a DNS label and DNS labels are not expected to be 1223 usable for multiple-word phrases (although they are not 1224 prohibited if the conventions and orthography of a particular 1225 language cause that to be possible). 1227 * Characters that are unassigned in the version of Unicode being 1228 used by the registry or application are not permitted, even on 1229 resolution (lookup). This is because, unlike the conditions 1230 contemplated in IDNA2003 (except for right-to-left text), we 1231 now understand that tests involving the context of characters 1232 (e.g., some characters being permitted only adjacent to other 1233 ones of specific types) and integrity tests on complete labels 1234 will be needed. Unassigned code points cannot be permitted 1235 because one cannot determine the contextual rules that 1236 particular code points will require before characters are 1237 assigned to them and the properties of those characters fully 1238 understood. 1240 * Any character that is mapped to another character by 1241 Nameprep2003 or by a current version of NFKC is prohibited as 1242 input to IDNA (for either registration or resolution). 1243 Implementers of user interfaces to applications are free to 1244 make those conversions when they consider them suitable for 1245 their operating system environments, context, or users. 1247 Tables used to identify the characters that are IDNA-valid are 1248 expected to be driven by the principles above. The principles 1249 are not just an interpretation of the tables. 1251 2. For registration purposes, the collection of IDNA-valid 1252 characters will be a growing list. The conditions for entry to 1253 the list for a set of characters are (i) that they meet the 1254 conditions for IDNA-valid characters discussed immediately above 1255 and (ii) that consensus can be reached about usage and contextual 1256 rules. Because it is likely that such consensus cannot be 1257 reached immediately about the correct contextual rules for some 1258 characters -- e.g., the use of invisible ("zero-width") 1259 characters to modify presentation forms -- some sets of 1260 characters may be deferred from the IDNA-valid set even if they 1261 appear in a current version of Unicode. Of course, characters 1262 first assigned code points in later versions of Unicode would 1263 need to be introduced into IDNA only after those code points are 1264 assigned. 1266 3. Anyone entering a label into a DNS zone must properly validate 1267 that label -- i.e., be sure that the criteria for an A-label are 1268 met -- in order for Unicode version-independence to be possible. 1269 In particular: 1271 * Any label that contains hyphens as its third and fourth 1272 characters MUST be IDNA-valid. This implies that, (i) if the 1273 third and fourth characters are hyphens, the first and second 1274 ones MUST be "xn" until and unless this specification is 1275 updated to permit other prefixes and (ii) labels starting in 1276 "xn--" MUST be valid A-labels, as discussed in Section 3 1277 above. 1279 * The Unicode tables (i.e., tables of code points, character 1280 classes, and properties) and IDNA tables (i.e., tables of 1281 contextual rules such as those described above), MUST be 1282 consistent on the systems performing or validating labels to 1283 be registered. Note that this does not require that tables 1284 reflect the latest version of Unicode, only that all tables 1285 used on a given system are consistent with each other. 1287 Systems looking up or resolving DNS labels MUST be able to assume 1288 that those rules were followed. 1290 4. Anyone looking up a label in a DNS zone MUST 1292 * Maintain a consistent set of tables, as discussed above. As 1293 with registration, the tables need not reflect the latest 1294 version of Unicode but they MUST be consistent. 1296 * Validate labels to be looked up only to the extent of 1297 determining that the U-label does not contain either code 1298 points prohibited by IDNA (categorized as "NEVER") or code 1299 points that are unassigned in its version of Unicode. No 1300 attempt should be made to validate contextual rules about 1301 characters, including mixed-script label prohibitions, 1302 although such rules MAY be used to influence presentation 1303 decisions in the user interface. 1305 By avoiding applying its own interpretation of which labels are 1306 valid as a means of rejecting lookup attempts, the resolver 1307 application becomes less sensitive to version incompatibilities 1308 with the particular zone registry associated with the domain 1309 name. 1311 Under this model, a registry (or entity communicating with a registry 1312 to accomplish name registrations) will need to update its tables -- 1313 both the Unicode-associated tables and the tables of permitted IDN 1314 characters -- to enable a new script or other set of new characters. 1315 It will not be affected by newer versions of Unicode, or newly- 1316 authorized characters, until and unless it wishes to make those 1317 registrations. The registration side is also responsible --under the 1318 protocol and to registrants and users-- for much more careful 1319 checking than is expected of applications systems that look names up, 1320 both checking as required by the protocol and checking required by 1321 whatever policies it develops for minimizing risks due to confusable 1322 characters and sequences and preserving language or script integrity. 1324 An application or client that looks names up in the DNS will be able 1325 to resolve any name that is registered, as long as its version of the 1326 Unicode-associated tables is sufficiently up-to-date to interpret all 1327 of the characters in the label. It SHOULD distinguish, in its 1328 messages to users, between "label contains an unallocated code point" 1329 and other types of lookup failures. A failure on the basis of an old 1330 version of Unicode may lead the user to a desire to upgrade to a 1331 newer version, but will have no other ill effects (this is consistent 1332 with behavior in the transition to the DNS when some hosts could not 1333 yet handle some forms of names or record types). 1335 8.2. More Flexibility in User Agents 1337 One key philosophical difference between IDNA2003 and this proposal 1338 is that the former provided mappings for many characters into others. 1339 These mappings were not reversible: the original string could not be 1340 recovered from the form stored in the DNS and, probably as a 1341 consequence, users became confused about what characters were valid 1342 for IDNs and which ones were not. Too many times, the answer to the 1343 question "can this character be used in an IDN" was "it depends on 1344 exactly what you mean by 'used'". 1346 IDNA200x does not perform these mappings but, instead, prohibits the 1347 characters that would be mapped to others. As examples, while 1348 mathematical characters based on Latin ones are accepted as input to 1349 IDNA2003, they are prohibited in IDNA200x. Similarly, double-width 1350 characters and other variations are prohibited as IDNA input. 1352 Since the rules in [IDNA200X-Tables] provide that only strings that 1353 are stable under NFKC are valid, if it is convenient for an 1354 application to perform NFKC normalization before lookup, that 1355 operation is safe since this will never make the application unable 1356 to look up any valid string. 1358 In many cases these prohibitions should have no effect on what the 1359 user can type at resolution time: it is perfectly reasonable for 1360 systems that support user interfaces at lookup time, to perform some 1361 character mapping that is appropriate to the local environment prior 1362 to actual invocation of IDNA as part of the Unicode conversions of 1363 [IDNA200X-protocol] above. However, those changes will be local ones 1364 only -- local to environments in which users will clearly understand 1365 that the character forms are equivalent. For use in interchange 1366 among systems, it appears to be much more important that U-labels and 1367 A-labels can be mapped back and forth without loss of information. 1369 One specific, and very important, instance of this change in strategy 1370 arises with case-folding. In the ASCII-only DNS, names are looked up 1371 and matched in a case-independent way, but no actual case-folding 1372 occurs. Names can be placed in the DNS in either upper or lower case 1373 form (or any mixture of them) and that form is preserved, returned in 1374 queries, and so on. IDNA2003 attempted to simulate that behavior by 1375 performing case-mapping at registration time (resulting in only 1376 lower-case IDNs in the DNS) and when names were looked up. 1378 As suggested earlier in this section, it appears to be desirable to 1379 do as little character mapping as possible consistent with having 1380 Unicode work correctly (e.g., NFC mapping to resolve different 1381 codings for the same character is still necessary) and to make the 1382 mapping between A-labels and U-labels idempotent. Case-mapping is 1383 not an exception to this principle. If only lower case characters 1384 can be registered in the DNS (i.e., present in a U-label), then 1385 IDNA200x should prohibit upper-case characters as input. Some other 1386 considerations reinforce this conclusion. For example, an essential 1387 element of the ASCII case-mapping functions is that 1388 uppercase(character) must be equal to 1389 uppercase(lowercase(character)). That requirement may not be 1390 satisfied with IDNs. The relationship between upper case and lower 1391 case may even be language-dependent, with different languages (or 1392 even the same language in different areas) using different mappings. 1393 Of course, the expectations of users who are accustomed to a case- 1394 insensitive DNS environment will probably be well-served if user 1395 agents perform case mapping prior to IDNA processing, but the IDNA 1396 procedures themselves should neither require such mapping nor expect 1397 it when it isn't natural to the localized environment. 1399 8.3. The Question of Prefix Changes 1401 The conditions that would require a change in the IDNA "prefix" 1402 ("xn--" for the version of IDNA specified in [RFC3490]) have been a 1403 great concern to the community. A prefix change would clearly be 1404 necessary if the algorithms were modified in a manner that would 1405 create serious ambiguities during subsequent transition in 1406 registrations. This section summarizes our conclusions about the 1407 conditions under which changes in prefix would be necessary. 1409 8.3.1. Conditions requiring a prefix change 1411 An IDN prefix change is needed if a given string would resolve or 1412 otherwise be interpreted differently depending on the version of the 1413 protocol or tables being used. Consequently, work to update IDNs 1414 would require a prefix change if, and only if, one of the following 1415 four conditions were met: 1417 1. The conversion of an A-label to Unicode (i.e., a U-label) yields 1418 one string under IDNA2003 (RFC3490) and a different string under 1419 IDNA200x. 1421 2. An input string that is valid under IDNA2003 and also valid under 1422 IDNA200x yields two different A-labels with the different 1423 versions of IDNA. This condition is believed to be essentially 1424 equivalent to the one above. 1426 Note, however, that if the input string is valid under one 1427 version and not valid under the other, this condition does not 1428 apply. See the first item in Section 8.3.2, below. 1430 3. A fundamental change is made to the semantics of the string that 1431 is inserted in the DNS, e.g., if a decision were made to try to 1432 include language or specific script information in that string, 1433 rather than having it be just a string of characters. 1435 4. A sufficiently large number of characters is added to Unicode so 1436 that the Punycode mechanism for block offsets no longer has 1437 enough capacity to reference the higher-numbered planes and 1438 blocks. This condition is unlikely even in the long term and 1439 certain not to arise in the next few years. 1441 8.3.2. Conditions not requiring a prefix change 1443 In particular, as a result of the principles described above, none of 1444 the following changes require a new prefix: 1446 1. Prohibition of some characters as input to IDNA. This may make 1447 names that are now registered inaccessible, but does not require 1448 a prefix change. 1450 2. Adjustments in Stringprep tables or IDNA actions, including 1451 normalization definitions, that do not affect characters that 1452 have already been invalid under IDNA2003. 1454 3. Changes in the style of definitions of Stringprep or Nameprep 1455 that do not alter the actions performed by them. 1457 8.4. Stringprep Changes and Compatibility 1459 Concerns have been expressed about problems for non-DNS uses of 1460 Stringprep being caused by changes to the specification intended to 1461 improve the handling of IDNs, most notably as this might affect 1462 identification and authentication protocols. Section 8.3, above, 1463 essentially also applies in this context. The proposed new inclusion 1464 tables [IDNA200X-Tables], the reduction in the number of characters 1465 permitted as input for registration or resolution (Section 5), and 1466 even the proposed changes in handling of right-to-left strings 1467 [IDNA200X-Bidi] either give interpretations to strings prohibited 1468 under IDNA2003 or prohibit strings that IDNA2003 permitted. Strings 1469 that are valid under both IDNA2003 and IDNA200x, and the 1470 corresponding versions of Stringprep, are not changed in 1471 interpretation. This protocol does not use either Nameprep or 1472 Stringprep as specified in IDNA2003. 1474 It is particularly important to keep IDNA processing separate from 1475 processing for various security protocols because some of the 1476 constraints that are necessary for smooth and comprehensible use of 1477 IDNs may be unwanted or undesirable in other contexts. For example, 1478 the criteria for good passwords or passphrases are very different 1479 from those for desirable IDNs. Similarly, internationalized SCSI 1480 identifiers and other protocol components are likely to have 1481 different requirements than IDNs. 1483 Perhaps even more important in practice, since most other known uses 1484 of Stringprep encode or process characters that are already in 1485 normalized form and expect the use of only those characters that can 1486 be used in writing words of languages, the changes proposed here and 1487 in [IDNA200X-Tables] are unlikely to have any effect at all, 1488 especially not on registries and registrations that follow rules 1489 already in existence when this work started. 1491 8.5. The Symbol Question 1493 [[anchor37: Move this material and integrate with the Symbol 1494 discussion above???]] 1496 One of the major differences between this specification and the 1497 original version of IDNA is that the original version permitted non- 1498 letter symbols of various sorts in the protocol. They were always 1499 discouraged in practice. In particular, both the "IESG Statement" 1500 about IDNA and all versions of the ICANN Guidelines specify that only 1501 language characters be used in labels. This specification bans the 1502 symbols entirely. There are several reasons for this, which include: 1504 o As discussed elsewhere, the original IDNA specification assumed 1505 that as many Unicode characters as possible should be permitted, 1506 directly or via mapping to other characters, in IDNs. This 1507 specification operates on an inclusion model, extrapolating from 1508 the LDH rules --which have served the Internet very well-- to a 1509 Unicode base rather than an ASCII base. 1511 o Unicode names for letters are fairly intuitive, recognizable to 1512 uses of the relevant script, and unambiguous. Symbol names are 1513 more problematic because there may be no general agreement on 1514 whether a particular glyph matches a symbol, there are no uniform 1515 conventions for naming, variations such as outline, solid, and 1516 shaded forms may or may not exist, and so on. As as result, 1517 symbols are a very poor basis for reliable communications. Of 1518 course, these difficulties with symbols do not arise with actual 1519 pictographic languages and scripts which would be treated like any 1520 other language characters; the two should not be confused. 1522 8.6. Other Compatibility Issues 1524 The existing (2003) IDNA model has several odd artifacts which occur 1525 largely by accident. Many, if not all, of these are potential 1526 avenues for exploits, especially if the registration process permits 1527 "source" names (names that have not been processed through IDNA and 1528 nameprep) to be registered. As one example, since the character 1529 Eszett, used in German, is mapped by IDNA2003 into the sequence "ss" 1530 rather than being retained as itself or prohibited, a string 1531 containing that character but otherwise in ASCII is not really an IDN 1532 (in the U-label sense defined above) at all. After Nameprep maps the 1533 Eszett out, the result is an ASCII string and so does not get an xn-- 1534 prefix, but the string that can be displayed to a user appears to be 1535 an IDN. The proposed IDNA200x eliminates this artifact. A character 1536 is either permitted as itself or it is prohibited; special cases that 1537 make sense only in a particular linguistic or cultural context can be 1538 dealt with as localization matters where appropriate. 1540 9. Acknowledgments 1542 The editor and contributors would like to express their thanks to 1543 those who contributed significant early review comments, sometimes 1544 accompanied by text, especially Mark Davis, Paul Hoffman, Simon 1545 Josefsson, and Sam Weiler. In addition, some specific ideas were 1546 incorporated from suggestions, text, or comments about sections that 1547 were unclear supplied by Frank Ellerman, Michael Everson, Asmus 1548 Freytag, Michel Suignard, and Ken Whistler, although, as usual, they 1549 bear little or no responsibility for the conclusions the editor and 1550 contributors reached after receiving their suggestions. Thanks are 1551 also due to Vint Cerf, Debbie Garside, and Jefsey Morphin for 1552 conversations that led to considerable improvements in the content of 1553 this document. 1555 10. Contributors 1557 While the listed editor held the pen, this document represents the 1558 joint work and conclusions of an ad hoc design team consisting of the 1559 editor and, in alphabetic order, Harald Alvestrand, Tina Dam, Patrik 1560 Faltstrom, and Cary Karp. In addition, there were many specific 1561 contributions and helpful comments from those listed in the 1562 Acknowledgments section and others who have contributed to the 1563 development and use of the IDNA protocols. 1565 11. IANA Considerations 1567 11.1. IDNA Permitted Character Registry 1569 The distinction between "MAYBE" code points and those classified into 1570 "ALWAYS" and "NEVER" (see Section 5) requires a registry of 1571 characters and scripts and their categories. IANA is requested to 1572 establish that registry, using the "expert reviewer" model. Unlike 1573 usual practice, we recommend that the "expert reviewer" be a 1574 committee that reflects expertise on the relevant scripts, and 1575 encourage IANA, the IESG, and IAB to establish liaisons and work 1576 together with other relevant standards bodies to populate that 1577 committee and its procedures over the long term. 1579 11.2. IDNA Context Registry 1581 For characters that are defined in the permitted character as 1582 requiring a contextual rule, IANA will create and maintain a list of 1583 approved contextual rules, using the registration methods described 1584 above. IANA should develop a format for that registry, or a copy of 1585 it maintained in parallel, that is convenient for retrieval and 1586 machine processing and publish the location of that version. 1588 11.3. IANA Repository of TLD IDN Practices 1590 This registry is maintained by IANA at the request of ICANN, in 1591 conjunction with ICANN Guidelines for IDN use. It is not an IETF- 1592 managed registry and, while the protocol changes specified here may 1593 call for some revisions to the tables, these specifications have no 1594 effect on that registry and no IANA action is required as a result. 1596 12. Security Considerations 1598 Security on the Internet partly relies on the DNS. Thus, any change 1599 to the characteristics of the DNS can change the security of much of 1600 the Internet. 1602 Domain names are used by users to identify and connect to Internet 1603 servers. The security of the Internet is compromised if a user 1604 entering a single internationalized name is connected to different 1605 servers based on different interpretations of the internationalized 1606 domain name. 1608 When systems use local character sets other than ASCII and Unicode, 1609 this specification leaves the the problem of transcoding between the 1610 local character set and Unicode up to the application or local 1611 system. If different applications (or different versions of one 1612 application) implement different transcoding rules, they could 1613 interpret the same name differently and contact different servers. 1614 This problem is not solved by security protocols like TLS that do not 1615 take local character sets into account. 1617 To help prevent confusion between characters that are visually 1618 similar, it is suggested that implementations provide visual 1619 indications where a domain name contains multiple scripts. Such 1620 mechanisms can also be used to show when a name contains a mixture of 1621 simplified and traditional Chinese characters, or to distinguish zero 1622 and one from O and l. DNS zone adminstrators may impose restrictions 1623 (subject to the limitations identified elsewhere in this document) 1624 that try to minimize characters that have similar appearance or 1625 similar interpretations. It is worth noting that there are no 1626 comprehensive technical solutions to the problems of confusable 1627 characters. One can reduce the extent of the problems in various 1628 ways, but probably never eliminate it. Some specific suggestion 1629 about identification and handling of confusable characters appear in 1630 a Unicode Consortium publication [???] 1632 The registration and resolution models described above and in 1634 [IDNA200X-protocol] change the mechanisms available for applications 1635 and resolvers to determine the validity of labels they encounter. In 1636 some respects, the ability to test is strengthened. For example, 1637 putative labels that contain unassigned code points will now be 1638 rejected, while IDNA2003 permitted them (something that is now 1639 recognized as a considerable source of risk). On the other hand, the 1640 protocol specification no longer assumes that the application that 1641 looks up a name will be able to determine, and apply, information 1642 about the protocol version used in registration. In theory, that may 1643 increase risk since the application will be able to do less pre- 1644 lookup validation. In practice, the protection afforded by that test 1645 has been largely illusory for reasons explained in RFC 4690 and 1646 above. 1648 Any change to Stringprep or, more broadly, the IETF's model of the 1649 use of internationalized character strings in different protocols, 1650 creates some risk of inadvertent changes to those protocols, 1651 invalidating deployed applications or databases, and so on. Our 1652 current hypothesis is that the same considerations that would require 1653 changing the IDN prefix (see Section 8.3.2) are the ones that would, 1654 e.g., invalidate certificates or hashes that depend on Stringprep, 1655 but those cases require careful consideration and evaluation. More 1656 important, it is not necessary to change Stringprep2003 at all in 1657 order to make the IDNA changes contemplated here. It is far 1658 preferable to create a separate document, or separate profile 1659 components, for IDN work, leaving the question of upgrading to other 1660 protocols to experts on them and eliminating any possible 1661 synchronization dependency between IDNA changes and possible upgrades 1662 to security protocols or conventions. 1664 13. Change Log 1666 [[anchor44: RFC Editor: Please remove this section.]] 1668 13.1. Version -01 1670 Version -01 of this document is a considerable rewrite from -00. 1671 Many sections have been clarified or extended and several new 1672 sections have been added to reflect discussions in a number of 1673 contexts since -00 was issued. 1675 13.2. Version -02 1677 o Corrected several editorial errors including an accidentally- 1678 introduced misstatement about NFKC. 1680 o Extensively revised the document to synchronize its terminology 1681 with version 03 of [IDNA200X-Tables] and to provide a better 1682 conceptual framework for its categories and how they are used. 1683 Added new material to clarify terminology and relationships with 1684 other efforts. More subtle changes in this version lay the 1685 groundwork for separating the document into a conceptual overview 1686 and a protocol specification for version 03. 1688 13.3. Version -03 1690 o Removed protocol materials to a separate document and incorporated 1691 rationale and explanation materials from the original 1692 specification in RFC 3960 into this document. Cleaned up earlier 1693 text to reflect a more mature specification and restructured 1694 several sections and added additional rationale material. 1696 o Strengthened and clarified the A-label / U-label/ LDH-label 1697 definition. 1699 o Retitled the document to reflect its evolving role. 1701 13.4. Version -04 1703 o Moved more text from "protocol" and further reorganized material. 1705 o Provided new material on "Contextual Rule Required. 1707 o Improved consistency of terminology, both internally and with the 1708 "tables" document. 1710 o Improved the IANA Considerations section and discussed the 1711 existing IDNA-related registry. 1713 o More small changes to increase consistency. 1715 13.5. Version -05 1717 Changed "YES" category back to "ALWAYS" to re-synch with the tables 1718 document and provide clearer terminology. 1720 14. References 1722 14.1. Normative References 1724 [ASCII] American National Standards Institute (formerly United 1725 States of America Standards Institute), "USA Code for 1726 Information Interchange", ANSI X3.4-1968, 1968. 1728 ANSI X3.4-1968 has been replaced by newer versions with 1729 slight modifications, but the 1968 version remains 1730 definitive for the Internet. 1732 [IDNA200X-Bidi] 1733 Alvestrand, H. and C. Karp, "An IDNA problem in right-to- 1734 left scripts", July 2007, . 1737 [IDNA200X-Tables] 1738 Faltstrom, P., "The Unicode Codepoints and IDN", 1739 November 2007, . 1742 A version of this document, is available in HTML format at 1743 http://stupid.domain.name/idnabis/ 1744 draft-faltstrom-idnabis-tables-03.txt 1746 [IDNA200X-protocol] 1747 Klensin, J., "Internationalizing Domain Names in 1748 Applications (IDNA): Protocol", November 2007, . 1752 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1753 Requirement Levels", BCP 14, RFC 2119, March 1997. 1755 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 1756 Internationalized Strings ("stringprep")", RFC 3454, 1757 December 2002. 1759 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1760 "Internationalizing Domain Names in Applications (IDNA)", 1761 RFC 3490, March 2003. 1763 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1764 Profile for Internationalized Domain Names (IDN)", 1765 RFC 3491, March 2003. 1767 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 1768 for Internationalized Domain Names in Applications 1769 (IDNA)", RFC 3492, March 2003. 1771 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint 1772 Engineering Team (JET) Guidelines for Internationalized 1773 Domain Names (IDN) Registration and Administration for 1774 Chinese, Japanese, and Korean", RFC 3743, April 2004. 1776 [RFC4290] Klensin, J., "Suggested Practices for Registration of 1777 Internationalized Domain Names (IDN)", RFC 4290, 1778 December 2005. 1780 [Unicode-UAX15] 1781 The Unicode Consortium, "Unicode Standard Annex #15: 1782 Unicode Normalization Forms", 2006, 1783 . 1785 [Unicode32] 1786 The Unicode Consortium, "The Unicode Standard, Version 1787 3.0", 2000. 1789 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5). 1790 Version 3.2 consists of the definition in that book as 1791 amended by the Unicode Standard Annex #27: Unicode 3.1 1792 (http://www.unicode.org/reports/tr27/) and by the Unicode 1793 Standard Annex #28: Unicode 3.2 1794 (http://www.unicode.org/reports/tr28/). 1796 [Unicode40] 1797 The Unicode Consortium, "The Unicode Standard, Version 1798 4.0", 2003. 1800 [Unicode50] 1801 The Unicode Consortium, "The Unicode Standard, Version 1802 5.0", 2007. 1804 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0 1806 14.2. Informative References 1808 [ICANN-Guidelines] 1809 ICANN, "IDN Implementation Guidelines", 2006, 1810 . 1812 [RFC0810] Feinler, E., Harrenstien, K., Su, Z., and V. White, "DoD 1813 Internet host table specification", RFC 810, March 1982. 1815 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1816 STD 13, RFC 1034, November 1987. 1818 [RFC1035] Mockapetris, P., "Domain names - implementation and 1819 specification", STD 13, RFC 1035, November 1987. 1821 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 1822 and Support", STD 3, RFC 1123, October 1989. 1824 [RFC2782] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for 1825 specifying the location of services (DNS SRV)", RFC 2782, 1826 February 2000. 1828 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1829 Resource Identifier (URI): Generic Syntax", STD 66, 1830 RFC 3986, January 2005. 1832 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 1833 Identifiers (IRIs)", RFC 3987, January 2005. 1835 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 1836 Recommendations for Internationalized Domain Names 1837 (IDNs)", RFC 4690, September 2006. 1839 Author's Address 1841 John C Klensin (editor) 1842 1770 Massachusetts Ave, Ste 322 1843 Cambridge, MA 02140 1844 USA 1846 Phone: +1 617 245 1457 1847 Fax: 1848 Email: john+ietf@jck.com 1849 URI: 1851 Full Copyright Statement 1853 Copyright (C) The IETF Trust (2007). 1855 This document is subject to the rights, licenses and restrictions 1856 contained in BCP 78, and except as set forth therein, the authors 1857 retain all their rights. 1859 This document and the information contained herein are provided on an 1860 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1861 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 1862 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 1863 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 1864 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1865 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1867 Intellectual Property 1869 The IETF takes no position regarding the validity or scope of any 1870 Intellectual Property Rights or other rights that might be claimed to 1871 pertain to the implementation or use of the technology described in 1872 this document or the extent to which any license under such rights 1873 might or might not be available; nor does it represent that it has 1874 made any independent effort to identify any such rights. Information 1875 on the procedures with respect to rights in RFC documents can be 1876 found in BCP 78 and BCP 79. 1878 Copies of IPR disclosures made to the IETF Secretariat and any 1879 assurances of licenses to be made available, or the result of an 1880 attempt made to obtain a general license or permission for the use of 1881 such proprietary rights by implementers or users of this 1882 specification can be obtained from the IETF on-line IPR repository at 1883 http://www.ietf.org/ipr. 1885 The IETF invites any interested party to bring to its attention any 1886 copyrights, patents or patent applications, or other proprietary 1887 rights that may cover technology that may be required to implement 1888 this standard. Please address the information to the IETF at 1889 ietf-ipr@ietf.org. 1891 Acknowledgment 1893 Funding for the RFC Editor function is provided by the IETF 1894 Administrative Support Activity (IASA).