idnits 2.17.1 draft-ietf-idnabis-rationale-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 2276. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2287. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2294. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2300. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 12, 2008) is 5764 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'IDNA2008-Bidi' == Outdated reference: A later version (-18) exists of draft-ietf-idnabis-protocol-02 == Outdated reference: A later version (-09) exists of draft-ietf-idnabis-tables-01 ** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564) ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) == Outdated reference: A later version (-18) exists of draft-ietf-idnabis-protocol-02 -- Duplicate reference: draft-ietf-idnabis-protocol, mentioned in 'RulesInit', was also mentioned in 'IDNA2008-Protocol'. -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode51' -- Obsolete informational reference (is this intentional?): RFC 810 (Obsoleted by RFC 952) Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft July 12, 2008 4 Intended status: Standards Track 5 Expires: January 13, 2009 7 Internationalized Domain Names for Applications (IDNA): Definitions, 8 Background and Rationale 9 draft-ietf-idnabis-rationale-01.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on January 13, 2009. 36 Abstract 38 Several years have passed since the original protocol for 39 Internationalized Domain Names (IDNs) was completed and deployed. 40 During that time, a number of issues have arisen, including the need 41 to update the system to deal with newer versions of Unicode. Some of 42 these issues require tuning of the existing protocols and the tables 43 on which they depend. This document provides an overview of a 44 revised system and provides explanatory material for its components. 46 Table of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 49 1.1. Context and Overview . . . . . . . . . . . . . . . . . . . 4 50 1.2. Discussion Forum . . . . . . . . . . . . . . . . . . . . . 4 51 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4 52 1.4. Applicability and Function of IDNA . . . . . . . . . . . . 5 53 1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 54 1.5.1. Documents and Standards . . . . . . . . . . . . . . . 6 55 1.5.2. Terminology about Characters and Character Sets . . . 6 56 1.5.3. DNS-related Terminology . . . . . . . . . . . . . . . 7 57 1.5.4. Terminology Specific to IDNA . . . . . . . . . . . . . 7 58 1.5.5. Punycode is an Algorithm, not a Name . . . . . . . . . 10 59 1.5.6. Other Terminology Issues . . . . . . . . . . . . . . . 11 60 1.6. Comprehensibility of IDNA Mechanisms and Processing . . . 12 61 2. Summary of Major Changes from IDNA2003 . . . . . . . . . . . . 13 62 3. The Revised IDNA Model . . . . . . . . . . . . . . . . . . . . 14 63 4. Processing in IDNA2008 . . . . . . . . . . . . . . . . . . . . 14 64 5. IDNA2008 Document List . . . . . . . . . . . . . . . . . . . . 14 65 6. Permitted Characters: An Inclusion List . . . . . . . . . . . 15 66 6.1. A Tiered Model of Permitted Characters and Labels . . . . 15 67 6.1.1. PROTOCOL-VALID . . . . . . . . . . . . . . . . . . . . 16 68 6.1.2. DISALLOWED . . . . . . . . . . . . . . . . . . . . . . 17 69 6.1.3. UNASSIGNED . . . . . . . . . . . . . . . . . . . . . . 18 70 6.2. Registration Policy . . . . . . . . . . . . . . . . . . . 19 71 6.3. Layered Restrictions: Tables, Context, Registration, 72 Applications . . . . . . . . . . . . . . . . . . . . . . . 19 73 7. Issues that Constrain Possible Solutions . . . . . . . . . . . 19 74 7.1. Display and Network Order . . . . . . . . . . . . . . . . 19 75 7.2. Entry and Display in Applications . . . . . . . . . . . . 21 76 7.3. Linguistic Expectations: Ligatures, Digraphs, and 77 Alternate Character Forms . . . . . . . . . . . . . . . . 22 78 7.4. Case Mapping and Related Issues . . . . . . . . . . . . . 24 79 7.5. Right to Left Text . . . . . . . . . . . . . . . . . . . . 25 80 8. IDNs and the Robustness Principle . . . . . . . . . . . . . . 25 81 9. Front-end and User Interface Processing . . . . . . . . . . . 26 82 10. Migration and Version Synchronization . . . . . . . . . . . . 29 83 10.1. Design Criteria . . . . . . . . . . . . . . . . . . . . . 29 84 10.1.1. General IDNA Validity Criteria . . . . . . . . . . . . 29 85 10.1.2. Labels in Registration . . . . . . . . . . . . . . . . 30 86 10.1.3. Labels in Resolution (Lookup) . . . . . . . . . . . . 31 87 10.2. More Flexibility in User Agents . . . . . . . . . . . . . 32 88 10.3. The Question of Prefix Changes . . . . . . . . . . . . . . 33 89 10.3.1. Conditions Requiring a Prefix Change . . . . . . . . . 33 90 10.3.2. Conditions Not Requiring a Prefix Change . . . . . . . 34 91 10.3.3. Implications of Prefix Changes . . . . . . . . . . . . 35 92 10.4. Stringprep Changes and Compatibility . . . . . . . . . . . 35 93 10.5. The Symbol Question . . . . . . . . . . . . . . . . . . . 36 94 10.6. Migration Between Unicode Versions: Unassigned Code 95 Points . . . . . . . . . . . . . . . . . . . . . . . . . . 37 96 10.7. Other Compatibility Issues . . . . . . . . . . . . . . . . 38 97 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 39 98 12. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 39 99 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40 100 13.1. IDNA Character Registry . . . . . . . . . . . . . . . . . 40 101 13.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 40 102 13.3. IANA Repository of IDN Practices of TLDs . . . . . . . . . 40 103 14. Security Considerations . . . . . . . . . . . . . . . . . . . 41 104 15. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 42 105 15.1. Version -01 of draft-klensin-idnabis-issues . . . . . . . 42 106 15.2. Version -02 of draft-klensin-idnabis-issues . . . . . . . 42 107 15.3. Version -03 of draft-klensin-idnabis-issues . . . . . . . 43 108 15.4. Version -04 of draft-klensin-idnabis-issues . . . . . . . 43 109 15.5. Version -05 of draft-klensin-idnabis-issues . . . . . . . 43 110 15.6. Version -06 of draft-klensin-idnabis-issues . . . . . . . 43 111 15.7. Version -07 of draft-klensin-idnabis-issues . . . . . . . 44 112 15.8. Version -00 of draft-ietf-idnabis-rationale . . . . . . . 44 113 15.9. Version -01 of draft-ietf-idnabis-rationale . . . . . . . 45 114 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46 115 16.1. Normative References . . . . . . . . . . . . . . . . . . . 46 116 16.2. Informative References . . . . . . . . . . . . . . . . . . 47 117 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 48 118 Intellectual Property and Copyright Statements . . . . . . . . . . 49 120 1. Introduction 122 1.1. Context and Overview 124 Several years have passed since the original protocol for 125 Internationalized Domain Names (IDNs) was completed and deployed. 126 During that time, a number of issues have arisen, including a subset 127 of those described in a recent IAB report [RFC4690] and the need to 128 update the system to deal with newer versions of Unicode. Those 129 standards are known as Internationalized Domain Names in Applications 130 (IDNA), taken from the name of the highest level standard within that 131 group (see Section 1.5). Some tuning of the existing protocols and 132 the tables on which they depend is now required. Where it is 133 important to understanding of the revised protocols, this document 134 further explains the issues that have been encountered. It also 135 provides an overview of the new IDNA model and explanatory material 136 for it. Additional explanatory material for the specific components 137 of the proposals will appear with the associated documents. 139 1.2. Discussion Forum 141 [[anchor4: RFC Editor: please remove this section.]] 143 This work is being discussed in the IETF "idnabis" Working Group and 144 on the mailing list idna-update@alvestrand.no 146 1.3. Objectives 148 The intent of the IDNA revision effort, and hence of this document 149 and the associated ones, is to increase the usability and 150 effectiveness of internationalized domain names (IDNs) while 151 preserving or strengthening the integrity of references that use 152 them. The original "hostname" character definitions (see, e.g., 153 [RFC0810]) struck a balance between the creation of useful mnemonics 154 and the introduction of parsing problems or general confusion in the 155 contexts in which domain names are used. Our objective is to 156 preserve that balance while expanding the character repertoire to 157 include extended versions of Roman-derived scripts and scripts that 158 are not Roman in origin. No work of this sort will be able to 159 completely eliminate sources of visual or textual confusion: such 160 confusion is possible even under the original rules where only ASCII 161 characters were permitted. However, one can hope, through the 162 application of different techniques at different points (see 163 Section 6.3), to keep problems to an acceptable minimum. One 164 consequence of this general objective is that the desire of some user 165 or marketing community to use a particular string --whether the 166 reason is to try to write sentences of particular languages in the 167 DNS, to express a facsimile of the symbol for a brand, or for some 168 other purpose-- is not a primary goal within the context of 169 applications in the domain name space. 171 1.4. Applicability and Function of IDNA 173 The IDNA standard does not require any applications to conform to it, 174 nor does it retroactively change those applications. An application 175 can elect to use IDNA in order to support IDN while maintaining 176 interoperability with existing infrastructure. If an application 177 wants to use non-ASCII characters in domain names, IDNA is the only 178 currently-defined option. Adding IDNA support to an existing 179 application entails changes to the application only, and leaves room 180 for flexibility in front-end processing and more specifically in the 181 user interface (see Section 9). 183 A great deal of the discussion of IDN solutions has focused on 184 transition issues and how IDNs will work in a world where not all of 185 the components have been updated. Proposals that were not chosen by 186 the original IDN Working Group would depend on user applications, 187 resolvers, and DNS servers being updated in order for a user to apply 188 an internationalized domain name in any form or coding acceptable 189 under that method. While processing must be performed prior to or 190 after access to the DNS, no changes are needed to the DNS protocol or 191 any DNS servers or the resolvers on user's computers. 193 The IDNA specification solves the problem of extending the repertoire 194 of characters that can be used in domain names to include a large 195 subset of the Unicode repertoire. 197 IDNA does not extend the service offered by DNS to the applications. 198 Instead, the applications (and, by implication, the users) continue 199 to see an exact-match lookup service. Either there is a single 200 exactly-matching name or there is no match. This model has served 201 the existing applications well, but it requires, with or without 202 internationalized domain names, that users know the exact spelling of 203 the domain names that are to be typed into applications such as web 204 browsers and mail user agents. The introduction of the larger 205 repertoire of characters potentially makes the set of misspellings 206 larger, especially given that in some cases the same appearance, for 207 example on a business card, might visually match several Unicode code 208 points or several sequences of code points. 210 IDNA allows the graceful introduction of IDNs not only by avoiding 211 upgrades to existing infrastructure (such as DNS servers and mail 212 transport agents), but also by allowing some rudimentary use of IDNs 213 in applications by using the ASCII representation of the non-ASCII 214 name labels. While such names are user-unfriendly to read and type, 215 and hence not optimal for user input, they allow (for instance) 216 replying to email and clicking on URLs even though the domain name 217 displayed is incomprehensible to the user. In order to allow user- 218 friendly input and output of the IDNs and acceptance of some 219 characters as equivalent to those to be processed according to the 220 protocol, the applications need to be modified to conform to this 221 specification. 223 IDNA uses the Unicode character repertoire, for continuity with 224 IDNA2003. 226 1.5. Terminology 228 1.5.1. Documents and Standards 230 This document uses the term "IDNA2003" to refer to the set of 231 standards that make up and support the version of IDNA published in 232 2003, i.e., those commonly known as the IDNA base specification 233 [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep 234 [RFC3454]. In this document, those names are used to refer, 235 conceptually, to the individual documents, with the base IDNA 236 specification called just "IDNA". 238 The term "IDNA2008" is used to refer to a new version of IDNA as 239 described in this document and in the documents described in 240 Section 5. References to "these specifications" are to the entire 241 set. 243 1.5.2. Terminology about Characters and Character Sets 245 A code point is an integer value associated with a character in a 246 coded character set. 248 Unicode [Unicode51] is a coded character set containing almost 249 100,000 characters as of the current version. A single Unicode code 250 point is denoted by "U+" followed by four to six hexadecimal digits, 251 while a range of Unicode code points is denoted by two four to six 252 digit hexadecimal numbers separated by "..", with no prefixes. 254 ASCII means US-ASCII [ASCII], a coded character set containing 128 255 characters associated with code points in the range 0000..007F. 256 Unicode may be thought of as an extension of ASCII; it includes all 257 the ASCII characters and associates them with equivalent code points. 259 "Letters" are, informally, generalizations from the ASCII and common- 260 sense understanding of that term, i.e., characters that are used to 261 write text that are not digits, symbols, or punctuation. Formally, 262 they are characters with a Unicode General Category value starting in 263 "L" (see Section 4.5 of [Unicode51]). 265 1.5.3. DNS-related Terminology 267 When discussing the DNS, this document generally assumes the 268 terminology used in the DNS specifications [RFC1034] [RFC1035]. The 269 terms "lookup" and "resolution" are used interchangeably and the 270 process or application component that performs DNS resolution is 271 called a "resolver". The process of placing an entry into the DNS is 272 referred to as "registration" paralleling common contemporary usage 273 in other contexts. Consequently, any DNS zone administration is 274 described as a "registry", regardless of that actual administrative 275 arrangements or level in the tree. A note about that relationship is 276 included in the text below where it seems particularly significant. 278 The term "LDH code points" is defined in this document to mean the 279 code points associated with ASCII letters, digits, and the hyphen- 280 minus; that is, U+002D, 0030..0039, 0041..005A, and 0061..007A. "LDH" 281 is an abbreviation for "letters, digits, hyphen". 283 The base DNS specifications [RFC1034] [RFC1035] discuss "domain 284 names" and "host names", but many people and sections of these 285 specifications use the terms interchangeably. Further, because those 286 documents were not terribly clear, many people who are sure they know 287 the exact definitions of each of these terms disagree on the 288 definitions. This document generally uses the term "domain name". 289 When it refers to, e.g., host name syntax restrictions, it explicitly 290 cites the relevant defining documents. The remaining definitions in 291 this subsection are essentially a review. 293 A label is an individual component of a domain name. Labels are 294 usually shown separated by dots; for example, the domain name 295 "www.example.com" is composed of three labels: "www", "example", and 296 "com". (The zero-length root label described in [RFC1123], which can 297 be explicit as in "www.example.com." or implicit as in 298 "www.example.com", is not considered a label in this specification.) 299 IDNA extends the set of usable characters in labels that are text. 300 For the rest of this document, the term "label" is shorthand for 301 "text label", and "every label" means "every text label". 303 1.5.4. Terminology Specific to IDNA 305 This section defines some terminology to reduce dependence on terms 306 and definitions that have been problematic in the past. 308 1.5.4.1. Terms for IDN Label Codings 310 1.5.4.1.1. IDNA-valid strings, A-label, and U-label 312 To improve clarity, this document introduces three new terms in this 313 subsection. In the next, it defines a historical one to be slightly 314 more precise for IDNA contexts. 316 o A string is "IDNA-valid" if it meets all of the requirements of 317 these specifications for an IDNA label. IDNA-valid strings may 318 appear in either of two forms, defined immediately below. It is 319 expected that specific reference will be made to the form 320 appropriate to any context in which the distinction is important. 322 o An "A-label" is the ASCII-Compatible Encoding (ACE, see 323 Section 1.5.4.4) form of an IDNA-valid string. It must be a 324 complete label: IDNA is defined for labels, not for parts of them 325 and not for complete domain names. This means, by definition, 326 that every A-label will begin with the IDNA ACE prefix, "xn--", 327 followed by a string that is a valid output of the Punycode 328 algorithm and hence a maximum of 59 ASCII characters in length. 329 The prefix and string together must conform to all requirements 330 for a label that can be stored in the DNS including conformance to 331 the LDH ("host name") rule described in RFC 1034, RFC 1123 and 332 elsewhere. 334 o A "U-label" is an IDNA-valid string of Unicode characters, 335 including at least one non-ASCII character, expressed in a 336 standard Unicode Encoding Form, normally UTF-8 in an Internet 337 transmission context, and subject to the constraint below. 338 Conversions between valid U-labels and valid A-labels is performed 339 according to the specification in [RFC3492], adding or removing 340 the ACE prefix (see Section 1.5.4.4) as needed. 342 To be valid, U-labels and A-labels must obey an important symmetry 343 constraint. While that constraint may be tested in any of several 344 ways, an A-label must be capable of being produced by conversion from 345 a U-label and a U-label must be capable of being produced by 346 conversion from an A-label. Among other things, this implies that 347 both U-labels and A-labels must represent strings in normalized form. 348 These strings MUST contain only characters specified elsewhere in 349 this document and its companion documents, and only in the contexts 350 indicated as appropriate. 352 Any rules or conventions that apply to DNS labels in general, such as 353 rules about lengths of strings, apply to whichever of the U-label or 354 A-label would be more restrictive. For the U-label, constraints 355 imposed by existing protocols and their presentation forms make the 356 length restriction apply to the length in octets of the UTF-8 form of 357 those labels (which will always be greater than or equal to the 358 length in code points). The exception to this, of course, is that 359 the restriction to ASCII characters does not apply to the U-label. 361 A different way to look at these terms, which may be more clear to 362 some readers, is that U-labels, A-labels, and LDH-labels (see the 363 next subsection) are disjoint categories that, together, make up the 364 forms of legitimate strings for use in domain names that describe 365 hosts. Of the three, only A-labels and LDH-labels can actually 366 appear in DNS zone files or queries; U-labels can appear, along with 367 the other two, in presentation and user interface forms and in 368 selected protocols other than those of the DNS itself. Strings that 369 do not conform to the rules for one of these three categories and, in 370 particular, strings that contain "--" in the third and fourth 371 character position but are: 373 o not A-labels or 375 o cannot be processed as U-labels or A-labels as described in these 376 specifications, 378 are invalid in IDNA-conformant applications as labels in domain names 379 that identify Internet hosts or similar resources. This restriction 380 on strings containing "--" is required for three reasons: 382 o to prevent confusion with pre-IDNA coding forms; 384 o to permit future extensions that would require changing the 385 prefix, no matter how unlikely those might be (see Section 10.3); 386 and 388 o to reduce the opportunities for attacks via the encoding system. 390 1.5.4.2. LDH-label and Internationalized Label 392 In the hope of further clarifying discussions about IDNs, these 393 specifications use the term "LDH-label" strictly to refer to an all- 394 ASCII label that obeys the "hostname" (LDH) conventions and that is 395 not an IDN. In other words, only "U-label" and "A-label" refer to 396 IDNs; LDH-labels are not IDNs. "Internationalized label" is used 397 when a term is needed to refer to any of the three categories. There 398 are some standardized DNS label formats, such as those for service 399 location (SRV) records [RFC2782] that do not fall into any of the 400 three categories and hence are not internationalized labels. 402 1.5.4.3. Equivalence 404 In IDNA, equivalence of labels is defined in terms of the A-labels. 405 If the A-labels are equal in a case-independent comparison, then the 406 labels are considered equivalent, no matter how they are represented. 407 Traditional LDH labels already have a notion of equivalence: within 408 that list of characters, upper case and lower case are considered 409 equivalent. The IDNA notion of equivalence is an extension of that 410 older notion. Equivalent labels in IDNA are treated as alternate 411 forms of the same label, just as "foo" and "Foo" are treated as 412 alternate forms of the same label. 414 1.5.4.4. ACE Prefix 416 The "ACE prefix" is defined in this document to be a string of ASCII 417 characters "xn--" that appears at the beginning of every A-label. 418 "ACE" stands for "ASCII-Compatible Encoding". 420 1.5.4.5. Domain Name Slot 422 A "domain name slot" is defined in this document to be a protocol 423 element or a function argument or a return value (and so on) 424 explicitly designated for carrying a domain name. Examples of domain 425 name slots include: the QNAME field of a DNS query; the name argument 426 of the gethostbyname() or getaddrinfo() standard C library functions; 427 the part of an email address following the at-sign (@) in the 428 parameter to the SMTP MAIL or RCPT commands or the "From:" field of 429 an email message header; and the host portion of the URI in the src 430 attribute of an HTML tag. General text that just happens to 431 contain a domain name is not a domain name slot. For example, a 432 domain name appearing in the plain text body of an email message is 433 not occupying a domain name slot. 435 An "IDN-aware domain name slot" is defined in this document to be a 436 domain name slot explicitly designated for carrying an 437 internationalized domain name as defined in this document. The 438 designation may be static (for example, in the specification of the 439 protocol or interface) or dynamic (for example, as a result of 440 negotiation in an interactive session). 442 An "IDN-unaware domain name slot" is defined in this document to be 443 any domain name slot that is not an IDN-aware domain name slot. 444 Obviously, this includes any domain name slot whose specification 445 predates IDNA. 447 1.5.5. Punycode is an Algorithm, not a Name 449 There has been some confusion about whether a "Punycode string" does 450 or does not include the prefix and about whether it is required that 451 such strings could have been the output of ToASCII (see RFC 3490, 452 Section 4 [RFC3490]). This specification discourages the use of the 453 term "Punycode" to describe anything but the encoding method and 454 algorithm of [RFC3492]. The terms defined above are preferred as 455 much more clear than terms such as "Punycode string". 457 1.5.6. Other Terminology Issues 459 The document departs from historical DNS terminology and usage in one 460 important respect. Over the years, the community has talked very 461 casually about "names" in the DNS, beginning with calling it "the 462 domain name system". That terminology is fine in the very precise 463 sense that the identifiers of the DNS do provide names for objects 464 and addresses. But, in the context of IDNs, the term has introduced 465 some confusion, confusion that has increased further as people have 466 begun to speak of DNS labels in terms of the words or phrases of 467 various natural languages. 469 Historically, many, perhaps most, of the "names" in the DNS have been 470 mnemonics to identify some particular concept, object, or 471 organization. They are typically derived from, or rooted in, some 472 language because most people think in language-based ways. But, 473 because they are mnemonics, they need not obey the orthographic 474 conventions of any language: it is not a requirement that it be 475 possible for them to be "words". 477 This distinction is important because the reasonable goal of an IDN 478 effort is not to be able to write the great Klingon (or language of 479 one's choice) novel in DNS labels but to be able to form a usefully 480 broad range of mnemonics in ways that are as natural as possible in a 481 very broad range of scripts. 483 An "internationalized domain name" (IDN) is a domain name that may 484 contain any mixture of LDH-labels, A-labels, or U-labels. This 485 implies that every conventional domain name is an IDN (which implies 486 that it is possible for a domain name to be an IDN without it 487 containing any non-ASCII characters). Just as has been the case with 488 ASCII names, some DNS zone administrators may impose restrictions, 489 beyond those imposed by DNS or IDNA, on the characters or strings 490 that may be registered as labels in their zones. Because of the 491 diversity of characters that can be used in a U-label and the 492 confusion they might cause, such restrictions are mandatory for IDN 493 registries and zones even though the particular restrictions are not 494 part of these specifications. Because these restrictions, commonly 495 known as "registry restrictions", only affect what can be registered 496 and not resolution processing, they have no effect on the syntax or 497 semantics of DNS protocol messages; a query for a name that matches 498 no records will yield the same response regardless of the reason why 499 it is not in the zone. Clients issuing queries or interpreting 500 responses cannot be assumed to have any knowledge of zone-specific 501 restrictions or conventions. See Section 6.2. 503 "The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 504 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 505 document are to be interpreted as described in RFC 2119 [RFC2119]. 507 1.6. Comprehensibility of IDNA Mechanisms and Processing 509 One of the major goals of this work is to improve the general 510 understanding of how IDNA works and what characters are permitted and 511 what happens to them. Comprehensibility and predictability to users 512 and registrants are themselves important motivations and design goals 513 for this effort. The effort includes some new terminology and a 514 revised and extended model, both covered in this section, and some 515 more specific protocol, processing, and table modifications. Details 516 of the latter appear in other documents (see Section 5). 518 Several issues are inherent in the application of IDNs and, indeed, 519 almost any other system that tries to handle international characters 520 and concepts. They range from the apparently trivial --e.g., one 521 cannot display a character for which one does not have a font 522 available locally-- to the more complex and subtle. Many people have 523 observed that internationalization is just a tool to enable effective 524 localization while permitting some global uniformity. Issues of 525 display, of exactly how various strings and characters are entered, 526 and so on are inherently issues about localization and user interface 527 design. 529 A protocol such as IDNA can only assume that such operations as data 530 entry and reconciliation of differences in character forms are 531 possible. It may make some recommendations about how display might 532 work when characters and fonts are not available, but they can only 533 be general recommendations and, because display functions are rarely 534 controlled by the types of applications that would call upon IDNA, 535 will rarely be very effective. 537 However, shifting responsibility for character mapping and other 538 adjustments from the protocol (where it was located in IDNA2003) to 539 the user interface or processing before invoking IDNA raises issues 540 about both what that processing should do and about compatibility for 541 references prepared in an IDNA2003 context. Those issues are 542 discussed in Section 9. 544 Operations for converting between local character sets and normalized 545 Unicode are part of this general set of user interface issues. The 546 conversion is obviously not required at all in a Unicode-native 547 system that maintains all strings in Normalization Form C (NFC). It 548 may, however, involve some complexity in a system that is not 549 Unicode-native, especially if the elements of the local character set 550 do not map exactly and unambiguously into Unicode characters or do so 551 in a way that is not completely stable over time. Perhaps more 552 important, if a label being converted to a local character set 553 contains Unicode characters that have no correspondence in that 554 character set, the application may have to apply special, locally- 555 appropriate, methods to avoid or reduce loss of information. 557 Depending on the system involved, the major difficulty may not lie in 558 the mapping but in accurately identifying the incoming character set 559 and then applying the correct conversion routine. If a local 560 operating system uses one of the ISO 8859 character sets or an 561 extensive national or industrial system such as GB18030 [GB18030] or 562 BIG5 [BIG5], one must correctly identify the character set in use 563 before converting to Unicode even though those character coding 564 systems are substantially or completely Unicode-compatible (i.e., all 565 of the code points in them have an exact and unique mapping to 566 Unicode code points). It may be even more difficult when the 567 character coding system in local use is based on conceptually 568 different assumptions than those used by Unicode about, e.g., about 569 font encodings used for publications in some Indic scripts. Those 570 differences may not easily yield unambiguous conversions or 571 interpretations even if each coding system is internally consistent 572 and adequate to represent the local language and script. 574 2. Summary of Major Changes from IDNA2003 576 1. Update base character set from Unicode 3.2 to Unicode version- 577 agnostic. 579 2. Separate the definitions for the "registration" and "lookup" 580 activities. 582 3. Disallow symbol and punctuation characters except where special 583 exceptions are necessary. 585 4. Remove the mapping and normalization steps from the protocol and 586 have them instead done by the applications themselves, possibly 587 in a local fashion, before invoking the protocol. 589 5. Change the way that the protocol specifies which characters are 590 allowed in labels from "humans decide what the table of 591 codepoints contains" to "decision about codepoints are based on 592 Unicode properties plus a small exclusion list created by 593 humans". 595 6. Introduce the new concept of characters that can be used only in 596 specific contexts. 598 7. Allow typical words and names in languages such as Dhivehi and 599 Yiddish to be expressed. 601 8. Make bidirectional domain names (delimited strings of labels, 602 not just labels standing on their own) display in a non- 603 surprising fashion. 605 9. Make bidirectional domain names in a paragraph display in a non- 606 surprising fashion.[[anchor17: Is this statement necessary or is 607 it redundant with the previous one?]] 609 10. Remove the dot separator from the mandatory part of the 610 protocol. 612 11. Make some currently-valid labels that are not actually IDNA 613 labels invalid. 615 3. The Revised IDNA Model 617 IDNA is a client-side protocol, i.e., almost all of the processing is 618 performed by the client. The strings that appear in, and are 619 resolved by, the DNS conform to the traditional rules for the naming 620 of hosts, and consist of ASCII letters, digits, and hyphens. This 621 approach permits IDNA to be deployed without modifications to the DNS 622 itself. That, in turn, avoids both having to upgrade the entire 623 Internet to support IDNs and needing to incur the unknown risks to 624 deployed systems of DNS structural or design changes especially if 625 those changes need to be deployed all at the same time. 627 4. Processing in IDNA2008 629 These specifications separate Domain Name Registration and Resolution 630 in the protocol specification. Doing so reflects current practice in 631 which per-registry restrictions and special processing are applied at 632 registration time but not on resolution. Even more important in the 633 longer term, it facilitates incremental addition of permitted 634 character groups to avoid freezing on one particular version of 635 Unicode. 637 The actual registration and lookup protocols for IDNA2008 are 638 specified in [IDNA2008-Protocol]. 640 5. IDNA2008 Document List 642 [[anchor19: This section will need to be extensively revised or 643 removed before publication.]] 644 The following documents are being produced as part of the IDNA2008 645 effort. 647 o A revised version of this document, containing an overview, 648 rationale, and conformance conditions. 650 o A separate document, drawn from material in early versions of this 651 one, that explicitly updates and replaces RFC 3490 but which has 652 most rationale material from that document moved to this one 653 [IDNA2008-Protocol]. 655 o A document describing the "Bidi problem" with Stringprep and 656 proposing a solution [IDNA2008-Bidi]. 658 o A specification of the categories and rules that identify the code 659 points allowed in a U-label, based on Unicode 5.0 code 660 assignments. See Section 6 and [IDNA2008-Tables]. 662 o One or more documents containing guidance and suggestions for 663 registries (in this context, those responsible for establishing 664 policies for any zone file in the DNS, not only those at the top 665 or second level). The documents in this category may not be IETF 666 products and may be prepared and completed asynchronously with 667 those described above. 669 6. Permitted Characters: An Inclusion List 671 This section provides an overview of the model used to establish the 672 algorithm and character lists of [IDNA2008-Tables] and describes the 673 names and applicability of the categories used there. Note that the 674 inclusion of a character in the first category group does not imply 675 that it can be used indiscriminately; some characters are associated 676 with contextual rules that must be applied as well. 678 The information given in this section is provided to make the rules, 679 tables, and protocol easier to understand. It is not normative. The 680 normative generating rules appear in [IDNA2008-Tables] and the rules 681 that actually determine what labels can be registered or looked up 682 are in [IDNA2008-Protocol]. 684 6.1. A Tiered Model of Permitted Characters and Labels 686 Moving to an inclusion model requires respecifying the list of 687 characters that are permitted in IDNs. In IDNA2003, the role and 688 utility of characters are independent of context and fixed forever 689 (or until the standard is replaced). Making completely context- 690 independent rules globally has proven impractical because some 691 characters, especially those that are called "Join_Controls" in 692 Unicode, are needed to make reasonable use of some scripts but have 693 no visible effect(s) in others. Of necessity, IDNA2003 prohibited 694 those types of characters entirely. But the restrictions were much 695 too severe to permit an adequate range of mnemonics for terminology 696 based on some languages. The requirement to support those characters 697 but limit their use to very specific contexts was reinforced by the 698 observation that handling of particular characters across the 699 languages that use a script, or the use of similar or identical- 700 looking characters in different scripts, is less well understood than 701 many people believed it was several years ago. 703 Independently of the characters chosen (see next subsection), the 704 theory is to divide the characters that appear in Unicode into three 705 categories: 707 6.1.1. PROTOCOL-VALID 709 Characters identified as "PROTOCOL-VALID" (often abbreviated 710 "PVALID") are, in general, permitted by IDNA for all uses in IDNs. 711 Their use may be restricted by rules about the context in which they 712 appear or by other rules that apply to the entire label in which they 713 are to be embedded. For example, any label that contains a character 714 in this group that has a "right to left" property must be used in 715 context with the "Bidi" rules (see [IDNA2008-Bidi]). 717 The term "PROTOCOL-VALID", is used to stress the fact that the 718 presence of a character in this category does not imply that a given 719 registry need accept registrations containing any of the characters 720 in the category. Registries are still expected to apply judgment 721 about labels they will accept and to maintain rules consistent with 722 those judgments (see [IDNA2008-Protocol] and Section 6.3). 724 Characters that are placed in the "PROTOCOL-VALID" category are never 725 removed from it unless the code points themselves are removed from 726 Unicode (such removal would be inconsistent with the Unicode 727 stability principles (see [Unicode51], Appendix F) and hence should 728 never occur). 730 [[anchor21: Placeholder: Does this topic or comment need additional 731 discussion or explanation?]] 733 6.1.1.1. Contextual Rules 735 Some characters may be unsuitable for general use in IDNs but 736 necessary for the plausible support of some scripts. The two most 737 commonly-cited examples are the zero-width joiner and non-joiner 738 characters (ZWNJ, U+200C, and ZWJ, U+200D), but provisions for 739 unambiguous labels may require that other characters be restricted to 740 particular contexts. For example, the ASCII hyphen is not permitted 741 to start or end a label, whether that label contains non-ASCII 742 characters or not. 744 These characters must not appear in IDNs without additional 745 restrictions, typically because they have no visible consequences in 746 most scripts but affect format or presentation in a few others or 747 because they are combining characters that are safe for use only in 748 conjunction with particular characters or scripts. In order to 749 permit them to be used at all, they are specially identified as 750 "CONTEXTUAL RULE REQUIRED" and, when adequately understood, 751 associated with a rule. In addition, the rule will define whether it 752 is to be applied on lookup as well as registration. A distinction is 753 made between characters that indicate or prohibit joining (known as 754 "CONTEXT-JOINER" or "CONTEXTJ") and other characters requiring 755 contextual treatment ("CONTEXT-OTHER" or "CONTEXTO"). Only the 756 former are fully tested at lookup time. 758 6.1.1.2. Rules and Their Application 760 The actual rules may be present or absent. If present, they may have 761 values of "True" (character may be used in any position in any 762 label), "False" (character may not be used in any label), or may be 763 an extended regular expression that specifies the context in which 764 the character is permitted. 766 Examples of descriptions of typical rules, stated informally and in 767 English, include "Must follow a character from Script XYZ", "MUST 768 occur only if the entire label is in Script ABC", "MUST occur only if 769 the previous and subsequent characters have the DFG property". 771 Because it is easier to identify these characters than to know that 772 they are actually needed in IDNs or how to establish exactly the 773 right rules for each one, a rule may have a null value in a given 774 version of the tables. Characters associated with null rules MUST 775 NOT appear in putative labels for either registration or lookup. Of 776 course, a later version of the tables might contain a non-null rule. 778 [[anchor23: Definition of regular expression language to be supplied 779 or replaced with a description of the definitional technique. It may 780 be useful to more more of this material to Tables as part of moving 781 the rules from Protocol to Tables.]] 783 6.1.2. DISALLOWED 785 Some characters are sufficiently problematic for use in IDNs that 786 they should be excluded for both registration and lookup (i.e., 787 conforming applications performing name resolution should verify that 788 these characters are absent; if they are present, the label strings 789 should be rejected rather than converted to A-labels and looked up. 791 Of course, this category would include code points that had been 792 removed entirely from Unicode should such removals ever occur. 794 Characters that are placed in the "DISALLOWED" category are expected 795 to never be removed from it or reclassified. If a character is 796 classified as "DISALLOWED" in error and the error is sufficiently 797 problematic, the only recourse would be either to introduce a new 798 code point into Unicode and classify it as "PROTOCOL-VALID" or for 799 the IETF to accept the considerable costs of an incompatible change 800 and replace the relevant RFC with one containing appropriate 801 exceptions. 803 [[anchor24: Note in Draft: the permanence of DISALLOWED was still 804 under discussion in the WG when this draft was posted. The text 805 above reflects the editor's opinion about the emerging consensus but 806 is subject to change as the discussion continues.]] 808 There is provision for exception cases but, in general, characters 809 are placed into "DISALLOWED" if they fall into one or more of the 810 following groups: 812 o The character is a compatibility equivalent for another character. 813 In slightly more precise Unicode terms, application of 814 normalization method NFKC to the character yields some other 815 character. 817 o The character is an upper-case form or some other form that is 818 mapped to another character by Unicode casefolding. 820 o The character is a symbol or punctuation form or, more generally, 821 something that is not a letter, digit, or a mark that is used to 822 form a letter or digit. 824 6.1.3. UNASSIGNED 826 For convenience in processing and table-building, code points that do 827 not have assigned values in a given version of Unicode are treated as 828 belonging to a special UNASSIGNED category. Such code points MUST 829 NOT appear in labels to be registered or looked up. The category 830 differs from DISALLOWED in that code points are moved out of it by 831 the simple expedient of being assigned in a later version of Unicode 832 (at which point, they are classified into one of the other categories 833 as appropriate). 835 6.2. Registration Policy 837 While these recommendations cannot and should not define registry 838 policies, registries SHOULD develop and apply additional restrictions 839 to reduce confusion and other problems. For example, it is generally 840 believed that labels containing characters from more than one script 841 are a bad practice although there may be some important exceptions to 842 that principle. Some registries may choose to restrict registrations 843 to characters drawn from a very small number of scripts. For many 844 scripts, the use of variant techniques such as those as described in 845 [RFC3743] and [RFC4290], and illustrated for Chinese by the tables 846 described in RFC 4713 [RFC4713] may be helpful in reducing problems 847 that might be perceived by users. It is worth stressing that these 848 principles of policy development and application apply at all levels 849 of the DNS, not only, e.g., TLD registrations. 851 6.3. Layered Restrictions: Tables, Context, Registration, Applications 853 The essence of the character rules in IDNA2008 is based on the 854 realization that there is no magic bullet for any of the issues 855 associated with a multiscript DNS. Instead, the specifications 856 define a variety of approaches that, together, constitute multiple 857 lines of defense against ambiguity in identifiers and loss of 858 referential integrity. The actual character tables are the first 859 mechanism, protocol rules about how those characters are applied or 860 restricted in context are the second, and those two in combination 861 constitute the limits of what can be done by a protocol alone. As 862 discussed in the previous section (Section 6.2), registries are 863 expected to restrict what they permit to be registered, devising and 864 using rules that are designed to optimize the balance between 865 confusion and risk on the one hand and maximum expressiveness in 866 mnemonics on the other. 868 In addition, there is an important role for user agents in warning 869 against label forms that appear unreasonable given their knowledge of 870 local contexts and conventions. Of course, no approach based on 871 naming or identifiers alone can protect against all threats. 872 [[anchor25: Note in Draft: the last sentence above basically 873 duplicates a comment in Security Considerations. Is it worth having 874 in both places??]] 876 7. Issues that Constrain Possible Solutions 878 7.1. Display and Network Order 880 The correct treatment of domain names requires a clear distinction 881 between Network Order (the order in which the code points are sent in 882 protocols) and Display Order (the order in which the code points are 883 displayed on a screen or paper). The order of labels in a domain 884 name that contains characters that are normally written right to left 885 is discussed in [IDNA2008-Bidi]. In particular, there are questions 886 about the order in which labels are displayed if left to right and 887 right to left labels are adjacent to each other, especially if there 888 are also multiple consecutive appearances of one of the types. The 889 decision about the display order is ultimately under the control of 890 user agents --including web browsers, mail clients, and the like-- 891 which may be highly localized. Even when formats are specified by 892 protocols, the full composition of an Internationalized Resource 893 Identifier (IRI) [RFC3987] or Internationalized Email address 894 contains elements other than the domain name. For example, IRIs 895 contain protocol identifiers and field delimiter syntax such as 896 "http://" or "mailto:" while email addresses contain the "@" to 897 separate local parts from domain names. User agents are not required 898 to use those protocol-based forms directly but often do so. While 899 display, parsing, and processing within a label is specified by the 900 IDNA protocol and the associated documents, the relationship between 901 fully-qualified domain names and internationalized labels is 902 unchanged from the base DNS specifications. Comments here about such 903 full domain names are explanatory or examples of what might be done 904 and must not be considered normative. 906 Questions remain about protocol constraints implying that the overall 907 direction of these strings will always be left to right (or right to 908 left) for an IRI or email address, or if they even should conform to 909 such rules. These questions also have several possible answers. 910 Should a domain name abc.def, in which both labels are represented in 911 scripts that are written right to left, be displayed as fed.cba or 912 cba.fed? An IRI for clear text web access would, in network order, 913 begin with "http://" and the characters will appear as 914 "http://abc.def" -- but what does this suggest about the display 915 order? When entering a URI to many browsers, it may be possible to 916 provide only the domain name and leave the "http://" to be filled in 917 by default, assuming no tail (an approach that does not work for 918 other protocols). The natural display order for the typed domain 919 name on a right to left system is fed.cba. Does this change if a 920 protocol identifier, tail, and the corresponding delimiters are 921 specified? 923 While logic, precedent, and reality suggest that these are questions 924 for user interface design, not IETF protocol specifications, 925 experience in the 1980s and 1990s with mixing systems in which domain 926 name labels were read in network order (left to right) and those in 927 which those labels were read right to left would predict a great deal 928 of confusion, and heuristics that sometimes fail, if each 929 implementation of each application makes its own decisions on these 930 issues. 932 It should be obvious that any revision of IDNA, including the current 933 one, must be clear about the network (transmission on the wire) order 934 of characters in labels and for the labels in complete (fully- 935 qualified) domain names. In order to prevent user confusion and, in 936 particular, to reduce the chances for inconsistent transcription of 937 domain names from printed form, it is likely that some strong 938 suggestions should be made about display order as well. 940 7.2. Entry and Display in Applications 942 Applications can accept domain names using any character set or sets 943 desired by the application developer or specified by the operating 944 system, and can display domain names in any charset. That is, the 945 IDNA protocol does not affect the interface between users and 946 applications. 948 An IDNA-aware application can accept and display internationalized 949 domain names in two formats: the internationalized character set(s) 950 supported by the application (i.e., an appropriate local 951 representation of a U-label), and as an A-label. Applications MAY 952 allow the display and user input of A-labels, but are encouraged to 953 not do so except as an interface for special purposes, possibly for 954 debugging, or to cope with display limitations. A-labels are opaque 955 and ugly, and, where possible, should thus only be exposed to users 956 and in contexts in which they are absolutely needed. Because IDN 957 labels can be rendered either as the A-labels or U-labels, the 958 application may reasonably have an option for the user to select the 959 preferred method of display; if it does, rendering the U-label should 960 normally be the default. 962 Domain names are often stored and transported in many places. For 963 example, they are part of documents such as mail messages and web 964 pages. They are transported in many parts of many protocols, such as 965 both the control commands and the RFC 2822 body parts of SMTP, and 966 the headers and the body content in HTTP. It is important to 967 remember that domain names appear both in domain name slots and in 968 the content that is passed over protocols. 970 In protocols and document formats that define how to handle 971 specification or negotiation of charsets, labels can be encoded in 972 any charset allowed by the protocol or document format. If a 973 protocol or document format only allows one charset, the labels MUST 974 be given in that charset. Of course, not all charsets can properly 975 represent all labels. If a U-label cannot be displayed in its 976 entirety, the only choice (without loss of information) may be to 977 display the A-label. 979 In any place where a protocol or document format allows transmission 980 of the characters in internationalized labels, labels SHOULD be 981 transmitted using whatever character encoding and escape mechanism 982 the protocol or document format uses at that place. This provision 983 is intended to prevent situations in which, e.g., UTF-8 domain names 984 appear embedded in text that is otherwise in some other character 985 coding. 987 All protocols that use domain name slots already have the capacity 988 for handling domain names in the ASCII charset. Thus, A-labels can 989 inherently be handled by those protocols. 991 7.3. Linguistic Expectations: Ligatures, Digraphs, and Alternate 992 Character Forms 994 Users often have expectations about character matching or equivalence 995 that are based on their languages and the orthography of those 996 languages. These expectations may not be consistent with forms or 997 actions that can be naturally accommodated in a character coding 998 system, especially if multiple languages are written using the same 999 script but using different conventions. A Norwegian user might 1000 expect a label with the ae-ligature to be treated as the same label 1001 as one using the Swedish spelling with a-umlaut even though applying 1002 that mapping to English would be astonishing to users. A user in 1003 German might expect a label with an o-umlaut and a label that had 1004 "oe" substituted, but was otherwise the same, treated as equivalent 1005 even though that substitution would be a clear error in Swedish. A 1006 Chinese user might expect automatic matching of Simplified and 1007 Traditional Chinese characters, but applying that matching for Korean 1008 or Japanese text would create considerable confusion. For that 1009 matter, an English user might expect "theater" and "theatre" to 1010 match. 1012 Related issues arise because there are a number of languages written 1013 with alphabetic scripts in which single phonemes are written using 1014 two characters, termed a "digraph", for example, the "ph" in 1015 "pharmacy" and "telephone". (Note that characters paired in this 1016 manner can also appear consecutively without forming a digraph, as in 1017 "tophat".) Certain digraphs are normally indicated typographically 1018 by setting the two characters closer together than they would be if 1019 used consecutively to represent different phonemes. Some digraphs 1020 are fully joined as ligatures (strictly designating setting totally 1021 without intervening white space, although the term is sometimes 1022 applied to close set pairs). An example of this may be seen when the 1023 word "encyclopaedia" is set with a U+00E6 LATIN SMALL LIGATURE AE 1024 (and some would not consider that word correctly spelled unless the 1025 ligature form was used or the "a" was dropped entirely). When these 1026 ligature and digraph forms have the same interpretation across all 1027 languages that use a given script, application of Unicode 1028 normalization generally resolves the differences and causes them to 1029 match. When they have different interpretations, any requirements 1030 for matching must utilize other methods or users must be educated to 1031 understand that matching will not occur. 1033 Difficulties arise from the fact that a given ligature may be a 1034 completely optional typographic convenience for representing a 1035 digraph in one language (as in the above example with some spelling 1036 conventions), while in another language it is a single character that 1037 may not always be correctly representable by a two-letter sequence 1038 (as in the above example with different spelling conventions). This 1039 can be illustrated by many words in the Norwegian language, where the 1040 "ae" ligature is the 27th letter of a 29-letter extended Latin 1041 alphabet. It is equivalent to the 28th letter of the Swedish 1042 alphabet (also containing 29 letters), U+00E4 LATIN SMALL LETTER A 1043 WITH DIAERESIS, for which an "ae" cannot be substituted according to 1044 current orthographic standards. 1046 That character (U+00E4) is also part of the German alphabet where, 1047 unlike in the Nordic languages, the two-character sequence "ae" is 1048 usually treated as a fully acceptable alternate orthography for the 1049 "umlauted a" character. The inverse is however not true, and those 1050 two characters cannot necessarily be combined into an "umlauted a". 1051 This also applies to another German character, the "umlauted o" 1052 (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) which, for example, 1053 cannot be used for writing the name of the author "Goethe". It is 1054 also a letter in the Swedish alphabet where, in parallel to the 1055 "umlauted a", it cannot be correctly represented as "oe" and in the 1056 Norwegian alphabet, where it is represented, not as "umlauted o", but 1057 as "slashed o", U+00F8. 1059 Some of the ligatures that have explicit code points in Unicode were 1060 given special handling in IDNA2003 and now pose additional problems 1061 as people argue that they should have been treated differently to 1062 preserve important information. For example, the German character 1063 Eszett (Sharp S, U+00DF) is retained as itself by NFKC but case- 1064 folded by Stringprep to "ss", but the closely-related, but less 1065 frequently seen, character "Long S T" (U+FB05) is a compatibility 1066 character that is mapped out by NFKC. Unless exceptions are made, 1067 both will be treated as DISALLOWED by IDNA2008. But there is 1068 significant interest in an exception, especially for Eszett. 1069 Depending on what the exception was, making it would either raise 1070 some backward compatibility problems with IDNA2003 or create an 1071 unusual special case that would highlight differences in preferred 1072 orthography between German as written in Germany and German as 1073 written in some other countries, notably Switzerland. Additional 1074 discussion of issues with Eszett appear in Section 10.7. 1076 Additional cases with alphabets written right to left are described 1077 in Section 7.5. 1079 Whether ligatures and digraphs are to be treated as a sequence of 1080 characters or as a single standalone one constitute a problem that 1081 cannot be resolved solely by operating on scripts. They are, 1082 however, a key concern in the IDN context. Their satisfactory 1083 resolution will require support in policies set by registries, which 1084 therefore need to be particularly mindful not just of this specific 1085 issue, but of all other related matters that cannot be dealt with on 1086 an exclusively algorithmic basis. 1088 Just as with the examples of different-looking characters that may be 1089 assumed to be the same, it is in general impossible to deal with 1090 these situations in a system such as IDNA -- or with Unicode 1091 normalization generally -- since determining what to do requires 1092 information about the language being used, context, or both. 1093 Consequently, these specifications make no attempt to treat these 1094 combined characters in any special way. However, their existence 1095 provides a prime example of a situation in which a registry that is 1096 aware of the language context in which labels are to be registered, 1097 and where that language sometimes (or always) treats the two- 1098 character sequences as equivalent to the combined form, should give 1099 serious consideration to applying a "variant" model [RFC3743] 1100 [RFC4290] to reduce the opportunities for user confusion and fraud 1101 that would result from the related strings being registered to 1102 different parties. 1104 7.4. Case Mapping and Related Issues 1106 Traditionally in the DNS, ASCII letters have been stored with their 1107 case preserved. Matching during the query process has been case- 1108 independent, but none of the information that might be represented by 1109 choices of case has been lost. That model has been accidentally 1110 helpful because, as people have created DNS labels by catenating 1111 words (or parts of words) to form labels, case has often been used to 1112 distinguish among components and make the labels more memorable. 1114 The solution of keeping the characters separate but doing matching 1115 independent of case is not feasible with an IDNA-like model because 1116 the matching would then have to be done on the server rather than 1117 have characters mapped on the client. That situation was recognized 1118 in IDNA2003 and nothing in IDNA2008 fundamentally changes it or could 1119 do so. In IDNA2003, all upper-case characters are mapped to lower- 1120 case ones and, in general, all code points that represent alternate 1121 forms of the same character are mapped to that character (including 1122 mapping Greek final form sigma to the medial form). IDNA2008 1123 permits, at the risk of some incompatibility, slightly more 1124 flexibility in this area. That additional flexibility still does not 1125 solve the problem with final form sigma and other characters that 1126 Unicode treats as completely separate characters that match only 1127 under casemapping if at all. Many people now believe these should be 1128 handled as separate characters so information about them can be 1129 preserved in the transformations to A-labels and back. However 1130 making a change to permit that behavior would create a situation in 1131 which the same string, valid in both protocols, would be interpreted 1132 differently by IDNA2003 and IDNA2008. In principle, that would 1133 violate one of the conditions discussed in Section 10.3.1 and hence 1134 require a prefix change. Of course, if a prefix change were made (at 1135 the costs discussed in Section 10.3.3) there would be several 1136 options, including, if desired, assigning the characer to the 1137 CONTEXTUAL RULE REQUIRED category and requiring that it only be used 1138 in carefully-selected contexts. 1140 7.5. Right to Left Text 1142 In order to be sure that the directionality of right to left text is 1143 unambiguous, IDNA2003 required that any label in which right to left 1144 characters appear both starts and ends with them, may not include any 1145 characters with strong left to right properties (which excludes other 1146 alphabetic characters but permits European digits), and rejects any 1147 other string that contains a right to left character. This is one of 1148 the few places where the IDNA algorithms (both old and new) are 1149 required to look at an entire label, not just at individual 1150 characters. The algorithmic model used in IDNA2003 rejects the label 1151 when the final character in a right to left string requires a 1152 combining mark in order to be correctly represented. 1154 This problem manifests itself in languages written with consonantal 1155 alphabets to which diacritical vocalic systems are applied, and in 1156 languages with orthographies derived from them where the combining 1157 marks may have different functionality. In both cases the combining 1158 marks can be essential components of the orthography. Examples of 1159 this are Yiddish, written with an extended Hebrew script, and Dhivehi 1160 (the official language of Maldives) which is written in the Thaana 1161 script (which is, in turn, derived from the Arabic script). The new 1162 rules for right to left scripts are described in [IDNA2008-Bidi]. 1164 8. IDNs and the Robustness Principle 1166 The model of IDNs described in this document can be seen as a 1167 particular instance of the "Robustness Principle" that has been so 1168 important to other aspects of Internet protocol design. This 1169 principle is often stated as "Be conservative about what you send and 1170 liberal in what you accept" (See, e.g., RFC 1123, Section 1.2.2 1172 [RFC1123]). For IDNs to work well, not only must the protocol be 1173 carefully designed and implemented, but zone administrators 1174 (registries) must have and require sensible policies about what is 1175 registered -- conservative policies -- and implement and enforce 1176 them. 1178 Conversely, resolvers can (and SHOULD or maybe MUST) reject labels 1179 that clearly violate global (protocol) rules (no one has ever 1180 seriously claimed that being liberal in what is accepted requires 1181 being stupid). However, once one gets past such global rules and 1182 deals with anything sensitive to script or locale, it is necessary to 1183 assume that garbage has not been placed into the DNS, i.e., one must 1184 be liberal about what one is willing to look up in the DNS rather 1185 than guessing about whether it should have been permitted to be 1186 registered. 1188 As mentioned elsewhere, if a string doesn't resolve, it makes no 1189 difference whether it simply wasn't registered or was prohibited by 1190 some rule. 1192 If resolvers, as a user interface (UI) or other local matter, decide 1193 to warn about some strings that are valid under the global rules but 1194 that they perceive as dangerous, that is their prerogative and we can 1195 only hope that the market (and maybe regulators) will reinforce the 1196 good choices and discourage the poor ones. In this context, a 1197 resolver that decides a string that is valid under the protocol is 1198 dangerous and refuses to look it up is in violation of the protocols; 1199 one that is willing to look something up, but warns against it, is 1200 exercising a local choice. 1202 9. Front-end and User Interface Processing 1204 Domain names may be identified and processed in many contexts. They 1205 may be typed in by users either by themselves or as part of URIs or 1206 IRIs. They may occur in running text or be processed by one system 1207 after being provided in another. Systems may wish to try to 1208 normalize URLs so as to determine (or guess) whether a reference is 1209 valid or two references point to the same object without actually 1210 looking the objects up and comparing them. Some of these goals may 1211 be more easily and reliably satisfied than others. While there are 1212 strong arguments for any domain name that is placed "on the wire" -- 1213 transmitted between systems -- to be in the minimum-ambiguity forms 1214 of A-labels, U-labels, or LDH-labels, it is inevitable that programs 1215 that process domain names will encounter variant forms. One source 1216 of such forms will be labels created under IDNA2003. Because of the 1217 way that protocol was specified, there are a significant number of 1218 domain names in files on the Internet that use characters that cannot 1219 be represented directly in domain names but for which interpretations 1220 are provided. There are two major categories of such characters, 1221 those that are removed by NFKC normalization and those upper-case 1222 characters that are mapped to lower-case (there are also a few 1223 characters that are given special-case mapping treatment in 1224 Stringprep). [[anchor29: The text above is a too obscure, but was 1225 intended to address the mapping differences between IDNA2003 and the 1226 current proposal. Patrik suggests the following, which will need 1227 some tuning before it can be inserted: One source of such forms will 1228 be labels created under IDNA2003 as some allowed labels where 1229 transformed before they where turned into its ascii (xn--) form so 1230 that ToUnicode(ToASCII(label)) != label. This is why IDNA2008 1231 explicitly define A-label and U-label being a form of the label that 1232 is stable when converting between A-label and U-label, without 1233 mappings. A different way of explaining this is that there could be 1234 already today domain names in files on the Internet that use 1235 characters that cannot be represented directly in domain names but 1236 for which interpretations are provided. There are two major 1237 categories of such characters, those that are removed by NFKC 1238 normalization and those upper-case characters that are mapped to 1239 lower-case (there are also a few characters that are given special- 1240 case mapping treatment in Stringprep)."]] 1242 Other issues in domain name identification and processing arise 1243 because IDNA2003 specified that several other characters be treated 1244 as equivalent to the ASCII period (dot, full stop) character used as 1245 a label separator. If a domain name appears in an arbitrary context 1246 (such as running text), it is difficult, even with only ASCII 1247 characters, to know whether a domain name (or a protocol parameter 1248 like a URI) is present and where it starts and ends. When using 1249 Unicode this gets even more difficult if treatment of certain special 1250 characters (like the dot that separates labels in a domain name) 1251 depends on context. That problem occurs if the dot is part of a 1252 domain name or not, which would mean that, contrary to common 1253 practice today, the primary heuristic for identifying a domain name 1254 depends on dots separating strings with no intervening spaces. 1255 [[anchor30: Above text is a substitute for an earlier (pre -01) 1256 version and is hoped to be more clear. Comments and improvements 1257 welcome.]] 1259 As discussed elsewhere in this document, the IDNA2008 model removes 1260 all of these mappings and interpretations, including the equivalence 1261 of different forms of dots, from the protocol, leaving such mappings 1262 to local processing. This should not be taken to imply that local 1263 processing is optional or can be avoided entirely. Instead, unless 1264 the program context is such that it is known that any IDNs that 1265 appear will be either U-labels or A-labels, some local processing of 1266 apparent domain name strings will be required, both to maintain 1267 compatibility with IDNA2003 and to prevent user astonishment. Such 1268 local processing, while not specified in this document or the 1269 associated ones, will generally take one of two forms: 1271 o Generic Preprocessing. 1272 When the context in which the program or system that processes 1273 domain names operates is global, a reasonable balance must be 1274 found that is sensitive to the broad range of local needs and 1275 assumptions while, at the same time, not sacrificing the needs of 1276 one language, script, or user population to those of another. 1278 For this case, the best practice will usually be to apply NFKC and 1279 case-mapping (or, perhaps better yet, Stringprep itself), plus 1280 dot-mapping where appropriate, to the domain name string prior to 1281 applying IDNA. That practice will not only yield a reasonable 1282 compromise of user experience with protocol requirements but will 1283 be almost completely compatible with the various forms permitted 1284 by IDNA2003. 1286 o Highly Localized Preprocessing. 1287 Unlike the case above, there will be some situations in which 1288 software will be highly localized for a particular environment and 1289 carefully adapted to the expectations of users in that 1290 environment. The many discussions about using the Internet to 1291 preserve and support local cultures suggest that these cases may 1292 be more common in the future than they have been so far. 1294 In these cases, we should avoid trying to tell implementers what 1295 they should do, if only because they are quite likely (and for 1296 good reason) to ignore us. We would assume that they would map 1297 characters that the intuitions of their users would suggest be 1298 mapped. One can imagine switches about whether some sorts of 1299 mappings occur, warnings before applying them or, in a slightly 1300 more extreme version of the approach taken in Internet Explorer 1301 version 7 (IE7), utterly refuse to handle "strange" characters at 1302 all if they appear in U-label form. None of those local decisions 1303 are a threat to interoperability as long as (i) only U-labels and 1304 A-labels are used in interchange with systems outside the local 1305 environment, (ii) no character that would be valid in a U-label as 1306 itself is mapped to something else, (iii) any local mappings are 1307 applied as a preprocessing step (or, for conversions from U-labels 1308 or A-labels to presentation forms, postprocessing), not as part of 1309 IDNA processing proper, and (iv) appropriate consideration is 1310 given to labels that might have entered the environment in 1311 conformance to IDNA2003. [[anchor31: Placeholder: there have been 1312 suggestions that this text be removed entirely. Comments (or 1313 improved text) welcome.]] 1315 10. Migration and Version Synchronization 1317 10.1. Design Criteria 1319 As mentioned above and in RFC 4690, two key goals of this work are to 1320 enable applications to be agnostic about whether they are being run 1321 in environments supporting any Unicode version from 3.2 onward and to 1322 permit incrementally adding permitted scripts and other character 1323 collections without disruption or, subsequent to this version, 1324 "heavy" processes such as formation of an IETF WG. The mechanisms 1325 that support this are outlined above, but this section reviews them 1326 in a context that may be more helpful to those who need to understand 1327 the approach and make plans for it. 1329 10.1.1. General IDNA Validity Criteria 1331 The general criteria for a putative label, and the collection of 1332 characters that make it up, to be considered IDNA-valid are: 1334 o The characters are "letters", marks needed to form letters, 1335 numerals, or other code points used to write words in some 1336 language. Symbols, drawing characters, and various notational 1337 characters are permanently excluded -- some because they are 1338 actively dangerous in URI, IRI, or similar contexts and others 1339 because there is no evidence that they are important enough to 1340 Internet operations or internationalization to justify inclusion 1341 and the complexities that would come with it (additional 1342 discussion and rationale for the symbol decision appears in 1343 Section 10.5). 1345 o Other than in very exceptional cases, e.g., where they are needed 1346 to write substantially any word of a given language, punctuation 1347 characters are excluded as well. The fact that a word exists is 1348 not proof that it should be usable in a DNS label and DNS labels 1349 are not expected to be usable for multiple-word phrases (although 1350 they are certainly not prohibited if the conventions and 1351 orthography of a particular language cause that to be possible). 1352 Even for English, very common constructions -- contractions like 1353 "don't" or "it's", names that are written with apostrophes such as 1354 "O'Reilly" or characters for which apostrophes are common 1355 substitutes, and words whose usually-preferred spellings retain 1356 diacritical marks from earlier forms -- cannot be represented in 1357 DNS labels. 1359 o Characters that are unassigned (have no character assignment at 1360 all) in the version of Unicode being used by the registry or 1361 application are not permitted, even on resolution (lookup). There 1362 are at least two reasons for this. Tests involving the context of 1363 characters (e.g., some characters being permitted only adjacent to 1364 ones of specific types but otherwise invisible or very problematic 1365 for other reasons) and integrity tests on complete labels are 1366 needed. Unassigned code points cannot be permitted because one 1367 cannot determine whether particular code points will require 1368 contextual rules (and what those rules should be) before 1369 characters are assigned to them and the properties of those 1370 characters fully understood. Second, Unicode specifies that an 1371 unassigned code point normalizes and case folds to itself. If the 1372 code point is later assigned to a character, and particularly if 1373 the newly-assigned code point has a combining class that 1374 determines its placement relative to other combining characters, 1375 it could normalize to some other code point or sequence, creating 1376 confusion and/or violating other rules listed here. 1378 o Any character that is mapped to another character by Nameprep2003 1379 or by a current version of NFKC is prohibited as input to IDNA 1380 (for either registration or resolution). Implementers of user 1381 interfaces to applications are free to make those conversions when 1382 they consider them suitable for their operating system 1383 environments, context, or users. 1385 Tables used to identify the characters that are IDNA-valid are 1386 expected to be driven by the principles above (described in more 1387 precise form in [IDNA2008-Tables]). The principles are not just an 1388 interpretation of the tables. 1390 10.1.2. Labels in Registration 1392 Anyone entering a label into a DNS zone must properly validate that 1393 label -- i.e., be sure that the criteria for that label are met -- in 1394 order for applications to work as intended. This principle is not 1395 new: for example, zone administrators are expected to verify that 1396 names meet "hostname" [RFC0952] or special service location formats 1397 [RFC2782] where necessary for the expected applications. For zones 1398 that will contain IDNs, support for Unicode version-independence 1399 requires restrictions on all strings placed in the zone. In 1400 particular, for such zones: 1402 o Any label that appears to be an A-label, i.e., any label that 1403 starts in "xn--", MUST be IDNA-valid, i.e., that they MUST be 1404 valid A-labels, as discussed in Section 3 above. 1406 o The Unicode tables (i.e., tables of code points, character 1407 classes, and properties) and IDNA tables (i.e., tables of 1408 contextual rules such as those described above), MUST be 1409 consistent on the systems performing or validating labels to be 1410 registered. Note that this does not require that tables reflect 1411 the latest version of Unicode, only that all tables used on a 1412 given system are consistent with each other. 1414 [[anchor33: Note in draft: the above text was changed significantly 1415 between -00 and -01 to clearly restrict its scope to zones supporting 1416 IDNA and to eliminate comments about labels containing "--" in the 1417 third and forth positions but with different prefixes. There appears 1418 to be consensus that more extensive rules belong in a "best 1419 practices" document about appropriate DNS labels, but that document 1420 is not in-scope for the IDNABIS WG.]] 1422 Under this model, a registry (or entity communicating with a registry 1423 to accomplish name registrations) will need to update its tables -- 1424 both the Unicode-associated tables and the tables of permitted IDN 1425 characters -- to enable a new script or other set of new characters. 1426 It will not be affected by newer versions of Unicode, or newly- 1427 authorized characters, until and unless it wishes to make those 1428 registrations. The registration side is also responsible --under the 1429 protocol and to registrants and users-- for much more careful 1430 checking than is expected of applications systems that look names up, 1431 both checking as required by the protocol and checking required by 1432 whatever policies it develops for minimizing risks due to confusable 1433 characters and sequences and preserving language or script integrity. 1435 Systems looking up or resolving DNS labels, especially IDN DNS 1436 labels, MUST be able to assume that applicable registration rules 1437 were followed for names entered into the DNS. 1439 10.1.3. Labels in Resolution (Lookup) 1441 Anyone looking up a label in a DNS zone 1443 o MUST maintain a consistent set of tables, as discussed above. As 1444 with registration, the tables need not reflect the latest version 1445 of Unicode but they MUST be consistent. 1447 o MUST validate the characters in labels to be looked up only to the 1448 extent of determining that the U-label does not contain either 1449 code points prohibited by IDNA (categorized as "DISALLOWED") or 1450 code points that are unassigned in its version of Unicode. 1452 o MUST validate the label itself for conformance with a small number 1453 of whole-label rules, notably verifying that there are no leading 1454 combining marks, that the "bidi" conditions are met if right to 1455 left characters appear, that any required contextual rules are 1456 available and that, if such rules are associated with Joiner 1457 Controls, they are tested. 1459 o MUST NOT validate other contextual rules about characters, 1460 including mixed-script label prohibitions, although such rules MAY 1461 be used to influence presentation decisions in the user interface. 1463 By avoiding applying its own interpretation of which labels are valid 1464 as a means of rejecting lookup attempts, the resolver application 1465 becomes less sensitive to version incompatibilities with the 1466 particular zone registry associated with the domain name. 1468 An application or client that looks names up in the DNS will be able 1469 to resolve any name that is validly registered, as long as its 1470 version of the Unicode-associated tables is sufficiently up-to-date 1471 to interpret all of the characters in the label. It SHOULD 1472 distinguish, in its messages to users, between "label contains an 1473 unallocated code point" and other types of lookup failures. A 1474 failure on the basis of an old version of Unicode may lead the user 1475 to a desire to upgrade to a newer version, but will have no other ill 1476 effects (this is consistent with behavior in the transition to the 1477 DNS when some hosts could not yet handle some forms of names or 1478 record types). 1480 10.2. More Flexibility in User Agents 1482 These specifications do not perform mappings between one character or 1483 code point and others for any reason. Instead, they prohibits the 1484 characters that would be mapped to others by normalization, case 1485 folding, or other rules. As examples, while mathematical characters 1486 based on Latin ones are accepted as input to IDNA2003, they are 1487 prohibited in IDNA2008. Similarly, double-width characters and other 1488 variations are prohibited as IDNA input. 1490 Since the rules in [IDNA2008-Tables] provide that only strings that 1491 are stable under NFKC are valid, if it is convenient for an 1492 application to perform NFKC normalization before lookup, that 1493 operation is safe since this will never make the application unable 1494 to look up any valid string. 1496 In many cases these prohibitions should have no effect on what the 1497 user can type at resolution time. It is perfectly reasonable for 1498 systems that support user interfaces to perform some character 1499 mapping that is appropriate to the local environment. This would 1500 normally be done prior to actual invocation of IDNA. At least 1501 conceptually, the mapping would be part of the Unicode conversions 1502 discussed above and in [IDNA2008-Protocol]. However, those changes 1503 will be local ones only -- local to environments in which users will 1504 clearly understand that the character forms are equivalent. For use 1505 in interchange among systems, it appears to be much more important 1506 that U-labels and A-labels can be mapped back and forth without loss 1507 of information. 1509 One specific, and very important, instance of this strategy arises 1510 with case-folding. In the ASCII-only DNS, names are looked up and 1511 matched in a case-independent way, but no actual case-folding occurs. 1512 Names can be placed in the DNS in either upper or lower case form (or 1513 any mixture of them) and that form is preserved, returned in queries, 1514 and so on. IDNA2003 simulated that behavior by performing case- 1515 mapping at registration time (resulting in only lower-case IDNs in 1516 the DNS) and when names were looked up. 1518 As suggested earlier in this section, it appears to be desirable to 1519 do as little character mapping as possible consistent with having 1520 Unicode work correctly (e.g., NFC mapping to resolve different 1521 codings for the same character is still necessary although the 1522 specifications require that it be performed prior to invoking the 1523 protocol) and to make the mapping between A-labels and U-labels 1524 idempotent. Case-mapping is not an exception to this principle. If 1525 only lower case characters can be registered in the DNS (i.e., be 1526 present in a U-label), then IDNA2008 should prohibit upper-case 1527 characters as input. Some other considerations reinforce this 1528 conclusion. For example, an essential element of the ASCII case- 1529 mapping functions is that uppercase(character) must be equal to 1530 uppercase(lowercase(character)). That requirement may not be 1531 satisfied with IDNs. The relationship between upper case and lower 1532 case may even be language-dependent, with different languages (or 1533 even the same language in different areas) expecting different 1534 mappings. Of course, the expectations of users who are accustomed to 1535 a case-insensitive DNS environment will probably be well-served if 1536 user agents perform case mapping prior to IDNA processing, but the 1537 IDNA procedures themselves should neither require such mapping nor 1538 expect them when they are not natural to the localized environment. 1540 10.3. The Question of Prefix Changes 1542 The conditions that would require a change in the IDNA "prefix" 1543 ("xn--" for the version of IDNA specified in [RFC3490]) have been a 1544 great concern to the community. A prefix change would clearly be 1545 necessary if the algorithms were modified in a manner that would 1546 create serious ambiguities during subsequent transition in 1547 registrations. This section summarizes our conclusions about the 1548 conditions under which changes in prefix would be necessary and the 1549 implications of such a change. 1551 10.3.1. Conditions Requiring a Prefix Change 1553 An IDN prefix change is needed if a given string would resolve or 1554 otherwise be interpreted differently depending on the version of the 1555 protocol or tables being used. Consequently, work to update IDNs 1556 would require a prefix change if, and only if, one of the following 1557 four conditions were met: 1559 1. The conversion of an A-label to Unicode (i.e., a U-label) yields 1560 one string under IDNA2003 (RFC3490) and a different string under 1561 IDNA2008. 1563 2. An input string that is valid under IDNA2003 and also valid under 1564 IDNA2008 yields two different A-labels with the different 1565 versions of IDNA. This condition is believed to be essentially 1566 equivalent to the one above. 1568 Note, however, that if the input string is valid under one 1569 version and not valid under the other, this condition does not 1570 apply. See the first item in Section 10.3.2, below. 1572 3. A fundamental change is made to the semantics of the string that 1573 is inserted in the DNS, e.g., if a decision were made to try to 1574 include language or specific script information in that string, 1575 rather than having it be just a string of characters. 1577 4. A sufficiently large number of characters is added to Unicode so 1578 that the Punycode mechanism for block offsets no longer has 1579 enough capacity to reference the higher-numbered planes and 1580 blocks. This condition is unlikely even in the long term and 1581 certain not to arise in the next few years. 1583 10.3.2. Conditions Not Requiring a Prefix Change 1585 In particular, as a result of the principles described above, none of 1586 the following changes require a new prefix: 1588 1. Prohibition of some characters as input to IDNA. This may make 1589 names that are now registered inaccessible, but does not require 1590 a prefix change. 1592 2. Adjustments in Stringprep tables or IDNA actions, including 1593 normalization definitions, that affect characters that were 1594 already invalid under IDNA2003. 1596 3. Changes in the style of definitions of Stringprep or Nameprep 1597 that do not alter the actions performed by them. 1599 Of course, because these specifications do not involve changes to 1600 Stringprep or Nameprep, the third condition above and part of the 1601 second are moot. 1603 10.3.3. Implications of Prefix Changes 1605 While it might be possible to make a prefix change, the costs of such 1606 a change are considerable. Even if they wanted to do so, all 1607 registries could not convert all IDNA2003 ("xn--") registrations to a 1608 new form at the same time and synchronize that change with 1609 applications supporting lookup. Unless all existing registrations 1610 were simply to be declared invalid, and perhaps even then, systems 1611 that needed to support both labels with old prefixes and labels with 1612 new ones would first process a putative label under the IDNA2008 1613 rules and try to look it up and then, if it were not found, would 1614 process the label under IDNA2003 rules and look it up again. That 1615 process could significantly slow down all processing that involved 1616 IDNs in the DNS especially since, in principle, a fully-qualified 1617 name could contain a mixture of labels that were registered with the 1618 old and new prefixes, a situation that would make the use of DNS 1619 caching very difficult. In addition, looking up the same input 1620 string as two separate A-labels would create some potential for 1621 confusion and attacks, since they could, in principle, resolve to 1622 different targets. 1624 Consequently, a prefix change is to be avoided if at all possible, 1625 even if it means accepting some IDNA2003 decisions about character 1626 distinctions as irreversible. 1628 10.4. Stringprep Changes and Compatibility 1630 Concerns have been expressed about problems for non-DNS uses of 1631 Stringprep being caused by changes to the specification intended to 1632 improve the handling of IDNs, most notably as this might affect 1633 identification and authentication protocols. Section 10.3, above, 1634 essentially also applies in this context. The proposed new inclusion 1635 tables [IDNA2008-Tables], the reduction in the number of characters 1636 permitted as input for registration or resolution (Section 6), and 1637 even the proposed changes in handling of right to left strings 1638 [IDNA2008-Bidi] either give interpretations to strings prohibited 1639 under IDNA2003 or prohibit strings that IDNA2003 permitted. Strings 1640 that are valid under both IDNA2003 and IDNA2008, and the 1641 corresponding versions of Stringprep, are not changed in 1642 interpretation. This protocol does not use either Nameprep or 1643 Stringprep as specified in IDNA2003. 1645 It is particularly important to keep IDNA processing separate from 1646 processing for various security protocols because some of the 1647 constraints that are necessary for smooth and comprehensible use of 1648 IDNs may be unwanted or undesirable in other contexts. For example, 1649 the criteria for good passwords or passphrases are very different 1650 from those for desirable IDNs. Similarly, internationalized SCSI 1651 identifiers and other protocol components are likely to have 1652 different requirements than IDNs. 1654 Perhaps even more important in practice, since most other known uses 1655 of Stringprep encode or process characters that are already in 1656 normalized form and expect the use of only those characters that can 1657 be used in writing words of languages, the changes proposed here and 1658 in [IDNA2008-Tables] are unlikely to have any effect at all, 1659 especially not on registries and registrations that follow rules 1660 already in existence when this work started. 1662 10.5. The Symbol Question 1664 One of the major differences between this specification and the 1665 original version of IDNA is that the original version permitted non- 1666 letter symbols of various sorts, including punctuation and line- 1667 drawing symbols, in the protocol. They were always discouraged in 1668 practice. In particular, both the "IESG Statement" about IDNA and 1669 all versions of the ICANN Guidelines specify that only language 1670 characters be used in labels. This specification disallows symbols 1671 entirely. There are several reasons for this, which include: 1673 o As discussed elsewhere, the original IDNA specification assumed 1674 that as many Unicode characters as possible should be permitted, 1675 directly or via mapping to other characters, in IDNs. This 1676 specification operates on an inclusion model, extrapolating from 1677 the LDH rules --which have served the Internet very well-- to a 1678 Unicode base rather than an ASCII base. 1680 o Most Unicode names for letters are, in most cases, fairly 1681 intuitive, unambiguous and recognizable to users of the relevant 1682 script. Symbol names are more problematic because there may be no 1683 general agreement on whether a particular glyph matches a symbol; 1684 there are no uniform conventions for naming; variations such as 1685 outline, solid, and shaded forms may or may not exist; and so on. 1686 As just one example, consider a "heart" symbol as it might appear 1687 in a logo that might be read as "I love...". While the user might 1688 read such a logo as "I love..." or "I heart...", considerable 1689 knowledge of the coding distinctions made in Unicode is needed to 1690 know that there more than one "heart" character (e.g., U+2665, 1691 U+2661, and U+2765) and how to describe it. These issues are of 1692 particular importance if strings are expected to be understood or 1693 transcribed by the listener after being read out loud. 1694 [[anchor35: The above paragraph remains controversial as to 1695 whether it is valid. The WG will need to make a decision if this 1696 section is not dropped entirely.]] 1698 o As a simplified example of this, assume one wanted to use a 1699 "heart" or "star" symbol in a label. This is problematic because 1700 the those names are ambiguous in the Unicode system of naming (the 1701 actual Unicode names require far more qualification). A user or 1702 would-be registrant has no way to know --absent careful study of 1703 the code tables-- whether it is ambiguous (e.g., where there are 1704 multiple "heart" characters) or not. Conversely, the user seeing 1705 the hypothetical label doesn't know whether to read it --try to 1706 transmit it to a colleague by voice-- as "heart", as "love", as 1707 "black heart", or as any of the other examples below. 1709 o The actual situation is even worse than this. There is no 1710 possible way for a normal, casual, user to tell the difference 1711 between the hearts of U+2665 and U+2765 and the stars of U+2606 1712 and U+2729 or the without somehow knowing to look for a 1713 distinction. We have a white heart (U+2661) and few black hearts 1714 and describing a label containing a heart symbol is hopelessly 1715 ambiguous. In cities where "Square" is a popular part of a 1716 location name, one might well want to use a square symbol in a 1717 label as well and there are far more squares of various flavors in 1718 Unicode than there are hearts or stars. 1720 o The consequence of these ambiguities of description and 1721 dependencies on distinctions that were, or were not, made in 1722 Unicode codings, is that symbols are a very poor basis for 1723 reliable communication. Of course, these difficulties with 1724 symbols do not arise with actual pictographic languages and 1725 scripts which would be treated like any other language characters; 1726 the two should not be confused. 1728 [[anchor36: Note in Draft: Should the above section be significantly 1729 trimmed or eliminated?]] 1731 10.6. Migration Between Unicode Versions: Unassigned Code Points 1733 In IDNA2003, labels containing unassigned code points are resolved on 1734 the theory that, if they appear in labels and can be resolved, the 1735 relevant standards must have changed and the registry has properly 1736 allocated only assigned values. 1738 In this specification, strings containing unassigned code points MUST 1739 NOT be either looked up or registered. There are several reasons for 1740 this, with the most important ones being: 1742 o It cannot be known with sufficient reliability in advance that a 1743 code point that was not previously assigned will not be assigned 1744 to a compatibility character. In IDNA2003, since there is no 1745 direct dependency on NFKC (Stringprep's tables are based on NFKC, 1746 but IDNA2003 depends only on Stringprep), allocation of a 1747 compatibility character might produce some odd situations, but it 1748 would not be a problem. In IDNA2008, where compatibility 1749 characters are generally assigned to DISALLOWED, permitting 1750 strings containing unassigned characters to be looked up would 1751 permit violating the principle that characters in DISALLOWED are 1752 not looked up. 1754 o More generally, the status of an unassigned character with regard 1755 to the DISALLOWED and PROTOCOL-VALID categories, and whether 1756 contextual rules are required with the latter, cannot be evaluated 1757 until a character is actually assigned and known. 1759 It is possible to argue that the issues above are not important and 1760 that, as a consequence, it is better to retain the principle of 1761 looking up labels even if they contain unassigned characters because 1762 all of the important scripts and characters have been coded as of 1763 Unicode 5.1 and hence unassigned code points will be assigned only to 1764 obscure characters or archaic scripts. Unfortunately, that does not 1765 appear to be a safe assumption for at least two reasons. First, much 1766 the same claim of completeness has been made for earlier versions of 1767 Unicode. The reality is that a script that is obscure to much of the 1768 world may still be very important to those who use it. Cultural and 1769 linguistic preservation principles make it inappropriate to declare 1770 the script of no importance in IDNs. Second, we already have 1771 counterexamples in, e.g., the relationships associated with new Han 1772 characters being added (whether in the BMP or in Unicode Plane 2). 1774 10.7. Other Compatibility Issues 1776 The existing (2003) IDNA model includes several odd artifacts of the 1777 context in which it was developed. Many, if not all, of these are 1778 potential avenues for exploits, especially if the registration 1779 process permits "source" names (names that have not been processed 1780 through IDNA and nameprep) to be registered. As one example, since 1781 the character Eszett, used in German, is mapped by IDNA2003 into the 1782 sequence "ss" rather than being retained as itself or prohibited, a 1783 string containing that character but that is otherwise in ASCII is 1784 not really an IDN (in the U-label sense defined above) at all. After 1785 Nameprep maps the Eszett out, the result is an ASCII string and so 1786 does not get an xn-- prefix, but the string that can be displayed to 1787 a user appears to be an IDN. The proposed IDNA2008 eliminates this 1788 artifact. A character is either permitted as itself or it is 1789 prohibited; special cases that make sense only in a particular 1790 linguistic or cultural context can be dealt with as localization 1791 matters where appropriate. 1793 11. Acknowledgments 1795 The editor and contributors would like to express their thanks to 1796 those who contributed significant early (pre-WG) review comments, 1797 sometimes accompanied by text, especially Mark Davis, Paul Hoffman, 1798 Simon Josefsson, and Sam Weiler. In addition, some specific ideas 1799 were incorporated from suggestions, text, or comments about sections 1800 that were unclear supplied by Frank Ellerman, Michael Everson, Asmus 1801 Freytag, Erik van der Poel, Michel Suignard, and Ken Whistler, 1802 although, as usual, they bear little or no responsibility for the 1803 conclusions the editor and contributors reached after receiving their 1804 suggestions. Thanks are also due to Vint Cerf, Debbie Garside, and 1805 Jefsey Morphin for conversations that led to considerable 1806 improvements in the content of this document. 1808 A meeting was held on 30 January 2008 to attempt to reconcile 1809 differences in perspective and terminology about this set of 1810 specifications between the design team and members of the Unicode 1811 Technical Consortium. The discussions at and subsequent to that 1812 meeting were very helpful in focusing the issues and in refining the 1813 specifications. The active participants at that meeting were (in 1814 alphabetic order as usual) Harald Alvestrand, Vint Cerf, Tina Dam, 1815 Mark Davis, Lisa Dusseault, Patrik Faltstrom (by telephone), Cary 1816 Karp, John Klensin, Warren Kumari, Lisa Moore, Erik van der Poel, 1817 Michel Suignard, and Ken Whistler. We express our thanks to Google 1818 for support of that meeting and to the participants for their 1819 contributions. 1821 Special thanks are due to Paul Hoffman for permission to extract 1822 material from his Internet-Draft to form the basis for Section 2. 1824 Useful comments and text on the WG versions of the draft were 1825 received from many participants in the IETF "IDNABIS" WG and a number 1826 of document changes resulted from mailing list discussions made by 1827 that group. 1829 12. Contributors 1831 While the listed editor held the pen, this core of this document and 1832 the initial WG version represents the joint work and conclusions of 1833 an ad hoc design team consisting of the editor and, in alphabetic 1834 order, Harald Alvestrand, Tina Dam, Patrik Faltstrom, and Cary Karp. 1835 In addition, there were many specific contributions and helpful 1836 comments from those listed in the Acknowledgments section and others 1837 who have contributed to the development and use of the IDNA 1838 protocols. 1840 13. IANA Considerations 1842 This section gives an overview of registries required for IDNA. The 1843 actual definition of the first one appears in [IDNA2008-Tables]. 1845 13.1. IDNA Character Registry 1847 The distinction among the three major categories "UNASSIGNED", 1848 "DISALLOWED", and "PROTOCOL-VALID" is made by special categories and 1849 rules that are integral elements of [IDNA2008-Tables]. Convenience 1850 in programming and validation requires a registry of characters and 1851 scripts and their categories, updated for each new version of Unicode 1852 and the characters it contains. The details of this registry are 1853 specified in [IDNA2008-Tables]. 1855 13.2. IDNA Context Registry 1857 For characters that are defined in the IDNA Character Registry list 1858 as PROTOCOL-VALID but requiring a contextual rule (i.e., the types of 1859 rule described in Section 6.1.1.1), IANA will create and maintain a 1860 list of approved contextual rules. Additions or changes to these 1861 rules require IETF Review, as described in [RFC5226]. 1862 [[anchor41: Note in Draft: This section was changed between -00 and 1863 -01 based on list discussion. Consensus needs to be verified for 1864 that decision.]] 1866 A table from which that registry can be initialized, and some further 1867 discussion, appears in [RulesInit]. 1868 [[anchor42: This subsection should probably be moved to Tables along 1869 with the Contextual rules themselves (from Protocol) when the move is 1870 made.]] 1872 13.3. IANA Repository of IDN Practices of TLDs 1874 This registry, historically described as the "IANA Language Character 1875 Set Registry" or "IANA Script Registry" (both somewhat misleading 1876 terms) is maintained by IANA at the request of ICANN. It is used to 1877 provide a central documentation repository of the IDN policies used 1878 by top level domain (TLD) registries who volunteer to contribute to 1879 it and is used in conjunction with ICANN Guidelines for IDN use. 1881 It is not an IETF-managed registry and, while the protocol changes 1882 specified here may call for some revisions to the tables, these 1883 specifications have no direct effect on that registry and no IANA 1884 action is required as a result. 1886 14. Security Considerations 1888 Security on the Internet partly relies on the DNS. Thus, any change 1889 to the characteristics of the DNS can change the security of much of 1890 the Internet. 1892 Domain names are used by users to identify and connect to Internet 1893 servers. The security of the Internet is compromised if a user 1894 entering a single internationalized name is connected to different 1895 servers based on different interpretations of the internationalized 1896 domain name. 1898 When systems use local character sets other than ASCII and Unicode, 1899 this specification leaves the problem of transcoding between the 1900 local character set and Unicode up to the application or local 1901 system. If different applications (or different versions of one 1902 application) implement different transcoding rules, they could 1903 interpret the same name differently and contact different servers. 1904 This problem is not solved by security protocols like TLS that do not 1905 take local character sets into account. 1907 To help prevent confusion between characters that are visually 1908 similar, it is suggested that implementations provide visual 1909 indications where a domain name contains multiple scripts. Such 1910 mechanisms can also be used to show when a name contains a mixture of 1911 simplified and traditional Chinese characters, or to distinguish zero 1912 and one from O and l. DNS zone administrators may impose 1913 restrictions (subject to the limitations identified elsewhere in this 1914 document) that try to minimize characters that have similar 1915 appearance or similar interpretations. It is worth noting that there 1916 are no comprehensive technical solutions to the problems of 1917 confusable characters. One can reduce the extent of the problems in 1918 various ways, but probably never eliminate it. Some specific 1919 suggestions about identification and handling of confusable 1920 characters appear in a Unicode Consortium publication 1921 [Unicode-UTR36]. 1923 The registration and resolution models described above and in 1924 [IDNA2008-Protocol] change the mechanisms available for applications 1925 and resolvers to determine the validity of labels they encounter. In 1926 some respects, the ability to test is strengthened. For example, 1927 putative labels that contain unassigned code points will now be 1928 rejected, while IDNA2003 permitted them (something that is now 1929 recognized as a considerable source of risk). On the other hand, the 1930 protocol specification no longer assumes that the application that 1931 looks up a name will be able to determine, and apply, information 1932 about the protocol version used in registration. In theory, that may 1933 increase risk since the application will be able to do less pre- 1934 lookup validation. In practice, the protection afforded by that test 1935 has been largely illusory for reasons explained in RFC 4690 and 1936 above. 1938 Any change to Stringprep or, more broadly, the IETF's model of the 1939 use of internationalized character strings in different protocols, 1940 creates some risk of inadvertent changes to those protocols, 1941 invalidating deployed applications or databases, and so on. Our 1942 current hypothesis is that the same considerations that would require 1943 changing the IDN prefix (see Section 10.3.2) are the ones that would, 1944 e.g., invalidate certificates or hashes that depend on Stringprep, 1945 but those cases require careful consideration and evaluation. More 1946 important, it is not necessary to change Stringprep2003 at all in 1947 order to make the IDNA changes contemplated here. It is far 1948 preferable to create a separate document, or separate profile 1949 components, for IDN work, leaving the question of upgrading to other 1950 protocols to experts on them and eliminating any possible 1951 synchronization dependency between IDNA changes and possible upgrades 1952 to security protocols or conventions. 1954 No mechanism involving names or identifiers alone can protect a wide 1955 variety of security threats and attacks that are largely independent 1956 of them including spoofed pages, DNS query trapping and diversion, 1957 and so on. 1959 15. Change Log 1961 [[anchor45: RFC Editor: Please remove this section.]] 1963 For version 00 of draft-ietf-idnabis-rationale, this list contains a 1964 complete trace going back through the earlier, design team, drafts. 1965 Material earlier than that described in Section 15.9 will be removed 1966 in WG draft -02. 1968 15.1. Version -01 of draft-klensin-idnabis-issues 1970 Version -01 of this document is a considerable rewrite from -00. 1971 Many sections have been clarified or extended and several new 1972 sections have been added to reflect discussions in a number of 1973 contexts since -00 was issued. 1975 15.2. Version -02 of draft-klensin-idnabis-issues 1977 o Corrected several editorial errors including an accidentally- 1978 introduced misstatement about NFKC. 1980 o Extensively revised the document to synchronize its terminology 1981 with version 03 of [IDNA2008-Tables] and to provide a better 1982 conceptual framework for its categories and how they are used. 1983 Added new material to clarify terminology and relationships with 1984 other efforts. More subtle changes in this version lay the 1985 groundwork for separating the document into a conceptual overview 1986 and a protocol specification for version 03. 1988 15.3. Version -03 of draft-klensin-idnabis-issues 1990 o Removed protocol materials to a separate document and incorporated 1991 rationale and explanation materials from the original 1992 specification in RFC 3960 into this document. Cleaned up earlier 1993 text to reflect a more mature specification and restructured 1994 several sections and added additional rationale material. 1996 o Strengthened and clarified the A-label / U-label/ LDH-label 1997 definition. 1999 o Retitled the document to reflect its evolving role. 2001 15.4. Version -04 of draft-klensin-idnabis-issues 2003 o Moved more text from "protocol" and further reorganized material. 2005 o Provided new material on "Contextual Rule Required. 2007 o Improved consistency of terminology, both internally and with the 2008 "tables" document. 2010 o Improved the IANA Considerations section and discussed the 2011 existing IDNA-related registry. 2013 o More small changes to increase consistency. 2015 15.5. Version -05 of draft-klensin-idnabis-issues 2017 Changed "YES" category back to "ALWAYS" to re-synch with the tables 2018 document and provide clearer terminology. 2020 15.6. Version -06 of draft-klensin-idnabis-issues 2022 o Clarified the prohibitions on strings that look like A-labels but 2023 are not and on unassigned code points. 2025 o Clarified length restrictions on IDN labels. 2027 o Revised the terminology definitions to remove the impression of 2028 circularity and removed invocations of ToASCII and ToUnicode, 2029 which do not exist in IDNA2008. 2031 o Added a new section on front-end processing. 2033 o Added a new section to discuss case-mapping. 2035 o Extended the discussion of prefix changes to identify the 2036 implications of making one. 2038 o Several more editorial improvements, corrected references, and 2039 similar adjustments. 2041 15.7. Version -07 of draft-klensin-idnabis-issues 2043 o Added material that specifically defines the format of contextual 2044 rules. 2046 o Added and altered text after discussions at the 30 January meeting 2047 (see Section 11) and the follow-up to those discussions. Among 2048 the key decisions at that meeting were to eliminate the 2049 distinction among the valid categories (formerly "ALWAYS", "MAYBE 2050 YES", and "MAYBE NO"), to adjust the terminology accordingly, and 2051 to change "CONTEXTUAL RULE REQUIRED" from a separate category in 2052 this document and the protocol one to a modifier of what is now 2053 called "PROTOCOL-VALID". The consequent changes resulted in 2054 removal of several sections of explanation from this document. 2056 o Resynchronized terminology with "protocol" and "tables" documents. 2058 o More editorial and typographic corrections. 2060 15.8. Version -00 of draft-ietf-idnabis-rationale 2062 o Rewrote the abstract and introduction, and retuned the title, to 2063 be more consistent with WG work and activities. Changed the file 2064 name to reflect WG naming. 2066 o Removed most of the material that explained, or compared this 2067 approach to, IDNA2003. Some of this material may appear in the 2068 non-WG "IDNA-alternatives" draft if it is ever completed. 2070 o Changed IDNA200X in terminology and references to IDNA2008. 2072 o Added a contextual rule for hyphen to the appendix, adjusted the 2073 rule syntax slightly, and supplied draft regular expression rules. 2075 o Responded to comments produced during the WG charter discussions 2076 and from several individuals. In general, comments requesting a 2077 reorganization of the collection of documents have not been 2078 responded to pending a WG decision on that topic. 2080 o Moved the contextual rule appendix out of here and into 2081 "Protocol". It may not belong there either, but definitely does 2082 not belong here, and was holding up getting this document out. 2084 o Many small editorial improvements, including reorganization of 2085 some material. 2087 Editorial note: While several sections have been removed from this 2088 version, the WG should discuss whether further cuts are desirable, 2089 e.g., whether Section 7.3, Section 7.4, or Section 10.3 provide 2090 enough value to be worth retaining? Can Section 10.4 be trimmed 2091 without loss of useful information and, if so, how? Section 10.7 2092 appears critical of IDNA2003 in undesirable ways: should it be 2093 dropped or do people have suggestions about how to improve it? 2094 Strong opinions have been expressed that Section 10.5 should be 2095 trimmed significantly or removed entirely. The WG will need to 2096 discuss that too. Are there other materials that should be trimmed 2097 out? 2099 15.9. Version -01 of draft-ietf-idnabis-rationale 2101 o Clarified the U-label definition to note that U-labels must 2102 contain at least one non-ASCII character. Also clarified the 2103 relationship among label types. 2105 o Rewrote the discussion of Labels in Registration (Section 10.1.2) 2106 and related text in Section 1.5.4.1.1 to narrow its focus and 2107 remove more general restrictions. Added a temporary note in line 2108 to explain the situation. 2110 o Changed the "IDNA uses Unicode" statement to focus on 2111 compatibility with IDNA2003 and avoid more general or 2112 controversial assertions. 2114 o Added a discussion of examples to Section 10.1 2116 o Made a number of other small editorial changes and corrections 2117 suggested by Mark Davis. 2119 o Added several more discussion anchors and notes and expanded or 2120 updated some existing ones. 2122 16. References 2124 16.1. Normative References 2126 [ASCII] American National Standards Institute (formerly United 2127 States of America Standards Institute), "USA Code for 2128 Information Interchange", ANSI X3.4-1968, 1968. 2130 ANSI X3.4-1968 has been replaced by newer versions with 2131 slight modifications, but the 1968 version remains 2132 definitive for the Internet. 2134 [IDNA2008-Bidi] 2135 Alvestrand, H. and C. Karp, "An updated IDNA criterion for 2136 right to left scripts", July 2008, . 2139 [IDNA2008-Protocol] 2140 Klensin, J., "Internationalized Domain Names in 2141 Applications (IDNA): Protocol", July 2008, . 2145 [IDNA2008-Tables] 2146 Faltstrom, P., "The Unicode Code Points and IDNA", 2147 May 2008, . 2150 A version of this document is available in HTML format at 2151 http://stupid.domain.name/idnabis/ 2152 draft-ietf-idnabis-tables-01.html 2154 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2155 Requirement Levels", BCP 14, RFC 2119, March 1997. 2157 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 2158 Internationalized Strings ("stringprep")", RFC 3454, 2159 December 2002. 2161 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 2162 "Internationalizing Domain Names in Applications (IDNA)", 2163 RFC 3490, March 2003. 2165 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 2166 Profile for Internationalized Domain Names (IDN)", 2167 RFC 3491, March 2003. 2169 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 2170 for Internationalized Domain Names in Applications 2171 (IDNA)", RFC 3492, March 2003. 2173 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 2174 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 2175 May 2008. 2177 [RulesInit] 2178 Klensin, J., "Internationalizing Domain Names in 2179 Applications (IDNA): Protocol, Appendix A Contextual Rules 2180 Table", July 2008, . 2183 [Unicode51] 2184 The Unicode Consortium, "The Unicode Standard, Version 2185 5.1.0", 2008. 2187 defined by: The Unicode Standard, Version 5.0, Boston, MA, 2188 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by 2189 Unicode 5.1.0 2190 (http://www.unicode.org/versions/Unicode5.1.0/). 2192 16.2. Informative References 2194 [BIG5] Institute for Information Industry of Taiwan, "Computer 2195 Chinese Glyph and Character Code Mapping Table, Technical 2196 Report C-26", 1984. 2198 There are several forms and variations and a closely- 2199 related standard, CNS 11643. See the discussion in 2200 Chapter 3 of Lunde, K., CJKV Information Processing, 2201 O'Reilly & Associates, 1999 2203 [GB18030] "Chinese National Standard GB 18030-2000: Information 2204 Technology -- Chinese ideograms coded character set for 2205 information interchange -- Extension for the basic set.", 2206 2000. 2208 [RFC0810] Feinler, E., Harrenstien, K., Su, Z., and V. White, "DoD 2209 Internet host table specification", RFC 810, March 1982. 2211 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet 2212 host table specification", RFC 952, October 1985. 2214 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 2215 STD 13, RFC 1034, November 1987. 2217 [RFC1035] Mockapetris, P., "Domain names - implementation and 2218 specification", STD 13, RFC 1035, November 1987. 2220 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 2221 and Support", STD 3, RFC 1123, October 1989. 2223 [RFC2782] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for 2224 specifying the location of services (DNS SRV)", RFC 2782, 2225 February 2000. 2227 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint 2228 Engineering Team (JET) Guidelines for Internationalized 2229 Domain Names (IDN) Registration and Administration for 2230 Chinese, Japanese, and Korean", RFC 3743, April 2004. 2232 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 2233 Identifiers (IRIs)", RFC 3987, January 2005. 2235 [RFC4290] Klensin, J., "Suggested Practices for Registration of 2236 Internationalized Domain Names (IDN)", RFC 4290, 2237 December 2005. 2239 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 2240 Recommendations for Internationalized Domain Names 2241 (IDNs)", RFC 4690, September 2006. 2243 [RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin, 2244 "Registration and Administration Recommendations for 2245 Chinese Domain Names", RFC 4713, October 2006. 2247 [Unicode-UTR36] 2248 The Unicode Consortium, "Unicode Technical Report #36: 2249 Unicode Security Considerations", August 2006, 2250 . 2252 Author's Address 2254 John C Klensin 2255 1770 Massachusetts Ave, Ste 322 2256 Cambridge, MA 02140 2257 USA 2259 Phone: +1 617 245 1457 2260 Email: john+ietf@jck.com 2262 Full Copyright Statement 2264 Copyright (C) The IETF Trust (2008). 2266 This document is subject to the rights, licenses and restrictions 2267 contained in BCP 78, and except as set forth therein, the authors 2268 retain all their rights. 2270 This document and the information contained herein are provided on an 2271 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2272 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 2273 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 2274 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 2275 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2276 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2278 Intellectual Property 2280 The IETF takes no position regarding the validity or scope of any 2281 Intellectual Property Rights or other rights that might be claimed to 2282 pertain to the implementation or use of the technology described in 2283 this document or the extent to which any license under such rights 2284 might or might not be available; nor does it represent that it has 2285 made any independent effort to identify any such rights. Information 2286 on the procedures with respect to rights in RFC documents can be 2287 found in BCP 78 and BCP 79. 2289 Copies of IPR disclosures made to the IETF Secretariat and any 2290 assurances of licenses to be made available, or the result of an 2291 attempt made to obtain a general license or permission for the use of 2292 such proprietary rights by implementers or users of this 2293 specification can be obtained from the IETF on-line IPR repository at 2294 http://www.ietf.org/ipr. 2296 The IETF invites any interested party to bring to its attention any 2297 copyrights, patents or patent applications, or other proprietary 2298 rights that may cover technology that may be required to implement 2299 this standard. Please address the information to the IETF at 2300 ietf-ipr@ietf.org.