idnits 2.17.1 draft-iab-idn-nextsteps-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 19. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1834. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1811. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1818. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1824. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 12, 2006) is 6528 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Unicode-PR29' is defined on line 1740, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564) ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode32' == Outdated reference: A later version (-08) exists of draft-iab-dns-choices-02 -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) -- Obsolete informational reference (is this intentional?): RFC 3536 (Obsoleted by RFC 6365) Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft 4 Expires: December 14, 2006 P. Faltstrom 5 Cisco Systems 6 C. Karp 7 Swedish Museum of Natural History 8 IAB 9 June 12, 2006 11 Review and Recommendations for Internationalized Domain Names (IDN) 12 draft-iab-idn-nextsteps-06.txt 14 Status of this Memo 16 By submitting this Internet-Draft, each author represents that any 17 applicable patent or other IPR claims of which he or she is aware 18 have been or will be disclosed, and any of which he or she becomes 19 aware will be disclosed, in accordance with Section 6 of BCP 79. 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF), its areas, and its working groups. Note that 23 other groups may also distribute working documents as Internet- 24 Drafts. 26 Internet-Drafts are draft documents valid for a maximum of six months 27 and may be updated, replaced, or obsoleted by other documents at any 28 time. It is inappropriate to use Internet-Drafts as reference 29 material or to cite them other than as "work in progress." 31 The list of current Internet-Drafts can be accessed at 32 http://www.ietf.org/ietf/1id-abstracts.txt. 34 The list of Internet-Draft Shadow Directories can be accessed at 35 http://www.ietf.org/shadow.html. 37 This Internet-Draft will expire on December 14, 2006. 39 Copyright Notice 41 Copyright (C) The Internet Society (2006). 43 Abstract 45 This note describes issues raised by the deployment and use of 46 Internationalized Domain Names. It describes problems both at the 47 time of registration and those for use of those names for use in the 48 DNS. It recommends that IETF should update the IDN related RFCs and 49 a framework to be followed in doing so, as well as summarizing and 50 identifying some work that is required outside the IETF. In 51 particular, it proposes that some changes be investigated for the 52 IDNA standard and its supporting tables, based on experience gained 53 since those standards were completed. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 1.1. The Role of IDNs and this document . . . . . . . . . . . . 4 59 1.2. Status of this Document and its Recommendations . . . . . 5 60 1.3. The IDNA Standard . . . . . . . . . . . . . . . . . . . . 5 61 1.4. Unicode Documents . . . . . . . . . . . . . . . . . . . . 6 62 1.5. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7 63 1.5.1. Language . . . . . . . . . . . . . . . . . . . . . . . 7 64 1.5.2. Script . . . . . . . . . . . . . . . . . . . . . . . . 7 65 1.5.3. Multilingual . . . . . . . . . . . . . . . . . . . . . 8 66 1.5.4. Localization . . . . . . . . . . . . . . . . . . . . . 8 67 1.5.5. Internationalization . . . . . . . . . . . . . . . . . 8 68 1.6. Statements and Guidelines . . . . . . . . . . . . . . . . 9 69 1.6.1. IESG Statement . . . . . . . . . . . . . . . . . . . . 9 70 1.6.2. ICANN statements . . . . . . . . . . . . . . . . . . . 9 71 2. General Problems and Issues . . . . . . . . . . . . . . . . . 12 72 2.1. User conceptions, local character sets, and input 73 issues . . . . . . . . . . . . . . . . . . . . . . . . . . 12 74 2.2. Examples of Issues . . . . . . . . . . . . . . . . . . . . 14 75 2.2.1. Language specific character matching . . . . . . . . . 14 76 2.2.2. Multiple scripts . . . . . . . . . . . . . . . . . . . 14 77 2.2.3. Normalization and Character Mappings . . . . . . . . . 15 78 2.2.4. URLs in Printed Form . . . . . . . . . . . . . . . . . 17 79 2.2.5. Bidirectional text . . . . . . . . . . . . . . . . . . 18 80 2.2.6. Confusable Character Issues . . . . . . . . . . . . . 18 81 2.2.7. The IESG Statement and IDNA issues . . . . . . . . . . 20 82 3. Migrating to New Versions of Unicode . . . . . . . . . . . . . 20 83 3.1. Versions of Unicode . . . . . . . . . . . . . . . . . . . 20 84 3.2. Version changes and normalization issues . . . . . . . . . 22 85 3.2.1. Unnormalized Combining Sequences . . . . . . . . . . . 22 86 3.2.2. Combining Characters and Character Components . . . . 23 87 3.2.3. When does normalization occur? . . . . . . . . . . . . 23 88 4. Framework for next steps in IDN development . . . . . . . . . 24 89 4.1. Issues within the scope of the IETF . . . . . . . . . . . 24 90 4.1.1. Review of IDNA . . . . . . . . . . . . . . . . . . . . 24 91 4.1.2. Non-DNS and Above-DNS Internationalization 92 Approaches . . . . . . . . . . . . . . . . . . . . . . 25 93 4.1.3. Security issues, certificates, etc. . . . . . . . . . 26 94 4.1.4. Protocol Changes and Policy Implications . . . . . . . 28 95 4.1.5. Non US-ASCII in local part of email addresses . . . . 28 96 4.1.6. Use of the Unicode Character Set in the IETF . . . . . 28 97 4.2. Issues that fall within the purview of ICANN . . . . . . . 28 98 4.2.1. Dispute resolution . . . . . . . . . . . . . . . . . . 28 99 4.2.2. Policy at registries . . . . . . . . . . . . . . . . . 28 100 4.2.3. IDN TLDs . . . . . . . . . . . . . . . . . . . . . . . 29 101 5. Specific Recommendations for Next Steps . . . . . . . . . . . 30 102 5.1. Reduction of permitted character list . . . . . . . . . . 30 103 5.1.1. Elimination of all non-language characters . . . . . . 30 104 5.1.2. Elimination of word-separation punctuation . . . . . . 31 105 5.2. Updating to new versions of Unicode . . . . . . . . . . . 31 106 5.3. Role and Uses of the DNS . . . . . . . . . . . . . . . . . 31 107 5.4. Databases of Registered Names . . . . . . . . . . . . . . 32 108 6. Security Considerations . . . . . . . . . . . . . . . . . . . 32 109 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 32 110 8. Change History . . . . . . . . . . . . . . . . . . . . . . . . 33 111 8.1. Changes for version -01 . . . . . . . . . . . . . . . . . 33 112 8.2. Changes for version -02 . . . . . . . . . . . . . . . . . 33 113 8.3. Changes for Version -03 . . . . . . . . . . . . . . . . . 34 114 8.4. Changes for version -04 . . . . . . . . . . . . . . . . . 34 115 8.5. Changes for version -05 . . . . . . . . . . . . . . . . . 34 116 8.6. Changes for version -06 . . . . . . . . . . . . . . . . . 34 117 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 34 118 9.1. Normative References . . . . . . . . . . . . . . . . . . . 34 119 9.2. Informative References . . . . . . . . . . . . . . . . . . 35 120 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39 121 Intellectual Property and Copyright Statements . . . . . . . . . . 40 123 1. Introduction 125 1.1. The Role of IDNs and this document 127 While IDNs have been advocated as the solution for a wide range of 128 problems, this document is written from the perspective that they are 129 no more and no less than DNS names, reflecting the same requirements 130 for use, stability, and accuracy as traditional "hostnames", but 131 using a much larger collection of of permitted characters. In 132 particular, while IDNs represent a step toward an Internet that is 133 equally accessible from all languages and scripts they, at best, 134 address only a small part of that very broad objective. There has 135 been controversy since IDNs were first suggested about how important 136 they will actually turn out to be; that controversy will probably 137 continue. Accessibility from all languages is an important 138 objective, hence it is important that our standards and definitions 139 for IDNs be smoothly adaptable to additional scripts as they are 140 added to the Unicode character set. 142 The utility of IDNs must be evaluated in terms of their application 143 by users and in protocols: the ability to simply put a name into the 144 DNS and retrieve it is not, in and of itself, important. >From this 145 point of view, IDNs will be useful and effective if they provide 146 stable and predictable references -- references that are no less 147 stable and predictable, and no less secure, than their ASCII 148 counterparts. 150 This combination of objectives and criteria has proven very difficult 151 to satisfy. Experience in developing the IDNA standard and during 152 the initial years of its implementation and deployment suggests that 153 it may be impossible to fully satisfy all of them and that 154 engineering compromises are needed to yield a result that is 155 workable, even if not completely satisfactory. Based on that 156 experience and issues that have been raised, it is now appropriate to 157 review some of the implications of IDNs, the decisions made in 158 defining them, and the foundation on which they rest and determine 159 whether changes are needed and, if so, which ones. 161 The design of the DNS itself imposes some additional constraints. If 162 the DNS is to remain globally interoperable, there are specific 163 characteristics that no implementation of IDNs, or the DNS more 164 generally, can change. For example, because the DNS is a global 165 hierarchal administrative namespace with only a single name at any 166 given node, there is one and only one owner of each domain name. 167 Also, when strings are looked up in the DNS, positive responses can 168 only reflect exact matches: if there is no exact match, then one gets 169 an error reply, not an list of near matches or other supplemental 170 information. Searches and approximate matchings are not possible. 172 Finally, because the DNS is a distributed system where any server 173 might cache responses, and later use those cached responses to 174 attempt to satisfy queries before a global lookup is done, every 175 server must use the same matching criteria. 177 1.2. Status of this Document and its Recommendations 179 This document reviews the IDN landscape from an IETF perspective and 180 presents the recommendations and conclusions of the IAB, based 181 partially on input from an ad hoc committee charged with reviewing 182 IDN issues and the path forward (See Section 7). Its recommendations 183 are advice to the IETF, or in a few cases to other bodies, for topics 184 to be investigated and actions to be taken if those bodies, after 185 their examinations, consider those actions appropriate. 187 [[anchor4: IMPORTANT: The IAB has not yet reached consensus that this 188 document is ready for final publication. While considerable input 189 from the members of the ad hoc committee went into the document, no 190 claim is made that it represents the consensus of that group. 191 However, the IAB concluded that it was appropriate to expose these 192 versions, as working drafts, for community comment and feedback. 193 Such comments should be sent to iab@iab.org.]] 195 1.3. The IDNA Standard 197 During 2002 IETF completed the following RFCs that, together, define 198 IDNs: 200 RFC 3454 Preparation of Internationalized Strings ("Stringprep") 201 [RFC3454]. 202 Stringprep is a generic mechanism for taking a Unicode string and 203 converting it into a canonical format. Stringprep itself is just 204 a collection of rules, tables, and operations. Any protocol or 205 algorithm that uses it must define a "Stringprep profile", which 206 specifies which of those rules are applied, how, and with which 207 characteristics. 209 RFC 3490 Internationalizing Domain Names in Applications (IDNA) 210 [RFC3490]. 211 IDNA is the base specification in this group. It specifies that 212 Nameprep is used as the Stringprep profile for domain names, and 213 that Punycode is the relevant encoding mechanism for use in 214 generating an ASCII-compatible ("ACE") form of the name. It also 215 applies some additional conversions and character filtering that 216 are not part of Nameprep. 218 RFC 3491 Nameprep: A Stringprep Profile for Internationalized Domain 219 Names (IDN) [RFC3491]. 220 Nameprep is one such profile. It is designed to meet the specific 221 needs of IDNs and, in particular, to support case-folding for 222 scripts that support what are traditionally known as upper and 223 lower case forms of the same letters. The result of the Nameprep 224 algorithm is a string containing a subset of the Unicode Character 225 set, normalized and case folded so that case insensitive 226 comparison can be made. 228 RFC 3492 Punycode: A Bootstring encoding of Unicode for 229 Internationalized Domain Names in Applications (IDNA) [RFC3492]. 230 Punycode is a mechanism for encoding a Unicode string in ASCII 231 characters. The characters used are the same the subset of 232 characters that are allowed in the hostname definition of DNS, 233 i.e., the "letter, digit, and hyphen" characters, sometimes known 234 as "LDH". 236 1.4. Unicode Documents 238 Unicode is used as the base, and defining, character set for IDN. 239 Unicode is standardized by the Unicode Consortium, and synchronized 240 with ISO to create ISO/IEC 10646 [ISO10646]. At the time the RFCs 241 mentioned earlier were created, Unicode was at version 3.2. For 242 reasons explained later, it was necessary to pick a particular, then- 243 current, version of Unicode when IDNA was adopted. Consequently, the 244 RFCs are explicitly dependent on Unicode version 3.2 [Unicode32]. 245 There is, at present, no established mechanism for modifying the IDNA 246 RFCs to use newer Unicode versions (see Section 3.1). 248 Unicode is a very large and complex character set. (The term 249 "character set" or "charset" is used in a way that is peculiar to the 250 IETF and may not be the same as the usage in other bodies and 251 contexts.) The Unicode Standard and related documents are created 252 and maintained by the Unicode Technical Committee (UTC), one of the 253 committees of the Unicode Consortium. 255 The Consortium first published The Unicode Standard [Unicode10] in 256 1991, and continues to develop standards based on that original work. 257 Unicode is developed in conjunction with the International 258 Organization for Standardization, and it shares its character 259 repertoire with ISO/IEC 10646. Unicode and ISO/IEC 10646 function 260 equivalently as character encodings, but The Unicode Standard 261 contains much more information for implementers, covering -- in depth 262 -- topics such as bitwise encoding, collation, and rendering. The 263 Unicode Standard enumerates a multitude of character properties, 264 including those needed for supporting bidirectional text. The 265 Unicode Consortium and ISO standards do use slightly different 266 terminology. 268 1.5. Definitions 270 The following terms and their meanings are critical to understanding 271 the rest of this document and to discussions of IDNs more generally. 272 These terms are derived from [RFC3536], which contains additional 273 discussion of some of them. 275 1.5.1. Language 277 A language is a way that humans interact. The use of language occurs 278 in many forms, including speech, writing, and signing. 280 Some languages have a close relationship between the written and 281 spoken forms, while others have a looser relationship. RFC 3066 282 [RFC3066] discusses languages in more detail and provides identifiers 283 for languages for use in Internet protocols. Computer languages are 284 explicitly excluded from this definition. The most recent IETF work 285 in this area, and on script identification (see below), is documented 286 in [ltru-registry] and [ltru-initial]. 288 1.5.2. Script 290 A script is a set of graphic characters used for the written form of 291 one or more languages. This definition is the one used in 292 [ISO10646]. 294 Examples of scripts are Arabic, Cyrillic, Greek, Han (the so-called 295 ideographs used in writing Chinese, Japanese, and Korean), and 296 "Latin". Arabic, Greek, and Latin are, of course, also names of 297 languages. 299 Historically, the script that is known as "Latin" in Unicode and most 300 contexts associated with information technology standards is known in 301 the linguistic community as "Roman" or "Roman-derived". The latter 302 terminology distinguishes between the Latin language and the 303 characters used to write it, especially in Republican times, from the 304 much richer and more decorated script derived and adapted from those 305 character. Since IDNA is defined using Unicode and that standard 306 used the term "LATIN" in its character names and descriptions, that 307 terminology will be used in this document as well except when "Roman- 308 derived" is needed for clarity. However readers approaching this 309 document from a cultural or linguistic standpoint should be aware 310 that the use of, or references to, "Latin script" in this document 311 refers to the entire collection of Roman-derived characters, not just 312 the characters used to write the Latin language. Some other issues 313 with script identification and relationships with other standards are 314 discussed in [ltru-registry]. 316 1.5.3. Multilingual 318 The term "multilingual" has many widely-varying definitions and thus 319 is not recommended for use in standards. Some of the definitions 320 relate to the ability to handle international characters; other 321 definitions relate to the ability to handle multiple charsets; and 322 still others relate to the ability to handle multiple languages. 324 While this term has been deprecated for IETF-related uses and does 325 not otherwise appear in this document, a discussion here seemed 326 appropriate since the term is still widely used in some discussions 327 of IDNs. 329 1.5.4. Localization 331 Localization is the process of adapting an internationalized 332 application platform or application to a specific cultural 333 environment. In localization, the same semantics are preserved while 334 the syntax or presentation forms may be changed. 336 Localization is the act of tailoring an application for a different 337 language or script or culture. Some internationalized applications 338 can handle a wide variety of languages. Typical users only 339 understand a small number of languages, so the program must be 340 tailored to interact with users in just the languages they know. 342 Somewhat different definitions for localization and 343 internationalization (see below) are used by groups other than the 344 IETF. See [W3C-Localization] for one example. 346 1.5.5. Internationalization 348 In the IETF, the term "internationalization" is used to describe 349 adding or improving the handling of non-ASCII text in a protocol. 350 Other bodies use the term in other ways, often with subtle variation 351 in meaning. The term "internationalization" is often abbreviated 352 "i18n" (and localization as "l10n"). 354 Many protocols that handle text only handle the characters associated 355 with one script (often, a subset of the characters used in writing 356 English text), or leave the question of what character set is used up 357 to local guesswork (which leads, of course, to interoperability 358 problems). Adding non-ASCII text to such a protocol allows the 359 protocol to handle more scripts, with the intention of being able to 360 include all of the scripts that are useful in the world. It should 361 be noted that many English words cannot be written in ASCII, various 362 mythologies notwithstanding. 364 1.6. Statements and Guidelines 366 When the IDN RFCs were published, IESG and ICANN made statements that 367 were intended to guide deployment and future work. In recent months, 368 ICANN has updated its statement and others have also made 369 contributions. It is worth noting that the quality of understanding 370 of internationalization issues as applied to the DNS has evolved 371 considerably over the last few years. Organizations that took 372 specific positions a year or more ago might not make exactly the same 373 statements today. 375 1.6.1. IESG Statement 377 The IESG made a statement on IDNA [IESG-IDN]: 379 IDNA, through its requirement of Nameprep [RFC3491], uses 380 equivalence tables that are based only on the characters 381 themselves; no attention is paid to the intended language (if any) 382 for the domain name. However, for many domain names, the intended 383 language of one or more parts of the domain name actually does 384 matter to the users. 386 Similarly, many names cannot be presented and used without 387 ambiguity unless the scripts to which their characters belong are 388 known. In both cases, this additional information should be of 389 concern to the registry. 391 The statement is longer than this, but these paragraphs are the 392 important ones. The rest of the statement are explanations and 393 examples. 395 1.6.2. ICANN statements 397 1.6.2.1. Initial ICANN Guidelines 399 Soon after the IDNA standard was adopted, ICANN produced an initial 400 version of its "IDN Guidelines" [ICANNv1]. This document was 401 intended to serve two purposes. The first was to provide a basis for 402 releasing the gTLD registries that had been established by ICANN from 403 a contractual restriction on the registration of labels containing 404 hyphens in the third and fourth positions. The second was to provide 405 a general framework for the development of registry policies for the 406 implementation of IDN. 408 One of the key components of this framework was prescribing strict 409 compliance with RFCs 3490, 3491, and 3492. These specifications 410 established the ACE (ASCII-Compatible Encoding) scheme for IDN use, 411 known as "Punycode", and the various rules for its use. The 412 specifications designated Punycode, supported by those rules, as the 413 sole such encoding to be used with the DNS. 415 Limitations on the characters available for inclusion in IDNs were 416 mandated by two devices. The first was by requiring an "inclusion- 417 based approach (meaning that code points that are not explicitly 418 permitted by the registry are prohibited) for identifying permissible 419 code points from among the full Unicode repertoire." The second 420 device required the association of every IDN with a specific 421 language, with additional policies also being language based: 423 "In implementing the IDN standards, top-level domain registries will 424 (a) associate each registered internationalized domain name with one 425 language or set of languages, 426 (b) employ language-specific registration and administration rules 427 that are documented and publicly available, such as the reservation 428 of all domain names with equivalent character variants in the 429 languages associated with the registered domain name, and, 430 (c) where the registry finds that the registration and administration 431 rules for a given language would benefit from a character variants 432 table, allow registrations in that language only when an appropriate 433 table is available. ... In implementing the IDN standards, top-level 434 domain registries should, at least initially, limit any given domain 435 label (such as a second-level domain name) to the characters 436 associated with one language or set of languages only." 438 It was left to each TLD registry to define the character repertoire 439 it would associate with any given language. This led to significant 440 variation from registry to registry, with further heterogeneity in 441 the underlying language-based IDN policies. If the guidelines had 442 made provision for IDN policies also being based on script, a 443 substantial amount of the resulting ambiguity could have been 444 avoided. However, they did not, and the sequence of events leading 445 to the present review of IDNA was thus triggered. 447 1.6.2.2. ICANN Version 2 Guidelines 449 One of responses of the TLD registries to what was widely perceived 450 as a crisis situation, was to invoke the mechanism described in the 451 initial guidelines: "As the deployment of IDNs proceeds, ICANN and 452 the IDN registries will review these Guidelines at regular intervals, 453 and revise them as necessary based on experience." 455 The pivotal requirement was the modification of the guidelines to 456 permit script-based IDN policies. Further concern was expressed 457 about the need for realistically implementable mechanisms for the 458 propagation of TLD registry policies into the lower levels of their 459 name trees. In addition to the anticipated increase of constraint on 460 the protocol level, one obvious additional approach would be to 461 replace the guidelines by an instrument which itself had clear status 462 in the IETF's normative framework. A BCP was therefore seen as the 463 appropriate focus for longer-term effort. The most pressing issues 464 would be dealt with in the interim by incremental modification to the 465 guidelines, but no need was seen for the detailed further development 466 of those guidelines once that incremental modification was complete. 468 The outcome of this action was a version 2.0 of the guidelines 469 [ICANNv2] which was endorsed by the ICANN Board on November 8, 2005 470 for a period of nine months. The Board stated further that it "tasks 471 the IDN working group to continue its important work and return to 472 the board with specific IDN improvement recommendations before the 473 ICANN Meeting in Morocco" and "supports the working group's continued 474 action to reframe the guidelines completely in a manner appropriate 475 for further development as a Best Current Practices (BCP) document, 476 to ensure that the Guideline directions will be used deeper into the 477 DNS hierarchy and within TLD's where ICANN has a lesser policy 478 relationship." 480 Retaining the inclusion-based approach established in version 1.0, 481 the crucial addition to the policy framework is that: 483 "All code points in a single label will be taken from the same script 484 as determined by the Unicode Standard Annex #24: Script Names at 485 http://www.unicode.org/reports/tr24. Exception to this is 486 permissible for languages with established orthographies and 487 conventions that require the commingled use of multiple scripts. In 488 such cases, visually confusable characters from different scripts 489 will not be allowed to co-exist in a single set of permissible 490 codepoints unless a corresponding policy and character table is 491 clearly defined." 493 Additionally: 495 "Permissible code points will not include: (a) line symbol-drawing 496 characters (as those in the Unicode Box Drawing block), (b) symbols 497 and icons that are neither alphanumeric nor ideographic language 498 characters, such as typographic and pictographic dingbats, (c) 499 characters with well-established functions as protocol elements, (d) 500 punctuation marks used solely to indicate the structure of 501 sentences." 503 Attention has been called to several points that are not adequately 504 dealt with (if at all) in the version 2.0 guidelines but which ought 505 to be included in the policy framework without waiting for the 506 production and release of a document based on a "best practices" 507 model. The term "BCP" above does not necessarily refer to an IETF 508 consensus document. The intention in Nov 2005 was for the 509 recommended major revision to be put to the ICANN Board prior to its 510 meeting in Morocco (in late June 2006), but for the changes to be 511 collated incrementally and appear in interim version 2.n releases of 512 the guidelines. The IAB's understanding is that, while there has 513 been some progress with this, other issues relating to IDN 514 subsequently diverted much of the energy that was intended to be 515 devoted to the more extensive treatment of the guidelines. 517 2. General Problems and Issues 519 This section interweaves problems and issues of several types. Each 520 subsection outlines something that is perceived to be a problem or 521 issue "with IDNs", therefore needing correction. Some of these 522 issues can be at least partially resolved by making changes to 523 elements of the IDNA protocol or tables. Others will exist as long 524 as people have expectations of IDNs that are inconsistent with the 525 basic DNS architecture. It is important to identify this entire 526 range of problems because users, registrants, and policy makers often 527 do not understand the protocol and other technical issues but only 528 the difference between what they believe happens or should happen and 529 what actually happens. As long as those differences exist, there 530 will be demands for functionality or policy changes for IDN. Of 531 course, some of these demands will be less realistic than others but 532 even the realistic ones should be understood in the same context as 533 the others. 535 Most of the issues that have been raised, and that are discussed in 536 this document, exist whether IDNA remains tied to Unicode 3.2 or 537 whether migration to new Unicode versions is contemplated. A 538 migration path is necessary to accommodate newly-coded scripts and to 539 permit the maximum number of languages and scripts to be represented 540 in domain names. However, the migration issues are largely separate 541 from those involving a single Unicode Version or Version 3.2 in 542 particular, so they have been separated into this section and 543 Section 3 545 2.1. User conceptions, local character sets, and input issues 547 The labels of the DNS are just strings of characters that are not 548 inherently tied to a particular language. As mentioned briefly in 549 the Introduction, DNS labels that could not lexically be words in any 550 language are possible and indeed common: there appears to be no 551 reason to impose protocol restrictions on IDNs that would restrict 552 them more than all-ASCII hostname labels have been restricted. For 553 that reason, even describing DNS labels or strings of them as "names" 554 is something of a misnomer, one that has probably added to user 555 confusion about what to expect. 557 Ordinarily, people use "words" when they think of things and wish 558 others to think of them too. For example "orange", "tree", 559 "restaurant" or "Acme Inc". Words are normally in a specific 560 language, such as English or Swedish. The character-string labels 561 supported by the DNS are, as suggested above, not inherently "words". 562 While it is useful, especially for mnemonic value or to identify 563 objects, for actual words to be used as DNS labels, other constraints 564 on the DNS make it impossible to guarantee that it will be possible 565 to represent every word in every language as a DNS label, 566 internationalized or not. 568 When writing or typing the label (or word), a script must be selected 569 and a charset must be picked for use with that script. That choice 570 of charset is typically not under the control of the user on a per 571 word or per document basis, but may depend on local input devices, 572 keyboard or terminal drivers, or other decisions made by operating 573 system or even hardware designers and implementers. 575 If that charset, or the local charset being used by the relevant 576 operating system or application software, is not Unicode, a further 577 conversion must be performed to produce Unicode. How often this is 578 an issue depends on estimates of how widely Unicode is deployed as 579 the native character set for hardware, operating systems, and 580 applications. Those estimates differ widely but it should be noted 581 that, among other difficulties: 583 o ISO 8859 versions [ISO.8859.2003] and even national variations of 584 ISO 646 [ISO.646.1991] are still widely used in parts of Europe; 585 o code-table switching methods, typically based on the techniques of 586 ISO 2022 [ISO.2022.1986] are still in general use in many parts of 587 the world, especially in Japan with Shift-JIS and its variations; 588 o that computing, systems, and communications in China tend to use 589 one or more of the national "GB" standards rather than native 590 Unicode; 592 Additionally, not all charsets define their characters in the same 593 way and not all pre-existing coding systems were incorporated into 594 Unicode without changes. Sometimes local distinctions were made that 595 Unicode does not make or vice versa. Consequently, conversion from 596 other systems to Unicode may potentially lose information. 598 The Unicode string that results from this processing --processing 599 that is trivial in a Unicode-native system but that may be 600 significant in others-- is then used as input to IDNA. 602 2.2. Examples of Issues 604 While much of the discussion below is stated in terms of Unicode 605 codings and associated rules, the IAB believes that some of the 606 issues are actually not about the Unicode Character set per se, but 607 about how distributed matching systems operate in reality, and about 608 what implications the distributed delayed search for stored data that 609 characterizes the DNS have on the mapping algorithms. 611 2.2.1. Language specific character matching 613 There are similar words that can be expressed in multiple languages. 614 For example the name Torbjorn in Norwegian and Swedish. In Norwegian 615 it is spelled with the character U+00F8 (LATIN SMALL LETTER O WITH 616 STROKE) in the second syllable, while in Swedish it is spelled with 617 U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS). Those characters are 618 not treated as equivalent according to the Unicode Standard and its 619 Annexes while most people speaking Swedish, Danish, or Norwegian 620 probably think they are equivalent. 622 It is neither possible nor desirable to make these characters 623 equivalent on a global basis. To do so would, for this example 624 rationalize the situation in Sweden while causing considerable 625 confusion in Germany, where the U+00F8 character is never used in the 626 German language. But the "variant" model introduced in [RFC3743] and 627 [RFC4290] can be used by a registry to prevent the worst consequence 628 of the possible confusion, either by ensuring that both names are 629 registered to same party in a given domain or that one of them is 630 completely prohibited. 632 2.2.2. Multiple scripts 634 There are languages in the world that can be expressed using multiple 635 scripts. For example some Eastern European and Central Asian 636 languages can be expressed in either Cyrillic or Latin (See 637 Section 1.5.2) characters or some African and Southeast Asian 638 languages can be expressed in either Arabic or Latin characters A few 639 languages can even be written in three different scripts. In other 640 cases, the language is typically written in a combination of scripts 641 (e.g., Kanji, Kana, and Romaji for Japanese, Hangul and Hanji for 642 Korean). Because of this, the same word, in the same language, can 643 be expressed in different ways. For some languages, only a single 644 script is normally used to write a single word; for others, mixed 645 scripts are required; and, for still others, special circumstances 646 may dictate mixing scripts in labels although that is not normally 647 done for "words". For IDN purposes, these variations make the 648 definition of "script" extremely sensitive, especially since ICANN is 649 now recommending that it be used as the primary basis for registry 650 policies. However essential it may be to prohibit mixed-script 651 labels, additional policy nuance is required for "languages with 652 established orthographies and conventions that require the commingled 653 use of multiple scripts". 655 2.2.3. Normalization and Character Mappings 657 Unicode contains several different models for representing 658 characters. The Chinese (Han)-derived characters of the "CJK" 659 languages are "unified", i.e., characters with common derivation and 660 similar appearances are assigned to the same code point. European 661 characters derived from a Greek-Latin base are separated into 662 separate code blocks for Latin, Greek and Cyrillic even when 663 individual characters are identical in both form and semantics. 664 Separate code points based on font differences alone are generally 665 prohibited, but a large number of characters for "mathematical" use 666 have been assigned separate code points even though they differ from 667 base ASCII characters only by font attributes such as "script", 668 "bold", or "italic". Some characters that often appear together are 669 treated as typographical digraphs with specific code points assigned 670 to the combination, others require that the two-character sequences 671 be used, and still others are available in both forms. Some Roman- 672 derived letters that were developed as decorated variations on the 673 basic Latin letter collection (e.g., by addition of diacritical 674 marks) are assigned code points as individual characters, others must 675 be built up as two (or more) character sequences using "composing 676 characters". 678 Many of these differences result from the desire to maintain backward 679 compatibility while the standard evolved historically, and are hence 680 understandable. However, the DNS requires precise knowledge of which 681 codes and code sequences represent the same character and which ones 682 do not. Limiting the potential difficulties with confusable 683 characters (see Section 2.2.6) requires even more knowledge of which 684 characters might look alike in some fonts but not in others. These 685 variations make it difficult or impossible to apply a single set of 686 rules to all of Unicode and, in doing so, satisfy everyone and their 687 perceived needs. Instead, more or less complex mapping tables, 688 defined on a character by character basis, are required to 689 "normalize" different representations of the same character to a 690 single form so that matching is possible. 692 Unless normalization rules, such as those that underlie Nameprep, are 693 applied, characters that are essentially identical will not match in 694 the DNS, creating many opportunities for problems. The most common 695 of these problems is that, due to the processing applied (and 696 discussed above) before a word is represented as a Unicode string, a 697 single word can end up being expressed as more than one unique 698 Unicode string. Even if normalization rules are applied, some 699 strings that are considered identical by users will not compare 700 equal. That problem is discussed in more detail elsewhere in this 701 document, particularly in Section 3.2.1. 703 IDNA attempts to compensate for these problems by using a 704 normalization algorithm defined by the Unicode Consortium. This 705 algorithm can change a sequence of one or more Unicode characters to 706 another set of characters. One example is that the base character 707 U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING 708 DIAERESIS) is changed to the single Unicode character U+00E4 (LATIN 709 SMALL LETTER A WITH DIAERESIS). 711 This Unicode normalization process accounts only for simple character 712 equivalences, not equivalences that are language or script dependent. 713 For example, as mentioned above, the characters U+00F8 (LATIN SMALL 714 LETTER O WITH STROKE) and U+00F6 (LATIN SMALL LETTER O WITH 715 DIAERESIS) are considered to match in Swedish (and some other 716 languages), but not for all languages that use either of the 717 characters. Having these characters be treated as equivalent in some 718 contexts and not in others requires decisions and mechanisms that, in 719 turn, depend much more on context than either IDNA or the Unicode 720 character-based normalization tables can provide. 722 Additional complications occur if the sequences are more complicated 723 or if an attacker is making a deliberate effort to confuse the 724 normalization process. For example, if the sequence U+0069 U+0307 725 (LATIN SMALL LETTER I followed by COMBINING DOT ABOVE) appears, NFKC 726 maps it into U+00EF (COMBINING DOT ABOVE), which is what one would 727 predict. But consider U+0131 U+0308 (LATIN SMALL LETTER DOTLESS I 728 and COMBINING DIAERESIS): is that the same character? Is U+0131 729 U+0307 U+0307 (dotless i and two combining dot-above characters) 730 equivalent to U+00EF or U+0069, or neither? NFKC does not appear to 731 tell us, nor does the definition of U+0307 appear to tell us what 732 happens when it is combined with other "symbol above" arrangements 733 (unlike some of the "accent above" combining characters, which more 734 or less specify kerning). Similar issues arise when U+00EF is 735 combined with various dot-above combining characters. Each of these 736 questions provides some opportunities for spoofing if different 737 display implementations interpret the rules in different ways. 739 If we leave Latin scripts and examine those based on Chinese 740 characters, we see there is also an absence of specific, lexigraphic, 741 rules for transformations between Traditional and Simplified Chinese. 742 Even if there were such rules, unification of Japanese and Korean 743 characters with Chinese ones would make it impossible to normalize 744 Traditional Chinese into Simplified Chinese ones without causing 745 problems in Japanese and Korean use of the same characters. 747 More generally, while some mappings, such as those between 748 precomposed Latin script characters and the equivalent multiple code 749 point composed character sequences, depend only on the characters 750 themselves, in many or most cases, such as the case with Swedish 751 above, the mapping is language or culturally dependent. There have 752 been discussions as to whether different canonicalization rules (in 753 addition to or instead of Unicode normalization) should be, or could 754 be, applied differently to different languages or scripts. The fact 755 that most scripts included in Unicode have been initially 756 incorporated by copying an existing standard more or less intact has 757 impact on the optimization of these algorithms and on forward 758 compatibility. Even if the language is known and language-specific 759 rules can be defined, dependencies on the language do not disappear. 760 Canonicalization operations are not possible unless they either 761 depend only on short sequences of text or have significant context 762 available that is not obvious from the text itself. DNS lookups and 763 many other operations do not have a way to capture and utilize the 764 language or other information that would be needed to provide that 765 context. 767 These variations in languages and in user perceptions of characters 768 make it difficult or impossible to provide uniform algorithms for 769 matching Unicode strings in a way that no end users are ever 770 surprised by the result. For closely-related scripts or characters, 771 surprises may even be frequent. However, because uniform algorithms 772 are required for mappings that are applied when names are looked up 773 in the DNS, the rules that are chosen will always represent an 774 approximation that will be more or less successful in minimizing 775 those user surprises. The current Nameprep and Stringprep algorithms 776 use mapping tables to "normalize" different representations of the 777 same text to a single form so that matching is possible. 779 More details on the creation of the normalization algorithms can be 780 found in the Unicode Specification and the associated Technical 781 Reports [UTR] and Annexes. Technical Report #36 [UTR36] and [UTR39] 782 are specifically related to the IDN discussion. 784 2.2.4. URLs in Printed Form 786 URLs and other identifiers appear, not only in electronic forms from 787 which they can (at least in principle) be accurately copied and 788 "pasted" but in printed forms from which the user must transcribe 789 them into the computer system. This is often known as the "side of 790 the bus problem" because a particularly problematic version of it 791 requires that the user be able to observe and accurately remember a 792 URL that is quickly-glimpsed in a transient form -- a billboard seen 793 while driving, a sign on the side of a passing vehicle, a television 794 advertisement that is not frequently repeated or on-screen for a long 795 time, and so on. 797 The difficulty, in short, is that two Unicode strings that are 798 actually different might look exactly the same, especially when there 799 is no time to study them. This is because, for example, some glyphs 800 in Cyrillic, Greek and Latin do look the same, but have been assigned 801 different codepoints in Unicode. Worse, one needs to be reasonably 802 familiar with a script and how it is used to understand how much 803 characters can reasonably vary as the result of artistic fonts and 804 typography. For example, there are a few fonts for Latin characters 805 that are sufficiently highly ornamented that an observer might easily 806 confuse some of the characters with characters in Thai script. 807 Upper-case ITC Blackadder (a registered trademark of International 808 Typeface Corporation), Curlz MT, are two fairly obvious examples; 809 these fonts use loops at the end of serifs, creating a resemblance to 810 Thai (in some fonts) for some characters. 812 2.2.5. Bidirectional text 814 Some scripts (and because of that some words in some languages) are 815 written not left to right, but right to left. And, to complicate 816 things, one might have something written in Arabic characters right 817 to left that includes some characters in Latin characters, such as 818 European-style digits. The Latin character part is written left to 819 right, which implies some texts might have a mixed left to right AND 820 right to left order (even though in most implementations all texts 821 have a major direction, with the other as an exception). IDNA 822 prohibits these mixed-directional (or bidirectional) strings in IDN 823 labels, but the prohibition causes other problems such as the 824 rejection of some otherwise linguistically and culturally sensible 825 strings. As Unicode and conventions for handling so-called 826 bidirectional ("BIDI") strings evolve, the prohibition in IDNA should 827 be reviewed and reevaluated. 829 2.2.6. Confusable Character Issues 831 Similar-looking characters in identifiers can cause actual problems 832 on the Internet since they can result, deliberately or accidentally, 833 in people being directed to the wrong host or mailbox by believing 834 that they are typing, or clicking on, intended characters which are 835 different from those that actually appear in the domain name or 836 reference. See Section 4.1.3 for further discussion of this issue. 838 IDNs complicate these issues, not only by providing many additional 839 characters that look sufficiently alike to be potentially confused, 840 but by raising new policy questions. For example, if a language can 841 be written in two different scripts, is a label constructed from a 842 word written in one script equivalent to a label constructed from the 843 same word written in the other script? Is the answer the same for 844 words in two different languages that translate into each other? 846 It is now generally understood that, in addition to the collision 847 problems of possibly equivalent words and hence labels, it is 848 possible to utilize characters that look alike -- "confusable" 849 characters -- to spoof names in order to mislead or defraud users. 850 That issue, driven by particular attacks such as those known as 851 "phishing", has introduced stronger requirements for registry efforts 852 to prevent problems than were previously generally recognized as 853 important. 855 One commonly-proposed approach is to have a registry establish 856 restrictions on the characters, and combinations of characters, it 857 will permit to be included in a string to be registered as a label. 858 Taking the Swedish top-level domain, .SE, as an example, a rule might 859 be adopted that the registry "only accepts registrations in Swedish, 860 using Latin script, and because of this, Unicode characters Latin-a, 861 -b, -c,...". But, because there is not a 1:1 mapping between country 862 and language, even a ccTLD like .SE might have to accept 863 registrations in other languages. For example, there may be a 864 requirement for Finnish (the second most-used language in Sweden). 865 What rules and codepoints are then defined for Finnish? Does it have 866 special mappings that collide with those that are defined for 867 Swedish? And what does one do in countries that use more than one 868 script? (Finnish and Swedish use the same script.) In all cases, 869 the dispute will ultimately be about whether two strings are the same 870 (or confusingly similar) or not. That, in turn, will generate a 871 discussion of how one defines "what is the same" and "what is similar 872 enough to be a problem". 874 Another example arose recently that further illustrates the problem. 875 If one were to use Cyrillic characters to represent the country code 876 for Russia in a localized equivalent to the ccTLD label, the 877 characters themselves would be indistinguishable from the Latin 878 characters "P" and "Y" (in either lower or upper case) in most fonts. 879 We presume this might cause some consternation in Paraguay. 881 These difficulties can never be completely eliminated by algorithmic 882 means. Some of the problem can be addressed by appropriate tuning of 883 the protocols and their tables, other parts by registry actions to 884 reduce confusion and conflicts, and still other parts can be 885 addressed by careful design of user interfaces in application 886 programs. But, ultimately, some responsibility to avoid being 887 tricked or harmfully confused will rest with the user. 889 Another registry technique that has been extensively explored 890 involves looking at confusable characters and confusion between 891 complete labels, restricting the labels that can be registered based 892 on relationships to what is registered already. Registries that 893 adopt this approach might establish special mapping rules such as: 895 1. If you register something with codepoint A, domain names with B 896 instead of A will be blocked from registration by others (where B 897 is a character at a separate codepoint that has a confusingly 898 similar appearance to A). 899 2. If you register something with codepoint A, you also get domain 900 name with B instead of A. 902 These approaches are discussed in more detail for "CJK" characters in 903 RFC 3743 [RFC3743] and more generally in RFC 4290 [RFC4290]. 905 2.2.7. The IESG Statement and IDNA issues 907 The issues above, at least as they were understood at the time, 908 provided the background for the IESG statement included in 909 Section 1.6.1 (which, in turn, was part of the basis for the initial 910 ICANN Guidelines) that a registry should have a policy about the 911 scripts, languages, codepoints and text directions for which 912 registrations will be accepted. While "accept all" might be an 913 acceptable policy, it implies there is also a dispute resolution 914 process that takes the problems listed above into account. This 915 process must be designed for dealing with all types of potential 916 disputes. For example, issues might arise between registrant and 917 registry over a decision by the registry on collisions with already 918 registered domain names and between registrant and trade mark holder 919 (that a domain name infringes on a trademark). In both cases the 920 parties disagreeing have different views on whether two strings are 921 "equivalent" or not. They may believe that a string that is not 922 allowed to be registered is actually different from one that is 923 already registered. Or they might believe that two strings are the 924 same, even though the rules adopted by the registry to prevent 925 confusion define them as two different domain names. 927 3. Migrating to New Versions of Unicode 929 3.1. Versions of Unicode 931 While opinions differ about how important the issues are in practice, 932 the use of Unicode and its supporting tables for IDNA appears to be 933 far more sensitive to subtle changes than it is in typical Unicode 934 applications. This may be, at least in part, because many other 935 applications are internally sensitive only to the appearance of 936 characters and not to their representation. Or those applications 937 may be able to take effective advantage of script, language, or 938 character class identification. The working group that developed 939 IDNA concluded that attempting to encode any ancillary character 940 information into the DNS label would be impractical and unwise, and 941 the IAB, based in part on the comments in the ad hoc committee, saw 942 no reason to review that decision. 944 The Unicode Consortium has sometimes used the likelihood of a 945 combination of characters actually appearing in a natural language as 946 a criterion for the safety of a possible change. However, as 947 discussed above, DNS names are often fabrications -- abbreviations, 948 strings deliberately formed to be unusual, members of a series 949 sequenced by numbers or other characters, and so on. Consequently, a 950 criterion that considers a change to be safe if it would not be 951 visible in properly-constructed running text is not helpful for DNS 952 purposes: a change that would be safe under that criterion could 953 still be quite problematic for the DNS. 955 This sensitivity to changes has made it quite difficult to migrate 956 IDNA from one version of Unicode to the next if any changes are made 957 that are not strictly additive. A change in a code point assignment 958 or definition may be extremely disruptive if DNS labels have been 959 defined using the earlier form and any of its previous components has 960 been moved from one table position or normalization rule to another. 961 Unicode normalization tables, tables of scripts or languages and 962 characters that belong to them, and even tables of confusable 963 characters as an adjunct to security recommendations may be very 964 helpful in designing registry restrictions on registrations and 965 applications provisions for avoiding or identifying suspicious names. 966 Ironically, they also extend the sensitivity of IDNA and its 967 implementations to all forms of change between one version of Unicode 968 and the next. Consequently, they make Unicode version migration more 969 difficult. 971 An example of the type of change that appears to be just a small 972 correction from one perspective but may be problematic from another 973 was the correction to the normalization definition in 2004 [Unicode- 974 PR29]. Community input suggested that the change would cause 975 problems for Stringprep, but the Unicode Technical Committee decided, 976 on balance, that the change was worthwhile. Because of difficulties 977 with consistency, some deployed implementations have decided to adopt 978 the change and others have not, leading to subtle incompatibilities. 980 This situation leads to a dilemma. On the one hand, it is completely 981 unacceptable to freeze IDNA at a Unicode version level that excludes 982 more recently-defined characters and scripts which are important to 983 those who use them. On the other hand, it is equally unacceptable to 984 migrate from one version of Unicode to the next if such migration 985 might invalidate an existing registered DNS name or some of its 986 registered properties or might make the string or representation of 987 that name ambiguous. If IDNA is to be modified to accommodate new 988 versions of Unicode, the IETF will need to work with the Unicode 989 Consortium and other relevant bodies to find an appropriate balance 990 in this area, but progress will be possible only if all relevant 991 parties are able to fairly consider and discuss possible decisions 992 that may be very difficult and unpalatable. 994 It would also prove useful if during the course of that dialog, the 995 need for Unicode Consortium concern with security issues in 996 applications of the Unicode character set could be clarified. It 997 would be unfortunate from almost every perspective considered here, 998 if such matters slowed the inclusion of as yet unencoded scripts. 1000 3.2. Version changes and normalization issues 1002 3.2.1. Unnormalized Combining Sequences 1004 One of the advantages of the Unicode model of combining characters, 1005 as with previous systems that use character overstriking to 1006 accomplish similar purposes, is that it is possible to use sequences 1007 of code points to generate characters that are not explicitly 1008 provided for in the character set. However, unless sequences that 1009 are not explicitly provided for are prohibited by some mechanism 1010 (such as the normalization tables), such combining sequences can 1011 permit two related dangers. 1013 o The first is another risk of character confusion, especially if 1014 the relationship of the combining character with characters it 1015 combines with are not precisely defined or unexpected combinations 1016 of combining characters are used. That issue is discussed in more 1017 detail, with an example, in Section 2.2.3. 1018 o These same issues also inherently impact the stability of the 1019 normalization tables. Suppose that, somewhere in the world, there 1020 is a character that looks like a Roman-derived lower-case "i", but 1021 with three (not one or two) dots above it. And suppose that the 1022 users of that character agree to represent it by combining a 1023 traditional "i" (U+0069) with a combining diaeresis (U+0308). So 1024 far, no problem. But, later, a broader need for this character is 1025 discovered and it is coded into Unicode either as a single 1026 precomposed character or, more likely under existing rules, by 1027 introducing a three-dot-above combining character. In either 1028 case, that version of Unicode should include a rule in NFKC that 1029 maps the "i"-plus-diaeresis sequence into the new, approved, one. 1030 If one does not do so, then there is arguably a normalization that 1031 should occur that does not. If one does so, then strings that 1032 were valid and normalized (although unanticipated) under the 1033 previous versions of Unicode become unnormalized under the new 1034 version. That, in turn, would impact IDNA comparisons because, 1035 effectively, it would introduce a change in the matching rules. 1037 It would be useful to consider rules that would avoid or minimize 1038 these problems with the understanding that, for reasons given 1039 elsewhere, simply minimizing it may not be good enough for IDNA. One 1040 partial solution might be to ban any combination of a base character 1041 and a combining character that does not appear in a hypothetical 1042 "anticipated combinations" table from being used in a domain name 1043 label. The next subsection discusses a more radical, if impractical, 1044 view of the problem and its solutions. 1046 3.2.2. Combining Characters and Character Components 1048 For several reasons, including those discussed above, one thing that 1049 increases IDNA complexity and the need for normalization is that 1050 combining characters are permitted. Without them, complexity might 1051 be reduced enough to permit more easy transitions to new versions. 1052 The community should consider the impact of entirely prohibiting 1053 combining characters from IDNs. While it is almost certainly 1054 unfeasible to introduce this change into Unicode as it is now defined 1055 and doing so would be extremely disruptive even if it were feasible, 1056 the thought experiment can be helpful in understanding both the 1057 issues and the implications of the paths not taken. For example, one 1058 consequence of this, of course, is that each new language or script, 1059 and several existing ones, would require that all of its characters 1060 have Unicode assignments to specific, precomposed, code points. 1062 Note that this is not currently permitted within Unicode for Latin 1063 scripts. For non-Latin scripts, some such code points have been 1064 defined. The decisions that govern the assignment of such code 1065 points are managed entirely within the Unicode Consortium. Were the 1066 IETF to choose to reduce IDNA complexity by excluding combining 1067 characters, no doubt there would be additional input to the Unicode 1068 Consortium from users and proponents of scripts requiring composing 1069 characters. The IAB and the IETF should examine whether it is 1070 appropriate to press the Unicode Consortium to revise these policies 1071 or otherwise to recommend actions that would reduce the need for 1072 normalization and the related complexities. However, we have been 1073 told that the Technical Committee does not believe it is reasonable 1074 or feasible to add all possible precomposed characters to Unicode. 1075 If Unicode cannot be modified to contain the precomposed characters 1076 necessary to support existing languages and scripts, much less new 1077 ones, this option for IDN restrictions will not be feasible. 1079 3.2.3. When does normalization occur? 1081 In many Unicode applications, the preferred solution is to pick a 1082 style of normalization and require that all text that is stored or 1083 transmitted be normalized to that form. (This is the approach taken 1084 in ongoing work in the IETF on a standard Unicode text form [net- 1085 utf8]). IDNA does not impose this requirement. Text is normalized 1086 and case-reduced at registration time, and only the normalized 1087 version is placed in the DNS. However, there is no requirement that 1088 applications show only the native (and lower-case where appropriate) 1089 characters associated with the normalized form in discussions or 1090 references such as URLs. If conventions used for all-ASCII DNS names 1091 are to be extended to internationalized forms, such a requirement 1092 would be unreasonable, since it would prohibit the use of mixed-case 1093 references for clarity or market identification. It might even be 1094 culturally inappropriate. However, without that restriction, the 1095 comparison that will ultimately be made in the DNS will be between 1096 strings normalized at different times and under different versions of 1097 Unicode. The assertion that a string in normalized form under one 1098 version of Unicode will still be in normalized form under all future 1099 versions is not sufficient. Normalization at different times also 1100 requires that a given source string always normalizes to the same 1101 target string, regardless of the version under which it is 1102 normalized. That criterion is much more difficult to fulfill. The 1103 discussion above suggests that it may even be impossible. 1105 Ignoring these issues with combining characters entirely, as IDNA 1106 effectively does today, may leave us "stuck" at Unicode 3.2, leading 1107 either to incompatibility differences in applications that otherwise 1108 use a modern version of Unicode (while IDN remains at Unicode 3.2) or 1109 to painful transitions to new versions. If decisions are made 1110 quickly, it may still be possible to make a one-time version upgrade 1111 to Version 4.1 or Version 5 of Unicode. However, unless we can 1112 impose sufficient global restrictions to permit smooth transitions, 1113 upgrading to versions beyond that one are likely to be painful (e.g., 1114 potentially requiring changing strings already in the DNS or even a 1115 new Punycode prefix) or impossible. 1117 4. Framework for next steps in IDN development 1119 4.1. Issues within the scope of the IETF 1121 4.1.1. Review of IDNA 1123 The IETF should consider reviewing RFCs 3454, 3490, 3491 and/or 3492, 1124 and update, replace or supplement them to meet the criteria of this 1125 paragraph (one or more of them may prove impractical after further 1126 study). Any new versions or additional specifications should be 1127 adapted to the version of Unicode that is current when they are 1128 created. Ideally, they should specify a path for adapting to future 1129 versions of Unicode (some suggestions below may facilitate this). 1130 The IETF should also consider whether there are significant 1131 advantages to mapping some groups of characters, such as code points 1132 assigned to font variations, into others or whether clarity and 1133 comprehensibility for the user would be better served by simply 1134 prohibiting those characters. More generally, it appears that it 1135 would be worthwhile for the IETF to review whether the Unicode 1136 normalization rules now invoked by the Stringprep profile in Nameprep 1137 are optimal for the DNS or whether more restrictive rules, or an even 1138 more restrictive set of permitted character combinations, would 1139 provide better support for DNS internationalization. 1141 The IAB has concluded that there is a consensus within the broader 1142 community that lists of codepoints should be specified by the use of 1143 an inclusion based mechanism (i.e., identifying the characters that 1144 are permitted), rather than by excluding a small number of characters 1145 from the total Unicode set as Stringprep and Nameprep do today. That 1146 conclusion should be reviewed by the IETF community and action taken 1147 as appropriate. 1149 We suggest that the individuals doing the review of the codepoints 1150 should work as a specialized design team. To the extent possible, 1151 that work should be done jointly by people with experience from the 1152 IETF and deep knowledge of the constraints of the DNS and application 1153 design, participants from the Unicode Consortium, and other people 1154 necessary to be able to reach a generally-accepted result. Because 1155 any work along these lines would be modifications and updates to 1156 standards-track documents, final review and approval of any proposals 1157 would necesarily follow normal IETF processes. 1159 It is worth noting that sufficiently extreme changes to IDNA would 1160 require a new Punycode prefix, probably with long-term support for 1161 both the old prefix or the new one in both registration arrangements 1162 and applications. An alternative, which is almost certainly 1163 impractical, would be some sort of "flag day", i.e., a date on which 1164 the old rules are simultaneously abandoned by everyone and the new 1165 ones adopted. However, preliminary analysis indicates that few, if 1166 any, of the changes recommended for consideration elsewhere in this 1167 document would require this type of version change. For example, 1168 additional restrictions on what can be registered may require policy 1169 decisions about actions to be taken with regard to labels that 1170 conformed to earlier rules but not to new ones, but not changes in 1171 the protocol or prefix. 1173 4.1.2. Non-DNS and Above-DNS Internationalization Approaches 1175 The IETF should once again examine the extent to which it is 1176 appropriate to try to solve internationalization problems via the DNS 1177 and what place the many varieties of so-called "keyword systems" or 1178 other Internet navigational techniques might have. Those techniques 1179 can be designed to impose fewer constraints, or at least different 1180 constraints, than IDNA and the DNS. As discussed elsewhere in this 1181 document, IDNA cannot support information about scripts, languages, 1182 or Unicode versions on lookup. As a consequence of the nature of DNS 1183 lookups, characters and labels either match or do not match; a near- 1184 match is simply not a possible concept in the DNS. By contrast, 1185 observation of near-matching is common in human communication and in 1186 matching operations performed by people, especially when they have a 1187 particular script or language context in mind. The DNS is further 1188 constrained by a fairly rigid internal aliasing system (via CNAME and 1189 DNAME resource records), while some applications of international 1190 naming may require more flexibility. Finally, the rigid hierarchy of 1191 the DNS --and the tendency in practice for it to become flat at 1192 levels nearest the root-- and the need for names to be unique are 1193 more suitable for some purposes than others and may not be a good 1194 match for some purposes for which people wish to use IDNs. Each of 1195 these constraints can be relaxed or changed by one or more systems 1196 that would provide alternatives to direct use of the DNS by users. 1197 Some of the issues involved are discussed further in Section 5.3 and 1198 various ideas have been discussed in detail in the IETF or IRTF. 1199 Many of those ideas have even been described in Internet Drafts or 1200 other documents. As experience with IDNs and with expectations for 1201 them accumulates, it will probably become appropriate for the IETF or 1202 IRTF to revisit the underlying questions and possibilities. 1204 4.1.3. Security issues, certificates, etc. 1206 Some characters look like others, often as the result of common 1207 origins. The problem with these "confusable" characters, often 1208 incorrectly called homographs, has always existed when characters are 1209 presented to humans that interpret what is displayed and then make 1210 decisions based on what the person sees. This is not a problem that 1211 exists only when working with internationalized domain names, but it 1212 makes the problem worse. The result of a survey that would explain 1213 what the problems are might be interesting. Many of these issues are 1214 mentioned in Unicode Technical Report #36 [UTR36]. 1216 In this and other issues associated with IDNs, precise use of 1217 terminology is important lest even more confusion result. The 1218 definition of the term 'homograph' that normally appears in 1219 dictionaries and linguistic texts states that homographs are 1220 different words which are spelled identically (for example, the 1221 adjective 'brief' meaning short, the noun 'brief' meaning a document, 1222 and the verb 'brief' meaning to inform). By definition, letters in 1223 two different alphabets are not the same, regardless of similarities 1224 in appearance. This means that sequences of letters from two 1225 different scripts that appear to be identical on a computer display 1226 cannot be homographs in the accepted sense, even if they are both 1227 words in the dictionary of some language. Assuming that there is a 1228 language written with Cyrillic script in which "cap" is a word, 1229 regardless of what it might mean, it is not a homograph of the Latin- 1230 script English word "cap". 1232 When the security implications of visually confusable characters were 1233 brought to the forefront in 2005, the term homograph was used to 1234 designate any instance of graphic similarity, even when comparing 1235 individual characters. This usage is not only incorrect, but risks 1236 introducing even more confusion and hence should be avoided. The 1237 current preferred terminology is to describe these similar-looking 1238 characters as "confusable characters" or even "confusables". 1240 Many people have suggested that confusable characters are a problem 1241 that must be addressed, at least in part, directly in the user 1242 interfaces of application software. While it should almost certainly 1243 be part of a complete solution, that approach creates it own set of 1244 difficulties. For example, a user switching between systems, or even 1245 between applications on the same system, may be surprised by 1246 different types of behavior and different levels of protection. In 1247 addition, it is unclear how a secure setup for the end user should be 1248 designed. Today, in the web browser, a padlock is a traditional way 1249 of describing some level of security for the end user. Is this 1250 binary signaling enough? Should there be any connection between a 1251 risk for a displayed string including confusable characters and the 1252 padlock or similar signaling to the user? 1254 Many web browsers have adopted a convention, based on a "whitelist" 1255 or similar technique, of restricting the display of native characters 1256 to subdomains of top-level domains that are deemed to have safe 1257 practices for the registration of potentially confusable labels. 1258 IDNs in other domains are displayed as Punycode. These techniques 1259 may not be sufficiently sensitive to differences in policies among 1260 top-level domains and their subdomains and so, while they are clearly 1261 helpful, they may not be adequate. Are other methods of dealing with 1262 confusable characters possible? Would other methods of identifying 1263 and listing policies about avoiding confusing registrations be 1264 feasible and helpful? 1266 It would be interesting to see a more coordinated effort in 1267 establishing guidelines for user interfaces. If nothing else, the 1268 current whitelists are browser specific and both can, and do, differ 1269 between implementations. 1271 4.1.4. Protocol Changes and Policy Implications 1273 Some potential protocol or table changes raise important policy 1274 issues about what to do with existing, registered, names. Should 1275 such changes be needed, their impact must be carefully evaluated in 1276 the IETF, ICANN, and possibly other forums. In particular, protocol 1277 or policy changes that would not permit existing, registered, names 1278 to be registered under the newer rules should be considered 1279 carefully, balancing their importance against possible disruption and 1280 the issues of invalidating older names against the importance of 1281 consistency as seen by the user. 1283 4.1.5. Non US-ASCII in local part of email addresses 1285 Work is going on in the IETF related to the local part of email 1286 addresses. It should be noted that the local part of email addresses 1287 has much different syntax and constraints than a domain name label, 1288 so to directly apply IDNA on the local part is not possible. 1290 4.1.6. Use of the Unicode Character Set in the IETF 1292 Unicode, and the closely-related ISO 10646, are the only coded 1293 character set that aspire to include all of the world's characters. 1294 As such, they permit use of international characters without having 1295 to identify particular character coding standards or tables. The 1296 requirement for a single character set is particularly important for 1297 use with the DNS since there is no place to put character set 1298 identification. The decision to use Unicode as the base for IETF 1299 protocols going forward is discussed in [RFC2277]. The IAB does not 1300 see any reason to revisit the decision to use Unicode in IETF 1301 protocols. 1303 4.2. Issues that fall within the purview of ICANN 1305 4.2.1. Dispute resolution 1307 IDN creates new types of collisions between trademarks and domain 1308 names as well as collisions between domain names. These have impact 1309 on dispute resolution processes used by registries and otherwise. It 1310 is important that deployment of IDN evolve in parallel with review 1311 and updating of ICANN or registry-specific dispute resolution 1312 processes. 1314 4.2.2. Policy at registries 1316 The IAB recommends that registries use an inclusion based model when 1317 choosing what characters to allow at the time of registration. This 1318 list of characters is in turn to be a subset of what is allowed 1319 according to the updated IDNA standard. The IAB further recommends 1320 that registries develop their inclusion based models in parallel with 1321 dispute resolution process at the registry itself. 1323 Most established policies for dealing with claimed or apparent 1324 confusion or conflicts of names are based on dispute resolution. 1325 Decisions about legitimate use or registration of one or more names 1326 are resolved at or after the time of registration on a case-by-case 1327 basis and using policies that are specific to the particular DNS zone 1328 or jurisdiction involved. These policies have generally not been 1329 extended below the level of the DNS that is directly controlled by 1330 the top-level registry. 1332 Because of the number of conflicts that can be generated by the 1333 larger number of available and confusable characters in Unicode, we 1334 recommend that registration-restriction and dispute resolution 1335 policies be developed to constrain IDN registrations by registries 1336 and zone administrators at all levels of the DNS tree. Of course, 1337 many of these policies will be less formal than others and there is 1338 no requirement for complete global consistency, but the arguments for 1339 reduction of confusable characters and other issues in TLDs should 1340 apply to all zones below that specific TLD. 1342 Consistency across all zones can obviously only be accomplished by 1343 changes to the protocols. Such changes should be considered by the 1344 IETF if particular restrictions are identified that are important and 1345 consistent enough to be applied globally. 1347 Some potential protocol changes or changes to character-mapping 1348 mapping tables might, if adopted, have profound registry policy 1349 implications. See Section 4.1.4. 1351 4.2.3. IDN TLDs 1353 The IAB has concluded that there is not one IDN TLD issue but at 1354 least three very separate ones: 1356 o If IDN entries are to be made in the root zone, decisions must 1357 first be made about how these TLDs are to be named and delegated. 1358 These decisions fall within the traditional IANA scope and are 1359 ICANN issues today. 1360 o There has been discussion of permitting some or all existing TLDs 1361 to be referenced by multiple labels, with those labels presumably 1362 representing some understanding of the "name" of the TLD in 1363 different languages. If actual aliases of this type are desired 1364 for existing domains, the IETF may need to consider whether the 1365 use of DNAME records in the root is appropriate to meet that need, 1366 what constraints, if any, are needed, whether alternate 1367 approaches, such as those of [RFC4185], are appropriate or whether 1368 further alternatives should be investigated. But, to the extent 1369 to which aliases are considered desirable and feasible, decisions 1370 presumably must be made as to which, if any, root IDN labels 1371 should be associated with DNAME records and which ones should be 1372 handled by normal delegation records or other mechanisms. That 1373 decision is one of DNS root-level namespace policy and hence falls 1374 to ICANN although we would expect ICANN to pay careful attention 1375 to any technical, operational, or security recommendations that 1376 may be produced by other bodies. 1377 o Finally, if IDN labels are to be placed in the root zone, there 1378 are issues associated with how they are to be encoded and 1379 deployed. This area may have implications for work that has been 1380 done, or should be done, in the IETF. 1382 5. Specific Recommendations for Next Steps 1384 Consistent with the framework described above, the IAB offers these 1385 recommendations as steps for further consideration in the identified 1386 groups. 1388 5.1. Reduction of permitted character list 1390 Generalize from the original "hostname" rules to non-ASCII 1391 characters, permitting as few characters as possible to do that job. 1392 This would involve a restrictive model for characters permitted in 1393 IDN labels, thus contrasting with the approach used to develop the 1394 original IDNA/Nameprep tables. That approach was to include all 1395 Unicode characters that there was not a clear reason to exclude. 1397 The specific recommendation here is to specify such internationalized 1398 hostnames. Such an activity would fall to the IETF, although the 1399 task of developing the appropriate list of permitted characters will 1400 require effort both in the IETF and elsewhere. The effort should be 1401 as linguistically and culturally sensitive as possible, but smooth 1402 and effective operation of the DNS, including minimizing of 1403 complexity, should be primary goals. The following should be 1404 considered as possible mechanisms for achieving an appropriate 1405 minimum number of characters. 1407 5.1.1. Elimination of all non-language characters 1409 Unicode characters that are not needed to write words or numbers in 1410 any of the world's languages should be eliminated from the list of 1411 characters that are appropriate in DNS labels. In addition to such 1412 characters as those used for box-drawing and sentence punctuation, 1413 this should exclude punctuation for word structure and other 1414 delimiters: while DNS labels may conveniently be used to express 1415 words in many circumstances, the goal is not to express words (or 1416 sentences or phrases), but to permit the creation of unambiguous 1417 labels with good mnemonic value. 1419 5.1.2. Elimination of word-separation punctuation 1421 The inclusion of the hyphen in the original hostname rules is a 1422 historical artifact from an older, flat, name space. The community 1423 should consider whether it is appropriate to treat it as a simple 1424 legacy property of ASCII names and not attempt to generalize it to 1425 other scripts. We might, for example, not permit claimed equivalents 1426 to the hyphen from other scripts to be used in IDNs. We might even 1427 consider banning use of the hyphen itself in non-ASCII strings or, 1428 less restrictively, strings that contained non-Latin characters. 1430 5.2. Updating to new versions of Unicode 1432 As new scripts, to support new languages, continue to be added to 1433 Unicode, it is important that IDNA track updates. If it does not do 1434 so, but remains "stuck" at 3.2 or some single later version, it will 1435 not be possible to include labels in the DNS that are derived from 1436 words in languages that require characters that are available only in 1437 later versions. Making those upgrades is difficult, and will 1438 continue to be difficult, as long as new versions require, not just 1439 addition of characters, but changes to canonicalization conventions, 1440 normalization tables, or matching procedures (see Section 3.1). 1441 Anything that can be done to lower complexity and simplify forward 1442 transitions should be seriously considered. 1444 5.3. Role and Uses of the DNS 1446 We wish to remind the community that there are boundaries to the 1447 appropriate uses of the DNS. It was designed and implemented to 1448 serve some specific purposes. There are additional things that it 1449 does well, other things that it does badly, and still other things it 1450 cannot do at all. No amount of protocol work on IDNs will solve 1451 problems with alternate spellings, near-matches, searching for 1452 appropriate names, and so on. Registration restrictions and 1453 carefully-designed user interfaces can be used to reduce the risk and 1454 pain of attempts to do some of these things gone wrong, as well as 1455 reducing the risks of various sort of deliberate bad behavior, but, 1456 beyond a certain point, use of the DNS simply because it is available 1457 becomes a bad tradeoff. The tradeoff may be particularly unfortunate 1458 when the use of IDNs does not actually solve the proposed problem. 1459 For example, internationalization of DNS names does not eliminate the 1460 ASCII protocol identifiers and structure of URIs [RFC3986] and even 1461 IRIs [RFC3987]. Hence, DNS internationalization itself, at any or 1462 all levels of the DNS tree, is not a sufficient response to the 1463 desire of populations to use the Internet entirely in their own 1464 languages and the characters associated with those languages. 1466 These issues are discussed at more length, and alternatives 1467 presented, in [RFC2825], [RFC3467], [INDNS], and [DNS-Choices]. 1469 5.4. Databases of Registered Names 1471 In addition to their presence in the DNS, IDNs introduce issues in 1472 other contexts in which domain names are used. In particular, the 1473 design and content of databases that bind registered names to 1474 information about the registrant (commonly described as "whois" 1475 databases) will require review and updating. For example, the whois 1476 protocol itself [RFC3912] has no standard capability for handling 1477 non-ASCII text: one cannot search consistently for, or report, either 1478 a DNS name or contact information that is not in ASCII characters. 1479 This may provide some additional impetus for a switch to IRIS 1480 [RFC3981] [RFC3982] but also raises a number of other questions about 1481 what information, and in what languages and scripts, should be 1482 included or permitted in such databases. 1484 6. Security Considerations 1486 This document is simply a discussion of IDNs and IDN issues; it 1487 raises no new security concerns. However, if some of its 1488 recommendations to reduce IDNA complexity, the number of available 1489 characters, and various approaches to constraining the use of 1490 confusable characters, are followed and prove successful, the risks 1491 of name spoofing and other problems may be reduced. 1493 7. Acknowledgments 1495 The contributions to this report from members of the IAB-IDN ad hoc 1496 committee are gratefully acknowledged. Of course, not all of the 1497 members of that group endorse every comment and suggestion of this 1498 report. In particular, this report does not claim to reflect the 1499 views of the Unicode Consortium as a whole or those of particular 1500 participants in the work of that Consortium. The members of the ad 1501 hoc committee were: 1503 Rob Austein, Leslie Daigle, Tina Dam, Mark Davis, Patrik Faltstrom, 1504 Scott Hollenbeck, Cary Karp, John Klensin, Gervase Markham, David 1505 Meyer, Thomas Narten, Michael Suignard, Sam Weiler, Bert Wijnen, Kurt 1506 Zeilenga and Lixia Zhang. 1508 Thanks are due to Tina Dam and others associated with the ICANN IDN 1509 Working Group for contributions of considerable specific text, to 1510 Marcos Sanz and Paul Hoffman for careful late-stage reading and 1511 extensive comments, and to Pete Resnick for many contributions and 1512 comments, both in conjunction with his former IAB service and 1513 subsequently. Olaf M. Kolkman took over IAB leadership for this 1514 document after Patrik Faltstrom and Pete Resnick stepped down in 1515 March 2006. 1517 Members of the IAB at the time of approval of this document were: 1518 [[anchor40: To be supplied]] 1520 8. Change History 1522 [[anchor42: RFC Editor: this section is to be removed before 1523 publication]] 1525 8.1. Changes for version -01 1527 1. Added discussion and reference to Unicode PR-29 1528 2. Replaced the discussion of the ICANN Guidelines (with thanks to 1529 Tina Dam and Cary Karp). 1530 3. Revised the Bidi text to make the potential recommendation more 1531 clear. 1532 4. Removed any claims (actual or implied) of endorsement by the 1533 members of the ad hoc committee. 1534 5. Several small editorial changes, etc. 1536 8.2. Changes for version -02 1538 1. Added some additional references, e.g., to W3C 1539 internationalization work and to UTR39. 1540 2. Adjusted some terminology to correct errors and avoid unnecessary 1541 controversy. 1542 3. Extended the discussion of related characters in Swedish and 1543 Norwegian to clarify at least one of the possibilities 1544 4. Introduced new Section 5.4 to discuss IDN issues in other than 1545 the DNS itself and point to IRIS. 1546 5. Rewrote the introduction to the "problem" section and its first 1547 subsection. 1548 6. Small changes made to the "definitions" section including 1549 explaining why "multilingual" is there and rewriting the "script" 1550 definition to clarify slightly and put the example script names 1551 into alphabetical order. 1552 7. Section 4.2.3, has been fairly extensively rewritten for clarity, 1553 and a large number of less extensive clarifications have been 1554 made, although no substantive changes have been (intentionally) 1555 occurred. 1557 8.3. Changes for Version -03 1559 1. Made a number of further tuning changes to better reflect the 1560 role of the document and corrected several references. 1561 2. Removed the reference to Vietnamese. 1562 3. Added a discussion of IDNA versioning and new prefixes. 1564 8.4. Changes for version -04 1566 1. Corrected many small typographical and editorial errors. 1567 2. Clarified that elimination of non-language characters was not 1568 intended to eliminate digits. 1570 8.5. Changes for version -05 1572 1. Revised section 4.3 to further clarify the suggestion. 1573 2. Revised the Acknowledgments section 1575 8.6. Changes for version -06 1577 1. New subsection added to the Introduction to put the document into 1578 better context. 1579 2. New introduction to Section 2.1. 1580 3. Several small changes to the Normalization section to further 1581 clarify that issue, 1582 4. Split out Unicode upgrades from other material, in the process 1583 revising the notorious section 4.3 and giving it additional 1584 context. 1585 5. Acknowledgments updated. 1586 6. Many small editorial and clarification corrections. 1588 9. References 1590 9.1. Normative References 1592 [ISO10646] 1593 International Organization for Standardization, 1594 "Information Technology - Universal Multiple- Octet Coded 1595 Character Set (UCS) - Part 1: Architecture and Basic 1596 Multilingual Plane"", ISO/IEC 10646-1:2000, October 2000. 1598 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 1599 Internationalized Strings ("stringprep")", RFC 3454, 1600 December 2002. 1602 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1603 "Internationalizing Domain Names in Applications (IDNA)", 1604 RFC 3490, March 2003. 1606 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1607 Profile for Internationalized Domain Names (IDN)", 1608 RFC 3491, March 2003. 1610 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 1611 for Internationalized Domain Names in Applications 1612 (IDNA)", RFC 3492, March 2003. 1614 [Unicode32] 1615 The Unicode Consortium, "The Unicode Standard, Version 1616 3.0", 2000. 1618 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5). 1619 Version 3.2 consists of the definition in that book as 1620 amended by the Unicode Standard Annex #27: Unicode 3.1 1621 (http://www.unicode.org/reports/tr27/) and by the Unicode 1622 Standard Annex #28: Unicode 3.2 1623 (http://www.unicode.org/reports/tr28/). 1625 9.2. Informative References 1627 [DNS-Choices] 1628 Faltstrom, P., "Design Choices When Expanding DNS", 1629 draft-iab-dns-choices-02 (work in progress), June 2005. 1631 [ICANNv1] ICANN, "Guidelines for the Implementation of 1632 Internationalized Domain Names, Version 1.0", March 2003, 1633 . 1635 [ICANNv2] ICANN, "Guidelines for the Implementation of 1636 Internationalized Domain Names, Version 2.0", 1637 November 2005, 1638 . 1640 [IESG-IDN] 1641 Internet Engineering Steering Group (IESG), "IESG 1642 Statement on IDN", IESG Statements IDN Statement, 1643 February 2003, 1644 . 1646 [INDNS] National Research Council, "Signposts in Cyberspace: The 1647 Domain Name System and Internet Navigation", National 1648 Academy Press ISBN 0309-09640-5 (Book) 0309-54979-5 (PDF), 1649 2005, 1650 . 1652 [ISO.2022.1986] 1653 International Organization for Standardization, 1654 "Information Processing: ISO 7-bit and 8-bit coded 1655 character sets: Code extension techniques", ISO Standard 1656 2022, 1986. 1658 [ISO.646.1991] 1659 International Organization for Standardization, 1660 "Information technology - ISO 7-bit coded character set 1661 for information interchange", ISO Standard 646, 1991. 1663 [ISO.8859.2003] 1664 International Organization for Standardization, 1665 "Information processing - 8-bit single-byte coded graphic 1666 character sets - Part 1: Latin alphabet No. 1 (1998) - 1667 Part 2: Latin alphabet No. 2 (1999) - Part 3: Latin 1668 alphabet No. 3 (1999) - Part 4: Latin alphabet No. 4 1669 (1998) - Part 5: Latin/Cyrillic alphabet (1999) - Part 6: 1670 Latin/Arabic alphabet (1999) - Part 7: Latin/Greek 1671 alphabet (2003) - Part 8: Latin/Hebrew alphabet (1999) - 1672 Part 9: Latin alphabet No. 5 (1999) - Part 10: Latin 1673 alphabet No. 6 (1998) - Part 11: Latin/Thai alphabet 1674 (2001) - Part 13: Latin alphabet No. 7 (1998) - Part 14: 1675 Latin alphabet No. 8 (Celtic) (1998) - Part 15: Latin 1676 alphabet No. 9 (1999) - Part 16: Part 16: Latin alphabet 1677 No. 10 (2001)", ISO Standard 8859, 2003. 1679 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1680 Languages", BCP 18, RFC 2277, January 1998. 1682 [RFC2825] IAB and L. Daigle, "A Tangled Web: Issues of I18N, Domain 1683 Names, and the Other Internet protocols", RFC 2825, 1684 May 2000. 1686 [RFC3066] Alvestrand, H., "Tags for the Identification of 1687 Languages", BCP 47, RFC 3066, January 2001. 1689 [RFC3467] Klensin, J., "Role of the Domain Name System (DNS)", 1690 RFC 3467, February 2003. 1692 [RFC3536] Hoffman, P., "Terminology Used in Internationalization in 1693 the IETF", RFC 3536, May 2003. 1695 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint 1696 Engineering Team (JET) Guidelines for Internationalized 1697 Domain Names (IDN) Registration and Administration for 1698 Chinese, Japanese, and Korean", RFC 3743, April 2004. 1700 [RFC3912] Daigle, L., "WHOIS Protocol Specification", RFC 3912, 1701 September 2004. 1703 [RFC3981] Newton, A. and M. Sanz, "IRIS: The Internet Registry 1704 Information Service (IRIS) Core Protocol", RFC 3981, 1705 January 2005. 1707 [RFC3982] Newton, A. and M. Sanz, "IRIS: A Domain Registry (dreg) 1708 Type for the Internet Registry Information Service 1709 (IRIS)", RFC 3982, January 2005. 1711 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1712 Resource Identifier (URI): Generic Syntax", STD 66, 1713 RFC 3986, January 2005. 1715 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 1716 Identifiers (IRIs)", RFC 3987, January 2005. 1718 [RFC4185] Klensin, J., "National and Local Characters for DNS Top 1719 Level Domain (TLD) Names", RFC 4185, October 2005. 1721 [RFC4290] Klensin, J., "Suggested Practices for Registration of 1722 Internationalized Domain Names (IDN)", RFC 4290, 1723 December 2005. 1725 [UTR] Unicode Consortium, "Unicode Technical Reports", 1726 . 1728 [UTR36] Davis, M. and M. Suignard, "Unicode Technical Report #36: 1729 Unicode Security Considerations", November 2005, 1730 . 1732 Working Draft for Proposed Update 1734 [UTR39] Davis, M. and M. Suignard, "Unicode Technical Standard #39 1735 (proposed): Unicode Security Considerations", July 2005, 1736 . 1738 Working Draft for Proposed Draft 1740 [Unicode-PR29] 1741 The Unicode Consortium, "Public Review Issue #29: 1742 Normalization Issue", Unicode PR 29, February 2004. 1744 [Unicode10] 1745 The Unicode Consortium, "The Unicode Standard, Version 1746 1.0", 1991. 1748 [W3C-Localization] 1749 Ishida, R. and S. Miller, "Localization vs. 1750 Internationalization", W3C International/questions/ 1751 qa-i18n.txt, December 2005. 1753 [ltru-initial] 1754 Ewell, D., Ed., "Initial Language Subtag Registry", 1755 draft-ietf-ltru-initial-06 (work in progress), 1756 February 2004. 1758 This document is awaiting publication as an Informational 1759 RFC. 1761 [ltru-registry] 1762 Phillips, A., Ed. and M. Davis, Ed., "Tags for Identifying 1763 Languages", draft-ietf-ltru-registry-14 (work in 1764 progress), October 2004. 1766 This document has been approved as a Proposed Standard and 1767 is awaiting publication as an RFC. 1769 [net-utf8] 1770 Klensin, J. and M. Padlipsky, "Unicode Format for Network 1771 Interchange", 1772 InternetDraft draft-klensin-net-utf8-00f.txt, April 2006. 1774 Authors' Addresses 1776 John C Klensin 1777 1770 Massachusetts Ave, #322 1778 Cambridge, MA 02140 1779 USA 1781 Phone: +1 617 491 5735 1782 Email: john-ietf@jck.com 1784 Patrik Faltstrom 1785 Cisco Systems 1787 Email: paf@cisco.com 1789 Cary Karp 1790 Swedish Museum of Natural History 1791 Box 50007 1792 Stockholm SE-10405 1793 Sweden 1795 Phone: +46 8 5195 4055 1796 Email: ck@nic.museum 1798 IAB 1800 Email: iab@iab.org 1802 Intellectual Property Statement 1804 The IETF takes no position regarding the validity or scope of any 1805 Intellectual Property Rights or other rights that might be claimed to 1806 pertain to the implementation or use of the technology described in 1807 this document or the extent to which any license under such rights 1808 might or might not be available; nor does it represent that it has 1809 made any independent effort to identify any such rights. Information 1810 on the procedures with respect to rights in RFC documents can be 1811 found in BCP 78 and BCP 79. 1813 Copies of IPR disclosures made to the IETF Secretariat and any 1814 assurances of licenses to be made available, or the result of an 1815 attempt made to obtain a general license or permission for the use of 1816 such proprietary rights by implementers or users of this 1817 specification can be obtained from the IETF on-line IPR repository at 1818 http://www.ietf.org/ipr. 1820 The IETF invites any interested party to bring to its attention any 1821 copyrights, patents or patent applications, or other proprietary 1822 rights that may cover technology that may be required to implement 1823 this standard. Please address the information to the IETF at 1824 ietf-ipr@ietf.org. 1826 Disclaimer of Validity 1828 This document and the information contained herein are provided on an 1829 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1830 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1831 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1832 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1833 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1834 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1836 Copyright Statement 1838 Copyright (C) The Internet Society (2006). This document is subject 1839 to the rights, licenses and restrictions contained in BCP 78, and 1840 except as set forth therein, the authors retain all their rights. 1842 Acknowledgment 1844 Funding for the RFC Editor function is currently provided by the 1845 Internet Society.