idnits 2.17.1 

draft-iab-idn-nextsteps-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 19.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 1834.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1811.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1818.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1824.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There is 1 instance of too long lines in the document, the longest one
     being 1 character in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 12, 2006) is 6528 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'Unicode-PR29' is defined on line 1740, but no
     explicit reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564)

  ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode32'

  == Outdated reference: A later version (-08) exists of
     draft-iab-dns-choices-02

  -- Obsolete informational reference (is this intentional?): RFC 3066
     (Obsoleted by RFC 4646, RFC 4647)

  -- Obsolete informational reference (is this intentional?): RFC 3536
     (Obsoleted by RFC 6365)


     Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         J. Klensin
3	Internet-Draft
4	Expires: December 14, 2006                                  P. Faltstrom
5	                                                           Cisco Systems
6	                                                                 C. Karp
7	                                       Swedish Museum of Natural History
8	                                                                     IAB
9	                                                           June 12, 2006

11	  Review and Recommendations for Internationalized Domain Names (IDN)
12	                    draft-iab-idn-nextsteps-06.txt

14	Status of this Memo

16	   By submitting this Internet-Draft, each author represents that any
17	   applicable patent or other IPR claims of which he or she is aware
18	   have been or will be disclosed, and any of which he or she becomes
19	   aware will be disclosed, in accordance with Section 6 of BCP 79.

21	   Internet-Drafts are working documents of the Internet Engineering
22	   Task Force (IETF), its areas, and its working groups.  Note that
23	   other groups may also distribute working documents as Internet-
24	   Drafts.

26	   Internet-Drafts are draft documents valid for a maximum of six months
27	   and may be updated, replaced, or obsoleted by other documents at any
28	   time.  It is inappropriate to use Internet-Drafts as reference
29	   material or to cite them other than as "work in progress."

31	   The list of current Internet-Drafts can be accessed at
32	   http://www.ietf.org/ietf/1id-abstracts.txt.

34	   The list of Internet-Draft Shadow Directories can be accessed at
35	   http://www.ietf.org/shadow.html.

37	   This Internet-Draft will expire on December 14, 2006.

39	Copyright Notice

41	   Copyright (C) The Internet Society (2006).

43	Abstract

45	   This note describes issues raised by the deployment and use of
46	   Internationalized Domain Names.  It describes problems both at the
47	   time of registration and those for use of those names for use in the
48	   DNS.  It recommends that IETF should update the IDN related RFCs and
49	   a framework to be followed in doing so, as well as summarizing and
50	   identifying some work that is required outside the IETF.  In
51	   particular, it proposes that some changes be investigated for the
52	   IDNA standard and its supporting tables, based on experience gained
53	   since those standards were completed.

55	Table of Contents

57	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
58	     1.1.  The Role of IDNs and this document . . . . . . . . . . . .  4
59	     1.2.  Status of this Document and its Recommendations  . . . . .  5
60	     1.3.  The IDNA Standard  . . . . . . . . . . . . . . . . . . . .  5
61	     1.4.  Unicode Documents  . . . . . . . . . . . . . . . . . . . .  6
62	     1.5.  Definitions  . . . . . . . . . . . . . . . . . . . . . . .  7
63	       1.5.1.  Language . . . . . . . . . . . . . . . . . . . . . . .  7
64	       1.5.2.  Script . . . . . . . . . . . . . . . . . . . . . . . .  7
65	       1.5.3.  Multilingual . . . . . . . . . . . . . . . . . . . . .  8
66	       1.5.4.  Localization . . . . . . . . . . . . . . . . . . . . .  8
67	       1.5.5.  Internationalization . . . . . . . . . . . . . . . . .  8
68	     1.6.  Statements and Guidelines  . . . . . . . . . . . . . . . .  9
69	       1.6.1.  IESG Statement . . . . . . . . . . . . . . . . . . . .  9
70	       1.6.2.  ICANN statements . . . . . . . . . . . . . . . . . . .  9
71	   2.  General Problems and Issues  . . . . . . . . . . . . . . . . . 12
72	     2.1.  User conceptions, local character sets, and input
73	           issues . . . . . . . . . . . . . . . . . . . . . . . . . . 12
74	     2.2.  Examples of Issues . . . . . . . . . . . . . . . . . . . . 14
75	       2.2.1.  Language specific character matching . . . . . . . . . 14
76	       2.2.2.  Multiple scripts . . . . . . . . . . . . . . . . . . . 14
77	       2.2.3.  Normalization and Character Mappings . . . . . . . . . 15
78	       2.2.4.  URLs in Printed Form . . . . . . . . . . . . . . . . . 17
79	       2.2.5.  Bidirectional text . . . . . . . . . . . . . . . . . . 18
80	       2.2.6.  Confusable Character Issues  . . . . . . . . . . . . . 18
81	       2.2.7.  The IESG Statement and IDNA issues . . . . . . . . . . 20
82	   3.  Migrating to New Versions of Unicode . . . . . . . . . . . . . 20
83	     3.1.  Versions of Unicode  . . . . . . . . . . . . . . . . . . . 20
84	     3.2.  Version changes and normalization issues . . . . . . . . . 22
85	       3.2.1.  Unnormalized Combining Sequences . . . . . . . . . . . 22
86	       3.2.2.  Combining Characters and Character Components  . . . . 23
87	       3.2.3.  When does normalization occur? . . . . . . . . . . . . 23
88	   4.  Framework for next steps in IDN development  . . . . . . . . . 24
89	     4.1.  Issues within the scope of the IETF  . . . . . . . . . . . 24
90	       4.1.1.  Review of IDNA . . . . . . . . . . . . . . . . . . . . 24
91	       4.1.2.  Non-DNS and Above-DNS Internationalization
92	               Approaches . . . . . . . . . . . . . . . . . . . . . . 25
93	       4.1.3.  Security issues, certificates, etc.  . . . . . . . . . 26
94	       4.1.4.  Protocol Changes and Policy Implications . . . . . . . 28
95	       4.1.5.  Non US-ASCII in local part of email addresses  . . . . 28
96	       4.1.6.  Use of the Unicode Character Set in the IETF . . . . . 28
97	     4.2.  Issues that fall within the purview of ICANN . . . . . . . 28
98	       4.2.1.  Dispute resolution . . . . . . . . . . . . . . . . . . 28
99	       4.2.2.  Policy at registries . . . . . . . . . . . . . . . . . 28
100	       4.2.3.  IDN TLDs . . . . . . . . . . . . . . . . . . . . . . . 29
101	   5.  Specific Recommendations for Next Steps  . . . . . . . . . . . 30
102	     5.1.  Reduction of permitted character list  . . . . . . . . . . 30
103	       5.1.1.  Elimination of all non-language characters . . . . . . 30
104	       5.1.2.  Elimination of word-separation punctuation . . . . . . 31
105	     5.2.  Updating to new versions of Unicode  . . . . . . . . . . . 31
106	     5.3.  Role and Uses of the DNS . . . . . . . . . . . . . . . . . 31
107	     5.4.  Databases of Registered Names  . . . . . . . . . . . . . . 32
108	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 32
109	   7.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 32
110	   8.  Change History . . . . . . . . . . . . . . . . . . . . . . . . 33
111	     8.1.  Changes for version -01  . . . . . . . . . . . . . . . . . 33
112	     8.2.  Changes for version -02  . . . . . . . . . . . . . . . . . 33
113	     8.3.  Changes for Version -03  . . . . . . . . . . . . . . . . . 34
114	     8.4.  Changes for version -04  . . . . . . . . . . . . . . . . . 34
115	     8.5.  Changes for version -05  . . . . . . . . . . . . . . . . . 34
116	     8.6.  Changes for version -06  . . . . . . . . . . . . . . . . . 34
117	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 34
118	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 34
119	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 35
120	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39
121	   Intellectual Property and Copyright Statements . . . . . . . . . . 40

123	1.  Introduction

125	1.1.  The Role of IDNs and this document

127	   While IDNs have been advocated as the solution for a wide range of
128	   problems, this document is written from the perspective that they are
129	   no more and no less than DNS names, reflecting the same requirements
130	   for use, stability, and accuracy as traditional "hostnames", but
131	   using a much larger collection of of permitted characters.  In
132	   particular, while IDNs represent a step toward an Internet that is
133	   equally accessible from all languages and scripts they, at best,
134	   address only a small part of that very broad objective.  There has
135	   been controversy since IDNs were first suggested about how important
136	   they will actually turn out to be; that controversy will probably
137	   continue.  Accessibility from all languages is an important
138	   objective, hence it is important that our standards and definitions
139	   for IDNs be smoothly adaptable to additional scripts as they are
140	   added to the Unicode character set.

142	   The utility of IDNs must be evaluated in terms of their application
143	   by users and in protocols: the ability to simply put a name into the
144	   DNS and retrieve it is not, in and of itself, important.  >From this
145	   point of view, IDNs will be useful and effective if they provide
146	   stable and predictable references -- references that are no less
147	   stable and predictable, and no less secure, than their ASCII
148	   counterparts.

150	   This combination of objectives and criteria has proven very difficult
151	   to satisfy.  Experience in developing the IDNA standard and during
152	   the initial years of its implementation and deployment suggests that
153	   it may be impossible to fully satisfy all of them and that
154	   engineering compromises are needed to yield a result that is
155	   workable, even if not completely satisfactory.  Based on that
156	   experience and issues that have been raised, it is now appropriate to
157	   review some of the implications of IDNs, the decisions made in
158	   defining them, and the foundation on which they rest and determine
159	   whether changes are needed and, if so, which ones.

161	   The design of the DNS itself imposes some additional constraints.  If
162	   the DNS is to remain globally interoperable, there are specific
163	   characteristics that no implementation of IDNs, or the DNS more
164	   generally, can change.  For example, because the DNS is a global
165	   hierarchal administrative namespace with only a single name at any
166	   given node, there is one and only one owner of each domain name.
167	   Also, when strings are looked up in the DNS, positive responses can
168	   only reflect exact matches: if there is no exact match, then one gets
169	   an error reply, not an list of near matches or other supplemental
170	   information.  Searches and approximate matchings are not possible.

172	   Finally, because the DNS is a distributed system where any server
173	   might cache responses, and later use those cached responses to
174	   attempt to satisfy queries before a global lookup is done, every
175	   server must use the same matching criteria.

177	1.2.  Status of this Document and its Recommendations

179	   This document reviews the IDN landscape from an IETF perspective and
180	   presents the recommendations and conclusions of the IAB, based
181	   partially on input from an ad hoc committee charged with reviewing
182	   IDN issues and the path forward (See Section 7).  Its recommendations
183	   are advice to the IETF, or in a few cases to other bodies, for topics
184	   to be investigated and actions to be taken if those bodies, after
185	   their examinations, consider those actions appropriate.

187	   [[anchor4: IMPORTANT: The IAB has not yet reached consensus that this
188	   document is ready for final publication.  While considerable input
189	   from the members of the ad hoc committee went into the document, no
190	   claim is made that it represents the consensus of that group.
191	   However, the IAB concluded that it was appropriate to expose these
192	   versions, as working drafts, for community comment and feedback.
193	   Such comments should be sent to iab@iab.org.]]

195	1.3.  The IDNA Standard

197	   During 2002 IETF completed the following RFCs that, together, define
198	   IDNs:

200	   RFC 3454 Preparation of Internationalized Strings ("Stringprep")
201	      [RFC3454].
202	      Stringprep is a generic mechanism for taking a Unicode string and
203	      converting it into a canonical format.  Stringprep itself is just
204	      a collection of rules, tables, and operations.  Any protocol or
205	      algorithm that uses it must define a "Stringprep profile", which
206	      specifies which of those rules are applied, how, and with which
207	      characteristics.

209	   RFC 3490 Internationalizing Domain Names in Applications (IDNA)
210	      [RFC3490].
211	      IDNA is the base specification in this group.  It specifies that
212	      Nameprep is used as the Stringprep profile for domain names, and
213	      that Punycode is the relevant encoding mechanism for use in
214	      generating an ASCII-compatible ("ACE") form of the name.  It also
215	      applies some additional conversions and character filtering that
216	      are not part of Nameprep.

218	   RFC 3491 Nameprep: A Stringprep Profile for Internationalized Domain
219	      Names (IDN) [RFC3491].
220	      Nameprep is one such profile.  It is designed to meet the specific
221	      needs of IDNs and, in particular, to support case-folding for
222	      scripts that support what are traditionally known as upper and
223	      lower case forms of the same letters.  The result of the Nameprep
224	      algorithm is a string containing a subset of the Unicode Character
225	      set, normalized and case folded so that case insensitive
226	      comparison can be made.

228	   RFC 3492 Punycode: A Bootstring encoding of Unicode for
229	      Internationalized Domain Names in Applications (IDNA) [RFC3492].
230	      Punycode is a mechanism for encoding a Unicode string in ASCII
231	      characters.  The characters used are the same the subset of
232	      characters that are allowed in the hostname definition of DNS,
233	      i.e., the "letter, digit, and hyphen" characters, sometimes known
234	      as "LDH".

236	1.4.  Unicode Documents

238	   Unicode is used as the base, and defining, character set for IDN.
239	   Unicode is standardized by the Unicode Consortium, and synchronized
240	   with ISO to create ISO/IEC 10646 [ISO10646].  At the time the RFCs
241	   mentioned earlier were created, Unicode was at version 3.2.  For
242	   reasons explained later, it was necessary to pick a particular, then-
243	   current, version of Unicode when IDNA was adopted.  Consequently, the
244	   RFCs are explicitly dependent on Unicode version 3.2 [Unicode32].
245	   There is, at present, no established mechanism for modifying the IDNA
246	   RFCs to use newer Unicode versions (see Section 3.1).

248	   Unicode is a very large and complex character set.  (The term
249	   "character set" or "charset" is used in a way that is peculiar to the
250	   IETF and may not be the same as the usage in other bodies and
251	   contexts.)  The Unicode Standard and related documents are created
252	   and maintained by the Unicode Technical Committee (UTC), one of the
253	   committees of the Unicode Consortium.

255	   The Consortium first published The Unicode Standard [Unicode10] in
256	   1991, and continues to develop standards based on that original work.
257	   Unicode is developed in conjunction with the International
258	   Organization for Standardization, and it shares its character
259	   repertoire with ISO/IEC 10646.  Unicode and ISO/IEC 10646 function
260	   equivalently as character encodings, but The Unicode Standard
261	   contains much more information for implementers, covering -- in depth
262	   -- topics such as bitwise encoding, collation, and rendering.  The
263	   Unicode Standard enumerates a multitude of character properties,
264	   including those needed for supporting bidirectional text.  The
265	   Unicode Consortium and ISO standards do use slightly different
266	   terminology.

268	1.5.  Definitions

270	   The following terms and their meanings are critical to understanding
271	   the rest of this document and to discussions of IDNs more generally.
272	   These terms are derived from [RFC3536], which contains additional
273	   discussion of some of them.

275	1.5.1.  Language

277	   A language is a way that humans interact.  The use of language occurs
278	   in many forms, including speech, writing, and signing.

280	   Some languages have a close relationship between the written and
281	   spoken forms, while others have a looser relationship.  RFC 3066
282	   [RFC3066] discusses languages in more detail and provides identifiers
283	   for languages for use in Internet protocols.  Computer languages are
284	   explicitly excluded from this definition.  The most recent IETF work
285	   in this area, and on script identification (see below), is documented
286	   in [ltru-registry] and [ltru-initial].

288	1.5.2.  Script

290	   A script is a set of graphic characters used for the written form of
291	   one or more languages.  This definition is the one used in
292	   [ISO10646].

294	   Examples of scripts are Arabic, Cyrillic, Greek, Han (the so-called
295	   ideographs used in writing Chinese, Japanese, and Korean), and
296	   "Latin".  Arabic, Greek, and Latin are, of course, also names of
297	   languages.

299	   Historically, the script that is known as "Latin" in Unicode and most
300	   contexts associated with information technology standards is known in
301	   the linguistic community as "Roman" or "Roman-derived".  The latter
302	   terminology distinguishes between the Latin language and the
303	   characters used to write it, especially in Republican times, from the
304	   much richer and more decorated script derived and adapted from those
305	   character.  Since IDNA is defined using Unicode and that standard
306	   used the term "LATIN" in its character names and descriptions, that
307	   terminology will be used in this document as well except when "Roman-
308	   derived" is needed for clarity.  However readers approaching this
309	   document from a cultural or linguistic standpoint should be aware
310	   that the use of, or references to, "Latin script" in this document
311	   refers to the entire collection of Roman-derived characters, not just
312	   the characters used to write the Latin language.  Some other issues
313	   with script identification and relationships with other standards are
314	   discussed in [ltru-registry].

316	1.5.3.  Multilingual

318	   The term "multilingual" has many widely-varying definitions and thus
319	   is not recommended for use in standards.  Some of the definitions
320	   relate to the ability to handle international characters; other
321	   definitions relate to the ability to handle multiple charsets; and
322	   still others relate to the ability to handle multiple languages.

324	   While this term has been deprecated for IETF-related uses and does
325	   not otherwise appear in this document, a discussion here seemed
326	   appropriate since the term is still widely used in some discussions
327	   of IDNs.

329	1.5.4.  Localization

331	   Localization is the process of adapting an internationalized
332	   application platform or application to a specific cultural
333	   environment.  In localization, the same semantics are preserved while
334	   the syntax or presentation forms may be changed.

336	   Localization is the act of tailoring an application for a different
337	   language or script or culture.  Some internationalized applications
338	   can handle a wide variety of languages.  Typical users only
339	   understand a small number of languages, so the program must be
340	   tailored to interact with users in just the languages they know.

342	   Somewhat different definitions for localization and
343	   internationalization (see below) are used by groups other than the
344	   IETF.  See [W3C-Localization] for one example.

346	1.5.5.  Internationalization

348	   In the IETF, the term "internationalization" is used to describe
349	   adding or improving the handling of non-ASCII text in a protocol.
350	   Other bodies use the term in other ways, often with subtle variation
351	   in meaning.  The term "internationalization" is often abbreviated
352	   "i18n" (and localization as "l10n").

354	   Many protocols that handle text only handle the characters associated
355	   with one script (often, a subset of the characters used in writing
356	   English text), or leave the question of what character set is used up
357	   to local guesswork (which leads, of course, to interoperability
358	   problems).  Adding non-ASCII text to such a protocol allows the
359	   protocol to handle more scripts, with the intention of being able to
360	   include all of the scripts that are useful in the world.  It should
361	   be noted that many English words cannot be written in ASCII, various
362	   mythologies notwithstanding.

364	1.6.  Statements and Guidelines

366	   When the IDN RFCs were published, IESG and ICANN made statements that
367	   were intended to guide deployment and future work.  In recent months,
368	   ICANN has updated its statement and others have also made
369	   contributions.  It is worth noting that the quality of understanding
370	   of internationalization issues as applied to the DNS has evolved
371	   considerably over the last few years.  Organizations that took
372	   specific positions a year or more ago might not make exactly the same
373	   statements today.

375	1.6.1.  IESG Statement

377	   The IESG made a statement on IDNA [IESG-IDN]:

379	       IDNA, through its requirement of Nameprep [RFC3491], uses
380	       equivalence tables that are based only on the characters
381	       themselves; no attention is paid to the intended language (if any)
382	       for the domain name. However, for many domain names, the intended
383	       language of one or more parts of the domain name actually does
384	       matter to the users.

386	       Similarly, many names cannot be presented and used without
387	       ambiguity unless the scripts to which their characters belong are
388	       known. In both cases, this additional information should be of
389	       concern to the registry.

391	   The statement is longer than this, but these paragraphs are the
392	   important ones.  The rest of the statement are explanations and
393	   examples.

395	1.6.2.  ICANN statements

397	1.6.2.1.  Initial ICANN Guidelines

399	   Soon after the IDNA standard was adopted, ICANN produced an initial
400	   version of its "IDN Guidelines" [ICANNv1].  This document was
401	   intended to serve two purposes.  The first was to provide a basis for
402	   releasing the gTLD registries that had been established by ICANN from
403	   a contractual restriction on the registration of labels containing
404	   hyphens in the third and fourth positions.  The second was to provide
405	   a general framework for the development of registry policies for the
406	   implementation of IDN.

408	   One of the key components of this framework was prescribing strict
409	   compliance with RFCs 3490, 3491, and 3492.  These specifications
410	   established the ACE (ASCII-Compatible Encoding) scheme for IDN use,
411	   known as "Punycode", and the various rules for its use.  The
412	   specifications designated Punycode, supported by those rules, as the
413	   sole such encoding to be used with the DNS.

415	   Limitations on the characters available for inclusion in IDNs were
416	   mandated by two devices.  The first was by requiring an "inclusion-
417	   based approach (meaning that code points that are not explicitly
418	   permitted by the registry are prohibited) for identifying permissible
419	   code points from among the full Unicode repertoire."  The second
420	   device required the association of every IDN with a specific
421	   language, with additional policies also being language based:

423	   "In implementing the IDN standards, top-level domain registries will
424	   (a) associate each registered internationalized domain name with one
425	   language or set of languages,
426	   (b) employ language-specific registration and administration rules
427	   that are documented and publicly available, such as the reservation
428	   of all domain names with equivalent character variants in the
429	   languages associated with the registered domain name, and,
430	   (c) where the registry finds that the registration and administration
431	   rules for a given language would benefit from a character variants
432	   table, allow registrations in that language only when an appropriate
433	   table is available. ...  In implementing the IDN standards, top-level
434	   domain registries should, at least initially, limit any given domain
435	   label (such as a second-level domain name) to the characters
436	   associated with one language or set of languages only."

438	   It was left to each TLD registry to define the character repertoire
439	   it would associate with any given language.  This led to significant
440	   variation from registry to registry, with further heterogeneity in
441	   the underlying language-based IDN policies.  If the guidelines had
442	   made provision for IDN policies also being based on script, a
443	   substantial amount of the resulting ambiguity could have been
444	   avoided.  However, they did not, and the sequence of events leading
445	   to the present review of IDNA was thus triggered.

447	1.6.2.2.  ICANN Version 2 Guidelines

449	   One of responses of the TLD registries to what was widely perceived
450	   as a crisis situation, was to invoke the mechanism described in the
451	   initial guidelines: "As the deployment of IDNs proceeds, ICANN and
452	   the IDN registries will review these Guidelines at regular intervals,
453	   and revise them as necessary based on experience."

455	   The pivotal requirement was the modification of the guidelines to
456	   permit script-based IDN policies.  Further concern was expressed
457	   about the need for realistically implementable mechanisms for the
458	   propagation of TLD registry policies into the lower levels of their
459	   name trees.  In addition to the anticipated increase of constraint on
460	   the protocol level, one obvious additional approach would be to
461	   replace the guidelines by an instrument which itself had clear status
462	   in the IETF's normative framework.  A BCP was therefore seen as the
463	   appropriate focus for longer-term effort.  The most pressing issues
464	   would be dealt with in the interim by incremental modification to the
465	   guidelines, but no need was seen for the detailed further development
466	   of those guidelines once that incremental modification was complete.

468	   The outcome of this action was a version 2.0 of the guidelines
469	   [ICANNv2] which was endorsed by the ICANN Board on November 8, 2005
470	   for a period of nine months.  The Board stated further that it "tasks
471	   the IDN working group to continue its important work and return to
472	   the board with specific IDN improvement recommendations before the
473	   ICANN Meeting in Morocco" and "supports the working group's continued
474	   action to reframe the guidelines completely in a manner appropriate
475	   for further development as a Best Current Practices (BCP) document,
476	   to ensure that the Guideline directions will be used deeper into the
477	   DNS hierarchy and within TLD's where ICANN has a lesser policy
478	   relationship."

480	   Retaining the inclusion-based approach established in version 1.0,
481	   the crucial addition to the policy framework is that:

483	   "All code points in a single label will be taken from the same script
484	   as determined by the Unicode Standard Annex #24: Script Names at
485	   http://www.unicode.org/reports/tr24.  Exception to this is
486	   permissible for languages with established orthographies and
487	   conventions that require the commingled use of multiple scripts.  In
488	   such cases, visually confusable characters from different scripts
489	   will not be allowed to co-exist in a single set of permissible
490	   codepoints unless a corresponding policy and character table is
491	   clearly defined."

493	   Additionally:

495	   "Permissible code points will not include: (a) line symbol-drawing
496	   characters (as those in the Unicode Box Drawing block), (b) symbols
497	   and icons that are neither alphanumeric nor ideographic language
498	   characters, such as typographic and pictographic dingbats, (c)
499	   characters with well-established functions as protocol elements, (d)
500	   punctuation marks used solely to indicate the structure of
501	   sentences."

503	   Attention has been called to several points that are not adequately
504	   dealt with (if at all) in the version 2.0 guidelines but which ought
505	   to be included in the policy framework without waiting for the
506	   production and release of a document based on a "best practices"
507	   model.  The term "BCP" above does not necessarily refer to an IETF
508	   consensus document.  The intention in Nov 2005 was for the
509	   recommended major revision to be put to the ICANN Board prior to its
510	   meeting in Morocco (in late June 2006), but for the changes to be
511	   collated incrementally and appear in interim version 2.n releases of
512	   the guidelines.  The IAB's understanding is that, while there has
513	   been some progress with this, other issues relating to IDN
514	   subsequently diverted much of the energy that was intended to be
515	   devoted to the more extensive treatment of the guidelines.

517	2.  General Problems and Issues

519	   This section interweaves problems and issues of several types.  Each
520	   subsection outlines something that is perceived to be a problem or
521	   issue "with IDNs", therefore needing correction.  Some of these
522	   issues can be at least partially resolved by making changes to
523	   elements of the IDNA protocol or tables.  Others will exist as long
524	   as people have expectations of IDNs that are inconsistent with the
525	   basic DNS architecture.  It is important to identify this entire
526	   range of problems because users, registrants, and policy makers often
527	   do not understand the protocol and other technical issues but only
528	   the difference between what they believe happens or should happen and
529	   what actually happens.  As long as those differences exist, there
530	   will be demands for functionality or policy changes for IDN.  Of
531	   course, some of these demands will be less realistic than others but
532	   even the realistic ones should be understood in the same context as
533	   the others.

535	   Most of the issues that have been raised, and that are discussed in
536	   this document, exist whether IDNA remains tied to Unicode 3.2 or
537	   whether migration to new Unicode versions is contemplated.  A
538	   migration path is necessary to accommodate newly-coded scripts and to
539	   permit the maximum number of languages and scripts to be represented
540	   in domain names.  However, the migration issues are largely separate
541	   from those involving a single Unicode Version or Version 3.2 in
542	   particular, so they have been separated into this section and
543	   Section 3

545	2.1.  User conceptions, local character sets, and input issues

547	   The labels of the DNS are just strings of characters that are not
548	   inherently tied to a particular language.  As mentioned briefly in
549	   the Introduction, DNS labels that could not lexically be words in any
550	   language are possible and indeed common: there appears to be no
551	   reason to impose protocol restrictions on IDNs that would restrict
552	   them more than all-ASCII hostname labels have been restricted.  For
553	   that reason, even describing DNS labels or strings of them as "names"
554	   is something of a misnomer, one that has probably added to user
555	   confusion about what to expect.

557	   Ordinarily, people use "words" when they think of things and wish
558	   others to think of them too.  For example "orange", "tree",
559	   "restaurant" or "Acme Inc".  Words are normally in a specific
560	   language, such as English or Swedish.  The character-string labels
561	   supported by the DNS are, as suggested above, not inherently "words".
562	   While it is useful, especially for mnemonic value or to identify
563	   objects, for actual words to be used as DNS labels, other constraints
564	   on the DNS make it impossible to guarantee that it will be possible
565	   to represent every word in every language as a DNS label,
566	   internationalized or not.

568	   When writing or typing the label (or word), a script must be selected
569	   and a charset must be picked for use with that script.  That choice
570	   of charset is typically not under the control of the user on a per
571	   word or per document basis, but may depend on local input devices,
572	   keyboard or terminal drivers, or other decisions made by operating
573	   system or even hardware designers and implementers.

575	   If that charset, or the local charset being used by the relevant
576	   operating system or application software, is not Unicode, a further
577	   conversion must be performed to produce Unicode.  How often this is
578	   an issue depends on estimates of how widely Unicode is deployed as
579	   the native character set for hardware, operating systems, and
580	   applications.  Those estimates differ widely but it should be noted
581	   that, among other difficulties:

583	   o  ISO 8859 versions [ISO.8859.2003] and even national variations of
584	      ISO 646 [ISO.646.1991] are still widely used in parts of Europe;
585	   o  code-table switching methods, typically based on the techniques of
586	      ISO 2022 [ISO.2022.1986] are still in general use in many parts of
587	      the world, especially in Japan with Shift-JIS and its variations;
588	   o  that computing, systems, and communications in China tend to use
589	      one or more of the national "GB" standards rather than native
590	      Unicode;

592	   Additionally, not all charsets define their characters in the same
593	   way and not all pre-existing coding systems were incorporated into
594	   Unicode without changes.  Sometimes local distinctions were made that
595	   Unicode does not make or vice versa.  Consequently, conversion from
596	   other systems to Unicode may potentially lose information.

598	   The Unicode string that results from this processing --processing
599	   that is trivial in a Unicode-native system but that may be
600	   significant in others-- is then used as input to IDNA.

602	2.2.  Examples of Issues

604	   While much of the discussion below is stated in terms of Unicode
605	   codings and associated rules, the IAB believes that some of the
606	   issues are actually not about the Unicode Character set per se, but
607	   about how distributed matching systems operate in reality, and about
608	   what implications the distributed delayed search for stored data that
609	   characterizes the DNS have on the mapping algorithms.

611	2.2.1.  Language specific character matching

613	   There are similar words that can be expressed in multiple languages.
614	   For example the name Torbjorn in Norwegian and Swedish.  In Norwegian
615	   it is spelled with the character U+00F8 (LATIN SMALL LETTER O WITH
616	   STROKE) in the second syllable, while in Swedish it is spelled with
617	   U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS).  Those characters are
618	   not treated as equivalent according to the Unicode Standard and its
619	   Annexes while most people speaking Swedish, Danish, or Norwegian
620	   probably think they are equivalent.

622	   It is neither possible nor desirable to make these characters
623	   equivalent on a global basis.  To do so would, for this example
624	   rationalize the situation in Sweden while causing considerable
625	   confusion in Germany, where the U+00F8 character is never used in the
626	   German language.  But the "variant" model introduced in [RFC3743] and
627	   [RFC4290] can be used by a registry to prevent the worst consequence
628	   of the possible confusion, either by ensuring that both names are
629	   registered to same party in a given domain or that one of them is
630	   completely prohibited.

632	2.2.2.  Multiple scripts

634	   There are languages in the world that can be expressed using multiple
635	   scripts.  For example some Eastern European and Central Asian
636	   languages can be expressed in either Cyrillic or Latin (See
637	   Section 1.5.2) characters or some African and Southeast Asian
638	   languages can be expressed in either Arabic or Latin characters A few
639	   languages can even be written in three different scripts.  In other
640	   cases, the language is typically written in a combination of scripts
641	   (e.g., Kanji, Kana, and Romaji for Japanese, Hangul and Hanji for
642	   Korean).  Because of this, the same word, in the same language, can
643	   be expressed in different ways.  For some languages, only a single
644	   script is normally used to write a single word; for others, mixed
645	   scripts are required; and, for still others, special circumstances
646	   may dictate mixing scripts in labels although that is not normally
647	   done for "words".  For IDN purposes, these variations make the
648	   definition of "script" extremely sensitive, especially since ICANN is
649	   now recommending that it be used as the primary basis for registry
650	   policies.  However essential it may be to prohibit mixed-script
651	   labels, additional policy nuance is required for "languages with
652	   established orthographies and conventions that require the commingled
653	   use of multiple scripts".

655	2.2.3.  Normalization and Character Mappings

657	   Unicode contains several different models for representing
658	   characters.  The Chinese (Han)-derived characters of the "CJK"
659	   languages are "unified", i.e., characters with common derivation and
660	   similar appearances are assigned to the same code point.  European
661	   characters derived from a Greek-Latin base are separated into
662	   separate code blocks for Latin, Greek and Cyrillic even when
663	   individual characters are identical in both form and semantics.
664	   Separate code points based on font differences alone are generally
665	   prohibited, but a large number of characters for "mathematical" use
666	   have been assigned separate code points even though they differ from
667	   base ASCII characters only by font attributes such as "script",
668	   "bold", or "italic".  Some characters that often appear together are
669	   treated as typographical digraphs with specific code points assigned
670	   to the combination, others require that the two-character sequences
671	   be used, and still others are available in both forms.  Some Roman-
672	   derived letters that were developed as decorated variations on the
673	   basic Latin letter collection (e.g., by addition of diacritical
674	   marks) are assigned code points as individual characters, others must
675	   be built up as two (or more) character sequences using "composing
676	   characters".

678	   Many of these differences result from the desire to maintain backward
679	   compatibility while the standard evolved historically, and are hence
680	   understandable.  However, the DNS requires precise knowledge of which
681	   codes and code sequences represent the same character and which ones
682	   do not.  Limiting the potential difficulties with confusable
683	   characters (see Section 2.2.6) requires even more knowledge of which
684	   characters might look alike in some fonts but not in others.  These
685	   variations make it difficult or impossible to apply a single set of
686	   rules to all of Unicode and, in doing so, satisfy everyone and their
687	   perceived needs.  Instead, more or less complex mapping tables,
688	   defined on a character by character basis, are required to
689	   "normalize" different representations of the same character to a
690	   single form so that matching is possible.

692	   Unless normalization rules, such as those that underlie Nameprep, are
693	   applied, characters that are essentially identical will not match in
694	   the DNS, creating many opportunities for problems.  The most common
695	   of these problems is that, due to the processing applied (and
696	   discussed above) before a word is represented as a Unicode string, a
697	   single word can end up being expressed as more than one unique
698	   Unicode string.  Even if normalization rules are applied, some
699	   strings that are considered identical by users will not compare
700	   equal.  That problem is discussed in more detail elsewhere in this
701	   document, particularly in Section 3.2.1.

703	   IDNA attempts to compensate for these problems by using a
704	   normalization algorithm defined by the Unicode Consortium.  This
705	   algorithm can change a sequence of one or more Unicode characters to
706	   another set of characters.  One example is that the base character
707	   U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING
708	   DIAERESIS) is changed to the single Unicode character U+00E4 (LATIN
709	   SMALL LETTER A WITH DIAERESIS).

711	   This Unicode normalization process accounts only for simple character
712	   equivalences, not equivalences that are language or script dependent.
713	   For example, as mentioned above, the characters U+00F8 (LATIN SMALL
714	   LETTER O WITH STROKE) and U+00F6 (LATIN SMALL LETTER O WITH
715	   DIAERESIS) are considered to match in Swedish (and some other
716	   languages), but not for all languages that use either of the
717	   characters.  Having these characters be treated as equivalent in some
718	   contexts and not in others requires decisions and mechanisms that, in
719	   turn, depend much more on context than either IDNA or the Unicode
720	   character-based normalization tables can provide.

722	   Additional complications occur if the sequences are more complicated
723	   or if an attacker is making a deliberate effort to confuse the
724	   normalization process.  For example, if the sequence U+0069 U+0307
725	   (LATIN SMALL LETTER I followed by COMBINING DOT ABOVE) appears, NFKC
726	   maps it into U+00EF (COMBINING DOT ABOVE), which is what one would
727	   predict.  But consider U+0131 U+0308 (LATIN SMALL LETTER DOTLESS I
728	   and COMBINING DIAERESIS): is that the same character?  Is U+0131
729	   U+0307 U+0307 (dotless i and two combining dot-above characters)
730	   equivalent to U+00EF or U+0069, or neither?  NFKC does not appear to
731	   tell us, nor does the definition of U+0307 appear to tell us what
732	   happens when it is combined with other "symbol above" arrangements
733	   (unlike some of the "accent above" combining characters, which more
734	   or less specify kerning).  Similar issues arise when U+00EF is
735	   combined with various dot-above combining characters.  Each of these
736	   questions provides some opportunities for spoofing if different
737	   display implementations interpret the rules in different ways.

739	   If we leave Latin scripts and examine those based on Chinese
740	   characters, we see there is also an absence of specific, lexigraphic,
741	   rules for transformations between Traditional and Simplified Chinese.
742	   Even if there were such rules, unification of Japanese and Korean
743	   characters with Chinese ones would make it impossible to normalize
744	   Traditional Chinese into Simplified Chinese ones without causing
745	   problems in Japanese and Korean use of the same characters.

747	   More generally, while some mappings, such as those between
748	   precomposed Latin script characters and the equivalent multiple code
749	   point composed character sequences, depend only on the characters
750	   themselves, in many or most cases, such as the case with Swedish
751	   above, the mapping is language or culturally dependent.  There have
752	   been discussions as to whether different canonicalization rules (in
753	   addition to or instead of Unicode normalization) should be, or could
754	   be, applied differently to different languages or scripts.  The fact
755	   that most scripts included in Unicode have been initially
756	   incorporated by copying an existing standard more or less intact has
757	   impact on the optimization of these algorithms and on forward
758	   compatibility.  Even if the language is known and language-specific
759	   rules can be defined, dependencies on the language do not disappear.
760	   Canonicalization operations are not possible unless they either
761	   depend only on short sequences of text or have significant context
762	   available that is not obvious from the text itself.  DNS lookups and
763	   many other operations do not have a way to capture and utilize the
764	   language or other information that would be needed to provide that
765	   context.

767	   These variations in languages and in user perceptions of characters
768	   make it difficult or impossible to provide uniform algorithms for
769	   matching Unicode strings in a way that no end users are ever
770	   surprised by the result.  For closely-related scripts or characters,
771	   surprises may even be frequent.  However, because uniform algorithms
772	   are required for mappings that are applied when names are looked up
773	   in the DNS, the rules that are chosen will always represent an
774	   approximation that will be more or less successful in minimizing
775	   those user surprises.  The current Nameprep and Stringprep algorithms
776	   use mapping tables to "normalize" different representations of the
777	   same text to a single form so that matching is possible.

779	   More details on the creation of the normalization algorithms can be
780	   found in the Unicode Specification and the associated Technical
781	   Reports [UTR] and Annexes.  Technical Report #36 [UTR36] and [UTR39]
782	   are specifically related to the IDN discussion.

784	2.2.4.  URLs in Printed Form

786	   URLs and other identifiers appear, not only in electronic forms from
787	   which they can (at least in principle) be accurately copied and
788	   "pasted" but in printed forms from which the user must transcribe
789	   them into the computer system.  This is often known as the "side of
790	   the bus problem" because a particularly problematic version of it
791	   requires that the user be able to observe and accurately remember a
792	   URL that is quickly-glimpsed in a transient form -- a billboard seen
793	   while driving, a sign on the side of a passing vehicle, a television
794	   advertisement that is not frequently repeated or on-screen for a long
795	   time, and so on.

797	   The difficulty, in short, is that two Unicode strings that are
798	   actually different might look exactly the same, especially when there
799	   is no time to study them.  This is because, for example, some glyphs
800	   in Cyrillic, Greek and Latin do look the same, but have been assigned
801	   different codepoints in Unicode.  Worse, one needs to be reasonably
802	   familiar with a script and how it is used to understand how much
803	   characters can reasonably vary as the result of artistic fonts and
804	   typography.  For example, there are a few fonts for Latin characters
805	   that are sufficiently highly ornamented that an observer might easily
806	   confuse some of the characters with characters in Thai script.
807	   Upper-case ITC Blackadder (a registered trademark of International
808	   Typeface Corporation), Curlz MT, are two fairly obvious examples;
809	   these fonts use loops at the end of serifs, creating a resemblance to
810	   Thai (in some fonts) for some characters.

812	2.2.5.  Bidirectional text

814	   Some scripts (and because of that some words in some languages) are
815	   written not left to right, but right to left.  And, to complicate
816	   things, one might have something written in Arabic characters right
817	   to left that includes some characters in Latin characters, such as
818	   European-style digits.  The Latin character part is written left to
819	   right, which implies some texts might have a mixed left to right AND
820	   right to left order (even though in most implementations all texts
821	   have a major direction, with the other as an exception).  IDNA
822	   prohibits these mixed-directional (or bidirectional) strings in IDN
823	   labels, but the prohibition causes other problems such as the
824	   rejection of some otherwise linguistically and culturally sensible
825	   strings.  As Unicode and conventions for handling so-called
826	   bidirectional ("BIDI") strings evolve, the prohibition in IDNA should
827	   be reviewed and reevaluated.

829	2.2.6.  Confusable Character Issues

831	   Similar-looking characters in identifiers can cause actual problems
832	   on the Internet since they can result, deliberately or accidentally,
833	   in people being directed to the wrong host or mailbox by believing
834	   that they are typing, or clicking on, intended characters which are
835	   different from those that actually appear in the domain name or
836	   reference.  See Section 4.1.3 for further discussion of this issue.

838	   IDNs complicate these issues, not only by providing many additional
839	   characters that look sufficiently alike to be potentially confused,
840	   but by raising new policy questions.  For example, if a language can
841	   be written in two different scripts, is a label constructed from a
842	   word written in one script equivalent to a label constructed from the
843	   same word written in the other script?  Is the answer the same for
844	   words in two different languages that translate into each other?

846	   It is now generally understood that, in addition to the collision
847	   problems of possibly equivalent words and hence labels, it is
848	   possible to utilize characters that look alike -- "confusable"
849	   characters -- to spoof names in order to mislead or defraud users.
850	   That issue, driven by particular attacks such as those known as
851	   "phishing", has introduced stronger requirements for registry efforts
852	   to prevent problems than were previously generally recognized as
853	   important.

855	   One commonly-proposed approach is to have a registry establish
856	   restrictions on the characters, and combinations of characters, it
857	   will permit to be included in a string to be registered as a label.
858	   Taking the Swedish top-level domain, .SE, as an example, a rule might
859	   be adopted that the registry "only accepts registrations in Swedish,
860	   using Latin script, and because of this, Unicode characters Latin-a,
861	   -b, -c,...".  But, because there is not a 1:1 mapping between country
862	   and language, even a ccTLD like .SE might have to accept
863	   registrations in other languages.  For example, there may be a
864	   requirement for Finnish (the second most-used language in Sweden).
865	   What rules and codepoints are then defined for Finnish?  Does it have
866	   special mappings that collide with those that are defined for
867	   Swedish?  And what does one do in countries that use more than one
868	   script?  (Finnish and Swedish use the same script.)  In all cases,
869	   the dispute will ultimately be about whether two strings are the same
870	   (or confusingly similar) or not.  That, in turn, will generate a
871	   discussion of how one defines "what is the same" and "what is similar
872	   enough to be a problem".

874	   Another example arose recently that further illustrates the problem.
875	   If one were to use Cyrillic characters to represent the country code
876	   for Russia in a localized equivalent to the ccTLD label, the
877	   characters themselves would be indistinguishable from the Latin
878	   characters "P" and "Y" (in either lower or upper case) in most fonts.
879	   We presume this might cause some consternation in Paraguay.

881	   These difficulties can never be completely eliminated by algorithmic
882	   means.  Some of the problem can be addressed by appropriate tuning of
883	   the protocols and their tables, other parts by registry actions to
884	   reduce confusion and conflicts, and still other parts can be
885	   addressed by careful design of user interfaces in application
886	   programs.  But, ultimately, some responsibility to avoid being
887	   tricked or harmfully confused will rest with the user.

889	   Another registry technique that has been extensively explored
890	   involves looking at confusable characters and confusion between
891	   complete labels, restricting the labels that can be registered based
892	   on relationships to what is registered already.  Registries that
893	   adopt this approach might establish special mapping rules such as:

895	   1.  If you register something with codepoint A, domain names with B
896	       instead of A will be blocked from registration by others (where B
897	       is a character at a separate codepoint that has a confusingly
898	       similar appearance to A).
899	   2.  If you register something with codepoint A, you also get domain
900	       name with B instead of A.

902	   These approaches are discussed in more detail for "CJK" characters in
903	   RFC 3743 [RFC3743] and more generally in RFC 4290 [RFC4290].

905	2.2.7.  The IESG Statement and IDNA issues

907	   The issues above, at least as they were understood at the time,
908	   provided the background for the IESG statement included in
909	   Section 1.6.1 (which, in turn, was part of the basis for the initial
910	   ICANN Guidelines) that a registry should have a policy about the
911	   scripts, languages, codepoints and text directions for which
912	   registrations will be accepted.  While "accept all" might be an
913	   acceptable policy, it implies there is also a dispute resolution
914	   process that takes the problems listed above into account.  This
915	   process must be designed for dealing with all types of potential
916	   disputes.  For example, issues might arise between registrant and
917	   registry over a decision by the registry on collisions with already
918	   registered domain names and between registrant and trade mark holder
919	   (that a domain name infringes on a trademark).  In both cases the
920	   parties disagreeing have different views on whether two strings are
921	   "equivalent" or not.  They may believe that a string that is not
922	   allowed to be registered is actually different from one that is
923	   already registered.  Or they might believe that two strings are the
924	   same, even though the rules adopted by the registry to prevent
925	   confusion define them as two different domain names.

927	3.  Migrating to New Versions of Unicode

929	3.1.  Versions of Unicode

931	   While opinions differ about how important the issues are in practice,
932	   the use of Unicode and its supporting tables for IDNA appears to be
933	   far more sensitive to subtle changes than it is in typical Unicode
934	   applications.  This may be, at least in part, because many other
935	   applications are internally sensitive only to the appearance of
936	   characters and not to their representation.  Or those applications
937	   may be able to take effective advantage of script, language, or
938	   character class identification.  The working group that developed
939	   IDNA concluded that attempting to encode any ancillary character
940	   information into the DNS label would be impractical and unwise, and
941	   the IAB, based in part on the comments in the ad hoc committee, saw
942	   no reason to review that decision.

944	   The Unicode Consortium has sometimes used the likelihood of a
945	   combination of characters actually appearing in a natural language as
946	   a criterion for the safety of a possible change.  However, as
947	   discussed above, DNS names are often fabrications -- abbreviations,
948	   strings deliberately formed to be unusual, members of a series
949	   sequenced by numbers or other characters, and so on.  Consequently, a
950	   criterion that considers a change to be safe if it would not be
951	   visible in properly-constructed running text is not helpful for DNS
952	   purposes: a change that would be safe under that criterion could
953	   still be quite problematic for the DNS.

955	   This sensitivity to changes has made it quite difficult to migrate
956	   IDNA from one version of Unicode to the next if any changes are made
957	   that are not strictly additive.  A change in a code point assignment
958	   or definition may be extremely disruptive if DNS labels have been
959	   defined using the earlier form and any of its previous components has
960	   been moved from one table position or normalization rule to another.
961	   Unicode normalization tables, tables of scripts or languages and
962	   characters that belong to them, and even tables of confusable
963	   characters as an adjunct to security recommendations may be very
964	   helpful in designing registry restrictions on registrations and
965	   applications provisions for avoiding or identifying suspicious names.
966	   Ironically, they also extend the sensitivity of IDNA and its
967	   implementations to all forms of change between one version of Unicode
968	   and the next.  Consequently, they make Unicode version migration more
969	   difficult.

971	   An example of the type of change that appears to be just a small
972	   correction from one perspective but may be problematic from another
973	   was the correction to the normalization definition in 2004 [Unicode-
974	   PR29].  Community input suggested that the change would cause
975	   problems for Stringprep, but the Unicode Technical Committee decided,
976	   on balance, that the change was worthwhile.  Because of difficulties
977	   with consistency, some deployed implementations have decided to adopt
978	   the change and others have not, leading to subtle incompatibilities.

980	   This situation leads to a dilemma.  On the one hand, it is completely
981	   unacceptable to freeze IDNA at a Unicode version level that excludes
982	   more recently-defined characters and scripts which are important to
983	   those who use them.  On the other hand, it is equally unacceptable to
984	   migrate from one version of Unicode to the next if such migration
985	   might invalidate an existing registered DNS name or some of its
986	   registered properties or might make the string or representation of
987	   that name ambiguous.  If IDNA is to be modified to accommodate new
988	   versions of Unicode, the IETF will need to work with the Unicode
989	   Consortium and other relevant bodies to find an appropriate balance
990	   in this area, but progress will be possible only if all relevant
991	   parties are able to fairly consider and discuss possible decisions
992	   that may be very difficult and unpalatable.

994	   It would also prove useful if during the course of that dialog, the
995	   need for Unicode Consortium concern with security issues in
996	   applications of the Unicode character set could be clarified.  It
997	   would be unfortunate from almost every perspective considered here,
998	   if such matters slowed the inclusion of as yet unencoded scripts.

1000	3.2.  Version changes and normalization issues

1002	3.2.1.  Unnormalized Combining Sequences

1004	   One of the advantages of the Unicode model of combining characters,
1005	   as with previous systems that use character overstriking to
1006	   accomplish similar purposes, is that it is possible to use sequences
1007	   of code points to generate characters that are not explicitly
1008	   provided for in the character set.  However, unless sequences that
1009	   are not explicitly provided for are prohibited by some mechanism
1010	   (such as the normalization tables), such combining sequences can
1011	   permit two related dangers.

1013	   o  The first is another risk of character confusion, especially if
1014	      the relationship of the combining character with characters it
1015	      combines with are not precisely defined or unexpected combinations
1016	      of combining characters are used.  That issue is discussed in more
1017	      detail, with an example, in Section 2.2.3.
1018	   o  These same issues also inherently impact the stability of the
1019	      normalization tables.  Suppose that, somewhere in the world, there
1020	      is a character that looks like a Roman-derived lower-case "i", but
1021	      with three (not one or two) dots above it.  And suppose that the
1022	      users of that character agree to represent it by combining a
1023	      traditional "i" (U+0069) with a combining diaeresis (U+0308).  So
1024	      far, no problem.  But, later, a broader need for this character is
1025	      discovered and it is coded into Unicode either as a single
1026	      precomposed character or, more likely under existing rules, by
1027	      introducing a three-dot-above combining character.  In either
1028	      case, that version of Unicode should include a rule in NFKC that
1029	      maps the "i"-plus-diaeresis sequence into the new, approved, one.
1030	      If one does not do so, then there is arguably a normalization that
1031	      should occur that does not.  If one does so, then strings that
1032	      were valid and normalized (although unanticipated) under the
1033	      previous versions of Unicode become unnormalized under the new
1034	      version.  That, in turn, would impact IDNA comparisons because,
1035	      effectively, it would introduce a change in the matching rules.

1037	   It would be useful to consider rules that would avoid or minimize
1038	   these problems with the understanding that, for reasons given
1039	   elsewhere, simply minimizing it may not be good enough for IDNA.  One
1040	   partial solution might be to ban any combination of a base character
1041	   and a combining character that does not appear in a hypothetical
1042	   "anticipated combinations" table from being used in a domain name
1043	   label.  The next subsection discusses a more radical, if impractical,
1044	   view of the problem and its solutions.

1046	3.2.2.  Combining Characters and Character Components

1048	   For several reasons, including those discussed above, one thing that
1049	   increases IDNA complexity and the need for normalization is that
1050	   combining characters are permitted.  Without them, complexity might
1051	   be reduced enough to permit more easy transitions to new versions.
1052	   The community should consider the impact of entirely prohibiting
1053	   combining characters from IDNs.  While it is almost certainly
1054	   unfeasible to introduce this change into Unicode as it is now defined
1055	   and doing so would be extremely disruptive even if it were feasible,
1056	   the thought experiment can be helpful in understanding both the
1057	   issues and the implications of the paths not taken.  For example, one
1058	   consequence of this, of course, is that each new language or script,
1059	   and several existing ones, would require that all of its characters
1060	   have Unicode assignments to specific, precomposed, code points.

1062	   Note that this is not currently permitted within Unicode for Latin
1063	   scripts.  For non-Latin scripts, some such code points have been
1064	   defined.  The decisions that govern the assignment of such code
1065	   points are managed entirely within the Unicode Consortium.  Were the
1066	   IETF to choose to reduce IDNA complexity by excluding combining
1067	   characters, no doubt there would be additional input to the Unicode
1068	   Consortium from users and proponents of scripts requiring composing
1069	   characters.  The IAB and the IETF should examine whether it is
1070	   appropriate to press the Unicode Consortium to revise these policies
1071	   or otherwise to recommend actions that would reduce the need for
1072	   normalization and the related complexities.  However, we have been
1073	   told that the Technical Committee does not believe it is reasonable
1074	   or feasible to add all possible precomposed characters to Unicode.
1075	   If Unicode cannot be modified to contain the precomposed characters
1076	   necessary to support existing languages and scripts, much less new
1077	   ones, this option for IDN restrictions will not be feasible.

1079	3.2.3.  When does normalization occur?

1081	   In many Unicode applications, the preferred solution is to pick a
1082	   style of normalization and require that all text that is stored or
1083	   transmitted be normalized to that form.  (This is the approach taken
1084	   in ongoing work in the IETF on a standard Unicode text form [net-
1085	   utf8]).  IDNA does not impose this requirement.  Text is normalized
1086	   and case-reduced at registration time, and only the normalized
1087	   version is placed in the DNS.  However, there is no requirement that
1088	   applications show only the native (and lower-case where appropriate)
1089	   characters associated with the normalized form in discussions or
1090	   references such as URLs.  If conventions used for all-ASCII DNS names
1091	   are to be extended to internationalized forms, such a requirement
1092	   would be unreasonable, since it would prohibit the use of mixed-case
1093	   references for clarity or market identification.  It might even be
1094	   culturally inappropriate.  However, without that restriction, the
1095	   comparison that will ultimately be made in the DNS will be between
1096	   strings normalized at different times and under different versions of
1097	   Unicode.  The assertion that a string in normalized form under one
1098	   version of Unicode will still be in normalized form under all future
1099	   versions is not sufficient.  Normalization at different times also
1100	   requires that a given source string always normalizes to the same
1101	   target string, regardless of the version under which it is
1102	   normalized.  That criterion is much more difficult to fulfill.  The
1103	   discussion above suggests that it may even be impossible.

1105	   Ignoring these issues with combining characters entirely, as IDNA
1106	   effectively does today, may leave us "stuck" at Unicode 3.2, leading
1107	   either to incompatibility differences in applications that otherwise
1108	   use a modern version of Unicode (while IDN remains at Unicode 3.2) or
1109	   to painful transitions to new versions.  If decisions are made
1110	   quickly, it may still be possible to make a one-time version upgrade
1111	   to Version 4.1 or Version 5 of Unicode.  However, unless we can
1112	   impose sufficient global restrictions to permit smooth transitions,
1113	   upgrading to versions beyond that one are likely to be painful (e.g.,
1114	   potentially requiring changing strings already in the DNS or even a
1115	   new Punycode prefix) or impossible.

1117	4.  Framework for next steps in IDN development

1119	4.1.  Issues within the scope of the IETF

1121	4.1.1.  Review of IDNA

1123	   The IETF should consider reviewing RFCs 3454, 3490, 3491 and/or 3492,
1124	   and update, replace or supplement them to meet the criteria of this
1125	   paragraph (one or more of them may prove impractical after further
1126	   study).  Any new versions or additional specifications should be
1127	   adapted to the version of Unicode that is current when they are
1128	   created.  Ideally, they should specify a path for adapting to future
1129	   versions of Unicode (some suggestions below may facilitate this).
1130	   The IETF should also consider whether there are significant
1131	   advantages to mapping some groups of characters, such as code points
1132	   assigned to font variations, into others or whether clarity and
1133	   comprehensibility for the user would be better served by simply
1134	   prohibiting those characters.  More generally, it appears that it
1135	   would be worthwhile for the IETF to review whether the Unicode
1136	   normalization rules now invoked by the Stringprep profile in Nameprep
1137	   are optimal for the DNS or whether more restrictive rules, or an even
1138	   more restrictive set of permitted character combinations, would
1139	   provide better support for DNS internationalization.

1141	   The IAB has concluded that there is a consensus within the broader
1142	   community that lists of codepoints should be specified by the use of
1143	   an inclusion based mechanism (i.e., identifying the characters that
1144	   are permitted), rather than by excluding a small number of characters
1145	   from the total Unicode set as Stringprep and Nameprep do today.  That
1146	   conclusion should be reviewed by the IETF community and action taken
1147	   as appropriate.

1149	   We suggest that the individuals doing the review of the codepoints
1150	   should work as a specialized design team.  To the extent possible,
1151	   that work should be done jointly by people with experience from the
1152	   IETF and deep knowledge of the constraints of the DNS and application
1153	   design, participants from the Unicode Consortium, and other people
1154	   necessary to be able to reach a generally-accepted result.  Because
1155	   any work along these lines would be modifications and updates to
1156	   standards-track documents, final review and approval of any proposals
1157	   would necesarily follow normal IETF processes.

1159	   It is worth noting that sufficiently extreme changes to IDNA would
1160	   require a new Punycode prefix, probably with long-term support for
1161	   both the old prefix or the new one in both registration arrangements
1162	   and applications.  An alternative, which is almost certainly
1163	   impractical, would be some sort of "flag day", i.e., a date on which
1164	   the old rules are simultaneously abandoned by everyone and the new
1165	   ones adopted.  However, preliminary analysis indicates that few, if
1166	   any, of the changes recommended for consideration elsewhere in this
1167	   document would require this type of version change.  For example,
1168	   additional restrictions on what can be registered may require policy
1169	   decisions about actions to be taken with regard to labels that
1170	   conformed to earlier rules but not to new ones, but not changes in
1171	   the protocol or prefix.

1173	4.1.2.  Non-DNS and Above-DNS Internationalization Approaches

1175	   The IETF should once again examine the extent to which it is
1176	   appropriate to try to solve internationalization problems via the DNS
1177	   and what place the many varieties of so-called "keyword systems" or
1178	   other Internet navigational techniques might have.  Those techniques
1179	   can be designed to impose fewer constraints, or at least different
1180	   constraints, than IDNA and the DNS.  As discussed elsewhere in this
1181	   document, IDNA cannot support information about scripts, languages,
1182	   or Unicode versions on lookup.  As a consequence of the nature of DNS
1183	   lookups, characters and labels either match or do not match; a near-
1184	   match is simply not a possible concept in the DNS.  By contrast,
1185	   observation of near-matching is common in human communication and in
1186	   matching operations performed by people, especially when they have a
1187	   particular script or language context in mind.  The DNS is further
1188	   constrained by a fairly rigid internal aliasing system (via CNAME and
1189	   DNAME resource records), while some applications of international
1190	   naming may require more flexibility.  Finally, the rigid hierarchy of
1191	   the DNS --and the tendency in practice for it to become flat at
1192	   levels nearest the root-- and the need for names to be unique are
1193	   more suitable for some purposes than others and may not be a good
1194	   match for some purposes for which people wish to use IDNs.  Each of
1195	   these constraints can be relaxed or changed by one or more systems
1196	   that would provide alternatives to direct use of the DNS by users.
1197	   Some of the issues involved are discussed further in Section 5.3 and
1198	   various ideas have been discussed in detail in the IETF or IRTF.
1199	   Many of those ideas have even been described in Internet Drafts or
1200	   other documents.  As experience with IDNs and with expectations for
1201	   them accumulates, it will probably become appropriate for the IETF or
1202	   IRTF to revisit the underlying questions and possibilities.

1204	4.1.3.  Security issues, certificates, etc.

1206	   Some characters look like others, often as the result of common
1207	   origins.  The problem with these "confusable" characters, often
1208	   incorrectly called homographs, has always existed when characters are
1209	   presented to humans that interpret what is displayed and then make
1210	   decisions based on what the person sees.  This is not a problem that
1211	   exists only when working with internationalized domain names, but it
1212	   makes the problem worse.  The result of a survey that would explain
1213	   what the problems are might be interesting.  Many of these issues are
1214	   mentioned in Unicode Technical Report #36 [UTR36].

1216	   In this and other issues associated with IDNs, precise use of
1217	   terminology is important lest even more confusion result.  The
1218	   definition of the term 'homograph' that normally appears in
1219	   dictionaries and linguistic texts states that homographs are
1220	   different words which are spelled identically (for example, the
1221	   adjective 'brief' meaning short, the noun 'brief' meaning a document,
1222	   and the verb 'brief' meaning to inform).  By definition, letters in
1223	   two different alphabets are not the same, regardless of similarities
1224	   in appearance.  This means that sequences of letters from two
1225	   different scripts that appear to be identical on a computer display
1226	   cannot be homographs in the accepted sense, even if they are both
1227	   words in the dictionary of some language.  Assuming that there is a
1228	   language written with Cyrillic script in which "cap" is a word,
1229	   regardless of what it might mean, it is not a homograph of the Latin-
1230	   script English word "cap".

1232	   When the security implications of visually confusable characters were
1233	   brought to the forefront in 2005, the term homograph was used to
1234	   designate any instance of graphic similarity, even when comparing
1235	   individual characters.  This usage is not only incorrect, but risks
1236	   introducing even more confusion and hence should be avoided.  The
1237	   current preferred terminology is to describe these similar-looking
1238	   characters as "confusable characters" or even "confusables".

1240	   Many people have suggested that confusable characters are a problem
1241	   that must be addressed, at least in part, directly in the user
1242	   interfaces of application software.  While it should almost certainly
1243	   be part of a complete solution, that approach creates it own set of
1244	   difficulties.  For example, a user switching between systems, or even
1245	   between applications on the same system, may be surprised by
1246	   different types of behavior and different levels of protection.  In
1247	   addition, it is unclear how a secure setup for the end user should be
1248	   designed.  Today, in the web browser, a padlock is a traditional way
1249	   of describing some level of security for the end user.  Is this
1250	   binary signaling enough?  Should there be any connection between a
1251	   risk for a displayed string including confusable characters and the
1252	   padlock or similar signaling to the user?

1254	   Many web browsers have adopted a convention, based on a "whitelist"
1255	   or similar technique, of restricting the display of native characters
1256	   to subdomains of top-level domains that are deemed to have safe
1257	   practices for the registration of potentially confusable labels.
1258	   IDNs in other domains are displayed as Punycode.  These techniques
1259	   may not be sufficiently sensitive to differences in policies among
1260	   top-level domains and their subdomains and so, while they are clearly
1261	   helpful, they may not be adequate.  Are other methods of dealing with
1262	   confusable characters possible?  Would other methods of identifying
1263	   and listing policies about avoiding confusing registrations be
1264	   feasible and helpful?

1266	   It would be interesting to see a more coordinated effort in
1267	   establishing guidelines for user interfaces.  If nothing else, the
1268	   current whitelists are browser specific and both can, and do, differ
1269	   between implementations.

1271	4.1.4.  Protocol Changes and Policy Implications

1273	   Some potential protocol or table changes raise important policy
1274	   issues about what to do with existing, registered, names.  Should
1275	   such changes be needed, their impact must be carefully evaluated in
1276	   the IETF, ICANN, and possibly other forums.  In particular, protocol
1277	   or policy changes that would not permit existing, registered, names
1278	   to be registered under the newer rules should be considered
1279	   carefully, balancing their importance against possible disruption and
1280	   the issues of invalidating older names against the importance of
1281	   consistency as seen by the user.

1283	4.1.5.  Non US-ASCII in local part of email addresses

1285	   Work is going on in the IETF related to the local part of email
1286	   addresses.  It should be noted that the local part of email addresses
1287	   has much different syntax and constraints than a domain name label,
1288	   so to directly apply IDNA on the local part is not possible.

1290	4.1.6.  Use of the Unicode Character Set in the IETF

1292	   Unicode, and the closely-related ISO 10646, are the only coded
1293	   character set that aspire to include all of the world's characters.
1294	   As such, they permit use of international characters without having
1295	   to identify particular character coding standards or tables.  The
1296	   requirement for a single character set is particularly important for
1297	   use with the DNS since there is no place to put character set
1298	   identification.  The decision to use Unicode as the base for IETF
1299	   protocols going forward is discussed in [RFC2277].  The IAB does not
1300	   see any reason to revisit the decision to use Unicode in IETF
1301	   protocols.

1303	4.2.  Issues that fall within the purview of ICANN

1305	4.2.1.  Dispute resolution

1307	   IDN creates new types of collisions between trademarks and domain
1308	   names as well as collisions between domain names.  These have impact
1309	   on dispute resolution processes used by registries and otherwise.  It
1310	   is important that deployment of IDN evolve in parallel with review
1311	   and updating of ICANN or registry-specific dispute resolution
1312	   processes.

1314	4.2.2.  Policy at registries

1316	   The IAB recommends that registries use an inclusion based model when
1317	   choosing what characters to allow at the time of registration.  This
1318	   list of characters is in turn to be a subset of what is allowed
1319	   according to the updated IDNA standard.  The IAB further recommends
1320	   that registries develop their inclusion based models in parallel with
1321	   dispute resolution process at the registry itself.

1323	   Most established policies for dealing with claimed or apparent
1324	   confusion or conflicts of names are based on dispute resolution.
1325	   Decisions about legitimate use or registration of one or more names
1326	   are resolved at or after the time of registration on a case-by-case
1327	   basis and using policies that are specific to the particular DNS zone
1328	   or jurisdiction involved.  These policies have generally not been
1329	   extended below the level of the DNS that is directly controlled by
1330	   the top-level registry.

1332	   Because of the number of conflicts that can be generated by the
1333	   larger number of available and confusable characters in Unicode, we
1334	   recommend that registration-restriction and dispute resolution
1335	   policies be developed to constrain IDN registrations by registries
1336	   and zone administrators at all levels of the DNS tree.  Of course,
1337	   many of these policies will be less formal than others and there is
1338	   no requirement for complete global consistency, but the arguments for
1339	   reduction of confusable characters and other issues in TLDs should
1340	   apply to all zones below that specific TLD.

1342	   Consistency across all zones can obviously only be accomplished by
1343	   changes to the protocols.  Such changes should be considered by the
1344	   IETF if particular restrictions are identified that are important and
1345	   consistent enough to be applied globally.

1347	   Some potential protocol changes or changes to character-mapping
1348	   mapping tables might, if adopted, have profound registry policy
1349	   implications.  See Section 4.1.4.

1351	4.2.3.  IDN TLDs

1353	   The IAB has concluded that there is not one IDN TLD issue but at
1354	   least three very separate ones:

1356	   o  If IDN entries are to be made in the root zone, decisions must
1357	      first be made about how these TLDs are to be named and delegated.
1358	      These decisions fall within the traditional IANA scope and are
1359	      ICANN issues today.
1360	   o  There has been discussion of permitting some or all existing TLDs
1361	      to be referenced by multiple labels, with those labels presumably
1362	      representing some understanding of the "name" of the TLD in
1363	      different languages.  If actual aliases of this type are desired
1364	      for existing domains, the IETF may need to consider whether the
1365	      use of DNAME records in the root is appropriate to meet that need,
1366	      what constraints, if any, are needed, whether alternate
1367	      approaches, such as those of [RFC4185], are appropriate or whether
1368	      further alternatives should be investigated.  But, to the extent
1369	      to which aliases are considered desirable and feasible, decisions
1370	      presumably must be made as to which, if any, root IDN labels
1371	      should be associated with DNAME records and which ones should be
1372	      handled by normal delegation records or other mechanisms.  That
1373	      decision is one of DNS root-level namespace policy and hence falls
1374	      to ICANN although we would expect ICANN to pay careful attention
1375	      to any technical, operational, or security recommendations that
1376	      may be produced by other bodies.
1377	   o  Finally, if IDN labels are to be placed in the root zone, there
1378	      are issues associated with how they are to be encoded and
1379	      deployed.  This area may have implications for work that has been
1380	      done, or should be done, in the IETF.

1382	5.  Specific Recommendations for Next Steps

1384	   Consistent with the framework described above, the IAB offers these
1385	   recommendations as steps for further consideration in the identified
1386	   groups.

1388	5.1.  Reduction of permitted character list

1390	   Generalize from the original "hostname" rules to non-ASCII
1391	   characters, permitting as few characters as possible to do that job.
1392	   This would involve a restrictive model for characters permitted in
1393	   IDN labels, thus contrasting with the approach used to develop the
1394	   original IDNA/Nameprep tables.  That approach was to include all
1395	   Unicode characters that there was not a clear reason to exclude.

1397	   The specific recommendation here is to specify such internationalized
1398	   hostnames.  Such an activity would fall to the IETF, although the
1399	   task of developing the appropriate list of permitted characters will
1400	   require effort both in the IETF and elsewhere.  The effort should be
1401	   as linguistically and culturally sensitive as possible, but smooth
1402	   and effective operation of the DNS, including minimizing of
1403	   complexity, should be primary goals.  The following should be
1404	   considered as possible mechanisms for achieving an appropriate
1405	   minimum number of characters.

1407	5.1.1.  Elimination of all non-language characters

1409	   Unicode characters that are not needed to write words or numbers in
1410	   any of the world's languages should be eliminated from the list of
1411	   characters that are appropriate in DNS labels.  In addition to such
1412	   characters as those used for box-drawing and sentence punctuation,
1413	   this should exclude punctuation for word structure and other
1414	   delimiters: while DNS labels may conveniently be used to express
1415	   words in many circumstances, the goal is not to express words (or
1416	   sentences or phrases), but to permit the creation of unambiguous
1417	   labels with good mnemonic value.

1419	5.1.2.  Elimination of word-separation punctuation

1421	   The inclusion of the hyphen in the original hostname rules is a
1422	   historical artifact from an older, flat, name space.  The community
1423	   should consider whether it is appropriate to treat it as a simple
1424	   legacy property of ASCII names and not attempt to generalize it to
1425	   other scripts.  We might, for example, not permit claimed equivalents
1426	   to the hyphen from other scripts to be used in IDNs.  We might even
1427	   consider banning use of the hyphen itself in non-ASCII strings or,
1428	   less restrictively, strings that contained non-Latin characters.

1430	5.2.  Updating to new versions of Unicode

1432	   As new scripts, to support new languages, continue to be added to
1433	   Unicode, it is important that IDNA track updates.  If it does not do
1434	   so, but remains "stuck" at 3.2 or some single later version, it will
1435	   not be possible to include labels in the DNS that are derived from
1436	   words in languages that require characters that are available only in
1437	   later versions.  Making those upgrades is difficult, and will
1438	   continue to be difficult, as long as new versions require, not just
1439	   addition of characters, but changes to canonicalization conventions,
1440	   normalization tables, or matching procedures (see Section 3.1).
1441	   Anything that can be done to lower complexity and simplify forward
1442	   transitions should be seriously considered.

1444	5.3.  Role and Uses of the DNS

1446	   We wish to remind the community that there are boundaries to the
1447	   appropriate uses of the DNS.  It was designed and implemented to
1448	   serve some specific purposes.  There are additional things that it
1449	   does well, other things that it does badly, and still other things it
1450	   cannot do at all.  No amount of protocol work on IDNs will solve
1451	   problems with alternate spellings, near-matches, searching for
1452	   appropriate names, and so on.  Registration restrictions and
1453	   carefully-designed user interfaces can be used to reduce the risk and
1454	   pain of attempts to do some of these things gone wrong, as well as
1455	   reducing the risks of various sort of deliberate bad behavior, but,
1456	   beyond a certain point, use of the DNS simply because it is available
1457	   becomes a bad tradeoff.  The tradeoff may be particularly unfortunate
1458	   when the use of IDNs does not actually solve the proposed problem.
1459	   For example, internationalization of DNS names does not eliminate the
1460	   ASCII protocol identifiers and structure of URIs [RFC3986] and even
1461	   IRIs [RFC3987].  Hence, DNS internationalization itself, at any or
1462	   all levels of the DNS tree, is not a sufficient response to the
1463	   desire of populations to use the Internet entirely in their own
1464	   languages and the characters associated with those languages.

1466	   These issues are discussed at more length, and alternatives
1467	   presented, in [RFC2825], [RFC3467], [INDNS], and [DNS-Choices].

1469	5.4.  Databases of Registered Names

1471	   In addition to their presence in the DNS, IDNs introduce issues in
1472	   other contexts in which domain names are used.  In particular, the
1473	   design and content of databases that bind registered names to
1474	   information about the registrant (commonly described as "whois"
1475	   databases) will require review and updating.  For example, the whois
1476	   protocol itself [RFC3912] has no standard capability for handling
1477	   non-ASCII text: one cannot search consistently for, or report, either
1478	   a DNS name or contact information that is not in ASCII characters.
1479	   This may provide some additional impetus for a switch to IRIS
1480	   [RFC3981] [RFC3982] but also raises a number of other questions about
1481	   what information, and in what languages and scripts, should be
1482	   included or permitted in such databases.

1484	6.  Security Considerations

1486	   This document is simply a discussion of IDNs and IDN issues; it
1487	   raises no new security concerns.  However, if some of its
1488	   recommendations to reduce IDNA complexity, the number of available
1489	   characters, and various approaches to constraining the use of
1490	   confusable characters, are followed and prove successful, the risks
1491	   of name spoofing and other problems may be reduced.

1493	7.  Acknowledgments

1495	   The contributions to this report from members of the IAB-IDN ad hoc
1496	   committee are gratefully acknowledged.  Of course, not all of the
1497	   members of that group endorse every comment and suggestion of this
1498	   report.  In particular, this report does not claim to reflect the
1499	   views of the Unicode Consortium as a whole or those of particular
1500	   participants in the work of that Consortium.  The members of the ad
1501	   hoc committee were:

1503	   Rob Austein, Leslie Daigle, Tina Dam, Mark Davis, Patrik Faltstrom,
1504	   Scott Hollenbeck, Cary Karp, John Klensin, Gervase Markham, David
1505	   Meyer, Thomas Narten, Michael Suignard, Sam Weiler, Bert Wijnen, Kurt
1506	   Zeilenga and Lixia Zhang.

1508	   Thanks are due to Tina Dam and others associated with the ICANN IDN
1509	   Working Group for contributions of considerable specific text, to
1510	   Marcos Sanz and Paul Hoffman for careful late-stage reading and
1511	   extensive comments, and to Pete Resnick for many contributions and
1512	   comments, both in conjunction with his former IAB service and
1513	   subsequently.  Olaf M. Kolkman took over IAB leadership for this
1514	   document after Patrik Faltstrom and Pete Resnick stepped down in
1515	   March 2006.

1517	   Members of the IAB at the time of approval of this document were:
1518	   [[anchor40: To be supplied]]

1520	8.  Change History

1522	   [[anchor42: RFC Editor: this section is to be removed before
1523	   publication]]

1525	8.1.  Changes for version -01

1527	   1.  Added discussion and reference to Unicode PR-29
1528	   2.  Replaced the discussion of the ICANN Guidelines (with thanks to
1529	       Tina Dam and Cary Karp).
1530	   3.  Revised the Bidi text to make the potential recommendation more
1531	       clear.
1532	   4.  Removed any claims (actual or implied) of endorsement by the
1533	       members of the ad hoc committee.
1534	   5.  Several small editorial changes, etc.

1536	8.2.  Changes for version -02

1538	   1.  Added some additional references, e.g., to W3C
1539	       internationalization work and to UTR39.
1540	   2.  Adjusted some terminology to correct errors and avoid unnecessary
1541	       controversy.
1542	   3.  Extended the discussion of related characters in Swedish and
1543	       Norwegian to clarify at least one of the possibilities
1544	   4.  Introduced new Section 5.4 to discuss IDN issues in other than
1545	       the DNS itself and point to IRIS.
1546	   5.  Rewrote the introduction to the "problem" section and its first
1547	       subsection.
1548	   6.  Small changes made to the "definitions" section including
1549	       explaining why "multilingual" is there and rewriting the "script"
1550	       definition to clarify slightly and put the example script names
1551	       into alphabetical order.
1552	   7.  Section 4.2.3, has been fairly extensively rewritten for clarity,
1553	       and a large number of less extensive clarifications have been
1554	       made, although no substantive changes have been (intentionally)
1555	       occurred.

1557	8.3.  Changes for Version -03

1559	   1.  Made a number of further tuning changes to better reflect the
1560	       role of the document and corrected several references.
1561	   2.  Removed the reference to Vietnamese.
1562	   3.  Added a discussion of IDNA versioning and new prefixes.

1564	8.4.  Changes for version -04

1566	   1.  Corrected many small typographical and editorial errors.
1567	   2.  Clarified that elimination of non-language characters was not
1568	       intended to eliminate digits.

1570	8.5.  Changes for version -05

1572	   1.  Revised section 4.3 to further clarify the suggestion.
1573	   2.  Revised the Acknowledgments section

1575	8.6.  Changes for version -06

1577	   1.  New subsection added to the Introduction to put the document into
1578	       better context.
1579	   2.  New introduction to Section 2.1.
1580	   3.  Several small changes to the Normalization section to further
1581	       clarify that issue,
1582	   4.  Split out Unicode upgrades from other material, in the process
1583	       revising the notorious section 4.3 and giving it additional
1584	       context.
1585	   5.  Acknowledgments updated.
1586	   6.  Many small editorial and clarification corrections.

1588	9.  References

1590	9.1.  Normative References

1592	   [ISO10646]
1593	              International Organization for Standardization,
1594	              "Information Technology - Universal Multiple- Octet Coded
1595	              Character Set (UCS) - Part 1: Architecture and Basic
1596	              Multilingual Plane"", ISO/IEC 10646-1:2000, October 2000.

1598	   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
1599	              Internationalized Strings ("stringprep")", RFC 3454,
1600	              December 2002.

1602	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
1603	              "Internationalizing Domain Names in Applications (IDNA)",
1604	              RFC 3490, March 2003.

1606	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1607	              Profile for Internationalized Domain Names (IDN)",
1608	              RFC 3491, March 2003.

1610	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
1611	              for Internationalized Domain Names in Applications
1612	              (IDNA)", RFC 3492, March 2003.

1614	   [Unicode32]
1615	              The Unicode Consortium, "The Unicode Standard, Version
1616	              3.0", 2000.

1618	              (Reading, MA, Addison-Wesley, 2000.  ISBN 0-201-61633-5).
1619	              Version 3.2 consists of the definition in that book as
1620	              amended by the Unicode Standard Annex #27: Unicode 3.1
1621	              (http://www.unicode.org/reports/tr27/) and by the Unicode
1622	              Standard Annex #28: Unicode 3.2
1623	              (http://www.unicode.org/reports/tr28/).

1625	9.2.  Informative References

1627	   [DNS-Choices]
1628	              Faltstrom, P., "Design Choices When Expanding DNS",
1629	              draft-iab-dns-choices-02 (work in progress), June 2005.

1631	   [ICANNv1]  ICANN, "Guidelines for the Implementation of
1632	              Internationalized Domain Names, Version 1.0", March 2003,
1633	              <http://www.icann.org/general/idn-guidelines-20jun03.htm>.

1635	   [ICANNv2]  ICANN, "Guidelines for the Implementation of
1636	              Internationalized Domain Names, Version 2.0",
1637	              November 2005,
1638	              <http://www.icann.org/general/idn-guidelines-20sep05.htm>.

1640	   [IESG-IDN]
1641	              Internet Engineering Steering Group (IESG), "IESG
1642	              Statement on IDN", IESG Statements IDN Statement,
1643	              February 2003,
1644	              <http://www.ietf.org/IESG/STATEMENTS/IDNstatement.txt>.

1646	   [INDNS]    National Research Council, "Signposts in Cyberspace: The
1647	              Domain Name System and Internet Navigation", National
1648	              Academy Press ISBN 0309-09640-5 (Book) 0309-54979-5 (PDF),
1649	              2005,
1650	              <http://www7.nationalacademies.org/cstb/pub_dns.html>.

1652	   [ISO.2022.1986]
1653	              International Organization for Standardization,
1654	              "Information Processing: ISO 7-bit and 8-bit coded
1655	              character sets: Code extension techniques", ISO Standard
1656	              2022, 1986.

1658	   [ISO.646.1991]
1659	              International Organization for Standardization,
1660	              "Information technology - ISO 7-bit coded character set
1661	              for information interchange", ISO Standard 646, 1991.

1663	   [ISO.8859.2003]
1664	              International Organization for Standardization,
1665	              "Information processing - 8-bit single-byte coded graphic
1666	              character sets - Part 1: Latin alphabet No. 1 (1998) -
1667	              Part 2: Latin alphabet No. 2 (1999) - Part 3: Latin
1668	              alphabet No. 3 (1999) - Part 4: Latin alphabet No. 4
1669	              (1998) - Part 5: Latin/Cyrillic alphabet (1999) - Part 6:
1670	              Latin/Arabic alphabet (1999) - Part 7: Latin/Greek
1671	              alphabet (2003) - Part 8: Latin/Hebrew alphabet (1999) -
1672	              Part 9: Latin alphabet No. 5 (1999) - Part 10: Latin
1673	              alphabet No. 6 (1998) - Part 11: Latin/Thai alphabet
1674	              (2001) - Part 13: Latin alphabet No. 7 (1998) - Part 14:
1675	              Latin alphabet No. 8 (Celtic) (1998) - Part 15: Latin
1676	              alphabet No. 9 (1999) - Part 16: Part 16: Latin alphabet
1677	              No. 10 (2001)", ISO Standard 8859, 2003.

1679	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
1680	              Languages", BCP 18, RFC 2277, January 1998.

1682	   [RFC2825]  IAB and L. Daigle, "A Tangled Web: Issues of I18N, Domain
1683	              Names, and the Other Internet protocols", RFC 2825,
1684	              May 2000.

1686	   [RFC3066]  Alvestrand, H., "Tags for the Identification of
1687	              Languages", BCP 47, RFC 3066, January 2001.

1689	   [RFC3467]  Klensin, J., "Role of the Domain Name System (DNS)",
1690	              RFC 3467, February 2003.

1692	   [RFC3536]  Hoffman, P., "Terminology Used in Internationalization in
1693	              the IETF", RFC 3536, May 2003.

1695	   [RFC3743]  Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
1696	              Engineering Team (JET) Guidelines for Internationalized
1697	              Domain Names (IDN) Registration and Administration for
1698	              Chinese, Japanese, and Korean", RFC 3743, April 2004.

1700	   [RFC3912]  Daigle, L., "WHOIS Protocol Specification", RFC 3912,
1701	              September 2004.

1703	   [RFC3981]  Newton, A. and M. Sanz, "IRIS: The Internet Registry
1704	              Information Service (IRIS) Core Protocol", RFC 3981,
1705	              January 2005.

1707	   [RFC3982]  Newton, A. and M. Sanz, "IRIS: A Domain Registry (dreg)
1708	              Type for the Internet Registry Information Service
1709	              (IRIS)", RFC 3982, January 2005.

1711	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1712	              Resource Identifier (URI): Generic Syntax", STD 66,
1713	              RFC 3986, January 2005.

1715	   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
1716	              Identifiers (IRIs)", RFC 3987, January 2005.

1718	   [RFC4185]  Klensin, J., "National and Local Characters for DNS Top
1719	              Level Domain (TLD) Names", RFC 4185, October 2005.

1721	   [RFC4290]  Klensin, J., "Suggested Practices for Registration of
1722	              Internationalized Domain Names (IDN)", RFC 4290,
1723	              December 2005.

1725	   [UTR]      Unicode Consortium, "Unicode Technical Reports",
1726	              <http://www.unicode.org/reports/>.

1728	   [UTR36]    Davis, M. and M. Suignard, "Unicode Technical Report #36:
1729	              Unicode Security Considerations", November 2005,
1730	              <http://www.unicode.org/draft/reports/tr36/tr36.html>.

1732	              Working Draft for Proposed Update

1734	   [UTR39]    Davis, M. and M. Suignard, "Unicode Technical Standard #39
1735	              (proposed): Unicode Security Considerations", July 2005,
1736	              <http://www.unicode.org/draft/reports/tr39/tr39.html>.

1738	              Working Draft for Proposed Draft

1740	   [Unicode-PR29]
1741	              The Unicode Consortium, "Public Review Issue #29:
1742	              Normalization Issue", Unicode PR 29, February 2004.

1744	   [Unicode10]
1745	              The Unicode Consortium, "The Unicode Standard, Version
1746	              1.0", 1991.

1748	   [W3C-Localization]
1749	              Ishida, R. and S. Miller, "Localization vs.
1750	              Internationalization", W3C International/questions/
1751	              qa-i18n.txt, December 2005.

1753	   [ltru-initial]
1754	              Ewell, D., Ed., "Initial Language Subtag Registry",
1755	              draft-ietf-ltru-initial-06 (work in progress),
1756	              February 2004.

1758	              This document is awaiting publication as an Informational
1759	              RFC.

1761	   [ltru-registry]
1762	              Phillips, A., Ed. and M. Davis, Ed., "Tags for Identifying
1763	              Languages", draft-ietf-ltru-registry-14 (work in
1764	              progress), October 2004.

1766	              This document has been approved as a Proposed Standard and
1767	              is awaiting publication as an RFC.

1769	   [net-utf8]
1770	              Klensin, J. and M. Padlipsky, "Unicode Format for Network
1771	              Interchange",
1772	              InternetDraft draft-klensin-net-utf8-00f.txt, April 2006.

1774	Authors' Addresses

1776	   John C Klensin
1777	   1770 Massachusetts Ave, #322
1778	   Cambridge, MA  02140
1779	   USA

1781	   Phone: +1 617 491 5735
1782	   Email: john-ietf@jck.com

1784	   Patrik Faltstrom
1785	   Cisco Systems

1787	   Email: paf@cisco.com

1789	   Cary Karp
1790	   Swedish Museum of Natural History
1791	   Box 50007
1792	   Stockholm  SE-10405
1793	   Sweden

1795	   Phone: +46 8 5195 4055
1796	   Email: ck@nic.museum

1798	   IAB

1800	   Email: iab@iab.org

1802	Intellectual Property Statement

1804	   The IETF takes no position regarding the validity or scope of any
1805	   Intellectual Property Rights or other rights that might be claimed to
1806	   pertain to the implementation or use of the technology described in
1807	   this document or the extent to which any license under such rights
1808	   might or might not be available; nor does it represent that it has
1809	   made any independent effort to identify any such rights.  Information
1810	   on the procedures with respect to rights in RFC documents can be
1811	   found in BCP 78 and BCP 79.

1813	   Copies of IPR disclosures made to the IETF Secretariat and any
1814	   assurances of licenses to be made available, or the result of an
1815	   attempt made to obtain a general license or permission for the use of
1816	   such proprietary rights by implementers or users of this
1817	   specification can be obtained from the IETF on-line IPR repository at
1818	   http://www.ietf.org/ipr.

1820	   The IETF invites any interested party to bring to its attention any
1821	   copyrights, patents or patent applications, or other proprietary
1822	   rights that may cover technology that may be required to implement
1823	   this standard.  Please address the information to the IETF at
1824	   ietf-ipr@ietf.org.

1826	Disclaimer of Validity

1828	   This document and the information contained herein are provided on an
1829	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1830	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1831	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1832	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1833	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1834	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1836	Copyright Statement

1838	   Copyright (C) The Internet Society (2006).  This document is subject
1839	   to the rights, licenses and restrictions contained in BCP 78, and
1840	   except as set forth therein, the authors retain all their rights.

1842	Acknowledgment

1844	   Funding for the RFC Editor function is currently provided by the
1845	   Internet Society.