idnits 2.17.1 

draft-ietf-idnabis-rationale-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 2276.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2287.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2294.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2300.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 12, 2008) is 5764 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDNA2008-Bidi'

  == Outdated reference: A later version (-18) exists of
     draft-ietf-idnabis-protocol-02

  == Outdated reference: A later version (-09) exists of
     draft-ietf-idnabis-tables-01

  ** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564)

  ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126)

  == Outdated reference: A later version (-18) exists of
     draft-ietf-idnabis-protocol-02

  -- Duplicate reference: draft-ietf-idnabis-protocol, mentioned in
     'RulesInit', was also mentioned in 'IDNA2008-Protocol'.

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode51'

  -- Obsolete informational reference (is this intentional?): RFC  810
     (Obsoleted by RFC 952)


     Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         J. Klensin
3	Internet-Draft                                             July 12, 2008
4	Intended status: Standards Track
5	Expires: January 13, 2009

7	  Internationalized Domain Names for Applications (IDNA): Definitions,
8	                        Background and Rationale
9	                  draft-ietf-idnabis-rationale-01.txt

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on January 13, 2009.

36	Abstract

38	   Several years have passed since the original protocol for
39	   Internationalized Domain Names (IDNs) was completed and deployed.
40	   During that time, a number of issues have arisen, including the need
41	   to update the system to deal with newer versions of Unicode.  Some of
42	   these issues require tuning of the existing protocols and the tables
43	   on which they depend.  This document provides an overview of a
44	   revised system and provides explanatory material for its components.

46	Table of Contents

48	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
49	     1.1.  Context and Overview . . . . . . . . . . . . . . . . . . .  4
50	     1.2.  Discussion Forum . . . . . . . . . . . . . . . . . . . . .  4
51	     1.3.  Objectives . . . . . . . . . . . . . . . . . . . . . . . .  4
52	     1.4.  Applicability and Function of IDNA . . . . . . . . . . . .  5
53	     1.5.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  6
54	       1.5.1.  Documents and Standards  . . . . . . . . . . . . . . .  6
55	       1.5.2.  Terminology about Characters and Character Sets  . . .  6
56	       1.5.3.  DNS-related Terminology  . . . . . . . . . . . . . . .  7
57	       1.5.4.  Terminology Specific to IDNA . . . . . . . . . . . . .  7
58	       1.5.5.  Punycode is an Algorithm, not a Name . . . . . . . . . 10
59	       1.5.6.  Other Terminology Issues . . . . . . . . . . . . . . . 11
60	     1.6.  Comprehensibility of IDNA Mechanisms and Processing  . . . 12
61	   2.  Summary of Major Changes from IDNA2003 . . . . . . . . . . . . 13
62	   3.  The Revised IDNA Model . . . . . . . . . . . . . . . . . . . . 14
63	   4.  Processing in IDNA2008 . . . . . . . . . . . . . . . . . . . . 14
64	   5.  IDNA2008 Document List . . . . . . . . . . . . . . . . . . . . 14
65	   6.  Permitted Characters: An Inclusion List  . . . . . . . . . . . 15
66	     6.1.  A Tiered Model of Permitted Characters and Labels  . . . . 15
67	       6.1.1.  PROTOCOL-VALID . . . . . . . . . . . . . . . . . . . . 16
68	       6.1.2.  DISALLOWED . . . . . . . . . . . . . . . . . . . . . . 17
69	       6.1.3.  UNASSIGNED . . . . . . . . . . . . . . . . . . . . . . 18
70	     6.2.  Registration Policy  . . . . . . . . . . . . . . . . . . . 19
71	     6.3.  Layered Restrictions: Tables, Context, Registration,
72	           Applications . . . . . . . . . . . . . . . . . . . . . . . 19
73	   7.  Issues that Constrain Possible Solutions . . . . . . . . . . . 19
74	     7.1.  Display and Network Order  . . . . . . . . . . . . . . . . 19
75	     7.2.  Entry and Display in Applications  . . . . . . . . . . . . 21
76	     7.3.  Linguistic Expectations: Ligatures, Digraphs, and
77	           Alternate Character Forms  . . . . . . . . . . . . . . . . 22
78	     7.4.  Case Mapping and Related Issues  . . . . . . . . . . . . . 24
79	     7.5.  Right to Left Text . . . . . . . . . . . . . . . . . . . . 25
80	   8.  IDNs and the Robustness Principle  . . . . . . . . . . . . . . 25
81	   9.  Front-end and User Interface Processing  . . . . . . . . . . . 26
82	   10. Migration and Version Synchronization  . . . . . . . . . . . . 29
83	     10.1. Design Criteria  . . . . . . . . . . . . . . . . . . . . . 29
84	       10.1.1. General IDNA Validity Criteria . . . . . . . . . . . . 29
85	       10.1.2. Labels in Registration . . . . . . . . . . . . . . . . 30
86	       10.1.3. Labels in Resolution (Lookup)  . . . . . . . . . . . . 31
87	     10.2. More Flexibility in User Agents  . . . . . . . . . . . . . 32
88	     10.3. The Question of Prefix Changes . . . . . . . . . . . . . . 33
89	       10.3.1. Conditions Requiring a Prefix Change . . . . . . . . . 33
90	       10.3.2. Conditions Not Requiring a Prefix Change . . . . . . . 34
91	       10.3.3. Implications of Prefix Changes . . . . . . . . . . . . 35
92	     10.4. Stringprep Changes and Compatibility . . . . . . . . . . . 35
93	     10.5. The Symbol Question  . . . . . . . . . . . . . . . . . . . 36
94	     10.6. Migration Between Unicode Versions: Unassigned Code
95	           Points . . . . . . . . . . . . . . . . . . . . . . . . . . 37
96	     10.7. Other Compatibility Issues . . . . . . . . . . . . . . . . 38
97	   11. Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 39
98	   12. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 39
99	   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 40
100	     13.1. IDNA Character Registry  . . . . . . . . . . . . . . . . . 40
101	     13.2. IDNA Context Registry  . . . . . . . . . . . . . . . . . . 40
102	     13.3. IANA Repository of IDN Practices of TLDs . . . . . . . . . 40
103	   14. Security Considerations  . . . . . . . . . . . . . . . . . . . 41
104	   15. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 42
105	     15.1. Version -01 of draft-klensin-idnabis-issues  . . . . . . . 42
106	     15.2. Version -02 of draft-klensin-idnabis-issues  . . . . . . . 42
107	     15.3. Version -03 of draft-klensin-idnabis-issues  . . . . . . . 43
108	     15.4. Version -04 of draft-klensin-idnabis-issues  . . . . . . . 43
109	     15.5. Version -05 of draft-klensin-idnabis-issues  . . . . . . . 43
110	     15.6. Version -06 of draft-klensin-idnabis-issues  . . . . . . . 43
111	     15.7. Version -07 of draft-klensin-idnabis-issues  . . . . . . . 44
112	     15.8. Version -00 of draft-ietf-idnabis-rationale  . . . . . . . 44
113	     15.9. Version -01 of draft-ietf-idnabis-rationale  . . . . . . . 45
114	   16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46
115	     16.1. Normative References . . . . . . . . . . . . . . . . . . . 46
116	     16.2. Informative References . . . . . . . . . . . . . . . . . . 47
117	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 48
118	   Intellectual Property and Copyright Statements . . . . . . . . . . 49

120	1.  Introduction

122	1.1.  Context and Overview

124	   Several years have passed since the original protocol for
125	   Internationalized Domain Names (IDNs) was completed and deployed.
126	   During that time, a number of issues have arisen, including a subset
127	   of those described in a recent IAB report [RFC4690] and the need to
128	   update the system to deal with newer versions of Unicode.  Those
129	   standards are known as Internationalized Domain Names in Applications
130	   (IDNA), taken from the name of the highest level standard within that
131	   group (see Section 1.5).  Some tuning of the existing protocols and
132	   the tables on which they depend is now required.  Where it is
133	   important to understanding of the revised protocols, this document
134	   further explains the issues that have been encountered.  It also
135	   provides an overview of the new IDNA model and explanatory material
136	   for it.  Additional explanatory material for the specific components
137	   of the proposals will appear with the associated documents.

139	1.2.  Discussion Forum

141	   [[anchor4: RFC Editor: please remove this section.]]

143	   This work is being discussed in the IETF "idnabis" Working Group and
144	   on the mailing list idna-update@alvestrand.no

146	1.3.  Objectives

148	   The intent of the IDNA revision effort, and hence of this document
149	   and the associated ones, is to increase the usability and
150	   effectiveness of internationalized domain names (IDNs) while
151	   preserving or strengthening the integrity of references that use
152	   them.  The original "hostname" character definitions (see, e.g.,
153	   [RFC0810]) struck a balance between the creation of useful mnemonics
154	   and the introduction of parsing problems or general confusion in the
155	   contexts in which domain names are used.  Our objective is to
156	   preserve that balance while expanding the character repertoire to
157	   include extended versions of Roman-derived scripts and scripts that
158	   are not Roman in origin.  No work of this sort will be able to
159	   completely eliminate sources of visual or textual confusion: such
160	   confusion is possible even under the original rules where only ASCII
161	   characters were permitted.  However, one can hope, through the
162	   application of different techniques at different points (see
163	   Section 6.3), to keep problems to an acceptable minimum.  One
164	   consequence of this general objective is that the desire of some user
165	   or marketing community to use a particular string --whether the
166	   reason is to try to write sentences of particular languages in the
167	   DNS, to express a facsimile of the symbol for a brand, or for some
168	   other purpose-- is not a primary goal within the context of
169	   applications in the domain name space.

171	1.4.  Applicability and Function of IDNA

173	   The IDNA standard does not require any applications to conform to it,
174	   nor does it retroactively change those applications.  An application
175	   can elect to use IDNA in order to support IDN while maintaining
176	   interoperability with existing infrastructure.  If an application
177	   wants to use non-ASCII characters in domain names, IDNA is the only
178	   currently-defined option.  Adding IDNA support to an existing
179	   application entails changes to the application only, and leaves room
180	   for flexibility in front-end processing and more specifically in the
181	   user interface (see Section 9).

183	   A great deal of the discussion of IDN solutions has focused on
184	   transition issues and how IDNs will work in a world where not all of
185	   the components have been updated.  Proposals that were not chosen by
186	   the original IDN Working Group would depend on user applications,
187	   resolvers, and DNS servers being updated in order for a user to apply
188	   an internationalized domain name in any form or coding acceptable
189	   under that method.  While processing must be performed prior to or
190	   after access to the DNS, no changes are needed to the DNS protocol or
191	   any DNS servers or the resolvers on user's computers.

193	   The IDNA specification solves the problem of extending the repertoire
194	   of characters that can be used in domain names to include a large
195	   subset of the Unicode repertoire.

197	   IDNA does not extend the service offered by DNS to the applications.
198	   Instead, the applications (and, by implication, the users) continue
199	   to see an exact-match lookup service.  Either there is a single
200	   exactly-matching name or there is no match.  This model has served
201	   the existing applications well, but it requires, with or without
202	   internationalized domain names, that users know the exact spelling of
203	   the domain names that are to be typed into applications such as web
204	   browsers and mail user agents.  The introduction of the larger
205	   repertoire of characters potentially makes the set of misspellings
206	   larger, especially given that in some cases the same appearance, for
207	   example on a business card, might visually match several Unicode code
208	   points or several sequences of code points.

210	   IDNA allows the graceful introduction of IDNs not only by avoiding
211	   upgrades to existing infrastructure (such as DNS servers and mail
212	   transport agents), but also by allowing some rudimentary use of IDNs
213	   in applications by using the ASCII representation of the non-ASCII
214	   name labels.  While such names are user-unfriendly to read and type,
215	   and hence not optimal for user input, they allow (for instance)
216	   replying to email and clicking on URLs even though the domain name
217	   displayed is incomprehensible to the user.  In order to allow user-
218	   friendly input and output of the IDNs and acceptance of some
219	   characters as equivalent to those to be processed according to the
220	   protocol, the applications need to be modified to conform to this
221	   specification.

223	   IDNA uses the Unicode character repertoire, for continuity with
224	   IDNA2003.

226	1.5.  Terminology

228	1.5.1.  Documents and Standards

230	   This document uses the term "IDNA2003" to refer to the set of
231	   standards that make up and support the version of IDNA published in
232	   2003, i.e., those commonly known as the IDNA base specification
233	   [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep
234	   [RFC3454].  In this document, those names are used to refer,
235	   conceptually, to the individual documents, with the base IDNA
236	   specification called just "IDNA".

238	   The term "IDNA2008" is used to refer to a new version of IDNA as
239	   described in this document and in the documents described in
240	   Section 5.  References to "these specifications" are to the entire
241	   set.

243	1.5.2.  Terminology about Characters and Character Sets

245	   A code point is an integer value associated with a character in a
246	   coded character set.

248	   Unicode [Unicode51] is a coded character set containing almost
249	   100,000 characters as of the current version.  A single Unicode code
250	   point is denoted by "U+" followed by four to six hexadecimal digits,
251	   while a range of Unicode code points is denoted by two four to six
252	   digit hexadecimal numbers separated by "..", with no prefixes.

254	   ASCII means US-ASCII [ASCII], a coded character set containing 128
255	   characters associated with code points in the range 0000..007F.
256	   Unicode may be thought of as an extension of ASCII; it includes all
257	   the ASCII characters and associates them with equivalent code points.

259	   "Letters" are, informally, generalizations from the ASCII and common-
260	   sense understanding of that term, i.e., characters that are used to
261	   write text that are not digits, symbols, or punctuation.  Formally,
262	   they are characters with a Unicode General Category value starting in
263	   "L" (see Section 4.5 of [Unicode51]).

265	1.5.3.  DNS-related Terminology

267	   When discussing the DNS, this document generally assumes the
268	   terminology used in the DNS specifications [RFC1034] [RFC1035].  The
269	   terms "lookup" and "resolution" are used interchangeably and the
270	   process or application component that performs DNS resolution is
271	   called a "resolver".  The process of placing an entry into the DNS is
272	   referred to as "registration" paralleling common contemporary usage
273	   in other contexts.  Consequently, any DNS zone administration is
274	   described as a "registry", regardless of that actual administrative
275	   arrangements or level in the tree.  A note about that relationship is
276	   included in the text below where it seems particularly significant.

278	   The term "LDH code points" is defined in this document to mean the
279	   code points associated with ASCII letters, digits, and the hyphen-
280	   minus; that is, U+002D, 0030..0039, 0041..005A, and 0061..007A. "LDH"
281	   is an abbreviation for "letters, digits, hyphen".

283	   The base DNS specifications [RFC1034] [RFC1035] discuss "domain
284	   names" and "host names", but many people and sections of these
285	   specifications use the terms interchangeably.  Further, because those
286	   documents were not terribly clear, many people who are sure they know
287	   the exact definitions of each of these terms disagree on the
288	   definitions.  This document generally uses the term "domain name".
289	   When it refers to, e.g., host name syntax restrictions, it explicitly
290	   cites the relevant defining documents.  The remaining definitions in
291	   this subsection are essentially a review.

293	   A label is an individual component of a domain name.  Labels are
294	   usually shown separated by dots; for example, the domain name
295	   "www.example.com" is composed of three labels: "www", "example", and
296	   "com".  (The zero-length root label described in [RFC1123], which can
297	   be explicit as in "www.example.com." or implicit as in
298	   "www.example.com", is not considered a label in this specification.)
299	   IDNA extends the set of usable characters in labels that are text.
300	   For the rest of this document, the term "label" is shorthand for
301	   "text label", and "every label" means "every text label".

303	1.5.4.  Terminology Specific to IDNA

305	   This section defines some terminology to reduce dependence on terms
306	   and definitions that have been problematic in the past.

308	1.5.4.1.  Terms for IDN Label Codings

310	1.5.4.1.1.  IDNA-valid strings, A-label, and U-label

312	   To improve clarity, this document introduces three new terms in this
313	   subsection.  In the next, it defines a historical one to be slightly
314	   more precise for IDNA contexts.

316	   o  A string is "IDNA-valid" if it meets all of the requirements of
317	      these specifications for an IDNA label.  IDNA-valid strings may
318	      appear in either of two forms, defined immediately below.  It is
319	      expected that specific reference will be made to the form
320	      appropriate to any context in which the distinction is important.

322	   o  An "A-label" is the ASCII-Compatible Encoding (ACE, see
323	      Section 1.5.4.4) form of an IDNA-valid string.  It must be a
324	      complete label: IDNA is defined for labels, not for parts of them
325	      and not for complete domain names.  This means, by definition,
326	      that every A-label will begin with the IDNA ACE prefix, "xn--",
327	      followed by a string that is a valid output of the Punycode
328	      algorithm and hence a maximum of 59 ASCII characters in length.
329	      The prefix and string together must conform to all requirements
330	      for a label that can be stored in the DNS including conformance to
331	      the LDH ("host name") rule described in RFC 1034, RFC 1123 and
332	      elsewhere.

334	   o  A "U-label" is an IDNA-valid string of Unicode characters,
335	      including at least one non-ASCII character, expressed in a
336	      standard Unicode Encoding Form, normally UTF-8 in an Internet
337	      transmission context, and subject to the constraint below.
338	      Conversions between valid U-labels and valid A-labels is performed
339	      according to the specification in [RFC3492], adding or removing
340	      the ACE prefix (see Section 1.5.4.4) as needed.

342	   To be valid, U-labels and A-labels must obey an important symmetry
343	   constraint.  While that constraint may be tested in any of several
344	   ways, an A-label must be capable of being produced by conversion from
345	   a U-label and a U-label must be capable of being produced by
346	   conversion from an A-label.  Among other things, this implies that
347	   both U-labels and A-labels must represent strings in normalized form.
348	   These strings MUST contain only characters specified elsewhere in
349	   this document and its companion documents, and only in the contexts
350	   indicated as appropriate.

352	   Any rules or conventions that apply to DNS labels in general, such as
353	   rules about lengths of strings, apply to whichever of the U-label or
354	   A-label would be more restrictive.  For the U-label, constraints
355	   imposed by existing protocols and their presentation forms make the
356	   length restriction apply to the length in octets of the UTF-8 form of
357	   those labels (which will always be greater than or equal to the
358	   length in code points).  The exception to this, of course, is that
359	   the restriction to ASCII characters does not apply to the U-label.

361	   A different way to look at these terms, which may be more clear to
362	   some readers, is that U-labels, A-labels, and LDH-labels (see the
363	   next subsection) are disjoint categories that, together, make up the
364	   forms of legitimate strings for use in domain names that describe
365	   hosts.  Of the three, only A-labels and LDH-labels can actually
366	   appear in DNS zone files or queries; U-labels can appear, along with
367	   the other two, in presentation and user interface forms and in
368	   selected protocols other than those of the DNS itself.  Strings that
369	   do not conform to the rules for one of these three categories and, in
370	   particular, strings that contain "--" in the third and fourth
371	   character position but are:

373	   o  not A-labels or

375	   o  cannot be processed as U-labels or A-labels as described in these
376	      specifications,

378	   are invalid in IDNA-conformant applications as labels in domain names
379	   that identify Internet hosts or similar resources.  This restriction
380	   on strings containing "--" is required for three reasons:

382	   o  to prevent confusion with pre-IDNA coding forms;

384	   o  to permit future extensions that would require changing the
385	      prefix, no matter how unlikely those might be (see Section 10.3);
386	      and

388	   o  to reduce the opportunities for attacks via the encoding system.

390	1.5.4.2.  LDH-label and Internationalized Label

392	   In the hope of further clarifying discussions about IDNs, these
393	   specifications use the term "LDH-label" strictly to refer to an all-
394	   ASCII label that obeys the "hostname" (LDH) conventions and that is
395	   not an IDN.  In other words, only "U-label" and "A-label" refer to
396	   IDNs; LDH-labels are not IDNs.  "Internationalized label" is used
397	   when a term is needed to refer to any of the three categories.  There
398	   are some standardized DNS label formats, such as those for service
399	   location (SRV) records [RFC2782] that do not fall into any of the
400	   three categories and hence are not internationalized labels.

402	1.5.4.3.  Equivalence

404	   In IDNA, equivalence of labels is defined in terms of the A-labels.
405	   If the A-labels are equal in a case-independent comparison, then the
406	   labels are considered equivalent, no matter how they are represented.
407	   Traditional LDH labels already have a notion of equivalence: within
408	   that list of characters, upper case and lower case are considered
409	   equivalent.  The IDNA notion of equivalence is an extension of that
410	   older notion.  Equivalent labels in IDNA are treated as alternate
411	   forms of the same label, just as "foo" and "Foo" are treated as
412	   alternate forms of the same label.

414	1.5.4.4.  ACE Prefix

416	   The "ACE prefix" is defined in this document to be a string of ASCII
417	   characters "xn--" that appears at the beginning of every A-label.
418	   "ACE" stands for "ASCII-Compatible Encoding".

420	1.5.4.5.  Domain Name Slot

422	   A "domain name slot" is defined in this document to be a protocol
423	   element or a function argument or a return value (and so on)
424	   explicitly designated for carrying a domain name.  Examples of domain
425	   name slots include: the QNAME field of a DNS query; the name argument
426	   of the gethostbyname() or getaddrinfo() standard C library functions;
427	   the part of an email address following the at-sign (@) in the
428	   parameter to the SMTP MAIL or RCPT commands or the "From:" field of
429	   an email message header; and the host portion of the URI in the src
430	   attribute of an HTML <IMG> tag.  General text that just happens to
431	   contain a domain name is not a domain name slot.  For example, a
432	   domain name appearing in the plain text body of an email message is
433	   not occupying a domain name slot.

435	   An "IDN-aware domain name slot" is defined in this document to be a
436	   domain name slot explicitly designated for carrying an
437	   internationalized domain name as defined in this document.  The
438	   designation may be static (for example, in the specification of the
439	   protocol or interface) or dynamic (for example, as a result of
440	   negotiation in an interactive session).

442	   An "IDN-unaware domain name slot" is defined in this document to be
443	   any domain name slot that is not an IDN-aware domain name slot.
444	   Obviously, this includes any domain name slot whose specification
445	   predates IDNA.

447	1.5.5.  Punycode is an Algorithm, not a Name

449	   There has been some confusion about whether a "Punycode string" does
450	   or does not include the prefix and about whether it is required that
451	   such strings could have been the output of ToASCII (see RFC 3490,
452	   Section 4 [RFC3490]).  This specification discourages the use of the
453	   term "Punycode" to describe anything but the encoding method and
454	   algorithm of [RFC3492].  The terms defined above are preferred as
455	   much more clear than terms such as "Punycode string".

457	1.5.6.  Other Terminology Issues

459	   The document departs from historical DNS terminology and usage in one
460	   important respect.  Over the years, the community has talked very
461	   casually about "names" in the DNS, beginning with calling it "the
462	   domain name system".  That terminology is fine in the very precise
463	   sense that the identifiers of the DNS do provide names for objects
464	   and addresses.  But, in the context of IDNs, the term has introduced
465	   some confusion, confusion that has increased further as people have
466	   begun to speak of DNS labels in terms of the words or phrases of
467	   various natural languages.

469	   Historically, many, perhaps most, of the "names" in the DNS have been
470	   mnemonics to identify some particular concept, object, or
471	   organization.  They are typically derived from, or rooted in, some
472	   language because most people think in language-based ways.  But,
473	   because they are mnemonics, they need not obey the orthographic
474	   conventions of any language: it is not a requirement that it be
475	   possible for them to be "words".

477	   This distinction is important because the reasonable goal of an IDN
478	   effort is not to be able to write the great Klingon (or language of
479	   one's choice) novel in DNS labels but to be able to form a usefully
480	   broad range of mnemonics in ways that are as natural as possible in a
481	   very broad range of scripts.

483	   An "internationalized domain name" (IDN) is a domain name that may
484	   contain any mixture of LDH-labels, A-labels, or U-labels.  This
485	   implies that every conventional domain name is an IDN (which implies
486	   that it is possible for a domain name to be an IDN without it
487	   containing any non-ASCII characters).  Just as has been the case with
488	   ASCII names, some DNS zone administrators may impose restrictions,
489	   beyond those imposed by DNS or IDNA, on the characters or strings
490	   that may be registered as labels in their zones.  Because of the
491	   diversity of characters that can be used in a U-label and the
492	   confusion they might cause, such restrictions are mandatory for IDN
493	   registries and zones even though the particular restrictions are not
494	   part of these specifications.  Because these restrictions, commonly
495	   known as "registry restrictions", only affect what can be registered
496	   and not resolution processing, they have no effect on the syntax or
497	   semantics of DNS protocol messages; a query for a name that matches
498	   no records will yield the same response regardless of the reason why
499	   it is not in the zone.  Clients issuing queries or interpreting
500	   responses cannot be assumed to have any knowledge of zone-specific
501	   restrictions or conventions.  See Section 6.2.

503	   "The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
504	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
505	   document are to be interpreted as described in RFC 2119 [RFC2119].

507	1.6.  Comprehensibility of IDNA Mechanisms and Processing

509	   One of the major goals of this work is to improve the general
510	   understanding of how IDNA works and what characters are permitted and
511	   what happens to them.  Comprehensibility and predictability to users
512	   and registrants are themselves important motivations and design goals
513	   for this effort.  The effort includes some new terminology and a
514	   revised and extended model, both covered in this section, and some
515	   more specific protocol, processing, and table modifications.  Details
516	   of the latter appear in other documents (see Section 5).

518	   Several issues are inherent in the application of IDNs and, indeed,
519	   almost any other system that tries to handle international characters
520	   and concepts.  They range from the apparently trivial --e.g., one
521	   cannot display a character for which one does not have a font
522	   available locally-- to the more complex and subtle.  Many people have
523	   observed that internationalization is just a tool to enable effective
524	   localization while permitting some global uniformity.  Issues of
525	   display, of exactly how various strings and characters are entered,
526	   and so on are inherently issues about localization and user interface
527	   design.

529	   A protocol such as IDNA can only assume that such operations as data
530	   entry and reconciliation of differences in character forms are
531	   possible.  It may make some recommendations about how display might
532	   work when characters and fonts are not available, but they can only
533	   be general recommendations and, because display functions are rarely
534	   controlled by the types of applications that would call upon IDNA,
535	   will rarely be very effective.

537	   However, shifting responsibility for character mapping and other
538	   adjustments from the protocol (where it was located in IDNA2003) to
539	   the user interface or processing before invoking IDNA raises issues
540	   about both what that processing should do and about compatibility for
541	   references prepared in an IDNA2003 context.  Those issues are
542	   discussed in Section 9.

544	   Operations for converting between local character sets and normalized
545	   Unicode are part of this general set of user interface issues.  The
546	   conversion is obviously not required at all in a Unicode-native
547	   system that maintains all strings in Normalization Form C (NFC).  It
548	   may, however, involve some complexity in a system that is not
549	   Unicode-native, especially if the elements of the local character set
550	   do not map exactly and unambiguously into Unicode characters or do so
551	   in a way that is not completely stable over time.  Perhaps more
552	   important, if a label being converted to a local character set
553	   contains Unicode characters that have no correspondence in that
554	   character set, the application may have to apply special, locally-
555	   appropriate, methods to avoid or reduce loss of information.

557	   Depending on the system involved, the major difficulty may not lie in
558	   the mapping but in accurately identifying the incoming character set
559	   and then applying the correct conversion routine.  If a local
560	   operating system uses one of the ISO 8859 character sets or an
561	   extensive national or industrial system such as GB18030 [GB18030] or
562	   BIG5 [BIG5], one must correctly identify the character set in use
563	   before converting to Unicode even though those character coding
564	   systems are substantially or completely Unicode-compatible (i.e., all
565	   of the code points in them have an exact and unique mapping to
566	   Unicode code points).  It may be even more difficult when the
567	   character coding system in local use is based on conceptually
568	   different assumptions than those used by Unicode about, e.g., about
569	   font encodings used for publications in some Indic scripts.  Those
570	   differences may not easily yield unambiguous conversions or
571	   interpretations even if each coding system is internally consistent
572	   and adequate to represent the local language and script.

574	2.  Summary of Major Changes from IDNA2003

576	   1.   Update base character set from Unicode 3.2 to Unicode version-
577	        agnostic.

579	   2.   Separate the definitions for the "registration" and "lookup"
580	        activities.

582	   3.   Disallow symbol and punctuation characters except where special
583	        exceptions are necessary.

585	   4.   Remove the mapping and normalization steps from the protocol and
586	        have them instead done by the applications themselves, possibly
587	        in a local fashion, before invoking the protocol.

589	   5.   Change the way that the protocol specifies which characters are
590	        allowed in labels from "humans decide what the table of
591	        codepoints contains" to "decision about codepoints are based on
592	        Unicode properties plus a small exclusion list created by
593	        humans".

595	   6.   Introduce the new concept of characters that can be used only in
596	        specific contexts.

598	   7.   Allow typical words and names in languages such as Dhivehi and
599	        Yiddish to be expressed.

601	   8.   Make bidirectional domain names (delimited strings of labels,
602	        not just labels standing on their own) display in a non-
603	        surprising fashion.

605	   9.   Make bidirectional domain names in a paragraph display in a non-
606	        surprising fashion.[[anchor17: Is this statement necessary or is
607	        it redundant with the previous one?]]

609	   10.  Remove the dot separator from the mandatory part of the
610	        protocol.

612	   11.  Make some currently-valid labels that are not actually IDNA
613	        labels invalid.

615	3.  The Revised IDNA Model

617	   IDNA is a client-side protocol, i.e., almost all of the processing is
618	   performed by the client.  The strings that appear in, and are
619	   resolved by, the DNS conform to the traditional rules for the naming
620	   of hosts, and consist of ASCII letters, digits, and hyphens.  This
621	   approach permits IDNA to be deployed without modifications to the DNS
622	   itself.  That, in turn, avoids both having to upgrade the entire
623	   Internet to support IDNs and needing to incur the unknown risks to
624	   deployed systems of DNS structural or design changes especially if
625	   those changes need to be deployed all at the same time.

627	4.  Processing in IDNA2008

629	   These specifications separate Domain Name Registration and Resolution
630	   in the protocol specification.  Doing so reflects current practice in
631	   which per-registry restrictions and special processing are applied at
632	   registration time but not on resolution.  Even more important in the
633	   longer term, it facilitates incremental addition of permitted
634	   character groups to avoid freezing on one particular version of
635	   Unicode.

637	   The actual registration and lookup protocols for IDNA2008 are
638	   specified in [IDNA2008-Protocol].

640	5.  IDNA2008 Document List

642	   [[anchor19: This section will need to be extensively revised or
643	   removed before publication.]]
644	   The following documents are being produced as part of the IDNA2008
645	   effort.

647	   o  A revised version of this document, containing an overview,
648	      rationale, and conformance conditions.

650	   o  A separate document, drawn from material in early versions of this
651	      one, that explicitly updates and replaces RFC 3490 but which has
652	      most rationale material from that document moved to this one
653	      [IDNA2008-Protocol].

655	   o  A document describing the "Bidi problem" with Stringprep and
656	      proposing a solution [IDNA2008-Bidi].

658	   o  A specification of the categories and rules that identify the code
659	      points allowed in a U-label, based on Unicode 5.0 code
660	      assignments.  See Section 6 and [IDNA2008-Tables].

662	   o  One or more documents containing guidance and suggestions for
663	      registries (in this context, those responsible for establishing
664	      policies for any zone file in the DNS, not only those at the top
665	      or second level).  The documents in this category may not be IETF
666	      products and may be prepared and completed asynchronously with
667	      those described above.

669	6.  Permitted Characters: An Inclusion List

671	   This section provides an overview of the model used to establish the
672	   algorithm and character lists of [IDNA2008-Tables] and describes the
673	   names and applicability of the categories used there.  Note that the
674	   inclusion of a character in the first category group does not imply
675	   that it can be used indiscriminately; some characters are associated
676	   with contextual rules that must be applied as well.

678	   The information given in this section is provided to make the rules,
679	   tables, and protocol easier to understand.  It is not normative.  The
680	   normative generating rules appear in [IDNA2008-Tables] and the rules
681	   that actually determine what labels can be registered or looked up
682	   are in [IDNA2008-Protocol].

684	6.1.  A Tiered Model of Permitted Characters and Labels

686	   Moving to an inclusion model requires respecifying the list of
687	   characters that are permitted in IDNs.  In IDNA2003, the role and
688	   utility of characters are independent of context and fixed forever
689	   (or until the standard is replaced).  Making completely context-
690	   independent rules globally has proven impractical because some
691	   characters, especially those that are called "Join_Controls" in
692	   Unicode, are needed to make reasonable use of some scripts but have
693	   no visible effect(s) in others.  Of necessity, IDNA2003 prohibited
694	   those types of characters entirely.  But the restrictions were much
695	   too severe to permit an adequate range of mnemonics for terminology
696	   based on some languages.  The requirement to support those characters
697	   but limit their use to very specific contexts was reinforced by the
698	   observation that handling of particular characters across the
699	   languages that use a script, or the use of similar or identical-
700	   looking characters in different scripts, is less well understood than
701	   many people believed it was several years ago.

703	   Independently of the characters chosen (see next subsection), the
704	   theory is to divide the characters that appear in Unicode into three
705	   categories:

707	6.1.1.  PROTOCOL-VALID

709	   Characters identified as "PROTOCOL-VALID" (often abbreviated
710	   "PVALID") are, in general, permitted by IDNA for all uses in IDNs.
711	   Their use may be restricted by rules about the context in which they
712	   appear or by other rules that apply to the entire label in which they
713	   are to be embedded.  For example, any label that contains a character
714	   in this group that has a "right to left" property must be used in
715	   context with the "Bidi" rules (see [IDNA2008-Bidi]).

717	   The term "PROTOCOL-VALID", is used to stress the fact that the
718	   presence of a character in this category does not imply that a given
719	   registry need accept registrations containing any of the characters
720	   in the category.  Registries are still expected to apply judgment
721	   about labels they will accept and to maintain rules consistent with
722	   those judgments (see [IDNA2008-Protocol] and Section 6.3).

724	   Characters that are placed in the "PROTOCOL-VALID" category are never
725	   removed from it unless the code points themselves are removed from
726	   Unicode (such removal would be inconsistent with the Unicode
727	   stability principles (see [Unicode51], Appendix F) and hence should
728	   never occur).

730	   [[anchor21: Placeholder: Does this topic or comment need additional
731	   discussion or explanation?]]

733	6.1.1.1.  Contextual Rules

735	   Some characters may be unsuitable for general use in IDNs but
736	   necessary for the plausible support of some scripts.  The two most
737	   commonly-cited examples are the zero-width joiner and non-joiner
738	   characters (ZWNJ, U+200C, and ZWJ, U+200D), but provisions for
739	   unambiguous labels may require that other characters be restricted to
740	   particular contexts.  For example, the ASCII hyphen is not permitted
741	   to start or end a label, whether that label contains non-ASCII
742	   characters or not.

744	   These characters must not appear in IDNs without additional
745	   restrictions, typically because they have no visible consequences in
746	   most scripts but affect format or presentation in a few others or
747	   because they are combining characters that are safe for use only in
748	   conjunction with particular characters or scripts.  In order to
749	   permit them to be used at all, they are specially identified as
750	   "CONTEXTUAL RULE REQUIRED" and, when adequately understood,
751	   associated with a rule.  In addition, the rule will define whether it
752	   is to be applied on lookup as well as registration.  A distinction is
753	   made between characters that indicate or prohibit joining (known as
754	   "CONTEXT-JOINER" or "CONTEXTJ") and other characters requiring
755	   contextual treatment ("CONTEXT-OTHER" or "CONTEXTO").  Only the
756	   former are fully tested at lookup time.

758	6.1.1.2.  Rules and Their Application

760	   The actual rules may be present or absent.  If present, they may have
761	   values of "True" (character may be used in any position in any
762	   label), "False" (character may not be used in any label), or may be
763	   an extended regular expression that specifies the context in which
764	   the character is permitted.

766	   Examples of descriptions of typical rules, stated informally and in
767	   English, include "Must follow a character from Script XYZ", "MUST
768	   occur only if the entire label is in Script ABC", "MUST occur only if
769	   the previous and subsequent characters have the DFG property".

771	   Because it is easier to identify these characters than to know that
772	   they are actually needed in IDNs or how to establish exactly the
773	   right rules for each one, a rule may have a null value in a given
774	   version of the tables.  Characters associated with null rules MUST
775	   NOT appear in putative labels for either registration or lookup.  Of
776	   course, a later version of the tables might contain a non-null rule.

778	   [[anchor23: Definition of regular expression language to be supplied
779	   or replaced with a description of the definitional technique.  It may
780	   be useful to more more of this material to Tables as part of moving
781	   the rules from Protocol to Tables.]]

783	6.1.2.  DISALLOWED

785	   Some characters are sufficiently problematic for use in IDNs that
786	   they should be excluded for both registration and lookup (i.e.,
787	   conforming applications performing name resolution should verify that
788	   these characters are absent; if they are present, the label strings
789	   should be rejected rather than converted to A-labels and looked up.

791	   Of course, this category would include code points that had been
792	   removed entirely from Unicode should such removals ever occur.

794	   Characters that are placed in the "DISALLOWED" category are expected
795	   to never be removed from it or reclassified.  If a character is
796	   classified as "DISALLOWED" in error and the error is sufficiently
797	   problematic, the only recourse would be either to introduce a new
798	   code point into Unicode and classify it as "PROTOCOL-VALID" or for
799	   the IETF to accept the considerable costs of an incompatible change
800	   and replace the relevant RFC with one containing appropriate
801	   exceptions.

803	   [[anchor24: Note in Draft: the permanence of DISALLOWED was still
804	   under discussion in the WG when this draft was posted.  The text
805	   above reflects the editor's opinion about the emerging consensus but
806	   is subject to change as the discussion continues.]]

808	   There is provision for exception cases but, in general, characters
809	   are placed into "DISALLOWED" if they fall into one or more of the
810	   following groups:

812	   o  The character is a compatibility equivalent for another character.
813	      In slightly more precise Unicode terms, application of
814	      normalization method NFKC to the character yields some other
815	      character.

817	   o  The character is an upper-case form or some other form that is
818	      mapped to another character by Unicode casefolding.

820	   o  The character is a symbol or punctuation form or, more generally,
821	      something that is not a letter, digit, or a mark that is used to
822	      form a letter or digit.

824	6.1.3.  UNASSIGNED

826	   For convenience in processing and table-building, code points that do
827	   not have assigned values in a given version of Unicode are treated as
828	   belonging to a special UNASSIGNED category.  Such code points MUST
829	   NOT appear in labels to be registered or looked up.  The category
830	   differs from DISALLOWED in that code points are moved out of it by
831	   the simple expedient of being assigned in a later version of Unicode
832	   (at which point, they are classified into one of the other categories
833	   as appropriate).

835	6.2.  Registration Policy

837	   While these recommendations cannot and should not define registry
838	   policies, registries SHOULD develop and apply additional restrictions
839	   to reduce confusion and other problems.  For example, it is generally
840	   believed that labels containing characters from more than one script
841	   are a bad practice although there may be some important exceptions to
842	   that principle.  Some registries may choose to restrict registrations
843	   to characters drawn from a very small number of scripts.  For many
844	   scripts, the use of variant techniques such as those as described in
845	   [RFC3743] and [RFC4290], and illustrated for Chinese by the tables
846	   described in RFC 4713 [RFC4713] may be helpful in reducing problems
847	   that might be perceived by users.  It is worth stressing that these
848	   principles of policy development and application apply at all levels
849	   of the DNS, not only, e.g., TLD registrations.

851	6.3.  Layered Restrictions: Tables, Context, Registration, Applications

853	   The essence of the character rules in IDNA2008 is based on the
854	   realization that there is no magic bullet for any of the issues
855	   associated with a multiscript DNS.  Instead, the specifications
856	   define a variety of approaches that, together, constitute multiple
857	   lines of defense against ambiguity in identifiers and loss of
858	   referential integrity.  The actual character tables are the first
859	   mechanism, protocol rules about how those characters are applied or
860	   restricted in context are the second, and those two in combination
861	   constitute the limits of what can be done by a protocol alone.  As
862	   discussed in the previous section (Section 6.2), registries are
863	   expected to restrict what they permit to be registered, devising and
864	   using rules that are designed to optimize the balance between
865	   confusion and risk on the one hand and maximum expressiveness in
866	   mnemonics on the other.

868	   In addition, there is an important role for user agents in warning
869	   against label forms that appear unreasonable given their knowledge of
870	   local contexts and conventions.  Of course, no approach based on
871	   naming or identifiers alone can protect against all threats.
872	   [[anchor25: Note in Draft: the last sentence above basically
873	   duplicates a comment in Security Considerations.  Is it worth having
874	   in both places??]]

876	7.  Issues that Constrain Possible Solutions

878	7.1.  Display and Network Order

880	   The correct treatment of domain names requires a clear distinction
881	   between Network Order (the order in which the code points are sent in
882	   protocols) and Display Order (the order in which the code points are
883	   displayed on a screen or paper).  The order of labels in a domain
884	   name that contains characters that are normally written right to left
885	   is discussed in [IDNA2008-Bidi].  In particular, there are questions
886	   about the order in which labels are displayed if left to right and
887	   right to left labels are adjacent to each other, especially if there
888	   are also multiple consecutive appearances of one of the types.  The
889	   decision about the display order is ultimately under the control of
890	   user agents --including web browsers, mail clients, and the like--
891	   which may be highly localized.  Even when formats are specified by
892	   protocols, the full composition of an Internationalized Resource
893	   Identifier (IRI) [RFC3987] or Internationalized Email address
894	   contains elements other than the domain name.  For example, IRIs
895	   contain protocol identifiers and field delimiter syntax such as
896	   "http://" or "mailto:" while email addresses contain the "@" to
897	   separate local parts from domain names.  User agents are not required
898	   to use those protocol-based forms directly but often do so.  While
899	   display, parsing, and processing within a label is specified by the
900	   IDNA protocol and the associated documents, the relationship between
901	   fully-qualified domain names and internationalized labels is
902	   unchanged from the base DNS specifications.  Comments here about such
903	   full domain names are explanatory or examples of what might be done
904	   and must not be considered normative.

906	   Questions remain about protocol constraints implying that the overall
907	   direction of these strings will always be left to right (or right to
908	   left) for an IRI or email address, or if they even should conform to
909	   such rules.  These questions also have several possible answers.
910	   Should a domain name abc.def, in which both labels are represented in
911	   scripts that are written right to left, be displayed as fed.cba or
912	   cba.fed?  An IRI for clear text web access would, in network order,
913	   begin with "http://" and the characters will appear as
914	   "http://abc.def" -- but what does this suggest about the display
915	   order?  When entering a URI to many browsers, it may be possible to
916	   provide only the domain name and leave the "http://" to be filled in
917	   by default, assuming no tail (an approach that does not work for
918	   other protocols).  The natural display order for the typed domain
919	   name on a right to left system is fed.cba.  Does this change if a
920	   protocol identifier, tail, and the corresponding delimiters are
921	   specified?

923	   While logic, precedent, and reality suggest that these are questions
924	   for user interface design, not IETF protocol specifications,
925	   experience in the 1980s and 1990s with mixing systems in which domain
926	   name labels were read in network order (left to right) and those in
927	   which those labels were read right to left would predict a great deal
928	   of confusion, and heuristics that sometimes fail, if each
929	   implementation of each application makes its own decisions on these
930	   issues.

932	   It should be obvious that any revision of IDNA, including the current
933	   one, must be clear about the network (transmission on the wire) order
934	   of characters in labels and for the labels in complete (fully-
935	   qualified) domain names.  In order to prevent user confusion and, in
936	   particular, to reduce the chances for inconsistent transcription of
937	   domain names from printed form, it is likely that some strong
938	   suggestions should be made about display order as well.

940	7.2.  Entry and Display in Applications

942	   Applications can accept domain names using any character set or sets
943	   desired by the application developer or specified by the operating
944	   system, and can display domain names in any charset.  That is, the
945	   IDNA protocol does not affect the interface between users and
946	   applications.

948	   An IDNA-aware application can accept and display internationalized
949	   domain names in two formats: the internationalized character set(s)
950	   supported by the application (i.e., an appropriate local
951	   representation of a U-label), and as an A-label.  Applications MAY
952	   allow the display and user input of A-labels, but are encouraged to
953	   not do so except as an interface for special purposes, possibly for
954	   debugging, or to cope with display limitations.  A-labels are opaque
955	   and ugly, and, where possible, should thus only be exposed to users
956	   and in contexts in which they are absolutely needed.  Because IDN
957	   labels can be rendered either as the A-labels or U-labels, the
958	   application may reasonably have an option for the user to select the
959	   preferred method of display; if it does, rendering the U-label should
960	   normally be the default.

962	   Domain names are often stored and transported in many places.  For
963	   example, they are part of documents such as mail messages and web
964	   pages.  They are transported in many parts of many protocols, such as
965	   both the control commands and the RFC 2822 body parts of SMTP, and
966	   the headers and the body content in HTTP.  It is important to
967	   remember that domain names appear both in domain name slots and in
968	   the content that is passed over protocols.

970	   In protocols and document formats that define how to handle
971	   specification or negotiation of charsets, labels can be encoded in
972	   any charset allowed by the protocol or document format.  If a
973	   protocol or document format only allows one charset, the labels MUST
974	   be given in that charset.  Of course, not all charsets can properly
975	   represent all labels.  If a U-label cannot be displayed in its
976	   entirety, the only choice (without loss of information) may be to
977	   display the A-label.

979	   In any place where a protocol or document format allows transmission
980	   of the characters in internationalized labels, labels SHOULD be
981	   transmitted using whatever character encoding and escape mechanism
982	   the protocol or document format uses at that place.  This provision
983	   is intended to prevent situations in which, e.g., UTF-8 domain names
984	   appear embedded in text that is otherwise in some other character
985	   coding.

987	   All protocols that use domain name slots already have the capacity
988	   for handling domain names in the ASCII charset.  Thus, A-labels can
989	   inherently be handled by those protocols.

991	7.3.  Linguistic Expectations: Ligatures, Digraphs, and Alternate
992	      Character Forms

994	   Users often have expectations about character matching or equivalence
995	   that are based on their languages and the orthography of those
996	   languages.  These expectations may not be consistent with forms or
997	   actions that can be naturally accommodated in a character coding
998	   system, especially if multiple languages are written using the same
999	   script but using different conventions.  A Norwegian user might
1000	   expect a label with the ae-ligature to be treated as the same label
1001	   as one using the Swedish spelling with a-umlaut even though applying
1002	   that mapping to English would be astonishing to users.  A user in
1003	   German might expect a label with an o-umlaut and a label that had
1004	   "oe" substituted, but was otherwise the same, treated as equivalent
1005	   even though that substitution would be a clear error in Swedish.  A
1006	   Chinese user might expect automatic matching of Simplified and
1007	   Traditional Chinese characters, but applying that matching for Korean
1008	   or Japanese text would create considerable confusion.  For that
1009	   matter, an English user might expect "theater" and "theatre" to
1010	   match.

1012	   Related issues arise because there are a number of languages written
1013	   with alphabetic scripts in which single phonemes are written using
1014	   two characters, termed a "digraph", for example, the "ph" in
1015	   "pharmacy" and "telephone".  (Note that characters paired in this
1016	   manner can also appear consecutively without forming a digraph, as in
1017	   "tophat".)  Certain digraphs are normally indicated typographically
1018	   by setting the two characters closer together than they would be if
1019	   used consecutively to represent different phonemes.  Some digraphs
1020	   are fully joined as ligatures (strictly designating setting totally
1021	   without intervening white space, although the term is sometimes
1022	   applied to close set pairs).  An example of this may be seen when the
1023	   word "encyclopaedia" is set with a U+00E6 LATIN SMALL LIGATURE AE
1024	   (and some would not consider that word correctly spelled unless the
1025	   ligature form was used or the "a" was dropped entirely).  When these
1026	   ligature and digraph forms have the same interpretation across all
1027	   languages that use a given script, application of Unicode
1028	   normalization generally resolves the differences and causes them to
1029	   match.  When they have different interpretations, any requirements
1030	   for matching must utilize other methods or users must be educated to
1031	   understand that matching will not occur.

1033	   Difficulties arise from the fact that a given ligature may be a
1034	   completely optional typographic convenience for representing a
1035	   digraph in one language (as in the above example with some spelling
1036	   conventions), while in another language it is a single character that
1037	   may not always be correctly representable by a two-letter sequence
1038	   (as in the above example with different spelling conventions).  This
1039	   can be illustrated by many words in the Norwegian language, where the
1040	   "ae" ligature is the 27th letter of a 29-letter extended Latin
1041	   alphabet.  It is equivalent to the 28th letter of the Swedish
1042	   alphabet (also containing 29 letters), U+00E4 LATIN SMALL LETTER A
1043	   WITH DIAERESIS, for which an "ae" cannot be substituted according to
1044	   current orthographic standards.

1046	   That character (U+00E4) is also part of the German alphabet where,
1047	   unlike in the Nordic languages, the two-character sequence "ae" is
1048	   usually treated as a fully acceptable alternate orthography for the
1049	   "umlauted a" character.  The inverse is however not true, and those
1050	   two characters cannot necessarily be combined into an "umlauted a".
1051	   This also applies to another German character, the "umlauted o"
1052	   (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) which, for example,
1053	   cannot be used for writing the name of the author "Goethe".  It is
1054	   also a letter in the Swedish alphabet where, in parallel to the
1055	   "umlauted a", it cannot be correctly represented as "oe" and in the
1056	   Norwegian alphabet, where it is represented, not as "umlauted o", but
1057	   as "slashed o", U+00F8.

1059	   Some of the ligatures that have explicit code points in Unicode were
1060	   given special handling in IDNA2003 and now pose additional problems
1061	   as people argue that they should have been treated differently to
1062	   preserve important information.  For example, the German character
1063	   Eszett (Sharp S, U+00DF) is retained as itself by NFKC but case-
1064	   folded by Stringprep to "ss", but the closely-related, but less
1065	   frequently seen, character "Long S T" (U+FB05) is a compatibility
1066	   character that is mapped out by NFKC.  Unless exceptions are made,
1067	   both will be treated as DISALLOWED by IDNA2008.  But there is
1068	   significant interest in an exception, especially for Eszett.
1069	   Depending on what the exception was, making it would either raise
1070	   some backward compatibility problems with IDNA2003 or create an
1071	   unusual special case that would highlight differences in preferred
1072	   orthography between German as written in Germany and German as
1073	   written in some other countries, notably Switzerland.  Additional
1074	   discussion of issues with Eszett appear in Section 10.7.

1076	   Additional cases with alphabets written right to left are described
1077	   in Section 7.5.

1079	   Whether ligatures and digraphs are to be treated as a sequence of
1080	   characters or as a single standalone one constitute a problem that
1081	   cannot be resolved solely by operating on scripts.  They are,
1082	   however, a key concern in the IDN context.  Their satisfactory
1083	   resolution will require support in policies set by registries, which
1084	   therefore need to be particularly mindful not just of this specific
1085	   issue, but of all other related matters that cannot be dealt with on
1086	   an exclusively algorithmic basis.

1088	   Just as with the examples of different-looking characters that may be
1089	   assumed to be the same, it is in general impossible to deal with
1090	   these situations in a system such as IDNA -- or with Unicode
1091	   normalization generally -- since determining what to do requires
1092	   information about the language being used, context, or both.
1093	   Consequently, these specifications make no attempt to treat these
1094	   combined characters in any special way.  However, their existence
1095	   provides a prime example of a situation in which a registry that is
1096	   aware of the language context in which labels are to be registered,
1097	   and where that language sometimes (or always) treats the two-
1098	   character sequences as equivalent to the combined form, should give
1099	   serious consideration to applying a "variant" model [RFC3743]
1100	   [RFC4290] to reduce the opportunities for user confusion and fraud
1101	   that would result from the related strings being registered to
1102	   different parties.

1104	7.4.  Case Mapping and Related Issues

1106	   Traditionally in the DNS, ASCII letters have been stored with their
1107	   case preserved.  Matching during the query process has been case-
1108	   independent, but none of the information that might be represented by
1109	   choices of case has been lost.  That model has been accidentally
1110	   helpful because, as people have created DNS labels by catenating
1111	   words (or parts of words) to form labels, case has often been used to
1112	   distinguish among components and make the labels more memorable.

1114	   The solution of keeping the characters separate but doing matching
1115	   independent of case is not feasible with an IDNA-like model because
1116	   the matching would then have to be done on the server rather than
1117	   have characters mapped on the client.  That situation was recognized
1118	   in IDNA2003 and nothing in IDNA2008 fundamentally changes it or could
1119	   do so.  In IDNA2003, all upper-case characters are mapped to lower-
1120	   case ones and, in general, all code points that represent alternate
1121	   forms of the same character are mapped to that character (including
1122	   mapping Greek final form sigma to the medial form).  IDNA2008
1123	   permits, at the risk of some incompatibility, slightly more
1124	   flexibility in this area.  That additional flexibility still does not
1125	   solve the problem with final form sigma and other characters that
1126	   Unicode treats as completely separate characters that match only
1127	   under casemapping if at all.  Many people now believe these should be
1128	   handled as separate characters so information about them can be
1129	   preserved in the transformations to A-labels and back.  However
1130	   making a change to permit that behavior would create a situation in
1131	   which the same string, valid in both protocols, would be interpreted
1132	   differently by IDNA2003 and IDNA2008.  In principle, that would
1133	   violate one of the conditions discussed in Section 10.3.1 and hence
1134	   require a prefix change.  Of course, if a prefix change were made (at
1135	   the costs discussed in Section 10.3.3) there would be several
1136	   options, including, if desired, assigning the characer to the
1137	   CONTEXTUAL RULE REQUIRED category and requiring that it only be used
1138	   in carefully-selected contexts.

1140	7.5.  Right to Left Text

1142	   In order to be sure that the directionality of right to left text is
1143	   unambiguous, IDNA2003 required that any label in which right to left
1144	   characters appear both starts and ends with them, may not include any
1145	   characters with strong left to right properties (which excludes other
1146	   alphabetic characters but permits European digits), and rejects any
1147	   other string that contains a right to left character.  This is one of
1148	   the few places where the IDNA algorithms (both old and new) are
1149	   required to look at an entire label, not just at individual
1150	   characters.  The algorithmic model used in IDNA2003 rejects the label
1151	   when the final character in a right to left string requires a
1152	   combining mark in order to be correctly represented.

1154	   This problem manifests itself in languages written with consonantal
1155	   alphabets to which diacritical vocalic systems are applied, and in
1156	   languages with orthographies derived from them where the combining
1157	   marks may have different functionality.  In both cases the combining
1158	   marks can be essential components of the orthography.  Examples of
1159	   this are Yiddish, written with an extended Hebrew script, and Dhivehi
1160	   (the official language of Maldives) which is written in the Thaana
1161	   script (which is, in turn, derived from the Arabic script).  The new
1162	   rules for right to left scripts are described in [IDNA2008-Bidi].

1164	8.  IDNs and the Robustness Principle

1166	   The model of IDNs described in this document can be seen as a
1167	   particular instance of the "Robustness Principle" that has been so
1168	   important to other aspects of Internet protocol design.  This
1169	   principle is often stated as "Be conservative about what you send and
1170	   liberal in what you accept" (See, e.g., RFC 1123, Section 1.2.2

1172	   [RFC1123]).  For IDNs to work well, not only must the protocol be
1173	   carefully designed and implemented, but zone administrators
1174	   (registries) must have and require sensible policies about what is
1175	   registered -- conservative policies -- and implement and enforce
1176	   them.

1178	   Conversely, resolvers can (and SHOULD or maybe MUST) reject labels
1179	   that clearly violate global (protocol) rules (no one has ever
1180	   seriously claimed that being liberal in what is accepted requires
1181	   being stupid).  However, once one gets past such global rules and
1182	   deals with anything sensitive to script or locale, it is necessary to
1183	   assume that garbage has not been placed into the DNS, i.e., one must
1184	   be liberal about what one is willing to look up in the DNS rather
1185	   than guessing about whether it should have been permitted to be
1186	   registered.

1188	   As mentioned elsewhere, if a string doesn't resolve, it makes no
1189	   difference whether it simply wasn't registered or was prohibited by
1190	   some rule.

1192	   If resolvers, as a user interface (UI) or other local matter, decide
1193	   to warn about some strings that are valid under the global rules but
1194	   that they perceive as dangerous, that is their prerogative and we can
1195	   only hope that the market (and maybe regulators) will reinforce the
1196	   good choices and discourage the poor ones.  In this context, a
1197	   resolver that decides a string that is valid under the protocol is
1198	   dangerous and refuses to look it up is in violation of the protocols;
1199	   one that is willing to look something up, but warns against it, is
1200	   exercising a local choice.

1202	9.  Front-end and User Interface Processing

1204	   Domain names may be identified and processed in many contexts.  They
1205	   may be typed in by users either by themselves or as part of URIs or
1206	   IRIs.  They may occur in running text or be processed by one system
1207	   after being provided in another.  Systems may wish to try to
1208	   normalize URLs so as to determine (or guess) whether a reference is
1209	   valid or two references point to the same object without actually
1210	   looking the objects up and comparing them.  Some of these goals may
1211	   be more easily and reliably satisfied than others.  While there are
1212	   strong arguments for any domain name that is placed "on the wire" --
1213	   transmitted between systems -- to be in the minimum-ambiguity forms
1214	   of A-labels, U-labels, or LDH-labels, it is inevitable that programs
1215	   that process domain names will encounter variant forms.  One source
1216	   of such forms will be labels created under IDNA2003.  Because of the
1217	   way that protocol was specified, there are a significant number of
1218	   domain names in files on the Internet that use characters that cannot
1219	   be represented directly in domain names but for which interpretations
1220	   are provided.  There are two major categories of such characters,
1221	   those that are removed by NFKC normalization and those upper-case
1222	   characters that are mapped to lower-case (there are also a few
1223	   characters that are given special-case mapping treatment in
1224	   Stringprep). [[anchor29: The text above is a too obscure, but was
1225	   intended to address the mapping differences between IDNA2003 and the
1226	   current proposal.  Patrik suggests the following, which will need
1227	   some tuning before it can be inserted: One source of such forms will
1228	   be labels created under IDNA2003 as some allowed labels where
1229	   transformed before they where turned into its ascii (xn--) form so
1230	   that ToUnicode(ToASCII(label)) != label.  This is why IDNA2008
1231	   explicitly define A-label and U-label being a form of the label that
1232	   is stable when converting between A-label and U-label, without
1233	   mappings.  A different way of explaining this is that there could be
1234	   already today domain names in files on the Internet that use
1235	   characters that cannot be represented directly in domain names but
1236	   for which interpretations are provided.  There are two major
1237	   categories of such characters, those that are removed by NFKC
1238	   normalization and those upper-case characters that are mapped to
1239	   lower-case (there are also a few characters that are given special-
1240	   case mapping treatment in Stringprep)."]]

1242	   Other issues in domain name identification and processing arise
1243	   because IDNA2003 specified that several other characters be treated
1244	   as equivalent to the ASCII period (dot, full stop) character used as
1245	   a label separator.  If a domain name appears in an arbitrary context
1246	   (such as running text), it is difficult, even with only ASCII
1247	   characters, to know whether a domain name (or a protocol parameter
1248	   like a URI) is present and where it starts and ends.  When using
1249	   Unicode this gets even more difficult if treatment of certain special
1250	   characters (like the dot that separates labels in a domain name)
1251	   depends on context.  That problem occurs if the dot is part of a
1252	   domain name or not, which would mean that, contrary to common
1253	   practice today, the primary heuristic for identifying a domain name
1254	   depends on dots separating strings with no intervening spaces.
1255	   [[anchor30: Above text is a substitute for an earlier (pre -01)
1256	   version and is hoped to be more clear.  Comments and improvements
1257	   welcome.]]

1259	   As discussed elsewhere in this document, the IDNA2008 model removes
1260	   all of these mappings and interpretations, including the equivalence
1261	   of different forms of dots, from the protocol, leaving such mappings
1262	   to local processing.  This should not be taken to imply that local
1263	   processing is optional or can be avoided entirely.  Instead, unless
1264	   the program context is such that it is known that any IDNs that
1265	   appear will be either U-labels or A-labels, some local processing of
1266	   apparent domain name strings will be required, both to maintain
1267	   compatibility with IDNA2003 and to prevent user astonishment.  Such
1268	   local processing, while not specified in this document or the
1269	   associated ones, will generally take one of two forms:

1271	   o  Generic Preprocessing.
1272	      When the context in which the program or system that processes
1273	      domain names operates is global, a reasonable balance must be
1274	      found that is sensitive to the broad range of local needs and
1275	      assumptions while, at the same time, not sacrificing the needs of
1276	      one language, script, or user population to those of another.

1278	      For this case, the best practice will usually be to apply NFKC and
1279	      case-mapping (or, perhaps better yet, Stringprep itself), plus
1280	      dot-mapping where appropriate, to the domain name string prior to
1281	      applying IDNA.  That practice will not only yield a reasonable
1282	      compromise of user experience with protocol requirements but will
1283	      be almost completely compatible with the various forms permitted
1284	      by IDNA2003.

1286	   o  Highly Localized Preprocessing.
1287	      Unlike the case above, there will be some situations in which
1288	      software will be highly localized for a particular environment and
1289	      carefully adapted to the expectations of users in that
1290	      environment.  The many discussions about using the Internet to
1291	      preserve and support local cultures suggest that these cases may
1292	      be more common in the future than they have been so far.

1294	      In these cases, we should avoid trying to tell implementers what
1295	      they should do, if only because they are quite likely (and for
1296	      good reason) to ignore us.  We would assume that they would map
1297	      characters that the intuitions of their users would suggest be
1298	      mapped.  One can imagine switches about whether some sorts of
1299	      mappings occur, warnings before applying them or, in a slightly
1300	      more extreme version of the approach taken in Internet Explorer
1301	      version 7 (IE7), utterly refuse to handle "strange" characters at
1302	      all if they appear in U-label form.  None of those local decisions
1303	      are a threat to interoperability as long as (i) only U-labels and
1304	      A-labels are used in interchange with systems outside the local
1305	      environment, (ii) no character that would be valid in a U-label as
1306	      itself is mapped to something else, (iii) any local mappings are
1307	      applied as a preprocessing step (or, for conversions from U-labels
1308	      or A-labels to presentation forms, postprocessing), not as part of
1309	      IDNA processing proper, and (iv) appropriate consideration is
1310	      given to labels that might have entered the environment in
1311	      conformance to IDNA2003. [[anchor31: Placeholder: there have been
1312	      suggestions that this text be removed entirely.  Comments (or
1313	      improved text) welcome.]]

1315	10.  Migration and Version Synchronization

1317	10.1.  Design Criteria

1319	   As mentioned above and in RFC 4690, two key goals of this work are to
1320	   enable applications to be agnostic about whether they are being run
1321	   in environments supporting any Unicode version from 3.2 onward and to
1322	   permit incrementally adding permitted scripts and other character
1323	   collections without disruption or, subsequent to this version,
1324	   "heavy" processes such as formation of an IETF WG.  The mechanisms
1325	   that support this are outlined above, but this section reviews them
1326	   in a context that may be more helpful to those who need to understand
1327	   the approach and make plans for it.

1329	10.1.1.  General IDNA Validity Criteria

1331	   The general criteria for a putative label, and the collection of
1332	   characters that make it up, to be considered IDNA-valid are:

1334	   o  The characters are "letters", marks needed to form letters,
1335	      numerals, or other code points used to write words in some
1336	      language.  Symbols, drawing characters, and various notational
1337	      characters are permanently excluded -- some because they are
1338	      actively dangerous in URI, IRI, or similar contexts and others
1339	      because there is no evidence that they are important enough to
1340	      Internet operations or internationalization to justify inclusion
1341	      and the complexities that would come with it (additional
1342	      discussion and rationale for the symbol decision appears in
1343	      Section 10.5).

1345	   o  Other than in very exceptional cases, e.g., where they are needed
1346	      to write substantially any word of a given language, punctuation
1347	      characters are excluded as well.  The fact that a word exists is
1348	      not proof that it should be usable in a DNS label and DNS labels
1349	      are not expected to be usable for multiple-word phrases (although
1350	      they are certainly not prohibited if the conventions and
1351	      orthography of a particular language cause that to be possible).
1352	      Even for English, very common constructions -- contractions like
1353	      "don't" or "it's", names that are written with apostrophes such as
1354	      "O'Reilly" or characters for which apostrophes are common
1355	      substitutes, and words whose usually-preferred spellings retain
1356	      diacritical marks from earlier forms -- cannot be represented in
1357	      DNS labels.

1359	   o  Characters that are unassigned (have no character assignment at
1360	      all) in the version of Unicode being used by the registry or
1361	      application are not permitted, even on resolution (lookup).  There
1362	      are at least two reasons for this.  Tests involving the context of
1363	      characters (e.g., some characters being permitted only adjacent to
1364	      ones of specific types but otherwise invisible or very problematic
1365	      for other reasons) and integrity tests on complete labels are
1366	      needed.  Unassigned code points cannot be permitted because one
1367	      cannot determine whether particular code points will require
1368	      contextual rules (and what those rules should be) before
1369	      characters are assigned to them and the properties of those
1370	      characters fully understood.  Second, Unicode specifies that an
1371	      unassigned code point normalizes and case folds to itself.  If the
1372	      code point is later assigned to a character, and particularly if
1373	      the newly-assigned code point has a combining class that
1374	      determines its placement relative to other combining characters,
1375	      it could normalize to some other code point or sequence, creating
1376	      confusion and/or violating other rules listed here.

1378	   o  Any character that is mapped to another character by Nameprep2003
1379	      or by a current version of NFKC is prohibited as input to IDNA
1380	      (for either registration or resolution).  Implementers of user
1381	      interfaces to applications are free to make those conversions when
1382	      they consider them suitable for their operating system
1383	      environments, context, or users.

1385	   Tables used to identify the characters that are IDNA-valid are
1386	   expected to be driven by the principles above (described in more
1387	   precise form in [IDNA2008-Tables]).  The principles are not just an
1388	   interpretation of the tables.

1390	10.1.2.  Labels in Registration

1392	   Anyone entering a label into a DNS zone must properly validate that
1393	   label -- i.e., be sure that the criteria for that label are met -- in
1394	   order for applications to work as intended.  This principle is not
1395	   new: for example, zone administrators are expected to verify that
1396	   names meet "hostname" [RFC0952] or special service location formats
1397	   [RFC2782] where necessary for the expected applications.  For zones
1398	   that will contain IDNs, support for Unicode version-independence
1399	   requires restrictions on all strings placed in the zone.  In
1400	   particular, for such zones:

1402	   o  Any label that appears to be an A-label, i.e., any label that
1403	      starts in "xn--", MUST be IDNA-valid, i.e., that they MUST be
1404	      valid A-labels, as discussed in Section 3 above.

1406	   o  The Unicode tables (i.e., tables of code points, character
1407	      classes, and properties) and IDNA tables (i.e., tables of
1408	      contextual rules such as those described above), MUST be
1409	      consistent on the systems performing or validating labels to be
1410	      registered.  Note that this does not require that tables reflect
1411	      the latest version of Unicode, only that all tables used on a
1412	      given system are consistent with each other.

1414	   [[anchor33: Note in draft: the above text was changed significantly
1415	   between -00 and -01 to clearly restrict its scope to zones supporting
1416	   IDNA and to eliminate comments about labels containing "--" in the
1417	   third and forth positions but with different prefixes.  There appears
1418	   to be consensus that more extensive rules belong in a "best
1419	   practices" document about appropriate DNS labels, but that document
1420	   is not in-scope for the IDNABIS WG.]]

1422	   Under this model, a registry (or entity communicating with a registry
1423	   to accomplish name registrations) will need to update its tables --
1424	   both the Unicode-associated tables and the tables of permitted IDN
1425	   characters -- to enable a new script or other set of new characters.
1426	   It will not be affected by newer versions of Unicode, or newly-
1427	   authorized characters, until and unless it wishes to make those
1428	   registrations.  The registration side is also responsible --under the
1429	   protocol and to registrants and users-- for much more careful
1430	   checking than is expected of applications systems that look names up,
1431	   both checking as required by the protocol and checking required by
1432	   whatever policies it develops for minimizing risks due to confusable
1433	   characters and sequences and preserving language or script integrity.

1435	   Systems looking up or resolving DNS labels, especially IDN DNS
1436	   labels, MUST be able to assume that applicable registration rules
1437	   were followed for names entered into the DNS.

1439	10.1.3.  Labels in Resolution (Lookup)

1441	   Anyone looking up a label in a DNS zone

1443	   o  MUST maintain a consistent set of tables, as discussed above.  As
1444	      with registration, the tables need not reflect the latest version
1445	      of Unicode but they MUST be consistent.

1447	   o  MUST validate the characters in labels to be looked up only to the
1448	      extent of determining that the U-label does not contain either
1449	      code points prohibited by IDNA (categorized as "DISALLOWED") or
1450	      code points that are unassigned in its version of Unicode.

1452	   o  MUST validate the label itself for conformance with a small number
1453	      of whole-label rules, notably verifying that there are no leading
1454	      combining marks, that the "bidi" conditions are met if right to
1455	      left characters appear, that any required contextual rules are
1456	      available and that, if such rules are associated with Joiner
1457	      Controls, they are tested.

1459	   o  MUST NOT validate other contextual rules about characters,
1460	      including mixed-script label prohibitions, although such rules MAY
1461	      be used to influence presentation decisions in the user interface.

1463	   By avoiding applying its own interpretation of which labels are valid
1464	   as a means of rejecting lookup attempts, the resolver application
1465	   becomes less sensitive to version incompatibilities with the
1466	   particular zone registry associated with the domain name.

1468	   An application or client that looks names up in the DNS will be able
1469	   to resolve any name that is validly registered, as long as its
1470	   version of the Unicode-associated tables is sufficiently up-to-date
1471	   to interpret all of the characters in the label.  It SHOULD
1472	   distinguish, in its messages to users, between "label contains an
1473	   unallocated code point" and other types of lookup failures.  A
1474	   failure on the basis of an old version of Unicode may lead the user
1475	   to a desire to upgrade to a newer version, but will have no other ill
1476	   effects (this is consistent with behavior in the transition to the
1477	   DNS when some hosts could not yet handle some forms of names or
1478	   record types).

1480	10.2.  More Flexibility in User Agents

1482	   These specifications do not perform mappings between one character or
1483	   code point and others for any reason.  Instead, they prohibits the
1484	   characters that would be mapped to others by normalization, case
1485	   folding, or other rules.  As examples, while mathematical characters
1486	   based on Latin ones are accepted as input to IDNA2003, they are
1487	   prohibited in IDNA2008.  Similarly, double-width characters and other
1488	   variations are prohibited as IDNA input.

1490	   Since the rules in [IDNA2008-Tables] provide that only strings that
1491	   are stable under NFKC are valid, if it is convenient for an
1492	   application to perform NFKC normalization before lookup, that
1493	   operation is safe since this will never make the application unable
1494	   to look up any valid string.

1496	   In many cases these prohibitions should have no effect on what the
1497	   user can type at resolution time.  It is perfectly reasonable for
1498	   systems that support user interfaces to perform some character
1499	   mapping that is appropriate to the local environment.  This would
1500	   normally be done prior to actual invocation of IDNA.  At least
1501	   conceptually, the mapping would be part of the Unicode conversions
1502	   discussed above and in [IDNA2008-Protocol].  However, those changes
1503	   will be local ones only -- local to environments in which users will
1504	   clearly understand that the character forms are equivalent.  For use
1505	   in interchange among systems, it appears to be much more important
1506	   that U-labels and A-labels can be mapped back and forth without loss
1507	   of information.

1509	   One specific, and very important, instance of this strategy arises
1510	   with case-folding.  In the ASCII-only DNS, names are looked up and
1511	   matched in a case-independent way, but no actual case-folding occurs.
1512	   Names can be placed in the DNS in either upper or lower case form (or
1513	   any mixture of them) and that form is preserved, returned in queries,
1514	   and so on.  IDNA2003 simulated that behavior by performing case-
1515	   mapping at registration time (resulting in only lower-case IDNs in
1516	   the DNS) and when names were looked up.

1518	   As suggested earlier in this section, it appears to be desirable to
1519	   do as little character mapping as possible consistent with having
1520	   Unicode work correctly (e.g., NFC mapping to resolve different
1521	   codings for the same character is still necessary although the
1522	   specifications require that it be performed prior to invoking the
1523	   protocol) and to make the mapping between A-labels and U-labels
1524	   idempotent.  Case-mapping is not an exception to this principle.  If
1525	   only lower case characters can be registered in the DNS (i.e., be
1526	   present in a U-label), then IDNA2008 should prohibit upper-case
1527	   characters as input.  Some other considerations reinforce this
1528	   conclusion.  For example, an essential element of the ASCII case-
1529	   mapping functions is that uppercase(character) must be equal to
1530	   uppercase(lowercase(character)).  That requirement may not be
1531	   satisfied with IDNs.  The relationship between upper case and lower
1532	   case may even be language-dependent, with different languages (or
1533	   even the same language in different areas) expecting different
1534	   mappings.  Of course, the expectations of users who are accustomed to
1535	   a case-insensitive DNS environment will probably be well-served if
1536	   user agents perform case mapping prior to IDNA processing, but the
1537	   IDNA procedures themselves should neither require such mapping nor
1538	   expect them when they are not natural to the localized environment.

1540	10.3.  The Question of Prefix Changes

1542	   The conditions that would require a change in the IDNA "prefix"
1543	   ("xn--" for the version of IDNA specified in [RFC3490]) have been a
1544	   great concern to the community.  A prefix change would clearly be
1545	   necessary if the algorithms were modified in a manner that would
1546	   create serious ambiguities during subsequent transition in
1547	   registrations.  This section summarizes our conclusions about the
1548	   conditions under which changes in prefix would be necessary and the
1549	   implications of such a change.

1551	10.3.1.  Conditions Requiring a Prefix Change

1553	   An IDN prefix change is needed if a given string would resolve or
1554	   otherwise be interpreted differently depending on the version of the
1555	   protocol or tables being used.  Consequently, work to update IDNs
1556	   would require a prefix change if, and only if, one of the following
1557	   four conditions were met:

1559	   1.  The conversion of an A-label to Unicode (i.e., a U-label) yields
1560	       one string under IDNA2003 (RFC3490) and a different string under
1561	       IDNA2008.

1563	   2.  An input string that is valid under IDNA2003 and also valid under
1564	       IDNA2008 yields two different A-labels with the different
1565	       versions of IDNA.  This condition is believed to be essentially
1566	       equivalent to the one above.

1568	       Note, however, that if the input string is valid under one
1569	       version and not valid under the other, this condition does not
1570	       apply.  See the first item in Section 10.3.2, below.

1572	   3.  A fundamental change is made to the semantics of the string that
1573	       is inserted in the DNS, e.g., if a decision were made to try to
1574	       include language or specific script information in that string,
1575	       rather than having it be just a string of characters.

1577	   4.  A sufficiently large number of characters is added to Unicode so
1578	       that the Punycode mechanism for block offsets no longer has
1579	       enough capacity to reference the higher-numbered planes and
1580	       blocks.  This condition is unlikely even in the long term and
1581	       certain not to arise in the next few years.

1583	10.3.2.  Conditions Not Requiring a Prefix Change

1585	   In particular, as a result of the principles described above, none of
1586	   the following changes require a new prefix:

1588	   1.  Prohibition of some characters as input to IDNA.  This may make
1589	       names that are now registered inaccessible, but does not require
1590	       a prefix change.

1592	   2.  Adjustments in Stringprep tables or IDNA actions, including
1593	       normalization definitions, that affect characters that were
1594	       already invalid under IDNA2003.

1596	   3.  Changes in the style of definitions of Stringprep or Nameprep
1597	       that do not alter the actions performed by them.

1599	   Of course, because these specifications do not involve changes to
1600	   Stringprep or Nameprep, the third condition above and part of the
1601	   second are moot.

1603	10.3.3.  Implications of Prefix Changes

1605	   While it might be possible to make a prefix change, the costs of such
1606	   a change are considerable.  Even if they wanted to do so, all
1607	   registries could not convert all IDNA2003 ("xn--") registrations to a
1608	   new form at the same time and synchronize that change with
1609	   applications supporting lookup.  Unless all existing registrations
1610	   were simply to be declared invalid, and perhaps even then, systems
1611	   that needed to support both labels with old prefixes and labels with
1612	   new ones would first process a putative label under the IDNA2008
1613	   rules and try to look it up and then, if it were not found, would
1614	   process the label under IDNA2003 rules and look it up again.  That
1615	   process could significantly slow down all processing that involved
1616	   IDNs in the DNS especially since, in principle, a fully-qualified
1617	   name could contain a mixture of labels that were registered with the
1618	   old and new prefixes, a situation that would make the use of DNS
1619	   caching very difficult.  In addition, looking up the same input
1620	   string as two separate A-labels would create some potential for
1621	   confusion and attacks, since they could, in principle, resolve to
1622	   different targets.

1624	   Consequently, a prefix change is to be avoided if at all possible,
1625	   even if it means accepting some IDNA2003 decisions about character
1626	   distinctions as irreversible.

1628	10.4.  Stringprep Changes and Compatibility

1630	   Concerns have been expressed about problems for non-DNS uses of
1631	   Stringprep being caused by changes to the specification intended to
1632	   improve the handling of IDNs, most notably as this might affect
1633	   identification and authentication protocols.  Section 10.3, above,
1634	   essentially also applies in this context.  The proposed new inclusion
1635	   tables [IDNA2008-Tables], the reduction in the number of characters
1636	   permitted as input for registration or resolution (Section 6), and
1637	   even the proposed changes in handling of right to left strings
1638	   [IDNA2008-Bidi] either give interpretations to strings prohibited
1639	   under IDNA2003 or prohibit strings that IDNA2003 permitted.  Strings
1640	   that are valid under both IDNA2003 and IDNA2008, and the
1641	   corresponding versions of Stringprep, are not changed in
1642	   interpretation.  This protocol does not use either Nameprep or
1643	   Stringprep as specified in IDNA2003.

1645	   It is particularly important to keep IDNA processing separate from
1646	   processing for various security protocols because some of the
1647	   constraints that are necessary for smooth and comprehensible use of
1648	   IDNs may be unwanted or undesirable in other contexts.  For example,
1649	   the criteria for good passwords or passphrases are very different
1650	   from those for desirable IDNs.  Similarly, internationalized SCSI
1651	   identifiers and other protocol components are likely to have
1652	   different requirements than IDNs.

1654	   Perhaps even more important in practice, since most other known uses
1655	   of Stringprep encode or process characters that are already in
1656	   normalized form and expect the use of only those characters that can
1657	   be used in writing words of languages, the changes proposed here and
1658	   in [IDNA2008-Tables] are unlikely to have any effect at all,
1659	   especially not on registries and registrations that follow rules
1660	   already in existence when this work started.

1662	10.5.  The Symbol Question

1664	   One of the major differences between this specification and the
1665	   original version of IDNA is that the original version permitted non-
1666	   letter symbols of various sorts, including punctuation and line-
1667	   drawing symbols, in the protocol.  They were always discouraged in
1668	   practice.  In particular, both the "IESG Statement" about IDNA and
1669	   all versions of the ICANN Guidelines specify that only language
1670	   characters be used in labels.  This specification disallows symbols
1671	   entirely.  There are several reasons for this, which include:

1673	   o  As discussed elsewhere, the original IDNA specification assumed
1674	      that as many Unicode characters as possible should be permitted,
1675	      directly or via mapping to other characters, in IDNs.  This
1676	      specification operates on an inclusion model, extrapolating from
1677	      the LDH rules --which have served the Internet very well-- to a
1678	      Unicode base rather than an ASCII base.

1680	   o  Most Unicode names for letters are, in most cases, fairly
1681	      intuitive, unambiguous and recognizable to users of the relevant
1682	      script.  Symbol names are more problematic because there may be no
1683	      general agreement on whether a particular glyph matches a symbol;
1684	      there are no uniform conventions for naming; variations such as
1685	      outline, solid, and shaded forms may or may not exist; and so on.
1686	      As just one example, consider a "heart" symbol as it might appear
1687	      in a logo that might be read as "I love...".  While the user might
1688	      read such a logo as "I love..." or "I heart...", considerable
1689	      knowledge of the coding distinctions made in Unicode is needed to
1690	      know that there more than one "heart" character (e.g., U+2665,
1691	      U+2661, and U+2765) and how to describe it.  These issues are of
1692	      particular importance if strings are expected to be understood or
1693	      transcribed by the listener after being read out loud.
1694	      [[anchor35: The above paragraph remains controversial as to
1695	      whether it is valid.  The WG will need to make a decision if this
1696	      section is not dropped entirely.]]

1698	   o  As a simplified example of this, assume one wanted to use a
1699	      "heart" or "star" symbol in a label.  This is problematic because
1700	      the those names are ambiguous in the Unicode system of naming (the
1701	      actual Unicode names require far more qualification).  A user or
1702	      would-be registrant has no way to know --absent careful study of
1703	      the code tables-- whether it is ambiguous (e.g., where there are
1704	      multiple "heart" characters) or not.  Conversely, the user seeing
1705	      the hypothetical label doesn't know whether to read it --try to
1706	      transmit it to a colleague by voice-- as "heart", as "love", as
1707	      "black heart", or as any of the other examples below.

1709	   o  The actual situation is even worse than this.  There is no
1710	      possible way for a normal, casual, user to tell the difference
1711	      between the hearts of U+2665 and U+2765 and the stars of U+2606
1712	      and U+2729 or the without somehow knowing to look for a
1713	      distinction.  We have a white heart (U+2661) and few black hearts
1714	      and describing a label containing a heart symbol is hopelessly
1715	      ambiguous.  In cities where "Square" is a popular part of a
1716	      location name, one might well want to use a square symbol in a
1717	      label as well and there are far more squares of various flavors in
1718	      Unicode than there are hearts or stars.

1720	   o  The consequence of these ambiguities of description and
1721	      dependencies on distinctions that were, or were not, made in
1722	      Unicode codings, is that symbols are a very poor basis for
1723	      reliable communication.  Of course, these difficulties with
1724	      symbols do not arise with actual pictographic languages and
1725	      scripts which would be treated like any other language characters;
1726	      the two should not be confused.

1728	   [[anchor36: Note in Draft: Should the above section be significantly
1729	   trimmed or eliminated?]]

1731	10.6.  Migration Between Unicode Versions: Unassigned Code Points

1733	   In IDNA2003, labels containing unassigned code points are resolved on
1734	   the theory that, if they appear in labels and can be resolved, the
1735	   relevant standards must have changed and the registry has properly
1736	   allocated only assigned values.

1738	   In this specification, strings containing unassigned code points MUST
1739	   NOT be either looked up or registered.  There are several reasons for
1740	   this, with the most important ones being:

1742	   o  It cannot be known with sufficient reliability in advance that a
1743	      code point that was not previously assigned will not be assigned
1744	      to a compatibility character.  In IDNA2003, since there is no
1745	      direct dependency on NFKC (Stringprep's tables are based on NFKC,
1746	      but IDNA2003 depends only on Stringprep), allocation of a
1747	      compatibility character might produce some odd situations, but it
1748	      would not be a problem.  In IDNA2008, where compatibility
1749	      characters are generally assigned to DISALLOWED, permitting
1750	      strings containing unassigned characters to be looked up would
1751	      permit violating the principle that characters in DISALLOWED are
1752	      not looked up.

1754	   o  More generally, the status of an unassigned character with regard
1755	      to the DISALLOWED and PROTOCOL-VALID categories, and whether
1756	      contextual rules are required with the latter, cannot be evaluated
1757	      until a character is actually assigned and known.

1759	   It is possible to argue that the issues above are not important and
1760	   that, as a consequence, it is better to retain the principle of
1761	   looking up labels even if they contain unassigned characters because
1762	   all of the important scripts and characters have been coded as of
1763	   Unicode 5.1 and hence unassigned code points will be assigned only to
1764	   obscure characters or archaic scripts.  Unfortunately, that does not
1765	   appear to be a safe assumption for at least two reasons.  First, much
1766	   the same claim of completeness has been made for earlier versions of
1767	   Unicode.  The reality is that a script that is obscure to much of the
1768	   world may still be very important to those who use it.  Cultural and
1769	   linguistic preservation principles make it inappropriate to declare
1770	   the script of no importance in IDNs.  Second, we already have
1771	   counterexamples in, e.g., the relationships associated with new Han
1772	   characters being added (whether in the BMP or in Unicode Plane 2).

1774	10.7.  Other Compatibility Issues

1776	   The existing (2003) IDNA model includes several odd artifacts of the
1777	   context in which it was developed.  Many, if not all, of these are
1778	   potential avenues for exploits, especially if the registration
1779	   process permits "source" names (names that have not been processed
1780	   through IDNA and nameprep) to be registered.  As one example, since
1781	   the character Eszett, used in German, is mapped by IDNA2003 into the
1782	   sequence "ss" rather than being retained as itself or prohibited, a
1783	   string containing that character but that is otherwise in ASCII is
1784	   not really an IDN (in the U-label sense defined above) at all.  After
1785	   Nameprep maps the Eszett out, the result is an ASCII string and so
1786	   does not get an xn-- prefix, but the string that can be displayed to
1787	   a user appears to be an IDN.  The proposed IDNA2008 eliminates this
1788	   artifact.  A character is either permitted as itself or it is
1789	   prohibited; special cases that make sense only in a particular
1790	   linguistic or cultural context can be dealt with as localization
1791	   matters where appropriate.

1793	11.  Acknowledgments

1795	   The editor and contributors would like to express their thanks to
1796	   those who contributed significant early (pre-WG) review comments,
1797	   sometimes accompanied by text, especially Mark Davis, Paul Hoffman,
1798	   Simon Josefsson, and Sam Weiler.  In addition, some specific ideas
1799	   were incorporated from suggestions, text, or comments about sections
1800	   that were unclear supplied by Frank Ellerman, Michael Everson, Asmus
1801	   Freytag, Erik van der Poel, Michel Suignard, and Ken Whistler,
1802	   although, as usual, they bear little or no responsibility for the
1803	   conclusions the editor and contributors reached after receiving their
1804	   suggestions.  Thanks are also due to Vint Cerf, Debbie Garside, and
1805	   Jefsey Morphin for conversations that led to considerable
1806	   improvements in the content of this document.

1808	   A meeting was held on 30 January 2008 to attempt to reconcile
1809	   differences in perspective and terminology about this set of
1810	   specifications between the design team and members of the Unicode
1811	   Technical Consortium.  The discussions at and subsequent to that
1812	   meeting were very helpful in focusing the issues and in refining the
1813	   specifications.  The active participants at that meeting were (in
1814	   alphabetic order as usual) Harald Alvestrand, Vint Cerf, Tina Dam,
1815	   Mark Davis, Lisa Dusseault, Patrik Faltstrom (by telephone), Cary
1816	   Karp, John Klensin, Warren Kumari, Lisa Moore, Erik van der Poel,
1817	   Michel Suignard, and Ken Whistler.  We express our thanks to Google
1818	   for support of that meeting and to the participants for their
1819	   contributions.

1821	   Special thanks are due to Paul Hoffman for permission to extract
1822	   material from his Internet-Draft to form the basis for Section 2.

1824	   Useful comments and text on the WG versions of the draft were
1825	   received from many participants in the IETF "IDNABIS" WG and a number
1826	   of document changes resulted from mailing list discussions made by
1827	   that group.

1829	12.  Contributors

1831	   While the listed editor held the pen, this core of this document and
1832	   the initial WG version represents the joint work and conclusions of
1833	   an ad hoc design team consisting of the editor and, in alphabetic
1834	   order, Harald Alvestrand, Tina Dam, Patrik Faltstrom, and Cary Karp.
1835	   In addition, there were many specific contributions and helpful
1836	   comments from those listed in the Acknowledgments section and others
1837	   who have contributed to the development and use of the IDNA
1838	   protocols.

1840	13.  IANA Considerations

1842	   This section gives an overview of registries required for IDNA.  The
1843	   actual definition of the first one appears in [IDNA2008-Tables].

1845	13.1.  IDNA Character Registry

1847	   The distinction among the three major categories "UNASSIGNED",
1848	   "DISALLOWED", and "PROTOCOL-VALID" is made by special categories and
1849	   rules that are integral elements of [IDNA2008-Tables].  Convenience
1850	   in programming and validation requires a registry of characters and
1851	   scripts and their categories, updated for each new version of Unicode
1852	   and the characters it contains.  The details of this registry are
1853	   specified in [IDNA2008-Tables].

1855	13.2.  IDNA Context Registry

1857	   For characters that are defined in the IDNA Character Registry list
1858	   as PROTOCOL-VALID but requiring a contextual rule (i.e., the types of
1859	   rule described in Section 6.1.1.1), IANA will create and maintain a
1860	   list of approved contextual rules.  Additions or changes to these
1861	   rules require IETF Review, as described in [RFC5226].
1862	   [[anchor41: Note in Draft: This section was changed between -00 and
1863	   -01 based on list discussion.  Consensus needs to be verified for
1864	   that decision.]]

1866	   A table from which that registry can be initialized, and some further
1867	   discussion, appears in [RulesInit].
1868	   [[anchor42: This subsection should probably be moved to Tables along
1869	   with the Contextual rules themselves (from Protocol) when the move is
1870	   made.]]

1872	13.3.  IANA Repository of IDN Practices of TLDs

1874	   This registry, historically described as the "IANA Language Character
1875	   Set Registry" or "IANA Script Registry" (both somewhat misleading
1876	   terms) is maintained by IANA at the request of ICANN.  It is used to
1877	   provide a central documentation repository of the IDN policies used
1878	   by top level domain (TLD) registries who volunteer to contribute to
1879	   it and is used in conjunction with ICANN Guidelines for IDN use.

1881	   It is not an IETF-managed registry and, while the protocol changes
1882	   specified here may call for some revisions to the tables, these
1883	   specifications have no direct effect on that registry and no IANA
1884	   action is required as a result.

1886	14.  Security Considerations

1888	   Security on the Internet partly relies on the DNS.  Thus, any change
1889	   to the characteristics of the DNS can change the security of much of
1890	   the Internet.

1892	   Domain names are used by users to identify and connect to Internet
1893	   servers.  The security of the Internet is compromised if a user
1894	   entering a single internationalized name is connected to different
1895	   servers based on different interpretations of the internationalized
1896	   domain name.

1898	   When systems use local character sets other than ASCII and Unicode,
1899	   this specification leaves the problem of transcoding between the
1900	   local character set and Unicode up to the application or local
1901	   system.  If different applications (or different versions of one
1902	   application) implement different transcoding rules, they could
1903	   interpret the same name differently and contact different servers.
1904	   This problem is not solved by security protocols like TLS that do not
1905	   take local character sets into account.

1907	   To help prevent confusion between characters that are visually
1908	   similar, it is suggested that implementations provide visual
1909	   indications where a domain name contains multiple scripts.  Such
1910	   mechanisms can also be used to show when a name contains a mixture of
1911	   simplified and traditional Chinese characters, or to distinguish zero
1912	   and one from O and l.  DNS zone administrators may impose
1913	   restrictions (subject to the limitations identified elsewhere in this
1914	   document) that try to minimize characters that have similar
1915	   appearance or similar interpretations.  It is worth noting that there
1916	   are no comprehensive technical solutions to the problems of
1917	   confusable characters.  One can reduce the extent of the problems in
1918	   various ways, but probably never eliminate it.  Some specific
1919	   suggestions about identification and handling of confusable
1920	   characters appear in a Unicode Consortium publication
1921	   [Unicode-UTR36].

1923	   The registration and resolution models described above and in
1924	   [IDNA2008-Protocol] change the mechanisms available for applications
1925	   and resolvers to determine the validity of labels they encounter.  In
1926	   some respects, the ability to test is strengthened.  For example,
1927	   putative labels that contain unassigned code points will now be
1928	   rejected, while IDNA2003 permitted them (something that is now
1929	   recognized as a considerable source of risk).  On the other hand, the
1930	   protocol specification no longer assumes that the application that
1931	   looks up a name will be able to determine, and apply, information
1932	   about the protocol version used in registration.  In theory, that may
1933	   increase risk since the application will be able to do less pre-
1934	   lookup validation.  In practice, the protection afforded by that test
1935	   has been largely illusory for reasons explained in RFC 4690 and
1936	   above.

1938	   Any change to Stringprep or, more broadly, the IETF's model of the
1939	   use of internationalized character strings in different protocols,
1940	   creates some risk of inadvertent changes to those protocols,
1941	   invalidating deployed applications or databases, and so on.  Our
1942	   current hypothesis is that the same considerations that would require
1943	   changing the IDN prefix (see Section 10.3.2) are the ones that would,
1944	   e.g., invalidate certificates or hashes that depend on Stringprep,
1945	   but those cases require careful consideration and evaluation.  More
1946	   important, it is not necessary to change Stringprep2003 at all in
1947	   order to make the IDNA changes contemplated here.  It is far
1948	   preferable to create a separate document, or separate profile
1949	   components, for IDN work, leaving the question of upgrading to other
1950	   protocols to experts on them and eliminating any possible
1951	   synchronization dependency between IDNA changes and possible upgrades
1952	   to security protocols or conventions.

1954	   No mechanism involving names or identifiers alone can protect a wide
1955	   variety of security threats and attacks that are largely independent
1956	   of them including spoofed pages, DNS query trapping and diversion,
1957	   and so on.

1959	15.  Change Log

1961	   [[anchor45: RFC Editor: Please remove this section.]]

1963	   For version 00 of draft-ietf-idnabis-rationale, this list contains a
1964	   complete trace going back through the earlier, design team, drafts.
1965	   Material earlier than that described in Section 15.9 will be removed
1966	   in WG draft -02.

1968	15.1.  Version -01 of draft-klensin-idnabis-issues

1970	   Version -01 of this document is a considerable rewrite from -00.
1971	   Many sections have been clarified or extended and several new
1972	   sections have been added to reflect discussions in a number of
1973	   contexts since -00 was issued.

1975	15.2.  Version -02 of draft-klensin-idnabis-issues

1977	   o  Corrected several editorial errors including an accidentally-
1978	      introduced misstatement about NFKC.

1980	   o  Extensively revised the document to synchronize its terminology
1981	      with version 03 of [IDNA2008-Tables] and to provide a better
1982	      conceptual framework for its categories and how they are used.
1983	      Added new material to clarify terminology and relationships with
1984	      other efforts.  More subtle changes in this version lay the
1985	      groundwork for separating the document into a conceptual overview
1986	      and a protocol specification for version 03.

1988	15.3.  Version -03 of draft-klensin-idnabis-issues

1990	   o  Removed protocol materials to a separate document and incorporated
1991	      rationale and explanation materials from the original
1992	      specification in RFC 3960 into this document.  Cleaned up earlier
1993	      text to reflect a more mature specification and restructured
1994	      several sections and added additional rationale material.

1996	   o  Strengthened and clarified the A-label / U-label/ LDH-label
1997	      definition.

1999	   o  Retitled the document to reflect its evolving role.

2001	15.4.  Version -04 of draft-klensin-idnabis-issues

2003	   o  Moved more text from "protocol" and further reorganized material.

2005	   o  Provided new material on "Contextual Rule Required.

2007	   o  Improved consistency of terminology, both internally and with the
2008	      "tables" document.

2010	   o  Improved the IANA Considerations section and discussed the
2011	      existing IDNA-related registry.

2013	   o  More small changes to increase consistency.

2015	15.5.  Version -05 of draft-klensin-idnabis-issues

2017	   Changed "YES" category back to "ALWAYS" to re-synch with the tables
2018	   document and provide clearer terminology.

2020	15.6.  Version -06 of draft-klensin-idnabis-issues

2022	   o  Clarified the prohibitions on strings that look like A-labels but
2023	      are not and on unassigned code points.

2025	   o  Clarified length restrictions on IDN labels.

2027	   o  Revised the terminology definitions to remove the impression of
2028	      circularity and removed invocations of ToASCII and ToUnicode,
2029	      which do not exist in IDNA2008.

2031	   o  Added a new section on front-end processing.

2033	   o  Added a new section to discuss case-mapping.

2035	   o  Extended the discussion of prefix changes to identify the
2036	      implications of making one.

2038	   o  Several more editorial improvements, corrected references, and
2039	      similar adjustments.

2041	15.7.  Version -07 of draft-klensin-idnabis-issues

2043	   o  Added material that specifically defines the format of contextual
2044	      rules.

2046	   o  Added and altered text after discussions at the 30 January meeting
2047	      (see Section 11) and the follow-up to those discussions.  Among
2048	      the key decisions at that meeting were to eliminate the
2049	      distinction among the valid categories (formerly "ALWAYS", "MAYBE
2050	      YES", and "MAYBE NO"), to adjust the terminology accordingly, and
2051	      to change "CONTEXTUAL RULE REQUIRED" from a separate category in
2052	      this document and the protocol one to a modifier of what is now
2053	      called "PROTOCOL-VALID".  The consequent changes resulted in
2054	      removal of several sections of explanation from this document.

2056	   o  Resynchronized terminology with "protocol" and "tables" documents.

2058	   o  More editorial and typographic corrections.

2060	15.8.  Version -00 of draft-ietf-idnabis-rationale

2062	   o  Rewrote the abstract and introduction, and retuned the title, to
2063	      be more consistent with WG work and activities.  Changed the file
2064	      name to reflect WG naming.

2066	   o  Removed most of the material that explained, or compared this
2067	      approach to, IDNA2003.  Some of this material may appear in the
2068	      non-WG "IDNA-alternatives" draft if it is ever completed.

2070	   o  Changed IDNA200X in terminology and references to IDNA2008.

2072	   o  Added a contextual rule for hyphen to the appendix, adjusted the
2073	      rule syntax slightly, and supplied draft regular expression rules.

2075	   o  Responded to comments produced during the WG charter discussions
2076	      and from several individuals.  In general, comments requesting a
2077	      reorganization of the collection of documents have not been
2078	      responded to pending a WG decision on that topic.

2080	   o  Moved the contextual rule appendix out of here and into
2081	      "Protocol".  It may not belong there either, but definitely does
2082	      not belong here, and was holding up getting this document out.

2084	   o  Many small editorial improvements, including reorganization of
2085	      some material.

2087	   Editorial note: While several sections have been removed from this
2088	   version, the WG should discuss whether further cuts are desirable,
2089	   e.g., whether Section 7.3, Section 7.4, or Section 10.3 provide
2090	   enough value to be worth retaining?  Can Section 10.4 be trimmed
2091	   without loss of useful information and, if so, how?  Section 10.7
2092	   appears critical of IDNA2003 in undesirable ways: should it be
2093	   dropped or do people have suggestions about how to improve it?
2094	   Strong opinions have been expressed that Section 10.5 should be
2095	   trimmed significantly or removed entirely.  The WG will need to
2096	   discuss that too.  Are there other materials that should be trimmed
2097	   out?

2099	15.9.  Version -01 of draft-ietf-idnabis-rationale

2101	   o  Clarified the U-label definition to note that U-labels must
2102	      contain at least one non-ASCII character.  Also clarified the
2103	      relationship among label types.

2105	   o  Rewrote the discussion of Labels in Registration (Section 10.1.2)
2106	      and related text in Section 1.5.4.1.1 to narrow its focus and
2107	      remove more general restrictions.  Added a temporary note in line
2108	      to explain the situation.

2110	   o  Changed the "IDNA uses Unicode" statement to focus on
2111	      compatibility with IDNA2003 and avoid more general or
2112	      controversial assertions.

2114	   o  Added a discussion of examples to Section 10.1

2116	   o  Made a number of other small editorial changes and corrections
2117	      suggested by Mark Davis.

2119	   o  Added several more discussion anchors and notes and expanded or
2120	      updated some existing ones.

2122	16.  References

2124	16.1.  Normative References

2126	   [ASCII]    American National Standards Institute (formerly United
2127	              States of America Standards Institute), "USA Code for
2128	              Information Interchange", ANSI X3.4-1968, 1968.

2130	              ANSI X3.4-1968 has been replaced by newer versions with
2131	              slight modifications, but the 1968 version remains
2132	              definitive for the Internet.

2134	   [IDNA2008-Bidi]
2135	              Alvestrand, H. and C. Karp, "An updated IDNA criterion for
2136	              right to left scripts", July 2008, <http://www.ietf.org/
2137	              internet-drafts/draft-ietf-idnabs-bidi-01.txt>.

2139	   [IDNA2008-Protocol]
2140	              Klensin, J., "Internationalized Domain Names in
2141	              Applications (IDNA): Protocol", July 2008, <http://
2142	              www.ietf.org/internet-drafts/
2143	              draft-ietf-idnabis-protocol-02.txt>.

2145	   [IDNA2008-Tables]
2146	              Faltstrom, P., "The Unicode Code Points and IDNA",
2147	              May 2008, <http://www.ietf.org/internet-drafts/
2148	              draft-ietf-idnabis-tables-01.txt>.

2150	              A version of this document is available in HTML format at
2151	              http://stupid.domain.name/idnabis/
2152	              draft-ietf-idnabis-tables-01.html

2154	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2155	              Requirement Levels", BCP 14, RFC 2119, March 1997.

2157	   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
2158	              Internationalized Strings ("stringprep")", RFC 3454,
2159	              December 2002.

2161	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
2162	              "Internationalizing Domain Names in Applications (IDNA)",
2163	              RFC 3490, March 2003.

2165	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
2166	              Profile for Internationalized Domain Names (IDN)",
2167	              RFC 3491, March 2003.

2169	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
2170	              for Internationalized Domain Names in Applications
2171	              (IDNA)", RFC 3492, March 2003.

2173	   [RFC5226]  Narten, T. and H. Alvestrand, "Guidelines for Writing an
2174	              IANA Considerations Section in RFCs", BCP 26, RFC 5226,
2175	              May 2008.

2177	   [RulesInit]
2178	              Klensin, J., "Internationalizing Domain Names in
2179	              Applications (IDNA): Protocol, Appendix A Contextual Rules
2180	              Table", July 2008, <http://www.ietf.org/internet-drafts/
2181	              draft-ietf-idnabis-protocol-02.txt>.

2183	   [Unicode51]
2184	              The Unicode Consortium, "The Unicode Standard, Version
2185	              5.1.0", 2008.

2187	              defined by: The Unicode Standard, Version 5.0, Boston, MA,
2188	              Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
2189	              Unicode 5.1.0
2190	              (http://www.unicode.org/versions/Unicode5.1.0/).

2192	16.2.  Informative References

2194	   [BIG5]     Institute for Information Industry of Taiwan, "Computer
2195	              Chinese Glyph and Character Code Mapping Table, Technical
2196	              Report C-26", 1984.

2198	              There are several forms and variations and a closely-
2199	              related standard, CNS 11643.  See the discussion in
2200	              Chapter 3 of Lunde, K., CJKV Information Processing,
2201	              O'Reilly & Associates, 1999

2203	   [GB18030]  "Chinese National Standard GB 18030-2000: Information
2204	              Technology -- Chinese ideograms coded character set for
2205	              information interchange -- Extension for the basic set.",
2206	              2000.

2208	   [RFC0810]  Feinler, E., Harrenstien, K., Su, Z., and V. White, "DoD
2209	              Internet host table specification", RFC 810, March 1982.

2211	   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
2212	              host table specification", RFC 952, October 1985.

2214	   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
2215	              STD 13, RFC 1034, November 1987.

2217	   [RFC1035]  Mockapetris, P., "Domain names - implementation and
2218	              specification", STD 13, RFC 1035, November 1987.

2220	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
2221	              and Support", STD 3, RFC 1123, October 1989.

2223	   [RFC2782]  Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for
2224	              specifying the location of services (DNS SRV)", RFC 2782,
2225	              February 2000.

2227	   [RFC3743]  Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
2228	              Engineering Team (JET) Guidelines for Internationalized
2229	              Domain Names (IDN) Registration and Administration for
2230	              Chinese, Japanese, and Korean", RFC 3743, April 2004.

2232	   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
2233	              Identifiers (IRIs)", RFC 3987, January 2005.

2235	   [RFC4290]  Klensin, J., "Suggested Practices for Registration of
2236	              Internationalized Domain Names (IDN)", RFC 4290,
2237	              December 2005.

2239	   [RFC4690]  Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
2240	              Recommendations for Internationalized Domain Names
2241	              (IDNs)", RFC 4690, September 2006.

2243	   [RFC4713]  Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin,
2244	              "Registration and Administration Recommendations for
2245	              Chinese Domain Names", RFC 4713, October 2006.

2247	   [Unicode-UTR36]
2248	              The Unicode Consortium, "Unicode Technical Report #36:
2249	              Unicode Security Considerations", August 2006,
2250	              <http://www.unicode.org/reports/tr36/>.

2252	Author's Address

2254	   John C Klensin
2255	   1770 Massachusetts Ave, Ste 322
2256	   Cambridge, MA  02140
2257	   USA

2259	   Phone: +1 617 245 1457
2260	   Email: john+ietf@jck.com

2262	Full Copyright Statement

2264	   Copyright (C) The IETF Trust (2008).

2266	   This document is subject to the rights, licenses and restrictions
2267	   contained in BCP 78, and except as set forth therein, the authors
2268	   retain all their rights.

2270	   This document and the information contained herein are provided on an
2271	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2272	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
2273	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
2274	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
2275	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2276	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

2278	Intellectual Property

2280	   The IETF takes no position regarding the validity or scope of any
2281	   Intellectual Property Rights or other rights that might be claimed to
2282	   pertain to the implementation or use of the technology described in
2283	   this document or the extent to which any license under such rights
2284	   might or might not be available; nor does it represent that it has
2285	   made any independent effort to identify any such rights.  Information
2286	   on the procedures with respect to rights in RFC documents can be
2287	   found in BCP 78 and BCP 79.

2289	   Copies of IPR disclosures made to the IETF Secretariat and any
2290	   assurances of licenses to be made available, or the result of an
2291	   attempt made to obtain a general license or permission for the use of
2292	   such proprietary rights by implementers or users of this
2293	   specification can be obtained from the IETF on-line IPR repository at
2294	   http://www.ietf.org/ipr.

2296	   The IETF invites any interested party to bring to its attention any
2297	   copyrights, patents or patent applications, or other proprietary
2298	   rights that may cover technology that may be required to implement
2299	   this standard.  Please address the information to the IETF at
2300	   ietf-ipr@ietf.org.