idnits 2.17.1 

draft-hoffman-rfc3490bis-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-27) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand
     corner of the first page

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 946 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 2 instances of too long lines in the document, the longest one
     being 2 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC2136' is mentioned on line 737, but not defined

  -- Obsolete informational reference (is this intentional?): RFC 2535
     (Obsoleted by RFC 4033, RFC 4034, RFC 4035)


     Summary: 9 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	draft-hoffman-rfc3490bis-02.txt                             P. Faltstrom
2	April 14, 2004                                                     Cisco
3	Expires in six months                                         P. Hoffman
4	                                                              IMC & VPNC
5	                                                             A. Costello
6	                                                             UC Berkeley

8	         Internationalizing Domain Names in Applications (IDNA)

10	Abstract

12	   Until now, there has been no standard method for domain names to use
13	   characters outside the ASCII repertoire.  This document defines
14	   internationalized domain names (IDNs) and a mechanism called
15	   Internationalizing Domain Names in Applications (IDNA) for handling
16	   them in a standard fashion.  IDNs use characters drawn from a large
17	   repertoire (Unicode), but IDNA allows the non-ASCII characters to be
18	   represented using only the ASCII characters already allowed in so-
19	   called host names today.  This backward-compatible representation is
20	   required in existing protocols like DNS, so that IDNs can be
21	   introduced with no changes to the existing infrastructure.  IDNA is
22	   only meant for processing domain names, not free text.

24	1. Introduction

26	   IDNA works by allowing applications to use certain ASCII name labels
27	   (beginning with a special prefix) to represent non-ASCII name labels.
28	   Lower-layer protocols need not be aware of this; therefore IDNA does
29	   not depend on changes to any infrastructure.  In particular, IDNA
30	   does not depend on any changes to DNS servers, resolvers, or protocol
31	   elements, because the ASCII name service provided by the existing DNS
32	   is entirely sufficient for IDNA.

34	   This document does not require any applications to conform to IDNA,
35	   but applications can elect to use IDNA in order to support IDN while
36	   maintaining interoperability with existing infrastructure.  If an
37	   application wants to use non-ASCII characters in domain names, IDNA
38	   is the only currently-defined option.  Adding IDNA support to an
39	   existing application entails changes to the application only, and
40	   leaves room for flexibility in the user interface.

42	   A great deal of the discussion of IDN solutions has focused on
43	   transition issues and how IDN will work in a world where not all of
44	   the components have been updated.  Proposals that were not chosen by
45	   the IDN Working Group would depend on user applications, resolvers,
46	   and DNS servers being updated in order for a user to use an
47	   internationalized domain name.  Rather than rely on widespread
48	   updating of all components, IDNA depends on updates to user
49	   applications only; no changes are needed to the DNS protocol or any
50	   DNS servers or the resolvers on users' computers.

52	   The IESG issued a statement on IDNA [IESG-STATEMENT].

54	1.1 Problem Statement

56	   The IDNA specification solves the problem of extending the repertoire
57	   of characters that can be used in domain names to include the Unicode
58	   repertoire (with some restrictions).

60	   IDNA does not extend the service offered by DNS to the applications.
61	   Instead, the applications (and, by implication, the users) continue
62	   to see an exact-match lookup service.  Either there is a single
63	   exactly-matching name or there is no match.  This model has served
64	   the existing applications well, but it requires, with or without
65	   internationalized domain names, that users know the exact spelling of
66	   the domain names that the users type into applications such as web
67	   browsers and mail user agents.  The introduction of the larger
68	   repertoire of characters potentially makes the set of misspellings
69	   larger, especially given that in some cases the same appearance, for
70	   example on a business card, might visually match several Unicode code
71	   points or several sequences of code points.

73	   IDNA allows the graceful introduction of IDNs not only by avoiding
74	   upgrades to existing infrastructure (such as DNS servers and mail
75	   transport agents), but also by allowing some rudimentary use of IDNs
76	   in applications by using the ASCII representation of the non-ASCII
77	   name labels.  While such names are very user-unfriendly to read and
78	   type, and hence are not suitable for user input, they allow (for
79	   instance) replying to email and clicking on URLs even though the
80	   domain name displayed is incomprehensible to the user.  In order to
81	   allow user-friendly input and output of the IDNs, the applications
82	   need to be modified to conform to this specification.

84	   IDNA uses the Unicode character repertoire, which avoids the
85	   significant delays that would be inherent in waiting for a different
86	   and specific character set be defined for IDN purposes by some other
87	   standards developing organization.

89	1.2 Limitations of IDNA

91	   The IDNA protocol does not solve all linguistic issues with users
92	   inputting names in different scripts.  Many important language-based
93	   and script-based mappings are not covered in IDNA and need to be
94	   handled outside the protocol.  For example, names that are entered in
95	   a mix of traditional and simplified Chinese characters will not be
96	   mapped to a single canonical name.  Another example is Scandinavian
97	   names that are entered with U+00F6 (LATIN SMALL LETTER O WITH
98	   DIAERESIS) will not be mapped to U+00F8 (LATIN SMALL LETTER O WITH
99	   STROKE).

101	   An example of an important issue that is not considered in detail in
102	   IDNA is how to provide a high probability that a user who is entering
103	   a domain name based on visual information (such as from a business
104	   card or billboard) or aural information (such as from a telephone or
105	   radio) would correctly enter the IDN.  Similar issues exist for ASCII
106	   domain names, for example the possible visual confusion between the
107	   letter 'O' and the digit zero, but the introduction of the larger
108	   repertoire of characters creates more opportunities of similar
109	   looking and similar sounding names.  Note that this is a complex
110	   issue relating to languages, input methods on computers, and so on.
111	   Furthermore, the kind of matching and searching necessary for a high
112	   probability of success would not fit the role of the DNS and its
113	   exact matching function.

115	1.3 Brief overview for application developers

117	   Applications can use IDNA to support internationalized domain names
118	   anywhere that ASCII domain names are already supported, including DNS
119	   master files and resolver interfaces.  (Applications can also define
120	   protocols and interfaces that support IDNs directly using non-ASCII
121	   representations.  IDNA does not prescribe any particular
122	   representation for new protocols, but it still defines which names
123	   are valid and how they are compared.)

125	   The IDNA protocol is contained completely within applications.  It is
126	   not a client-server or peer-to-peer protocol: everything is done
127	   inside the application itself.  When used with a DNS resolver
128	   library, IDNA is inserted as a "shim" between the application and the
129	   resolver library.  When used for writing names into a DNS zone, IDNA
130	   is used just before the name is committed to the zone.

132	   There are two operations described in section 4 of this document:

134	   -  The ToASCII operation is used before sending an IDN to something
135	      that expects ASCII names (such as a resolver) or writing an IDN
136	      into a place that expects ASCII names (such as a DNS master file).

138	   -  The ToUnicode operation is used when displaying names to users,
139	      for example names obtained from a DNS zone.

141	   It is important to note that the ToASCII operation can fail.  If it
142	   fails when processing a domain name, that domain name cannot be used
143	   as an internationalized domain name and the application has to have
144	   some method of dealing with this failure.

146	   IDNA requires that implementations process input strings with
147	   Nameprep [NAMEPREP], which is a profile of Stringprep [STRINGPREP],
148	   and then with Punycode [PUNYCODE].  Implementations of IDNA MUST
149	   fully implement Nameprep and Punycode; neither Nameprep nor Punycode
150	   are optional.

152	2. Terminology

154	   The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
155	   and "MAY" in this document are to be interpreted as described in BCP
156	   14, RFC 2119 [RFC2119].

158	   A code point is an integer value associated with a character in a
159	   coded character set.

161	   Unicode [UNICODE] is a coded character set containing tens of
162	   thousands of characters.  A single Unicode code point is denoted by
163	   "U+" followed by four to six hexadecimal digits, while a range of
164	   Unicode code points is denoted by two hexadecimal numbers separated
165	   by "..", with no prefixes.

167	   ASCII means US-ASCII [USASCII], a coded character set containing 128
168	   characters associated with code points in the range 0..7F.  Unicode
169	   is an extension of ASCII: it includes all the ASCII characters and
170	   associates them with the same code points.

172	   The term "LDH code points" is defined in this document to mean the
173	   code points associated with ASCII letters, digits, and the hyphen-
174	   minus; that is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an
175	   abbreviation for "letters, digits, hyphen".

177	   [STD13] talks about "domain names" and "host names", but many people
178	   use the terms interchangeably.  Further, because [STD13] was not
179	   terribly clear, many people who are sure they know the exact
180	   definitions of each of these terms disagree on the definitions.  In
181	   this document the term "domain name" is used in general.  This
182	   document explicitly cites [STD3] whenever referring to the host name
183	   syntax restrictions defined therein.

185	   A label is an individual part of a domain name.  Labels are usually
186	   shown separated by dots; for example, the domain name
187	   "www.example.com" is composed of three labels: "www", "example", and
188	   "com".  (The zero-length root label described in [STD13], which can
189	   be explicit as in "www.example.com." or implicit as in
190	   "www.example.com", is not considered a label in this specification.)
191	   IDNA extends the set of usable characters in labels that are text.
192	   For the rest of this document, the term "label" is shorthand for
193	   "text label", and "every label" means "every text label".

195	   An "internationalized label" is a label to which the ToASCII
196	   operation (see section 4) can be applied without failing (with the
197	   UseSTD3ASCIIRules flag unset).  This implies that every ASCII label
198	   that satisfies the [STD13] length restriction is an internationalized
199	   label.  Therefore the term "internationalized label" is a
200	   generalization, embracing both old ASCII labels and new non-ASCII
201	   labels.  Although most Unicode characters can appear in
202	   internationalized labels, ToASCII will fail for some input strings,
203	   and such strings are not valid internationalized labels.

205	   An "internationalized domain name" (IDN) is a domain name in which
206	   every label is an internationalized label.  This implies that every
207	   ASCII domain name is an IDN (which implies that it is possible for a
208	   name to be an IDN without it containing any non-ASCII characters).
209	   This document does not attempt to define an "internationalized host
210	   name".  Just as has been the case with ASCII names, some DNS zone
211	   administrators may impose restrictions, beyond those imposed by DNS
212	   or IDNA, on the characters or strings that may be registered as
213	   labels in their zones.  Such restrictions have no impact on the
214	   syntax or semantics of DNS protocol messages; a query for a name that
215	   matches no records will yield the same response regardless of the
216	   reason why it is not in the zone.  Clients issuing queries or
217	   interpreting responses cannot be assumed to have any knowledge of
218	   zone-specific restrictions or conventions.

220	   In IDNA, equivalence of labels is defined in terms of the ToASCII
221	   operation, which constructs an ASCII form for a given label, whether
222	   or not the label was already an ASCII label.  Labels are defined to
223	   be equivalent if and only if their ASCII forms produced by ToASCII
224	   match using a case-insensitive ASCII comparison.  ASCII labels
225	   already have a notion of equivalence: upper case and lower case are
226	   considered equivalent.  The IDNA notion of equivalence is an
227	   extension of that older notion.  Equivalent labels in IDNA are
228	   treated as alternate forms of the same label, just as "foo" and "Foo"
229	   are treated as alternate forms of the same label.

231	   To allow internationalized labels to be handled by existing
232	   applications, IDNA uses an "ACE label" (ACE stands for ASCII
233	   Compatible Encoding).  An ACE label is an internationalized label
234	   that can be rendered in ASCII and is equivalent to an
235	   internationalized label that cannot be rendered in ASCII.  Given any
236	   internationalized label that cannot be rendered in ASCII, the ToASCII
237	   operation will convert it to an equivalent ACE label (whereas an
238	   ASCII label will be left unaltered by ToASCII).  ACE labels are
239	   unsuitable for display to users.  The ToUnicode operation will
240	   convert any label to an equivalent non-ACE label.  In fact, an ACE
241	   label is formally defined to be any label that the ToUnicode
242	   operation would alter (whereas non-ACE labels are left unaltered by
243	   ToUnicode).  Every ACE label begins with the ACE prefix specified in
244	   section 5.  The ToASCII and ToUnicode operations are specified in
245	   section 4.

247	   The "ACE prefix" is defined in this document to be a string of ASCII
248	   characters that appears at the beginning of every ACE label.  It is
249	   specified in section 5.

251	   A "domain name slot" is defined in this document to be a protocol
252	   element or a function argument or a return value (and so on)
253	   explicitly designated for carrying a domain name.  Examples of domain
254	   name slots include: the QNAME field of a DNS query; the name argument
255	   of the gethostbyname() library function; the part of an email address
256	   following the at-sign (@) in the From: field of an email message
257	   header; and the host portion of the URI in the src attribute of an
258	   HTML <IMG> tag.  General text that just happens to contain a domain
259	   name is not a domain name slot; for example, a domain name appearing
260	   in the plain text body of an email message is not occupying a domain
261	   name slot.

263	   An "IDN-aware domain name slot" is defined in this document to be a
264	   domain name slot explicitly designated for carrying an
265	   internationalized domain name as defined in this document.  The
266	   designation may be static (for example, in the specification of the
267	   protocol or interface) or dynamic (for example, as a result of
268	   negotiation in an interactive session).

270	   An "IDN-unaware domain name slot" is defined in this document to be
271	   any domain name slot that is not an IDN-aware domain name slot.
272	   Obviously, this includes any domain name slot whose specification
273	   predates IDNA.

275	3. Requirements and applicability

277	3.1 Requirements

279	   IDNA conformance means adherence to the following four requirements:

281	   1) Whenever dots are used as label separators, the following
282	      characters MUST be recognized as dots: U+002E (full stop), U+3002
283	      (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61
284	      (halfwidth ideographic full stop).

286	   2) Whenever a domain name is put into an IDN-unaware domain name slot
287	      (see section 2), it MUST contain only ASCII characters.  Given an
288	      internationalized domain name (IDN), an equivalent domain name
289	      satisfying this requirement can be obtained by applying the
290	      ToASCII operation (see section 4) to each label and, if dots are
291	      used as label separators, changing all the label separators to
292	      U+002E.

294	   3) ACE labels obtained from domain name slots SHOULD be hidden from
295	      users when it is known that the environment can handle the non-ACE
296	      form, except when the ACE form is explicitly requested.  When it
297	      is not known whether or not the environment can handle the non-ACE
298	      form, the application MAY use the non-ACE form (which might fail,
299	      such as by not being displayed properly), or it MAY use the ACE
300	      form (which will look unintelligible to the user).  Given an
301	      internationalized domain name, an equivalent domain name
302	      containing no ACE labels can be obtained by applying the ToUnicode
303	      operation (see section 4) to each label.  When requirements 2 and
304	      3 both apply, requirement 2 takes precedence.

306	   4) Whenever two labels are compared, they MUST be considered to match
307	      if and only if they are equivalent, that is, their ASCII forms
308	      (obtained by applying ToASCII) match using a case-insensitive
309	      ASCII comparison.  Whenever two names are compared, they MUST be
310	      considered to match if and only if their corresponding labels
311	      match, regardless of whether the names use the same forms of label
312	      separators.

314	3.2 Applicability

316	   IDNA is applicable to all domain names in all domain name slots
317	   except where it is explicitly excluded.

319	   This implies that IDNA is applicable to many protocols that predate
320	   IDNA.  Note that IDNs occupying domain name slots in those protocols
321	   MUST be in ASCII form (see section 3.1, requirement 2).

323	3.2.1. DNS resource records

325	   IDNA does not apply to domain names in the NAME and RDATA fields of
326	   DNS resource records whose CLASS is not IN.  This exclusion applies
327	   to every non-IN class, present and future, except where future
328	   standards override this exclusion by explicitly inviting the use of
329	   IDNA.

331	   There are currently no other exclusions on the applicability of IDNA
332	   to DNS resource records; it depends entirely on the CLASS, and not on
333	   the TYPE.  This will remain true, even as new types are defined,
334	   unless there is a compelling reason for a new type to complicate
335	   matters by imposing type-specific rules.

337	3.2.2. Non-domain-name data types stored in domain names

339	   Although IDNA enables the representation of non-ASCII characters in
340	   domain names, that does not imply that IDNA enables the
341	   representation of non-ASCII characters in other data types that are
342	   stored in domain names.  For example, an email address local part is
343	   sometimes stored in a domain label (hostmaster@example.com would be
344	   represented as hostmaster.example.com in the RDATA field of an SOA
345	   record).  IDNA does not update the existing email standards, which
346	   allow only ASCII characters in local parts.  Therefore, unless the
347	   email standards are revised to invite the use of IDNA for local
348	   parts, a domain label that holds the local part of an email address
349	   SHOULD NOT begin with the ACE prefix, and even if it does, it is to
350	   be interpreted literally as a local part that happens to begin with
351	   the ACE prefix.

353	4. Conversion operations

355	   An application converts a domain name put into an IDN-unaware slot or
356	   displayed to a user.  This section specifies the steps to perform in
357	   the conversion, and the ToASCII and ToUnicode operations.

359	   The input to ToASCII or ToUnicode is a single label that is a
360	   sequence of Unicode code points (remember that all ASCII code points
361	   are also Unicode code points).  If a domain name is represented using
362	   a character set other than Unicode or US-ASCII, it will first need to
363	   be transcoded to Unicode.

365	   Starting from a whole domain name, the steps that an application
366	   takes to do the conversions are:

368	   1) Decide whether the domain name is a "stored string" or a "query
369	      string" as described in [STRINGPREP].  If this conversion follows
370	      the "queries" rule from [STRINGPREP], set the flag called
371	      "AllowUnassigned".

373	   2) Split the domain name into individual labels as described in
374	      section 3.1.  The labels do not include the separator.

376	   3) For each label, decide whether or not to enforce the restrictions
377	      on ASCII characters in host names [STD3].  (Applications already
378	      faced this choice before the introduction of IDNA, and can
379	      continue to make the decision the same way they always have; IDNA
380	      makes no new recommendations regarding this choice.)  If the
381	      restrictions are to be enforced, set the flag called
382	      "UseSTD3ASCIIRules" for that label.

384	   4) Process each label with either the ToASCII or the ToUnicode
385	      operation as appropriate.  Typically, you use the ToASCII
386	      operation if you are about to put the name into an IDN-unaware
387	      slot, and you use the ToUnicode operation if you are displaying
388	      the name to a user; section 3.1 gives greater detail on the
389	      applicable requirements.

391	   5) If ToASCII was applied in step 4 and dots are used as label
392	      separators, change all the label separators to U+002E (full stop).

394	   The following two subsections define the ToASCII and ToUnicode
395	   operations that are used in step 4.

397	   This description of the protocol uses specific procedure names, names
398	   of flags, and so on, in order to facilitate the specification of the
399	   protocol.  These names, as well as the actual steps of the
400	   procedures, are not required of an implementation.  In fact, any
401	   implementation which has the same external behavior as specified in
402	   this document conforms to this specification.

404	4.1 ToASCII

406	   The ToASCII operation takes a sequence of Unicode code points that
407	   make up one label and transforms it into a sequence of code points in
408	   the ASCII range (0..7F).  If ToASCII succeeds, the original sequence
409	   and the resulting sequence are equivalent labels.

411	   It is important to note that the ToASCII operation can fail.  ToASCII
412	   fails if any step of it fails.  If any step of the ToASCII operation
413	   fails on any label in a domain name, that domain name MUST NOT be
414	   used as an internationalized domain name.  The method for dealing
415	   with this failure is application-specific.

417	   The inputs to ToASCII are a sequence of code points, the
418	   AllowUnassigned flag, and the UseSTD3ASCIIRules flag.  The output of
419	   ToASCII is either a sequence of ASCII code points or a failure
420	   condition.

422	   ToASCII never alters a sequence of code points that are all in the
423	   ASCII range to begin with (although it could fail).  Applying the
424	   ToASCII operation multiple times has exactly the same effect as
425	   applying it just once.

427	   ToASCII consists of the following steps:

429	   1. If the sequence contains any code points outside the ASCII range
430	      (0..7F) then proceed to step 2, otherwise skip to step 3.

432	   2. Perform the steps specified in [NAMEPREP] and fail if there is an
433	      error.  The AllowUnassigned flag is used in [NAMEPREP].

435	   3. If the UseSTD3ASCIIRules flag is set, then perform these checks:

437	     (a) Verify the absence of non-LDH ASCII code points; that is, the
438	         absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.

440	     (b) Verify the absence of leading and trailing hyphen-minus; that
441	         is, the absence of U+002D at the beginning and end of the
442	         sequence.

444	   4. If the sequence contains any code points outside the ASCII range
445	      (0..7F) then proceed to step 5, otherwise skip to step 8.

447	   5. Verify that the sequence does NOT begin with the ACE prefix.

449	   6. Encode the sequence using the encoding algorithm in [PUNYCODE] and
450	      fail if there is an error.

452	   7. Prepend the ACE prefix.

454	   8. Verify that the number of code points is in the range 1 to 63
455	      inclusive (0 is excluded).

457	4.2 ToUnicode

459	   The ToUnicode operation takes a sequence of Unicode code points that
460	   make up one label and returns a sequence of Unicode code points.  If
461	   the input sequence is a label in ACE form, then the result is an
462	   equivalent internationalized label that is not in ACE form, otherwise
463	   the original sequence is returned unaltered.

465	   ToUnicode never fails.  If any step fails, then the original input
466	   sequence is returned immediately in that step.

468	   The Punycode decoder can never output more code points than it
469	   inputs, but Nameprep can, and therefore ToUnicode can.
470	   Note that the number of octets needed to represent a sequence of code
471	   points depends on the particular character encoding used.

473	   The inputs to ToUnicode are a sequence of code points, the
474	   AllowUnassigned flag, and the UseSTD3ASCIIRules flag.  The output of
475	   ToUnicode is always a sequence of Unicode code points.

477	   ToUnicode consists of the following steps:

479	   1. If the sequence contains any code points outside the ASCII range
480	      (0..7F) then proceed to step 2, otherwise skip to step 3.

482	   2. Perform the steps specified in [NAMEPREP] and fail if there is an
483	      error.  (If step 3 of ToASCII is also performed here, it will not
484	      affect the overall behavior of ToUnicode, but it is not
485	      necessary.)  The AllowUnassigned flag is used in [NAMEPREP].

487	   3. Verify that the sequence begins with the ACE prefix, and save a
488	      copy of the sequence.

490	   4. Remove the ACE prefix.

492	   5. Decode the sequence using the decoding algorithm in [PUNYCODE] and
493	      fail if there is an error.  Save a copy of the result of this
494	      step.

496	   6. Apply ToASCII.

498	   7. Verify that the result of step 6 matches the saved copy from step
499	      3, using a case-insensitive ASCII comparison.

501	   8. Return the saved copy from step 5.

503	5. ACE prefix

505	   The ACE prefix, used in the conversion operations (section 4), is two
506	   alphanumeric ASCII characters followed by two hyphen-minuses.  It
507	   cannot be any of the prefixes already used in earlier documents,
508	   which includes the following: "bl--", "bq--", "dq--", "lq--", "mq--",
509	   "ra--", "wq--" and "zq--".  The ToASCII and ToUnicode operations MUST
510	   recognize the ACE prefix in a case-insensitive manner.

512	   The ACE prefix for IDNA is "xn--" or any capitalization thereof.

514	   This means that an ACE label might be "xn--de-jg4avhby1noc0d", where
515	   "de-jg4avhby1noc0d" is the part of the ACE label that is generated by
516	   the encoding steps in [PUNYCODE].

518	   While all ACE labels begin with the ACE prefix, not all labels
519	   beginning with the ACE prefix are necessarily ACE labels.  Non-ACE
520	   labels that begin with the ACE prefix will confuse users and SHOULD
521	   NOT be allowed in DNS zones.

523	6. Implications for typical applications using DNS

525	   In IDNA, applications perform the processing needed to input
526	   internationalized domain names from users, display internationalized
527	   domain names to users, and process the inputs and outputs from DNS
528	   and other protocols that carry domain names.

530	   The components and interfaces between them can be represented
531	   pictorially as:

533	                    +------+
534	                    | User |
535	                    +------+
536	                       ^
537	                       | Input and display: local interface methods
538	                       | (pen, keyboard, glowing phosphorus, ...)
539	   +-------------------|-------------------------------+
540	   |                   v                               |
541	   |          +-----------------------------+          |
542	   |          |        Application          |          |
543	   |          |   (ToASCII and ToUnicode    |          |
544	   |          |      operations may be      |          |
545	   |          |        called here)         |          |
546	   |          +-----------------------------+          |
547	   |                   ^        ^                      | End system
548	   |                   |        |                      |
549	   | Call to resolver: |        | Application-specific |
550	   |              ACE  |        | protocol:            |
551	   |                   v        | ACE unless the       |
552	   |           +----------+     | protocol is updated  |
553	   |           | Resolver |     | to handle other      |
554	   |           +----------+     | encodings            |
555	   |                 ^          |                      |
556	   +-----------------|----------|----------------------+
557	       DNS protocol: |          |
558	                 ACE |          |
559	                     v          v
560	          +-------------+    +---------------------+
561	          | DNS servers |    | Application servers |
562	          +-------------+    +---------------------+

564	   The box labeled "Application" is where the application splits a
565	   domain name into labels, sets the appropriate flags, and performs the
566	   ToASCII and ToUnicode operations.  This is described in section 4.

568	6.1 Entry and display in applications

570	   Applications can accept domain names using any character set or sets
571	   desired by the application developer, and can display domain names in
572	   any charset.  That is, the IDNA protocol does not affect the
573	   interface between users and applications.

575	   An IDNA-aware application can accept and display internationalized
576	   domain names in two formats: the internationalized character set(s)
577	   supported by the application, and as an ACE label.  ACE labels that
578	   are displayed or input MUST always include the ACE prefix.
579	   Applications MAY allow input and display of ACE labels, but are not
580	   encouraged to do so except as an interface for special purposes,
581	   possibly for debugging, or to cope with display limitations as
582	   described in section 6.4.  ACE encoding is opaque and ugly, and
583	   should thus only be exposed to users who absolutely need it.  Because
584	   name labels encoded as ACE name labels can be rendered either as the
585	   encoded ASCII characters or the proper decoded characters, the
586	   application MAY have an option for the user to select the preferred
587	   method of display; if it does, rendering the ACE SHOULD NOT be the
588	   default.

590	   Domain names are often stored and transported in many places.  For
591	   example, they are part of documents such as mail messages and web
592	   pages.  They are transported in many parts of many protocols, such as
593	   both the control commands and the RFC 2822 body parts of SMTP, and
594	   the headers and the body content in HTTP.  It is important to
595	   remember that domain names appear both in domain name slots and in
596	   the content that is passed over protocols.

598	   In protocols and document formats that define how to handle
599	   specification or negotiation of charsets, labels can be encoded in
600	   any charset allowed by the protocol or document format.  If a
601	   protocol or document format only allows one charset, the labels MUST
602	   be given in that charset.

604	   In any place where a protocol or document format allows transmission
605	   of the characters in internationalized labels, internationalized
606	   labels SHOULD be transmitted using whatever character encoding and
607	   escape mechanism that the protocol or document format uses at that
608	   place.

610	   All protocols that use domain name slots already have the capacity
611	   for handling domain names in the ASCII charset.  Thus, ACE labels
612	   (internationalized labels that have been processed with the ToASCII
613	   operation) can inherently be handled by those protocols.

615	   Displaying internationalized characters can be tricky for
616	   applications regardless of whether the characters appear in free
617	   text, in domain names, or in other protocol elements.  The Unicode
618	   standard encompasses many types of text that can cause display
619	   problems, such as formatting characters, characters that combine with
620	   one or more surrounding characters, characters whose direction of
621	   display can change, strings whose logical order cannot be uniquely
622	   inferred from their display order, and so on.  IDNA requires the use
623	   of Nameprep, which mitigates some of these issues, both in individual
624	   domain labels and to a lesser extent in full domain names, but does
625	   not eliminate all the issues (and does nothing to mitigate them in
626	   text outside of domain names).

628	6.2 Applications and resolver libraries

630	   Applications normally use functions in the operating system when they
631	   resolve DNS queries.  Those functions in the operating system are
632	   often called "the resolver library", and the applications communicate
633	   with the resolver libraries through a programming interface (API).

635	   Because these resolver libraries today expect only domain names in
636	   ASCII, applications MUST prepare labels that are passed to the
637	   resolver library using the ToASCII operation.  Labels received from
638	   the resolver library contain only ASCII characters; internationalized
639	   labels that cannot be represented directly in ASCII use the ACE form.
640	   ACE labels always include the ACE prefix.

642	   An operating system might have a set of libraries for performing the
643	   ToASCII operation.  The input to such a library might be in one or
644	   more charsets that are used in applications (UTF-8 and UTF-16 are
645	   likely candidates for almost any operating system, and script-
646	   specific charsets are likely for localized operating systems).

648	   IDNA-aware applications MUST be able to work with both non-
649	   internationalized labels (those that conform to [STD13] and [STD3])
650	   and internationalized labels.

652	   It is expected that new versions of the resolver libraries in the
653	   future will be able to accept domain names in other charsets than
654	   ASCII, and application developers might one day pass not only domain
655	   names in Unicode, but also in local script to a new API for the
656	   resolver libraries in the operating system.  Thus the ToASCII and
657	   ToUnicode operations might be performed inside these new versions of
658	   the resolver libraries.

660	   Domain names passed to resolvers or put into the question section of
661	   DNS requests follow the rules for "queries" from [STRINGPREP].

663	6.3 DNS servers

665	   Domain names stored in zones follow the rules for "stored strings"
666	   from [STRINGPREP].

668	   For internationalized labels that cannot be represented directly in
669	   ASCII, DNS servers MUST use the ACE form produced by the ToASCII
670	   operation.  All IDNs served by DNS servers MUST contain only ASCII
671	   characters.

673	   If a signaling system which makes negotiation possible between old
674	   and new DNS clients and servers is standardized in the future, the
675	   encoding of the query in the DNS protocol itself can be changed from
676	   ACE to something else, such as UTF-8.  The question whether or not
677	   this should be used is, however, a separate problem and is not
678	   discussed in this memo.

680	6.4 Avoiding exposing users to the raw ACE encoding

682	   Any application that might show the user a domain name obtained from
683	   a domain name slot, such as from gethostbyaddr or part of a mail
684	   header, will need to be updated if it is to prevent users from seeing
685	   the ACE.

687	   If an application decodes an ACE name using ToUnicode but cannot show
688	   all of the characters in the decoded name, such as if the name
689	   contains characters that the output system cannot display, the
690	   application SHOULD show the name in ACE format (which always includes
691	   the ACE prefix) instead of displaying the name with the replacement
692	   character (U+FFFD).  This is to make it easier for the user to
693	   transfer the name correctly to other programs.  Programs that by
694	   default show the ACE form when they cannot show all the characters in
695	   a name label SHOULD also have a mechanism to show the name that is
696	   produced by the ToUnicode operation with as many characters as
697	   possible and replacement characters in the positions where characters
698	   cannot be displayed.

700	   The ToUnicode operation does not alter labels that are not valid ACE
701	   labels, even if they begin with the ACE prefix.  After ToUnicode has
702	   been applied, if a label still begins with the ACE prefix, then it is
703	   not a valid ACE label, and is not equivalent to any of the
704	   intermediate Unicode strings constructed by ToUnicode.

706	6.5  DNSSEC authentication of IDN domain names

708	   DNS Security [RFC2535] is a method for supplying cryptographic
709	   verification information along with DNS messages.  Public Key
710	   Cryptography is used in conjunction with digital signatures to
711	   provide a means for a requester of domain information to authenticate
712	   the source of the data.  This ensures that it can be traced back to a
713	   trusted source, either directly, or via a chain of trust linking the
714	   source of the information to the top of the DNS hierarchy.

716	   IDNA specifies that all internationalized domain names served by DNS
717	   servers that cannot be represented directly in ASCII must use the ACE
718	   form produced by the ToASCII operation.  This operation must be
719	   performed prior to a zone being signed by the private key for that
720	   zone.  Because of this ordering, it is important to recognize that
721	   DNSSEC authenticates the ASCII domain name, not the Unicode form or
722	   the mapping between the Unicode form and the ASCII form.  In the
723	   presence of DNSSEC, this is the name that MUST be signed in the zone
724	   and MUST be validated against.

726	   One consequence of this for sites deploying IDNA in the presence of
727	   DNSSEC is that any special purpose proxies or forwarders used to
728	   transform user input into IDNs must be earlier in the resolution flow
729	   than DNSSEC authenticating nameservers for DNSSEC to work.

731	7. Name server considerations

733	   Existing DNS servers do not know the IDNA rules for handling non-
734	   ASCII forms of IDNs, and therefore need to be shielded from them.
735	   All existing channels through which names can enter a DNS server
736	   database (for example, master files [STD13] and DNS update messages
737	   [RFC2136]) are IDN-unaware because they predate IDNA, and therefore
738	   requirement 2 of section 3.1 of this document provides the needed
739	   shielding, by ensuring that internationalized domain names entering
740	   DNS server databases through such channels have already been
741	   converted to their equivalent ASCII forms.

743	   It is imperative that there be only one ASCII encoding for a
744	   particular domain name.  Because of the design of the ToASCII and
745	   ToUnicode operations, there are no ACE labels that decode to ASCII
746	   labels, and therefore name servers cannot contain multiple ASCII
747	   encodings of the same domain name.

749	   [RFC2181] explicitly allows domain labels to contain octets beyond
750	   the ASCII range (0..7F), and this document does not change that.
751	   Note, however, that there is no defined interpretation of octets
752	   80..FF as characters.  If labels containing these octets are returned
753	   to applications, unpredictable behavior could result.  The ASCII form
754	   defined by ToASCII is the only standard representation for
755	   internationalized labels in the current DNS protocol.

757	8. Root server considerations

759	   IDNs are likely to be somewhat longer than current domain names, so
760	   the bandwidth needed by the root servers is likely to go up by a
761	   small amount.  Also, queries and responses for IDNs will probably be
762	   somewhat longer than typical queries today, so more queries and
763	   responses may be forced to go to TCP instead of UDP.

765	9. References

767	9.1 Normative References

769	   [RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate
770	                Requirement Levels", BCP 14, RFC 2119, March 1997.

772	   [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
773	                Internationalized Strings ("stringprep")",
774	                draft-hoffman-rfc3454bis.

776	   [NAMEPREP]   Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
777	                Profile for Internationalized Domain Names (IDN)",
778	                draft-hoffman-rfc3491bis.

780	   [PUNYCODE]   Costello, A., "Punycode: A Bootstring encoding of
781	                Unicode for use with Internationalized Domain Names in
782	                Applications (IDNA)", draft-costello-rfc3492bis.

784	   [STD3]       Braden, R., "Requirements for Internet Hosts --
785	                Communication Layers", STD 3, RFC 1122, and
786	                "Requirements for Internet Hosts -- Application and
787	                Support", STD 3, RFC 1123, October 1989.

789	   [STD13]      Mockapetris, P., "Domain names - concepts and
790	                facilities", STD 13, RFC 1034 and "Domain names -
791	                implementation and specification", STD 13, RFC 1035,
792	                November 1987.

794	9.2 Informative References

796	   [IESG-STATEMENT] "IESG Statement on IDN", February 2003,
797	                <http://www.ietf.org/IESG/STATEMENTS/IDNstatement.txt>.

799	   [RFC2535]    Eastlake, D., "Domain Name System Security Extensions",
800	                RFC 2535, March 1999.

802	   [RFC2181]    Elz, R. and R. Bush, "Clarifications to the DNS
803	                Specification", RFC 2181, July 1997.

805	   [UNICODE]    The Unicode Consortium. The Unicode Standard, Version
806	                3.2.0 is defined by The Unicode Standard, Version 3.0
807	                (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5),
808	                as amended by the Unicode Standard Annex #27: Unicode
809	                3.1 (http://www.unicode.org/reports/tr27/) and by the
810	                Unicode Standard Annex #28: Unicode 3.2
811	                (http://www.unicode.org/reports/tr28/).

813	   [USASCII]    Cerf, V., "ASCII format for Network Interchange", RFC
814	                20, October 1969.

816	10. Security Considerations

818	   Security on the Internet partly relies on the DNS.  Thus, any change
819	   to the characteristics of the DNS can change the security of much of
820	   the Internet.

822	   This memo describes an algorithm which encodes characters that are
823	   not valid according to STD3 and STD13 into octet values that are
824	   valid.  No security issues such as string length increases or new
825	   allowed values are introduced by the encoding process or the use of
826	   these encoded values, apart from those introduced by the ACE encoding
827	   itself.

829	   Domain names are used by users to identify and connect to Internet
830	   servers.  The security of the Internet is compromised if a user
831	   entering a single internationalized name is connected to different
832	   servers based on different interpretations of the internationalized
833	   domain name.

835	   When systems use local character sets other than ASCII and Unicode,
836	   this specification leaves the the problem of transcoding between the
837	   local character set and Unicode up to the application.  If different
838	   applications (or different versions of one application) implement
839	   different transcoding rules, they could interpret the same name
840	   differently and contact different servers.  This problem is not
841	   solved by security protocols like TLS that do not take local
842	   character sets into account.

844	   Because this document normatively refers to [NAMEPREP], [PUNYCODE],
845	   and [STRINGPREP], it includes the security considerations from those
846	   documents as well.

848	   If or when this specification is updated to use a more recent Unicode
849	   normalization table, the new normalization table will need to be
850	   compared with the old to spot backwards incompatible changes.  If
851	   there are such changes, they will need to be handled somehow, or
852	   there will be security as well as operational implications.  Methods
853	   to handle the conflicts could include keeping the old normalization,
854	   or taking care of the conflicting characters by operational means, or
855	   some other method.

857	   Implementations MUST NOT use more recent normalization tables than
858	   the one referenced from this document, even though more recent tables
859	   may be provided by operating systems.  If an application is unsure of
860	   which version of the normalization tables are in the operating
861	   system, the application needs to include the normalization tables
862	   itself.  Using normalization tables other than the one referenced
863	   from this specification could have security and operational
864	   implications.

866	   To help prevent confusion between characters that are visually
867	   similar, it is suggested that implementations provide visual
868	   indications where a domain name contains multiple scripts.  Such
869	   mechanisms can also be used to show when a name contains a mixture of
870	   simplified and traditional Chinese characters, or to distinguish zero
871	   and one from O and l.  DNS zone adminstrators may impose restrictions
872	   (subject to the limitations in section 2) that try to minimize
873	   homographs.

875	   Domain names (or portions of them) are sometimes compared against a
876	   set of privileged or anti-privileged domains.  In such situations it
877	   is especially important that the comparisons be done properly, as
878	   specified in section 3.1 requirement 4.  For labels already in ASCII
879	   form, the proper comparison reduces to the same case-insensitive
880	   ASCII comparison that has always been used for ASCII labels.

882	   The introduction of IDNA means that any existing labels that start
883	   with the ACE prefix and would be altered by ToUnicode will
884	   automatically be ACE labels, and will be considered equivalent to
885	   non-ASCII labels, whether or not that was the intent of the zone
886	   adminstrator or registrant.

888	11. IANA Considerations

890	   IANA has assigned the ACE prefix "xn--" in consultation with the
891	   IESG.

893	12. Authors' Addresses

895	   Patrik Faltstrom
896	   Cisco Systems
897	   Arstaangsvagen 31 J
898	   S-117 43 Stockholm  Sweden

900	   EMail: paf@cisco.com

902	   Paul Hoffman
903	   Internet Mail Consortium and VPN Consortium
904	   127 Segre Place
905	   Santa Cruz, CA  95060  USA

907	   EMail: phoffman@imc.org

909	   Adam M. Costello
910	   University of California, Berkeley

912	   URL: http://www.nicemice.net/amc/

914	A. Changes from RFC 3490

916	This document is a revision of RFC 3490. None of the changes affect the
917	protocol described in RFC 3490; that is, all implementations of RFC 3490
918	will be identical with implementations of the specification in this
919	document. The items that have changed RFC 3490 document are:

921	- The last line of section 1 has a grammatical fix (user's -> users').

923	- Added a note in section 1 about the IESG statement on IDNA, and
924	  added a reference to it.

926	- In section 3.1 rule 3, fixed spelling of "unintelligle" to
927	  "unintelligible".

929	- In step 8 of section 4.1, added "(0 is excluded)" to clarify.

931	- In section 4.2, the first sentence of the third paragraph was
932	  incorrect. It has been replaced with a sentence that is both
933	  correct and more descriptive.

935	- Added "ToUnicode consists of the following steps:" before the steps
936	  in section 4.2.

938	- Changed wording of step 1 of section 4.2 to match the wording in section
939	  4.1 (the result is identical).

941	- Added the last paragraph in section 6.1 to acknowledge that some Unicode
942	  display issues are tricky, but they are not specific to IDNA.

944	- The sentence in section 11 now says the sequence that was chosen.