idnits 2.17.1 

draft-iab-idn-encoding-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 380: '...      Protocols MUST be able to use th...'
     RFC 2119 keyword, line 383: '... for all text.  Protocols MAY specify,...'
     RFC 2119 keyword, line 393: '...      support MUST be possible....'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (May 14, 2010) is 5096 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '10646' on line 382

  == Missing Reference: 'BCP9' is mentioned on line 387, but not defined

  == Outdated reference: A later version (-15) exists of
     draft-cheshire-dnsext-multicastdns-11

  == Outdated reference: A later version (-02) exists of
     draft-ietf-idn-punycode-00

  == Outdated reference: A later version (-06) exists of
     draft-skwan-utf8-dns-00

  -- Obsolete informational reference (is this intentional?): RFC  821
     (Obsoleted by RFC 2821)

  -- Obsolete informational reference (is this intentional?): RFC 3490
     (Obsoleted by RFC 5890, RFC 5891)


     Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                          D. Thaler
3	Internet-Draft                                                 Microsoft
4	Intended status: Informational                                J. Klensin
5	Expires: November 15, 2010
6	                                                             S. Cheshire
7	                                                                   Apple
8	                                                            May 14, 2010

10	      IAB Thoughts on Encodings for Internationalized Domain Names
11	                     draft-iab-idn-encoding-02.txt

13	Abstract

15	   This document explores issues with Internationalized Domain Names
16	   (IDNs) that result from the use of various encoding schemes such as
17	   UTF-8 and the ASCII-Compatible Encoding produced by the Punycode
18	   algorithm.  It focuses on the importance of agreeing on a canonical
19	   format and how complicated it ends up being as a result of using
20	   different encodings today.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on November 15, 2010.

39	Copyright Notice

41	   Copyright (c) 2010 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	     1.1.  APIs . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
58	   2.  Use of Non-DNS Protocols . . . . . . . . . . . . . . . . . . .  9
59	   3.  Use of Non-ASCII in DNS  . . . . . . . . . . . . . . . . . . . 10
60	     3.1.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . 14
61	   4.  Recommendations  . . . . . . . . . . . . . . . . . . . . . . . 16
62	   5.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
63	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 17
64	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 18
65	   8.  IAB Members at the time of publication . . . . . . . . . . . . 18
66	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
67	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 18
68	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 19
69	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21

71	1.  Introduction

73	   The goal of this document is to explore what can be learned from some
74	   current difficulties in implementing Internationalized Domain Names
75	   (IDNs).

77	   A domain name consists of a set of labels, conventionally written
78	   separated with dots.  An Internationalized Domain Name (IDN) is a
79	   domain name that contains one or more labels that, in turn, contain
80	   one or more non-ASCII characters.  Just as with plain ASCII domain
81	   names, each IDN label must be encoded using some mechanism before it
82	   can be transmitted in network packets, stored in memory, stored on
83	   disk, etc.  These encodings need to be reversible, but they need not
84	   store domain names the same way humans conventionally write them on
85	   paper.  For example, when transmitted over the network in DNS
86	   packets, domain name labels are *not* separated with dots.

88	   IDNA, discussed later in this document, is the standard that defines
89	   the use and coding of internationalized domain names for use on the
90	   public Internet.  It is described as "Internationalizing Domain Names
91	   in Applications (IDNA)" and is defined in several documents.
92	   Definitions for the current version and a roadmap of related
93	   documents appears in [IDNA2008-Defs].  An earlier version of IDNA
94	   [RFC3490] is now being phased out.  Except where noted, the two
95	   versions are approximately the same with regard to the issues
96	   discussed in this document.  However, some explanations appeared in
97	   the earlier documents that did not seem useful when the revision was
98	   created; they are quoted here from the documents in which they
99	   appear.  In addition, the terminology of the two version differs
100	   somewhat; this document reflects the terminology of the current
101	   version.

103	   Unicode [Unicode] is a list of characters (including non-spacing
104	   marks that are used to form some other characters), where each
105	   character is assigned an integer value, called a code point.  In
106	   simple terms a Unicode string is a string of integer code point
107	   values in the range 0 to 1,114,111 (10FFFF in base 16), which
108	   represent a string of Unicode characters.  These integer code points
109	   must be encoded using some mechanism before they can be transmitted
110	   in network packets, stored in memory, stored on disk, etc.  Some
111	   common ways of encoding these integer code point values in computer
112	   systems include UTF-8, UTF-16, and UTF-32.  In addition to the
113	   material below, those forms and the tradeoffs among them are
114	   discussed in Chapter 2 of The Unicode Standard [Unicode].

116	   UTF-8 [RFC3629] is a mechanism for encoding a Unicode code point in a
117	   variable number of 8-bit octets, where an ASCII code point is
118	   preserved as-is.  Those octets encode a string of integer code point
119	   values, which represent a string of Unicode characters.

121	   UTF-16 (formerly UCS-2) is a mechanism for encoding a Unicode code
122	   point in one or two 16-bit integers, described in detail in Sections
123	   3.9 and 3.10 of The Unicode Standard [Unicode].  A UTF-16 string
124	   encodes a string of integer code point values that represent a string
125	   of Unicode characters.

127	   UTF-32 (formerly UCS-4), also described in [Unicode] Sections 3.9 and
128	   3.10, is a mechanism for encoding a Unicode code point in a single
129	   32-bit integer.  A UTF-32 string is thus a string of 32-bit integer
130	   code point values, which represent a string of Unicode characters.

132	   Note that UTF-16 results in some all-zero octets when code points
133	   occur early in the Unicode sequence, and UTF-32 always has all-zero
134	   octets.

136	   IDNA specifies validity of a label, such as what characters it can
137	   contain, relationships among them, and so on, in Unicode terms.
138	   Valid labels can take either of two forms, with the appropriate one
139	   determined by particular protocols or by context.  One of those
140	   forms, called a U-label, is a direct representation of the Unicode
141	   characters using one of the encoding forms discussed above.  This
142	   document discusses UTF-8 strings in many places.  While all U-labels
143	   can be represented by UTF-8 strings, not all UTF-8 strings are valid
144	   U-labels (see Section 2.3.2 of [IDNA2008-Defs] for a discussion of
145	   these distinctions).  The other, called an A-label, uses a
146	   compressed, ASCII-compatible encoding (an "ACE" in IDNA and other
147	   terminology) produced by an algorithm called Punycode.  U-labels and
148	   A-labels are duals of each other: transformations from one to the
149	   other do not lose information.  The transformation mechanisms are
150	   specified in [IDNA2008-Protocol].

152	   Punycode [RFC3492] is thus a mechanism for encoding a Unicode string
153	   in an ASCII-compatible encoding, i.e., using only letters, digits,
154	   and hyphens from the ASCII character set.  When a Unicode label that
155	   is valid under the IDNA rules (a U-label) is encoded with Punycode
156	   for IDNA purposes, it is prefixed with "xn--"; the result is called
157	   an A-label.  The prefix convention assumes that no other DNS labels
158	   (at least no other DNS labels in IDNA-aware applications) are allowed
159	   to start with these four characters.  Consequently, when A-label
160	   encoding is assumed, any DNS labels beginning with "xn--" now have a
161	   different meaning (the Punycode encoding of a label containing one or
162	   more non-ASCII characters) or no defined meaning at all (in the case
163	   of labels that are not IDNA-compliant, i.e., are not well-formed
164	   A-labels).

166	   ISO-2022-JP [RFC1468] is a mechanism for encoding a string of ASCII
167	   and Japanese characters, where an ASCII character is preserved as-is.
168	   ISO-2022-JP is stateful: special sequences are used to switch between
169	   character coding tables.

171	   Comparison of Unicode strings is not as easy as comparing for example
172	   ASCII strings.  First, there are a multitude of ways of representing
173	   a string of Unicode characters.  Second, in many languages and
174	   scripts, the actual definition of "same" is very context-dependent.
175	   Because of this, comparison of two Unicode strings must take into
176	   account how the Unicode strings are encoded.  Regardless of the
177	   encoding, however, comparison cannot simply be done by comparing the
178	   encoded Unicode strings byte by byte.  The only time that is possible
179	   is when the strings both are mapped into some canonical format and
180	   encoded the same way.

182	   This document focuses on the importance of agreeing on a canonical
183	   format and how complicated it ends up being as a result of using
184	   different encodings today.

186	   Different applications, APIs, and protocols use different encoding
187	   schemes today.  Historically, many of them were originally defined to
188	   use only ASCII.  Internationalizing  Domain Names in Applications
189	   (IDNA) [IDNA2008-Defs] defined a mechanism that required changes to
190	   applications, but in attempt not to change APIs or servers, specified
191	   that the A-label format is to be used in many contexts.  In some ways
192	   this could be seen as not changing the existing APIs, in the sense
193	   that the strings being passed to and from the APIs were still
194	   apparently ASCII strings.  In other ways it was a very profound
195	   change to the existing APIs, because while those strings were still
196	   syntactically valid ASCII strings, they no longer meant the same
197	   thing as they used to.  What looked like a plain ASCII string to one
198	   piece of software or library could be seen by another piece of
199	   software or library (with the application of out-of-band information)
200	   to be in fact an encoding of a Unicode string.

202	   Section 1.3 of the original IDNA specification [RFC3490] states:

204	      The IDNA protocol is contained completely within applications.  It
205	      is not a client-server or peer-to-peer protocol: everything is
206	      done inside the application itself.  When used with a DNS resolver
207	      library, IDNA is inserted as a "shim" between the application and
208	      the resolver library.  When used for writing names into a DNS
209	      zone, IDNA is used just before the name is committed to the zone.

211	   Figure 1 depicts a simplistic architecture that a naive reader might
212	   assume from the paragraph quoted above.  (A variant of this same
213	   picture appears in Section 6 of the IDNA specification [RFC3490]
214	   further strengthening this assumption.)
215	    +-----------------------------------------+
216	    |Host                                     |
217	    |             +-------------+             |
218	    |             | Application |             |
219	    |             +------+------+             |
220	    |                    |                    |
221	    |               +----+----+               |
222	    |               |   DNS   |               |
223	    |               | Resolver|               |
224	    |               | Library |               |
225	    |               +----+----+               |
226	    |                    |                    |
227	    +-----------------------------------------+
228	                         |
229	                _________|_________
230	               /                   \
231	              /                     \
232	             /                       \
233	            |         Internet        |
234	             \                       /
235	              \                     /
236	               \___________________/

238	                          Simplistic Architecture

240	                                 Figure 1

242	   There are, however, two problems with this simplistic architecture
243	   that cause it to differ from reality.

245	   First, resolver APIs on Operating Systems (OSs) today (MacOS,
246	   Windows, Linux, etc.) are not DNS-specific.  They typically provide a
247	   layer of indirection so that the application can work independent of
248	   the name resolution mechanism, which could be DNS, mDNS
249	   [I-D.cheshire-dnsext-multicastdns], LLMNR [RFC4795], NetBIOS-over-TCP
250	   [RFC1001][RFC1002], etc/hosts file [RFC0952], NIS [NIS], or anything
251	   else.  For example, "Basic Socket Interface Extensions for IPv6"
252	   [RFC3493] specifies the getaddrinfo() API and contains many phrases
253	   like "For example, when using the DNS" and "any type of name
254	   resolution service (for example, the DNS)".  Importantly, DNS is
255	   mentioned only as an example, and the application has no knowledge as
256	   to whether DNS or some other protocol will be used.

258	   Second, even with the DNS protocol, private name spaces (sometimes
259	   including private uses of the DNS), do not necessarily use the same
260	   character set encoding scheme as the public Internet name space.

262	   We will discuss each of the above issues in subsequent sections.  For
263	   reference, Figure 2 depicts a more realistic architecture on typical
264	   hosts today (which don't have IDNA inserted as a shim immediately
265	   above the DNS resolver library).  More generally, the host may be
266	   attached to one or more local networks, each of which may or may not
267	   be connected to the public Internet and may or may not have a private
268	   name space.

270	    +-----------------------------------------+
271	    |Host                                     |
272	    |             +-------------+             |
273	    |             | Application |             |
274	    |             +------+------+             |
275	    |                    |                    |
276	    |             +------+------+             |
277	    |             |   Generic   |             |
278	    |             |    Name     |             |
279	    |             |  Resolution |             |
280	    |             |     API     |             |
281	    |             +------+------+             |
282	    |                    |                    |
283	    |   +-----+------+---+--+-------+-----+   |
284	    |   |     |      |      |       |     |   |
285	    | +-+-++--+--++--+-++---+---++--+--++-+-+ |
286	    | |DNS||LLMNR||mDNS||NetBIOS||hosts||...| |
287	    | +---++-----++----++-------++-----++---+ |
288	    |                                         |
289	    +-----------------------------------------+
290	                         |
291	                   ______|______
292	                  /             \
293	                 /               \
294	                /      local      \
295	                \     network     /
296	                 \               /
297	                  \_____________/
298	                         |
299	                _________|_________
300	               /                   \
301	              /                     \
302	             /                       \
303	            |         Internet        |
304	             \                       /
305	              \                     /
306	               \___________________/

308	                          Realistic Architecture

310	                                 Figure 2

312	1.1.  APIs

314	   Section 6.2 of the original IDNA specification [RFC3490] states
315	   (where ToASCII and ToUnicode below refer to conversions using the
316	   Punycode algorithm):

318	      It is expected that new versions of the resolver libraries in the
319	      future will be able to accept domain names in other charsets than
320	      ASCII, and application developers might one day pass not only
321	      domain names in Unicode, but also in local script to a new API for
322	      the resolver libraries in the operating system.  Thus the ToASCII
323	      and ToUnicode operations might be performed inside these new
324	      versions of the resolver libraries.

326	   Resolver APIs such as getaddrinfo() and its predecessor
327	   gethostbyname() were defined to accept "char *" arguments, meaning
328	   they accept a string of bytes, terminated with a NULL (0) byte.
329	   Because of the use of a NULL octet as a string terminator, this is
330	   sufficient for ASCII strings (including A-labels) and even
331	   ISO-2022-JP and UTF-8 strings (unless an implementation artificially
332	   precludes them), but not UTF-16 or UTF-32 strings because a NULL
333	   octet could appear in the the middle of strings using these
334	   encodings.  Several operating systems historically used in Japan will
335	   accept (and expect) ISO-2022-JP strings in such APIs.  Some platforms
336	   used worldwide also have new versions of the APIs (e.g.,
337	   GetAddrInfoW() on Windows) that accept other encoding schemes such as
338	   UTF-16.

340	   It is worth noting that an API using "char *" arguments can
341	   distinguish between conventional ASCII "host name" labels, A-labels,
342	   ISO-2022-JP, and UTF-8 labels in names if the coding is known to be
343	   one of those four.  An example method is as follows:
344	   o  if the label contains an ESC (0x1B) byte the label is ISO-2022-JP;
345	      otherwise,
346	   o  if any byte in the label has the high bit set, the label is UTF-8;
347	      otherwise,
348	   o  if the label starts with "xn--" then it is presumed to be an
349	      A-label; otherwise,
350	   o  the label is ASCII.
351	   Again this assumes that neither ASCII labels nor UTF-8 strings ever
352	   start with "xn--", and also that UTF-8 strings never contain an ESC
353	   character.  Also the above is merely an illustration; UTF-8 can be
354	   detected and distinguished from other 8-bit encodings with good
355	   accuracy [MJD].

357	   It is more difficult or impossible to distinguish the ISO 8859
358	   character sets from each other, because they differ in up to about 90
359	   characters which have exactly the same encodings, and a short string
360	   is very unlikely to contain enough characters to allow a receiver to
361	   deduce the character set.  Similarly, it is not possible in general
362	   to distinguish between ISO-2022-JP and any other encoding based on
363	   ISO 2022 code table switching.

365	   Although it is possible (as in the example above) to distinguish some
366	   encodings when not explicitly specified, it is cleaner to have the
367	   encodings specified explicitly, such as specifying UTF-16 for
368	   GetAddrInfoW(), or specifying explicitly which APIs expect UTF-8
369	   strings.

371	2.  Use of Non-DNS Protocols

373	   As noted earlier, typical name resolution libraries are not DNS-
374	   specific.  Furthermore, some protocols are defined to use encoding
375	   forms other than IDNA A-labels.  For example, mDNS
376	   [I-D.cheshire-dnsext-multicastdns] specifies that UTF-8 be used.
377	   Indeed, the IETF policy on character sets and languages [RFC2277]
378	   states:

380	      Protocols MUST be able to use the UTF-8 charset, which consists of
381	      the ISO 10646 coded character set combined with the UTF-8
382	      character encoding scheme, as defined in [10646] Annex R
383	      (published in Amendment 2), for all text.  Protocols MAY specify,
384	      in addition, how to use other charsets or other character encoding
385	      schemes for ISO 10646, such as UTF-16, but lack of an ability to
386	      use UTF-8 is a violation of this policy; such a violation would
387	      need a variance procedure ([BCP9] section 9) with clear and solid
388	      justification in the protocol specification document before being
389	      entered into or advanced upon the standards track.  For existing
390	      protocols or protocols that move data from existing datastores,
391	      support of other charsets, or even using a default other than
392	      UTF-8, may be a requirement.  This is acceptable, but UTF-8
393	      support MUST be possible.

395	   Applications that convert an IDN to A-label form before calling
396	   getaddrinfo() will result in name resolution failures if the Punycode
397	   name is directly used in such protocols.  Having libraries or
398	   protocols to convert from A-labels to the encoding scheme defined by
399	   the protocol (e.g., UTF-8) would require changes to APIs and/or
400	   servers, which IDNA was intended to avoid.

402	   As a result, applications that assume that non-ASCII names are
403	   resolved using the public DNS and blindly convert them to A-labels
404	   without knowledge of what protocol will be selected by the name
405	   resolution library, have problems.  Furthermore, name resolution
406	   libraries often try multiple protocols until one succeeds, because
407	   they are defined to use a common name space.  For example, the hosts
408	   file, DNS, and NetBIOS-over-TCP are all defined to be able to share a
409	   common syntax (e.g., see ([RFC0952], [RFC1001] section 11.1.1, and
410	   [RFC1034] section 2.1).  This means that when an application passes a
411	   name to be resolved, resolution may in fact be attempted using
412	   multiple protocols, each with a potentially different encoding
413	   scheme.  For this to work successfully, the name must be converted to
414	   the appropriate encoding scheme only after the choice is made to use
415	   that protocol.  In general, this cannot be done by the application
416	   since the choice of protocol is not made by the application.

418	3.  Use of Non-ASCII in DNS

420	   A common misconception is that DNS only supports names that can be
421	   expressed using letters, digits, and hyphens.

423	   This misconception originally stemmed from the definition in 1985 of
424	   an "Internet host name" (and net, gateway, and domain name) for use
425	   in the "hosts" file [RFC0952].  An Internet host name was defined
426	   therein as including only letters, digits, and hyphens, where upper
427	   and lower case letters were to be treated as identical.  The DNS
428	   specification [RFC1034] section 3.5 entitled "Preferred name syntax"
429	   then repeated this definition in 1987, saying that this "syntax will
430	   result in fewer problems with many applications that use domain names
431	   (e.g., mail, TELNET)".

433	   The confusion was thus left as to whether the "preferred" name syntax
434	   was a mandatory restriction in DNS, or merely "preferred".

436	   The definition of an Internet host name was updated in 1989
437	   ([RFC1123] section 2.1) to allow names starting with a digit (to
438	   support IPv4 addresses in dotted-decimal form).  Section 6.1 of
439	   "Requirements for Internet Hosts -- Application and Support"
440	   [RFC1123] discusses the use of DNS (and the hosts file) for resolving
441	   host names to IP addresses and vice versa.  This led to confusion as
442	   to whether all names in DNS are "host names", or whether a "host
443	   name" is merely a special case of a DNS name.

445	   By 1997, things had progressed to a state where it was necessary to
446	   clarify these areas of confusion.  "Clarifications to the DNS
447	   Specification" [RFC2181] section 11 states:

449	      The DNS itself places only one restriction on the particular
450	      labels that can be used to identify resource records.  That one
451	      restriction relates to the length of the label and the full name.
452	      The length of any one label is limited to between 1 and 63 octets.
453	      A full domain name is limited to 255 octets (including the
454	      separators).  The zero length full name is defined as representing
455	      the root of the DNS tree, and is typically written and displayed
456	      as ".".  Those restrictions aside, any binary string whatever can
457	      be used as the label of any resource record.  Similarly, any
458	      binary string can serve as the value of any record that includes a
459	      domain name as some or all of its value (SOA, NS, MX, PTR, CNAME,
460	      and any others that may be added).  Implementations of the DNS
461	      protocols must not place any restrictions on the labels that can
462	      be used.

464	   Hence, it clarified that the restriction to letters, digits, and
465	   hyphens does not apply to DNS names in general, nor to records that
466	   include "domain names".  Hence the "preferred" name syntax described
467	   in the original DNS specification [RFC1034] is indeed merely
468	   "preferred", not mandatory.

470	   Since there is no restriction even to ASCII, let alone letter-digit-
471	   hyphen use, DNS is in conformance with the IETF requirement to allow
472	   UTF-8 [RFC2277].

474	   Using UTF-16 or UTF-32 encoding, however, would not be ideal for use
475	   in DNS packets or APIs because existing software already uses ASCII,
476	   and UTF-16 and UTF-32 strings can contain all-zero octets that
477	   existing software may interpret as the end of the string.  To use
478	   UTF-16 or UTF-32 one would need some way of knowing whether the
479	   string was encoded using ASCII, UTF-16, or UTF-32, and indeed for
480	   UTF-16 or UTF-32 whether it was big-endian or little-endian encoding.
481	   In contrast, UTF-8 works well because any 7-bit ASCII string is also
482	   a UTF-8 string representing the same characters.

484	   If a private name space is defined to use UTF-8 (and not other
485	   encodings such as UTF-16 or UTF-32), there's no need for a mechanism
486	   to know whether a string was encoded using ASCII or UTF-8, because
487	   (for any string that can be represented using ASCII) the
488	   representations are exactly the same.  In other words, for any string
489	   that can be represented using ASCII it doesn't matter whether it is
490	   interpreted as ASCII or UTF-8 because both encodings are the same,
491	   and for any string that can't be represented using ASCII, it's
492	   obviously UTF-8.  In addition, unlike UTF-16 and UTF-32, ASCII and
493	   UTF-8 are both byte-oriented encodings so the question of big-endian
494	   or little-endian encoding doesn't apply.

496	   While implementations of the DNS protocol must not place any
497	   restrictions on the labels that can be used, applications that use
498	   the DNS are free to impose whatever restrictions they like, and many
499	   have.  The above rules permit a domain name label that contains
500	   unusual characters, such as embedded spaces which many applications
501	   would consider a bad idea.  For example, the SMTP protocol [RFC5321],
502	   but going back to the original specification in [RFC0821], constrains
503	   the character set usable in email addresses.  There is now an effort
504	   underway to permit SMTP to support internationalized email addresses
505	   via an extension.

507	   Shortly after the DNS Clarifications [RFC2181] and IETF character
508	   sets and languages policy [RFC2277] were published, the need for
509	   internationalized names within private name spaces (i.e., within
510	   enterprises) arose.  The current (and past, predating IDNA and the
511	   prefixed ACE conventions) practice within enterprises that support
512	   other languages is to put UTF-8 names in their internal DNS servers
513	   in a private name space.  For example, "Using the UTF-8 Character Set
514	   in the Domain Name System" [I-D.skwan-utf8-dns-00] was first written
515	   in 1997, and was then widely deployed in Windows.  The use of UTF-8
516	   names in DNS was similarly implemented and deployed in MacOS, simply
517	   by virtue of the fact that applications blindly passed UTF-8 strings
518	   to the name resolution APIs, and the name resolution APIs blindly
519	   passed those UTF-8 strings to the DNS servers, and the DNS servers
520	   correctly answered those queries, and from the user's point of view
521	   everything worked properly without any special new code being
522	   written, except that ASCII is matched case-insensitively whereas
523	   UTF-8 is not (although some enterprise DNS servers reportedly attempt
524	   to do case-insensitive matching on UTF-8 within private name spaces).
525	   Within a private name space, and especially in light of the IETF
526	   UTF-8 policy [RFC2277], it was reasonable to assume within a private
527	   name space that binary strings were encoded in UTF-8.

529	   As implied earlier, there are also issues with mapping strings to
530	   some canonical form, independent of the encoding.  Such issues are
531	   not discussed in detail in this document.  They are discussed to some
532	   extent in, for example, Section 3 of [RFC5198], and are left as
533	   opportunities for elaboration in other documents.

535	   Five years after UTF-8 was already in use in private name spaces in
536	   DNS, the strategy of using a reserved prefix and an ASCII-compatible
537	   Encoding (ACE) was developed for IDNA.  That strategy included the
538	   Punycode algorithm, which began to be developed (during the period
539	   from 2002 [I-D.ietf-idn-punycode-00] to 2003 [RFC3492]) for use in
540	   the public DNS name space.  One reason the prefixed ACE strategy was
541	   selected for the public DNS name space had to do with concerns about
542	   whether the details of IDNA, including the use of the Punycode
543	   algorithm, were an adequate solution to the problems that were posed.
544	   If either the Punycode algorithm or fundamental aspects of character
545	   handling were wrong, and had to be changed to something incompatible,
546	   it would be possible to switch to a new prefix or adopt another model
547	   entirely.  Only the part of the public DNS namespace that starts a
548	   label with "xn--" would be polluted.

550	   Today the algorithm is seen as being about as good as it can
551	   realistically be, so moving to a different encoding (UTF-8 as
552	   suggested in this document) that can be viewed as "native" would not
553	   be as risky as it would have been in 2002.

555	   In any case, the publication of [RFC3492] and the dependencies on it
556	   in [IDNA2008-Protocol] and the earlier [RFC3490] thus resulted in
557	   having to use different encodings for different name spaces (where
558	   UTF-8 for private name spaces was already deployed).  Hence,
559	   referring back to Figure 2, a different encoding scheme may be in use
560	   on the Internet vs. a local network.

562	   In general a host may be connected to zero or more networks using
563	   private name spaces, plus potentially the public name space.
564	   Applications that convert a U-label form IDN to an A-label before
565	   calling getaddrinfo() will incur name resolution failures if the name
566	   is actually registered in a private name space in some other encoding
567	   (e.g., UTF-8).  Having libraries or protocols convert from A-labels
568	   to the encoding used by a private name space (e.g., UTF-8) would
569	   require changes to APIs and/or servers, which IDNA was intended to
570	   avoid.

572	   Also, a fully-qualified domain name (FQDN) to be resolved may be
573	   obtained directly from an application, or it may be composed by the
574	   DNS resolver itself from a single label obtained from an application
575	   by using a configured suffix search list, and the resulting FQDN may
576	   use multiple encodings in different labels.  For more information on
577	   the suffix search list, see section 6 of "Common DNS Implementation
578	   Errors and Suggested Fixes" [RFC1536], the DHCP Domain Search Option
579	   [RFC3397], and section 4 of "DNS Configuration options for DHCPv6"
580	   [RFC3646].

582	   As noted in [RFC1536] section 6, the community has had bad
583	   experiences with "searching" for domain names by trying multiple
584	   variations or appending different suffixes.  Such searching can yield
585	   inconsistent results depending on the order in which alternatives are
586	   tried.  Nonetheless, the practice is widespread and must be
587	   considered.

589	   The practice of searching for names, whether by the use of a suffix
590	   search list or by searching in different namespaces can yield
591	   inconsistent results.  For example, even when a suffix search list is
592	   only used when an application provides a name containing no dots, two
593	   clients with different configured suffix search lists can get
594	   different answers, and the same client could get different answers at
595	   different times if it changes its configuration (e.g., when moving to
596	   another network).  A deeper discussion of this topic is outside the
597	   scope of this document.

599	3.1.  Examples

601	   Some examples of cases that can happen in existing implementations
602	   today (where {non-ASCII} below represents some user-entered non-ASCII
603	   string) are:
604	   1.  User types in {non-ASCII}.{non-ASCII}.com, and the application
605	       passes it, in the form of a UTF-8 string, to getaddrinfo or
606	       gethostbyname or equivalent.
607	       *  The DNS resolver passes the (UTF-8) string unmodified to a DNS
608	          server.
609	   2.  User types in {non-ASCII}.{non-ASCII}.com, and the application
610	       passes it to a name resolution API that accepts strings in some
611	       other encoding such as UTF-16, e.g., GetAddrInfoW on Windows.
612	       *  The name resolution API decides to pass the string to DNS (and
613	          possibly other protocols).
614	       *  The DNS resolver converts the name from UTF-16 to UTF-8 and
615	          passes the query to a DNS server.
616	   3.  User types in {non-ASCII}.{non-ASCII}.com, but the application
617	       first converts it to A-label form such that the name that is
618	       passed to name resolution APIs is (say) xn--e1afmkfd.xn--
619	       80akhbyknj4f.com.
620	       *  The name resolution API decides to pass the string to DNS (and
621	          possibly other protocols).
622	       *  The DNS resolver passes the string unmodified to a DNS server.
623	       *  If the name is not found in DNS, the name resolution API
624	          decides to try another protocol, say mDNS.
625	       *  The query goes out in mDNS, but since mDNS specified that
626	          names are to be registered in UTF-8, the name isn't found
627	          since it was encoded as an A-label in the query.
628	   4.  User types in {non-ASCII}, and the application passes it, in the
629	       form of a UTF-8 string, to getaddrinfo or equivalent.
630	       *  The name resolution API decides to pass the string to DNS (and
631	          possibly other protocols).
632	       *  The DNS resolver will append suffixes in the suffix search
633	          list, which may contain UTF-8 characters if the local network
634	          uses a private name space.
635	       *  Each FQDN in turn will then be sent in a query to a DNS
636	          server, until one succeeds.
637	   5.  User types in {non-ASCII}, but the application first converts it
638	       to an A-label, such that the name that is passed to getaddrinfo
639	       or equivalent is (say) xn--e1afmkfd.
640	       *  The name resolution API decides to pass the string to DNS (and
641	          possibly other protocols).
642	       *  The DNS stub resolver will append suffixes in the suffix
643	          search list, which may contain UTF-8 characters if the local
644	          network uses a private name space, resulting in (say) xn--
645	          e1afmkfd.{non-ASCII}.com

647	       *  Each FQDN in turn will then be sent in a query to a DNS
648	          server, until one succeeds.
649	       *  Since the private name space in this case uses UTF-8, the
650	          above queries fail, since the A-label version of the name was
651	          not registered in that name space.
652	   6.  User types in {non-ASCII1}.{non-ASCII2}.{non-ASCII3}.com, where
653	       {non-ASCII3}.com is a public name space using IDNA and A-labels,
654	       but {non-ASCII2}.{non-ASCII3}.com is a private name space using
655	       UTF-8, which is accessible to the user.  The application passes
656	       the name, in the form of a UTF-8 string, to getaddrinfo or
657	       equivalent.
658	       *  The name resolution API decides to pass the string to DNS (and
659	          possibly other protocols).
660	       *  The DNS resolver tries to locate the authoritative server, but
661	          fails the lookup because it cannot find a server for the UTF-8
662	          encoding of {non-ASCII3}.com, even though it would have access
663	          to the private name space.  (To make this work, the private
664	          name space would need to include the UTF-8 encoding of {non-
665	          ASCII3}.com.)

667	   When users use multiple applications, some of which do A-label
668	   conversion prior to passing a name to name resolution APIs, and some
669	   of which do not, odd behavior can result which at best violates the
670	   principle of least surprise, and at worst can result in security
671	   vulnerabilities.

673	   First consider two competing applications, such as web browsers, that
674	   are designed to achieve the same task.  If the user types the same
675	   name into each browser, one may successfully resolve the name (and
676	   hence access the desired content) because the encoding scheme was
677	   correct, while the other may fail name resolution because the
678	   encoding scheme was incorrect.  Hence the issue can incent users to
679	   switch to another application (which in some cases means switching to
680	   an IDNA application, and in other cases means switching away from an
681	   IDNA application).

683	   Next consider two separate applications where one is designed to be
684	   launched from the other, for example a web browser launching a media
685	   player application when the link to a media file is clicked.  If both
686	   types of content (web pages and media files in this example) are
687	   hosted at the same IDN in a private name space, but one application
688	   converts to A-labels before calling name resolution APIs and the
689	   other does not, the user may be able to access a web page, click on
690	   the media file causing the media player to launch and attempt to
691	   retrieve the media file, which will then fail because the IDN
692	   encoding scheme was incorrect.  Or even worse, if an attacker was
693	   able to register the same name in the other encoding scheme, may get
694	   the content from the attacker's machine.  This is similar to a normal
695	   phishing attack, except that the two names represent exactly the same
696	   Unicode characters.

698	4.  Recommendations

700	   As explained above, using multiple canonical formats, and multiple
701	   encodings in different protocols or even in different places in the
702	   same namespace creates problems.  Because of this, and the fact that
703	   both IDNA A-labels and UTF-8 are in use as encoding mechanisms for
704	   domain names today, we recommend the following.

706	   It is inappropriate for an application to convert a name to an
707	   A-label when it does not know whether DNS will be used by the name
708	   resolution library, or whether the name exists in a private name
709	   space that uses UTF-8, or in the global DNS that uses IDNA A-labels.

711	   Instead, conversion to A-label form, UTF-8, or any other encoding,
712	   should be done only by an entity that knows which protocol will be
713	   used (e.g., the DNS resolver, or getaddrinfo upon deciding to pass
714	   the name to DNS), rather than by general applications that call
715	   protocol-independent name resolution APIs.  (Of course, it is still
716	   necessary for applications to convert to whatever form those APIs
717	   expect.)  Similarly, even when DNS is used, the conversion to
718	   A-labels should be done only by an entity that knows which name space
719	   will be used.

721	   That is, a more intelligent DNS resolver would be more liberal in
722	   what it would accept from an application and be able to query for
723	   both a name in A-label form (e.g., over the Internet) and a UTF-8
724	   name (e.g., over a corporate network with a private name space) in
725	   case the server only recognized one.  However, we might also take
726	   into account that the various resolution behaviors discussed earlier
727	   could also occur with record updates (e.g., with Dynamic Update
728	   [RFC2136]), resulting in some names being registered in a local
729	   network's private name space by applications doing conversion to
730	   A-labels, and other names being registered using UTF-8.  Hence a name
731	   might have to be queried with both encodings to be sure to succeed
732	   without changes to DNS servers.

734	   Similarly, a more intelligent stub resolver would also be more
735	   liberal in what it would accept from a response as the value of a
736	   record (e.g., PTR) in that it would accept either UTF-8 (U-labels in
737	   the case of IDNA) or A-labels and convert them to whatever encoding
738	   is used by the application APIs to return strings to applications.

740	   Indeed the choice of conversion within the resolver libraries is
741	   consistent with the quote from section 6.2 of the original IDNA
742	   specification [RFC3490] stating that conversion using the Punycode
743	   algorithm (i.e., to A-labels) "might be performed inside these new
744	   versions of the resolver libraries".

746	   That said, some application-layer protocols may be defined to use
747	   A-labels rather than UTF-8 as recommended by the IETF character sets
748	   and languages policy [RFC2277].  In this case, an application may
749	   receive a string containing A-labels and want to pass it to name
750	   resolution APIs.  Again the recommendation that a resolver library be
751	   more liberal in what it would accept from an application would mean
752	   that such a name would be accepted and re-encoded as needed, rather
753	   than requiring the application to do so.

755	   Finally, the question remains about what, if anything, a DNS server
756	   should do to handle cases where some existing applications or hosts
757	   do IDNA queries using A-labels within the local network using a
758	   private name space, and other existing applications or hosts send
759	   UTF-8 queries.  It is undesirable to store different records for
760	   different encodings of the same name, since this introduces the
761	   possibility for inconsistency between them.  Instead, a new DNS
762	   server serving a private name space using UTF-8 could potentially
763	   treat encoding-conversion in the same way as case-insensitive
764	   comparison which a DNS server is already required to do, as long the
765	   DNS server has some way to know what the encoding is.  Two encodings
766	   are, in this sense, two representations of the same name, just as two
767	   case-different strings are.  However, whereas case comparison of non-
768	   ASCII characters is complicated by ambiguities (as explained in the
769	   IAB's Review and Recommendations for Internationalized Domain Names
770	   [RFC4690]), encoding conversion between A-labels and U-labels is
771	   unambiguous.

773	5.  Acknowledgements

775	   The authors wish to thank Patrik Falstrom, Martin Duerst, and JFC
776	   Morfin for their careful review and helpful suggestions.

778	6.  Security Considerations

780	   Having applications convert names to prefixed ACE format (A-labels)
781	   before calling name resolution can result in security
782	   vulnerabilities.  If the name is resolved by protocols or in zones
783	   for which records are registered using other encoding schemes, an
784	   attacker can claim the A-label version of the same name and hence
785	   trick the victim into accessing a different destination.  This can be
786	   done for any non-ASCII name, even when there is no possible confusion
787	   due to case, language, or other issues.  Other types of confusion
788	   beyond those resulting simply from the choice of encoding scheme are
789	   discussed in "Review and Recommendations for IDNs" [RFC4690].

791	   Designers and users of encodings that represent Unicode strings in
792	   terms of ASCII should also consider whether trademark protection is
793	   an issue, e.g., if one name would be encoded in a way that would be
794	   naturally associated with another organization, such as xn--rfc-
795	   editor.

797	7.  IANA Considerations

799	   [RFC Editor: please remove this section prior to publication.]

801	   This document has no IANA Actions.

803	8.  IAB Members at the time of publication

805	   Bernard Aboba
806	   Marcelo Bagnulo
807	   Ross Callon
808	   Spencer Dawkins
809	   Vijay Gill
810	   Russ Housley
811	   John Klensin
812	   Olaf Kolkman
813	   Danny McPherson
814	   Jon Peterson
815	   Andrei Robachevsky
816	   Dave Thaler
817	   Hannes Tschofenig

819	9.  References

821	9.1.  Normative References

823	   [Unicode]  The Unicode Consortium, "The Unicode Standard, Version
824	              5.1.0", 2008.

826	              defined by: The Unicode Standard, Version 5.0, Boston, MA,
827	              Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
828	              Unicode 5.1.0
829	              (http://www.unicode.org/versions/Unicode5.1.0/).

831	9.2.  Informative References

833	   [I-D.cheshire-dnsext-multicastdns]
834	              Cheshire, S. and M. Krochmal, "Multicast DNS",
835	              draft-cheshire-dnsext-multicastdns-11 (work in progress),
836	              March 2010.

838	   [I-D.ietf-idn-punycode-00]
839	              Costello, A., "Punycode version 0.3.3",
840	              draft-ietf-idn-punycode-00 (work in progress), July 2002.

842	   [I-D.skwan-utf8-dns-00]
843	              Kwan, S. and J. Gilroy, "Using the UTF-8 Character Set in
844	              the Domain Name System", draft-skwan-utf8-dns-00 (work in
845	              progress), November 1997.

847	   [IDNA2008-Defs]
848	              Klensin, J., "Internationalized Domain Names for
849	              Applications (IDNA): Definitions and Document Framework",
850	              January 2010, <https://datatracker.ietf.org/drafts/
851	              draft-ietf-idnabis-defs/>.

853	   [IDNA2008-Protocol]
854	              Klensin, J., "Internationalized Domain Names in
855	              Applications (IDNA): Protocol", January 2010, <https://
856	              datatracker.ietf.org/drafts/draft-ietf-idnabis-protocol/>.

858	   [MJD]      Duerst, M., "The Properties and Promizes of UTF-8", 11th
859	              International Unicode Conference, San Jose ,
860	              September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
861	              papers/PDF/IUC11-UTF-8.pdf>.

863	   [NIS]      Sun Microsystems, "System and Network Administration",
864	              March 1990.

866	   [RFC0821]  Postel, J., "Simple Mail Transfer Protocol", STD 10,
867	              RFC 821, August 1982.

869	   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
870	              host table specification", RFC 952, October 1985.

872	   [RFC1001]  NetBIOS Working Group, "Protocol standard for a NetBIOS
873	              service on a TCP/UDP transport: Concepts and methods",
874	              STD 19, RFC 1001, March 1987.

876	   [RFC1002]  NetBIOS Working Group, "Protocol standard for a NetBIOS
877	              service on a TCP/UDP transport: Detailed specifications",
878	              STD 19, RFC 1002, March 1987.

880	   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
881	              STD 13, RFC 1034, November 1987.

883	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
884	              and Support", STD 3, RFC 1123, October 1989.

886	   [RFC1468]  Murai, J., Crispin, M., and E. van der Poel, "Japanese
887	              Character Encoding for Internet Messages", RFC 1468,
888	              June 1993.

890	   [RFC1536]  Kumar, A., Postel, J., Neuman, C., Danzig, P., and S.
891	              Miller, "Common DNS Implementation Errors and Suggested
892	              Fixes", RFC 1536, October 1993.

894	   [RFC2136]  Vixie, P., Thomson, S., Rekhter, Y., and J. Bound,
895	              "Dynamic Updates in the Domain Name System (DNS UPDATE)",
896	              RFC 2136, April 1997.

898	   [RFC2181]  Elz, R. and R. Bush, "Clarifications to the DNS
899	              Specification", RFC 2181, July 1997.

901	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
902	              Languages", BCP 18, RFC 2277, January 1998.

904	   [RFC3397]  Aboba, B. and S. Cheshire, "Dynamic Host Configuration
905	              Protocol (DHCP) Domain Search Option", RFC 3397,
906	              November 2002.

908	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
909	              "Internationalizing Domain Names in Applications (IDNA)",
910	              RFC 3490, March 2003.

912	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
913	              for Internationalized Domain Names in Applications
914	              (IDNA)", RFC 3492, March 2003.

916	   [RFC3493]  Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
917	              Stevens, "Basic Socket Interface Extensions for IPv6",
918	              RFC 3493, February 2003.

920	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
921	              10646", STD 63, RFC 3629, November 2003.

923	   [RFC3646]  Droms, R., "DNS Configuration options for Dynamic Host
924	              Configuration Protocol for IPv6 (DHCPv6)", RFC 3646,
925	              December 2003.

927	   [RFC4690]  Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
928	              Recommendations for Internationalized Domain Names
929	              (IDNs)", RFC 4690, September 2006.

931	   [RFC4795]  Aboba, B., Thaler, D., and L. Esibov, "Link-local
932	              Multicast Name Resolution (LLMNR)", RFC 4795,
933	              January 2007.

935	   [RFC5198]  Klensin, J. and M. Padlipsky, "Unicode Format for Network
936	              Interchange", RFC 5198, March 2008.

938	   [RFC5321]  Klensin, J., "Simple Mail Transfer Protocol", RFC 5321,
939	              October 2008.

941	Authors' Addresses

943	   Dave Thaler
944	   Microsoft Corporation
945	   One Microsoft Way
946	   Redmond, WA  98052
947	   USA

949	   Phone: +1 425 703 8835
950	   Email: dthaler@microsoft.com

952	   John C Klensin
953	   1770 Massachusetts Ave, Ste 322
954	   Cambridge, MA  02140

956	   Phone: +1 617 245 1457
957	   Email: john+ietf@jck.com

959	   Stuart Cheshire
960	   Apple Inc.
961	   1 Infinite Loop
962	   Cupertino, CA  95014

964	   Phone: +1 408 974 3207
965	   Email: cheshire@apple.com