idnits 2.17.1 

draft-iab-idn-encoding-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Sep 2009 rather than the newer Notice from 28 Dec 2009.  (See
     https://trustee.ietf.org/license-info/)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 348: '...      Protocols MUST be able to use th...'
     RFC 2119 keyword, line 351: '... for all text.  Protocols MAY specify,...'
     RFC 2119 keyword, line 361: '...      support MUST be possible....'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (November 10, 2009) is 5281 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '10646' on line 350

  == Missing Reference: 'BCP9' is mentioned on line 355, but not defined

  == Outdated reference: A later version (-15) exists of
     draft-cheshire-dnsext-multicastdns-08

  == Outdated reference: A later version (-02) exists of
     draft-ietf-idn-punycode-00

  == Outdated reference: A later version (-06) exists of
     draft-skwan-utf8-dns-00

  -- Obsolete informational reference (is this intentional?): RFC  821
     (Obsoleted by RFC 2821)

  -- Obsolete informational reference (is this intentional?): RFC 3490
     (Obsoleted by RFC 5890, RFC 5891)


     Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                          D. Thaler
3	Internet-Draft                                                 Microsoft
4	Intended status: Informational                                J. Klensin
5	Expires: May 14, 2010
6	                                                             S. Cheshire
7	                                                                   Apple
8	                                                       November 10, 2009

10	      IAB Thoughts on Encodings for Internationalized Domain Names
11	                     draft-iab-idn-encoding-01.txt

13	Abstract

15	   This document explores issues with Internationalized Domain Names
16	   (IDNs) that result from the use of various encoding schemes such as
17	   Punycode and UTF-8.

19	Status of this Memo

21	   This Internet-Draft is submitted to IETF in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF), its areas, and its working groups.  Note that
26	   other groups may also distribute working documents as Internet-
27	   Drafts.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   The list of current Internet-Drafts can be accessed at
35	   http://www.ietf.org/ietf/1id-abstracts.txt.

37	   The list of Internet-Draft Shadow Directories can be accessed at
38	   http://www.ietf.org/shadow.html.

40	   This Internet-Draft will expire on May 14, 2010.

42	Copyright Notice

44	   Copyright (c) 2009 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the BSD License.

57	Table of Contents

59	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
60	     1.1.  APIs . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
61	   2.  Use of Non-DNS Protocols . . . . . . . . . . . . . . . . . . .  9
62	   3.  Use of Non-ASCII in DNS  . . . . . . . . . . . . . . . . . . . 10
63	     3.1.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . 13
64	   4.  Recommendations  . . . . . . . . . . . . . . . . . . . . . . . 15
65	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 16
66	   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 17
67	   7.  IAB Members at the time of this writing  . . . . . . . . . . . 17
68	   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 17
69	     8.1.  Normative References . . . . . . . . . . . . . . . . . . . 17
70	     8.2.  Informative References . . . . . . . . . . . . . . . . . . 18
71	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20

73	1.  Introduction

75	   The goal of this document is to explore what can be learned from some
76	   current difficulties in implementing Internationalized Domain Names
77	   (IDNs).  Although some elements of this exploration may immediately
78	   feed back into current IETF work, it is explicitly not the intention
79	   for this document to influence any current working group charter.

81	   A domain name consists of a set of labels, conventionally written
82	   separated with dots.  An Internationalized Domain Name (IDN) is a
83	   domain name that contains one or more labels that, in turn, contain
84	   one or more non-ASCII characters.  Just as with plain ASCII domain
85	   names, each IDN label must be encoded using some mechanism before it
86	   can be transmitted in network packets, stored in memory, stored on
87	   disk, etc.  These encodings need to be reversible, but they need not
88	   store domain names the same way humans conventionally write them on
89	   paper.  For example, when transmitted over the network in DNS
90	   packets, domain name labels are *not* separated with dots.

92	   IDNA, discussed later in this document, is the standard that defines
93	   the use and coding of internationalized domain names for use on the
94	   public Internet.  It is defined in several documents, with the
95	   primary one of those being "Internationalizing Domain Names in
96	   Applications (IDNA)" [RFC3490].  A revision to the IDNA Standard is
97	   undergoing IETF Last Call review as this document is being written.
98	   That revision is reflected in [IDNA2008-Defs] and associated
99	   materials.  Except where noted, the two versions are approximately
100	   the same with regard to the issues discussed in this document.
101	   However, their terminology differs somewhat; this document reflects
102	   the terminology of the earlier version.

104	   Punycode [RFC3492] is a mechanism for encoding a Unicode [Unicode]
105	   string in ASCII characters using only letters, digits, and hyphens.
106	   When a Unicode label is encoded with Punycode, it is prefixed with
107	   "xn--", which assumes that other DNS labels are no longer allowed to
108	   start with these four characters.  Consequently, when Punycode
109	   encoding is assumed, any DNS labels beginning with "xn--" now have a
110	   different meaning (the Punycode encoding of a label containing one or
111	   more non-ASCII characters) or no defined meaning at all (in the case
112	   of labels that are not well-formed Punycode).

114	   The term "ToASCII" refers to the process of encoding a label
115	   containing one or more non-ASCII characters as an ASCII string
116	   beginning with "xn--".  It consists of a combination of a non-
117	   reversible character mapping operation (e.g., converting upper case
118	   characters to lower case characters), plus a reversible encoding
119	   algorithm ('Punycode') that encodes a sequence of Unicode code points
120	   (which may contain code points above 127) as a sequence of ASCII code
121	   points (containing only ASCII code points for letters, digits and
122	   hyphens).  The term "ToUnicode" refers to the process of reversing
123	   the Punycode encoding, but not reversing the (irreversible) character
124	   mapping operation.

126	   ISO-2022-JP [RFC1468] is a mechanism for encoding a string of ASCII
127	   and Japanese characters, where an ASCII character is preserved as-is.

129	   Unicode [Unicode] is a list of characters (including non-spacing
130	   marks that are used to form some other characters), where each
131	   character is assigned an integer value, called a code point.  In
132	   simple terms a Unicode string is a string of integer code point
133	   values in the range 0 to 1,114,111 (10FFFF in base 16), which
134	   represent a string of Unicode characters.  These integer code points
135	   must be encoded using some mechanism before they can be transmitted
136	   in network packets, stored in memory, stored on disk, etc.  Some
137	   common ways of encoding these integer code point values in computer
138	   systems include UTF-8, UTF-16, and UTF-32.  In addition to the
139	   material below, those forms and the tradeoffs among them are
140	   discussed in Chapter 2 of The Unicode Standard [Unicode].

142	   UTF-8 [RFC3629] is a mechanism for encoding a Unicode code point in a
143	   variable number of 8-bit octets, where an ASCII code point is
144	   preserved as-is.  Those octets encode a string of integer code point
145	   values, which represent a string of Unicode characters.

147	   UTF-16 (formerly UCS-2) is a mechanism for encoding a Unicode code
148	   point in one or two 16-bit integers, described in detail in Sections
149	   3.9 and 3.10 of The Unicode Standard [Unicode].  A UTF-16 string
150	   encodes a string of integer code point values that represent a string
151	   of Unicode characters.

153	   UTF-32 (formerly UCS-4), also described in [Unicode] Sections 3.9 and
154	   3.10, is a mechanism for encoding a Unicode code point in a single
155	   32-bit integer.  A UTF-32 string is thus a string of 32-bit integer
156	   code point values, which represent a string of Unicode characters.

158	   Note that UTF-16 and UTF-32 codings result in some all-zero octets
159	   when code points occur early in the Unicode sequence.

161	   Different applications, APIs, and protocols use different encoding
162	   schemes today.  Historically, many of them were originally defined to
163	   use only ASCII.  Internationalizing  Domain Names in Applications
164	   (IDNA) [RFC3490] defined a mechanism that required changes to
165	   applications, but in attempt not to change APIs or servers, specified
166	   that Punycode is to be used.  In some ways this could be seen as not
167	   changing the existing APIs, in the sense that the strings being
168	   passed to and from the APIs were still apparently ASCII strings.  In
169	   other ways it was a very profound change to the existing APIs,
170	   because while those strings were still syntactically valid ASCII
171	   strings, they no longer meant the same thing as they used to.  What
172	   looked like a plain ASCII string to one piece of software or library
173	   could be seen by another piece of software or library (with the
174	   application of out-of-band information) to be in fact an encoding of
175	   a Unicode string.

177	   Section 1.3 of the IDNA specification [RFC3490] states:

179	      The IDNA protocol is contained completely within applications.  It
180	      is not a client-server or peer-to-peer protocol: everything is
181	      done inside the application itself.  When used with a DNS resolver
182	      library, IDNA is inserted as a "shim" between the application and
183	      the resolver library.  When used for writing names into a DNS
184	      zone, IDNA is used just before the name is committed to the zone.

186	   Figure 1 depicts a simplistic architecture that a naive reader might
187	   assume from the paragraph quoted above.  (A variant of this same
188	   picture appears in Section 6 of the IDNA specification [RFC3490]
189	   further strengthening this assumption.)

191	    +-----------------------------------------+
192	    |Host                                     |
193	    |             +-------------+             |
194	    |             | Application |             |
195	    |             +------+------+             |
196	    |                    |                    |
197	    |               +----+----+               |
198	    |               |   DNS   |               |
199	    |               | Resolver|               |
200	    |               | Library |               |
201	    |               +----+----+               |
202	    |                    |                    |
203	    +-----------------------------------------+
204	                         |
205	                _________|_________
206	               /                   \
207	              /                     \
208	             /                       \
209	            |         Internet        |
210	             \                       /
211	              \                     /
212	               \___________________/

214	                          Simplistic Architecture

216	                                 Figure 1

218	   There are, however, two problems with this simplistic architecture
219	   that cause it to differ from reality.

221	   First, resolver APIs on Operating Systems (OSs) today (MacOS,
222	   Windows, Linux, etc.) are not DNS-specific.  They typically provide a
223	   layer of indirection so that the application can work independent of
224	   the name resolution mechanism, which could be DNS, mDNS
225	   [I-D.cheshire-dnsext-multicastdns], LLMNR [RFC4795], NetBIOS-over-TCP
226	   [RFC1001][RFC1002], etc/hosts file [RFC0952], NIS [NIS], or anything
227	   else.  For example, "Basic Socket Interface Extensions for IPv6"
228	   [RFC3493] specifies the getaddrinfo() API and contains many phrases
229	   like "For example, when using the DNS" and "any type of name
230	   resolution service (for example, the DNS)".  Importantly, DNS is
231	   mentioned only as an example, and the application has no knowledge as
232	   to whether DNS or some other protocol will be used.

234	   Second, even with the DNS protocol, private name spaces (sometimes
235	   including private uses of the DNS), do not necessarily use the same
236	   character set encoding scheme as the public Internet name space.

238	   We will discuss each of the above issues in subsequent sections.  For
239	   reference, Figure 2 depicts a more realistic architecture on typical
240	   hosts today (which don't have IDNA inserted as a shim immediately
241	   above the DNS resolver library).  More generally, the host may be
242	   attached to one or more local networks, each of which may or may not
243	   be connected to the public Internet and may or may not have a private
244	   name space.

246	    +-----------------------------------------+
247	    |Host                                     |
248	    |             +-------------+             |
249	    |             | Application |             |
250	    |             +------+------+             |
251	    |                    |                    |
252	    |             +------+------+             |
253	    |             |   Generic   |             |
254	    |             |    Name     |             |
255	    |             |  Resolution |             |
256	    |             |     API     |             |
257	    |             +------+------+             |
258	    |                    |                    |
259	    |   +-----+------+---+--+-------+-----+   |
260	    |   |     |      |      |       |     |   |
261	    | +-+-++--+--++--+-++---+---++--+--++-+-+ |
262	    | |DNS||LLMNR||mDNS||NetBIOS||hosts||...| |
263	    | +---++-----++----++-------++-----++---+ |
264	    |                                         |
265	    +-----------------------------------------+
266	                         |
267	                   ______|______
268	                  /             \
269	                 /               \
270	                /      local      \
271	                \     network     /
272	                 \               /
273	                  \_____________/
274	                         |
275	                _________|_________
276	               /                   \
277	              /                     \
278	             /                       \
279	            |         Internet        |
280	             \                       /
281	              \                     /
282	               \___________________/

284	                          Realistic Architecture

286	                                 Figure 2

288	1.1.  APIs

290	   Section 6.2 of the IDNA specification [RFC3490] states:

292	      It is expected that new versions of the resolver libraries in the
293	      future will be able to accept domain names in other charsets than
294	      ASCII, and application developers might one day pass not only
295	      domain names in Unicode, but also in local script to a new API for
296	      the resolver libraries in the operating system.  Thus the ToASCII
297	      and ToUnicode operations might be performed inside these new
298	      versions of the resolver libraries.

300	   Resolver APIs such as getaddrinfo() and its predecessor
301	   gethostbyname() were defined to accept "char *" arguments, meaning
302	   they accept a string of bytes, terminated with a NULL (0) byte.
303	   Because of the use of a NULL octet as a string terminator, this is
304	   sufficient for ASCII strings, Punycode strings, and even ISO-2022-JP
305	   and UTF-8 strings (unless an implementation artificially precludes
306	   them), but not UTF-16 or UTF-32 strings.  Several operating systems
307	   historically used in Japan will accept (and expect) ISO-2022-JP
308	   strings in such APIs.  Some platforms used worldwide also have new
309	   versions of the APIs (e.g., GetAddrInfoW() on Windows) that accept
310	   other encoding schemes such as UTF-16.

312	   It is worth noting that an API using "char *" arguments can
313	   distinguish between ASCII, Punycode, ISO-2022-JP, and UTF-8 labels in
314	   names if the coding is known to be one of those four.  An example
315	   method is as follows:
316	   o  if the label contains an ESC (0x1B) byte the label is ISO-2022-JP;
317	      otherwise,
318	   o  if any byte in the label has the high bit set, the label is UTF-8;
319	      otherwise,
320	   o  if the label starts with "xn--" then it contains a string in
321	      Punycode encoding; otherwise,
322	   o  the label is ASCII.
323	   Again this assumes that ASCII labels never start with "xn--", and
324	   also that UTF-8 strings never contain an ESC character.  Also the
325	   above is merely an illustration; UTF-8 can be detected and
326	   distinguished from other 8-bit encodings with high precision [MJD].

328	   It is more difficult or impossible to distinguish the ISO 8859
329	   character sets from each other.  Similarly, it is not possible in
330	   general to distinguish between ISO-2022-JP and any other encoding
331	   based on ISO 2022 code table switching.

333	   Although it is possible (as in the example above) to distinguish some
334	   encodings when not explicitly specified, it is cleaner to have the
335	   encodings specified explicitly, such as specifying UTF-16 for
336	   GetAddrInfoW(), or specifying explicitly which APIs expect UTF-8
337	   strings.

339	2.  Use of Non-DNS Protocols

341	   As noted earlier, typical name resolution libraries are not DNS-
342	   specific.  Furthermore, some protocols are defined to use encoding
343	   schemes other than Punycode.  For example, mDNS
344	   [I-D.cheshire-dnsext-multicastdns] specifies that UTF-8 be used.
345	   Indeed, the IETF policy on character sets and languages [RFC2277]
346	   states:

348	      Protocols MUST be able to use the UTF-8 charset, which consists of
349	      the ISO 10646 coded character set combined with the UTF-8
350	      character encoding scheme, as defined in [10646] Annex R
351	      (published in Amendment 2), for all text.  Protocols MAY specify,
352	      in addition, how to use other charsets or other character encoding
353	      schemes for ISO 10646, such as UTF-16, but lack of an ability to
354	      use UTF-8 is a violation of this policy; such a violation would
355	      need a variance procedure ([BCP9] section 9) with clear and solid
356	      justification in the protocol specification document before being
357	      entered into or advanced upon the standards track.  For existing
358	      protocols or protocols that move data from existing datastores,
359	      support of other charsets, or even using a default other than
360	      UTF-8, may be a requirement.  This is acceptable, but UTF-8
361	      support MUST be possible.

363	   Applications that convert an IDN to Punycode before calling
364	   getaddrinfo() will result in name resolution failures if the Punycode
365	   name is directly used in such protocols.  Having libraries or
366	   protocols to convert from Punycode to the encoding scheme defined by
367	   the protocol (e.g., UTF-8) would require changes to APIs and/or
368	   servers, which IDNA was intended to avoid.

370	   As a result, applications that assume that non-ASCII names are
371	   resolved using the public DNS and blindly convert them to Punycode
372	   without knowledge of what protocol will be selected by the name
373	   resolution library, have problems.  Furthermore, name resolution
374	   libraries often try multiple protocols until one succeeds, because
375	   they are defined to use a common name space.  For example, the hosts
376	   file, DNS, and NetBIOS-over-TCP are all defined to be able to share a
377	   common syntax (e.g., see ([RFC0952], [RFC1001] section 11.1.1, and
378	   [RFC1034] section 2.1).  This means that when an application passes a
379	   name to be resolved, resolution may in fact be attempted using
380	   multiple protocols, each with a potentially different encoding
381	   scheme.  For this to work successfully, the name must be converted to
382	   the appropriate encoding scheme only after the choice is made to use
383	   that protocol.  In general, this cannot be done by the application
384	   since the choice of protocol is not made by the application.

386	3.  Use of Non-ASCII in DNS

388	   A common misconception is that DNS only supports names that can be
389	   expressed using letters, digits, and hyphens.

391	   This misconception originally stemmed from the definition in 1985 of
392	   an "Internet host name" (and net, gateway, and domain name) for use
393	   in the "hosts" file [RFC0952].  An Internet host name was defined
394	   therein as including only letters, digits, and hyphens, where upper
395	   and lower case letters were to be treated as identical.  The DNS
396	   specification [RFC1034] section 3.5 entitled "Preferred name syntax"
397	   then repeated this definition in 1987, saying that this "syntax will
398	   result in fewer problems with many applications that use domain names
399	   (e.g., mail, TELNET)".

401	   The confusion was thus left as to whether the "preferred" name syntax
402	   was a mandatory restriction in DNS, or merely "preferred".

404	   The definition of an Internet host name was updated in 1989
405	   ([RFC1123] section 2.1) to allow names starting with a digit (to
406	   support IPv4 addresses in dotted-decimal form).  Section 6.1 of
407	   "Requirements for Internet Hosts -- Application and Support"
408	   [RFC1123] discusses the use of DNS (and the hosts file) for resolving
409	   host names to IP addresses and vice versa.  This led to confusion as
410	   to whether all names in DNS are "host names", or whether a "host
411	   name" is merely a special case of a DNS name.

413	   By 1997, things had progressed to a state where it was necessary to
414	   clarify these areas of confusion.  "Clarifications to the DNS
415	   Specification" [RFC2181] section 11 states:

417	      The DNS itself places only one restriction on the particular
418	      labels that can be used to identify resource records.  That one
419	      restriction relates to the length of the label and the full name.
420	      The length of any one label is limited to between 1 and 63 octets.
421	      A full domain name is limited to 255 octets (including the
422	      separators).  The zero length full name is defined as representing
423	      the root of the DNS tree, and is typically written and displayed
424	      as ".".  Those restrictions aside, any binary string whatever can
425	      be used as the label of any resource record.  Similarly, any
426	      binary string can serve as the value of any record that includes a
427	      domain name as some or all of its value (SOA, NS, MX, PTR, CNAME,
428	      and any others that may be added).  Implementations of the DNS
429	      protocols must not place any restrictions on the labels that can
430	      be used.

432	   Hence, it clarified that the restriction to letters, digits, and
433	   hyphens does not apply to DNS names in general, nor to records that
434	   include "domain names".  Hence the "preferred" name syntax described
435	   in the original DNS specification [RFC1034] is indeed merely
436	   "preferred", not mandatory.

438	   Since there is no restriction even to ASCII, let alone letter-digit-
439	   hyphen use, DNS is in conformance with the IETF requirement to allow
440	   UTF-8 [RFC2277].

442	   Using UTF-16 or UTF-32 encoding, however, would not be ideal for use
443	   in DNS packets or APIs because existing software already uses ASCII,
444	   and UTF-16 and UTF-32 strings can contain all-zero octets that
445	   existing software may interpret as the end of the string.  To use
446	   UTF-16 or UTF-32 one would need some way of knowing whether the
447	   string was encoded using ASCII, UTF-16, or UTF-32, and indeed for
448	   UTF-16 or UTF-32 whether it was big-endian or little-endian encoding.
449	   In contrast, UTF-8 works well because any 7-bit ASCII string is also
450	   a UTF-8 string representing the same characters.

452	   If a private name space is defined to use UTF-8 (and not other
453	   encodings such as UTF-16 or UTF-32), there's no need for a mechanism
454	   to know whether a string was encoded using ASCII or UTF-8, because
455	   (for any string that can be represented using ASCII) the
456	   representations are exactly the same.  In other words, for any string
457	   that can be represented using ASCII it doesn't matter whether it is
458	   interpreted as ASCII or UTF-8 because both encodings are the same,
459	   and for any string that can't be represented using ASCII, it's
460	   obviously UTF-8.  In addition, unlike UTF-16 and UTF-32, ASCII and
461	   UTF-8 are both byte-oriented encodings so the question of big-endian
462	   or little-endian encoding doesn't apply.

464	   While implementations of the DNS protocol must not place any
465	   restrictions on the labels that can be used, applications that use
466	   the DNS are free to impose whatever restrictions they like, and many
467	   have.  The above rules permit a domain name label that contains
468	   unusual characters, such as embedded spaces which many applications
469	   would consider a bad idea.  For example, the SMTP protocol [RFC5321],
470	   but going back to the original specification in [RFC0821], constrains
471	   the character set usable in email addresses.  There is now an effort
472	   underway to permit SMTP to support internationalized email addresses
473	   via an extension.

475	   Shortly after the DNS Clarifications [RFC2181] and IETF character
476	   sets and languages policy [RFC2277] were published, the need for
477	   internationalized names within private name spaces (i.e., within
478	   enterprises) arose.  The current (and past, predating Punycode)
479	   practice within enterprises that support other languages is to put
480	   UTF-8 names in their internal DNS servers in a private name space.
481	   For example, "Using the UTF-8 Character Set in the Domain Name
482	   System" [I-D.skwan-utf8-dns-00] was first written in 1997, and was
483	   then widely deployed in Windows.  The use of UTF-8 names in DNS was
484	   similarly implemented and deployed in MacOS, simply by virtue of the
485	   fact that applications blindly passed UTF-8 strings to the name
486	   resolution APIs, and the name resolution APIs blindly passed those
487	   UTF-8 strings to the DNS servers, and the DNS servers correctly
488	   answered those queries, and from the user's point of view everything
489	   worked properly without any special new code being written, except
490	   that ASCII is matched case-insensitively whereas UTF-8 is not
491	   (although some enterprise DNS servers reportedly attempt to do case-
492	   insensitive matching on UTF-8 within private name spaces).  Within a
493	   private name space, and especially in light of the IETF UTF-8 policy
494	   [RFC2277], it was reasonable to assume within a private name space
495	   that binary strings were encoded in UTF-8.

497	   [EDITOR'S NOTE: There are also normalization/mapping issues.
498	   Currently we only explore encoding issues.]

500	   Five years after UTF-8 was already in use in private name spaces in
501	   DNS, Punycode began to be developed (during the period from 2002
502	   [I-D.ietf-idn-punycode-00] to 2003 [RFC3492]) for use in the public
503	   DNS name space.  This publication thus resulted in having to use
504	   different encodings for different name spaces (where UTF-8 for
505	   private name spaces was already deployed).  Hence, referring back to
506	   Figure 2, a different encoding scheme may be in use on the Internet
507	   vs. a local network.

509	   In general a host may be connected to zero or more networks using
510	   private name spaces, plus potentially the public name space.
511	   Applications that convert an IDN to Punycode before calling
512	   getaddrinfo() will result in name resolution failures if the name is
513	   actually registered in a private name space in some other encoding
514	   (e.g., UTF-8).  Having libraries or protocols convert from Punycode
515	   to the encoding used by a private name space (e.g., UTF-8) would
516	   require changes to APIs and/or servers, which IDNA was intended to
517	   avoid.

519	   Also, a fully-qualified domain name (FQDN) to be resolved may be
520	   obtained directly from an application, or it may be composed by the
521	   DNS resolver itself from a single label obtained from an application
522	   by using a configured suffix search list, and the resulting FQDN may
523	   use multiple encodings in different labels.  For more information on
524	   the suffix search list, see section 6 of "Common DNS Implementation
525	   Errors and Suggested Fixes" [RFC1536], the DHCP Domain Search Option
526	   [RFC3397], and section 4 of "DNS Configuration options for DHCPv6"
527	   [RFC3646].

529	   As noted in [RFC1536] section 6, the community has had bad
530	   experiences with "searching" for domain names by trying multiple
531	   variations or appending different suffixes.  Such searching can yield
532	   inconsistent results depending on the order in which alternatives are
533	   tried.  Nonetheless, the practice is widespread and must be
534	   considered.

536	   The practice of searching for names, whether by the use of a suffix
537	   search list or by searching in different namespaces can yield
538	   inconsistent results.  For example, even when a suffix search list is
539	   only used when an application provides a name containing no dots, two
540	   clients with different configured suffix search lists can get
541	   different answers, and the same client could get different answers at
542	   different times if it changes its configuration (e.g., when moving to
543	   another network).  A deeper discussion of this topic is outside the
544	   scope of this document.

546	3.1.  Examples

548	   Some examples of cases that can happen in existing implementations
549	   today (where {non-ASCII} below represents some user-entered non-ASCII
550	   string) are:
551	   1.  User types in {non-ASCII}.{non-ASCII}.com, and the application
552	       passes it, in the form of a UTF-8 string, to getaddrinfo or
553	       gethostbyname or equivalent.
554	       *  The DNS resolver passes the (UTF-8) string unmodified to a DNS
555	          server.
556	   2.  User types in {non-ASCII}.{non-ASCII}.com, and the application
557	       passes it to a name resolution API that accepts strings in some
558	       other encoding such as UTF-16, e.g., GetAddrInfoW on Windows.
559	       *  The name resolution API decides to pass the string to DNS (and
560	          possibly other protocols).
561	       *  The DNS resolver converts the name from UTF-16 to UTF-8 and
562	          passes the query to a DNS server.
563	   3.  User types in {non-ASCII}.{non-ASCII}.com, but the application
564	       first converts it to Punycode such that the name that is passed
565	       to name resolution APIs is (say) xn--e1afmkfd.xn--
566	       80akhbyknj4f.com.
567	       *  The name resolution API decides to pass the string to DNS (and
568	          possibly other protocols).
569	       *  The DNS resolver passes the string unmodified to a DNS server.
570	       *  If the name is not found in DNS, the name resolution API
571	          decides to try another protocol, say mDNS.
572	       *  The query goes out in mDNS, but since mDNS specified that
573	          names are to be registered in UTF-8, the name isn't found
574	          since it was Punycode encoded in the query.
575	   4.  User types in {non-ASCII}, and the application passes it, in the
576	       form of a UTF-8 string, to getaddrinfo or equivalent.

578	       *  The name resolution API decides to pass the string to DNS (and
579	          possibly other protocols).
580	       *  The DNS resolver will append suffixes in the suffix search
581	          list, which may contain UTF-8 characters if the local network
582	          uses a private name space.
583	       *  Each FQDN in turn will then be sent in a query to a DNS
584	          server, until one succeeds.
585	   5.  User types in {non-ASCII}, but the application first converts it
586	       to Punycode, such that the name that is passed to getaddrinfo or
587	       equivalent is (say) xn--e1afmkfd.
588	       *  The name resolution API decides to pass the string to DNS (and
589	          possibly other protocols).
590	       *  The DNS stub resolver will append suffixes in the suffix
591	          search list, which may contain UTF-8 characters if the local
592	          network uses a private name space, resulting in (say) xn--
593	          e1afmkfd.{non-ASCII}.com
594	       *  Each FQDN in turn will then be sent in a query to a DNS
595	          server, until one succeeds.
596	       *  Since the private name space in this case uses UTF-8, the
597	          above queries fail, since the Punycode version of the name was
598	          not registered in that name space.
599	   6.  User types in {non-ASCII1}.{non-ASCII2}.{non-ASCII3}.com, where
600	       {non-ASCII3}.com is a public name space using Punycode, but {non-
601	       ASCII2}.{non-ASCII3}.com is a private name space using UTF-8,
602	       which is accessible to the user.  The application passes the
603	       name, in the form of a UTF-8 string, to getaddrinfo or
604	       equivalent.
605	       *  The name resolution API decides to pass the string to DNS (and
606	          possibly other protocols).
607	       *  The DNS resolver tries to locate the authoritative server, but
608	          fails the lookup because it cannot find a server for the UTF-8
609	          encoding of {non-ASCII3}.com, even though it would have access
610	          to the private name space.  (To make this work, the private
611	          name space would need to include the UTF-8 encoding of {non-
612	          ASCII3}.com.)

614	   When users use multiple applications, some of which do Punycode
615	   conversion prior to passing a name to name resolution APIs, and some
616	   of which do not, odd behavior can result which at best violates the
617	   principle of least surprise, and at worst can result in security
618	   vulnerabilities.

620	   First consider two competing applications, such as web browsers, that
621	   are designed to achieve the same task.  If the user types the same
622	   name into each browser, one may successfully resolve the name (and
623	   hence access the desired content) because the encoding scheme was
624	   correct, while the other may fail name resolution because the
625	   encoding scheme was incorrect.  Hence the issue can incent users to
626	   switch to another application (which in some cases means switching to
627	   an IDNA application, and in other cases means switching away from an
628	   IDNA application).

630	   Next consider two separate applications where one is designed to be
631	   launched from the other, for example a web browser launching a media
632	   player application when the link to a media file is clicked.  If both
633	   types of content (web pages and media files in this example) are
634	   hosted at the same IDN in a private name space, but one application
635	   converts to Punycode before calling name resolution APIs and the
636	   other does not, the user may be able to access a web page, click on
637	   the media file causing the media player to launch and attempt to
638	   retrieve the media file, which will then fail because the IDN
639	   encoding scheme was incorrect.  Or even worse, if an attacker was
640	   able to register the same name in the other encoding scheme, may get
641	   the content from the attacker's machine.  This is similar to a normal
642	   phishing attack, except that the two names represent exactly the same
643	   Unicode characters.

645	4.  Recommendations

647	   Taking into account the issues above, it would seem inappropriate for
648	   an application to convert a name to Punycode when it does not know
649	   whether DNS will be used by the name resolution library, or whether
650	   the name exists in a private name space that uses UTF-8, or in the
651	   global DNS that uses Punycode.

653	   Instead, conversion to Punycode, UTF-8, or whatever other encoding,
654	   should be done only by an entity that knows which protocol will be
655	   used (e.g., the DNS resolver, or getaddrinfo upon deciding to pass
656	   the name to DNS), rather than by general applications that call
657	   protocol-independent name resolution APIs.  (Of course, it is still
658	   necessary for applications to convert to whatever form those APIs
659	   expect.)  Similarly, even when DNS is used, the conversion to
660	   Punycode should be done only by an entity that knows which name space
661	   will be used.

663	   That is, a more intelligent DNS resolver would be more liberal in
664	   what it would accept from an application and be able to query for
665	   both a Punycode name (e.g., over the Internet) and a UTF-8 name
666	   (e.g., over a corporate network with a private name space) in case
667	   the server only recognized one.  However, we might also take into
668	   account that the various resolution behaviors discussed earlier could
669	   also occur with record updates (e.g., with Dynamic Update [RFC2136]),
670	   resulting in some names being registered in a local network's private
671	   name space by applications doing Punycode conversion, and other names
672	   being registered using UTF-8.  Hence a name might have to be queried
673	   with both encodings to be sure to succeed without changes to DNS
674	   servers.

676	   Similarly, a more intelligent stub resolver would also be more
677	   liberal in what it would accept from a response as the value of a
678	   record (e.g., PTR) in that it would accept either UTF-8 or Punycode
679	   and convert them to whatever encoding is used by the application APIs
680	   to return strings to applications.

682	   Indeed the choice of conversion within the resolver libraries is
683	   consistent with the quote from section 6.2 of the IDNA specification
684	   [RFC3490] stating that Punycode conversion "might be performed inside
685	   these new versions of the resolver libraries".

687	   That said, some application-layer protocols may be defined to use
688	   Punycode rather than UTF-8 as recommended by the IETF character sets
689	   and languages policy [RFC2277].  In this case, an application may
690	   receive a Punycode name and want to pass it to name resolution APIs.
691	   Again the recommendation that a resolver library be more liberal in
692	   what it would accept from an application would mean that such a name
693	   would be accepted and re-encoded as needed, rather than requiring the
694	   application to do so.

696	   Finally, the question remains about what, if anything, a DNS server
697	   should do to handle cases where some existing applications or hosts
698	   do Punycode queries within the local network using a private name
699	   space, and other existing applications or hosts send UTF-8 queries.
700	   It is undesirable to store different records for different encodings
701	   of the same name, since this introduces the possibility for
702	   inconsistency between them.  Instead, a new DNS server serving a
703	   private name space using UTF-8 could potentially treat encoding-
704	   conversion in the same way as case-insensitive comparison which a DNS
705	   server is already required to do, as long the DNS server has some way
706	   to know what the encoding is.  Two encodings are, in this sense, two
707	   representations of the same name, just as two case-different strings
708	   are.  However, whereas case comparison of non-ASCII characters is
709	   complicated by ambiguities (as explained in the IAB's Review and
710	   Recommendations for Internationalized Domain Names [RFC4690]),
711	   encoding conversion between Punycode and UTF-8 is unambiguous.

713	   [EDITOR'S NOTE: There are also normalization/mapping issues.
714	   Currently we only explore encoding issues.]

716	5.  Security Considerations

718	   Having applications convert names to Punycode before calling name
719	   resolution can result in security vulnerabilities.  If the name is
720	   resolved by protocols or in zones for which records are registered
721	   using other encoding schemes, an attacker can claim the Punycode
722	   version of the same name and hence trick the victim into accessing a
723	   different destination.  This can be done for any non-ASCII name, even
724	   when there is no possible confusion due to case, language, or other
725	   issues.  Other types of confusion beyond those resulting simply from
726	   the choice of encoding scheme are discussed in "Review and
727	   Recommendations for IDNs" [RFC4690].

729	   Designers and users of encodings that represent Unicode strings in
730	   terms of ASCII should also consider whether trademark protection is
731	   an issue, e.g., if one name would be encoded in a way that would be
732	   naturally associated with another organization, such as xn--rfc-
733	   editor.

735	6.  IANA Considerations

737	   [RFC Editor: please remove this section prior to publication.]

739	   This document has no IANA Actions.

741	7.  IAB Members at the time of this writing

743	   Marcelo Bagnulo
744	   Gonzalo Camarillo
745	   Stuart Cheshire
746	   Vijay Gill
747	   Russ Housley
748	   John Klensin
749	   Olaf Kolkman
750	   Gregory Lebovitz
751	   Andrew Malis
752	   Danny McPherson
753	   David Oran
754	   Jon Peterson
755	   Dave Thaler

757	8.  References

759	8.1.  Normative References

761	   [Unicode]  The Unicode Consortium, "The Unicode Standard, Version
762	              5.1.0", 2008.

764	              defined by: The Unicode Standard, Version 5.0, Boston, MA,
765	              Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
766	              Unicode 5.1.0
767	              (http://www.unicode.org/versions/Unicode5.1.0/).

769	8.2.  Informative References

771	   [I-D.cheshire-dnsext-multicastdns]
772	              Cheshire, S. and M. Krochmal, "Multicast DNS",
773	              draft-cheshire-dnsext-multicastdns-08 (work in progress),
774	              September 2009.

776	   [I-D.ietf-idn-punycode-00]
777	              Costello, A., "Punycode version 0.3.3",
778	              draft-ietf-idn-punycode-00 (work in progress), July 2002.

780	   [I-D.skwan-utf8-dns-00]
781	              Kwan, S. and J. Gilroy, "Using the UTF-8 Character Set in
782	              the Domain Name System", draft-skwan-utf8-dns-00 (work in
783	              progress), November 1997.

785	   [IDNA2008-Defs]
786	              Klensin, J., "Internationalized Domain Names for
787	              Applications (IDNA): Definitions and Document Framework",
788	              August 2009, <https://datatracker.ietf.org/drafts/
789	              draft-ietf-idnabis-defs/>.

791	   [MJD]      Duerst, M., "The Properties and Promizes of UTF-8", 11th
792	              International Unicode Conference, San Jose ,
793	              September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
794	              papers/PDF/IUC11-UTF-8.pdf>.

796	   [NIS]      Sun Microsystems, "System and Network Administration",
797	              March 1990.

799	   [RFC0821]  Postel, J., "Simple Mail Transfer Protocol", STD 10,
800	              RFC 821, August 1982.

802	   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
803	              host table specification", RFC 952, October 1985.

805	   [RFC1001]  NetBIOS Working Group, "Protocol standard for a NetBIOS
806	              service on a TCP/UDP transport: Concepts and methods",
807	              STD 19, RFC 1001, March 1987.

809	   [RFC1002]  NetBIOS Working Group, "Protocol standard for a NetBIOS
810	              service on a TCP/UDP transport: Detailed specifications",
811	              STD 19, RFC 1002, March 1987.

813	   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
814	              STD 13, RFC 1034, November 1987.

816	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
817	              and Support", STD 3, RFC 1123, October 1989.

819	   [RFC1468]  Murai, J., Crispin, M., and E. van der Poel, "Japanese
820	              Character Encoding for Internet Messages", RFC 1468,
821	              June 1993.

823	   [RFC1536]  Kumar, A., Postel, J., Neuman, C., Danzig, P., and S.
824	              Miller, "Common DNS Implementation Errors and Suggested
825	              Fixes", RFC 1536, October 1993.

827	   [RFC2136]  Vixie, P., Thomson, S., Rekhter, Y., and J. Bound,
828	              "Dynamic Updates in the Domain Name System (DNS UPDATE)",
829	              RFC 2136, April 1997.

831	   [RFC2181]  Elz, R. and R. Bush, "Clarifications to the DNS
832	              Specification", RFC 2181, July 1997.

834	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
835	              Languages", BCP 18, RFC 2277, January 1998.

837	   [RFC3397]  Aboba, B. and S. Cheshire, "Dynamic Host Configuration
838	              Protocol (DHCP) Domain Search Option", RFC 3397,
839	              November 2002.

841	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
842	              "Internationalizing Domain Names in Applications (IDNA)",
843	              RFC 3490, March 2003.

845	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
846	              for Internationalized Domain Names in Applications
847	              (IDNA)", RFC 3492, March 2003.

849	   [RFC3493]  Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
850	              Stevens, "Basic Socket Interface Extensions for IPv6",
851	              RFC 3493, February 2003.

853	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
854	              10646", STD 63, RFC 3629, November 2003.

856	   [RFC3646]  Droms, R., "DNS Configuration options for Dynamic Host
857	              Configuration Protocol for IPv6 (DHCPv6)", RFC 3646,
858	              December 2003.

860	   [RFC4690]  Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
861	              Recommendations for Internationalized Domain Names
862	              (IDNs)", RFC 4690, September 2006.

864	   [RFC4795]  Aboba, B., Thaler, D., and L. Esibov, "Link-local
865	              Multicast Name Resolution (LLMNR)", RFC 4795,
866	              January 2007.

868	   [RFC5321]  Klensin, J., "Simple Mail Transfer Protocol", RFC 5321,
869	              October 2008.

871	Authors' Addresses

873	   Dave Thaler
874	   Microsoft Corporation
875	   One Microsoft Way
876	   Redmond, WA  98052
877	   USA

879	   Phone: +1 425 703 8835
880	   Email: dthaler@microsoft.com

882	   John C Klensin
883	   1770 Massachusetts Ave, Ste 322
884	   Cambridge, MA  02140

886	   Phone: +1 617 245 1457
887	   Email: john+ietf@jck.com

889	   Stuart Cheshire
890	   Apple Inc.
891	   1 Infinite Loop
892	   Cupertino, CA  95014

894	   Phone: +1 408 974 3207
895	   Email: cheshire@apple.com