idnits 2.17.1
draft-iab-idn-encoding-02.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
** The document seems to lack a both a reference to RFC 2119 and the
recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
keywords.
RFC 2119 keyword, line 380: '... Protocols MUST be able to use th...'
RFC 2119 keyword, line 383: '... for all text. Protocols MAY specify,...'
RFC 2119 keyword, line 393: '... support MUST be possible....'
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust and authors Copyright Line does not
match the current year
-- The document date (May 14, 2010) is 5096 days in the past. Is this
intentional?
Checking references for intended status: Informational
----------------------------------------------------------------------------
-- Looks like a reference, but probably isn't: '10646' on line 382
== Missing Reference: 'BCP9' is mentioned on line 387, but not defined
== Outdated reference: A later version (-15) exists of
draft-cheshire-dnsext-multicastdns-11
== Outdated reference: A later version (-02) exists of
draft-ietf-idn-punycode-00
== Outdated reference: A later version (-06) exists of
draft-skwan-utf8-dns-00
-- Obsolete informational reference (is this intentional?): RFC 821
(Obsoleted by RFC 2821)
-- Obsolete informational reference (is this intentional?): RFC 3490
(Obsoleted by RFC 5890, RFC 5891)
Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 4 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group D. Thaler
3 Internet-Draft Microsoft
4 Intended status: Informational J. Klensin
5 Expires: November 15, 2010
6 S. Cheshire
7 Apple
8 May 14, 2010
10 IAB Thoughts on Encodings for Internationalized Domain Names
11 draft-iab-idn-encoding-02.txt
13 Abstract
15 This document explores issues with Internationalized Domain Names
16 (IDNs) that result from the use of various encoding schemes such as
17 UTF-8 and the ASCII-Compatible Encoding produced by the Punycode
18 algorithm. It focuses on the importance of agreeing on a canonical
19 format and how complicated it ends up being as a result of using
20 different encodings today.
22 Status of this Memo
24 This Internet-Draft is submitted in full conformance with the
25 provisions of BCP 78 and BCP 79.
27 Internet-Drafts are working documents of the Internet Engineering
28 Task Force (IETF). Note that other groups may also distribute
29 working documents as Internet-Drafts. The list of current Internet-
30 Drafts is at http://datatracker.ietf.org/drafts/current/.
32 Internet-Drafts are draft documents valid for a maximum of six months
33 and may be updated, replaced, or obsoleted by other documents at any
34 time. It is inappropriate to use Internet-Drafts as reference
35 material or to cite them other than as "work in progress."
37 This Internet-Draft will expire on November 15, 2010.
39 Copyright Notice
41 Copyright (c) 2010 IETF Trust and the persons identified as the
42 document authors. All rights reserved.
44 This document is subject to BCP 78 and the IETF Trust's Legal
45 Provisions Relating to IETF Documents
46 (http://trustee.ietf.org/license-info) in effect on the date of
47 publication of this document. Please review these documents
48 carefully, as they describe your rights and restrictions with respect
49 to this document. Code Components extracted from this document must
50 include Simplified BSD License text as described in Section 4.e of
51 the Trust Legal Provisions and are provided without warranty as
52 described in the Simplified BSD License.
54 Table of Contents
56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
57 1.1. APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
58 2. Use of Non-DNS Protocols . . . . . . . . . . . . . . . . . . . 9
59 3. Use of Non-ASCII in DNS . . . . . . . . . . . . . . . . . . . 10
60 3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 14
61 4. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 16
62 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
63 6. Security Considerations . . . . . . . . . . . . . . . . . . . 17
64 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18
65 8. IAB Members at the time of publication . . . . . . . . . . . . 18
66 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
67 9.1. Normative References . . . . . . . . . . . . . . . . . . . 18
68 9.2. Informative References . . . . . . . . . . . . . . . . . . 19
69 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21
71 1. Introduction
73 The goal of this document is to explore what can be learned from some
74 current difficulties in implementing Internationalized Domain Names
75 (IDNs).
77 A domain name consists of a set of labels, conventionally written
78 separated with dots. An Internationalized Domain Name (IDN) is a
79 domain name that contains one or more labels that, in turn, contain
80 one or more non-ASCII characters. Just as with plain ASCII domain
81 names, each IDN label must be encoded using some mechanism before it
82 can be transmitted in network packets, stored in memory, stored on
83 disk, etc. These encodings need to be reversible, but they need not
84 store domain names the same way humans conventionally write them on
85 paper. For example, when transmitted over the network in DNS
86 packets, domain name labels are *not* separated with dots.
88 IDNA, discussed later in this document, is the standard that defines
89 the use and coding of internationalized domain names for use on the
90 public Internet. It is described as "Internationalizing Domain Names
91 in Applications (IDNA)" and is defined in several documents.
92 Definitions for the current version and a roadmap of related
93 documents appears in [IDNA2008-Defs]. An earlier version of IDNA
94 [RFC3490] is now being phased out. Except where noted, the two
95 versions are approximately the same with regard to the issues
96 discussed in this document. However, some explanations appeared in
97 the earlier documents that did not seem useful when the revision was
98 created; they are quoted here from the documents in which they
99 appear. In addition, the terminology of the two version differs
100 somewhat; this document reflects the terminology of the current
101 version.
103 Unicode [Unicode] is a list of characters (including non-spacing
104 marks that are used to form some other characters), where each
105 character is assigned an integer value, called a code point. In
106 simple terms a Unicode string is a string of integer code point
107 values in the range 0 to 1,114,111 (10FFFF in base 16), which
108 represent a string of Unicode characters. These integer code points
109 must be encoded using some mechanism before they can be transmitted
110 in network packets, stored in memory, stored on disk, etc. Some
111 common ways of encoding these integer code point values in computer
112 systems include UTF-8, UTF-16, and UTF-32. In addition to the
113 material below, those forms and the tradeoffs among them are
114 discussed in Chapter 2 of The Unicode Standard [Unicode].
116 UTF-8 [RFC3629] is a mechanism for encoding a Unicode code point in a
117 variable number of 8-bit octets, where an ASCII code point is
118 preserved as-is. Those octets encode a string of integer code point
119 values, which represent a string of Unicode characters.
121 UTF-16 (formerly UCS-2) is a mechanism for encoding a Unicode code
122 point in one or two 16-bit integers, described in detail in Sections
123 3.9 and 3.10 of The Unicode Standard [Unicode]. A UTF-16 string
124 encodes a string of integer code point values that represent a string
125 of Unicode characters.
127 UTF-32 (formerly UCS-4), also described in [Unicode] Sections 3.9 and
128 3.10, is a mechanism for encoding a Unicode code point in a single
129 32-bit integer. A UTF-32 string is thus a string of 32-bit integer
130 code point values, which represent a string of Unicode characters.
132 Note that UTF-16 results in some all-zero octets when code points
133 occur early in the Unicode sequence, and UTF-32 always has all-zero
134 octets.
136 IDNA specifies validity of a label, such as what characters it can
137 contain, relationships among them, and so on, in Unicode terms.
138 Valid labels can take either of two forms, with the appropriate one
139 determined by particular protocols or by context. One of those
140 forms, called a U-label, is a direct representation of the Unicode
141 characters using one of the encoding forms discussed above. This
142 document discusses UTF-8 strings in many places. While all U-labels
143 can be represented by UTF-8 strings, not all UTF-8 strings are valid
144 U-labels (see Section 2.3.2 of [IDNA2008-Defs] for a discussion of
145 these distinctions). The other, called an A-label, uses a
146 compressed, ASCII-compatible encoding (an "ACE" in IDNA and other
147 terminology) produced by an algorithm called Punycode. U-labels and
148 A-labels are duals of each other: transformations from one to the
149 other do not lose information. The transformation mechanisms are
150 specified in [IDNA2008-Protocol].
152 Punycode [RFC3492] is thus a mechanism for encoding a Unicode string
153 in an ASCII-compatible encoding, i.e., using only letters, digits,
154 and hyphens from the ASCII character set. When a Unicode label that
155 is valid under the IDNA rules (a U-label) is encoded with Punycode
156 for IDNA purposes, it is prefixed with "xn--"; the result is called
157 an A-label. The prefix convention assumes that no other DNS labels
158 (at least no other DNS labels in IDNA-aware applications) are allowed
159 to start with these four characters. Consequently, when A-label
160 encoding is assumed, any DNS labels beginning with "xn--" now have a
161 different meaning (the Punycode encoding of a label containing one or
162 more non-ASCII characters) or no defined meaning at all (in the case
163 of labels that are not IDNA-compliant, i.e., are not well-formed
164 A-labels).
166 ISO-2022-JP [RFC1468] is a mechanism for encoding a string of ASCII
167 and Japanese characters, where an ASCII character is preserved as-is.
168 ISO-2022-JP is stateful: special sequences are used to switch between
169 character coding tables.
171 Comparison of Unicode strings is not as easy as comparing for example
172 ASCII strings. First, there are a multitude of ways of representing
173 a string of Unicode characters. Second, in many languages and
174 scripts, the actual definition of "same" is very context-dependent.
175 Because of this, comparison of two Unicode strings must take into
176 account how the Unicode strings are encoded. Regardless of the
177 encoding, however, comparison cannot simply be done by comparing the
178 encoded Unicode strings byte by byte. The only time that is possible
179 is when the strings both are mapped into some canonical format and
180 encoded the same way.
182 This document focuses on the importance of agreeing on a canonical
183 format and how complicated it ends up being as a result of using
184 different encodings today.
186 Different applications, APIs, and protocols use different encoding
187 schemes today. Historically, many of them were originally defined to
188 use only ASCII. Internationalizing Domain Names in Applications
189 (IDNA) [IDNA2008-Defs] defined a mechanism that required changes to
190 applications, but in attempt not to change APIs or servers, specified
191 that the A-label format is to be used in many contexts. In some ways
192 this could be seen as not changing the existing APIs, in the sense
193 that the strings being passed to and from the APIs were still
194 apparently ASCII strings. In other ways it was a very profound
195 change to the existing APIs, because while those strings were still
196 syntactically valid ASCII strings, they no longer meant the same
197 thing as they used to. What looked like a plain ASCII string to one
198 piece of software or library could be seen by another piece of
199 software or library (with the application of out-of-band information)
200 to be in fact an encoding of a Unicode string.
202 Section 1.3 of the original IDNA specification [RFC3490] states:
204 The IDNA protocol is contained completely within applications. It
205 is not a client-server or peer-to-peer protocol: everything is
206 done inside the application itself. When used with a DNS resolver
207 library, IDNA is inserted as a "shim" between the application and
208 the resolver library. When used for writing names into a DNS
209 zone, IDNA is used just before the name is committed to the zone.
211 Figure 1 depicts a simplistic architecture that a naive reader might
212 assume from the paragraph quoted above. (A variant of this same
213 picture appears in Section 6 of the IDNA specification [RFC3490]
214 further strengthening this assumption.)
215 +-----------------------------------------+
216 |Host |
217 | +-------------+ |
218 | | Application | |
219 | +------+------+ |
220 | | |
221 | +----+----+ |
222 | | DNS | |
223 | | Resolver| |
224 | | Library | |
225 | +----+----+ |
226 | | |
227 +-----------------------------------------+
228 |
229 _________|_________
230 / \
231 / \
232 / \
233 | Internet |
234 \ /
235 \ /
236 \___________________/
238 Simplistic Architecture
240 Figure 1
242 There are, however, two problems with this simplistic architecture
243 that cause it to differ from reality.
245 First, resolver APIs on Operating Systems (OSs) today (MacOS,
246 Windows, Linux, etc.) are not DNS-specific. They typically provide a
247 layer of indirection so that the application can work independent of
248 the name resolution mechanism, which could be DNS, mDNS
249 [I-D.cheshire-dnsext-multicastdns], LLMNR [RFC4795], NetBIOS-over-TCP
250 [RFC1001][RFC1002], etc/hosts file [RFC0952], NIS [NIS], or anything
251 else. For example, "Basic Socket Interface Extensions for IPv6"
252 [RFC3493] specifies the getaddrinfo() API and contains many phrases
253 like "For example, when using the DNS" and "any type of name
254 resolution service (for example, the DNS)". Importantly, DNS is
255 mentioned only as an example, and the application has no knowledge as
256 to whether DNS or some other protocol will be used.
258 Second, even with the DNS protocol, private name spaces (sometimes
259 including private uses of the DNS), do not necessarily use the same
260 character set encoding scheme as the public Internet name space.
262 We will discuss each of the above issues in subsequent sections. For
263 reference, Figure 2 depicts a more realistic architecture on typical
264 hosts today (which don't have IDNA inserted as a shim immediately
265 above the DNS resolver library). More generally, the host may be
266 attached to one or more local networks, each of which may or may not
267 be connected to the public Internet and may or may not have a private
268 name space.
270 +-----------------------------------------+
271 |Host |
272 | +-------------+ |
273 | | Application | |
274 | +------+------+ |
275 | | |
276 | +------+------+ |
277 | | Generic | |
278 | | Name | |
279 | | Resolution | |
280 | | API | |
281 | +------+------+ |
282 | | |
283 | +-----+------+---+--+-------+-----+ |
284 | | | | | | | |
285 | +-+-++--+--++--+-++---+---++--+--++-+-+ |
286 | |DNS||LLMNR||mDNS||NetBIOS||hosts||...| |
287 | +---++-----++----++-------++-----++---+ |
288 | |
289 +-----------------------------------------+
290 |
291 ______|______
292 / \
293 / \
294 / local \
295 \ network /
296 \ /
297 \_____________/
298 |
299 _________|_________
300 / \
301 / \
302 / \
303 | Internet |
304 \ /
305 \ /
306 \___________________/
308 Realistic Architecture
310 Figure 2
312 1.1. APIs
314 Section 6.2 of the original IDNA specification [RFC3490] states
315 (where ToASCII and ToUnicode below refer to conversions using the
316 Punycode algorithm):
318 It is expected that new versions of the resolver libraries in the
319 future will be able to accept domain names in other charsets than
320 ASCII, and application developers might one day pass not only
321 domain names in Unicode, but also in local script to a new API for
322 the resolver libraries in the operating system. Thus the ToASCII
323 and ToUnicode operations might be performed inside these new
324 versions of the resolver libraries.
326 Resolver APIs such as getaddrinfo() and its predecessor
327 gethostbyname() were defined to accept "char *" arguments, meaning
328 they accept a string of bytes, terminated with a NULL (0) byte.
329 Because of the use of a NULL octet as a string terminator, this is
330 sufficient for ASCII strings (including A-labels) and even
331 ISO-2022-JP and UTF-8 strings (unless an implementation artificially
332 precludes them), but not UTF-16 or UTF-32 strings because a NULL
333 octet could appear in the the middle of strings using these
334 encodings. Several operating systems historically used in Japan will
335 accept (and expect) ISO-2022-JP strings in such APIs. Some platforms
336 used worldwide also have new versions of the APIs (e.g.,
337 GetAddrInfoW() on Windows) that accept other encoding schemes such as
338 UTF-16.
340 It is worth noting that an API using "char *" arguments can
341 distinguish between conventional ASCII "host name" labels, A-labels,
342 ISO-2022-JP, and UTF-8 labels in names if the coding is known to be
343 one of those four. An example method is as follows:
344 o if the label contains an ESC (0x1B) byte the label is ISO-2022-JP;
345 otherwise,
346 o if any byte in the label has the high bit set, the label is UTF-8;
347 otherwise,
348 o if the label starts with "xn--" then it is presumed to be an
349 A-label; otherwise,
350 o the label is ASCII.
351 Again this assumes that neither ASCII labels nor UTF-8 strings ever
352 start with "xn--", and also that UTF-8 strings never contain an ESC
353 character. Also the above is merely an illustration; UTF-8 can be
354 detected and distinguished from other 8-bit encodings with good
355 accuracy [MJD].
357 It is more difficult or impossible to distinguish the ISO 8859
358 character sets from each other, because they differ in up to about 90
359 characters which have exactly the same encodings, and a short string
360 is very unlikely to contain enough characters to allow a receiver to
361 deduce the character set. Similarly, it is not possible in general
362 to distinguish between ISO-2022-JP and any other encoding based on
363 ISO 2022 code table switching.
365 Although it is possible (as in the example above) to distinguish some
366 encodings when not explicitly specified, it is cleaner to have the
367 encodings specified explicitly, such as specifying UTF-16 for
368 GetAddrInfoW(), or specifying explicitly which APIs expect UTF-8
369 strings.
371 2. Use of Non-DNS Protocols
373 As noted earlier, typical name resolution libraries are not DNS-
374 specific. Furthermore, some protocols are defined to use encoding
375 forms other than IDNA A-labels. For example, mDNS
376 [I-D.cheshire-dnsext-multicastdns] specifies that UTF-8 be used.
377 Indeed, the IETF policy on character sets and languages [RFC2277]
378 states:
380 Protocols MUST be able to use the UTF-8 charset, which consists of
381 the ISO 10646 coded character set combined with the UTF-8
382 character encoding scheme, as defined in [10646] Annex R
383 (published in Amendment 2), for all text. Protocols MAY specify,
384 in addition, how to use other charsets or other character encoding
385 schemes for ISO 10646, such as UTF-16, but lack of an ability to
386 use UTF-8 is a violation of this policy; such a violation would
387 need a variance procedure ([BCP9] section 9) with clear and solid
388 justification in the protocol specification document before being
389 entered into or advanced upon the standards track. For existing
390 protocols or protocols that move data from existing datastores,
391 support of other charsets, or even using a default other than
392 UTF-8, may be a requirement. This is acceptable, but UTF-8
393 support MUST be possible.
395 Applications that convert an IDN to A-label form before calling
396 getaddrinfo() will result in name resolution failures if the Punycode
397 name is directly used in such protocols. Having libraries or
398 protocols to convert from A-labels to the encoding scheme defined by
399 the protocol (e.g., UTF-8) would require changes to APIs and/or
400 servers, which IDNA was intended to avoid.
402 As a result, applications that assume that non-ASCII names are
403 resolved using the public DNS and blindly convert them to A-labels
404 without knowledge of what protocol will be selected by the name
405 resolution library, have problems. Furthermore, name resolution
406 libraries often try multiple protocols until one succeeds, because
407 they are defined to use a common name space. For example, the hosts
408 file, DNS, and NetBIOS-over-TCP are all defined to be able to share a
409 common syntax (e.g., see ([RFC0952], [RFC1001] section 11.1.1, and
410 [RFC1034] section 2.1). This means that when an application passes a
411 name to be resolved, resolution may in fact be attempted using
412 multiple protocols, each with a potentially different encoding
413 scheme. For this to work successfully, the name must be converted to
414 the appropriate encoding scheme only after the choice is made to use
415 that protocol. In general, this cannot be done by the application
416 since the choice of protocol is not made by the application.
418 3. Use of Non-ASCII in DNS
420 A common misconception is that DNS only supports names that can be
421 expressed using letters, digits, and hyphens.
423 This misconception originally stemmed from the definition in 1985 of
424 an "Internet host name" (and net, gateway, and domain name) for use
425 in the "hosts" file [RFC0952]. An Internet host name was defined
426 therein as including only letters, digits, and hyphens, where upper
427 and lower case letters were to be treated as identical. The DNS
428 specification [RFC1034] section 3.5 entitled "Preferred name syntax"
429 then repeated this definition in 1987, saying that this "syntax will
430 result in fewer problems with many applications that use domain names
431 (e.g., mail, TELNET)".
433 The confusion was thus left as to whether the "preferred" name syntax
434 was a mandatory restriction in DNS, or merely "preferred".
436 The definition of an Internet host name was updated in 1989
437 ([RFC1123] section 2.1) to allow names starting with a digit (to
438 support IPv4 addresses in dotted-decimal form). Section 6.1 of
439 "Requirements for Internet Hosts -- Application and Support"
440 [RFC1123] discusses the use of DNS (and the hosts file) for resolving
441 host names to IP addresses and vice versa. This led to confusion as
442 to whether all names in DNS are "host names", or whether a "host
443 name" is merely a special case of a DNS name.
445 By 1997, things had progressed to a state where it was necessary to
446 clarify these areas of confusion. "Clarifications to the DNS
447 Specification" [RFC2181] section 11 states:
449 The DNS itself places only one restriction on the particular
450 labels that can be used to identify resource records. That one
451 restriction relates to the length of the label and the full name.
452 The length of any one label is limited to between 1 and 63 octets.
453 A full domain name is limited to 255 octets (including the
454 separators). The zero length full name is defined as representing
455 the root of the DNS tree, and is typically written and displayed
456 as ".". Those restrictions aside, any binary string whatever can
457 be used as the label of any resource record. Similarly, any
458 binary string can serve as the value of any record that includes a
459 domain name as some or all of its value (SOA, NS, MX, PTR, CNAME,
460 and any others that may be added). Implementations of the DNS
461 protocols must not place any restrictions on the labels that can
462 be used.
464 Hence, it clarified that the restriction to letters, digits, and
465 hyphens does not apply to DNS names in general, nor to records that
466 include "domain names". Hence the "preferred" name syntax described
467 in the original DNS specification [RFC1034] is indeed merely
468 "preferred", not mandatory.
470 Since there is no restriction even to ASCII, let alone letter-digit-
471 hyphen use, DNS is in conformance with the IETF requirement to allow
472 UTF-8 [RFC2277].
474 Using UTF-16 or UTF-32 encoding, however, would not be ideal for use
475 in DNS packets or APIs because existing software already uses ASCII,
476 and UTF-16 and UTF-32 strings can contain all-zero octets that
477 existing software may interpret as the end of the string. To use
478 UTF-16 or UTF-32 one would need some way of knowing whether the
479 string was encoded using ASCII, UTF-16, or UTF-32, and indeed for
480 UTF-16 or UTF-32 whether it was big-endian or little-endian encoding.
481 In contrast, UTF-8 works well because any 7-bit ASCII string is also
482 a UTF-8 string representing the same characters.
484 If a private name space is defined to use UTF-8 (and not other
485 encodings such as UTF-16 or UTF-32), there's no need for a mechanism
486 to know whether a string was encoded using ASCII or UTF-8, because
487 (for any string that can be represented using ASCII) the
488 representations are exactly the same. In other words, for any string
489 that can be represented using ASCII it doesn't matter whether it is
490 interpreted as ASCII or UTF-8 because both encodings are the same,
491 and for any string that can't be represented using ASCII, it's
492 obviously UTF-8. In addition, unlike UTF-16 and UTF-32, ASCII and
493 UTF-8 are both byte-oriented encodings so the question of big-endian
494 or little-endian encoding doesn't apply.
496 While implementations of the DNS protocol must not place any
497 restrictions on the labels that can be used, applications that use
498 the DNS are free to impose whatever restrictions they like, and many
499 have. The above rules permit a domain name label that contains
500 unusual characters, such as embedded spaces which many applications
501 would consider a bad idea. For example, the SMTP protocol [RFC5321],
502 but going back to the original specification in [RFC0821], constrains
503 the character set usable in email addresses. There is now an effort
504 underway to permit SMTP to support internationalized email addresses
505 via an extension.
507 Shortly after the DNS Clarifications [RFC2181] and IETF character
508 sets and languages policy [RFC2277] were published, the need for
509 internationalized names within private name spaces (i.e., within
510 enterprises) arose. The current (and past, predating IDNA and the
511 prefixed ACE conventions) practice within enterprises that support
512 other languages is to put UTF-8 names in their internal DNS servers
513 in a private name space. For example, "Using the UTF-8 Character Set
514 in the Domain Name System" [I-D.skwan-utf8-dns-00] was first written
515 in 1997, and was then widely deployed in Windows. The use of UTF-8
516 names in DNS was similarly implemented and deployed in MacOS, simply
517 by virtue of the fact that applications blindly passed UTF-8 strings
518 to the name resolution APIs, and the name resolution APIs blindly
519 passed those UTF-8 strings to the DNS servers, and the DNS servers
520 correctly answered those queries, and from the user's point of view
521 everything worked properly without any special new code being
522 written, except that ASCII is matched case-insensitively whereas
523 UTF-8 is not (although some enterprise DNS servers reportedly attempt
524 to do case-insensitive matching on UTF-8 within private name spaces).
525 Within a private name space, and especially in light of the IETF
526 UTF-8 policy [RFC2277], it was reasonable to assume within a private
527 name space that binary strings were encoded in UTF-8.
529 As implied earlier, there are also issues with mapping strings to
530 some canonical form, independent of the encoding. Such issues are
531 not discussed in detail in this document. They are discussed to some
532 extent in, for example, Section 3 of [RFC5198], and are left as
533 opportunities for elaboration in other documents.
535 Five years after UTF-8 was already in use in private name spaces in
536 DNS, the strategy of using a reserved prefix and an ASCII-compatible
537 Encoding (ACE) was developed for IDNA. That strategy included the
538 Punycode algorithm, which began to be developed (during the period
539 from 2002 [I-D.ietf-idn-punycode-00] to 2003 [RFC3492]) for use in
540 the public DNS name space. One reason the prefixed ACE strategy was
541 selected for the public DNS name space had to do with concerns about
542 whether the details of IDNA, including the use of the Punycode
543 algorithm, were an adequate solution to the problems that were posed.
544 If either the Punycode algorithm or fundamental aspects of character
545 handling were wrong, and had to be changed to something incompatible,
546 it would be possible to switch to a new prefix or adopt another model
547 entirely. Only the part of the public DNS namespace that starts a
548 label with "xn--" would be polluted.
550 Today the algorithm is seen as being about as good as it can
551 realistically be, so moving to a different encoding (UTF-8 as
552 suggested in this document) that can be viewed as "native" would not
553 be as risky as it would have been in 2002.
555 In any case, the publication of [RFC3492] and the dependencies on it
556 in [IDNA2008-Protocol] and the earlier [RFC3490] thus resulted in
557 having to use different encodings for different name spaces (where
558 UTF-8 for private name spaces was already deployed). Hence,
559 referring back to Figure 2, a different encoding scheme may be in use
560 on the Internet vs. a local network.
562 In general a host may be connected to zero or more networks using
563 private name spaces, plus potentially the public name space.
564 Applications that convert a U-label form IDN to an A-label before
565 calling getaddrinfo() will incur name resolution failures if the name
566 is actually registered in a private name space in some other encoding
567 (e.g., UTF-8). Having libraries or protocols convert from A-labels
568 to the encoding used by a private name space (e.g., UTF-8) would
569 require changes to APIs and/or servers, which IDNA was intended to
570 avoid.
572 Also, a fully-qualified domain name (FQDN) to be resolved may be
573 obtained directly from an application, or it may be composed by the
574 DNS resolver itself from a single label obtained from an application
575 by using a configured suffix search list, and the resulting FQDN may
576 use multiple encodings in different labels. For more information on
577 the suffix search list, see section 6 of "Common DNS Implementation
578 Errors and Suggested Fixes" [RFC1536], the DHCP Domain Search Option
579 [RFC3397], and section 4 of "DNS Configuration options for DHCPv6"
580 [RFC3646].
582 As noted in [RFC1536] section 6, the community has had bad
583 experiences with "searching" for domain names by trying multiple
584 variations or appending different suffixes. Such searching can yield
585 inconsistent results depending on the order in which alternatives are
586 tried. Nonetheless, the practice is widespread and must be
587 considered.
589 The practice of searching for names, whether by the use of a suffix
590 search list or by searching in different namespaces can yield
591 inconsistent results. For example, even when a suffix search list is
592 only used when an application provides a name containing no dots, two
593 clients with different configured suffix search lists can get
594 different answers, and the same client could get different answers at
595 different times if it changes its configuration (e.g., when moving to
596 another network). A deeper discussion of this topic is outside the
597 scope of this document.
599 3.1. Examples
601 Some examples of cases that can happen in existing implementations
602 today (where {non-ASCII} below represents some user-entered non-ASCII
603 string) are:
604 1. User types in {non-ASCII}.{non-ASCII}.com, and the application
605 passes it, in the form of a UTF-8 string, to getaddrinfo or
606 gethostbyname or equivalent.
607 * The DNS resolver passes the (UTF-8) string unmodified to a DNS
608 server.
609 2. User types in {non-ASCII}.{non-ASCII}.com, and the application
610 passes it to a name resolution API that accepts strings in some
611 other encoding such as UTF-16, e.g., GetAddrInfoW on Windows.
612 * The name resolution API decides to pass the string to DNS (and
613 possibly other protocols).
614 * The DNS resolver converts the name from UTF-16 to UTF-8 and
615 passes the query to a DNS server.
616 3. User types in {non-ASCII}.{non-ASCII}.com, but the application
617 first converts it to A-label form such that the name that is
618 passed to name resolution APIs is (say) xn--e1afmkfd.xn--
619 80akhbyknj4f.com.
620 * The name resolution API decides to pass the string to DNS (and
621 possibly other protocols).
622 * The DNS resolver passes the string unmodified to a DNS server.
623 * If the name is not found in DNS, the name resolution API
624 decides to try another protocol, say mDNS.
625 * The query goes out in mDNS, but since mDNS specified that
626 names are to be registered in UTF-8, the name isn't found
627 since it was encoded as an A-label in the query.
628 4. User types in {non-ASCII}, and the application passes it, in the
629 form of a UTF-8 string, to getaddrinfo or equivalent.
630 * The name resolution API decides to pass the string to DNS (and
631 possibly other protocols).
632 * The DNS resolver will append suffixes in the suffix search
633 list, which may contain UTF-8 characters if the local network
634 uses a private name space.
635 * Each FQDN in turn will then be sent in a query to a DNS
636 server, until one succeeds.
637 5. User types in {non-ASCII}, but the application first converts it
638 to an A-label, such that the name that is passed to getaddrinfo
639 or equivalent is (say) xn--e1afmkfd.
640 * The name resolution API decides to pass the string to DNS (and
641 possibly other protocols).
642 * The DNS stub resolver will append suffixes in the suffix
643 search list, which may contain UTF-8 characters if the local
644 network uses a private name space, resulting in (say) xn--
645 e1afmkfd.{non-ASCII}.com
647 * Each FQDN in turn will then be sent in a query to a DNS
648 server, until one succeeds.
649 * Since the private name space in this case uses UTF-8, the
650 above queries fail, since the A-label version of the name was
651 not registered in that name space.
652 6. User types in {non-ASCII1}.{non-ASCII2}.{non-ASCII3}.com, where
653 {non-ASCII3}.com is a public name space using IDNA and A-labels,
654 but {non-ASCII2}.{non-ASCII3}.com is a private name space using
655 UTF-8, which is accessible to the user. The application passes
656 the name, in the form of a UTF-8 string, to getaddrinfo or
657 equivalent.
658 * The name resolution API decides to pass the string to DNS (and
659 possibly other protocols).
660 * The DNS resolver tries to locate the authoritative server, but
661 fails the lookup because it cannot find a server for the UTF-8
662 encoding of {non-ASCII3}.com, even though it would have access
663 to the private name space. (To make this work, the private
664 name space would need to include the UTF-8 encoding of {non-
665 ASCII3}.com.)
667 When users use multiple applications, some of which do A-label
668 conversion prior to passing a name to name resolution APIs, and some
669 of which do not, odd behavior can result which at best violates the
670 principle of least surprise, and at worst can result in security
671 vulnerabilities.
673 First consider two competing applications, such as web browsers, that
674 are designed to achieve the same task. If the user types the same
675 name into each browser, one may successfully resolve the name (and
676 hence access the desired content) because the encoding scheme was
677 correct, while the other may fail name resolution because the
678 encoding scheme was incorrect. Hence the issue can incent users to
679 switch to another application (which in some cases means switching to
680 an IDNA application, and in other cases means switching away from an
681 IDNA application).
683 Next consider two separate applications where one is designed to be
684 launched from the other, for example a web browser launching a media
685 player application when the link to a media file is clicked. If both
686 types of content (web pages and media files in this example) are
687 hosted at the same IDN in a private name space, but one application
688 converts to A-labels before calling name resolution APIs and the
689 other does not, the user may be able to access a web page, click on
690 the media file causing the media player to launch and attempt to
691 retrieve the media file, which will then fail because the IDN
692 encoding scheme was incorrect. Or even worse, if an attacker was
693 able to register the same name in the other encoding scheme, may get
694 the content from the attacker's machine. This is similar to a normal
695 phishing attack, except that the two names represent exactly the same
696 Unicode characters.
698 4. Recommendations
700 As explained above, using multiple canonical formats, and multiple
701 encodings in different protocols or even in different places in the
702 same namespace creates problems. Because of this, and the fact that
703 both IDNA A-labels and UTF-8 are in use as encoding mechanisms for
704 domain names today, we recommend the following.
706 It is inappropriate for an application to convert a name to an
707 A-label when it does not know whether DNS will be used by the name
708 resolution library, or whether the name exists in a private name
709 space that uses UTF-8, or in the global DNS that uses IDNA A-labels.
711 Instead, conversion to A-label form, UTF-8, or any other encoding,
712 should be done only by an entity that knows which protocol will be
713 used (e.g., the DNS resolver, or getaddrinfo upon deciding to pass
714 the name to DNS), rather than by general applications that call
715 protocol-independent name resolution APIs. (Of course, it is still
716 necessary for applications to convert to whatever form those APIs
717 expect.) Similarly, even when DNS is used, the conversion to
718 A-labels should be done only by an entity that knows which name space
719 will be used.
721 That is, a more intelligent DNS resolver would be more liberal in
722 what it would accept from an application and be able to query for
723 both a name in A-label form (e.g., over the Internet) and a UTF-8
724 name (e.g., over a corporate network with a private name space) in
725 case the server only recognized one. However, we might also take
726 into account that the various resolution behaviors discussed earlier
727 could also occur with record updates (e.g., with Dynamic Update
728 [RFC2136]), resulting in some names being registered in a local
729 network's private name space by applications doing conversion to
730 A-labels, and other names being registered using UTF-8. Hence a name
731 might have to be queried with both encodings to be sure to succeed
732 without changes to DNS servers.
734 Similarly, a more intelligent stub resolver would also be more
735 liberal in what it would accept from a response as the value of a
736 record (e.g., PTR) in that it would accept either UTF-8 (U-labels in
737 the case of IDNA) or A-labels and convert them to whatever encoding
738 is used by the application APIs to return strings to applications.
740 Indeed the choice of conversion within the resolver libraries is
741 consistent with the quote from section 6.2 of the original IDNA
742 specification [RFC3490] stating that conversion using the Punycode
743 algorithm (i.e., to A-labels) "might be performed inside these new
744 versions of the resolver libraries".
746 That said, some application-layer protocols may be defined to use
747 A-labels rather than UTF-8 as recommended by the IETF character sets
748 and languages policy [RFC2277]. In this case, an application may
749 receive a string containing A-labels and want to pass it to name
750 resolution APIs. Again the recommendation that a resolver library be
751 more liberal in what it would accept from an application would mean
752 that such a name would be accepted and re-encoded as needed, rather
753 than requiring the application to do so.
755 Finally, the question remains about what, if anything, a DNS server
756 should do to handle cases where some existing applications or hosts
757 do IDNA queries using A-labels within the local network using a
758 private name space, and other existing applications or hosts send
759 UTF-8 queries. It is undesirable to store different records for
760 different encodings of the same name, since this introduces the
761 possibility for inconsistency between them. Instead, a new DNS
762 server serving a private name space using UTF-8 could potentially
763 treat encoding-conversion in the same way as case-insensitive
764 comparison which a DNS server is already required to do, as long the
765 DNS server has some way to know what the encoding is. Two encodings
766 are, in this sense, two representations of the same name, just as two
767 case-different strings are. However, whereas case comparison of non-
768 ASCII characters is complicated by ambiguities (as explained in the
769 IAB's Review and Recommendations for Internationalized Domain Names
770 [RFC4690]), encoding conversion between A-labels and U-labels is
771 unambiguous.
773 5. Acknowledgements
775 The authors wish to thank Patrik Falstrom, Martin Duerst, and JFC
776 Morfin for their careful review and helpful suggestions.
778 6. Security Considerations
780 Having applications convert names to prefixed ACE format (A-labels)
781 before calling name resolution can result in security
782 vulnerabilities. If the name is resolved by protocols or in zones
783 for which records are registered using other encoding schemes, an
784 attacker can claim the A-label version of the same name and hence
785 trick the victim into accessing a different destination. This can be
786 done for any non-ASCII name, even when there is no possible confusion
787 due to case, language, or other issues. Other types of confusion
788 beyond those resulting simply from the choice of encoding scheme are
789 discussed in "Review and Recommendations for IDNs" [RFC4690].
791 Designers and users of encodings that represent Unicode strings in
792 terms of ASCII should also consider whether trademark protection is
793 an issue, e.g., if one name would be encoded in a way that would be
794 naturally associated with another organization, such as xn--rfc-
795 editor.
797 7. IANA Considerations
799 [RFC Editor: please remove this section prior to publication.]
801 This document has no IANA Actions.
803 8. IAB Members at the time of publication
805 Bernard Aboba
806 Marcelo Bagnulo
807 Ross Callon
808 Spencer Dawkins
809 Vijay Gill
810 Russ Housley
811 John Klensin
812 Olaf Kolkman
813 Danny McPherson
814 Jon Peterson
815 Andrei Robachevsky
816 Dave Thaler
817 Hannes Tschofenig
819 9. References
821 9.1. Normative References
823 [Unicode] The Unicode Consortium, "The Unicode Standard, Version
824 5.1.0", 2008.
826 defined by: The Unicode Standard, Version 5.0, Boston, MA,
827 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
828 Unicode 5.1.0
829 (http://www.unicode.org/versions/Unicode5.1.0/).
831 9.2. Informative References
833 [I-D.cheshire-dnsext-multicastdns]
834 Cheshire, S. and M. Krochmal, "Multicast DNS",
835 draft-cheshire-dnsext-multicastdns-11 (work in progress),
836 March 2010.
838 [I-D.ietf-idn-punycode-00]
839 Costello, A., "Punycode version 0.3.3",
840 draft-ietf-idn-punycode-00 (work in progress), July 2002.
842 [I-D.skwan-utf8-dns-00]
843 Kwan, S. and J. Gilroy, "Using the UTF-8 Character Set in
844 the Domain Name System", draft-skwan-utf8-dns-00 (work in
845 progress), November 1997.
847 [IDNA2008-Defs]
848 Klensin, J., "Internationalized Domain Names for
849 Applications (IDNA): Definitions and Document Framework",
850 January 2010, .
853 [IDNA2008-Protocol]
854 Klensin, J., "Internationalized Domain Names in
855 Applications (IDNA): Protocol", January 2010, .
858 [MJD] Duerst, M., "The Properties and Promizes of UTF-8", 11th
859 International Unicode Conference, San Jose ,
860 September 1997, .
863 [NIS] Sun Microsystems, "System and Network Administration",
864 March 1990.
866 [RFC0821] Postel, J., "Simple Mail Transfer Protocol", STD 10,
867 RFC 821, August 1982.
869 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
870 host table specification", RFC 952, October 1985.
872 [RFC1001] NetBIOS Working Group, "Protocol standard for a NetBIOS
873 service on a TCP/UDP transport: Concepts and methods",
874 STD 19, RFC 1001, March 1987.
876 [RFC1002] NetBIOS Working Group, "Protocol standard for a NetBIOS
877 service on a TCP/UDP transport: Detailed specifications",
878 STD 19, RFC 1002, March 1987.
880 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
881 STD 13, RFC 1034, November 1987.
883 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application
884 and Support", STD 3, RFC 1123, October 1989.
886 [RFC1468] Murai, J., Crispin, M., and E. van der Poel, "Japanese
887 Character Encoding for Internet Messages", RFC 1468,
888 June 1993.
890 [RFC1536] Kumar, A., Postel, J., Neuman, C., Danzig, P., and S.
891 Miller, "Common DNS Implementation Errors and Suggested
892 Fixes", RFC 1536, October 1993.
894 [RFC2136] Vixie, P., Thomson, S., Rekhter, Y., and J. Bound,
895 "Dynamic Updates in the Domain Name System (DNS UPDATE)",
896 RFC 2136, April 1997.
898 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS
899 Specification", RFC 2181, July 1997.
901 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
902 Languages", BCP 18, RFC 2277, January 1998.
904 [RFC3397] Aboba, B. and S. Cheshire, "Dynamic Host Configuration
905 Protocol (DHCP) Domain Search Option", RFC 3397,
906 November 2002.
908 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
909 "Internationalizing Domain Names in Applications (IDNA)",
910 RFC 3490, March 2003.
912 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
913 for Internationalized Domain Names in Applications
914 (IDNA)", RFC 3492, March 2003.
916 [RFC3493] Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
917 Stevens, "Basic Socket Interface Extensions for IPv6",
918 RFC 3493, February 2003.
920 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
921 10646", STD 63, RFC 3629, November 2003.
923 [RFC3646] Droms, R., "DNS Configuration options for Dynamic Host
924 Configuration Protocol for IPv6 (DHCPv6)", RFC 3646,
925 December 2003.
927 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
928 Recommendations for Internationalized Domain Names
929 (IDNs)", RFC 4690, September 2006.
931 [RFC4795] Aboba, B., Thaler, D., and L. Esibov, "Link-local
932 Multicast Name Resolution (LLMNR)", RFC 4795,
933 January 2007.
935 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network
936 Interchange", RFC 5198, March 2008.
938 [RFC5321] Klensin, J., "Simple Mail Transfer Protocol", RFC 5321,
939 October 2008.
941 Authors' Addresses
943 Dave Thaler
944 Microsoft Corporation
945 One Microsoft Way
946 Redmond, WA 98052
947 USA
949 Phone: +1 425 703 8835
950 Email: dthaler@microsoft.com
952 John C Klensin
953 1770 Massachusetts Ave, Ste 322
954 Cambridge, MA 02140
956 Phone: +1 617 245 1457
957 Email: john+ietf@jck.com
959 Stuart Cheshire
960 Apple Inc.
961 1 Infinite Loop
962 Cupertino, CA 95014
964 Phone: +1 408 974 3207
965 Email: cheshire@apple.com