idnits 2.17.1
draft-iab-idn-encoding-03.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
** The document seems to lack a both a reference to RFC 2119 and the
recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
keywords.
RFC 2119 keyword, line 391: '... Protocols MUST be able to use th...'
RFC 2119 keyword, line 394: '... for all text. Protocols MAY specify,...'
RFC 2119 keyword, line 404: '... support MUST be possible....'
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust and authors Copyright Line does not
match the current year
-- The document date (July 12, 2010) is 5037 days in the past. Is this
intentional?
Checking references for intended status: Informational
----------------------------------------------------------------------------
== Missing Reference: 'BCP9' is mentioned on line 398, but not defined
== Outdated reference: A later version (-15) exists of
draft-cheshire-dnsext-multicastdns-11
== Outdated reference: A later version (-02) exists of
draft-ietf-idn-punycode-00
== Outdated reference: A later version (-06) exists of
draft-skwan-utf8-dns-00
-- Obsolete informational reference (is this intentional?): RFC 821
(Obsoleted by RFC 2821)
-- Obsolete informational reference (is this intentional?): RFC 3490
(Obsoleted by RFC 5890, RFC 5891)
-- Obsolete informational reference (is this intentional?): RFC 4282
(Obsoleted by RFC 7542)
Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 4 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group D. Thaler
3 Internet-Draft Microsoft
4 Intended status: Informational J. Klensin
5 Expires: January 13, 2011
6 S. Cheshire
7 Apple
8 July 12, 2010
10 IAB Thoughts on Encodings for Internationalized Domain Names
11 draft-iab-idn-encoding-03.txt
13 Abstract
15 This document explores issues with Internationalized Domain Names
16 (IDNs) that result from the use of various encoding schemes such as
17 UTF-8 and the ASCII-Compatible Encoding produced by the Punycode
18 algorithm. It focuses on the importance of agreeing on a canonical
19 format and how complicated it ends up being as a result of using
20 different encodings today.
22 Status of this Memo
24 This Internet-Draft is submitted in full conformance with the
25 provisions of BCP 78 and BCP 79.
27 Internet-Drafts are working documents of the Internet Engineering
28 Task Force (IETF). Note that other groups may also distribute
29 working documents as Internet-Drafts. The list of current Internet-
30 Drafts is at http://datatracker.ietf.org/drafts/current/.
32 Internet-Drafts are draft documents valid for a maximum of six months
33 and may be updated, replaced, or obsoleted by other documents at any
34 time. It is inappropriate to use Internet-Drafts as reference
35 material or to cite them other than as "work in progress."
37 This Internet-Draft will expire on January 13, 2011.
39 Copyright Notice
41 Copyright (c) 2010 IETF Trust and the persons identified as the
42 document authors. All rights reserved.
44 This document is subject to BCP 78 and the IETF Trust's Legal
45 Provisions Relating to IETF Documents
46 (http://trustee.ietf.org/license-info) in effect on the date of
47 publication of this document. Please review these documents
48 carefully, as they describe your rights and restrictions with respect
49 to this document. Code Components extracted from this document must
50 include Simplified BSD License text as described in Section 4.e of
51 the Trust Legal Provisions and are provided without warranty as
52 described in the Simplified BSD License.
54 Table of Contents
56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
57 1.1. APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
58 2. Use of Non-DNS Protocols . . . . . . . . . . . . . . . . . . . 10
59 3. Use of Non-ASCII in DNS . . . . . . . . . . . . . . . . . . . 11
60 3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 14
61 4. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 17
62 5. Security Considerations . . . . . . . . . . . . . . . . . . . 18
63 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19
64 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19
65 8. IAB Members at the time of publication . . . . . . . . . . . . 19
66 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20
67 9.1. Normative References . . . . . . . . . . . . . . . . . . . 20
68 9.2. Informative References . . . . . . . . . . . . . . . . . . 20
69 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23
71 1. Introduction
73 The goal of this document is to explore what can be learned from some
74 current difficulties in implementing Internationalized Domain Names
75 (IDNs).
77 A domain name consists of a set of labels, conventionally written
78 separated with dots. An Internationalized Domain Name (IDN) is a
79 domain name that contains one or more labels that, in turn, contain
80 one or more non-ASCII characters. Just as with plain ASCII domain
81 names, each IDN label must be encoded using some mechanism before it
82 can be transmitted in network packets, stored in memory, stored on
83 disk, etc. These encodings need to be reversible, but they need not
84 store domain names the same way humans conventionally write them on
85 paper. For example, when transmitted over the network in DNS
86 packets, domain name labels are *not* separated with dots.
88 IDNA, discussed later in this document, is the standard that defines
89 the use and coding of internationalized domain names for use on the
90 public Internet. It is described as "Internationalizing Domain Names
91 in Applications (IDNA)" and is defined in several documents.
92 Definitions for the current version and a roadmap of related
93 documents appears in [IDNA2008-Defs]. An earlier version of IDNA
94 [RFC3490] is now being phased out. Except where noted, the two
95 versions are approximately the same with regard to the issues
96 discussed in this document. However, some explanations appeared in
97 the earlier documents that did not seem useful when the revision was
98 created; they are quoted here from the documents in which they
99 appear. In addition, the terminology of the two version differs
100 somewhat; this document reflects the terminology of the current
101 version.
103 Unicode [Unicode] is a list of characters (including non-spacing
104 marks that are used to form some other characters), where each
105 character is assigned an integer value, called a code point. In
106 simple terms a Unicode string is a string of integer code point
107 values in the range 0 to 1,114,111 (10FFFF in base 16), which
108 represent a string of Unicode characters. These integer code points
109 must be encoded using some mechanism before they can be transmitted
110 in network packets, stored in memory, stored on disk, etc. Some
111 common ways of encoding these integer code point values in computer
112 systems include UTF-8, UTF-16, and UTF-32. In addition to the
113 material below, those forms and the tradeoffs among them are
114 discussed in Chapter 2 of The Unicode Standard [Unicode].
116 UTF-8 is a mechanism for encoding a Unicode code point in a variable
117 number of 8-bit octets, where an ASCII code point is preserved as-is.
118 Those octets encode a string of integer code point values, which
119 represent a string of Unicode characters. The authoritative
120 definition of UTF-8 is in Sections 3.9 and 3.10 of The Unicode
121 Standard [Unicode], but UTF-8 is also discussed in [RFC3629].
122 Descriptions and formulae can also be found in Annex D of ISO/IEC
123 10646-1 [10646].
125 UTF-16 (formerly UCS-2) is a mechanism for encoding a Unicode code
126 point in one or two 16-bit integers, described in detail in Sections
127 3.9 and 3.10 of The Unicode Standard [Unicode]. A UTF-16 string
128 encodes a string of integer code point values that represent a string
129 of Unicode characters.
131 UTF-32 (formerly UCS-4), also described in [Unicode] Sections 3.9 and
132 3.10, is a mechanism for encoding a Unicode code point in a single
133 32-bit integer. A UTF-32 string is thus a string of 32-bit integer
134 code point values, which represent a string of Unicode characters.
136 Note that UTF-16 results in some all-zero octets when code points
137 occur early in the Unicode sequence, and UTF-32 always has all-zero
138 octets.
140 IDNA specifies validity of a label, such as what characters it can
141 contain, relationships among them, and so on, in Unicode terms.
142 Valid labels can take either of two forms, with the appropriate one
143 determined by particular protocols or by context. One of those
144 forms, called a U-label, is a direct representation of the Unicode
145 characters using one of the encoding forms discussed above. This
146 document discusses UTF-8 strings in many places. While all U-labels
147 can be represented by UTF-8 strings, not all UTF-8 strings are valid
148 U-labels (see Section 2.3.2 of [IDNA2008-Defs] for a discussion of
149 these distinctions). The other, called an A-label, uses a
150 compressed, ASCII-compatible encoding (an "ACE" in IDNA and other
151 terminology) produced by an algorithm called Punycode. U-labels and
152 A-labels are duals of each other: transformations from one to the
153 other do not lose information. The transformation mechanisms are
154 specified in [IDNA2008-Protocol].
156 Punycode [RFC3492] is thus a mechanism for encoding a Unicode string
157 in an ASCII-compatible encoding, i.e., using only letters, digits,
158 and hyphens from the ASCII character set. When a Unicode label that
159 is valid under the IDNA rules (a U-label) is encoded with Punycode
160 for IDNA purposes, it is prefixed with "xn--"; the result is called
161 an A-label. The prefix convention assumes that no other DNS labels
162 (at least no other DNS labels in IDNA-aware applications) are allowed
163 to start with these four characters. Consequently, when A-label
164 encoding is assumed, any DNS labels beginning with "xn--" now have a
165 different meaning (the Punycode encoding of a label containing one or
166 more non-ASCII characters) or no defined meaning at all (in the case
167 of labels that are not IDNA-compliant, i.e., are not well-formed
168 A-labels).
170 ISO-2022-JP [RFC1468] is a mechanism for encoding a string of ASCII
171 and Japanese characters, where an ASCII character is preserved as-is.
172 ISO-2022-JP is stateful: special sequences are used to switch between
173 character coding tables. As a result, if there are lost or mangled
174 characters in a character stream, it is extremely difficult to
175 recover the original stream after such a lost character encoding
176 shift.
178 Comparison of Unicode strings is not as easy as comparing, for
179 example, ASCII strings. First, there are a multitude of ways of
180 representing a string of Unicode characters. Second, in many
181 languages and scripts, the actual definition of "same" is very
182 context-dependent. Because of this, comparison of two Unicode
183 strings must take into account how the Unicode strings are encoded.
184 Regardless of the encoding, however, comparison cannot simply be done
185 by comparing the encoded Unicode strings byte by byte. The only time
186 that is possible is when the strings both are mapped into some
187 canonical format and encoded the same way.
189 RFC 2130 [RFC2130] reports on an IAB-sponsored workshop on character
190 sets and encodings. This document adds to that discussion and
191 focuses on the importance of agreeing on a canonical format and how
192 complicated it ends up being as a result of using different encodings
193 today.
195 Different applications, APIs, and protocols use different encoding
196 schemes today. Historically, many of them were originally defined to
197 use only ASCII. Internationalizing Domain Names in Applications
198 (IDNA) [IDNA2008-Defs] defined a mechanism that required changes to
199 applications, but in attempt not to change APIs or servers, specified
200 that the A-label format is to be used in many contexts. In some ways
201 this could be seen as not changing the existing APIs, in the sense
202 that the strings being passed to and from the APIs were still
203 apparently ASCII strings. In other ways it was a very profound
204 change to the existing APIs, because while those strings were still
205 syntactically valid ASCII strings, they no longer meant the same
206 thing as they used to. What looked like a plain ASCII string to one
207 piece of software or library could be seen by another piece of
208 software or library (with the application of out-of-band information)
209 to be in fact an encoding of a Unicode string.
211 Section 1.3 of the original IDNA specification [RFC3490] states:
213 The IDNA protocol is contained completely within applications. It
214 is not a client-server or peer-to-peer protocol: everything is
215 done inside the application itself. When used with a DNS resolver
216 library, IDNA is inserted as a "shim" between the application and
217 the resolver library. When used for writing names into a DNS
218 zone, IDNA is used just before the name is committed to the zone.
220 Figure 1 depicts a simplistic architecture that a naive reader might
221 assume from the paragraph quoted above. (A variant of this same
222 picture appears in Section 6 of the IDNA specification [RFC3490]
223 further strengthening this assumption.)
225 +-----------------------------------------+
226 |Host |
227 | +-------------+ |
228 | | Application | |
229 | +------+------+ |
230 | | |
231 | +----+----+ |
232 | | DNS | |
233 | | Resolver| |
234 | | Library | |
235 | +----+----+ |
236 | | |
237 +-----------------------------------------+
238 |
239 _________|_________
240 / \
241 / \
242 / \
243 | Internet |
244 \ /
245 \ /
246 \___________________/
248 Simplistic Architecture
250 Figure 1
252 There are, however, two problems with this simplistic architecture
253 that cause it to differ from reality.
255 First, resolver APIs on Operating Systems (OSs) today (MacOS,
256 Windows, Linux, etc.) are not DNS-specific. They typically provide a
257 layer of indirection so that the application can work independent of
258 the name resolution mechanism, which could be DNS, mDNS
259 [I-D.cheshire-dnsext-multicastdns], LLMNR [RFC4795], NetBIOS-over-TCP
260 [RFC1001][RFC1002], etc/hosts file [RFC0952], NIS [NIS], or anything
261 else. For example, "Basic Socket Interface Extensions for IPv6"
262 [RFC3493] specifies the getaddrinfo() API and contains many phrases
263 like "For example, when using the DNS" and "any type of name
264 resolution service (for example, the DNS)". Importantly, DNS is
265 mentioned only as an example, and the application has no knowledge as
266 to whether DNS or some other protocol will be used.
268 Second, even with the DNS protocol, private name spaces (sometimes
269 including private uses of the DNS), do not necessarily use the same
270 character set encoding scheme as the public Internet name space.
272 We will discuss each of the above issues in subsequent sections. For
273 reference, Figure 2 depicts a more realistic architecture on typical
274 hosts today (which don't have IDNA inserted as a shim immediately
275 above the DNS resolver library). More generally, the host may be
276 attached to one or more local networks, each of which may or may not
277 be connected to the public Internet and may or may not have a private
278 name space.
280 +-----------------------------------------+
281 |Host |
282 | +-------------+ |
283 | | Application | |
284 | +------+------+ |
285 | | |
286 | +------+------+ |
287 | | Generic | |
288 | | Name | |
289 | | Resolution | |
290 | | API | |
291 | +------+------+ |
292 | | |
293 | +-----+------+---+--+-------+-----+ |
294 | | | | | | | |
295 | +-+-++--+--++--+-++---+---++--+--++-+-+ |
296 | |DNS||LLMNR||mDNS||NetBIOS||hosts||...| |
297 | +---++-----++----++-------++-----++---+ |
298 | |
299 +-----------------------------------------+
300 |
301 ______|______
302 / \
303 / \
304 / local \
305 \ network /
306 \ /
307 \_____________/
308 |
309 _________|_________
310 / \
311 / \
312 / \
313 | Internet |
314 \ /
315 \ /
316 \___________________/
318 Realistic Architecture
320 Figure 2
322 1.1. APIs
324 Section 6.2 of the original IDNA specification [RFC3490] states
325 (where ToASCII and ToUnicode below refer to conversions using the
326 Punycode algorithm):
328 It is expected that new versions of the resolver libraries in the
329 future will be able to accept domain names in other charsets than
330 ASCII, and application developers might one day pass not only
331 domain names in Unicode, but also in local script to a new API for
332 the resolver libraries in the operating system. Thus the ToASCII
333 and ToUnicode operations might be performed inside these new
334 versions of the resolver libraries.
336 Resolver APIs such as getaddrinfo() and its predecessor
337 gethostbyname() were defined to accept "char *" arguments, meaning
338 they accept a string of bytes, terminated with a NULL (0) byte.
339 Because of the use of a NULL octet as a string terminator, this is
340 sufficient for ASCII strings (including A-labels) and even
341 ISO-2022-JP and UTF-8 strings (unless an implementation artificially
342 precludes them), but not UTF-16 or UTF-32 strings because a NULL
343 octet could appear in the the middle of strings using these
344 encodings. Several operating systems historically used in Japan will
345 accept (and expect) ISO-2022-JP strings in such APIs. Some platforms
346 used worldwide also have new versions of the APIs (e.g.,
347 GetAddrInfoW() on Windows) that accept other encoding schemes such as
348 UTF-16.
350 It is worth noting that an API using "char *" arguments can
351 distinguish between conventional ASCII "host name" labels, A-labels,
352 ISO-2022-JP, and UTF-8 labels in names if the coding is known to be
353 one of those four, and the label is intact (no lost or mangled
354 characters). An example method is as follows:
355 o if the label contains an ESC (0x1B) byte the label is ISO-2022-JP;
356 otherwise,
357 o if any byte in the label has the high bit set, the label is UTF-8;
358 otherwise,
359 o if the label starts with "xn--" then it is presumed to be an
360 A-label; otherwise,
361 o the label is ASCII.
362 Again this assumes that neither ASCII labels nor UTF-8 strings ever
363 start with "xn--", and also that UTF-8 strings never contain an ESC
364 character. Also the above is merely an illustration; UTF-8 can be
365 detected and distinguished from other 8-bit encodings with good
366 accuracy [MJD].
368 It is more difficult or impossible to distinguish the ISO 8859
369 character sets from each other, because they differ in up to about 90
370 characters which have exactly the same encodings, and a short string
371 is very unlikely to contain enough characters to allow a receiver to
372 deduce the character set. Similarly, it is not possible in general
373 to distinguish between ISO-2022-JP and any other encoding based on
374 ISO 2022 code table switching.
376 Although it is possible (as in the example above) to distinguish some
377 encodings when not explicitly specified, it is cleaner to have the
378 encodings specified explicitly, such as specifying UTF-16 for
379 GetAddrInfoW(), or specifying explicitly which APIs expect UTF-8
380 strings.
382 2. Use of Non-DNS Protocols
384 As noted earlier, typical name resolution libraries are not DNS-
385 specific. Furthermore, some protocols are defined to use encoding
386 forms other than IDNA A-labels. For example, mDNS
387 [I-D.cheshire-dnsext-multicastdns] specifies that UTF-8 be used.
388 Indeed, the IETF policy on character sets and languages [RFC2277]
389 (which followed the IAB-sponsored workshop [RFC2130]) states:
391 Protocols MUST be able to use the UTF-8 charset, which consists of
392 the ISO 10646 coded character set combined with the UTF-8
393 character encoding scheme, as defined in [10646] Annex R
394 (published in Amendment 2), for all text. Protocols MAY specify,
395 in addition, how to use other charsets or other character encoding
396 schemes for ISO 10646, such as UTF-16, but lack of an ability to
397 use UTF-8 is a violation of this policy; such a violation would
398 need a variance procedure ([BCP9] section 9) with clear and solid
399 justification in the protocol specification document before being
400 entered into or advanced upon the standards track. For existing
401 protocols or protocols that move data from existing datastores,
402 support of other charsets, or even using a default other than
403 UTF-8, may be a requirement. This is acceptable, but UTF-8
404 support MUST be possible.
406 Applications that convert an IDN to A-label form before calling
407 getaddrinfo() will result in name resolution failures if the Punycode
408 name is directly used in such protocols. Having libraries or
409 protocols to convert from A-labels to the encoding scheme defined by
410 the protocol (e.g., UTF-8) would require changes to APIs and/or
411 servers, which IDNA was intended to avoid.
413 As a result, applications that assume that non-ASCII names are
414 resolved using the public DNS and blindly convert them to A-labels
415 without knowledge of what protocol will be selected by the name
416 resolution library, have problems. Furthermore, name resolution
417 libraries often try multiple protocols until one succeeds, because
418 they are defined to use a common name space. For example, the hosts
419 file, DNS, and NetBIOS-over-TCP are all defined to be able to share a
420 common syntax (e.g., see ([RFC0952], [RFC1001] section 11.1.1, and
421 [RFC1034] section 2.1). This means that when an application passes a
422 name to be resolved, resolution may in fact be attempted using
423 multiple protocols, each with a potentially different encoding
424 scheme. For this to work successfully, the name must be converted to
425 the appropriate encoding scheme only after the choice is made to use
426 that protocol. In general, this cannot be done by the application
427 since the choice of protocol is not made by the application.
429 3. Use of Non-ASCII in DNS
431 A common misconception is that DNS only supports names that can be
432 expressed using letters, digits, and hyphens.
434 This misconception originally stemmed from the definition in 1985 of
435 an "Internet host name" (and net, gateway, and domain name) for use
436 in the "hosts" file [RFC0952]. An Internet host name was defined
437 therein as including only letters, digits, and hyphens, where upper
438 and lower case letters were to be treated as identical. The DNS
439 specification [RFC1034] section 3.5 entitled "Preferred name syntax"
440 then repeated this definition in 1987, saying that this "syntax will
441 result in fewer problems with many applications that use domain names
442 (e.g., mail, TELNET)".
444 The confusion was thus left as to whether the "preferred" name syntax
445 was a mandatory restriction in DNS, or merely "preferred".
447 The definition of an Internet host name was updated in 1989
448 ([RFC1123] section 2.1) to allow names starting with a digit (to
449 support IPv4 addresses in dotted-decimal form). Section 6.1 of
450 "Requirements for Internet Hosts -- Application and Support"
451 [RFC1123] discusses the use of DNS (and the hosts file) for resolving
452 host names to IP addresses and vice versa. This led to confusion as
453 to whether all names in DNS are "host names", or whether a "host
454 name" is merely a special case of a DNS name.
456 By 1997, things had progressed to a state where it was necessary to
457 clarify these areas of confusion. "Clarifications to the DNS
458 Specification" [RFC2181] section 11 states:
460 The DNS itself places only one restriction on the particular
461 labels that can be used to identify resource records. That one
462 restriction relates to the length of the label and the full name.
463 The length of any one label is limited to between 1 and 63 octets.
464 A full domain name is limited to 255 octets (including the
465 separators). The zero length full name is defined as representing
466 the root of the DNS tree, and is typically written and displayed
467 as ".". Those restrictions aside, any binary string whatever can
468 be used as the label of any resource record. Similarly, any
469 binary string can serve as the value of any record that includes a
470 domain name as some or all of its value (SOA, NS, MX, PTR, CNAME,
471 and any others that may be added). Implementations of the DNS
472 protocols must not place any restrictions on the labels that can
473 be used.
475 Hence, it clarified that the restriction to letters, digits, and
476 hyphens does not apply to DNS names in general, nor to records that
477 include "domain names". Hence the "preferred" name syntax described
478 in the original DNS specification [RFC1034] is indeed merely
479 "preferred", not mandatory.
481 Since there is no restriction even to ASCII, let alone letter-digit-
482 hyphen use, DNS is in conformance with the IETF requirement to allow
483 UTF-8 [RFC2277].
485 Using UTF-16 or UTF-32 encoding, however, would not be ideal for use
486 in DNS packets or "char *" APIs because existing software already
487 uses ASCII, and UTF-16 and UTF-32 strings can contain all-zero octets
488 that existing software will interpret as the end of the string. To
489 use UTF-16 or UTF-32 one would need some way of knowing whether the
490 string was encoded using ASCII, UTF-16, or UTF-32, and indeed for
491 UTF-16 or UTF-32 whether it was big-endian or little-endian encoding.
492 In contrast, UTF-8 works well because any 7-bit ASCII string is also
493 a UTF-8 string representing the same characters.
495 If a private name space is defined to use UTF-8 (and not other
496 encodings such as UTF-16 or UTF-32), there's no need for a mechanism
497 to know whether a string was encoded using ASCII or UTF-8, because
498 (for any string that can be represented using ASCII) the
499 representations are exactly the same. In other words, for any string
500 that can be represented using ASCII it doesn't matter whether it is
501 interpreted as ASCII or UTF-8 because both encodings are the same,
502 and for any string that can't be represented using ASCII, it's
503 obviously UTF-8. In addition, unlike UTF-16 and UTF-32, ASCII and
504 UTF-8 are both byte-oriented encodings so the question of big-endian
505 or little-endian encoding doesn't apply.
507 While implementations of the DNS protocol must not place any
508 restrictions on the labels that can be used, applications that use
509 the DNS are free to impose whatever restrictions they like, and many
510 have. The above rules permit a domain name label that contains
511 unusual characters, such as embedded spaces which many applications
512 would consider a bad idea. For example, the original specification
513 in [RFC0821] of the SMTP protocol [RFC5321] constrains the character
514 set usable in email addresses. There is now an effort underway to
515 permit SMTP to support internationalized email addresses via an
516 extension.
518 Shortly after the DNS Clarifications [RFC2181] and IETF character
519 sets and languages policy [RFC2277] were published, the need for
520 internationalized names within private name spaces (i.e., within
521 enterprises) arose. The current (and past, predating IDNA and the
522 prefixed ACE conventions) practice within enterprises that support
523 other languages is to put UTF-8 names in their internal DNS servers
524 in a private name space. For example, "Using the UTF-8 Character Set
525 in the Domain Name System" [I-D.skwan-utf8-dns-00] was first written
526 in 1997, and was then widely deployed in Windows. The use of UTF-8
527 names in DNS was similarly implemented and deployed in MacOS, simply
528 by virtue of the fact that applications blindly passed UTF-8 strings
529 to the name resolution APIs, and the name resolution APIs blindly
530 passed those UTF-8 strings to the DNS servers, and the DNS servers
531 correctly answered those queries, and from the user's point of view
532 everything worked properly without any special new code being
533 written, except that ASCII is matched case-insensitively whereas
534 UTF-8 is not (although some enterprise DNS servers reportedly attempt
535 to do case-insensitive matching on UTF-8 within private name spaces).
536 Within a private name space, and especially in light of the IETF
537 UTF-8 policy [RFC2277], it was reasonable to assume within a private
538 name space that binary strings were encoded in UTF-8.
540 As implied earlier, there are also issues with mapping strings to
541 some canonical form, independent of the encoding. Such issues are
542 not discussed in detail in this document. They are discussed to some
543 extent in, for example, Section 3 of [RFC5198], and are left as
544 opportunities for elaboration in other documents.
546 Five years after UTF-8 was already in use in private name spaces in
547 DNS, the strategy of using a reserved prefix and an ASCII-compatible
548 Encoding (ACE) was developed for IDNA. That strategy included the
549 Punycode algorithm, which began to be developed (during the period
550 from 2002 [I-D.ietf-idn-punycode-00] to 2003 [RFC3492]) for use in
551 the public DNS name space. One reason the prefixed ACE strategy was
552 selected for the public DNS name space had to do with concerns about
553 whether the details of IDNA, including the use of the Punycode
554 algorithm, were an adequate solution to the problems that were posed.
555 If either the Punycode algorithm or fundamental aspects of character
556 handling were wrong, and had to be changed to something incompatible,
557 it would be possible to switch to a new prefix or adopt another model
558 entirely. Only the part of the public DNS namespace that starts a
559 label with "xn--" would be polluted.
561 Today the algorithm is seen as being about as good as it can
562 realistically be, so moving to a different encoding (UTF-8 as
563 suggested in this document) that can be viewed as "native" would not
564 be as risky as it would have been in 2002.
566 In any case, the publication of [RFC3492] and the dependencies on it
567 in [IDNA2008-Protocol] and the earlier [RFC3490] thus resulted in
568 having to use different encodings for different name spaces (where
569 UTF-8 for private name spaces was already deployed). Hence,
570 referring back to Figure 2, a different encoding scheme may be in use
571 on the Internet vs. a local network.
573 In general a host may be connected to zero or more networks using
574 private name spaces, plus potentially the public name space.
575 Applications that convert a U-label form IDN to an A-label before
576 calling getaddrinfo() will incur name resolution failures if the name
577 is actually registered in a private name space in some other encoding
578 (e.g., UTF-8). Having libraries or protocols convert from A-labels
579 to the encoding used by a private name space (e.g., UTF-8) would
580 require changes to APIs and/or servers, which IDNA was intended to
581 avoid.
583 Also, a fully-qualified domain name (FQDN) to be resolved may be
584 obtained directly from an application, or it may be composed by the
585 DNS resolver itself from a single label obtained from an application
586 by using a configured suffix search list, and the resulting FQDN may
587 use multiple encodings in different labels. For more information on
588 the suffix search list, see section 6 of "Common DNS Implementation
589 Errors and Suggested Fixes" [RFC1536], the DHCP Domain Search Option
590 [RFC3397], and section 4 of "DNS Configuration options for DHCPv6"
591 [RFC3646].
593 As noted in [RFC1536] section 6, the community has had bad
594 experiences (e.g., [RFC1535]) with "searching" for domain names by
595 trying multiple variations or appending different suffixes. Such
596 searching can yield inconsistent results depending on the order in
597 which alternatives are tried. Nonetheless, the practice is
598 widespread and must be considered.
600 The practice of searching for names, whether by the use of a suffix
601 search list or by searching in different namespaces can yield
602 inconsistent results. For example, even when a suffix search list is
603 only used when an application provides a name containing no dots, two
604 clients with different configured suffix search lists can get
605 different answers, and the same client could get different answers at
606 different times if it changes its configuration (e.g., when moving to
607 another network). A deeper discussion of this topic is outside the
608 scope of this document.
610 3.1. Examples
612 Some examples of cases that can happen in existing implementations
613 today (where {non-ASCII} below represents some user-entered non-ASCII
614 string) are:
615 1. User types in {non-ASCII}.{non-ASCII}.com, and the application
616 passes it, in the form of a UTF-8 string, to getaddrinfo or
617 gethostbyname or equivalent.
618 * The DNS resolver passes the (UTF-8) string unmodified to a DNS
619 server.
620 2. User types in {non-ASCII}.{non-ASCII}.com, and the application
621 passes it to a name resolution API that accepts strings in some
622 other encoding such as UTF-16, e.g., GetAddrInfoW on Windows.
623 * The name resolution API decides to pass the string to DNS (and
624 possibly other protocols).
625 * The DNS resolver converts the name from UTF-16 to UTF-8 and
626 passes the query to a DNS server.
627 3. User types in {non-ASCII}.{non-ASCII}.com, but the application
628 first converts it to A-label form such that the name that is
629 passed to name resolution APIs is (say) xn--e1afmkfd.xn--
630 80akhbyknj4f.com.
631 * The name resolution API decides to pass the string to DNS (and
632 possibly other protocols).
633 * The DNS resolver passes the string unmodified to a DNS server.
634 * If the name is not found in DNS, the name resolution API
635 decides to try another protocol, say mDNS.
636 * The query goes out in mDNS, but since mDNS specified that
637 names are to be registered in UTF-8, the name isn't found
638 since it was encoded as an A-label in the query.
639 4. User types in {non-ASCII}, and the application passes it, in the
640 form of a UTF-8 string, to getaddrinfo or equivalent.
641 * The name resolution API decides to pass the string to DNS (and
642 possibly other protocols).
643 * The DNS resolver will append suffixes in the suffix search
644 list, which may contain UTF-8 characters if the local network
645 uses a private name space.
646 * Each FQDN in turn will then be sent in a query to a DNS
647 server, until one succeeds.
648 5. User types in {non-ASCII}, but the application first converts it
649 to an A-label, such that the name that is passed to getaddrinfo
650 or equivalent is (say) xn--e1afmkfd.
651 * The name resolution API decides to pass the string to DNS (and
652 possibly other protocols).
653 * The DNS stub resolver will append suffixes in the suffix
654 search list, which may contain UTF-8 characters if the local
655 network uses a private name space, resulting in (say) xn--
656 e1afmkfd.{non-ASCII}.com
657 * Each FQDN in turn will then be sent in a query to a DNS
658 server, until one succeeds.
659 * Since the private name space in this case uses UTF-8, the
660 above queries fail, since the A-label version of the name was
661 not registered in that name space.
663 6. User types in {non-ASCII1}.{non-ASCII2}.{non-ASCII3}.com, where
664 {non-ASCII3}.com is a public name space using IDNA and A-labels,
665 but {non-ASCII2}.{non-ASCII3}.com is a private name space using
666 UTF-8, which is accessible to the user. The application passes
667 the name, in the form of a UTF-8 string, to getaddrinfo or
668 equivalent.
669 * The name resolution API decides to pass the string to DNS (and
670 possibly other protocols).
671 * The DNS resolver tries to locate the authoritative server, but
672 fails the lookup because it cannot find a server for the UTF-8
673 encoding of {non-ASCII3}.com, even though it would have access
674 to the private name space. (To make this work, the private
675 name space would need to include the UTF-8 encoding of {non-
676 ASCII3}.com.)
678 When users use multiple applications, some of which do A-label
679 conversion prior to passing a name to name resolution APIs, and some
680 of which do not, odd behavior can result which at best violates the
681 principle of least surprise, and at worst can result in security
682 vulnerabilities.
684 First consider two competing applications, such as web browsers, that
685 are designed to achieve the same task. If the user types the same
686 name into each browser, one may successfully resolve the name (and
687 hence access the desired content) because the encoding scheme was
688 correct, while the other may fail name resolution because the
689 encoding scheme was incorrect. Hence the issue can incent users to
690 switch to another application (which in some cases means switching to
691 an IDNA application, and in other cases means switching away from an
692 IDNA application).
694 Next consider two separate applications where one is designed to be
695 launched from the other, for example a web browser launching a media
696 player application when the link to a media file is clicked. If both
697 types of content (web pages and media files in this example) are
698 hosted at the same IDN in a private name space, but one application
699 converts to A-labels before calling name resolution APIs and the
700 other does not, the user may be able to access a web page, click on
701 the media file causing the media player to launch and attempt to
702 retrieve the media file, which will then fail because the IDN
703 encoding scheme was incorrect. Or even worse, if an attacker was
704 able to register the same name in the other encoding scheme, may get
705 the content from the attacker's machine. This is similar to a normal
706 phishing attack, except that the two names represent exactly the same
707 Unicode characters.
709 4. Recommendations
711 On many platforms, the name resolution library will automatically use
712 a variety of protocols to search a variety of name spaces which might
713 be using UTF-8 or other encodings. In addition, even when only the
714 DNS protocol is used, in many operational environments, a private DNS
715 name space using UTF-8 is also deployed and is automatically searched
716 by the name resolution library.
718 As explained earlier, using multiple canonical formats, and multiple
719 encodings in different protocols or even in different places in the
720 same namespace creates problems. Because of this, and the fact that
721 both IDNA A-labels and UTF-8 are in use as encoding mechanisms for
722 domain names today, we recommend the following.
724 It is inappropriate for an application to convert a name to an
725 A-label unless the application is absolutely certain that, in all
726 environments where the application might be used, only the global DNS
727 that uses IDNA A-labels actually will be used to resolve the name.
729 Instead, conversion to A-label form, UTF-8, or any other encoding,
730 should be done only by an entity that knows which protocol will be
731 used (e.g., the DNS resolver, or getaddrinfo upon deciding to pass
732 the name to DNS), rather than by general applications that call
733 protocol-independent name resolution APIs. (Of course, it is still
734 necessary for applications to convert to whatever form those APIs
735 expect.) Similarly, even when DNS is used, the conversion to
736 A-labels should be done only by an entity that knows which name space
737 will be used.
739 That is, a more intelligent DNS resolver would be more liberal in
740 what it would accept from an application and be able to query for
741 both a name in A-label form (e.g., over the Internet) and a UTF-8
742 name (e.g., over a corporate network with a private name space) in
743 case the server only recognized one. However, we might also take
744 into account that the various resolution behaviors discussed earlier
745 could also occur with record updates (e.g., with Dynamic Update
746 [RFC2136]), resulting in some names being registered in a local
747 network's private name space by applications doing conversion to
748 A-labels, and other names being registered using UTF-8. Hence a name
749 might have to be queried with both encodings to be sure to succeed
750 without changes to DNS servers.
752 Similarly, a more intelligent stub resolver would also be more
753 liberal in what it would accept from a response as the value of a
754 record (e.g., PTR) in that it would accept either UTF-8 (U-labels in
755 the case of IDNA) or A-labels and convert them to whatever encoding
756 is used by the application APIs to return strings to applications.
758 Indeed the choice of conversion within the resolver libraries is
759 consistent with the quote from section 6.2 of the original IDNA
760 specification [RFC3490] stating that conversion using the Punycode
761 algorithm (i.e., to A-labels) "might be performed inside these new
762 versions of the resolver libraries".
764 That said, some application-layer protocols (e.g., [RFC4282]) are
765 defined to use A-labels rather than UTF-8 as recommended by the IETF
766 character sets and languages policy [RFC2277]. In this case, an
767 application may receive a string containing A-labels and want to pass
768 it to name resolution APIs. Again the recommendation that a resolver
769 library be more liberal in what it would accept from an application
770 would mean that such a name would be accepted and re-encoded as
771 needed, rather than requiring the application to do so.
773 It is important that any APIs used by applications to pass names
774 specify what encoding(s) the API uses. For example, GetAddrInfoW()
775 on Windows specifies that it accepts UTF-16. In contrast, the
776 original specification of getaddrinfo() [RFC3493] did not, and hence
777 platforms vary in what they use (e.g., MacOS uses UTF-8 whereas
778 Windows uses Windows code pages).
780 Finally, the question remains about what, if anything, a DNS server
781 should do to handle cases where some existing applications or hosts
782 do IDNA queries using A-labels within the local network using a
783 private name space, and other existing applications or hosts send
784 UTF-8 queries. It is undesirable to store different records for
785 different encodings of the same name, since this introduces the
786 possibility for inconsistency between them. Instead, a new DNS
787 server serving a private name space using UTF-8 could potentially
788 treat encoding-conversion in the same way as case-insensitive
789 comparison which a DNS server is already required to do, as long the
790 DNS server has some way to know what the encoding is. Two encodings
791 are, in this sense, two representations of the same name, just as two
792 case-different strings are. However, whereas case comparison of non-
793 ASCII characters is complicated by ambiguities (as explained in the
794 IAB's Review and Recommendations for Internationalized Domain Names
795 [RFC4690]), encoding conversion between A-labels and U-labels is
796 unambiguous.
798 5. Security Considerations
800 Having applications convert names to prefixed ACE format (A-labels)
801 before calling name resolution can result in security
802 vulnerabilities. If the name is resolved by protocols or in zones
803 for which records are registered using other encoding schemes, an
804 attacker can claim the A-label version of the same name and hence
805 trick the victim into accessing a different destination. This can be
806 done for any non-ASCII name, even when there is no possible confusion
807 due to case, language, or other issues. Other types of confusion
808 beyond those resulting simply from the choice of encoding scheme are
809 discussed in "Review and Recommendations for IDNs" [RFC4690].
811 Designers and users of encodings that represent Unicode strings in
812 terms of ASCII should also consider whether phishing is an issue,
813 e.g., if one name would be encoded in a way that would be naturally
814 associated with another organization or product.
816 6. IANA Considerations
818 [RFC Editor: please remove this section prior to publication.]
820 This document has no IANA Actions.
822 7. Acknowledgements
824 The authors wish to thank Patrik Faltstrom, Martin Duerst, JFC
825 Morfin, Ran Atkinson, and S. Moonesamy for their careful review and
826 helpful suggestions. It is also interesting to note that none of the
827 first three individuals' names above can be spelled out and written
828 correctly in ASCII text. Furthermore, one of the IAB member's names
829 below (Andrei Robachevsky) cannot be written in the script as it
830 appears on his birth certificate.
832 8. IAB Members at the time of publication
834 Bernard Aboba
835 Marcelo Bagnulo
836 Ross Callon
837 Spencer Dawkins
838 Vijay Gill
839 Russ Housley
840 John Klensin
841 Olaf Kolkman
842 Danny McPherson
843 Jon Peterson
844 Andrei Robachevsky
845 Dave Thaler
846 Hannes Tschofenig
848 9. References
849 9.1. Normative References
851 [10646] International Organization for Standardization,
852 "Information Technology - Universal Multiple-octet coded
853 Character Set (UCS)".
855 ISO/IEC Standard 10646, comprised of ISO/IEC 10646-1:2000,
856 "Information technology -- Universal Multiple-Octet Coded
857 Character Set (UCS) -- Part 1: Architecture and Basic
858 Multilingual Plane", ISO/IEC 10646-2:2001, "Information
859 technology -- Universal Multiple-Octet Coded Character Set
860 (UCS) -- Part 2: Supplementary Planes" and ISO/IEC 10646-
861 1:2000/Amd 1:2002, "Mathematical symbols and other
862 characters".
864 [Unicode] The Unicode Consortium, "The Unicode Standard, Version
865 5.1.0", 2008.
867 defined by: The Unicode Standard, Version 5.0, Boston, MA,
868 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
869 Unicode 5.1.0
870 (http://www.unicode.org/versions/Unicode5.1.0/).
872 9.2. Informative References
874 [I-D.cheshire-dnsext-multicastdns]
875 Cheshire, S. and M. Krochmal, "Multicast DNS",
876 draft-cheshire-dnsext-multicastdns-11 (work in progress),
877 March 2010.
879 [I-D.ietf-idn-punycode-00]
880 Costello, A., "Punycode version 0.3.3",
881 draft-ietf-idn-punycode-00 (work in progress), July 2002.
883 [I-D.skwan-utf8-dns-00]
884 Kwan, S. and J. Gilroy, "Using the UTF-8 Character Set in
885 the Domain Name System", draft-skwan-utf8-dns-00 (work in
886 progress), November 1997.
888 [IDNA2008-Defs]
889 Klensin, J., "Internationalized Domain Names for
890 Applications (IDNA): Definitions and Document Framework",
891 January 2010, .
894 [IDNA2008-Protocol]
895 Klensin, J., "Internationalized Domain Names in
896 Applications (IDNA): Protocol", January 2010, .
899 [MJD] Duerst, M., "The Properties and Promizes of UTF-8", 11th
900 International Unicode Conference, San Jose ,
901 September 1997, .
904 [NIS] Sun Microsystems, "System and Network Administration",
905 March 1990.
907 [RFC0821] Postel, J., "Simple Mail Transfer Protocol", STD 10,
908 RFC 821, August 1982.
910 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
911 host table specification", RFC 952, October 1985.
913 [RFC1001] NetBIOS Working Group, "Protocol standard for a NetBIOS
914 service on a TCP/UDP transport: Concepts and methods",
915 STD 19, RFC 1001, March 1987.
917 [RFC1002] NetBIOS Working Group, "Protocol standard for a NetBIOS
918 service on a TCP/UDP transport: Detailed specifications",
919 STD 19, RFC 1002, March 1987.
921 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
922 STD 13, RFC 1034, November 1987.
924 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application
925 and Support", STD 3, RFC 1123, October 1989.
927 [RFC1468] Murai, J., Crispin, M., and E. van der Poel, "Japanese
928 Character Encoding for Internet Messages", RFC 1468,
929 June 1993.
931 [RFC1535] Gavron, E., "A Security Problem and Proposed Correction
932 With Widely Deployed DNS Software", RFC 1535,
933 October 1993.
935 [RFC1536] Kumar, A., Postel, J., Neuman, C., Danzig, P., and S.
936 Miller, "Common DNS Implementation Errors and Suggested
937 Fixes", RFC 1536, October 1993.
939 [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
940 Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
941 the IAB Character Set Workshop held 29 February - 1 March,
942 1996", RFC 2130, April 1997.
944 [RFC2136] Vixie, P., Thomson, S., Rekhter, Y., and J. Bound,
945 "Dynamic Updates in the Domain Name System (DNS UPDATE)",
946 RFC 2136, April 1997.
948 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS
949 Specification", RFC 2181, July 1997.
951 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
952 Languages", BCP 18, RFC 2277, January 1998.
954 [RFC3397] Aboba, B. and S. Cheshire, "Dynamic Host Configuration
955 Protocol (DHCP) Domain Search Option", RFC 3397,
956 November 2002.
958 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
959 "Internationalizing Domain Names in Applications (IDNA)",
960 RFC 3490, March 2003.
962 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
963 for Internationalized Domain Names in Applications
964 (IDNA)", RFC 3492, March 2003.
966 [RFC3493] Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
967 Stevens, "Basic Socket Interface Extensions for IPv6",
968 RFC 3493, February 2003.
970 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
971 10646", STD 63, RFC 3629, November 2003.
973 [RFC3646] Droms, R., "DNS Configuration options for Dynamic Host
974 Configuration Protocol for IPv6 (DHCPv6)", RFC 3646,
975 December 2003.
977 [RFC4282] Aboba, B., Beadles, M., Arkko, J., and P. Eronen, "The
978 Network Access Identifier", RFC 4282, December 2005.
980 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
981 Recommendations for Internationalized Domain Names
982 (IDNs)", RFC 4690, September 2006.
984 [RFC4795] Aboba, B., Thaler, D., and L. Esibov, "Link-local
985 Multicast Name Resolution (LLMNR)", RFC 4795,
986 January 2007.
988 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network
989 Interchange", RFC 5198, March 2008.
991 [RFC5321] Klensin, J., "Simple Mail Transfer Protocol", RFC 5321,
992 October 2008.
994 Authors' Addresses
996 Dave Thaler
997 Microsoft Corporation
998 One Microsoft Way
999 Redmond, WA 98052
1000 USA
1002 Phone: +1 425 703 8835
1003 Email: dthaler@microsoft.com
1005 John C Klensin
1006 1770 Massachusetts Ave, Ste 322
1007 Cambridge, MA 02140
1009 Phone: +1 617 245 1457
1010 Email: john+ietf@jck.com
1012 Stuart Cheshire
1013 Apple Inc.
1014 1 Infinite Loop
1015 Cupertino, CA 95014
1017 Phone: +1 408 974 3207
1018 Email: cheshire@apple.com