idnits 2.17.1 draft-iab-idn-encoding-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 348: '... Protocols MUST be able to use th...' RFC 2119 keyword, line 351: '... for all text. Protocols MAY specify,...' RFC 2119 keyword, line 361: '... support MUST be possible....' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 10, 2009) is 5281 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '10646' on line 350 == Missing Reference: 'BCP9' is mentioned on line 355, but not defined == Outdated reference: A later version (-15) exists of draft-cheshire-dnsext-multicastdns-08 == Outdated reference: A later version (-02) exists of draft-ietf-idn-punycode-00 == Outdated reference: A later version (-06) exists of draft-skwan-utf8-dns-00 -- Obsolete informational reference (is this intentional?): RFC 821 (Obsoleted by RFC 2821) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group D. Thaler 3 Internet-Draft Microsoft 4 Intended status: Informational J. Klensin 5 Expires: May 14, 2010 6 S. Cheshire 7 Apple 8 November 10, 2009 10 IAB Thoughts on Encodings for Internationalized Domain Names 11 draft-iab-idn-encoding-01.txt 13 Abstract 15 This document explores issues with Internationalized Domain Names 16 (IDNs) that result from the use of various encoding schemes such as 17 Punycode and UTF-8. 19 Status of this Memo 21 This Internet-Draft is submitted to IETF in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF), its areas, and its working groups. Note that 26 other groups may also distribute working documents as Internet- 27 Drafts. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 The list of current Internet-Drafts can be accessed at 35 http://www.ietf.org/ietf/1id-abstracts.txt. 37 The list of Internet-Draft Shadow Directories can be accessed at 38 http://www.ietf.org/shadow.html. 40 This Internet-Draft will expire on May 14, 2010. 42 Copyright Notice 44 Copyright (c) 2009 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 61 2. Use of Non-DNS Protocols . . . . . . . . . . . . . . . . . . . 9 62 3. Use of Non-ASCII in DNS . . . . . . . . . . . . . . . . . . . 10 63 3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 13 64 4. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 15 65 5. Security Considerations . . . . . . . . . . . . . . . . . . . 16 66 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 67 7. IAB Members at the time of this writing . . . . . . . . . . . 17 68 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 69 8.1. Normative References . . . . . . . . . . . . . . . . . . . 17 70 8.2. Informative References . . . . . . . . . . . . . . . . . . 18 71 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 73 1. Introduction 75 The goal of this document is to explore what can be learned from some 76 current difficulties in implementing Internationalized Domain Names 77 (IDNs). Although some elements of this exploration may immediately 78 feed back into current IETF work, it is explicitly not the intention 79 for this document to influence any current working group charter. 81 A domain name consists of a set of labels, conventionally written 82 separated with dots. An Internationalized Domain Name (IDN) is a 83 domain name that contains one or more labels that, in turn, contain 84 one or more non-ASCII characters. Just as with plain ASCII domain 85 names, each IDN label must be encoded using some mechanism before it 86 can be transmitted in network packets, stored in memory, stored on 87 disk, etc. These encodings need to be reversible, but they need not 88 store domain names the same way humans conventionally write them on 89 paper. For example, when transmitted over the network in DNS 90 packets, domain name labels are *not* separated with dots. 92 IDNA, discussed later in this document, is the standard that defines 93 the use and coding of internationalized domain names for use on the 94 public Internet. It is defined in several documents, with the 95 primary one of those being "Internationalizing Domain Names in 96 Applications (IDNA)" [RFC3490]. A revision to the IDNA Standard is 97 undergoing IETF Last Call review as this document is being written. 98 That revision is reflected in [IDNA2008-Defs] and associated 99 materials. Except where noted, the two versions are approximately 100 the same with regard to the issues discussed in this document. 101 However, their terminology differs somewhat; this document reflects 102 the terminology of the earlier version. 104 Punycode [RFC3492] is a mechanism for encoding a Unicode [Unicode] 105 string in ASCII characters using only letters, digits, and hyphens. 106 When a Unicode label is encoded with Punycode, it is prefixed with 107 "xn--", which assumes that other DNS labels are no longer allowed to 108 start with these four characters. Consequently, when Punycode 109 encoding is assumed, any DNS labels beginning with "xn--" now have a 110 different meaning (the Punycode encoding of a label containing one or 111 more non-ASCII characters) or no defined meaning at all (in the case 112 of labels that are not well-formed Punycode). 114 The term "ToASCII" refers to the process of encoding a label 115 containing one or more non-ASCII characters as an ASCII string 116 beginning with "xn--". It consists of a combination of a non- 117 reversible character mapping operation (e.g., converting upper case 118 characters to lower case characters), plus a reversible encoding 119 algorithm ('Punycode') that encodes a sequence of Unicode code points 120 (which may contain code points above 127) as a sequence of ASCII code 121 points (containing only ASCII code points for letters, digits and 122 hyphens). The term "ToUnicode" refers to the process of reversing 123 the Punycode encoding, but not reversing the (irreversible) character 124 mapping operation. 126 ISO-2022-JP [RFC1468] is a mechanism for encoding a string of ASCII 127 and Japanese characters, where an ASCII character is preserved as-is. 129 Unicode [Unicode] is a list of characters (including non-spacing 130 marks that are used to form some other characters), where each 131 character is assigned an integer value, called a code point. In 132 simple terms a Unicode string is a string of integer code point 133 values in the range 0 to 1,114,111 (10FFFF in base 16), which 134 represent a string of Unicode characters. These integer code points 135 must be encoded using some mechanism before they can be transmitted 136 in network packets, stored in memory, stored on disk, etc. Some 137 common ways of encoding these integer code point values in computer 138 systems include UTF-8, UTF-16, and UTF-32. In addition to the 139 material below, those forms and the tradeoffs among them are 140 discussed in Chapter 2 of The Unicode Standard [Unicode]. 142 UTF-8 [RFC3629] is a mechanism for encoding a Unicode code point in a 143 variable number of 8-bit octets, where an ASCII code point is 144 preserved as-is. Those octets encode a string of integer code point 145 values, which represent a string of Unicode characters. 147 UTF-16 (formerly UCS-2) is a mechanism for encoding a Unicode code 148 point in one or two 16-bit integers, described in detail in Sections 149 3.9 and 3.10 of The Unicode Standard [Unicode]. A UTF-16 string 150 encodes a string of integer code point values that represent a string 151 of Unicode characters. 153 UTF-32 (formerly UCS-4), also described in [Unicode] Sections 3.9 and 154 3.10, is a mechanism for encoding a Unicode code point in a single 155 32-bit integer. A UTF-32 string is thus a string of 32-bit integer 156 code point values, which represent a string of Unicode characters. 158 Note that UTF-16 and UTF-32 codings result in some all-zero octets 159 when code points occur early in the Unicode sequence. 161 Different applications, APIs, and protocols use different encoding 162 schemes today. Historically, many of them were originally defined to 163 use only ASCII. Internationalizing Domain Names in Applications 164 (IDNA) [RFC3490] defined a mechanism that required changes to 165 applications, but in attempt not to change APIs or servers, specified 166 that Punycode is to be used. In some ways this could be seen as not 167 changing the existing APIs, in the sense that the strings being 168 passed to and from the APIs were still apparently ASCII strings. In 169 other ways it was a very profound change to the existing APIs, 170 because while those strings were still syntactically valid ASCII 171 strings, they no longer meant the same thing as they used to. What 172 looked like a plain ASCII string to one piece of software or library 173 could be seen by another piece of software or library (with the 174 application of out-of-band information) to be in fact an encoding of 175 a Unicode string. 177 Section 1.3 of the IDNA specification [RFC3490] states: 179 The IDNA protocol is contained completely within applications. It 180 is not a client-server or peer-to-peer protocol: everything is 181 done inside the application itself. When used with a DNS resolver 182 library, IDNA is inserted as a "shim" between the application and 183 the resolver library. When used for writing names into a DNS 184 zone, IDNA is used just before the name is committed to the zone. 186 Figure 1 depicts a simplistic architecture that a naive reader might 187 assume from the paragraph quoted above. (A variant of this same 188 picture appears in Section 6 of the IDNA specification [RFC3490] 189 further strengthening this assumption.) 191 +-----------------------------------------+ 192 |Host | 193 | +-------------+ | 194 | | Application | | 195 | +------+------+ | 196 | | | 197 | +----+----+ | 198 | | DNS | | 199 | | Resolver| | 200 | | Library | | 201 | +----+----+ | 202 | | | 203 +-----------------------------------------+ 204 | 205 _________|_________ 206 / \ 207 / \ 208 / \ 209 | Internet | 210 \ / 211 \ / 212 \___________________/ 214 Simplistic Architecture 216 Figure 1 218 There are, however, two problems with this simplistic architecture 219 that cause it to differ from reality. 221 First, resolver APIs on Operating Systems (OSs) today (MacOS, 222 Windows, Linux, etc.) are not DNS-specific. They typically provide a 223 layer of indirection so that the application can work independent of 224 the name resolution mechanism, which could be DNS, mDNS 225 [I-D.cheshire-dnsext-multicastdns], LLMNR [RFC4795], NetBIOS-over-TCP 226 [RFC1001][RFC1002], etc/hosts file [RFC0952], NIS [NIS], or anything 227 else. For example, "Basic Socket Interface Extensions for IPv6" 228 [RFC3493] specifies the getaddrinfo() API and contains many phrases 229 like "For example, when using the DNS" and "any type of name 230 resolution service (for example, the DNS)". Importantly, DNS is 231 mentioned only as an example, and the application has no knowledge as 232 to whether DNS or some other protocol will be used. 234 Second, even with the DNS protocol, private name spaces (sometimes 235 including private uses of the DNS), do not necessarily use the same 236 character set encoding scheme as the public Internet name space. 238 We will discuss each of the above issues in subsequent sections. For 239 reference, Figure 2 depicts a more realistic architecture on typical 240 hosts today (which don't have IDNA inserted as a shim immediately 241 above the DNS resolver library). More generally, the host may be 242 attached to one or more local networks, each of which may or may not 243 be connected to the public Internet and may or may not have a private 244 name space. 246 +-----------------------------------------+ 247 |Host | 248 | +-------------+ | 249 | | Application | | 250 | +------+------+ | 251 | | | 252 | +------+------+ | 253 | | Generic | | 254 | | Name | | 255 | | Resolution | | 256 | | API | | 257 | +------+------+ | 258 | | | 259 | +-----+------+---+--+-------+-----+ | 260 | | | | | | | | 261 | +-+-++--+--++--+-++---+---++--+--++-+-+ | 262 | |DNS||LLMNR||mDNS||NetBIOS||hosts||...| | 263 | +---++-----++----++-------++-----++---+ | 264 | | 265 +-----------------------------------------+ 266 | 267 ______|______ 268 / \ 269 / \ 270 / local \ 271 \ network / 272 \ / 273 \_____________/ 274 | 275 _________|_________ 276 / \ 277 / \ 278 / \ 279 | Internet | 280 \ / 281 \ / 282 \___________________/ 284 Realistic Architecture 286 Figure 2 288 1.1. APIs 290 Section 6.2 of the IDNA specification [RFC3490] states: 292 It is expected that new versions of the resolver libraries in the 293 future will be able to accept domain names in other charsets than 294 ASCII, and application developers might one day pass not only 295 domain names in Unicode, but also in local script to a new API for 296 the resolver libraries in the operating system. Thus the ToASCII 297 and ToUnicode operations might be performed inside these new 298 versions of the resolver libraries. 300 Resolver APIs such as getaddrinfo() and its predecessor 301 gethostbyname() were defined to accept "char *" arguments, meaning 302 they accept a string of bytes, terminated with a NULL (0) byte. 303 Because of the use of a NULL octet as a string terminator, this is 304 sufficient for ASCII strings, Punycode strings, and even ISO-2022-JP 305 and UTF-8 strings (unless an implementation artificially precludes 306 them), but not UTF-16 or UTF-32 strings. Several operating systems 307 historically used in Japan will accept (and expect) ISO-2022-JP 308 strings in such APIs. Some platforms used worldwide also have new 309 versions of the APIs (e.g., GetAddrInfoW() on Windows) that accept 310 other encoding schemes such as UTF-16. 312 It is worth noting that an API using "char *" arguments can 313 distinguish between ASCII, Punycode, ISO-2022-JP, and UTF-8 labels in 314 names if the coding is known to be one of those four. An example 315 method is as follows: 316 o if the label contains an ESC (0x1B) byte the label is ISO-2022-JP; 317 otherwise, 318 o if any byte in the label has the high bit set, the label is UTF-8; 319 otherwise, 320 o if the label starts with "xn--" then it contains a string in 321 Punycode encoding; otherwise, 322 o the label is ASCII. 323 Again this assumes that ASCII labels never start with "xn--", and 324 also that UTF-8 strings never contain an ESC character. Also the 325 above is merely an illustration; UTF-8 can be detected and 326 distinguished from other 8-bit encodings with high precision [MJD]. 328 It is more difficult or impossible to distinguish the ISO 8859 329 character sets from each other. Similarly, it is not possible in 330 general to distinguish between ISO-2022-JP and any other encoding 331 based on ISO 2022 code table switching. 333 Although it is possible (as in the example above) to distinguish some 334 encodings when not explicitly specified, it is cleaner to have the 335 encodings specified explicitly, such as specifying UTF-16 for 336 GetAddrInfoW(), or specifying explicitly which APIs expect UTF-8 337 strings. 339 2. Use of Non-DNS Protocols 341 As noted earlier, typical name resolution libraries are not DNS- 342 specific. Furthermore, some protocols are defined to use encoding 343 schemes other than Punycode. For example, mDNS 344 [I-D.cheshire-dnsext-multicastdns] specifies that UTF-8 be used. 345 Indeed, the IETF policy on character sets and languages [RFC2277] 346 states: 348 Protocols MUST be able to use the UTF-8 charset, which consists of 349 the ISO 10646 coded character set combined with the UTF-8 350 character encoding scheme, as defined in [10646] Annex R 351 (published in Amendment 2), for all text. Protocols MAY specify, 352 in addition, how to use other charsets or other character encoding 353 schemes for ISO 10646, such as UTF-16, but lack of an ability to 354 use UTF-8 is a violation of this policy; such a violation would 355 need a variance procedure ([BCP9] section 9) with clear and solid 356 justification in the protocol specification document before being 357 entered into or advanced upon the standards track. For existing 358 protocols or protocols that move data from existing datastores, 359 support of other charsets, or even using a default other than 360 UTF-8, may be a requirement. This is acceptable, but UTF-8 361 support MUST be possible. 363 Applications that convert an IDN to Punycode before calling 364 getaddrinfo() will result in name resolution failures if the Punycode 365 name is directly used in such protocols. Having libraries or 366 protocols to convert from Punycode to the encoding scheme defined by 367 the protocol (e.g., UTF-8) would require changes to APIs and/or 368 servers, which IDNA was intended to avoid. 370 As a result, applications that assume that non-ASCII names are 371 resolved using the public DNS and blindly convert them to Punycode 372 without knowledge of what protocol will be selected by the name 373 resolution library, have problems. Furthermore, name resolution 374 libraries often try multiple protocols until one succeeds, because 375 they are defined to use a common name space. For example, the hosts 376 file, DNS, and NetBIOS-over-TCP are all defined to be able to share a 377 common syntax (e.g., see ([RFC0952], [RFC1001] section 11.1.1, and 378 [RFC1034] section 2.1). This means that when an application passes a 379 name to be resolved, resolution may in fact be attempted using 380 multiple protocols, each with a potentially different encoding 381 scheme. For this to work successfully, the name must be converted to 382 the appropriate encoding scheme only after the choice is made to use 383 that protocol. In general, this cannot be done by the application 384 since the choice of protocol is not made by the application. 386 3. Use of Non-ASCII in DNS 388 A common misconception is that DNS only supports names that can be 389 expressed using letters, digits, and hyphens. 391 This misconception originally stemmed from the definition in 1985 of 392 an "Internet host name" (and net, gateway, and domain name) for use 393 in the "hosts" file [RFC0952]. An Internet host name was defined 394 therein as including only letters, digits, and hyphens, where upper 395 and lower case letters were to be treated as identical. The DNS 396 specification [RFC1034] section 3.5 entitled "Preferred name syntax" 397 then repeated this definition in 1987, saying that this "syntax will 398 result in fewer problems with many applications that use domain names 399 (e.g., mail, TELNET)". 401 The confusion was thus left as to whether the "preferred" name syntax 402 was a mandatory restriction in DNS, or merely "preferred". 404 The definition of an Internet host name was updated in 1989 405 ([RFC1123] section 2.1) to allow names starting with a digit (to 406 support IPv4 addresses in dotted-decimal form). Section 6.1 of 407 "Requirements for Internet Hosts -- Application and Support" 408 [RFC1123] discusses the use of DNS (and the hosts file) for resolving 409 host names to IP addresses and vice versa. This led to confusion as 410 to whether all names in DNS are "host names", or whether a "host 411 name" is merely a special case of a DNS name. 413 By 1997, things had progressed to a state where it was necessary to 414 clarify these areas of confusion. "Clarifications to the DNS 415 Specification" [RFC2181] section 11 states: 417 The DNS itself places only one restriction on the particular 418 labels that can be used to identify resource records. That one 419 restriction relates to the length of the label and the full name. 420 The length of any one label is limited to between 1 and 63 octets. 421 A full domain name is limited to 255 octets (including the 422 separators). The zero length full name is defined as representing 423 the root of the DNS tree, and is typically written and displayed 424 as ".". Those restrictions aside, any binary string whatever can 425 be used as the label of any resource record. Similarly, any 426 binary string can serve as the value of any record that includes a 427 domain name as some or all of its value (SOA, NS, MX, PTR, CNAME, 428 and any others that may be added). Implementations of the DNS 429 protocols must not place any restrictions on the labels that can 430 be used. 432 Hence, it clarified that the restriction to letters, digits, and 433 hyphens does not apply to DNS names in general, nor to records that 434 include "domain names". Hence the "preferred" name syntax described 435 in the original DNS specification [RFC1034] is indeed merely 436 "preferred", not mandatory. 438 Since there is no restriction even to ASCII, let alone letter-digit- 439 hyphen use, DNS is in conformance with the IETF requirement to allow 440 UTF-8 [RFC2277]. 442 Using UTF-16 or UTF-32 encoding, however, would not be ideal for use 443 in DNS packets or APIs because existing software already uses ASCII, 444 and UTF-16 and UTF-32 strings can contain all-zero octets that 445 existing software may interpret as the end of the string. To use 446 UTF-16 or UTF-32 one would need some way of knowing whether the 447 string was encoded using ASCII, UTF-16, or UTF-32, and indeed for 448 UTF-16 or UTF-32 whether it was big-endian or little-endian encoding. 449 In contrast, UTF-8 works well because any 7-bit ASCII string is also 450 a UTF-8 string representing the same characters. 452 If a private name space is defined to use UTF-8 (and not other 453 encodings such as UTF-16 or UTF-32), there's no need for a mechanism 454 to know whether a string was encoded using ASCII or UTF-8, because 455 (for any string that can be represented using ASCII) the 456 representations are exactly the same. In other words, for any string 457 that can be represented using ASCII it doesn't matter whether it is 458 interpreted as ASCII or UTF-8 because both encodings are the same, 459 and for any string that can't be represented using ASCII, it's 460 obviously UTF-8. In addition, unlike UTF-16 and UTF-32, ASCII and 461 UTF-8 are both byte-oriented encodings so the question of big-endian 462 or little-endian encoding doesn't apply. 464 While implementations of the DNS protocol must not place any 465 restrictions on the labels that can be used, applications that use 466 the DNS are free to impose whatever restrictions they like, and many 467 have. The above rules permit a domain name label that contains 468 unusual characters, such as embedded spaces which many applications 469 would consider a bad idea. For example, the SMTP protocol [RFC5321], 470 but going back to the original specification in [RFC0821], constrains 471 the character set usable in email addresses. There is now an effort 472 underway to permit SMTP to support internationalized email addresses 473 via an extension. 475 Shortly after the DNS Clarifications [RFC2181] and IETF character 476 sets and languages policy [RFC2277] were published, the need for 477 internationalized names within private name spaces (i.e., within 478 enterprises) arose. The current (and past, predating Punycode) 479 practice within enterprises that support other languages is to put 480 UTF-8 names in their internal DNS servers in a private name space. 481 For example, "Using the UTF-8 Character Set in the Domain Name 482 System" [I-D.skwan-utf8-dns-00] was first written in 1997, and was 483 then widely deployed in Windows. The use of UTF-8 names in DNS was 484 similarly implemented and deployed in MacOS, simply by virtue of the 485 fact that applications blindly passed UTF-8 strings to the name 486 resolution APIs, and the name resolution APIs blindly passed those 487 UTF-8 strings to the DNS servers, and the DNS servers correctly 488 answered those queries, and from the user's point of view everything 489 worked properly without any special new code being written, except 490 that ASCII is matched case-insensitively whereas UTF-8 is not 491 (although some enterprise DNS servers reportedly attempt to do case- 492 insensitive matching on UTF-8 within private name spaces). Within a 493 private name space, and especially in light of the IETF UTF-8 policy 494 [RFC2277], it was reasonable to assume within a private name space 495 that binary strings were encoded in UTF-8. 497 [EDITOR'S NOTE: There are also normalization/mapping issues. 498 Currently we only explore encoding issues.] 500 Five years after UTF-8 was already in use in private name spaces in 501 DNS, Punycode began to be developed (during the period from 2002 502 [I-D.ietf-idn-punycode-00] to 2003 [RFC3492]) for use in the public 503 DNS name space. This publication thus resulted in having to use 504 different encodings for different name spaces (where UTF-8 for 505 private name spaces was already deployed). Hence, referring back to 506 Figure 2, a different encoding scheme may be in use on the Internet 507 vs. a local network. 509 In general a host may be connected to zero or more networks using 510 private name spaces, plus potentially the public name space. 511 Applications that convert an IDN to Punycode before calling 512 getaddrinfo() will result in name resolution failures if the name is 513 actually registered in a private name space in some other encoding 514 (e.g., UTF-8). Having libraries or protocols convert from Punycode 515 to the encoding used by a private name space (e.g., UTF-8) would 516 require changes to APIs and/or servers, which IDNA was intended to 517 avoid. 519 Also, a fully-qualified domain name (FQDN) to be resolved may be 520 obtained directly from an application, or it may be composed by the 521 DNS resolver itself from a single label obtained from an application 522 by using a configured suffix search list, and the resulting FQDN may 523 use multiple encodings in different labels. For more information on 524 the suffix search list, see section 6 of "Common DNS Implementation 525 Errors and Suggested Fixes" [RFC1536], the DHCP Domain Search Option 526 [RFC3397], and section 4 of "DNS Configuration options for DHCPv6" 527 [RFC3646]. 529 As noted in [RFC1536] section 6, the community has had bad 530 experiences with "searching" for domain names by trying multiple 531 variations or appending different suffixes. Such searching can yield 532 inconsistent results depending on the order in which alternatives are 533 tried. Nonetheless, the practice is widespread and must be 534 considered. 536 The practice of searching for names, whether by the use of a suffix 537 search list or by searching in different namespaces can yield 538 inconsistent results. For example, even when a suffix search list is 539 only used when an application provides a name containing no dots, two 540 clients with different configured suffix search lists can get 541 different answers, and the same client could get different answers at 542 different times if it changes its configuration (e.g., when moving to 543 another network). A deeper discussion of this topic is outside the 544 scope of this document. 546 3.1. Examples 548 Some examples of cases that can happen in existing implementations 549 today (where {non-ASCII} below represents some user-entered non-ASCII 550 string) are: 551 1. User types in {non-ASCII}.{non-ASCII}.com, and the application 552 passes it, in the form of a UTF-8 string, to getaddrinfo or 553 gethostbyname or equivalent. 554 * The DNS resolver passes the (UTF-8) string unmodified to a DNS 555 server. 556 2. User types in {non-ASCII}.{non-ASCII}.com, and the application 557 passes it to a name resolution API that accepts strings in some 558 other encoding such as UTF-16, e.g., GetAddrInfoW on Windows. 559 * The name resolution API decides to pass the string to DNS (and 560 possibly other protocols). 561 * The DNS resolver converts the name from UTF-16 to UTF-8 and 562 passes the query to a DNS server. 563 3. User types in {non-ASCII}.{non-ASCII}.com, but the application 564 first converts it to Punycode such that the name that is passed 565 to name resolution APIs is (say) xn--e1afmkfd.xn-- 566 80akhbyknj4f.com. 567 * The name resolution API decides to pass the string to DNS (and 568 possibly other protocols). 569 * The DNS resolver passes the string unmodified to a DNS server. 570 * If the name is not found in DNS, the name resolution API 571 decides to try another protocol, say mDNS. 572 * The query goes out in mDNS, but since mDNS specified that 573 names are to be registered in UTF-8, the name isn't found 574 since it was Punycode encoded in the query. 575 4. User types in {non-ASCII}, and the application passes it, in the 576 form of a UTF-8 string, to getaddrinfo or equivalent. 578 * The name resolution API decides to pass the string to DNS (and 579 possibly other protocols). 580 * The DNS resolver will append suffixes in the suffix search 581 list, which may contain UTF-8 characters if the local network 582 uses a private name space. 583 * Each FQDN in turn will then be sent in a query to a DNS 584 server, until one succeeds. 585 5. User types in {non-ASCII}, but the application first converts it 586 to Punycode, such that the name that is passed to getaddrinfo or 587 equivalent is (say) xn--e1afmkfd. 588 * The name resolution API decides to pass the string to DNS (and 589 possibly other protocols). 590 * The DNS stub resolver will append suffixes in the suffix 591 search list, which may contain UTF-8 characters if the local 592 network uses a private name space, resulting in (say) xn-- 593 e1afmkfd.{non-ASCII}.com 594 * Each FQDN in turn will then be sent in a query to a DNS 595 server, until one succeeds. 596 * Since the private name space in this case uses UTF-8, the 597 above queries fail, since the Punycode version of the name was 598 not registered in that name space. 599 6. User types in {non-ASCII1}.{non-ASCII2}.{non-ASCII3}.com, where 600 {non-ASCII3}.com is a public name space using Punycode, but {non- 601 ASCII2}.{non-ASCII3}.com is a private name space using UTF-8, 602 which is accessible to the user. The application passes the 603 name, in the form of a UTF-8 string, to getaddrinfo or 604 equivalent. 605 * The name resolution API decides to pass the string to DNS (and 606 possibly other protocols). 607 * The DNS resolver tries to locate the authoritative server, but 608 fails the lookup because it cannot find a server for the UTF-8 609 encoding of {non-ASCII3}.com, even though it would have access 610 to the private name space. (To make this work, the private 611 name space would need to include the UTF-8 encoding of {non- 612 ASCII3}.com.) 614 When users use multiple applications, some of which do Punycode 615 conversion prior to passing a name to name resolution APIs, and some 616 of which do not, odd behavior can result which at best violates the 617 principle of least surprise, and at worst can result in security 618 vulnerabilities. 620 First consider two competing applications, such as web browsers, that 621 are designed to achieve the same task. If the user types the same 622 name into each browser, one may successfully resolve the name (and 623 hence access the desired content) because the encoding scheme was 624 correct, while the other may fail name resolution because the 625 encoding scheme was incorrect. Hence the issue can incent users to 626 switch to another application (which in some cases means switching to 627 an IDNA application, and in other cases means switching away from an 628 IDNA application). 630 Next consider two separate applications where one is designed to be 631 launched from the other, for example a web browser launching a media 632 player application when the link to a media file is clicked. If both 633 types of content (web pages and media files in this example) are 634 hosted at the same IDN in a private name space, but one application 635 converts to Punycode before calling name resolution APIs and the 636 other does not, the user may be able to access a web page, click on 637 the media file causing the media player to launch and attempt to 638 retrieve the media file, which will then fail because the IDN 639 encoding scheme was incorrect. Or even worse, if an attacker was 640 able to register the same name in the other encoding scheme, may get 641 the content from the attacker's machine. This is similar to a normal 642 phishing attack, except that the two names represent exactly the same 643 Unicode characters. 645 4. Recommendations 647 Taking into account the issues above, it would seem inappropriate for 648 an application to convert a name to Punycode when it does not know 649 whether DNS will be used by the name resolution library, or whether 650 the name exists in a private name space that uses UTF-8, or in the 651 global DNS that uses Punycode. 653 Instead, conversion to Punycode, UTF-8, or whatever other encoding, 654 should be done only by an entity that knows which protocol will be 655 used (e.g., the DNS resolver, or getaddrinfo upon deciding to pass 656 the name to DNS), rather than by general applications that call 657 protocol-independent name resolution APIs. (Of course, it is still 658 necessary for applications to convert to whatever form those APIs 659 expect.) Similarly, even when DNS is used, the conversion to 660 Punycode should be done only by an entity that knows which name space 661 will be used. 663 That is, a more intelligent DNS resolver would be more liberal in 664 what it would accept from an application and be able to query for 665 both a Punycode name (e.g., over the Internet) and a UTF-8 name 666 (e.g., over a corporate network with a private name space) in case 667 the server only recognized one. However, we might also take into 668 account that the various resolution behaviors discussed earlier could 669 also occur with record updates (e.g., with Dynamic Update [RFC2136]), 670 resulting in some names being registered in a local network's private 671 name space by applications doing Punycode conversion, and other names 672 being registered using UTF-8. Hence a name might have to be queried 673 with both encodings to be sure to succeed without changes to DNS 674 servers. 676 Similarly, a more intelligent stub resolver would also be more 677 liberal in what it would accept from a response as the value of a 678 record (e.g., PTR) in that it would accept either UTF-8 or Punycode 679 and convert them to whatever encoding is used by the application APIs 680 to return strings to applications. 682 Indeed the choice of conversion within the resolver libraries is 683 consistent with the quote from section 6.2 of the IDNA specification 684 [RFC3490] stating that Punycode conversion "might be performed inside 685 these new versions of the resolver libraries". 687 That said, some application-layer protocols may be defined to use 688 Punycode rather than UTF-8 as recommended by the IETF character sets 689 and languages policy [RFC2277]. In this case, an application may 690 receive a Punycode name and want to pass it to name resolution APIs. 691 Again the recommendation that a resolver library be more liberal in 692 what it would accept from an application would mean that such a name 693 would be accepted and re-encoded as needed, rather than requiring the 694 application to do so. 696 Finally, the question remains about what, if anything, a DNS server 697 should do to handle cases where some existing applications or hosts 698 do Punycode queries within the local network using a private name 699 space, and other existing applications or hosts send UTF-8 queries. 700 It is undesirable to store different records for different encodings 701 of the same name, since this introduces the possibility for 702 inconsistency between them. Instead, a new DNS server serving a 703 private name space using UTF-8 could potentially treat encoding- 704 conversion in the same way as case-insensitive comparison which a DNS 705 server is already required to do, as long the DNS server has some way 706 to know what the encoding is. Two encodings are, in this sense, two 707 representations of the same name, just as two case-different strings 708 are. However, whereas case comparison of non-ASCII characters is 709 complicated by ambiguities (as explained in the IAB's Review and 710 Recommendations for Internationalized Domain Names [RFC4690]), 711 encoding conversion between Punycode and UTF-8 is unambiguous. 713 [EDITOR'S NOTE: There are also normalization/mapping issues. 714 Currently we only explore encoding issues.] 716 5. Security Considerations 718 Having applications convert names to Punycode before calling name 719 resolution can result in security vulnerabilities. If the name is 720 resolved by protocols or in zones for which records are registered 721 using other encoding schemes, an attacker can claim the Punycode 722 version of the same name and hence trick the victim into accessing a 723 different destination. This can be done for any non-ASCII name, even 724 when there is no possible confusion due to case, language, or other 725 issues. Other types of confusion beyond those resulting simply from 726 the choice of encoding scheme are discussed in "Review and 727 Recommendations for IDNs" [RFC4690]. 729 Designers and users of encodings that represent Unicode strings in 730 terms of ASCII should also consider whether trademark protection is 731 an issue, e.g., if one name would be encoded in a way that would be 732 naturally associated with another organization, such as xn--rfc- 733 editor. 735 6. IANA Considerations 737 [RFC Editor: please remove this section prior to publication.] 739 This document has no IANA Actions. 741 7. IAB Members at the time of this writing 743 Marcelo Bagnulo 744 Gonzalo Camarillo 745 Stuart Cheshire 746 Vijay Gill 747 Russ Housley 748 John Klensin 749 Olaf Kolkman 750 Gregory Lebovitz 751 Andrew Malis 752 Danny McPherson 753 David Oran 754 Jon Peterson 755 Dave Thaler 757 8. References 759 8.1. Normative References 761 [Unicode] The Unicode Consortium, "The Unicode Standard, Version 762 5.1.0", 2008. 764 defined by: The Unicode Standard, Version 5.0, Boston, MA, 765 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by 766 Unicode 5.1.0 767 (http://www.unicode.org/versions/Unicode5.1.0/). 769 8.2. Informative References 771 [I-D.cheshire-dnsext-multicastdns] 772 Cheshire, S. and M. Krochmal, "Multicast DNS", 773 draft-cheshire-dnsext-multicastdns-08 (work in progress), 774 September 2009. 776 [I-D.ietf-idn-punycode-00] 777 Costello, A., "Punycode version 0.3.3", 778 draft-ietf-idn-punycode-00 (work in progress), July 2002. 780 [I-D.skwan-utf8-dns-00] 781 Kwan, S. and J. Gilroy, "Using the UTF-8 Character Set in 782 the Domain Name System", draft-skwan-utf8-dns-00 (work in 783 progress), November 1997. 785 [IDNA2008-Defs] 786 Klensin, J., "Internationalized Domain Names for 787 Applications (IDNA): Definitions and Document Framework", 788 August 2009, . 791 [MJD] Duerst, M., "The Properties and Promizes of UTF-8", 11th 792 International Unicode Conference, San Jose , 793 September 1997, . 796 [NIS] Sun Microsystems, "System and Network Administration", 797 March 1990. 799 [RFC0821] Postel, J., "Simple Mail Transfer Protocol", STD 10, 800 RFC 821, August 1982. 802 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet 803 host table specification", RFC 952, October 1985. 805 [RFC1001] NetBIOS Working Group, "Protocol standard for a NetBIOS 806 service on a TCP/UDP transport: Concepts and methods", 807 STD 19, RFC 1001, March 1987. 809 [RFC1002] NetBIOS Working Group, "Protocol standard for a NetBIOS 810 service on a TCP/UDP transport: Detailed specifications", 811 STD 19, RFC 1002, March 1987. 813 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 814 STD 13, RFC 1034, November 1987. 816 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 817 and Support", STD 3, RFC 1123, October 1989. 819 [RFC1468] Murai, J., Crispin, M., and E. van der Poel, "Japanese 820 Character Encoding for Internet Messages", RFC 1468, 821 June 1993. 823 [RFC1536] Kumar, A., Postel, J., Neuman, C., Danzig, P., and S. 824 Miller, "Common DNS Implementation Errors and Suggested 825 Fixes", RFC 1536, October 1993. 827 [RFC2136] Vixie, P., Thomson, S., Rekhter, Y., and J. Bound, 828 "Dynamic Updates in the Domain Name System (DNS UPDATE)", 829 RFC 2136, April 1997. 831 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS 832 Specification", RFC 2181, July 1997. 834 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 835 Languages", BCP 18, RFC 2277, January 1998. 837 [RFC3397] Aboba, B. and S. Cheshire, "Dynamic Host Configuration 838 Protocol (DHCP) Domain Search Option", RFC 3397, 839 November 2002. 841 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 842 "Internationalizing Domain Names in Applications (IDNA)", 843 RFC 3490, March 2003. 845 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 846 for Internationalized Domain Names in Applications 847 (IDNA)", RFC 3492, March 2003. 849 [RFC3493] Gilligan, R., Thomson, S., Bound, J., McCann, J., and W. 850 Stevens, "Basic Socket Interface Extensions for IPv6", 851 RFC 3493, February 2003. 853 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 854 10646", STD 63, RFC 3629, November 2003. 856 [RFC3646] Droms, R., "DNS Configuration options for Dynamic Host 857 Configuration Protocol for IPv6 (DHCPv6)", RFC 3646, 858 December 2003. 860 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 861 Recommendations for Internationalized Domain Names 862 (IDNs)", RFC 4690, September 2006. 864 [RFC4795] Aboba, B., Thaler, D., and L. Esibov, "Link-local 865 Multicast Name Resolution (LLMNR)", RFC 4795, 866 January 2007. 868 [RFC5321] Klensin, J., "Simple Mail Transfer Protocol", RFC 5321, 869 October 2008. 871 Authors' Addresses 873 Dave Thaler 874 Microsoft Corporation 875 One Microsoft Way 876 Redmond, WA 98052 877 USA 879 Phone: +1 425 703 8835 880 Email: dthaler@microsoft.com 882 John C Klensin 883 1770 Massachusetts Ave, Ste 322 884 Cambridge, MA 02140 886 Phone: +1 617 245 1457 887 Email: john+ietf@jck.com 889 Stuart Cheshire 890 Apple Inc. 891 1 Infinite Loop 892 Cupertino, CA 95014 894 Phone: +1 408 974 3207 895 Email: cheshire@apple.com