idnits 2.17.1 draft-iab-idn-encoding-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 380: '... Protocols MUST be able to use th...' RFC 2119 keyword, line 383: '... for all text. Protocols MAY specify,...' RFC 2119 keyword, line 393: '... support MUST be possible....' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 14, 2010) is 5096 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '10646' on line 382 == Missing Reference: 'BCP9' is mentioned on line 387, but not defined == Outdated reference: A later version (-15) exists of draft-cheshire-dnsext-multicastdns-11 == Outdated reference: A later version (-02) exists of draft-ietf-idn-punycode-00 == Outdated reference: A later version (-06) exists of draft-skwan-utf8-dns-00 -- Obsolete informational reference (is this intentional?): RFC 821 (Obsoleted by RFC 2821) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group D. Thaler 3 Internet-Draft Microsoft 4 Intended status: Informational J. Klensin 5 Expires: November 15, 2010 6 S. Cheshire 7 Apple 8 May 14, 2010 10 IAB Thoughts on Encodings for Internationalized Domain Names 11 draft-iab-idn-encoding-02.txt 13 Abstract 15 This document explores issues with Internationalized Domain Names 16 (IDNs) that result from the use of various encoding schemes such as 17 UTF-8 and the ASCII-Compatible Encoding produced by the Punycode 18 algorithm. It focuses on the importance of agreeing on a canonical 19 format and how complicated it ends up being as a result of using 20 different encodings today. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on November 15, 2010. 39 Copyright Notice 41 Copyright (c) 2010 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 58 2. Use of Non-DNS Protocols . . . . . . . . . . . . . . . . . . . 9 59 3. Use of Non-ASCII in DNS . . . . . . . . . . . . . . . . . . . 10 60 3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 14 61 4. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 16 62 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17 63 6. Security Considerations . . . . . . . . . . . . . . . . . . . 17 64 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 65 8. IAB Members at the time of publication . . . . . . . . . . . . 18 66 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 67 9.1. Normative References . . . . . . . . . . . . . . . . . . . 18 68 9.2. Informative References . . . . . . . . . . . . . . . . . . 19 69 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21 71 1. Introduction 73 The goal of this document is to explore what can be learned from some 74 current difficulties in implementing Internationalized Domain Names 75 (IDNs). 77 A domain name consists of a set of labels, conventionally written 78 separated with dots. An Internationalized Domain Name (IDN) is a 79 domain name that contains one or more labels that, in turn, contain 80 one or more non-ASCII characters. Just as with plain ASCII domain 81 names, each IDN label must be encoded using some mechanism before it 82 can be transmitted in network packets, stored in memory, stored on 83 disk, etc. These encodings need to be reversible, but they need not 84 store domain names the same way humans conventionally write them on 85 paper. For example, when transmitted over the network in DNS 86 packets, domain name labels are *not* separated with dots. 88 IDNA, discussed later in this document, is the standard that defines 89 the use and coding of internationalized domain names for use on the 90 public Internet. It is described as "Internationalizing Domain Names 91 in Applications (IDNA)" and is defined in several documents. 92 Definitions for the current version and a roadmap of related 93 documents appears in [IDNA2008-Defs]. An earlier version of IDNA 94 [RFC3490] is now being phased out. Except where noted, the two 95 versions are approximately the same with regard to the issues 96 discussed in this document. However, some explanations appeared in 97 the earlier documents that did not seem useful when the revision was 98 created; they are quoted here from the documents in which they 99 appear. In addition, the terminology of the two version differs 100 somewhat; this document reflects the terminology of the current 101 version. 103 Unicode [Unicode] is a list of characters (including non-spacing 104 marks that are used to form some other characters), where each 105 character is assigned an integer value, called a code point. In 106 simple terms a Unicode string is a string of integer code point 107 values in the range 0 to 1,114,111 (10FFFF in base 16), which 108 represent a string of Unicode characters. These integer code points 109 must be encoded using some mechanism before they can be transmitted 110 in network packets, stored in memory, stored on disk, etc. Some 111 common ways of encoding these integer code point values in computer 112 systems include UTF-8, UTF-16, and UTF-32. In addition to the 113 material below, those forms and the tradeoffs among them are 114 discussed in Chapter 2 of The Unicode Standard [Unicode]. 116 UTF-8 [RFC3629] is a mechanism for encoding a Unicode code point in a 117 variable number of 8-bit octets, where an ASCII code point is 118 preserved as-is. Those octets encode a string of integer code point 119 values, which represent a string of Unicode characters. 121 UTF-16 (formerly UCS-2) is a mechanism for encoding a Unicode code 122 point in one or two 16-bit integers, described in detail in Sections 123 3.9 and 3.10 of The Unicode Standard [Unicode]. A UTF-16 string 124 encodes a string of integer code point values that represent a string 125 of Unicode characters. 127 UTF-32 (formerly UCS-4), also described in [Unicode] Sections 3.9 and 128 3.10, is a mechanism for encoding a Unicode code point in a single 129 32-bit integer. A UTF-32 string is thus a string of 32-bit integer 130 code point values, which represent a string of Unicode characters. 132 Note that UTF-16 results in some all-zero octets when code points 133 occur early in the Unicode sequence, and UTF-32 always has all-zero 134 octets. 136 IDNA specifies validity of a label, such as what characters it can 137 contain, relationships among them, and so on, in Unicode terms. 138 Valid labels can take either of two forms, with the appropriate one 139 determined by particular protocols or by context. One of those 140 forms, called a U-label, is a direct representation of the Unicode 141 characters using one of the encoding forms discussed above. This 142 document discusses UTF-8 strings in many places. While all U-labels 143 can be represented by UTF-8 strings, not all UTF-8 strings are valid 144 U-labels (see Section 2.3.2 of [IDNA2008-Defs] for a discussion of 145 these distinctions). The other, called an A-label, uses a 146 compressed, ASCII-compatible encoding (an "ACE" in IDNA and other 147 terminology) produced by an algorithm called Punycode. U-labels and 148 A-labels are duals of each other: transformations from one to the 149 other do not lose information. The transformation mechanisms are 150 specified in [IDNA2008-Protocol]. 152 Punycode [RFC3492] is thus a mechanism for encoding a Unicode string 153 in an ASCII-compatible encoding, i.e., using only letters, digits, 154 and hyphens from the ASCII character set. When a Unicode label that 155 is valid under the IDNA rules (a U-label) is encoded with Punycode 156 for IDNA purposes, it is prefixed with "xn--"; the result is called 157 an A-label. The prefix convention assumes that no other DNS labels 158 (at least no other DNS labels in IDNA-aware applications) are allowed 159 to start with these four characters. Consequently, when A-label 160 encoding is assumed, any DNS labels beginning with "xn--" now have a 161 different meaning (the Punycode encoding of a label containing one or 162 more non-ASCII characters) or no defined meaning at all (in the case 163 of labels that are not IDNA-compliant, i.e., are not well-formed 164 A-labels). 166 ISO-2022-JP [RFC1468] is a mechanism for encoding a string of ASCII 167 and Japanese characters, where an ASCII character is preserved as-is. 168 ISO-2022-JP is stateful: special sequences are used to switch between 169 character coding tables. 171 Comparison of Unicode strings is not as easy as comparing for example 172 ASCII strings. First, there are a multitude of ways of representing 173 a string of Unicode characters. Second, in many languages and 174 scripts, the actual definition of "same" is very context-dependent. 175 Because of this, comparison of two Unicode strings must take into 176 account how the Unicode strings are encoded. Regardless of the 177 encoding, however, comparison cannot simply be done by comparing the 178 encoded Unicode strings byte by byte. The only time that is possible 179 is when the strings both are mapped into some canonical format and 180 encoded the same way. 182 This document focuses on the importance of agreeing on a canonical 183 format and how complicated it ends up being as a result of using 184 different encodings today. 186 Different applications, APIs, and protocols use different encoding 187 schemes today. Historically, many of them were originally defined to 188 use only ASCII. Internationalizing Domain Names in Applications 189 (IDNA) [IDNA2008-Defs] defined a mechanism that required changes to 190 applications, but in attempt not to change APIs or servers, specified 191 that the A-label format is to be used in many contexts. In some ways 192 this could be seen as not changing the existing APIs, in the sense 193 that the strings being passed to and from the APIs were still 194 apparently ASCII strings. In other ways it was a very profound 195 change to the existing APIs, because while those strings were still 196 syntactically valid ASCII strings, they no longer meant the same 197 thing as they used to. What looked like a plain ASCII string to one 198 piece of software or library could be seen by another piece of 199 software or library (with the application of out-of-band information) 200 to be in fact an encoding of a Unicode string. 202 Section 1.3 of the original IDNA specification [RFC3490] states: 204 The IDNA protocol is contained completely within applications. It 205 is not a client-server or peer-to-peer protocol: everything is 206 done inside the application itself. When used with a DNS resolver 207 library, IDNA is inserted as a "shim" between the application and 208 the resolver library. When used for writing names into a DNS 209 zone, IDNA is used just before the name is committed to the zone. 211 Figure 1 depicts a simplistic architecture that a naive reader might 212 assume from the paragraph quoted above. (A variant of this same 213 picture appears in Section 6 of the IDNA specification [RFC3490] 214 further strengthening this assumption.) 215 +-----------------------------------------+ 216 |Host | 217 | +-------------+ | 218 | | Application | | 219 | +------+------+ | 220 | | | 221 | +----+----+ | 222 | | DNS | | 223 | | Resolver| | 224 | | Library | | 225 | +----+----+ | 226 | | | 227 +-----------------------------------------+ 228 | 229 _________|_________ 230 / \ 231 / \ 232 / \ 233 | Internet | 234 \ / 235 \ / 236 \___________________/ 238 Simplistic Architecture 240 Figure 1 242 There are, however, two problems with this simplistic architecture 243 that cause it to differ from reality. 245 First, resolver APIs on Operating Systems (OSs) today (MacOS, 246 Windows, Linux, etc.) are not DNS-specific. They typically provide a 247 layer of indirection so that the application can work independent of 248 the name resolution mechanism, which could be DNS, mDNS 249 [I-D.cheshire-dnsext-multicastdns], LLMNR [RFC4795], NetBIOS-over-TCP 250 [RFC1001][RFC1002], etc/hosts file [RFC0952], NIS [NIS], or anything 251 else. For example, "Basic Socket Interface Extensions for IPv6" 252 [RFC3493] specifies the getaddrinfo() API and contains many phrases 253 like "For example, when using the DNS" and "any type of name 254 resolution service (for example, the DNS)". Importantly, DNS is 255 mentioned only as an example, and the application has no knowledge as 256 to whether DNS or some other protocol will be used. 258 Second, even with the DNS protocol, private name spaces (sometimes 259 including private uses of the DNS), do not necessarily use the same 260 character set encoding scheme as the public Internet name space. 262 We will discuss each of the above issues in subsequent sections. For 263 reference, Figure 2 depicts a more realistic architecture on typical 264 hosts today (which don't have IDNA inserted as a shim immediately 265 above the DNS resolver library). More generally, the host may be 266 attached to one or more local networks, each of which may or may not 267 be connected to the public Internet and may or may not have a private 268 name space. 270 +-----------------------------------------+ 271 |Host | 272 | +-------------+ | 273 | | Application | | 274 | +------+------+ | 275 | | | 276 | +------+------+ | 277 | | Generic | | 278 | | Name | | 279 | | Resolution | | 280 | | API | | 281 | +------+------+ | 282 | | | 283 | +-----+------+---+--+-------+-----+ | 284 | | | | | | | | 285 | +-+-++--+--++--+-++---+---++--+--++-+-+ | 286 | |DNS||LLMNR||mDNS||NetBIOS||hosts||...| | 287 | +---++-----++----++-------++-----++---+ | 288 | | 289 +-----------------------------------------+ 290 | 291 ______|______ 292 / \ 293 / \ 294 / local \ 295 \ network / 296 \ / 297 \_____________/ 298 | 299 _________|_________ 300 / \ 301 / \ 302 / \ 303 | Internet | 304 \ / 305 \ / 306 \___________________/ 308 Realistic Architecture 310 Figure 2 312 1.1. APIs 314 Section 6.2 of the original IDNA specification [RFC3490] states 315 (where ToASCII and ToUnicode below refer to conversions using the 316 Punycode algorithm): 318 It is expected that new versions of the resolver libraries in the 319 future will be able to accept domain names in other charsets than 320 ASCII, and application developers might one day pass not only 321 domain names in Unicode, but also in local script to a new API for 322 the resolver libraries in the operating system. Thus the ToASCII 323 and ToUnicode operations might be performed inside these new 324 versions of the resolver libraries. 326 Resolver APIs such as getaddrinfo() and its predecessor 327 gethostbyname() were defined to accept "char *" arguments, meaning 328 they accept a string of bytes, terminated with a NULL (0) byte. 329 Because of the use of a NULL octet as a string terminator, this is 330 sufficient for ASCII strings (including A-labels) and even 331 ISO-2022-JP and UTF-8 strings (unless an implementation artificially 332 precludes them), but not UTF-16 or UTF-32 strings because a NULL 333 octet could appear in the the middle of strings using these 334 encodings. Several operating systems historically used in Japan will 335 accept (and expect) ISO-2022-JP strings in such APIs. Some platforms 336 used worldwide also have new versions of the APIs (e.g., 337 GetAddrInfoW() on Windows) that accept other encoding schemes such as 338 UTF-16. 340 It is worth noting that an API using "char *" arguments can 341 distinguish between conventional ASCII "host name" labels, A-labels, 342 ISO-2022-JP, and UTF-8 labels in names if the coding is known to be 343 one of those four. An example method is as follows: 344 o if the label contains an ESC (0x1B) byte the label is ISO-2022-JP; 345 otherwise, 346 o if any byte in the label has the high bit set, the label is UTF-8; 347 otherwise, 348 o if the label starts with "xn--" then it is presumed to be an 349 A-label; otherwise, 350 o the label is ASCII. 351 Again this assumes that neither ASCII labels nor UTF-8 strings ever 352 start with "xn--", and also that UTF-8 strings never contain an ESC 353 character. Also the above is merely an illustration; UTF-8 can be 354 detected and distinguished from other 8-bit encodings with good 355 accuracy [MJD]. 357 It is more difficult or impossible to distinguish the ISO 8859 358 character sets from each other, because they differ in up to about 90 359 characters which have exactly the same encodings, and a short string 360 is very unlikely to contain enough characters to allow a receiver to 361 deduce the character set. Similarly, it is not possible in general 362 to distinguish between ISO-2022-JP and any other encoding based on 363 ISO 2022 code table switching. 365 Although it is possible (as in the example above) to distinguish some 366 encodings when not explicitly specified, it is cleaner to have the 367 encodings specified explicitly, such as specifying UTF-16 for 368 GetAddrInfoW(), or specifying explicitly which APIs expect UTF-8 369 strings. 371 2. Use of Non-DNS Protocols 373 As noted earlier, typical name resolution libraries are not DNS- 374 specific. Furthermore, some protocols are defined to use encoding 375 forms other than IDNA A-labels. For example, mDNS 376 [I-D.cheshire-dnsext-multicastdns] specifies that UTF-8 be used. 377 Indeed, the IETF policy on character sets and languages [RFC2277] 378 states: 380 Protocols MUST be able to use the UTF-8 charset, which consists of 381 the ISO 10646 coded character set combined with the UTF-8 382 character encoding scheme, as defined in [10646] Annex R 383 (published in Amendment 2), for all text. Protocols MAY specify, 384 in addition, how to use other charsets or other character encoding 385 schemes for ISO 10646, such as UTF-16, but lack of an ability to 386 use UTF-8 is a violation of this policy; such a violation would 387 need a variance procedure ([BCP9] section 9) with clear and solid 388 justification in the protocol specification document before being 389 entered into or advanced upon the standards track. For existing 390 protocols or protocols that move data from existing datastores, 391 support of other charsets, or even using a default other than 392 UTF-8, may be a requirement. This is acceptable, but UTF-8 393 support MUST be possible. 395 Applications that convert an IDN to A-label form before calling 396 getaddrinfo() will result in name resolution failures if the Punycode 397 name is directly used in such protocols. Having libraries or 398 protocols to convert from A-labels to the encoding scheme defined by 399 the protocol (e.g., UTF-8) would require changes to APIs and/or 400 servers, which IDNA was intended to avoid. 402 As a result, applications that assume that non-ASCII names are 403 resolved using the public DNS and blindly convert them to A-labels 404 without knowledge of what protocol will be selected by the name 405 resolution library, have problems. Furthermore, name resolution 406 libraries often try multiple protocols until one succeeds, because 407 they are defined to use a common name space. For example, the hosts 408 file, DNS, and NetBIOS-over-TCP are all defined to be able to share a 409 common syntax (e.g., see ([RFC0952], [RFC1001] section 11.1.1, and 410 [RFC1034] section 2.1). This means that when an application passes a 411 name to be resolved, resolution may in fact be attempted using 412 multiple protocols, each with a potentially different encoding 413 scheme. For this to work successfully, the name must be converted to 414 the appropriate encoding scheme only after the choice is made to use 415 that protocol. In general, this cannot be done by the application 416 since the choice of protocol is not made by the application. 418 3. Use of Non-ASCII in DNS 420 A common misconception is that DNS only supports names that can be 421 expressed using letters, digits, and hyphens. 423 This misconception originally stemmed from the definition in 1985 of 424 an "Internet host name" (and net, gateway, and domain name) for use 425 in the "hosts" file [RFC0952]. An Internet host name was defined 426 therein as including only letters, digits, and hyphens, where upper 427 and lower case letters were to be treated as identical. The DNS 428 specification [RFC1034] section 3.5 entitled "Preferred name syntax" 429 then repeated this definition in 1987, saying that this "syntax will 430 result in fewer problems with many applications that use domain names 431 (e.g., mail, TELNET)". 433 The confusion was thus left as to whether the "preferred" name syntax 434 was a mandatory restriction in DNS, or merely "preferred". 436 The definition of an Internet host name was updated in 1989 437 ([RFC1123] section 2.1) to allow names starting with a digit (to 438 support IPv4 addresses in dotted-decimal form). Section 6.1 of 439 "Requirements for Internet Hosts -- Application and Support" 440 [RFC1123] discusses the use of DNS (and the hosts file) for resolving 441 host names to IP addresses and vice versa. This led to confusion as 442 to whether all names in DNS are "host names", or whether a "host 443 name" is merely a special case of a DNS name. 445 By 1997, things had progressed to a state where it was necessary to 446 clarify these areas of confusion. "Clarifications to the DNS 447 Specification" [RFC2181] section 11 states: 449 The DNS itself places only one restriction on the particular 450 labels that can be used to identify resource records. That one 451 restriction relates to the length of the label and the full name. 452 The length of any one label is limited to between 1 and 63 octets. 453 A full domain name is limited to 255 octets (including the 454 separators). The zero length full name is defined as representing 455 the root of the DNS tree, and is typically written and displayed 456 as ".". Those restrictions aside, any binary string whatever can 457 be used as the label of any resource record. Similarly, any 458 binary string can serve as the value of any record that includes a 459 domain name as some or all of its value (SOA, NS, MX, PTR, CNAME, 460 and any others that may be added). Implementations of the DNS 461 protocols must not place any restrictions on the labels that can 462 be used. 464 Hence, it clarified that the restriction to letters, digits, and 465 hyphens does not apply to DNS names in general, nor to records that 466 include "domain names". Hence the "preferred" name syntax described 467 in the original DNS specification [RFC1034] is indeed merely 468 "preferred", not mandatory. 470 Since there is no restriction even to ASCII, let alone letter-digit- 471 hyphen use, DNS is in conformance with the IETF requirement to allow 472 UTF-8 [RFC2277]. 474 Using UTF-16 or UTF-32 encoding, however, would not be ideal for use 475 in DNS packets or APIs because existing software already uses ASCII, 476 and UTF-16 and UTF-32 strings can contain all-zero octets that 477 existing software may interpret as the end of the string. To use 478 UTF-16 or UTF-32 one would need some way of knowing whether the 479 string was encoded using ASCII, UTF-16, or UTF-32, and indeed for 480 UTF-16 or UTF-32 whether it was big-endian or little-endian encoding. 481 In contrast, UTF-8 works well because any 7-bit ASCII string is also 482 a UTF-8 string representing the same characters. 484 If a private name space is defined to use UTF-8 (and not other 485 encodings such as UTF-16 or UTF-32), there's no need for a mechanism 486 to know whether a string was encoded using ASCII or UTF-8, because 487 (for any string that can be represented using ASCII) the 488 representations are exactly the same. In other words, for any string 489 that can be represented using ASCII it doesn't matter whether it is 490 interpreted as ASCII or UTF-8 because both encodings are the same, 491 and for any string that can't be represented using ASCII, it's 492 obviously UTF-8. In addition, unlike UTF-16 and UTF-32, ASCII and 493 UTF-8 are both byte-oriented encodings so the question of big-endian 494 or little-endian encoding doesn't apply. 496 While implementations of the DNS protocol must not place any 497 restrictions on the labels that can be used, applications that use 498 the DNS are free to impose whatever restrictions they like, and many 499 have. The above rules permit a domain name label that contains 500 unusual characters, such as embedded spaces which many applications 501 would consider a bad idea. For example, the SMTP protocol [RFC5321], 502 but going back to the original specification in [RFC0821], constrains 503 the character set usable in email addresses. There is now an effort 504 underway to permit SMTP to support internationalized email addresses 505 via an extension. 507 Shortly after the DNS Clarifications [RFC2181] and IETF character 508 sets and languages policy [RFC2277] were published, the need for 509 internationalized names within private name spaces (i.e., within 510 enterprises) arose. The current (and past, predating IDNA and the 511 prefixed ACE conventions) practice within enterprises that support 512 other languages is to put UTF-8 names in their internal DNS servers 513 in a private name space. For example, "Using the UTF-8 Character Set 514 in the Domain Name System" [I-D.skwan-utf8-dns-00] was first written 515 in 1997, and was then widely deployed in Windows. The use of UTF-8 516 names in DNS was similarly implemented and deployed in MacOS, simply 517 by virtue of the fact that applications blindly passed UTF-8 strings 518 to the name resolution APIs, and the name resolution APIs blindly 519 passed those UTF-8 strings to the DNS servers, and the DNS servers 520 correctly answered those queries, and from the user's point of view 521 everything worked properly without any special new code being 522 written, except that ASCII is matched case-insensitively whereas 523 UTF-8 is not (although some enterprise DNS servers reportedly attempt 524 to do case-insensitive matching on UTF-8 within private name spaces). 525 Within a private name space, and especially in light of the IETF 526 UTF-8 policy [RFC2277], it was reasonable to assume within a private 527 name space that binary strings were encoded in UTF-8. 529 As implied earlier, there are also issues with mapping strings to 530 some canonical form, independent of the encoding. Such issues are 531 not discussed in detail in this document. They are discussed to some 532 extent in, for example, Section 3 of [RFC5198], and are left as 533 opportunities for elaboration in other documents. 535 Five years after UTF-8 was already in use in private name spaces in 536 DNS, the strategy of using a reserved prefix and an ASCII-compatible 537 Encoding (ACE) was developed for IDNA. That strategy included the 538 Punycode algorithm, which began to be developed (during the period 539 from 2002 [I-D.ietf-idn-punycode-00] to 2003 [RFC3492]) for use in 540 the public DNS name space. One reason the prefixed ACE strategy was 541 selected for the public DNS name space had to do with concerns about 542 whether the details of IDNA, including the use of the Punycode 543 algorithm, were an adequate solution to the problems that were posed. 544 If either the Punycode algorithm or fundamental aspects of character 545 handling were wrong, and had to be changed to something incompatible, 546 it would be possible to switch to a new prefix or adopt another model 547 entirely. Only the part of the public DNS namespace that starts a 548 label with "xn--" would be polluted. 550 Today the algorithm is seen as being about as good as it can 551 realistically be, so moving to a different encoding (UTF-8 as 552 suggested in this document) that can be viewed as "native" would not 553 be as risky as it would have been in 2002. 555 In any case, the publication of [RFC3492] and the dependencies on it 556 in [IDNA2008-Protocol] and the earlier [RFC3490] thus resulted in 557 having to use different encodings for different name spaces (where 558 UTF-8 for private name spaces was already deployed). Hence, 559 referring back to Figure 2, a different encoding scheme may be in use 560 on the Internet vs. a local network. 562 In general a host may be connected to zero or more networks using 563 private name spaces, plus potentially the public name space. 564 Applications that convert a U-label form IDN to an A-label before 565 calling getaddrinfo() will incur name resolution failures if the name 566 is actually registered in a private name space in some other encoding 567 (e.g., UTF-8). Having libraries or protocols convert from A-labels 568 to the encoding used by a private name space (e.g., UTF-8) would 569 require changes to APIs and/or servers, which IDNA was intended to 570 avoid. 572 Also, a fully-qualified domain name (FQDN) to be resolved may be 573 obtained directly from an application, or it may be composed by the 574 DNS resolver itself from a single label obtained from an application 575 by using a configured suffix search list, and the resulting FQDN may 576 use multiple encodings in different labels. For more information on 577 the suffix search list, see section 6 of "Common DNS Implementation 578 Errors and Suggested Fixes" [RFC1536], the DHCP Domain Search Option 579 [RFC3397], and section 4 of "DNS Configuration options for DHCPv6" 580 [RFC3646]. 582 As noted in [RFC1536] section 6, the community has had bad 583 experiences with "searching" for domain names by trying multiple 584 variations or appending different suffixes. Such searching can yield 585 inconsistent results depending on the order in which alternatives are 586 tried. Nonetheless, the practice is widespread and must be 587 considered. 589 The practice of searching for names, whether by the use of a suffix 590 search list or by searching in different namespaces can yield 591 inconsistent results. For example, even when a suffix search list is 592 only used when an application provides a name containing no dots, two 593 clients with different configured suffix search lists can get 594 different answers, and the same client could get different answers at 595 different times if it changes its configuration (e.g., when moving to 596 another network). A deeper discussion of this topic is outside the 597 scope of this document. 599 3.1. Examples 601 Some examples of cases that can happen in existing implementations 602 today (where {non-ASCII} below represents some user-entered non-ASCII 603 string) are: 604 1. User types in {non-ASCII}.{non-ASCII}.com, and the application 605 passes it, in the form of a UTF-8 string, to getaddrinfo or 606 gethostbyname or equivalent. 607 * The DNS resolver passes the (UTF-8) string unmodified to a DNS 608 server. 609 2. User types in {non-ASCII}.{non-ASCII}.com, and the application 610 passes it to a name resolution API that accepts strings in some 611 other encoding such as UTF-16, e.g., GetAddrInfoW on Windows. 612 * The name resolution API decides to pass the string to DNS (and 613 possibly other protocols). 614 * The DNS resolver converts the name from UTF-16 to UTF-8 and 615 passes the query to a DNS server. 616 3. User types in {non-ASCII}.{non-ASCII}.com, but the application 617 first converts it to A-label form such that the name that is 618 passed to name resolution APIs is (say) xn--e1afmkfd.xn-- 619 80akhbyknj4f.com. 620 * The name resolution API decides to pass the string to DNS (and 621 possibly other protocols). 622 * The DNS resolver passes the string unmodified to a DNS server. 623 * If the name is not found in DNS, the name resolution API 624 decides to try another protocol, say mDNS. 625 * The query goes out in mDNS, but since mDNS specified that 626 names are to be registered in UTF-8, the name isn't found 627 since it was encoded as an A-label in the query. 628 4. User types in {non-ASCII}, and the application passes it, in the 629 form of a UTF-8 string, to getaddrinfo or equivalent. 630 * The name resolution API decides to pass the string to DNS (and 631 possibly other protocols). 632 * The DNS resolver will append suffixes in the suffix search 633 list, which may contain UTF-8 characters if the local network 634 uses a private name space. 635 * Each FQDN in turn will then be sent in a query to a DNS 636 server, until one succeeds. 637 5. User types in {non-ASCII}, but the application first converts it 638 to an A-label, such that the name that is passed to getaddrinfo 639 or equivalent is (say) xn--e1afmkfd. 640 * The name resolution API decides to pass the string to DNS (and 641 possibly other protocols). 642 * The DNS stub resolver will append suffixes in the suffix 643 search list, which may contain UTF-8 characters if the local 644 network uses a private name space, resulting in (say) xn-- 645 e1afmkfd.{non-ASCII}.com 647 * Each FQDN in turn will then be sent in a query to a DNS 648 server, until one succeeds. 649 * Since the private name space in this case uses UTF-8, the 650 above queries fail, since the A-label version of the name was 651 not registered in that name space. 652 6. User types in {non-ASCII1}.{non-ASCII2}.{non-ASCII3}.com, where 653 {non-ASCII3}.com is a public name space using IDNA and A-labels, 654 but {non-ASCII2}.{non-ASCII3}.com is a private name space using 655 UTF-8, which is accessible to the user. The application passes 656 the name, in the form of a UTF-8 string, to getaddrinfo or 657 equivalent. 658 * The name resolution API decides to pass the string to DNS (and 659 possibly other protocols). 660 * The DNS resolver tries to locate the authoritative server, but 661 fails the lookup because it cannot find a server for the UTF-8 662 encoding of {non-ASCII3}.com, even though it would have access 663 to the private name space. (To make this work, the private 664 name space would need to include the UTF-8 encoding of {non- 665 ASCII3}.com.) 667 When users use multiple applications, some of which do A-label 668 conversion prior to passing a name to name resolution APIs, and some 669 of which do not, odd behavior can result which at best violates the 670 principle of least surprise, and at worst can result in security 671 vulnerabilities. 673 First consider two competing applications, such as web browsers, that 674 are designed to achieve the same task. If the user types the same 675 name into each browser, one may successfully resolve the name (and 676 hence access the desired content) because the encoding scheme was 677 correct, while the other may fail name resolution because the 678 encoding scheme was incorrect. Hence the issue can incent users to 679 switch to another application (which in some cases means switching to 680 an IDNA application, and in other cases means switching away from an 681 IDNA application). 683 Next consider two separate applications where one is designed to be 684 launched from the other, for example a web browser launching a media 685 player application when the link to a media file is clicked. If both 686 types of content (web pages and media files in this example) are 687 hosted at the same IDN in a private name space, but one application 688 converts to A-labels before calling name resolution APIs and the 689 other does not, the user may be able to access a web page, click on 690 the media file causing the media player to launch and attempt to 691 retrieve the media file, which will then fail because the IDN 692 encoding scheme was incorrect. Or even worse, if an attacker was 693 able to register the same name in the other encoding scheme, may get 694 the content from the attacker's machine. This is similar to a normal 695 phishing attack, except that the two names represent exactly the same 696 Unicode characters. 698 4. Recommendations 700 As explained above, using multiple canonical formats, and multiple 701 encodings in different protocols or even in different places in the 702 same namespace creates problems. Because of this, and the fact that 703 both IDNA A-labels and UTF-8 are in use as encoding mechanisms for 704 domain names today, we recommend the following. 706 It is inappropriate for an application to convert a name to an 707 A-label when it does not know whether DNS will be used by the name 708 resolution library, or whether the name exists in a private name 709 space that uses UTF-8, or in the global DNS that uses IDNA A-labels. 711 Instead, conversion to A-label form, UTF-8, or any other encoding, 712 should be done only by an entity that knows which protocol will be 713 used (e.g., the DNS resolver, or getaddrinfo upon deciding to pass 714 the name to DNS), rather than by general applications that call 715 protocol-independent name resolution APIs. (Of course, it is still 716 necessary for applications to convert to whatever form those APIs 717 expect.) Similarly, even when DNS is used, the conversion to 718 A-labels should be done only by an entity that knows which name space 719 will be used. 721 That is, a more intelligent DNS resolver would be more liberal in 722 what it would accept from an application and be able to query for 723 both a name in A-label form (e.g., over the Internet) and a UTF-8 724 name (e.g., over a corporate network with a private name space) in 725 case the server only recognized one. However, we might also take 726 into account that the various resolution behaviors discussed earlier 727 could also occur with record updates (e.g., with Dynamic Update 728 [RFC2136]), resulting in some names being registered in a local 729 network's private name space by applications doing conversion to 730 A-labels, and other names being registered using UTF-8. Hence a name 731 might have to be queried with both encodings to be sure to succeed 732 without changes to DNS servers. 734 Similarly, a more intelligent stub resolver would also be more 735 liberal in what it would accept from a response as the value of a 736 record (e.g., PTR) in that it would accept either UTF-8 (U-labels in 737 the case of IDNA) or A-labels and convert them to whatever encoding 738 is used by the application APIs to return strings to applications. 740 Indeed the choice of conversion within the resolver libraries is 741 consistent with the quote from section 6.2 of the original IDNA 742 specification [RFC3490] stating that conversion using the Punycode 743 algorithm (i.e., to A-labels) "might be performed inside these new 744 versions of the resolver libraries". 746 That said, some application-layer protocols may be defined to use 747 A-labels rather than UTF-8 as recommended by the IETF character sets 748 and languages policy [RFC2277]. In this case, an application may 749 receive a string containing A-labels and want to pass it to name 750 resolution APIs. Again the recommendation that a resolver library be 751 more liberal in what it would accept from an application would mean 752 that such a name would be accepted and re-encoded as needed, rather 753 than requiring the application to do so. 755 Finally, the question remains about what, if anything, a DNS server 756 should do to handle cases where some existing applications or hosts 757 do IDNA queries using A-labels within the local network using a 758 private name space, and other existing applications or hosts send 759 UTF-8 queries. It is undesirable to store different records for 760 different encodings of the same name, since this introduces the 761 possibility for inconsistency between them. Instead, a new DNS 762 server serving a private name space using UTF-8 could potentially 763 treat encoding-conversion in the same way as case-insensitive 764 comparison which a DNS server is already required to do, as long the 765 DNS server has some way to know what the encoding is. Two encodings 766 are, in this sense, two representations of the same name, just as two 767 case-different strings are. However, whereas case comparison of non- 768 ASCII characters is complicated by ambiguities (as explained in the 769 IAB's Review and Recommendations for Internationalized Domain Names 770 [RFC4690]), encoding conversion between A-labels and U-labels is 771 unambiguous. 773 5. Acknowledgements 775 The authors wish to thank Patrik Falstrom, Martin Duerst, and JFC 776 Morfin for their careful review and helpful suggestions. 778 6. Security Considerations 780 Having applications convert names to prefixed ACE format (A-labels) 781 before calling name resolution can result in security 782 vulnerabilities. If the name is resolved by protocols or in zones 783 for which records are registered using other encoding schemes, an 784 attacker can claim the A-label version of the same name and hence 785 trick the victim into accessing a different destination. This can be 786 done for any non-ASCII name, even when there is no possible confusion 787 due to case, language, or other issues. Other types of confusion 788 beyond those resulting simply from the choice of encoding scheme are 789 discussed in "Review and Recommendations for IDNs" [RFC4690]. 791 Designers and users of encodings that represent Unicode strings in 792 terms of ASCII should also consider whether trademark protection is 793 an issue, e.g., if one name would be encoded in a way that would be 794 naturally associated with another organization, such as xn--rfc- 795 editor. 797 7. IANA Considerations 799 [RFC Editor: please remove this section prior to publication.] 801 This document has no IANA Actions. 803 8. IAB Members at the time of publication 805 Bernard Aboba 806 Marcelo Bagnulo 807 Ross Callon 808 Spencer Dawkins 809 Vijay Gill 810 Russ Housley 811 John Klensin 812 Olaf Kolkman 813 Danny McPherson 814 Jon Peterson 815 Andrei Robachevsky 816 Dave Thaler 817 Hannes Tschofenig 819 9. References 821 9.1. Normative References 823 [Unicode] The Unicode Consortium, "The Unicode Standard, Version 824 5.1.0", 2008. 826 defined by: The Unicode Standard, Version 5.0, Boston, MA, 827 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by 828 Unicode 5.1.0 829 (http://www.unicode.org/versions/Unicode5.1.0/). 831 9.2. Informative References 833 [I-D.cheshire-dnsext-multicastdns] 834 Cheshire, S. and M. Krochmal, "Multicast DNS", 835 draft-cheshire-dnsext-multicastdns-11 (work in progress), 836 March 2010. 838 [I-D.ietf-idn-punycode-00] 839 Costello, A., "Punycode version 0.3.3", 840 draft-ietf-idn-punycode-00 (work in progress), July 2002. 842 [I-D.skwan-utf8-dns-00] 843 Kwan, S. and J. Gilroy, "Using the UTF-8 Character Set in 844 the Domain Name System", draft-skwan-utf8-dns-00 (work in 845 progress), November 1997. 847 [IDNA2008-Defs] 848 Klensin, J., "Internationalized Domain Names for 849 Applications (IDNA): Definitions and Document Framework", 850 January 2010, . 853 [IDNA2008-Protocol] 854 Klensin, J., "Internationalized Domain Names in 855 Applications (IDNA): Protocol", January 2010, . 858 [MJD] Duerst, M., "The Properties and Promizes of UTF-8", 11th 859 International Unicode Conference, San Jose , 860 September 1997, . 863 [NIS] Sun Microsystems, "System and Network Administration", 864 March 1990. 866 [RFC0821] Postel, J., "Simple Mail Transfer Protocol", STD 10, 867 RFC 821, August 1982. 869 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet 870 host table specification", RFC 952, October 1985. 872 [RFC1001] NetBIOS Working Group, "Protocol standard for a NetBIOS 873 service on a TCP/UDP transport: Concepts and methods", 874 STD 19, RFC 1001, March 1987. 876 [RFC1002] NetBIOS Working Group, "Protocol standard for a NetBIOS 877 service on a TCP/UDP transport: Detailed specifications", 878 STD 19, RFC 1002, March 1987. 880 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 881 STD 13, RFC 1034, November 1987. 883 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 884 and Support", STD 3, RFC 1123, October 1989. 886 [RFC1468] Murai, J., Crispin, M., and E. van der Poel, "Japanese 887 Character Encoding for Internet Messages", RFC 1468, 888 June 1993. 890 [RFC1536] Kumar, A., Postel, J., Neuman, C., Danzig, P., and S. 891 Miller, "Common DNS Implementation Errors and Suggested 892 Fixes", RFC 1536, October 1993. 894 [RFC2136] Vixie, P., Thomson, S., Rekhter, Y., and J. Bound, 895 "Dynamic Updates in the Domain Name System (DNS UPDATE)", 896 RFC 2136, April 1997. 898 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS 899 Specification", RFC 2181, July 1997. 901 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 902 Languages", BCP 18, RFC 2277, January 1998. 904 [RFC3397] Aboba, B. and S. Cheshire, "Dynamic Host Configuration 905 Protocol (DHCP) Domain Search Option", RFC 3397, 906 November 2002. 908 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 909 "Internationalizing Domain Names in Applications (IDNA)", 910 RFC 3490, March 2003. 912 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 913 for Internationalized Domain Names in Applications 914 (IDNA)", RFC 3492, March 2003. 916 [RFC3493] Gilligan, R., Thomson, S., Bound, J., McCann, J., and W. 917 Stevens, "Basic Socket Interface Extensions for IPv6", 918 RFC 3493, February 2003. 920 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 921 10646", STD 63, RFC 3629, November 2003. 923 [RFC3646] Droms, R., "DNS Configuration options for Dynamic Host 924 Configuration Protocol for IPv6 (DHCPv6)", RFC 3646, 925 December 2003. 927 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 928 Recommendations for Internationalized Domain Names 929 (IDNs)", RFC 4690, September 2006. 931 [RFC4795] Aboba, B., Thaler, D., and L. Esibov, "Link-local 932 Multicast Name Resolution (LLMNR)", RFC 4795, 933 January 2007. 935 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network 936 Interchange", RFC 5198, March 2008. 938 [RFC5321] Klensin, J., "Simple Mail Transfer Protocol", RFC 5321, 939 October 2008. 941 Authors' Addresses 943 Dave Thaler 944 Microsoft Corporation 945 One Microsoft Way 946 Redmond, WA 98052 947 USA 949 Phone: +1 425 703 8835 950 Email: dthaler@microsoft.com 952 John C Klensin 953 1770 Massachusetts Ave, Ste 322 954 Cambridge, MA 02140 956 Phone: +1 617 245 1457 957 Email: john+ietf@jck.com 959 Stuart Cheshire 960 Apple Inc. 961 1 Infinite Loop 962 Cupertino, CA 95014 964 Phone: +1 408 974 3207 965 Email: cheshire@apple.com