idnits 2.17.1 draft-iab-idn-encoding-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 396: '... Protocols MUST be able to use th...' RFC 2119 keyword, line 399: '... for all text. Protocols MAY specify,...' RFC 2119 keyword, line 409: '... support MUST be possible....' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 26, 2010) is 5016 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'BCP9' is mentioned on line 403, but not defined == Outdated reference: A later version (-15) exists of draft-cheshire-dnsext-multicastdns-11 == Outdated reference: A later version (-02) exists of draft-ietf-idn-punycode-00 == Outdated reference: A later version (-06) exists of draft-skwan-utf8-dns-00 -- Obsolete informational reference (is this intentional?): RFC 821 (Obsoleted by RFC 2821) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group D. Thaler 3 Internet-Draft Microsoft 4 Intended status: Informational J. Klensin 5 Expires: January 27, 2011 6 S. Cheshire 7 Apple 8 July 26, 2010 10 IAB Thoughts on Encodings for Internationalized Domain Names 11 draft-iab-idn-encoding-04.txt 13 Abstract 15 This document explores issues with Internationalized Domain Names 16 (IDNs) that result from the use of various encoding schemes such as 17 UTF-8 and the ASCII-Compatible Encoding produced by the Punycode 18 algorithm. It focuses on the importance of agreeing on a canonical 19 format and how complicated it ends up being as a result of using 20 different encodings today. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 27, 2011. 39 Copyright Notice 41 Copyright (c) 2010 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 58 2. Use of Non-DNS Protocols . . . . . . . . . . . . . . . . . . . 10 59 3. Use of Non-ASCII in DNS . . . . . . . . . . . . . . . . . . . 11 60 3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 15 61 4. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 17 62 5. Security Considerations . . . . . . . . . . . . . . . . . . . 19 63 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 64 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 65 8. IAB Members at the time of publication . . . . . . . . . . . . 20 66 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 67 9.1. Normative References . . . . . . . . . . . . . . . . . . . 20 68 9.2. Informative References . . . . . . . . . . . . . . . . . . 20 69 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23 71 1. Introduction 73 The goal of this document is to explore what can be learned from some 74 current difficulties in implementing Internationalized Domain Names 75 (IDNs). 77 A domain name consists of a set of labels, conventionally written 78 separated with dots. An Internationalized Domain Name (IDN) is a 79 domain name that contains one or more labels that, in turn, contain 80 one or more non-ASCII characters. Just as with plain ASCII domain 81 names, each IDN label must be encoded using some mechanism before it 82 can be transmitted in network packets, stored in memory, stored on 83 disk, etc. These encodings need to be reversible, but they need not 84 store domain names the same way humans conventionally write them on 85 paper. For example, when transmitted over the network in DNS 86 packets, domain name labels are *not* separated with dots. 88 IDNA, discussed later in this document, is the standard that defines 89 the use and coding of internationalized domain names for use on the 90 public Internet. It is described as "Internationalizing Domain Names 91 in Applications (IDNA)" and is defined in several documents. 92 Definitions for the current version and a roadmap of related 93 documents appears in [IDNA2008-Defs]. An earlier version of IDNA 94 [RFC3490] is now being phased out. Except where noted, the two 95 versions are approximately the same with regard to the issues 96 discussed in this document. However, some explanations appeared in 97 the earlier documents that did not seem useful when the revision was 98 created; they are quoted here from the documents in which they 99 appear. In addition, the terminology of the two version differs 100 somewhat; this document reflects the terminology of the current 101 version. 103 Unicode [Unicode] is a list of characters (including non-spacing 104 marks that are used to form some other characters), where each 105 character is assigned an integer value, called a code point. In 106 simple terms a Unicode string is a string of integer code point 107 values in the range 0 to 1,114,111 (10FFFF in base 16), which 108 represent a string of Unicode characters. These integer code points 109 must be encoded using some mechanism before they can be transmitted 110 in network packets, stored in memory, stored on disk, etc. Some 111 common ways of encoding these integer code point values in computer 112 systems include UTF-8, UTF-16, and UTF-32. In addition to the 113 material below, those forms and the tradeoffs among them are 114 discussed in Chapter 2 of The Unicode Standard [Unicode]. 116 UTF-8 is a mechanism for encoding a Unicode code point in a variable 117 number of 8-bit octets, where an ASCII code point is preserved as-is. 118 Those octets encode a string of integer code point values, which 119 represent a string of Unicode characters. The authoritative 120 definition of UTF-8 is in Sections 3.9 and 3.10 of The Unicode 121 Standard [Unicode], but UTF-8 is also discussed in [RFC3629]. 122 Descriptions and formulae can also be found in Annex D of ISO/IEC 123 10646-1 [10646]. 125 UTF-16 is a mechanism for encoding a Unicode code point in one or two 126 16-bit integers, described in detail in Sections 3.9 and 3.10 of The 127 Unicode Standard [Unicode]. A UTF-16 string encodes a string of 128 integer code point values that represent a string of Unicode 129 characters. 131 UTF-32 (formerly UCS-4), also described in [Unicode] Sections 3.9 and 132 3.10, is a mechanism for encoding a Unicode code point in a single 133 32-bit integer. A UTF-32 string is thus a string of 32-bit integer 134 code point values, which represent a string of Unicode characters. 136 Note that UTF-16 results in some all-zero octets when code points 137 occur early in the Unicode sequence, and UTF-32 always has all-zero 138 octets. 140 IDNA specifies validity of a label, such as what characters it can 141 contain, relationships among them, and so on, in Unicode terms. 142 Valid labels can take either of two forms, with the appropriate one 143 determined by particular protocols or by context. One of those 144 forms, called a U-label, is a direct representation of the Unicode 145 characters using one of the encoding forms discussed above. This 146 document discusses UTF-8 strings in many places. While all U-labels 147 can be represented by UTF-8 strings, not all UTF-8 strings are valid 148 U-labels (see Section 2.3.2 of [IDNA2008-Defs] for a discussion of 149 these distinctions). The other, called an A-label, uses a 150 compressed, ASCII-compatible encoding (an "ACE" in IDNA and other 151 terminology) produced by an algorithm called Punycode. U-labels and 152 A-labels are duals of each other: transformations from one to the 153 other do not lose information. The transformation mechanisms are 154 specified in [IDNA2008-Protocol]. 156 Punycode [RFC3492] is thus a mechanism for encoding a Unicode string 157 in an ASCII-compatible encoding, i.e., using only letters, digits, 158 and hyphens from the ASCII character set. When a Unicode label that 159 is valid under the IDNA rules (a U-label) is encoded with Punycode 160 for IDNA purposes, it is prefixed with "xn--"; the result is called 161 an A-label. The prefix convention assumes that no other DNS labels 162 (at least no other DNS labels in IDNA-aware applications) are allowed 163 to start with these four characters. Consequently, when A-label 164 encoding is assumed, any DNS labels beginning with "xn--" now have a 165 different meaning (the Punycode encoding of a label containing one or 166 more non-ASCII characters) or no defined meaning at all (in the case 167 of labels that are not IDNA-compliant, i.e., are not well-formed 168 A-labels). 170 ISO-2022-JP [RFC1468] is a mechanism for encoding a string of ASCII 171 and Japanese characters, where an ASCII character is preserved as-is. 172 ISO-2022-JP is stateful: special sequences are used to switch between 173 character coding tables. As a result, if there are lost or mangled 174 characters in a character stream, it is extremely difficult to 175 recover the original stream after such a lost character encoding 176 shift. 178 Comparison of Unicode strings is not as easy as comparing, for 179 example, ASCII strings. First, there are a multitude of ways of 180 representing a string of Unicode characters. Second, in many 181 languages and scripts, the actual definition of "same" is very 182 context-dependent. Because of this, comparison of two Unicode 183 strings must take into account how the Unicode strings are encoded. 184 Regardless of the encoding, however, comparison cannot simply be done 185 by comparing the encoded Unicode strings byte by byte. The only time 186 that is possible is when the strings both are mapped into some 187 canonical format and encoded the same way. 189 RFC 2130 [RFC2130] reports on an IAB-sponsored workshop on character 190 sets and encodings. This document adds to that discussion and 191 focuses on the importance of agreeing on a canonical format and how 192 complicated it ends up being as a result of using different encodings 193 today. 195 Different applications, APIs, and protocols use different encoding 196 schemes today. Historically, many of them were originally defined to 197 use only ASCII. Internationalizing Domain Names in Applications 198 (IDNA) [IDNA2008-Defs] defined a mechanism that required changes to 199 applications, but in attempt not to change APIs or servers, specified 200 that the A-label format is to be used in many contexts. In some ways 201 this could be seen as not changing the existing APIs, in the sense 202 that the strings being passed to and from the APIs were still 203 apparently ASCII strings. In other ways it was a very profound 204 change to the existing APIs, because while those strings were still 205 syntactically valid ASCII strings, they no longer meant the same 206 thing as they used to. What looked like a plain ASCII string to one 207 piece of software or library could be seen by another piece of 208 software or library (with the application of out-of-band information) 209 to be in fact an encoding of a Unicode string. 211 Section 1.3 of the original IDNA specification [RFC3490] states: 213 The IDNA protocol is contained completely within applications. It 214 is not a client-server or peer-to-peer protocol: everything is 215 done inside the application itself. When used with a DNS resolver 216 library, IDNA is inserted as a "shim" between the application and 217 the resolver library. When used for writing names into a DNS 218 zone, IDNA is used just before the name is committed to the zone. 220 Figure 1 depicts a simplistic architecture that a naive reader might 221 assume from the paragraph quoted above. (A variant of this same 222 picture appears in Section 6 of the IDNA specification [RFC3490] 223 further strengthening this assumption.) 225 +-----------------------------------------+ 226 |Host | 227 | +-------------+ | 228 | | Application | | 229 | +------+------+ | 230 | | | 231 | +----+----+ | 232 | | DNS | | 233 | | Resolver| | 234 | | Library | | 235 | +----+----+ | 236 | | | 237 +-----------------------------------------+ 238 | 239 _________|_________ 240 / \ 241 / \ 242 / \ 243 | Internet | 244 \ / 245 \ / 246 \___________________/ 248 Simplistic Architecture 250 Figure 1 252 There are, however, two problems with this simplistic architecture 253 that cause it to differ from reality. 255 First, resolver APIs on Operating Systems (OSs) today (MacOS, 256 Windows, Linux, etc.) are not DNS-specific. They typically provide a 257 layer of indirection so that the application can work independent of 258 the name resolution mechanism, which could be DNS, mDNS 259 [I-D.cheshire-dnsext-multicastdns], LLMNR [RFC4795], NetBIOS-over-TCP 260 [RFC1001][RFC1002], etc/hosts file [RFC0952], NIS [NIS], or anything 261 else. For example, "Basic Socket Interface Extensions for IPv6" 262 [RFC3493] specifies the getaddrinfo() API and contains many phrases 263 like "For example, when using the DNS" and "any type of name 264 resolution service (for example, the DNS)". Importantly, DNS is 265 mentioned only as an example, and the application has no knowledge as 266 to whether DNS or some other protocol will be used. 268 Second, even with the DNS protocol, private name spaces (sometimes 269 including private uses of the DNS), do not necessarily use the same 270 character set encoding scheme as the public Internet name space. 272 We will discuss each of the above issues in subsequent sections. For 273 reference, Figure 2 depicts a more realistic architecture on typical 274 hosts today (which don't have IDNA inserted as a shim immediately 275 above the DNS resolver library). More generally, the host may be 276 attached to one or more local networks, each of which may or may not 277 be connected to the public Internet and may or may not have a private 278 name space. 280 +-----------------------------------------+ 281 |Host | 282 | +-------------+ | 283 | | Application | | 284 | +------+------+ | 285 | | | 286 | +------+------+ | 287 | | Generic | | 288 | | Name | | 289 | | Resolution | | 290 | | API | | 291 | +------+------+ | 292 | | | 293 | +-----+------+---+--+-------+-----+ | 294 | | | | | | | | 295 | +-+-++--+--++--+-++---+---++--+--++-+-+ | 296 | |DNS||LLMNR||mDNS||NetBIOS||hosts||...| | 297 | +---++-----++----++-------++-----++---+ | 298 | | 299 +-----------------------------------------+ 300 | 301 ______|______ 302 / \ 303 / \ 304 / local \ 305 \ network / 306 \ / 307 \_____________/ 308 | 309 _________|_________ 310 / \ 311 / \ 312 / \ 313 | Internet | 314 \ / 315 \ / 316 \___________________/ 318 Realistic Architecture 320 Figure 2 322 1.1. APIs 324 Section 6.2 of the original IDNA specification [RFC3490] states 325 (where ToASCII and ToUnicode below refer to conversions using the 326 Punycode algorithm): 328 It is expected that new versions of the resolver libraries in the 329 future will be able to accept domain names in other charsets than 330 ASCII, and application developers might one day pass not only 331 domain names in Unicode, but also in local script to a new API for 332 the resolver libraries in the operating system. Thus the ToASCII 333 and ToUnicode operations might be performed inside these new 334 versions of the resolver libraries. 336 Resolver APIs such as getaddrinfo() and its predecessor 337 gethostbyname() were defined to accept "char *" arguments, meaning 338 they accept a string of bytes, terminated with a NULL (0) byte. 339 Because of the use of a NULL octet as a string terminator, this is 340 sufficient for ASCII strings (including A-labels) and even 341 ISO-2022-JP and UTF-8 strings (unless an implementation artificially 342 precludes them), but not UTF-16 or UTF-32 strings because a NULL 343 octet could appear in the the middle of strings using these 344 encodings. Several operating systems historically used in Japan will 345 accept (and expect) ISO-2022-JP strings in such APIs. Some platforms 346 used worldwide also have new versions of the APIs (e.g., 347 GetAddrInfoW() on Windows) that accept other encoding schemes such as 348 UTF-16. 350 It is worth noting that an API using "char *" arguments can 351 distinguish between conventional ASCII "host name" labels, A-labels, 352 ISO-2022-JP, and UTF-8 labels in names if the coding is known to be 353 one of those four, and the label is intact (no lost or mangled 354 characters). If a stateful encoding like ISO-2022-JP is used, 355 applications extracting labels from text must take special 356 precautions to be sure that the appropriate state-setting characters 357 are included in the string passed to the API. 359 An example method for distinguishing among such codings is as 360 follows: 361 o if the label contains an ESC (0x1B) byte the label is ISO-2022-JP; 362 otherwise, 363 o if any byte in the label has the high bit set, the label is UTF-8; 364 otherwise, 365 o if the label starts with "xn--" then it is presumed to be an 366 A-label; otherwise, 367 o the label is ASCII. 368 Again this assumes that ASCII labels never start with "xn--", and 369 also that UTF-8 strings never contain an ESC character. Also the 370 above is merely an illustration; UTF-8 can be detected and 371 distinguished from other 8-bit encodings with good accuracy [MJD]. 373 It is more difficult or impossible to distinguish the ISO 8859 374 character sets from each other, because they differ in up to about 90 375 characters which have exactly the same encodings, and a short string 376 is very unlikely to contain enough characters to allow a receiver to 377 deduce the character set. Similarly, it is not possible in general 378 to distinguish between ISO-2022-JP and any other encoding based on 379 ISO 2022 code table switching. 381 Although it is possible (as in the example above) to distinguish some 382 encodings when not explicitly specified, it is cleaner to have the 383 encodings specified explicitly, such as specifying UTF-16 for 384 GetAddrInfoW(), or specifying explicitly which APIs expect UTF-8 385 strings. 387 2. Use of Non-DNS Protocols 389 As noted earlier, typical name resolution libraries are not DNS- 390 specific. Furthermore, some protocols are defined to use encoding 391 forms other than IDNA A-labels. For example, mDNS 392 [I-D.cheshire-dnsext-multicastdns] specifies that UTF-8 be used. 393 Indeed, the IETF policy on character sets and languages [RFC2277] 394 (which followed the IAB-sponsored workshop [RFC2130]) states: 396 Protocols MUST be able to use the UTF-8 charset, which consists of 397 the ISO 10646 coded character set combined with the UTF-8 398 character encoding scheme, as defined in [10646] Annex R 399 (published in Amendment 2), for all text. Protocols MAY specify, 400 in addition, how to use other charsets or other character encoding 401 schemes for ISO 10646, such as UTF-16, but lack of an ability to 402 use UTF-8 is a violation of this policy; such a violation would 403 need a variance procedure ([BCP9] section 9) with clear and solid 404 justification in the protocol specification document before being 405 entered into or advanced upon the standards track. For existing 406 protocols or protocols that move data from existing datastores, 407 support of other charsets, or even using a default other than 408 UTF-8, may be a requirement. This is acceptable, but UTF-8 409 support MUST be possible. 411 Applications that convert an IDN to A-label form before calling 412 getaddrinfo() will result in name resolution failures if the Punycode 413 name is directly used in such protocols. Having libraries or 414 protocols to convert from A-labels to the encoding scheme defined by 415 the protocol (e.g., UTF-8) would require changes to APIs and/or 416 servers, which IDNA was intended to avoid. 418 As a result, applications that assume that non-ASCII names are 419 resolved using the public DNS and blindly convert them to A-labels 420 without knowledge of what protocol will be selected by the name 421 resolution library, have problems. Furthermore, name resolution 422 libraries often try multiple protocols until one succeeds, because 423 they are defined to use a common name space. For example, the hosts 424 file, DNS, and NetBIOS-over-TCP are all defined to be able to share a 425 common syntax (e.g., see ([RFC0952], [RFC1001] section 11.1.1, and 426 [RFC1034] section 2.1). This means that when an application passes a 427 name to be resolved, resolution may in fact be attempted using 428 multiple protocols, each with a potentially different encoding 429 scheme. For this to work successfully, the name must be converted to 430 the appropriate encoding scheme only after the choice is made to use 431 that protocol. In general, this cannot be done by the application 432 since the choice of protocol is not made by the application. 434 3. Use of Non-ASCII in DNS 436 A common misconception is that DNS only supports names that can be 437 expressed using letters, digits, and hyphens. 439 This misconception originally stemmed from the definition in 1985 of 440 an "Internet host name" (and net, gateway, and domain name) for use 441 in the "hosts" file [RFC0952]. An Internet host name was defined 442 therein as including only letters, digits, and hyphens, where upper 443 and lower case letters were to be treated as identical. The DNS 444 specification [RFC1034] section 3.5 entitled "Preferred name syntax" 445 then repeated this definition in 1987, saying that this "syntax will 446 result in fewer problems with many applications that use domain names 447 (e.g., mail, TELNET)". 449 The confusion was thus left as to whether the "preferred" name syntax 450 was a mandatory restriction in DNS, or merely "preferred". 452 The definition of an Internet host name was updated in 1989 453 ([RFC1123] section 2.1) to allow names starting with a digit (to 454 support IPv4 addresses in dotted-decimal form). Section 6.1 of 455 "Requirements for Internet Hosts -- Application and Support" 456 [RFC1123] discusses the use of DNS (and the hosts file) for resolving 457 host names to IP addresses and vice versa. This led to confusion as 458 to whether all names in DNS are "host names", or whether a "host 459 name" is merely a special case of a DNS name. 461 By 1997, things had progressed to a state where it was necessary to 462 clarify these areas of confusion. "Clarifications to the DNS 463 Specification" [RFC2181] section 11 states: 465 The DNS itself places only one restriction on the particular 466 labels that can be used to identify resource records. That one 467 restriction relates to the length of the label and the full name. 468 The length of any one label is limited to between 1 and 63 octets. 469 A full domain name is limited to 255 octets (including the 470 separators). The zero length full name is defined as representing 471 the root of the DNS tree, and is typically written and displayed 472 as ".". Those restrictions aside, any binary string whatever can 473 be used as the label of any resource record. Similarly, any 474 binary string can serve as the value of any record that includes a 475 domain name as some or all of its value (SOA, NS, MX, PTR, CNAME, 476 and any others that may be added). Implementations of the DNS 477 protocols must not place any restrictions on the labels that can 478 be used. 480 Hence, it clarified that the restriction to letters, digits, and 481 hyphens does not apply to DNS names in general, nor to records that 482 include "domain names". Hence the "preferred" name syntax described 483 in the original DNS specification [RFC1034] is indeed merely 484 "preferred", not mandatory. 486 Since there is no restriction even to ASCII, let alone letter-digit- 487 hyphen use, DNS is in conformance with the IETF requirement to allow 488 UTF-8 [RFC2277]. 490 Using UTF-16 or UTF-32 encoding, however, would not be ideal for use 491 in DNS packets or "char *" APIs because existing software already 492 uses ASCII, and UTF-16 and UTF-32 strings can contain all-zero octets 493 that existing software will interpret as the end of the string. To 494 use UTF-16 or UTF-32 one would need some way of knowing whether the 495 string was encoded using ASCII, UTF-16, or UTF-32, and indeed for 496 UTF-16 or UTF-32 whether it was big-endian or little-endian encoding. 497 In contrast, UTF-8 works well because any 7-bit ASCII string is also 498 a UTF-8 string representing the same characters. 500 If a private name space is defined to use UTF-8 (and not other 501 encodings such as UTF-16 or UTF-32), there's no need for a mechanism 502 to know whether a string was encoded using ASCII or UTF-8, because 503 (for any string that can be represented using ASCII) the 504 representations are exactly the same. In other words, for any string 505 that can be represented using ASCII it doesn't matter whether it is 506 interpreted as ASCII or UTF-8 because both encodings are the same, 507 and for any string that can't be represented using ASCII, it's 508 obviously UTF-8. In addition, unlike UTF-16 and UTF-32, ASCII and 509 UTF-8 are both byte-oriented encodings so the question of big-endian 510 or little-endian encoding doesn't apply. 512 While implementations of the DNS protocol must not place any 513 restrictions on the labels that can be used, applications that use 514 the DNS are free to impose whatever restrictions they like, and many 515 have. The above rules permit a domain name label that contains 516 unusual characters, such as embedded spaces which many applications 517 would consider a bad idea. For example, the original specification 518 in [RFC0821] of the SMTP protocol [RFC5321] constrains the character 519 set usable in email addresses. There is now an effort underway to 520 permit SMTP to support internationalized email addresses via an 521 extension. 523 Shortly after the DNS Clarifications [RFC2181] and IETF character 524 sets and languages policy [RFC2277] were published, the need for 525 internationalized names within private name spaces (i.e., within 526 enterprises) arose. The current (and past, predating IDNA and the 527 prefixed ACE conventions) practice within enterprises that support 528 other languages is to put UTF-8 names in their internal DNS servers 529 in a private name space. For example, "Using the UTF-8 Character Set 530 in the Domain Name System" [I-D.skwan-utf8-dns-00] was first written 531 in 1997, and was then widely deployed in Windows. The use of UTF-8 532 names in DNS was similarly implemented and deployed in MacOS, simply 533 by virtue of the fact that applications blindly passed UTF-8 strings 534 to the name resolution APIs, and the name resolution APIs blindly 535 passed those UTF-8 strings to the DNS servers, and the DNS servers 536 correctly answered those queries, and from the user's point of view 537 everything worked properly without any special new code being 538 written, except that ASCII is matched case-insensitively whereas 539 UTF-8 is not (although some enterprise DNS servers reportedly attempt 540 to do case-insensitive matching on UTF-8 within private name spaces). 541 Within a private name space, and especially in light of the IETF 542 UTF-8 policy [RFC2277], it was reasonable to assume within a private 543 name space that binary strings were encoded in UTF-8. 545 As implied earlier, there are also issues with mapping strings to 546 some canonical form, independent of the encoding. Such issues are 547 not discussed in detail in this document. They are discussed to some 548 extent in, for example, Section 3 of [RFC5198], and are left as 549 opportunities for elaboration in other documents. 551 A few years after UTF-8 was already in use in private name spaces in 552 DNS, the strategy of using a reserved prefix and an ASCII-compatible 553 Encoding (ACE) was developed for IDNA. That strategy included the 554 Punycode algorithm, which began to be developed (during the period 555 from 2002 [I-D.ietf-idn-punycode-00] to 2003 [RFC3492]) for use in 556 the public DNS name space. There were a number of reasons for this. 557 One such reason the prefixed ACE strategy was selected for the public 558 DNS name space had to do with the fact that other encodings such as 559 ISO 8859-1 were also in use in DNS and the various encodings were not 560 necessarily distinguishable from each other. Another reason had to 561 do with concerns about whether the details of IDNA, including the use 562 of the Punycode algorithm, were an adequate solution to the problems 563 that were posed. If either the Punycode algorithm or fundamental 564 aspects of character handling were wrong, and had to be changed to 565 something incompatible, it would be possible to switch to a new 566 prefix or adopt another model entirely. Only the part of the public 567 DNS namespace that starts a label with "xn--" would be polluted. 569 Today the algorithm is seen as being about as good as it can 570 realistically be, so moving to a different encoding (UTF-8 as 571 suggested in this document) that can be viewed as "native" would not 572 be as risky as it would have been in 2002. 574 In any case, the publication of [RFC3492] and the dependencies on it 575 in [IDNA2008-Protocol] and the earlier [RFC3490] thus resulted in 576 having to use different encodings for different name spaces (where 577 UTF-8 for private name spaces was already deployed). Hence, 578 referring back to Figure 2, a different encoding scheme may be in use 579 on the Internet vs. a local network. 581 In general a host may be connected to zero or more networks using 582 private name spaces, plus potentially the public name space. 583 Applications that convert a U-label form IDN to an A-label before 584 calling getaddrinfo() will incur name resolution failures if the name 585 is actually registered in a private name space in some other encoding 586 (e.g., UTF-8). Having libraries or protocols convert from A-labels 587 to the encoding used by a private name space (e.g., UTF-8) would 588 require changes to APIs and/or servers, which IDNA was intended to 589 avoid. 591 Also, a fully-qualified domain name (FQDN) to be resolved may be 592 obtained directly from an application, or it may be composed by the 593 DNS resolver itself from a single label obtained from an application 594 by using a configured suffix search list, and the resulting FQDN may 595 use multiple encodings in different labels. For more information on 596 the suffix search list, see section 6 of "Common DNS Implementation 597 Errors and Suggested Fixes" [RFC1536], the DHCP Domain Search Option 598 [RFC3397], and section 4 of "DNS Configuration options for DHCPv6" 599 [RFC3646]. 601 As noted in [RFC1536] section 6, the community has had bad 602 experiences (e.g., [RFC1535]) with "searching" for domain names by 603 trying multiple variations or appending different suffixes. Such 604 searching can yield inconsistent results depending on the order in 605 which alternatives are tried. Nonetheless, the practice is 606 widespread and must be considered. 608 The practice of searching for names, whether by the use of a suffix 609 search list or by searching in different namespaces can yield 610 inconsistent results. For example, even when a suffix search list is 611 only used when an application provides a name containing no dots, two 612 clients with different configured suffix search lists can get 613 different answers, and the same client could get different answers at 614 different times if it changes its configuration (e.g., when moving to 615 another network). A deeper discussion of this topic is outside the 616 scope of this document. 618 3.1. Examples 620 Some examples of cases that can happen in existing implementations 621 today (where {non-ASCII} below represents some user-entered non-ASCII 622 string) are: 623 1. User types in {non-ASCII}.{non-ASCII}.com, and the application 624 passes it, in the form of a UTF-8 string, to getaddrinfo or 625 gethostbyname or equivalent. 626 * The DNS resolver passes the (UTF-8) string unmodified to a DNS 627 server. 628 2. User types in {non-ASCII}.{non-ASCII}.com, and the application 629 passes it to a name resolution API that accepts strings in some 630 other encoding such as UTF-16, e.g., GetAddrInfoW on Windows. 631 * The name resolution API decides to pass the string to DNS (and 632 possibly other protocols). 633 * The DNS resolver converts the name from UTF-16 to UTF-8 and 634 passes the query to a DNS server. 635 3. User types in {non-ASCII}.{non-ASCII}.com, but the application 636 first converts it to A-label form such that the name that is 637 passed to name resolution APIs is (say) xn--e1afmkfd.xn-- 638 80akhbyknj4f.com. 639 * The name resolution API decides to pass the string to DNS (and 640 possibly other protocols). 641 * The DNS resolver passes the string unmodified to a DNS server. 642 * If the name is not found in DNS, the name resolution API 643 decides to try another protocol, say mDNS. 644 * The query goes out in mDNS, but since mDNS specified that 645 names are to be registered in UTF-8, the name isn't found 646 since it was encoded as an A-label in the query. 647 4. User types in {non-ASCII}, and the application passes it, in the 648 form of a UTF-8 string, to getaddrinfo or equivalent. 649 * The name resolution API decides to pass the string to DNS (and 650 possibly other protocols). 651 * The DNS resolver will append suffixes in the suffix search 652 list, which may contain UTF-8 characters if the local network 653 uses a private name space. 654 * Each FQDN in turn will then be sent in a query to a DNS 655 server, until one succeeds. 656 5. User types in {non-ASCII}, but the application first converts it 657 to an A-label, such that the name that is passed to getaddrinfo 658 or equivalent is (say) xn--e1afmkfd. 659 * The name resolution API decides to pass the string to DNS (and 660 possibly other protocols). 662 * The DNS stub resolver will append suffixes in the suffix 663 search list, which may contain UTF-8 characters if the local 664 network uses a private name space, resulting in (say) xn-- 665 e1afmkfd.{non-ASCII}.com 666 * Each FQDN in turn will then be sent in a query to a DNS 667 server, until one succeeds. 668 * Since the private name space in this case uses UTF-8, the 669 above queries fail, since the A-label version of the name was 670 not registered in that name space. 671 6. User types in {non-ASCII1}.{non-ASCII2}.{non-ASCII3}.com, where 672 {non-ASCII3}.com is a public name space using IDNA and A-labels, 673 but {non-ASCII2}.{non-ASCII3}.com is a private name space using 674 UTF-8, which is accessible to the user. The application passes 675 the name, in the form of a UTF-8 string, to getaddrinfo or 676 equivalent. 677 * The name resolution API decides to pass the string to DNS (and 678 possibly other protocols). 679 * The DNS resolver tries to locate the authoritative server, but 680 fails the lookup because it cannot find a server for the UTF-8 681 encoding of {non-ASCII3}.com, even though it would have access 682 to the private name space. (To make this work, the private 683 name space would need to include the UTF-8 encoding of {non- 684 ASCII3}.com.) 686 When users use multiple applications, some of which do A-label 687 conversion prior to passing a name to name resolution APIs, and some 688 of which do not, odd behavior can result which at best violates the 689 principle of least surprise, and at worst can result in security 690 vulnerabilities. 692 First consider two competing applications, such as web browsers, that 693 are designed to achieve the same task. If the user types the same 694 name into each browser, one may successfully resolve the name (and 695 hence access the desired content) because the encoding scheme was 696 correct, while the other may fail name resolution because the 697 encoding scheme was incorrect. Hence the issue can incent users to 698 switch to another application (which in some cases means switching to 699 an IDNA application, and in other cases means switching away from an 700 IDNA application). 702 Next consider two separate applications where one is designed to be 703 launched from the other, for example a web browser launching a media 704 player application when the link to a media file is clicked. If both 705 types of content (web pages and media files in this example) are 706 hosted at the same IDN in a private name space, but one application 707 converts to A-labels before calling name resolution APIs and the 708 other does not, the user may be able to access a web page, click on 709 the media file causing the media player to launch and attempt to 710 retrieve the media file, which will then fail because the IDN 711 encoding scheme was incorrect. Or even worse, if an attacker was 712 able to register the same name in the other encoding scheme, may get 713 the content from the attacker's machine. This is similar to a normal 714 phishing attack, except that the two names represent exactly the same 715 Unicode characters. 717 4. Recommendations 719 On many platforms, the name resolution library will automatically use 720 a variety of protocols to search a variety of name spaces which might 721 be using UTF-8 or other encodings. In addition, even when only the 722 DNS protocol is used, in many operational environments, a private DNS 723 name space using UTF-8 is also deployed and is automatically searched 724 by the name resolution library. 726 As explained earlier, using multiple canonical formats, and multiple 727 encodings in different protocols or even in different places in the 728 same namespace creates problems. Because of this, and the fact that 729 both IDNA A-labels and UTF-8 are in use as encoding mechanisms for 730 domain names today, we recommend the following. 732 It is inappropriate for an application that calls a general-purpose 733 name resolution library to convert a name to an A-label unless the 734 application is absolutely certain that, in all environments where the 735 application might be used, only the global DNS that uses IDNA 736 A-labels actually will be used to resolve the name. 738 Instead, conversion to A-label form, UTF-8, or any other encoding, 739 should be done only by an entity that knows which protocol will be 740 used (e.g., the DNS resolver, or getaddrinfo upon deciding to pass 741 the name to DNS), rather than by general applications that call 742 protocol-independent name resolution APIs. (Of course, it is still 743 necessary for applications to convert to whatever form those APIs 744 expect.) Similarly, even when DNS is used, the conversion to 745 A-labels should be done only by an entity that knows which name space 746 will be used. 748 That is, a more intelligent DNS resolver would be more liberal in 749 what it would accept from an application and be able to query for 750 both a name in A-label form (e.g., over the Internet) and a UTF-8 751 name (e.g., over a corporate network with a private name space) in 752 case the server only recognized one. However, we might also take 753 into account that the various resolution behaviors discussed earlier 754 could also occur with record updates (e.g., with Dynamic Update 755 [RFC2136]), resulting in some names being registered in a local 756 network's private name space by applications doing conversion to 757 A-labels, and other names being registered using UTF-8. Hence a name 758 might have to be queried with both encodings to be sure to succeed 759 without changes to DNS servers. 761 Similarly, a more intelligent stub resolver would also be more 762 liberal in what it would accept from a response as the value of a 763 record (e.g., PTR) in that it would accept either UTF-8 (U-labels in 764 the case of IDNA) or A-labels and convert them to whatever encoding 765 is used by the application APIs to return strings to applications. 767 Indeed the choice of conversion within the resolver libraries is 768 consistent with the quote from section 6.2 of the original IDNA 769 specification [RFC3490] stating that conversion using the Punycode 770 algorithm (i.e., to A-labels) "might be performed inside these new 771 versions of the resolver libraries". 773 That said, some application-layer protocols (e.g., [RFC5731]) are 774 defined to use A-labels rather than simply using UTF-8 as recommended 775 by the IETF character sets and languages policy [RFC2277]. In this 776 case, an application may receive a string containing A-labels and 777 want to pass it to name resolution APIs. Again the recommendation 778 that a resolver library be more liberal in what it would accept from 779 an application would mean that such a name would be accepted and re- 780 encoded as needed, rather than requiring the application to do so. 782 It is important that any APIs used by applications to pass names 783 specify what encoding(s) the API uses. For example, GetAddrInfoW() 784 on Windows specifies that it accepts UTF-16. In contrast, the 785 original specification of getaddrinfo() [RFC3493] did not, and hence 786 platforms vary in what they use (e.g., MacOS uses UTF-8 whereas 787 Windows uses Windows code pages). 789 Finally, the question remains about what, if anything, a DNS server 790 should do to handle cases where some existing applications or hosts 791 do IDNA queries using A-labels within the local network using a 792 private name space, and other existing applications or hosts send 793 UTF-8 queries. It is undesirable to store different records for 794 different encodings of the same name, since this introduces the 795 possibility for inconsistency between them. Instead, a new DNS 796 server serving a private name space using UTF-8 could potentially 797 treat encoding-conversion in the same way as case-insensitive 798 comparison which a DNS server is already required to do, as long the 799 DNS server has some way to know what the encoding is. Two encodings 800 are, in this sense, two representations of the same name, just as two 801 case-different strings are. However, whereas case comparison of non- 802 ASCII characters is complicated by ambiguities (as explained in the 803 IAB's Review and Recommendations for Internationalized Domain Names 804 [RFC4690]), encoding conversion between A-labels and U-labels is 805 unambiguous. 807 5. Security Considerations 809 Having applications convert names to prefixed ACE format (A-labels) 810 before calling name resolution can result in security 811 vulnerabilities. If the name is resolved by protocols or in zones 812 for which records are registered using other encoding schemes, an 813 attacker can claim the A-label version of the same name and hence 814 trick the victim into accessing a different destination. This can be 815 done for any non-ASCII name, even when there is no possible confusion 816 due to case, language, or other issues. Other types of confusion 817 beyond those resulting simply from the choice of encoding scheme are 818 discussed in "Review and Recommendations for IDNs" [RFC4690]. 820 Designers and users of encodings that represent Unicode strings in 821 terms of ASCII should also consider whether trademark protection or 822 phishing are issues, e.g., if one name would be encoded in a way that 823 would be naturally associated with another organization or product. 825 6. IANA Considerations 827 [RFC Editor: please remove this section prior to publication.] 829 This document has no IANA Actions. 831 7. Acknowledgements 833 The authors wish to thank Patrik Faltstrom, Martin Duerst, JFC 834 Morfin, Ran Atkinson, S. Moonesamy, Paul Hoffman, and Stephane 835 Bortzmeyer for their careful review and helpful suggestions. It is 836 also interesting to note that none of the first three individuals' 837 names above can be spelled out and written correctly in ASCII text. 838 Furthermore, one of the IAB member's names below (Andrei Robachevsky) 839 cannot be written in the script as it appears on his birth 840 certificate. 842 8. IAB Members at the time of publication 844 Bernard Aboba 845 Marcelo Bagnulo 846 Ross Callon 847 Spencer Dawkins 848 Vijay Gill 849 Russ Housley 850 John Klensin 851 Olaf Kolkman 852 Danny McPherson 853 Jon Peterson 854 Andrei Robachevsky 855 Dave Thaler 856 Hannes Tschofenig 858 9. References 860 9.1. Normative References 862 [10646] International Organization for Standardization, 863 "Information Technology - Universal Multiple-octet coded 864 Character Set (UCS)". 866 ISO/IEC Standard 10646, comprised of ISO/IEC 10646-1:2000, 867 "Information technology -- Universal Multiple-Octet Coded 868 Character Set (UCS) -- Part 1: Architecture and Basic 869 Multilingual Plane", ISO/IEC 10646-2:2001, "Information 870 technology -- Universal Multiple-Octet Coded Character Set 871 (UCS) -- Part 2: Supplementary Planes" and ISO/IEC 10646- 872 1:2000/Amd 1:2002, "Mathematical symbols and other 873 characters". 875 [Unicode] The Unicode Consortium, "The Unicode Standard, Version 876 5.1.0", 2008. 878 defined by: The Unicode Standard, Version 5.0, Boston, MA, 879 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by 880 Unicode 5.1.0 881 (http://www.unicode.org/versions/Unicode5.1.0/). 883 9.2. Informative References 885 [I-D.cheshire-dnsext-multicastdns] 886 Cheshire, S. and M. Krochmal, "Multicast DNS", 887 draft-cheshire-dnsext-multicastdns-11 (work in progress), 888 March 2010. 890 [I-D.ietf-idn-punycode-00] 891 Costello, A., "Punycode version 0.3.3", 892 draft-ietf-idn-punycode-00 (work in progress), July 2002. 894 [I-D.skwan-utf8-dns-00] 895 Kwan, S. and J. Gilroy, "Using the UTF-8 Character Set in 896 the Domain Name System", draft-skwan-utf8-dns-00 (work in 897 progress), November 1997. 899 [IDNA2008-Defs] 900 Klensin, J., "Internationalized Domain Names for 901 Applications (IDNA): Definitions and Document Framework", 902 January 2010, . 905 [IDNA2008-Protocol] 906 Klensin, J., "Internationalized Domain Names in 907 Applications (IDNA): Protocol", January 2010, . 910 [MJD] Duerst, M., "The Properties and Promizes of UTF-8", 11th 911 International Unicode Conference, San Jose , 912 September 1997, . 915 [NIS] Sun Microsystems, "System and Network Administration", 916 March 1990. 918 [RFC0821] Postel, J., "Simple Mail Transfer Protocol", STD 10, 919 RFC 821, August 1982. 921 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet 922 host table specification", RFC 952, October 1985. 924 [RFC1001] NetBIOS Working Group, "Protocol standard for a NetBIOS 925 service on a TCP/UDP transport: Concepts and methods", 926 STD 19, RFC 1001, March 1987. 928 [RFC1002] NetBIOS Working Group, "Protocol standard for a NetBIOS 929 service on a TCP/UDP transport: Detailed specifications", 930 STD 19, RFC 1002, March 1987. 932 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 933 STD 13, RFC 1034, November 1987. 935 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 936 and Support", STD 3, RFC 1123, October 1989. 938 [RFC1468] Murai, J., Crispin, M., and E. van der Poel, "Japanese 939 Character Encoding for Internet Messages", RFC 1468, 940 June 1993. 942 [RFC1535] Gavron, E., "A Security Problem and Proposed Correction 943 With Widely Deployed DNS Software", RFC 1535, 944 October 1993. 946 [RFC1536] Kumar, A., Postel, J., Neuman, C., Danzig, P., and S. 947 Miller, "Common DNS Implementation Errors and Suggested 948 Fixes", RFC 1536, October 1993. 950 [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., 951 Atkinson, R., Crispin, M., and P. Svanberg, "The Report of 952 the IAB Character Set Workshop held 29 February - 1 March, 953 1996", RFC 2130, April 1997. 955 [RFC2136] Vixie, P., Thomson, S., Rekhter, Y., and J. Bound, 956 "Dynamic Updates in the Domain Name System (DNS UPDATE)", 957 RFC 2136, April 1997. 959 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS 960 Specification", RFC 2181, July 1997. 962 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 963 Languages", BCP 18, RFC 2277, January 1998. 965 [RFC3397] Aboba, B. and S. Cheshire, "Dynamic Host Configuration 966 Protocol (DHCP) Domain Search Option", RFC 3397, 967 November 2002. 969 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 970 "Internationalizing Domain Names in Applications (IDNA)", 971 RFC 3490, March 2003. 973 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 974 for Internationalized Domain Names in Applications 975 (IDNA)", RFC 3492, March 2003. 977 [RFC3493] Gilligan, R., Thomson, S., Bound, J., McCann, J., and W. 978 Stevens, "Basic Socket Interface Extensions for IPv6", 979 RFC 3493, February 2003. 981 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 982 10646", STD 63, RFC 3629, November 2003. 984 [RFC3646] Droms, R., "DNS Configuration options for Dynamic Host 985 Configuration Protocol for IPv6 (DHCPv6)", RFC 3646, 986 December 2003. 988 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 989 Recommendations for Internationalized Domain Names 990 (IDNs)", RFC 4690, September 2006. 992 [RFC4795] Aboba, B., Thaler, D., and L. Esibov, "Link-local 993 Multicast Name Resolution (LLMNR)", RFC 4795, 994 January 2007. 996 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network 997 Interchange", RFC 5198, March 2008. 999 [RFC5321] Klensin, J., "Simple Mail Transfer Protocol", RFC 5321, 1000 October 2008. 1002 [RFC5731] Hollenbeck, S., "Extensible Provisioning Protocol (EPP) 1003 Domain Name Mapping", STD 69, RFC 5731, August 2009. 1005 Authors' Addresses 1007 Dave Thaler 1008 Microsoft Corporation 1009 One Microsoft Way 1010 Redmond, WA 98052 1011 USA 1013 Phone: +1 425 703 8835 1014 Email: dthaler@microsoft.com 1016 John C Klensin 1017 1770 Massachusetts Ave, Ste 322 1018 Cambridge, MA 02140 1020 Phone: +1 617 245 1457 1021 Email: john+ietf@jck.com 1023 Stuart Cheshire 1024 Apple Inc. 1025 1 Infinite Loop 1026 Cupertino, CA 95014 1028 Phone: +1 408 974 3207 1029 Email: cheshire@apple.com