idnits 2.17.1 draft-ietf-idn-idna-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 550 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 3 instances of too long lines in the document, the longest one being 3 characters in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 147: '...label MUST contain only ASCII characte...' RFC 2119 keyword, line 151: '...2) ACE labels SHOULD be hidden from us...' RFC 2119 keyword, line 157: '...e compared, they MUST be considered to...' RFC 2119 keyword, line 249: '...ollowed by two hyphen-minuses. It MUST...' RFC 2119 keyword, line 307: '...CE label. Applications MAY allow input...' (13 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC2119' on line 468 looks like a reference -- Missing reference section? 'UNICODE' on line 482 looks like a reference -- Missing reference section? 'STD13' on line 475 looks like a reference -- Missing reference section? 'NAMEPREP' on line 465 looks like a reference -- Missing reference section? 'AMC-ACE-Z' on line 462 looks like a reference -- Missing reference section? 'STD3' on line 471 looks like a reference -- Missing reference section? 'UAX9' on line 479 looks like a reference Summary: 5 errors (**), 0 flaws (~~), 2 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft Patrik Faltstrom 2 draft-ietf-idn-idna-04.txt Cisco 3 November 8, 2001 Paul Hoffman 4 Expires in six months IMC & VPNC 5 Adam M. Costello 6 UC Berkeley 8 Internationalizing Host Names in Applications (IDNA) 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with all 13 provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering Task 16 Force (IETF), its areas, and its working groups. Note that other groups 17 may also distribute working documents as Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference material 22 or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 Abstract 32 Until now, there has been no standard way for host names to use 33 characters outside the ASCII repertoire. This document describes a 34 mechanism called IDNA that enables internationalized host names, 35 that is, host names that use characters drawn from a much larger 36 repertoire. (The "D" in the name originally stood for "domain", 37 but the work is actually focused on host names, so the word 38 "host" is used throughout this document.) 40 1. Introduction 42 IDNA works by allowing applications to use certain ASCII name labels 43 (beginning with a special prefix) to represent non-ASCII name labels. 44 Lower-layer protocols need not be aware of this; therefore IDNA does not 45 require changes to any infrastructure. In particular, IDNA does not 46 require any changes to DNS servers, resolvers, or protocol elements, 47 because the ASCII name service provided by the existing DNS is entirely 48 sufficient. 50 This document does not require any applications to conform to IDNA, 51 but applications can elect to use IDNA in order to support IDN while 52 maintaining interoperability with existing infrastructure. Adding IDNA 53 support to an existing application entails changes to the application 54 only, and leaves room for flexibility in the user interface. 56 A great deal of the discussion of IDN solutions has focused on 57 transition issues and how IDN will work in a world where not all of the 58 components have been updated. Other proposals would require that user 59 applications, resolvers, and DNS servers be updated in order for a user 60 to use an internationalized host name. Rather than require widespread 61 updating of all components, IDNA requires only user applications to be 62 updated; no changes are needed to the DNS protocol or any DNS servers or 63 the resolvers on user's computers. 65 This document is being discussed on the ietf-idna@mail.apps.ietf.org 66 mailing list. To subscribe, send a message to 67 ietf-idna-request@mail.apps.ietf.org with the single word "subscribe" in 68 the body of the message. 70 2 Terminology 72 [[ Editor's note: the author's are considering changing "host name" to 73 "domain name" throughout the document after discussing this further 74 with the DNS experts. ]] 76 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and 77 "MAY" in this document are to be interpreted as described in RFC 2119 78 [RFC2119]. 80 A code point is an integral value associated with a character in a coded 81 character set. 83 Unicode [UNICODE] is a coded character set containing tens of thousands 84 of characters. A single Unicode code point is denoted by "U+" followed 85 by four to six hexadecimal digits, while a range of Unicode code points 86 is denoted by two hexadecimal numbers separated by "..", with no 87 prefixes. 89 ASCII means US-ASCII, a coded character set containing 128 characters 90 associated with code points in the range 0..7F. Unicode is an extension 91 of ASCII: it includes all the ASCII characters and associates them with 92 the same code points. 94 The term "LDH code points" is defined in this document to mean the code 95 points associated with ASCII letters, digits, and the hyphen-minus; that 96 is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an abbreviation for 97 "letters, digits, hyphen". 99 A host label is an individual part of a host name. Host labels are 100 usually shown separated by dots; for example, the host name 101 "www.example.com" is composed of three host labels: "www", "example", 102 and "com". In IDNA, not all text strings can be host labels. A string 103 can be a host label if and only if the ToASCII operation (see section 4) 104 does not fail when applied to it. (The zero-length root label that is 105 implied in host names, as described in [STD13], is not considered a 106 label in this specification.) 108 An "ACE label" is defined in this document to be a host label that 109 contains only ASCII characters but represents a label containing 110 non-ASCII characters (ACE stands for "ASCII-compatible encoding"). 111 Internationalized host labels generally contain non-ASCII characters, 112 but for every host label that cannot be directly represented in ASCII 113 there is an equivalent ACE label. The conversion of host labels to and 114 from the ACE form is specified in section 4. 116 The "ACE prefix" is defined in this document to be a string of ASCII 117 characters that appears at the beginning of every ACE label. It is 118 specified in section 5. 120 A "host name slot" is defined in this document to be a protocol element 121 or a function argument or a return value (and so on) explicitly 122 designated for carrying a host name. Examples of host name slots 123 include: the QNAME field of a DNS query; the name argument of the 124 gethostbyname() library function; the part of an email address following 125 the at-sign (@) in the From: field of an email message header; and the host 126 portion of the URI in the src attribute of an HTML tag. 127 General text that just happens to contain a host name is not a host name 128 slot; for example, a host name appearing in the plain text body of an 129 email message is not occupying a host name slot. 131 An "internationalized host name slot" is defined in this document to be 132 a host name slot explicitly designated for carrying an internationalized 133 host name as described in this document. The designation may be static 134 (for example, in the specification of the protocol or interface) or 135 dynamic (for example, as a result of negotiation in an interactive 136 session). 138 A "generic host name slot" is defined in this document to be any host 139 name slot that is not an internationalized host name slot. Obviously, 140 this includes any host name slot whose specification predates IDNA. 142 3. Requirements 144 IDNA conformance means adherence of the following three rules: 146 1) Whenever a host name is put into a generic host name slot, every 147 label MUST contain only ASCII characters. Given any host name, an 148 equivalent host name satisfying this requirement can be obtained by 149 applying the ToASCII operation (see section 4) to each label. 151 2) ACE labels SHOULD be hidden from users whenever possible. Therefore, 152 before a host name is displayed to a user or is output into a context 153 likely to be viewed by users, the ToUnicode operation (see section 4) 154 SHOULD be applied to each label. When requirements 1 and 2 both apply, 155 requirement 1 takes precedence. 157 3) Whenever two host labels are compared, they MUST be considered to 158 match if and only if their ASCII forms (obtained by applying ToASCII) 159 match using a case-insensitive ASCII comparison. 161 4. Conversion operations 163 This section specifies the ToASCII and ToUnicode operations. Each one 164 operates on a sequence of Unicode code points (but remember that all 165 ASCII code points are also Unicode code points). When host names are 166 represented using character sets other than Unicode and ASCII, they will 167 need to first be transcoded to Unicode before these operations can be 168 applied, and might need to be transcoded back afterwards. 170 4.1 ToASCII 172 The ToASCII operation takes a sequence of Unicode code points and 173 transforms it into a sequence of code points in the ASCII range (0..7F). 174 The original sequence and the resulting sequence are equivalent host 175 labels. 177 ToASCII fails if any step of it fails. Failure means that the original 178 sequence cannot be used as a host label. 180 ToASCII never alters a sequence of code points that are all in the ASCII 181 range to begin with (although it may fail). 183 ToASCII consists of the following steps: 185 1. If all code points in the sequence are in the ASCII range (0..7F) 186 then skip to step 3. 188 2. Perform the steps specified in [NAMEPREP]. 190 3. Host-specific restrictions: 191 Host names have additional restrictions: 193 * Verify the absence of non-LDH ASCII code points; that is, the 194 absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F. 196 * Verify the absence of leading and trailing hyphen-minus; that 197 is, the absence of U+002D at the beginning and end of the 198 sequence. 200 4. If all code points in the sequence are in the ASCII range (0..7F), 201 then skip to step 8. 203 5. Verify that the sequence does NOT begin with the ACE prefix. 205 6. Encode the sequence using the encoding algorithm in [AMC-ACE-Z]. 207 7. Prepend the ACE prefix. 209 8. Verify that the number of code points is in the range 1 to 63 210 inclusive. 212 4.2 ToUnicode 214 The ToUnicode operation takes a sequence of Unicode code points and 215 returns a sequence of Unicode code points. If the input sequence is a 216 host label in ACE form, then the result is an equivalent host label 217 that is not in ACE form, otherwise the original sequence is returned 218 unaltered. 220 ToUnicode never fails. If any step fails, then the original input 221 sequence is returned immediately in that step. 223 1. If all code points in the sequence are in the ASCII range (0..7F) 224 then skip to step 3. 226 2. Perform the steps specified in [NAMEPREP]. (If step 3 227 of ToASCII is also performed here, it will not affect the 228 overall behavior of ToUnicode, but it is not necessary.) 230 3. Verify that the sequence begins with the ACE prefix, and save a 231 copy of the sequence. 233 4. Remove the ACE prefix. 235 5. Decode the sequence using decoding algorithm in [AMC-ACE-Z]. Save 236 a copy of the result of this step. 238 6. Apply ToASCII. 240 7. Verify that the sequence matches the saved copy from step 3, using 241 a case-insensitive ASCII comparison. 243 8. Return the saved copy from step 5. 245 5. ACE prefix 247 The ACE prefix, used in the conversion operations (section 4), will 248 be specified in a future revision of this document. It will be two 249 alphanumeric ASCII characters followed by two hyphen-minuses. It MUST 250 be recognized in a case-insensitive manner. 252 For example, the eventual ACE prefix might be the string "jk--". In this 253 case, an ACE label might be "jk--r3c2a-qc902xs", where "r3c2a-qc902xs" 254 is the part of the ACE label that is generated by the encoding steps in 255 [AMC-ACE-Z]. 257 6. Implications for typical applications using DNS 259 In IDNA, applications perform the processing needed to input 260 internationalized host names from users, display internationalized 261 host names to users, and process the inputs and outputs from DNS and 262 other protocols that carry host names. 264 The components and interfaces between them can be represented 265 pictorially as: 267 +------+ 268 | User | 269 +------+ 270 ^ 271 | Input and display: local interface methods 272 | (pen, keyboard, glowing phosphorus, ...) 273 +-------------------|-------------------------------+ 274 | v | 275 | +-----------------------------+ | 276 | | Application | | 277 | | (conversion between local | | 278 | | character set and Unicode | | 279 | | is done here) | | 280 | +-----------------------------+ | 281 | ^ ^ | End system 282 | | | | 283 | Call to resolver: | | Application-specific | 284 | ACE | | protocol: | 285 | v | predefined by the | 286 | +----------+ | protocol or defaults | 287 | | Resolver | | to ACE | 288 | +----------+ | | 289 | ^ | | 290 +-----------------|----------|----------------------+ 291 DNS protocol: | | 292 ACE | | 293 v v 294 +-------------+ +---------------------+ 295 | DNS servers | | Application servers | 296 +-------------+ +---------------------+ 298 6.1 Entry and display in applications 300 Applications can accept host names using any character set or sets 301 desired by the application developer, and can display host names in any 302 charset. That is, the IDNA protocol does not affect the interface 303 between users and applications. 305 An IDNA-aware application can accept and display internationalized host 306 names in two formats: the internationalized character set(s) supported 307 by the application, and as an ACE label. Applications MAY allow input 308 and display of ACE labels, but are not encouraged to do so except as an 309 interface for special purposes, possibly for debugging. ACE encoding is 310 opaque and ugly, and should thus only be exposed to users who absolutely 311 need it. The optional use, especially during a transition period, of ACE 312 encodings in the user interface is described in section 6.4. Because 313 name labels encoded as ACE name labels can be rendered either as the 314 encoded ASCII characters or the proper decoded characters, the 315 application MAY have an option for the user to select the preferred 316 method of display; if it does, rendering the ACE SHOULD NOT be the 317 default. 319 Host names are often stored and transported in many places. For example, 320 they are part of documents such as mail messages and web pages. They are 321 transported in the many parts of many protocols, such as both the 322 control commands and the RFC 2822 body parts of SMTP, and the headers 323 and the body content in HTTP. It is important to remember that host 324 names appear both in host name slots and in the content that is passed 325 over protocols. 327 In protocols and document formats that define how to handle 328 specification or negotiation of charsets, IDN host name labels can be 329 encoded in any charset allowed by the protocol or document format. If a 330 protocol or document format only allows one charset, IDN host name 331 labels MUST be given in that charset. In any place where a protocol or 332 document format allows transmission of the characters in IDN host name 333 labels, IDN host name labels SHOULD be transmitted using whatever 334 character encoding and escape mechanism that the protocol or document 335 format uses at that place. 337 All protocols that have generic host name slots already have the 338 capacity for handling host names in the ASCII charset. Thus, IDN host 339 name labels that have been processed with the ToASCII operation can 340 inherently be handled by those protocols. 342 6.2 Applications and resolvers 344 Applications communicate with resolver libraries through a programming 345 interface (API). Typically, the IETF does not standardize APIs, although 346 there are non-standard APIs specified for IPv6. This protocol does not 347 specify a specific API, but instead specifies the operations that must 348 be used for input to and output from the resolver library. 350 An application MUST prepapre name parts that are sent in the DNS 351 protocol using the ToASCII operation. Internationalized labels received 352 from the resolver will always be in ACE form. IDNA-aware applications 353 MUST be able to work with both non-internationalized host name labels 354 (those that conform to [STD13] and [STD3]) and internationalized host 355 name labels. 357 6.3 Resolvers and DNS servers 359 An operating system might have a set of libraries for performing the 360 ToASCII operation. The input to such a library might be in one or more 361 charsets that are used in applications (UTF-8 and UTF-16 are likely 362 candidates for almost any operating system, and script-specific charsets 363 are likely for localized operating systems). 365 DNS servers MUST use the ACE format for internationalized host labels. 366 All internationalized names stored in DNS servers must be valid names 367 that have been processed with the ToASCII operation. 369 If a signalling system which makes negotiation possible between old and 370 new DNS clients and servers is standardized in the future, the encoding 371 of the query in the DNS protocol itself can be changed from ACE to 372 something else, such as UTF-8. The question whether or not this should 373 be used is, however, a separate problem and is not discussed in this 374 memo. 376 6.4 Avoiding exposing users to the raw ACE encoding 378 All applications that might show the user a host name that was received 379 from a gethostbyaddr or other such lookup SHOULD update as soon as 380 possible in order to prevent users from seeing the ACE. However, this is 381 not considered a big problem because so few applications show this type 382 of resolution to users. 384 If an application decodes an ACE name using ToUnicode but cannot show 385 all of the characters in the decoded name, such as if the name contains 386 characters that the output system cannot display, the application SHOULD 387 show the name in ACE format instead of displaying the name with the 388 replacement character (U+FFFD). This is to make it easier for the user 389 to transfer the name correctly to other programs. Programs that by 390 default show the ACE form when they cannot show all the characters in a 391 name label SHOULD also have a mechanism to show the name that is 392 produced by the ToUnicode operation with as many characters as possible 393 and replacement characters in the positions where characters cannot be 394 displayed. In many cases, the application doesn't know exactly what the 395 underlying rendering engine can or cannot display. 397 In addition to the condition above, if an application receives an ACE 398 host name after performing the ToUnicode operation, meaning that the 399 name was not properly prepared with ToASCII (for example, if it has 400 illegal characters in it), the application MUST show the name in ACE 401 format because the ToUnicode operation never fails, but returns the 402 original input if errors are detected at any step. 404 6.5 Bidirectional text in host names 406 The display of host names that contain bidirectional text is not covered 407 in this document. It may be covered in a future version of this 408 document, or may be covered in a different document. 410 For developers interested in displaying host names that have 411 bidirectional text, the Unicode standard has an extensive discussion of 412 how to deal with reorder glyphs for display when dealing with 413 bidirectional text such as Arabic or Hebrew. See [UAX9] for more 414 information. In particular, all Unicode text is stored in logical order. 416 7. Name Server Considerations 418 Internationalized host name data in zone files (as specified by section 419 5 of RFC 1035) MUST be processed with ToASCII before it is entered in 420 the zone files. 422 It is imperative that there be only one ASCII encoding for a particular 423 host name. ACE is an encoding for host name labels that use non-ASCII 424 characters. Thus, a primary master name server MUST NOT contain an 425 ACE-encoded label that decodes to an ASCII label. The ToASCII operation 426 assures that no such names are ever output from the operation. 428 Name servers MUST NOT have any records with host names that contain 429 internationalized name labels unless those name labels have be prepared 430 with the ToASCII operation. If names that are not processed by ToASCII 431 are passed to an application, it will result in unpredictable behavior. 432 Note that [NAMEPREP] describes how to handle versioning of unallocated 433 codepoints. 435 8. Root Server Considerations 437 Because there are no changes to the DNS protocols, adopting this 438 protocol has no effect on the DNS root servers. 440 9. Security Considerations 442 Much of the security of the Internet relies on the DNS. Thus, any change 443 to the characteristics of the DNS can change the security of much of the 444 Internet. 446 This memo describes an algorithm which encodes characters that are not 447 valid according to STD3 and STD13 into octet values that are valid. No 448 security issues such as string length increases or new allowed values 449 are introduced by the encoding process or the use of these encoded 450 values, apart from those introduced by the ACE encoding itself. 452 Host names are used by users to connect to Internet servers. The 453 security of the Internet would be compromised if a user entering a 454 single internationalized name could be connected to different servers 455 based on different interpretations of the internationalized host name. 457 Because this document normatively refers to [NAMEPREP], it includes the 458 security considerations from that document as well. 460 A. References 462 [AMC-ACE-Z] Adam Costello, "AMC-ACE-Z version 0.3.1", 463 draft-ietf-idn-amc-ace-z. 465 [NAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of 466 Internationalized Host Names", draft-ietf-idn-nameprep. 468 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 469 Requirement Levels", March 1997, RFC 2119. 471 [STD3] Bob Braden, "Requirements for Internet Hosts -- Communication 472 Layers" (RFC 1122) and "Requirements for Internet Hosts -- Application 473 and Support" (RFC 1123), STD 3, October 1989. 475 [STD13] Paul Mockapetris, "Domain names - concepts and facilities" (RFC 476 1034) and "Domain names - implementation and specification" (RFC 1035, 477 STD 13, November 1987. 479 [UAX9] Unicode Standard Annex #9, The Bidirectional Algorithm. 480 http://www.unicode.org/unicode/reports/tr9/ 482 [UNICODE] The Unicode Standard, Version 3.1.0: The Unicode Consortium. 483 The Unicode Standard, Version 3.0. Reading, MA, Addison-Wesley 484 Developers Press, 2000. ISBN 0-201-61633-5, as amended by: Unicode 485 Standard Annex #27: Unicode 3.1 486 . 488 B. Design philosophy 490 Many proposals for IDN protocols have required that DNS servers be 491 updated to handle internationalized host names. Because of this, a 492 person who wanted to use an internationalized host name would have to be 493 sure that their request went to a DNS server that had been updated for 494 IDN. Further, that server could send queries only to other servers that 495 had been updated for IDN, because the queries contain new protocol 496 elements to differentiate IDN name labels from current host labels. In 497 addition, these proposals require that resolvers be updated to use the 498 new protocols, and in most cases the applications would need to be 499 updated as well. 501 These proposals would require changes to the application protocols that 502 use host names as protocol elements, because of the assumptions and 503 requirements made in those protocols about the characters that have 504 always been used for host names, and the encoding of those characters. 505 Other proposals for IDN protocols do not require changes to DNS servers 506 but still require changes to most application protocols to handle the 507 new names. 509 Updating all (or even a significant percentage) of the existing servers 510 in the world will be difficult, to say the least. Updating applications, 511 application gateways, and clients to handle changes to the application 512 protocols is also daunting. Because of this, we have designed a protocol 513 that requires no updating of any name servers. IDNA still requires the 514 updating of applications, but only for input and display of names, not 515 for changes to the protocols. Once users have updated the applications, 516 they can immediately start using internationalized host names. The cost 517 of implementing IDN may thus be much lower, and the speed of 518 implementation could be much higher. 520 C. Authors' Addresses 522 Patrik Faltstrom 523 Cisco Systems 524 Arstaangsvagen 31 J 525 S-117 43 Stockholm Sweden 526 paf@cisco.com 528 Paul Hoffman 529 Internet Mail Consortium and VPN Consortium 530 127 Segre Place 531 Santa Cruz, CA 95060 USA 532 phoffman@imc.org 534 Adam M. Costello 535 University of California, Berkeley 536 idna-spec.amc @ nicemice.net