idnits 2.17.1 draft-hoffman-rfc3490bis-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-27) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 946 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 2 instances of too long lines in the document, the longest one being 2 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2136' is mentioned on line 737, but not defined -- Obsolete informational reference (is this intentional?): RFC 2535 (Obsoleted by RFC 4033, RFC 4034, RFC 4035) Summary: 9 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 draft-hoffman-rfc3490bis-02.txt P. Faltstrom 2 April 14, 2004 Cisco 3 Expires in six months P. Hoffman 4 IMC & VPNC 5 A. Costello 6 UC Berkeley 8 Internationalizing Domain Names in Applications (IDNA) 10 Abstract 12 Until now, there has been no standard method for domain names to use 13 characters outside the ASCII repertoire. This document defines 14 internationalized domain names (IDNs) and a mechanism called 15 Internationalizing Domain Names in Applications (IDNA) for handling 16 them in a standard fashion. IDNs use characters drawn from a large 17 repertoire (Unicode), but IDNA allows the non-ASCII characters to be 18 represented using only the ASCII characters already allowed in so- 19 called host names today. This backward-compatible representation is 20 required in existing protocols like DNS, so that IDNs can be 21 introduced with no changes to the existing infrastructure. IDNA is 22 only meant for processing domain names, not free text. 24 1. Introduction 26 IDNA works by allowing applications to use certain ASCII name labels 27 (beginning with a special prefix) to represent non-ASCII name labels. 28 Lower-layer protocols need not be aware of this; therefore IDNA does 29 not depend on changes to any infrastructure. In particular, IDNA 30 does not depend on any changes to DNS servers, resolvers, or protocol 31 elements, because the ASCII name service provided by the existing DNS 32 is entirely sufficient for IDNA. 34 This document does not require any applications to conform to IDNA, 35 but applications can elect to use IDNA in order to support IDN while 36 maintaining interoperability with existing infrastructure. If an 37 application wants to use non-ASCII characters in domain names, IDNA 38 is the only currently-defined option. Adding IDNA support to an 39 existing application entails changes to the application only, and 40 leaves room for flexibility in the user interface. 42 A great deal of the discussion of IDN solutions has focused on 43 transition issues and how IDN will work in a world where not all of 44 the components have been updated. Proposals that were not chosen by 45 the IDN Working Group would depend on user applications, resolvers, 46 and DNS servers being updated in order for a user to use an 47 internationalized domain name. Rather than rely on widespread 48 updating of all components, IDNA depends on updates to user 49 applications only; no changes are needed to the DNS protocol or any 50 DNS servers or the resolvers on users' computers. 52 The IESG issued a statement on IDNA [IESG-STATEMENT]. 54 1.1 Problem Statement 56 The IDNA specification solves the problem of extending the repertoire 57 of characters that can be used in domain names to include the Unicode 58 repertoire (with some restrictions). 60 IDNA does not extend the service offered by DNS to the applications. 61 Instead, the applications (and, by implication, the users) continue 62 to see an exact-match lookup service. Either there is a single 63 exactly-matching name or there is no match. This model has served 64 the existing applications well, but it requires, with or without 65 internationalized domain names, that users know the exact spelling of 66 the domain names that the users type into applications such as web 67 browsers and mail user agents. The introduction of the larger 68 repertoire of characters potentially makes the set of misspellings 69 larger, especially given that in some cases the same appearance, for 70 example on a business card, might visually match several Unicode code 71 points or several sequences of code points. 73 IDNA allows the graceful introduction of IDNs not only by avoiding 74 upgrades to existing infrastructure (such as DNS servers and mail 75 transport agents), but also by allowing some rudimentary use of IDNs 76 in applications by using the ASCII representation of the non-ASCII 77 name labels. While such names are very user-unfriendly to read and 78 type, and hence are not suitable for user input, they allow (for 79 instance) replying to email and clicking on URLs even though the 80 domain name displayed is incomprehensible to the user. In order to 81 allow user-friendly input and output of the IDNs, the applications 82 need to be modified to conform to this specification. 84 IDNA uses the Unicode character repertoire, which avoids the 85 significant delays that would be inherent in waiting for a different 86 and specific character set be defined for IDN purposes by some other 87 standards developing organization. 89 1.2 Limitations of IDNA 91 The IDNA protocol does not solve all linguistic issues with users 92 inputting names in different scripts. Many important language-based 93 and script-based mappings are not covered in IDNA and need to be 94 handled outside the protocol. For example, names that are entered in 95 a mix of traditional and simplified Chinese characters will not be 96 mapped to a single canonical name. Another example is Scandinavian 97 names that are entered with U+00F6 (LATIN SMALL LETTER O WITH 98 DIAERESIS) will not be mapped to U+00F8 (LATIN SMALL LETTER O WITH 99 STROKE). 101 An example of an important issue that is not considered in detail in 102 IDNA is how to provide a high probability that a user who is entering 103 a domain name based on visual information (such as from a business 104 card or billboard) or aural information (such as from a telephone or 105 radio) would correctly enter the IDN. Similar issues exist for ASCII 106 domain names, for example the possible visual confusion between the 107 letter 'O' and the digit zero, but the introduction of the larger 108 repertoire of characters creates more opportunities of similar 109 looking and similar sounding names. Note that this is a complex 110 issue relating to languages, input methods on computers, and so on. 111 Furthermore, the kind of matching and searching necessary for a high 112 probability of success would not fit the role of the DNS and its 113 exact matching function. 115 1.3 Brief overview for application developers 117 Applications can use IDNA to support internationalized domain names 118 anywhere that ASCII domain names are already supported, including DNS 119 master files and resolver interfaces. (Applications can also define 120 protocols and interfaces that support IDNs directly using non-ASCII 121 representations. IDNA does not prescribe any particular 122 representation for new protocols, but it still defines which names 123 are valid and how they are compared.) 125 The IDNA protocol is contained completely within applications. It is 126 not a client-server or peer-to-peer protocol: everything is done 127 inside the application itself. When used with a DNS resolver 128 library, IDNA is inserted as a "shim" between the application and the 129 resolver library. When used for writing names into a DNS zone, IDNA 130 is used just before the name is committed to the zone. 132 There are two operations described in section 4 of this document: 134 - The ToASCII operation is used before sending an IDN to something 135 that expects ASCII names (such as a resolver) or writing an IDN 136 into a place that expects ASCII names (such as a DNS master file). 138 - The ToUnicode operation is used when displaying names to users, 139 for example names obtained from a DNS zone. 141 It is important to note that the ToASCII operation can fail. If it 142 fails when processing a domain name, that domain name cannot be used 143 as an internationalized domain name and the application has to have 144 some method of dealing with this failure. 146 IDNA requires that implementations process input strings with 147 Nameprep [NAMEPREP], which is a profile of Stringprep [STRINGPREP], 148 and then with Punycode [PUNYCODE]. Implementations of IDNA MUST 149 fully implement Nameprep and Punycode; neither Nameprep nor Punycode 150 are optional. 152 2. Terminology 154 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", 155 and "MAY" in this document are to be interpreted as described in BCP 156 14, RFC 2119 [RFC2119]. 158 A code point is an integer value associated with a character in a 159 coded character set. 161 Unicode [UNICODE] is a coded character set containing tens of 162 thousands of characters. A single Unicode code point is denoted by 163 "U+" followed by four to six hexadecimal digits, while a range of 164 Unicode code points is denoted by two hexadecimal numbers separated 165 by "..", with no prefixes. 167 ASCII means US-ASCII [USASCII], a coded character set containing 128 168 characters associated with code points in the range 0..7F. Unicode 169 is an extension of ASCII: it includes all the ASCII characters and 170 associates them with the same code points. 172 The term "LDH code points" is defined in this document to mean the 173 code points associated with ASCII letters, digits, and the hyphen- 174 minus; that is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an 175 abbreviation for "letters, digits, hyphen". 177 [STD13] talks about "domain names" and "host names", but many people 178 use the terms interchangeably. Further, because [STD13] was not 179 terribly clear, many people who are sure they know the exact 180 definitions of each of these terms disagree on the definitions. In 181 this document the term "domain name" is used in general. This 182 document explicitly cites [STD3] whenever referring to the host name 183 syntax restrictions defined therein. 185 A label is an individual part of a domain name. Labels are usually 186 shown separated by dots; for example, the domain name 187 "www.example.com" is composed of three labels: "www", "example", and 188 "com". (The zero-length root label described in [STD13], which can 189 be explicit as in "www.example.com." or implicit as in 190 "www.example.com", is not considered a label in this specification.) 191 IDNA extends the set of usable characters in labels that are text. 192 For the rest of this document, the term "label" is shorthand for 193 "text label", and "every label" means "every text label". 195 An "internationalized label" is a label to which the ToASCII 196 operation (see section 4) can be applied without failing (with the 197 UseSTD3ASCIIRules flag unset). This implies that every ASCII label 198 that satisfies the [STD13] length restriction is an internationalized 199 label. Therefore the term "internationalized label" is a 200 generalization, embracing both old ASCII labels and new non-ASCII 201 labels. Although most Unicode characters can appear in 202 internationalized labels, ToASCII will fail for some input strings, 203 and such strings are not valid internationalized labels. 205 An "internationalized domain name" (IDN) is a domain name in which 206 every label is an internationalized label. This implies that every 207 ASCII domain name is an IDN (which implies that it is possible for a 208 name to be an IDN without it containing any non-ASCII characters). 209 This document does not attempt to define an "internationalized host 210 name". Just as has been the case with ASCII names, some DNS zone 211 administrators may impose restrictions, beyond those imposed by DNS 212 or IDNA, on the characters or strings that may be registered as 213 labels in their zones. Such restrictions have no impact on the 214 syntax or semantics of DNS protocol messages; a query for a name that 215 matches no records will yield the same response regardless of the 216 reason why it is not in the zone. Clients issuing queries or 217 interpreting responses cannot be assumed to have any knowledge of 218 zone-specific restrictions or conventions. 220 In IDNA, equivalence of labels is defined in terms of the ToASCII 221 operation, which constructs an ASCII form for a given label, whether 222 or not the label was already an ASCII label. Labels are defined to 223 be equivalent if and only if their ASCII forms produced by ToASCII 224 match using a case-insensitive ASCII comparison. ASCII labels 225 already have a notion of equivalence: upper case and lower case are 226 considered equivalent. The IDNA notion of equivalence is an 227 extension of that older notion. Equivalent labels in IDNA are 228 treated as alternate forms of the same label, just as "foo" and "Foo" 229 are treated as alternate forms of the same label. 231 To allow internationalized labels to be handled by existing 232 applications, IDNA uses an "ACE label" (ACE stands for ASCII 233 Compatible Encoding). An ACE label is an internationalized label 234 that can be rendered in ASCII and is equivalent to an 235 internationalized label that cannot be rendered in ASCII. Given any 236 internationalized label that cannot be rendered in ASCII, the ToASCII 237 operation will convert it to an equivalent ACE label (whereas an 238 ASCII label will be left unaltered by ToASCII). ACE labels are 239 unsuitable for display to users. The ToUnicode operation will 240 convert any label to an equivalent non-ACE label. In fact, an ACE 241 label is formally defined to be any label that the ToUnicode 242 operation would alter (whereas non-ACE labels are left unaltered by 243 ToUnicode). Every ACE label begins with the ACE prefix specified in 244 section 5. The ToASCII and ToUnicode operations are specified in 245 section 4. 247 The "ACE prefix" is defined in this document to be a string of ASCII 248 characters that appears at the beginning of every ACE label. It is 249 specified in section 5. 251 A "domain name slot" is defined in this document to be a protocol 252 element or a function argument or a return value (and so on) 253 explicitly designated for carrying a domain name. Examples of domain 254 name slots include: the QNAME field of a DNS query; the name argument 255 of the gethostbyname() library function; the part of an email address 256 following the at-sign (@) in the From: field of an email message 257 header; and the host portion of the URI in the src attribute of an 258 HTML tag. General text that just happens to contain a domain 259 name is not a domain name slot; for example, a domain name appearing 260 in the plain text body of an email message is not occupying a domain 261 name slot. 263 An "IDN-aware domain name slot" is defined in this document to be a 264 domain name slot explicitly designated for carrying an 265 internationalized domain name as defined in this document. The 266 designation may be static (for example, in the specification of the 267 protocol or interface) or dynamic (for example, as a result of 268 negotiation in an interactive session). 270 An "IDN-unaware domain name slot" is defined in this document to be 271 any domain name slot that is not an IDN-aware domain name slot. 272 Obviously, this includes any domain name slot whose specification 273 predates IDNA. 275 3. Requirements and applicability 277 3.1 Requirements 279 IDNA conformance means adherence to the following four requirements: 281 1) Whenever dots are used as label separators, the following 282 characters MUST be recognized as dots: U+002E (full stop), U+3002 283 (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61 284 (halfwidth ideographic full stop). 286 2) Whenever a domain name is put into an IDN-unaware domain name slot 287 (see section 2), it MUST contain only ASCII characters. Given an 288 internationalized domain name (IDN), an equivalent domain name 289 satisfying this requirement can be obtained by applying the 290 ToASCII operation (see section 4) to each label and, if dots are 291 used as label separators, changing all the label separators to 292 U+002E. 294 3) ACE labels obtained from domain name slots SHOULD be hidden from 295 users when it is known that the environment can handle the non-ACE 296 form, except when the ACE form is explicitly requested. When it 297 is not known whether or not the environment can handle the non-ACE 298 form, the application MAY use the non-ACE form (which might fail, 299 such as by not being displayed properly), or it MAY use the ACE 300 form (which will look unintelligible to the user). Given an 301 internationalized domain name, an equivalent domain name 302 containing no ACE labels can be obtained by applying the ToUnicode 303 operation (see section 4) to each label. When requirements 2 and 304 3 both apply, requirement 2 takes precedence. 306 4) Whenever two labels are compared, they MUST be considered to match 307 if and only if they are equivalent, that is, their ASCII forms 308 (obtained by applying ToASCII) match using a case-insensitive 309 ASCII comparison. Whenever two names are compared, they MUST be 310 considered to match if and only if their corresponding labels 311 match, regardless of whether the names use the same forms of label 312 separators. 314 3.2 Applicability 316 IDNA is applicable to all domain names in all domain name slots 317 except where it is explicitly excluded. 319 This implies that IDNA is applicable to many protocols that predate 320 IDNA. Note that IDNs occupying domain name slots in those protocols 321 MUST be in ASCII form (see section 3.1, requirement 2). 323 3.2.1. DNS resource records 325 IDNA does not apply to domain names in the NAME and RDATA fields of 326 DNS resource records whose CLASS is not IN. This exclusion applies 327 to every non-IN class, present and future, except where future 328 standards override this exclusion by explicitly inviting the use of 329 IDNA. 331 There are currently no other exclusions on the applicability of IDNA 332 to DNS resource records; it depends entirely on the CLASS, and not on 333 the TYPE. This will remain true, even as new types are defined, 334 unless there is a compelling reason for a new type to complicate 335 matters by imposing type-specific rules. 337 3.2.2. Non-domain-name data types stored in domain names 339 Although IDNA enables the representation of non-ASCII characters in 340 domain names, that does not imply that IDNA enables the 341 representation of non-ASCII characters in other data types that are 342 stored in domain names. For example, an email address local part is 343 sometimes stored in a domain label (hostmaster@example.com would be 344 represented as hostmaster.example.com in the RDATA field of an SOA 345 record). IDNA does not update the existing email standards, which 346 allow only ASCII characters in local parts. Therefore, unless the 347 email standards are revised to invite the use of IDNA for local 348 parts, a domain label that holds the local part of an email address 349 SHOULD NOT begin with the ACE prefix, and even if it does, it is to 350 be interpreted literally as a local part that happens to begin with 351 the ACE prefix. 353 4. Conversion operations 355 An application converts a domain name put into an IDN-unaware slot or 356 displayed to a user. This section specifies the steps to perform in 357 the conversion, and the ToASCII and ToUnicode operations. 359 The input to ToASCII or ToUnicode is a single label that is a 360 sequence of Unicode code points (remember that all ASCII code points 361 are also Unicode code points). If a domain name is represented using 362 a character set other than Unicode or US-ASCII, it will first need to 363 be transcoded to Unicode. 365 Starting from a whole domain name, the steps that an application 366 takes to do the conversions are: 368 1) Decide whether the domain name is a "stored string" or a "query 369 string" as described in [STRINGPREP]. If this conversion follows 370 the "queries" rule from [STRINGPREP], set the flag called 371 "AllowUnassigned". 373 2) Split the domain name into individual labels as described in 374 section 3.1. The labels do not include the separator. 376 3) For each label, decide whether or not to enforce the restrictions 377 on ASCII characters in host names [STD3]. (Applications already 378 faced this choice before the introduction of IDNA, and can 379 continue to make the decision the same way they always have; IDNA 380 makes no new recommendations regarding this choice.) If the 381 restrictions are to be enforced, set the flag called 382 "UseSTD3ASCIIRules" for that label. 384 4) Process each label with either the ToASCII or the ToUnicode 385 operation as appropriate. Typically, you use the ToASCII 386 operation if you are about to put the name into an IDN-unaware 387 slot, and you use the ToUnicode operation if you are displaying 388 the name to a user; section 3.1 gives greater detail on the 389 applicable requirements. 391 5) If ToASCII was applied in step 4 and dots are used as label 392 separators, change all the label separators to U+002E (full stop). 394 The following two subsections define the ToASCII and ToUnicode 395 operations that are used in step 4. 397 This description of the protocol uses specific procedure names, names 398 of flags, and so on, in order to facilitate the specification of the 399 protocol. These names, as well as the actual steps of the 400 procedures, are not required of an implementation. In fact, any 401 implementation which has the same external behavior as specified in 402 this document conforms to this specification. 404 4.1 ToASCII 406 The ToASCII operation takes a sequence of Unicode code points that 407 make up one label and transforms it into a sequence of code points in 408 the ASCII range (0..7F). If ToASCII succeeds, the original sequence 409 and the resulting sequence are equivalent labels. 411 It is important to note that the ToASCII operation can fail. ToASCII 412 fails if any step of it fails. If any step of the ToASCII operation 413 fails on any label in a domain name, that domain name MUST NOT be 414 used as an internationalized domain name. The method for dealing 415 with this failure is application-specific. 417 The inputs to ToASCII are a sequence of code points, the 418 AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output of 419 ToASCII is either a sequence of ASCII code points or a failure 420 condition. 422 ToASCII never alters a sequence of code points that are all in the 423 ASCII range to begin with (although it could fail). Applying the 424 ToASCII operation multiple times has exactly the same effect as 425 applying it just once. 427 ToASCII consists of the following steps: 429 1. If the sequence contains any code points outside the ASCII range 430 (0..7F) then proceed to step 2, otherwise skip to step 3. 432 2. Perform the steps specified in [NAMEPREP] and fail if there is an 433 error. The AllowUnassigned flag is used in [NAMEPREP]. 435 3. If the UseSTD3ASCIIRules flag is set, then perform these checks: 437 (a) Verify the absence of non-LDH ASCII code points; that is, the 438 absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F. 440 (b) Verify the absence of leading and trailing hyphen-minus; that 441 is, the absence of U+002D at the beginning and end of the 442 sequence. 444 4. If the sequence contains any code points outside the ASCII range 445 (0..7F) then proceed to step 5, otherwise skip to step 8. 447 5. Verify that the sequence does NOT begin with the ACE prefix. 449 6. Encode the sequence using the encoding algorithm in [PUNYCODE] and 450 fail if there is an error. 452 7. Prepend the ACE prefix. 454 8. Verify that the number of code points is in the range 1 to 63 455 inclusive (0 is excluded). 457 4.2 ToUnicode 459 The ToUnicode operation takes a sequence of Unicode code points that 460 make up one label and returns a sequence of Unicode code points. If 461 the input sequence is a label in ACE form, then the result is an 462 equivalent internationalized label that is not in ACE form, otherwise 463 the original sequence is returned unaltered. 465 ToUnicode never fails. If any step fails, then the original input 466 sequence is returned immediately in that step. 468 The Punycode decoder can never output more code points than it 469 inputs, but Nameprep can, and therefore ToUnicode can. 470 Note that the number of octets needed to represent a sequence of code 471 points depends on the particular character encoding used. 473 The inputs to ToUnicode are a sequence of code points, the 474 AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output of 475 ToUnicode is always a sequence of Unicode code points. 477 ToUnicode consists of the following steps: 479 1. If the sequence contains any code points outside the ASCII range 480 (0..7F) then proceed to step 2, otherwise skip to step 3. 482 2. Perform the steps specified in [NAMEPREP] and fail if there is an 483 error. (If step 3 of ToASCII is also performed here, it will not 484 affect the overall behavior of ToUnicode, but it is not 485 necessary.) The AllowUnassigned flag is used in [NAMEPREP]. 487 3. Verify that the sequence begins with the ACE prefix, and save a 488 copy of the sequence. 490 4. Remove the ACE prefix. 492 5. Decode the sequence using the decoding algorithm in [PUNYCODE] and 493 fail if there is an error. Save a copy of the result of this 494 step. 496 6. Apply ToASCII. 498 7. Verify that the result of step 6 matches the saved copy from step 499 3, using a case-insensitive ASCII comparison. 501 8. Return the saved copy from step 5. 503 5. ACE prefix 505 The ACE prefix, used in the conversion operations (section 4), is two 506 alphanumeric ASCII characters followed by two hyphen-minuses. It 507 cannot be any of the prefixes already used in earlier documents, 508 which includes the following: "bl--", "bq--", "dq--", "lq--", "mq--", 509 "ra--", "wq--" and "zq--". The ToASCII and ToUnicode operations MUST 510 recognize the ACE prefix in a case-insensitive manner. 512 The ACE prefix for IDNA is "xn--" or any capitalization thereof. 514 This means that an ACE label might be "xn--de-jg4avhby1noc0d", where 515 "de-jg4avhby1noc0d" is the part of the ACE label that is generated by 516 the encoding steps in [PUNYCODE]. 518 While all ACE labels begin with the ACE prefix, not all labels 519 beginning with the ACE prefix are necessarily ACE labels. Non-ACE 520 labels that begin with the ACE prefix will confuse users and SHOULD 521 NOT be allowed in DNS zones. 523 6. Implications for typical applications using DNS 525 In IDNA, applications perform the processing needed to input 526 internationalized domain names from users, display internationalized 527 domain names to users, and process the inputs and outputs from DNS 528 and other protocols that carry domain names. 530 The components and interfaces between them can be represented 531 pictorially as: 533 +------+ 534 | User | 535 +------+ 536 ^ 537 | Input and display: local interface methods 538 | (pen, keyboard, glowing phosphorus, ...) 539 +-------------------|-------------------------------+ 540 | v | 541 | +-----------------------------+ | 542 | | Application | | 543 | | (ToASCII and ToUnicode | | 544 | | operations may be | | 545 | | called here) | | 546 | +-----------------------------+ | 547 | ^ ^ | End system 548 | | | | 549 | Call to resolver: | | Application-specific | 550 | ACE | | protocol: | 551 | v | ACE unless the | 552 | +----------+ | protocol is updated | 553 | | Resolver | | to handle other | 554 | +----------+ | encodings | 555 | ^ | | 556 +-----------------|----------|----------------------+ 557 DNS protocol: | | 558 ACE | | 559 v v 560 +-------------+ +---------------------+ 561 | DNS servers | | Application servers | 562 +-------------+ +---------------------+ 564 The box labeled "Application" is where the application splits a 565 domain name into labels, sets the appropriate flags, and performs the 566 ToASCII and ToUnicode operations. This is described in section 4. 568 6.1 Entry and display in applications 570 Applications can accept domain names using any character set or sets 571 desired by the application developer, and can display domain names in 572 any charset. That is, the IDNA protocol does not affect the 573 interface between users and applications. 575 An IDNA-aware application can accept and display internationalized 576 domain names in two formats: the internationalized character set(s) 577 supported by the application, and as an ACE label. ACE labels that 578 are displayed or input MUST always include the ACE prefix. 579 Applications MAY allow input and display of ACE labels, but are not 580 encouraged to do so except as an interface for special purposes, 581 possibly for debugging, or to cope with display limitations as 582 described in section 6.4. ACE encoding is opaque and ugly, and 583 should thus only be exposed to users who absolutely need it. Because 584 name labels encoded as ACE name labels can be rendered either as the 585 encoded ASCII characters or the proper decoded characters, the 586 application MAY have an option for the user to select the preferred 587 method of display; if it does, rendering the ACE SHOULD NOT be the 588 default. 590 Domain names are often stored and transported in many places. For 591 example, they are part of documents such as mail messages and web 592 pages. They are transported in many parts of many protocols, such as 593 both the control commands and the RFC 2822 body parts of SMTP, and 594 the headers and the body content in HTTP. It is important to 595 remember that domain names appear both in domain name slots and in 596 the content that is passed over protocols. 598 In protocols and document formats that define how to handle 599 specification or negotiation of charsets, labels can be encoded in 600 any charset allowed by the protocol or document format. If a 601 protocol or document format only allows one charset, the labels MUST 602 be given in that charset. 604 In any place where a protocol or document format allows transmission 605 of the characters in internationalized labels, internationalized 606 labels SHOULD be transmitted using whatever character encoding and 607 escape mechanism that the protocol or document format uses at that 608 place. 610 All protocols that use domain name slots already have the capacity 611 for handling domain names in the ASCII charset. Thus, ACE labels 612 (internationalized labels that have been processed with the ToASCII 613 operation) can inherently be handled by those protocols. 615 Displaying internationalized characters can be tricky for 616 applications regardless of whether the characters appear in free 617 text, in domain names, or in other protocol elements. The Unicode 618 standard encompasses many types of text that can cause display 619 problems, such as formatting characters, characters that combine with 620 one or more surrounding characters, characters whose direction of 621 display can change, strings whose logical order cannot be uniquely 622 inferred from their display order, and so on. IDNA requires the use 623 of Nameprep, which mitigates some of these issues, both in individual 624 domain labels and to a lesser extent in full domain names, but does 625 not eliminate all the issues (and does nothing to mitigate them in 626 text outside of domain names). 628 6.2 Applications and resolver libraries 630 Applications normally use functions in the operating system when they 631 resolve DNS queries. Those functions in the operating system are 632 often called "the resolver library", and the applications communicate 633 with the resolver libraries through a programming interface (API). 635 Because these resolver libraries today expect only domain names in 636 ASCII, applications MUST prepare labels that are passed to the 637 resolver library using the ToASCII operation. Labels received from 638 the resolver library contain only ASCII characters; internationalized 639 labels that cannot be represented directly in ASCII use the ACE form. 640 ACE labels always include the ACE prefix. 642 An operating system might have a set of libraries for performing the 643 ToASCII operation. The input to such a library might be in one or 644 more charsets that are used in applications (UTF-8 and UTF-16 are 645 likely candidates for almost any operating system, and script- 646 specific charsets are likely for localized operating systems). 648 IDNA-aware applications MUST be able to work with both non- 649 internationalized labels (those that conform to [STD13] and [STD3]) 650 and internationalized labels. 652 It is expected that new versions of the resolver libraries in the 653 future will be able to accept domain names in other charsets than 654 ASCII, and application developers might one day pass not only domain 655 names in Unicode, but also in local script to a new API for the 656 resolver libraries in the operating system. Thus the ToASCII and 657 ToUnicode operations might be performed inside these new versions of 658 the resolver libraries. 660 Domain names passed to resolvers or put into the question section of 661 DNS requests follow the rules for "queries" from [STRINGPREP]. 663 6.3 DNS servers 665 Domain names stored in zones follow the rules for "stored strings" 666 from [STRINGPREP]. 668 For internationalized labels that cannot be represented directly in 669 ASCII, DNS servers MUST use the ACE form produced by the ToASCII 670 operation. All IDNs served by DNS servers MUST contain only ASCII 671 characters. 673 If a signaling system which makes negotiation possible between old 674 and new DNS clients and servers is standardized in the future, the 675 encoding of the query in the DNS protocol itself can be changed from 676 ACE to something else, such as UTF-8. The question whether or not 677 this should be used is, however, a separate problem and is not 678 discussed in this memo. 680 6.4 Avoiding exposing users to the raw ACE encoding 682 Any application that might show the user a domain name obtained from 683 a domain name slot, such as from gethostbyaddr or part of a mail 684 header, will need to be updated if it is to prevent users from seeing 685 the ACE. 687 If an application decodes an ACE name using ToUnicode but cannot show 688 all of the characters in the decoded name, such as if the name 689 contains characters that the output system cannot display, the 690 application SHOULD show the name in ACE format (which always includes 691 the ACE prefix) instead of displaying the name with the replacement 692 character (U+FFFD). This is to make it easier for the user to 693 transfer the name correctly to other programs. Programs that by 694 default show the ACE form when they cannot show all the characters in 695 a name label SHOULD also have a mechanism to show the name that is 696 produced by the ToUnicode operation with as many characters as 697 possible and replacement characters in the positions where characters 698 cannot be displayed. 700 The ToUnicode operation does not alter labels that are not valid ACE 701 labels, even if they begin with the ACE prefix. After ToUnicode has 702 been applied, if a label still begins with the ACE prefix, then it is 703 not a valid ACE label, and is not equivalent to any of the 704 intermediate Unicode strings constructed by ToUnicode. 706 6.5 DNSSEC authentication of IDN domain names 708 DNS Security [RFC2535] is a method for supplying cryptographic 709 verification information along with DNS messages. Public Key 710 Cryptography is used in conjunction with digital signatures to 711 provide a means for a requester of domain information to authenticate 712 the source of the data. This ensures that it can be traced back to a 713 trusted source, either directly, or via a chain of trust linking the 714 source of the information to the top of the DNS hierarchy. 716 IDNA specifies that all internationalized domain names served by DNS 717 servers that cannot be represented directly in ASCII must use the ACE 718 form produced by the ToASCII operation. This operation must be 719 performed prior to a zone being signed by the private key for that 720 zone. Because of this ordering, it is important to recognize that 721 DNSSEC authenticates the ASCII domain name, not the Unicode form or 722 the mapping between the Unicode form and the ASCII form. In the 723 presence of DNSSEC, this is the name that MUST be signed in the zone 724 and MUST be validated against. 726 One consequence of this for sites deploying IDNA in the presence of 727 DNSSEC is that any special purpose proxies or forwarders used to 728 transform user input into IDNs must be earlier in the resolution flow 729 than DNSSEC authenticating nameservers for DNSSEC to work. 731 7. Name server considerations 733 Existing DNS servers do not know the IDNA rules for handling non- 734 ASCII forms of IDNs, and therefore need to be shielded from them. 735 All existing channels through which names can enter a DNS server 736 database (for example, master files [STD13] and DNS update messages 737 [RFC2136]) are IDN-unaware because they predate IDNA, and therefore 738 requirement 2 of section 3.1 of this document provides the needed 739 shielding, by ensuring that internationalized domain names entering 740 DNS server databases through such channels have already been 741 converted to their equivalent ASCII forms. 743 It is imperative that there be only one ASCII encoding for a 744 particular domain name. Because of the design of the ToASCII and 745 ToUnicode operations, there are no ACE labels that decode to ASCII 746 labels, and therefore name servers cannot contain multiple ASCII 747 encodings of the same domain name. 749 [RFC2181] explicitly allows domain labels to contain octets beyond 750 the ASCII range (0..7F), and this document does not change that. 751 Note, however, that there is no defined interpretation of octets 752 80..FF as characters. If labels containing these octets are returned 753 to applications, unpredictable behavior could result. The ASCII form 754 defined by ToASCII is the only standard representation for 755 internationalized labels in the current DNS protocol. 757 8. Root server considerations 759 IDNs are likely to be somewhat longer than current domain names, so 760 the bandwidth needed by the root servers is likely to go up by a 761 small amount. Also, queries and responses for IDNs will probably be 762 somewhat longer than typical queries today, so more queries and 763 responses may be forced to go to TCP instead of UDP. 765 9. References 767 9.1 Normative References 769 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 770 Requirement Levels", BCP 14, RFC 2119, March 1997. 772 [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of 773 Internationalized Strings ("stringprep")", 774 draft-hoffman-rfc3454bis. 776 [NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 777 Profile for Internationalized Domain Names (IDN)", 778 draft-hoffman-rfc3491bis. 780 [PUNYCODE] Costello, A., "Punycode: A Bootstring encoding of 781 Unicode for use with Internationalized Domain Names in 782 Applications (IDNA)", draft-costello-rfc3492bis. 784 [STD3] Braden, R., "Requirements for Internet Hosts -- 785 Communication Layers", STD 3, RFC 1122, and 786 "Requirements for Internet Hosts -- Application and 787 Support", STD 3, RFC 1123, October 1989. 789 [STD13] Mockapetris, P., "Domain names - concepts and 790 facilities", STD 13, RFC 1034 and "Domain names - 791 implementation and specification", STD 13, RFC 1035, 792 November 1987. 794 9.2 Informative References 796 [IESG-STATEMENT] "IESG Statement on IDN", February 2003, 797 . 799 [RFC2535] Eastlake, D., "Domain Name System Security Extensions", 800 RFC 2535, March 1999. 802 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS 803 Specification", RFC 2181, July 1997. 805 [UNICODE] The Unicode Consortium. The Unicode Standard, Version 806 3.2.0 is defined by The Unicode Standard, Version 3.0 807 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), 808 as amended by the Unicode Standard Annex #27: Unicode 809 3.1 (http://www.unicode.org/reports/tr27/) and by the 810 Unicode Standard Annex #28: Unicode 3.2 811 (http://www.unicode.org/reports/tr28/). 813 [USASCII] Cerf, V., "ASCII format for Network Interchange", RFC 814 20, October 1969. 816 10. Security Considerations 818 Security on the Internet partly relies on the DNS. Thus, any change 819 to the characteristics of the DNS can change the security of much of 820 the Internet. 822 This memo describes an algorithm which encodes characters that are 823 not valid according to STD3 and STD13 into octet values that are 824 valid. No security issues such as string length increases or new 825 allowed values are introduced by the encoding process or the use of 826 these encoded values, apart from those introduced by the ACE encoding 827 itself. 829 Domain names are used by users to identify and connect to Internet 830 servers. The security of the Internet is compromised if a user 831 entering a single internationalized name is connected to different 832 servers based on different interpretations of the internationalized 833 domain name. 835 When systems use local character sets other than ASCII and Unicode, 836 this specification leaves the the problem of transcoding between the 837 local character set and Unicode up to the application. If different 838 applications (or different versions of one application) implement 839 different transcoding rules, they could interpret the same name 840 differently and contact different servers. This problem is not 841 solved by security protocols like TLS that do not take local 842 character sets into account. 844 Because this document normatively refers to [NAMEPREP], [PUNYCODE], 845 and [STRINGPREP], it includes the security considerations from those 846 documents as well. 848 If or when this specification is updated to use a more recent Unicode 849 normalization table, the new normalization table will need to be 850 compared with the old to spot backwards incompatible changes. If 851 there are such changes, they will need to be handled somehow, or 852 there will be security as well as operational implications. Methods 853 to handle the conflicts could include keeping the old normalization, 854 or taking care of the conflicting characters by operational means, or 855 some other method. 857 Implementations MUST NOT use more recent normalization tables than 858 the one referenced from this document, even though more recent tables 859 may be provided by operating systems. If an application is unsure of 860 which version of the normalization tables are in the operating 861 system, the application needs to include the normalization tables 862 itself. Using normalization tables other than the one referenced 863 from this specification could have security and operational 864 implications. 866 To help prevent confusion between characters that are visually 867 similar, it is suggested that implementations provide visual 868 indications where a domain name contains multiple scripts. Such 869 mechanisms can also be used to show when a name contains a mixture of 870 simplified and traditional Chinese characters, or to distinguish zero 871 and one from O and l. DNS zone adminstrators may impose restrictions 872 (subject to the limitations in section 2) that try to minimize 873 homographs. 875 Domain names (or portions of them) are sometimes compared against a 876 set of privileged or anti-privileged domains. In such situations it 877 is especially important that the comparisons be done properly, as 878 specified in section 3.1 requirement 4. For labels already in ASCII 879 form, the proper comparison reduces to the same case-insensitive 880 ASCII comparison that has always been used for ASCII labels. 882 The introduction of IDNA means that any existing labels that start 883 with the ACE prefix and would be altered by ToUnicode will 884 automatically be ACE labels, and will be considered equivalent to 885 non-ASCII labels, whether or not that was the intent of the zone 886 adminstrator or registrant. 888 11. IANA Considerations 890 IANA has assigned the ACE prefix "xn--" in consultation with the 891 IESG. 893 12. Authors' Addresses 895 Patrik Faltstrom 896 Cisco Systems 897 Arstaangsvagen 31 J 898 S-117 43 Stockholm Sweden 900 EMail: paf@cisco.com 902 Paul Hoffman 903 Internet Mail Consortium and VPN Consortium 904 127 Segre Place 905 Santa Cruz, CA 95060 USA 907 EMail: phoffman@imc.org 909 Adam M. Costello 910 University of California, Berkeley 912 URL: http://www.nicemice.net/amc/ 914 A. Changes from RFC 3490 916 This document is a revision of RFC 3490. None of the changes affect the 917 protocol described in RFC 3490; that is, all implementations of RFC 3490 918 will be identical with implementations of the specification in this 919 document. The items that have changed RFC 3490 document are: 921 - The last line of section 1 has a grammatical fix (user's -> users'). 923 - Added a note in section 1 about the IESG statement on IDNA, and 924 added a reference to it. 926 - In section 3.1 rule 3, fixed spelling of "unintelligle" to 927 "unintelligible". 929 - In step 8 of section 4.1, added "(0 is excluded)" to clarify. 931 - In section 4.2, the first sentence of the third paragraph was 932 incorrect. It has been replaced with a sentence that is both 933 correct and more descriptive. 935 - Added "ToUnicode consists of the following steps:" before the steps 936 in section 4.2. 938 - Changed wording of step 1 of section 4.2 to match the wording in section 939 4.1 (the result is identical). 941 - Added the last paragraph in section 6.1 to acknowledge that some Unicode 942 display issues are tricky, but they are not specific to IDNA. 944 - The sentence in section 11 now says the sequence that was chosen.