idnits 2.17.1 draft-ietf-idn-dunce-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 436 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 3 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 16, 2001) is 8401 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'ACE' is mentioned on line 6, but not defined == Missing Reference: 'EDNS' is mentioned on line 60, but not defined == Missing Reference: 'MIME' is mentioned on line 257, but not defined == Unused Reference: 'BASE64' is defined on line 360, but no explicit reference was found in the text == Unused Reference: 'ENDS' is defined on line 369, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' ** Downref: Normative reference to an Informational draft: draft-klensin-dns-role (ref. 'DNSROLE') -- Possible downref: Non-RFC (?) normative reference: ref. 'EBCDIC' ** Obsolete normative reference: RFC 2671 (ref. 'ENDS') (Obsoleted by RFC 6891) -- Possible downref: Normative reference to a draft: ref. 'IDNComp' -- Possible downref: Normative reference to a draft: ref. 'IDNReq' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Normative reference to a draft: ref. 'RACE' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode3' Summary: 6 errors (**), 0 flaws (~~), 8 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft John C Klensin 2 draft-ietf-idn-dunce-00.txt AT&T Labs 3 April 16, 2001 4 Expires in six months (October 2001) 5 DUNCE: A proposal for a Definitely Unencumbered 6 New Compatible [ACE] Encoding 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with 11 all provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering 14 Task Force (IETF), its areas, and its working groups. Note that 15 other groups may also distribute working documents as 16 Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other documents 20 at any time. It is inappropriate to use Internet-Drafts as reference 21 material or to cite them other than as "work in progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 26 The list of Internet-Draft Shadow Directories can be accessed 27 at http://www.ietf.org/shadow.html. 29 Abstract 31 This document describes a transformation method for representing 32 non-ASCII characters in host name parts in a fashion that is 33 completely compatible with the current DNS. It is a potential 34 candidate for an ASCII-Compatible Encoding (ACE) for 35 internationalized host names, as described in the comparison 36 document from the IETF IDN Working Group. This method is based 37 exclusively on long-established mechanisms for denoting the 38 positions of characters in tables, but included variations for 39 compressing that information, also based on long-established 40 mechanisms. 42 1. Introduction 44 1.1 Context 46 There is a strong world-wide desire to use characters other than 47 plain ASCII in host names. Host names have become the equivalent of 48 business or product names for many services on the Internet, so 49 there is a need to make them usable by people whose native scripts 50 are not representable by ASCII. The requirements for 51 internationalizing host names are described in the IDN WG's 52 requirements document, [IDNReq]. 54 The IDN WG's comparison document [IDNComp] describes three potential 55 main architectures for IDN: arch-1 (just send binary), arch-2 (send 56 binary or ACE), and arch-3 (just send ACE). DUNCE is an ACE that can 57 be used with protocols that match arch-2 or arch-3. It is known as 58 "dumb ACE" because it does not attempt any particular optimization 59 of string patterns, relying instead on either names extended to 60 longer length using DNS extension mechanisms [EDNS] or compression 61 if length optimization is desired (without optimization, the maximum 62 effective length of a DUNCE-encoded name would be about 14 63 characters). DUNCE specifies an ACE format as specified in ace-1 in 64 [IDNComp]. Further, it specifies an identifying mechanism for ace-2 65 in [IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to 66 the beginning of the name part). 68 In formal terms, DUNCE describes a mechanism for specifying 69 character positions in the ISO/IEC 10646 [ISO10646] coded character 70 set (whose assignment of characters is synchronized with Unicode 71 [Unicode3]) and the rules for using that scheme in the DNS. Since it 72 is a simple method of designating those characters, it probably does 73 not meet the definition of a "charset" as defined in [IDNReq]. 75 The DUNCE protocol has the following features: 77 - There is exactly one way to convert internationalized host parts 78 to and from DUNCE parts. Host name part uniqueness is preserved. 80 - Host parts that have no international characters are not changed. 82 - Names using DUNCE have lengths exactly proportionate to the number 83 of characters (from IS 10646) in the names themselves plus the 84 introducer tag. I.e., DUNCE is not dependent on the code positions 85 in the tables, the relationships of characters in the name, or other 86 coding factors. 88 - This specification utilizes the well-known Base64 encoding [MIME] 89 or the obvious Base 32 variation [RACE] as a means of shortening the 90 coded strings to permit longer names. 92 It is important to note that the following sections contain many 93 normative statements with "MUST" and "MUST NOT". Any implementation 94 that does not follow these statements exactly is likely to cause 95 damage to the Internet by creating non-unique representations of 96 host names. 98 1.2 Author's Disclaimer 100 This document was written for the convenience of the IDN WG, in case 101 (or for the next time) someone suggests that there are no plausible 102 mechanisms for encoding internationalized names into the DNS which 103 are unencumbered by any intellectual property rights claims, at 104 least any plausible one. 106 The author continues to believe that no DNS-based approach is going 107 to solve the "IDN" problem as it is perceived by users and company/ 108 enterprise domain name registrants and continues to hold the strong 109 hypothesis that, if non-DNS solutions are needed, it is probably not 110 desirable to further complicate the DNS and risk unknown problems 111 and incompatibilities [DNSROLE]. 113 1.3 Terminology 115 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", 116 and "MAY" in this document are to be interpreted as described in RFC 117 2119 [RFC2119]. 119 Hexadecimal values are shown preceded with an "0x". For example, 120 "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values 121 are shown preceded with an "0b". For example, a nine-bit value might 122 be shown as "0b101101111". 124 Examples in this document use the notation from the Unicode Standard 125 [Unicode3] as well as the ISO 10646 names. For example, the letter 126 "a" may be represented as either "U+0061" or "LATIN SMALL LETTER A". 128 DUNCE converts strings with internationalized characters into 129 strings of US-ASCII that are acceptable as host name parts in 130 current DNS host naming usage. The former are called "pre-converted" 131 and the latter are called "post-converted". 133 The protocol actually contains three variations (three dunces ?): 135 DUNCE1 Direct encoding, with the result that the maximum length of 136 names will be about 14 characters. 138 DUNCE2 Encoding using Base64 (or Base32, see section 3), with a 139 longer maximum name length 140 DUNCE3 Compression using the <> method, with a maximum name 141 length that will typically be longer than DUNCE2. 143 DUNCE1 will, in practice, probably be usable only in conjunction with 144 extended-length DNS labels. 146 1.4 IDN summary 148 Using the terminology in [IDNComp], DUNCE specifies an ACE format as 149 specified in ace-1. Further, it specifies an identifying mechanism 150 for ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the 151 beginning of the name part). 153 The length characteristics of DUNCEn are discussed above. Except 154 where compression is used, the number of characters in a name that 155 can be encoded in a DNS label will be invariant with the positions 156 or scripts from which those characters are derived. 158 2. Host Part Transformation 160 According to [STD13], host parts must be case-insensitive, start and 161 end with a letter or digit, and contain only letters, digits, and 162 the hyphen character ("-"). This, of course, excludes any 163 internationalized characters, as well as many other characters in 164 the ASCII character repertoire. Further, domain name parts must be 165 63 octets or shorter in length. 167 2.1 Name tagging 169 All post-converted name parts that contain internationalized 170 characters begin with the string "bl--". (Of course, because host 171 name parts are case-insensitive, this might also be represented as 172 "Bl--" or "bL--" or "BL--".) The string "bl--" was chosen because it 173 represents the first two characters of the English expletive 174 "bleech", which is an editorial observation on the context in which 175 this specification is being written. The string "bl--" will change 176 to other strings with more appropriate properties in future versions 177 of this draft. 179 Note that a zone administrator might still choose to use "bl--" at 180 the beginning of a host name part even if that part does not contain 181 internationalized characters. Zone administrators SHOULD NOT create 182 host part names that begin with "bl--" unless those names are 183 post-converted names. Creating host part names that begin with 184 "bl--" but that are not post-converted names may cause two distinct 185 problems. Some display systems, after converting the post-converted 186 name part back to an internationalized name part, might display the 187 name parts in a possibly-confusing fashion to users. More seriously, 188 some resolvers, after converting the post-converted name part back 189 to an internationalized name part, might reject the host name if it 190 contains illegal characters. 192 2.2 Converting an internationalized name to an ACE name part 194 To convert a string of internationalized characters into an ACE name 195 part, the following steps MUST be preformed in the exact order of 196 the subsections given here. 198 If a name part consists exclusively of characters that conform to 199 the host name requirements in [STD13], the name MUST NOT be 200 converted to DUNCE. That is, a name part that can be represented 201 without DUNCE MUST NOT be encoded using DUNCE. This absolute 202 requirement prevents there from being two different encodings for a 203 single DNS host name. 205 If any checking for prohibited name parts (such as ones that are 206 prohibited characters, case-folding, or canonicalization) is to be 207 done, it MUST be done before doing the conversion to an ACE name part. 209 Characters outside the first plane of characters (those with 210 codepoints above U+FFFF) MUST be represented using surrogates, as 211 described in the UTF-16 description in ISO 10646. 213 The input name string consists of characters from the ISO 10646 214 character set in big-endian UTF-16 encoding. This is the 215 pre-converted string. 217 2.2.1 Check the input string for disallowed names 219 If the input string consists only of characters that conform to the 220 host name requirements in [STD13], the conversion MUST stop with an 221 error. 223 2.2.2 Represent each character by its column and row position. 225 2.2.2.1 For DUNCE1... 227 Mechanisms for describing code point positions by a printable (and 228 ASCII) column and row position date to very early code point tables 229 and were believed to have been used for BCD and Baudot. The 230 earliest references readily available to the author are those for 231 [EBCDIC] and [ASCII], but those cited are not even the original 232 references for those coding and techniques. Note that these 233 techniques are all in the public literature and have been widely 234 practiced. In these notations, column and row positions are 235 typically separated by slashes or commas, but, as long as a number 236 system is used that permits representation in a fixed number of 237 digits, simple catenation is well-known as well. For example, in 238 ASCII, the coding position for the character "M" is variously 239 represented as 4/13, 4,13 or 0413 (decimal notation) or 4/D, 4,D, or 240 4D (hexadecimal notation). 242 For DUNCE1, each code point is represented by its column and row 243 position, each expressed as two hexidecimal digits. E.g., Latin 244 character upper case M would become 040D. 246 Catenate all such four-digit strings in the same order that the 247 characters appeared in the original label. 249 2.2.2.2 For DUNCE2... 251 Code each character into the 16-bit representation of that character 252 in IS 10646 BMP (plane 0), taking the column positions before the 253 row ones. Catenate the strings thus formed in the same order that 254 the characters appeared in the original label. 256 When the complete string is formed, convert it to Base64 (or Base32, 257 see section 3) encoding, as specified in [MIME]. 259 2.2.2.3 For DUNCE3... 261 Code each character into a 16-bit representation and then catenate 262 the strings, as for DUNCE2. Then compress the resulting bit string, 263 using <>. Some compression mechanisms produce, or can easily 264 be altered to produce, case-insensitive ASCII encodings. The 265 results of such compressions can be used directly. Others produce a 266 binary result which will then need to be converted using Base64 (or 267 Base32, see section 3). 269 2.2.3 Prepend "bl--" to the encoded string and finish 271 Prepend the characters "bl--" to the encoded string. This is the 272 host name part that can be used in DNS resolution. 274 2.3 Converting a host name part to an internationalized name 276 The input string for conversion is a valid host name part. Note that 277 if any checking for prohibited name parts (such as prohibited 278 characters, case-folding, or canonicalization is to be done, it MUST 279 be done after doing the conversion from an ACE name part. 281 If a decoded name part consists exclusively of characters that 282 conform to the host name requirements in [STD13], the conversion 283 from DUNCE MUST fail. Because a name part that can be represented 284 without DUNCE MUST NOT be encoded using DUNCE, the decoding process 285 MUST check for name parts that consists exclusively of characters 286 that conform to the host name requirements in [STD13] and, if such a 287 name part is found, MUST beconsidered an error (and possibly a 288 security violation). 290 2.3.1 Strip the "bl--" 292 The input string MUST begin with the characters "bl--". If it does 293 not, the conversion MUST stop with an error. Otherwise, remove the 294 characters "bl--" from the input string. The result of this step is 295 the stripped string. 297 2.3.2 Decode the stripped string 299 2.3.2.1 For DUNCE1... 301 Divide the stripped string into chunks of four hexidecimal digits 302 each. If the string is not an exact multiple of four characters in 303 length, or if any character is outside the range 0...9...F, report 304 an error. Use the hex-encoded row and column positions to 305 reconstruct the original characters, then catenate them to form the 306 resulting string. 308 2.3.2.2 For DUNCE2... 310 Apply a Base64 decoding to reconstruct the original binary string 311 and use that string to restore the original character codes. 313 2.3.2.3 For DUNCE3... 315 Apply a Base64 decoding if needed, uncompress the string to restore 316 the original binary, then use that string as above. 318 2.3.3 Check the internationalized string for disallowed names 320 If the internationalized string consists only of characters that 321 conform to the host name requirements in [STD13], the conversion 322 MUST stop with an error. 324 3. Using Base64 (or Base32) 326 The RACE [RACE] specification and its variations use a Base32 327 encoding to avoid difficulties with case-insensitivity of the coded 328 names. Since DNS implementations are required to preserve the case 329 of names that are deposited, the author naively believes that it 330 ought to be possible to use the more efficient Base64 encoding for 331 DUNCE. If he is wrong, which is probable, DUNCE2 and, if needed, 332 DUNCE3 can be easily altered to use Base32. Note that DUNCE1 and 333 compression mechanisms that automatically produce case-insensitive 334 ASCII encodings do not depend on the use of Base64 (or Base32) 335 encodings. 337 4. Security Considerations 339 Much of the security of the Internet relies on the DNS. Thus, any 340 change to the characteristics of the DNS can change the security of 341 much of the Internet. Thus, DUNCE makes no changes to the DNS itself. 343 Host names are used by users to connect to Internet servers. The 344 security of the Internet would be compromised if a user entering a 345 single internationalized name could be connected to different 346 servers based on different interpretations of the internationalized 347 host name. 349 DUNCE is designed so that every internationalized host name part can 350 be represented as one and only one DNS-compatible string. If there 351 is any way to follow the steps in this document and get two or more 352 different results, it is a severe and fatal error in the protocol. 354 5. References 356 [ASCII] American National Standards Institute (formerly United 357 States of America Standards Institute), X3.4, 1968, "USA Code for 358 Information Interchange". (ANSI X3.4-1968) 360 [BASE64] N. Freed & N. Borenstein, "Multipurpose Internet Mail 361 Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 362 2045. November 1996. 364 [DNSROLE] J Klensin, "Role of the Domain Name System", Work in 365 progress, draft-klensin-dns-role. (Current version is -00, November 366 2000.) 368 [EBCDIC] TBS - original S/360 _Principles of Operation_ manual. 369 [ENDS] Paul Vixie, "Extension Mechanisms for DNS (EDNS0)", RFC 2671. 370 August 1999. 372 [IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name 373 Proposals", draft-ietf-idn-compare. 375 [IDNReq] Zita Wenzel and James Seng, "Requirements of 376 Internationalized Domain Names", draft-ietf-idn-requirements. 377 (Current version is -04, October 2000.) 379 [ISO10646] ISO/IEC 10646-1:1993. International Standard -- 380 Information technology -- Universal Multiple-Octet Coded Character 381 Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. 382 Five amendments and a technical corrigendum have been published up 383 to now. UTF-16 is described in Annex Q, published as Amendment 1. 17 384 other amendments are currently at various stages of standardization. 385 [[[ THIS REFERENCE NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE 386 WORDING ]]] 388 [RACE] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding for 389 IDN", Work in Progress, November 2000, draft-ietf-idn-race-03.txt. 391 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 392 Requirement Levels", March 1997, RFC 2119. 394 [STD13] Paul Mockapetris, "Domain names - implementation and 395 specification", November 1987, STD 13 (RFC 1035). 397 [Unicode3] The Unicode Consortium, "The Unicode Standard -- Version 398 3.0", ISBN 0-201-61633-5. Described at 399 . 401 5. Acknowledgements 403 This document is shamelessly copied and extracted from Paul 404 Hoffman's RACE encoding document [RACE], but is intended to provide 405 a reference point for a completely unencumbered and unencumberable 406 encoding. The acknowledgements in the RACE document apply here as 407 well and will be incorporated if the document is published as an 408 RFC. Harald Alvestrand suggested a name for the protocol after the 409 author made a rude suggestion about another name. However, neither 410 Paul Hoffman nor anyone else besides the author bears the blame for 411 the stupid techniques described herein. 413 6. IANA Considerations 415 This document does not require IANA action, registration, or 416 considerations. 418 7. Author Contact Information 420 John C Klensin 421 AT&T Labs 422 99 Bedford St, 4th floor 423 Boston, MA 02111 424 +1 617 574 3076 425 klensin@att.com 427 Expires October 2001