idnits 2.17.1 draft-ietf-idn-dude-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 597 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 10 instances of too long lines in the document, the longest one being 3 characters in excess of 72. == There are 4 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 16, 2001) is 8352 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0123456789abcdef' is mentioned on line 453, but not defined -- Looks like a reference, but probably isn't: '0123456789' on line 277 == Missing Reference: 'GHIJKLMNOPQRSTUV' is mentioned on line 389, but not defined == Missing Reference: '0-9' is mentioned on line 458, but not defined == Unused Reference: 'IDNCOMP' is defined on line 522, but no explicit reference was found in the text == Unused Reference: 'IDNNAMEPREP' is defined on line 531, but no explicit reference was found in the text -- Possible downref: Normative reference to a draft: ref. 'IDNCOMP' -- Possible downref: Normative reference to a draft: ref. 'IDNRACE' -- No information found for draft-ietf-idn-requirement - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'IDNREQ' -- Possible downref: Normative reference to a draft: ref. 'IDNDUERST' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE3' Summary: 4 errors (**), 0 flaws (~~), 8 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force (IETF) Mark Welter 2 INTERNET-DRAFT Brian W. Spolarich 3 draft-ietf-idn-dude-00.txt WALID, Inc. 4 November 16, 2000 Expires May 16, 2001 6 DUDE: Differential Unicode Domain Encoding 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other 15 groups may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference 20 material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 The distribution of this document is unlimited. 30 Copyright (c) The Internet Society (2000). All Rights Reserved. 32 Abstract 34 This document describes a tranformation method for representing 35 Unicode character codepoints in host name parts in a fashion that is 36 completely compatible with the current Domain Name System. It provides 37 for very efficient representation of typical Unicode sequences as 38 host name parts, while preserving simplicity. It is proposed as a 39 potential candidate for an ASCII-Compatible Encoding (ACE) for supporting 40 the deployment of an internationalized Domain Name System. 42 Table of Contents 44 1. Introduction 45 1.1 Terminology 46 2. Hostname Part Transformation 47 2.1 Post-Converted Name Prefix 48 2.2 Radix Selection 49 2.3 Hostname Prepartion 50 2.4 Definitions 51 2.5 DUDE Encoding 52 2.5.1 Extended Variable Length Hex Encoding 53 2.5.2 DUDE Compression Algorithm 54 2.5.3 Forward Transformation Algorithm 55 2.6 DUDE Decoding 56 2.6.1 Extended Variable Length Hex Decoding 57 2.6.2 DUDE Decompression Algorithm 58 2.6.3 Reverse Transformation Algorithm 59 3. Examples 60 3.1 'www.walid.com' (in Arabic) 61 4. DUDE Extensions 62 4.1 Extended DUDE Encoding 63 4.1.1 Modified Extended Variable Length Hex Encoding 64 4.1.2 Extended Compression Algorithm 65 4.1.3 Extended Forward Transformation Algorithm 66 4.2 Extended DUDE Decoding 67 4.2.1 Modified Extended Variable Length Hex Decoding 68 4.2.2 Extended Decompression Algorithm 69 4.2.3 Extended Reverse Transformation Algorithm 70 5. Security Considerations 71 6. References 73 1. Introduction 75 DUDE describes an encoding scheme of the ISO/IEC 10646 [ISO10646] 76 character set (whose character code assignments are synchronized 77 with Unicode [UNICODE3]), and the procedures for using this scheme 78 to transform host name parts containing Unicode character sequences 79 into sequences that are compatible with the current DNS protocol 80 [STD13]. As such, it satisfies the definition of a 'charset' as 81 defined in [IDNREQ]. 83 1.1 Terminology 85 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and 86 "MAY" in this document are to be interpreted as described in RFC 2119 87 [RFC2119]. 89 Hexadecimal values are shown preceded with an "0x". For example, 90 "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are 91 shown preceded with an "0b". For example, a nine-bit value might be 92 shown as "0b101101111". 94 Examples in this document use the notation from the Unicode Standard 95 [UNICODE3] as well as the ISO 10646 names. For example, the letter "a" 96 may be represented as either "U+0061" or "LATIN SMALL LETTER A". 98 DUDE converts strings with internationalized characters into 99 strings of US-ASCII that are acceptable as host name parts in current 100 DNS host naming usage. The former are called "pre-converted" and the 101 latter are called "post-converted". This specification defines both 102 a forward and reverse transformation algorithm. 104 2. Hostname Part Transformation 106 According to [STD13], hostname parts must start and end with a letter 107 or digit, and contain only letters, digits, and the hyphen character 108 ("-"). This, of course, excludes most characters used by non-English 109 speakers, characters, as well as many other characters in the ASCII 110 character repertoire. Further, domain name parts must be 63 octets or 111 shorter in length. 113 2.1 Post-Converted Name Prefix 115 This document defines the string 'dq--' as a prefix to identify 116 DUDE-encoded sequences. For the purposes of comparison in the IDN 117 Working Group activities, the 'dq--' prefix should be used solely to 118 identify DUDE sequences. However, should this document proceed beyond 119 draft status the prefix should be changed to whatever prefix, if any, 120 is the final consensus of the IDN working group. 122 Note that the prepending of a fixed identifier sequence is only one 123 mechanism for differentiating ASCII character encoded international 124 domain names from 'ordinary' domain names. One method, as proposed in 125 [IDNRACE], is to include a character prefix or suffix that does not 126 appear in any name in any zone file. A second method is to insert a 127 domain component which pushes off any international names one or more 128 levels deeper into the DNS hierarchy. There are trade-offs between 129 these two methods which are independent of the Unicode to ASCII 130 transcoding method finally chosen. We do not address the international 131 vs. 'ordinary' name differention issue in this paper. 133 2.2 Radix Selection 135 There are many proposed methods for representing Unicode characters 136 within the allowed target character set, which can be split into groups 137 on the basis of the underlying radix. We have chosen a method with 138 radix 16 because both UTF-16 and ASCII are represented by even multiples 139 of four bits. This allows a Unicode character to be encoded as a 140 whole number of ASCII characters, and permits easier manipulation of 141 the resulting encoded data by humans. 143 2.3 Hostname Prepartion 145 The hostname part is assumed to have at least one character disallowed 146 by [STD13], and that is has been processed for logically equivalent 147 character mapping, filtering of disallowed characters (if any), and 148 compatibility composition/decomposition before presentation to the DUDE 149 conversion algorithm. 151 While it is possible to invent a transcoding mechanism that relies 152 on certain Unicode characters being deemed illegal within domain names 153 and hence available to the transcoding mechanism for improving encoding 154 efficiency, we feel that such a proposal would complicate matters 155 excessively. We also believe that Unicode name preprocessing for 156 both name resolution and name registration should be considered as 157 separate, independent issues, which we will address in a separate 158 document. 160 2.4 Definitions 162 For clarity: 164 'integer' is an unsigned binary quantity; 165 'byte' is an 8-bit integer quantity; 166 'nibble' is a 4-bit integer quantity. 168 2.5 DUDE Encoding 170 The idea behind this scheme is to provide compression by encoding the 171 contiguous least significant nibbles of a character that differ from the 172 preceding character. Using a variant of the variable length hex encoding 173 desribed in [IDNDUERST] and elsewhere, by encoding leading zero nibbles 174 this technique allows recovery of the differential length. The encoding 175 is, with some practice, easy to perform manually. 177 There are two extensions to this basic idea: one enables encoding the 178 preferred case for each charcter (for reverse DNS resolution) and 179 another improves the worse case behaviour related to surrogates. The 180 basic algorithms will be formally described first and then the extended 181 algorithms will be described. 183 2.5.1 Extended Variable Length Hex Encoding 185 The variable length hex encoding algorithm was introduced by Duerst in 186 [IDNDUERST]. It encodes an integer value in a slight modification of 187 traditional hexadecimal notation, the difference being that the most 188 significant digit is represented with an alternate set of "digits" 189 - -- 'g through 'v' are used to represent 0 through 15. The result is a 190 variable length encoding which can efficiently represent integers of 191 arbitrary length. 193 This specification extends the variable length hex encoding algorithm 194 to support the compression scheme defined below by potentially not 195 supressing leading zero nibbles. 197 The extended variable length nibble encoding of an integer, C, 198 to length N, is defined as follows: 200 1. Start with I, the Nth least significant nibble from the least 201 significant nibble of C; 203 2. Emit the Ith character of the sequence [ghijklmnopqrstuv]; 205 3. Continue from the most to least significant, encoding each 206 remaining nibble J by emitting the Jth character of the 207 sequence [0123456789abcdef]. 209 2.5.2 DUDE Compression Algorithm 211 1. Let PREV = 0; 213 2. If there are no more characters in the input, terminate successfully; 215 4. Let C be the next character in the input; 217 5. If C != '-' , then go to step 5; 219 6. Consume the input character, emit '-', and go to step 2; 221 7. Let D be the result of PREV exclusive ORed with C; 223 8. Find the least positive value N such that 224 D bitwise ANDed with M is zero 225 where M = the bitwise complement of (16**N) - 1; 227 9. Let V be C ANDed with the bitwise complement of M; 229 10. Variable length hex encode V to length N and emit the result; 231 11. Let PREV = C and go to step 2. 233 2.5.3 Forward Transformation Algorithm 235 The DUDE transformation algorithm accepts a string in UTF-16 236 [ISO10646] format as input. The encoding algorithm is as follows: 238 1. Break the hostname string into dot-separated hostname parts. 239 For each hostname part which contains one or more characters 240 disallowed by [STD13], perform steps 2 and 3 below; 242 2. Compress the hostname part using the method described in section 243 2.5.2 above, and encode using the encoding described in section 244 2.5.1; 246 3. Prepend the post-converted name prefix 'dq--' (see section 2.1 247 above) to the resulting string. 249 2.6 DUDE Decoding 251 2.6.1 Extended Variable Length Hex Decoding 253 Decoding extended variable length hex encoded strings is identical 254 to the standard variable length hex encoding, and is defined as 255 follows: 257 1. Let CL be the lower case of the first input character, 259 If CL is not in set [ghijklmnopqrstuv], 260 return error, 261 else 262 consume the input character; 264 2. Let R = CL - 'g', 265 Let N = 1; 267 3. If no more input characters exist, go to step 9. 269 4. Let CL be the lower case of the next input character; 271 5. If CL is not in the set [0123456789abcdef], go to Step 9; 273 6. Consume the next input character, 274 Let N = N + 1; 275 Let R = R * 16; 277 7. If N is in set [0123456789], 278 then let R = R + (N - '0') 279 else let R = R + (N - 'a') + 10; 281 8. Go to step 3; 283 9. Let MASK be the bitwise complement of (16**N) - 1; 285 10. Return decoded result R as well as MASK. 287 2.6.2 DUDE Decompression Algorithm 289 1. Let PREV = 0; 291 2. If there are no more input characters then terminate successfully; 293 3. Let C be the next input character; 295 4. If C == '-', append '-' to the result string, consume the character, 296 and go to step 2, 298 5. Let VPART, MASK be the next variable length hex decoded 299 value and mask; 301 6. If VPART > 0xFFFF then return error status, 303 7. Let CU = ( PREV bitwise-AND MASK) + VPART, 304 Let PREV = CU; 306 8. Append the UTF-16 character CU to the result string; 308 9. Go to step 2. 310 2.6.3 Reverse Transformation Algorithm 312 1. Break the string into dot-separated components and apply Steps 313 2 through 4 to each component; 315 2. Remove the post converted name prefix 'dq--' (see Section 2.1); 317 3. Decompress the component using the decompression algorithm 318 described above; 320 4. Concatenate the decoded segments with dot separators and return. 322 3. Examples 324 The examples below illustrate the encoding algorithm and provide 325 comparisons to alternate encoding schemes. UTF-5 sequences are 326 prefixed with '----', as no ACE prefix was defined for that encoding. 328 3.1 'www.walid.com' (in Arabic): 330 UTF-16: U+0645 U+0648 U+0642 U+0639 . U+0648 U+0644 U+064A U+062F . 331 U+0634 U+0631 U+0643 U+0629 333 DUDE: dq--m45oij9.dq--m48kqif.dq--m34hk3i9 335 UTF-6: wq--ymk5k8k2j9.wq--ymk8k4kaif.wq--ymj4j1k3i9 337 UTF-5: ----m45m48m42m39.----m48m44m4am2f.----m34m31m43m29 339 RACE: bq--azcuqqrz.bq--azeeisrp.bq--ay2dcqzj 341 LACE: bq--aqdekscche.bq--aqdeqrckf5.bq--aqddimkdfe 343 (more examples to come) 345 4. DUDE Extensions 347 The first extension to the DUDE concept recognizes that the first 348 character emitted by the variable length hex encoding algorithm is 349 always alphabetic. We encode the case (if any) of the original Unicode 350 character in the case of the initial "hex" character. Because the DNS 351 performs case-insensitive comparisons, mixed case international domain 352 names behave in exactly the same way as traditional domain names. 353 In particular, this enables reverse lookups to return names in the 354 preferred case. 356 The second extension regards the treatment of Unicode surrogate 357 characters. If surrogates are not expanded, two 16-bit surrogates are 358 needed to represent a single codepoint in the range of 0x10000 359 through 0x10FFFF. This cuts the worse case limits in half for most 360 proposals. We will assume that our input and output Unicode are in 361 UTF-32 format -- that is, any surrogates are expanded to their UCS-4 362 equivalents. If the input codes all fall under 0x10000, then the 363 extended method will emit the same length string as the basic method. 364 One final modification takes note of the fact that the only only 365 codepoints forcing the use of six hex digits is for those with a "10" 366 as the fifth and sixth digits. We will encode the fifth digit using 367 a seventeenth digit as a special case to avoid this extra expansion. 369 4.1 Extended DUDE Encoding 371 4.1.1 Modified Extended Variable Length Hex Encoding 373 The modified extended variable length hex encoding of an integer C to 374 length N with case U is performed as follows: 376 1. If C > 0x10FFFF return error status; 378 2. If N < 6 go to step 5; (this is true for characters from 379 the first 16 Planes) 381 3. If U is 'Uppercase' then emit 'W' 382 else emit 'w'; (special case for the 17th Plane) 384 4. go to step 7; 386 5. Let I be the Nth nibble from the right of C; 388 6. If U is 'Uppercase' 389 then emit the Ith character of sequence [GHIJKLMNOPQRSTUV], 390 else emit the Ith character of sequence [ghijklmnopqrstuv]; 392 7. Let N = N - 1; 394 8. Continue from N to 1, encoding each remaining nibble, J, by 395 emitting the Jth character of sequence [0123456789abcdef]. 397 4.1.2 Extended Compression Algorithm 399 1. Let PREV = 0; 401 2. If there are no more characters in the input, terminate successfully; 403 4. Let U be the case of the next character in the input; 404 Let C be the lowercase value of the next input character; 406 5. If C != '-' , then go to step 7; 408 6. Consume the input character, emit '-', and go to step 2; 410 7. Let D be the result of PREV exclusive ORed with C; 412 8. Find the least positive value N such that 413 D bitwise ANDed with M is zero 414 where M = the bitwise complement of (16**N) - 1; 416 9. Let V = C ANDed with the bitwise complement of M; 418 10. Emit the modified variable length hex encoding of V to length 419 N with case U; 421 11. Let PREV = C and go to step 2. 423 4.1.3 Extended Forward Transformation Algorithm 425 The overall extended encoding algorithm is as follows: 427 1. Break the hostname string into dot-separated hostname parts. 428 For each hostname part, perform steps 2 and 3 below; 430 2. Compress the component using the method described in section 431 4.1.2 above, and encode using the encoding described in section 432 4.1.1; 434 3. Prepend the post-converted name prefix 'dq--' (see section 2.1 435 above) to the resulting string. 437 4.2 Extended DUDE Decoding 439 4.2.1 Modified Extended Variable Length Hex Decoding 441 1. Let U be the case of the next input character, 442 Let C0 be the lower case of the next input character; 444 2. If C0 is not in set [ghijklmnopqrstuw] then return error status, 445 else, consume the input character; 447 3. Let R = C0 - 'g' 448 Let N = 1; 450 4. If no more input characters exist then go to step 8; 452 5. Let CL be the lower case of the next input character, 453 If CL is not in set [0123456789abcdef] then go to step 8; 455 6. Consume the next input character, 456 Let N = N + 1, 457 Let R = R * 16, 458 If CL is in set [0-9] 459 then let R = R + (CL - '0') 460 else let R = R + (CL - 'a') + 10; 462 7. Go to step 4; 464 8. If R < 0x100000 then go to step 10; 466 9. Let N = N + 1, 467 If (N > 6) or (C0 != 'w') 468 then return error status; 470 10. Let MASK be the bitwise complement of (16**N) - 1. Return 471 result R, MASK, and U. 473 4.2.2 Extended Decompression Algorithm 475 1. Let PREV = 0; 477 2. If there are no more input characters then terminate successfully; 479 3. Let C be the next input character; 481 4. If C == '-', append '-' to the result 482 string, consume the character, and go to step 2; 484 5. Let VPART, MASK, and U be the result of the modified extended 485 variable length decoded value; 487 6. Let CU = (PREV 'bitwise AND' MASK) + VPART, 488 Let PREV = CU; 490 7. If U == 'Uppercase' then let CU = the corresponding upper case value 491 of CU; 493 8. Append CU to the result string and go to step 2. 495 4.2.3 Extended Reverse Transformation Algorithm 497 1. Break the string into dot-separated components and apply Steps 498 2 through 4 to each component; 500 2. Remove the post converted name prefix 'dq--' (see Section 2.1); 502 3. Decompress the component using the extended decompression 503 algorithm described in section 4.2.2 above; 505 4. Concatenate the decoded segments with dot separators and return. 507 Note that DUDE decoding will return error for input strings which do 508 not comply with RFC1035. 510 5. Security Considerations 512 Much of the security of the Internet relies on the DNS and any 513 change to the characteristics of the DNS may change the security of 514 much of the Internet. Therefore DUDE makes no changes to the DNS itself. 516 DUDE is designed so that distinct Unicode sequences map to distinct 517 domain name sequences (modulo the Unicode and DNS equivalence rules). 518 Therefore use of DUDE with DNS will not negatively affect security. 520 6. References 522 [IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name 523 Proposals", draft-ietf-idn-compare; 525 [IDNRACE] Paul Hoffman, "RACE: Row-Based ASCII Compatible Encoding for 526 IDN", draft-ietf-idn-race; 528 [IDNREQ] James Seng, "Requirements of Internationalized Domain Names", 529 draft-ietf-idn-requirement; 531 [IDNNAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of 532 Internationalized Host Names", draft-ietf-idn-nameprep; 534 [IDNDUERST] M. Duerst, "Internationalization of Domain Names", 535 draft-duerst-dns-i18n; 537 [ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information 538 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- 539 Part 1: Architecture and Basic Multilingual Plane. Five amendments and 540 a technical corrigendum have been published up to now. UTF-16 is 541 described in Annex Q, published as Amendment 1. 17 other amendments are 542 currently at various stages of standardization; 544 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 545 Requirement Levels", March 1997, RFC 2119; 547 [STD13] Paul Mockapetris, "Domain names - implementation and 548 specification", November 1987, STD 13 (RFC 1035); 550 [UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version 551 3.0", ISBN 0-201-61633-5. Described at 552 . 554 A. Acknowledgements 556 The structure (and some of the structural text) of this document is 557 intentionally borrowed from the LACE IDN draft (draft-ietf-idn-lace-00) 558 by Mark Davis and Paul Hoffman. 560 B. IANA Considerations 562 There are no IANA considerations in this document. 564 C. Author Contact Information 566 Mark Welter 567 Brian W. Spolarich 568 WALID, Inc. 569 State Technology Park 570 2245 S. State St. 571 Ann Arbor, MI 48104 572 +1-734-822-2020 574 mwelter@walid.com 575 briansp@walid.com 576 -----BEGIN PGP SIGNATURE----- 577 Version: GnuPG v1.0.1 (GNU/Linux) 578 Comment: For info see http://www.gnupg.org 580 iD8DBQE6FZ/D/DkPcNgtD/0RAoswAKCUGBTSFJv96+Z+YnA8m47qrnheAgCeLQ6C 581 1+knyHluauC+66esCtPVoKU= 582 =hbT+ 583 -----END PGP SIGNATURE-----