idnits 2.17.1 draft-ietf-idn-utf6-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 452 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 14 instances of too long lines in the document, the longest one being 4 characters in excess of 72. == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 16, 2001) is 8380 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'IDNRACE' is mentioned on line 119, but not defined == Missing Reference: '0123456789abcdef' is mentioned on line 255, but not defined -- Looks like a reference, but probably isn't: '0123456789' on line 262 == Unused Reference: 'IDNCOMP' is defined on line 380, but no explicit reference was found in the text == Unused Reference: 'IDNNAMEPREP' is defined on line 386, but no explicit reference was found in the text -- Possible downref: Normative reference to a draft: ref. 'IDNCOMP' -- No information found for draft-ietf-idn-requirement - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'IDNREQ' -- Possible downref: Normative reference to a draft: ref. 'IDNDUERST' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE3' Summary: 4 errors (**), 0 flaws (~~), 7 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force (IETF) Mark Welter 2 INTERNET-DRAFT Brian W. Spolarich 3 draft-ietf-idn-utf6-00 WALID, Inc. 4 November 16, 2000 Expires May 16, 2001 6 UTF-6 - Yet Another ASCII-Compatible Encoding for IDN 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other 15 groups may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference 20 material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 The distribution of this document is unlimited. 30 Copyright (c) The Internet Society (2000). All Rights Reserved. 32 Abstract 34 This document describes a tranformation method for representing 35 Unicode character codepoints in host name parts in a fashion that is 36 completely compatible with the current Domain Name System. It is 37 proposed as a potential candidate for an ASCII-Compatible Encoding (ACE) 38 for supporting the deployment of an internationalized Domain Name System. 39 The tranformation method, an extension of the UTF-5 encoding proposed by 40 Duerst, provides both for more efficient representation of typical Unicode 41 sequences while preserving simplicity and readability. This transformation 42 method is deployed as part of the current WALID multilingual domain name 43 system implementation, although that status should not necessarily influence 44 the evaluation of its merits as a candidate encoding method. 46 Table of Contents 48 1. Introduction 49 1.1 Terminology 50 2. Hostname Part Transformation 51 2.1 Post-Converted Name Prefix 52 2.2 Hostname Prepartion 53 2.3 Definitions 54 2.4 UTF-6 Encoding 55 2.4.1 Variable Length Hex Encoding 56 2.4.2 UTF-6 Compression Algorithm 57 2.4.3 Forward Transformation Algorithm 58 2.5 UTF-6 Decoding 59 2.5.1 Variable Length Hex Decoding 60 2.5.2 UTF-6 Decompression Algorithm 61 2.5.3 Reverse Transformation Algorithm 62 3. Examples 63 3.1 'www.walid.com' (in Arabic) 64 4. Security Considerations 65 5. References 67 1. Introduction 69 UTF-6 describes an encoding scheme of the ISO/IEC 10646 [ISO10646] 70 character set (whose character code assignments are synchronized 71 with Unicode [UNICODE3]), and the procedures for using this scheme 72 to transform host name parts containing Unicode character sequences 73 into sequences that are compatible with the current DNS protocol 74 [STD13]. As such, it satisfies the definition of a 'charset' as 75 defined in [IDNREQ]. 77 1.1 Terminology 79 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and 80 "MAY" in this document are to be interpreted as described in RFC 2119 81 [RFC2119]. 83 Hexadecimal values are shown preceded with an "0x". For example, 84 "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are 85 shown preceded with an "0b". For example, a nine-bit value might be 86 shown as "0b101101111". 88 Examples in this document use the notation from the Unicode Standard 89 [UNICODE3] as well as the ISO 10646 names. For example, the letter "a" 90 may be represented as either "U+0061" or "LATIN SMALL LETTER A". 92 UTF-6 converts strings with internationalized characters into 93 strings of US-ASCII that are acceptable as host name parts in current 94 DNS host naming usage. The former are called "pre-converted" and the 95 latter are called "post-converted". This specification defines both 96 a forward and reverse transformation algorithm. 98 2. Hostname Part Transformation 100 According to [STD13], hostname parts must be case-insensitive, start and 101 end with a letter or digit, and contain only letters, digits, and the 102 hyphen character ("-"). This, of course, excludes most characters used 103 by non-English speakers, characters, as well as many other characters in 104 the ASCII character repertoire. Further, domain name parts must be 105 63 octets or shorter in length. 107 2.1 Post-Converted Name Prefix 109 This document defines the string 'wq--' as a prefix to identify 110 UTF-6-encoded sequences. For the purposes of comparison in the IDN 111 Working Group activities, the 'wq--' prefix should be used solely to 112 identify UTF-6 sequences. However, should this document proceed beyond 113 draft status the prefix should be changed to whatever prefix, if any, 114 is the final consensus of the IDN working group. 116 Note that the prepending of a fixed identifier sequence is only one 117 mechanism for differentiating ASCII character encoded international 118 domain names from 'ordinary' domain names. One method, as proposed in 119 [IDNRACE], is to include a character prefix or suffix that does not 120 appear in any name in any zone file. A second method is to insert a 121 domain component which pushes off any international names one or more 122 levels deeper into the DNS heirarchy. There are trade-offs between 123 these two methods which are independent of the Unicode to ASCII 124 transcoding method finally chosen. We do not address the international 125 vs. 'ordinary' name differention issue in this paper. 127 2.2 Hostname Prepartion 129 The hostname part is assumed to have at least one character disallowed 130 by [STD13], and that is has been processed for logically equivalent 131 character mapping, filtering of disallowed characters (if any), and 132 compatibility composition/decomposition before presentation to the UTF-6 133 conversion algorithm. 135 While it is possible to invent a transcoding mechanism that relies 136 on certain Unicode characters being deemed illegal within domain names 137 and hence available to the transcoding mechanism for improving encoding 138 efficiency, we feel that such a proposal would complicate matters 139 excessively. We also believe that Unicode name preprocessing for 140 both name resolution and name registration should be considered as s 141 separate, independent issues, which we will attempt to address in a 142 separate document. 144 2.3 Definitions 146 For clarity: 148 'integer' is an unsigned binary quantity; 149 'byte' is an 8-bit integer quantity; 150 'nibble' is a 4-bit integer quantity. 152 2.4 UTF-6 Encoding 154 The idea behind this scheme was to improve on the UTF-5 transformation 155 algorithm described in [IDNDUERST] by providing a straightforward 156 compression mechanism. UTF-6 defines a compression mechanism by 157 indentifying identical leading byte or nibble values in the pre-converted 158 string, and using the length of this leading value to select a mask which 159 can be applied to the pre-converted string. The resulting post-converted 160 string is preserves the simplicity and readability of UTF-5 while 161 enabling longer sequences to be encoded into a single host name part. 163 2.4.1 Variable Length Hex Encoding 165 The variable length hex encoding algorithm was introduced by Duerst in 166 [IDNDUERST]. It encodes an integer value in a slight modification of 167 traditional hexadecimal notation, the difference being that the most 168 significant digit is represented with an alternate set of "digits" 169 - -- 'g through 'v' are used to represent 0 through 15. The result is a 170 variable length encoding which can efficiently represent integers of 171 arbitrary length. 173 The variable length nibble encoding of an integer, C, is defined 174 as follows: 176 1. Skip over leading non-significant zero nibbles to find I, 177 the first significant nibble of c; 179 2. Emit the Ith character of the set [ghijklmopqrstuv]; 181 3. Continue from most to least significant, encoding each remaining 182 nibble J by emitting the Jth character of the set [0123456789abcdef]. 184 Examples: 186 0x1f4c is encoded as "hf4c" 187 0x0624 is encoded as "m24" 188 0x0000 is encoded as "g" 189 'n' a single character in single quotes stands for the 190 Unicode code point for that character. 192 2.4.2 UTF-6 Compression Algorithm 194 UTF-6 improves on the UTF-5 encoding by providing compression, which 195 enables encoding of a larger number of characters in each hostname 196 part. The compression algorithm is defined as follows: 198 1. Set the mask to 0xFFFF; 200 2. If the number of non '-' characters is less than 2, proceed to 201 step 5; 203 3. If the most significant byte of every non '-' character is the 204 same value: 206 3a. Set HB to this value; 207 3b. Emit 'Y'; 208 3c. Emit the variable length hex encoding of HB; 209 3d. Set the mask to 0x00FF; 210 3e. Proceed to step 5. 212 4. If the most significant nibble of every non '-' character is the 213 same value: 215 4a. Set HN to this value; 216 4b. Emit 'Z'; 217 4c. Emit the variable length hex encoding of HN; 218 4d. Set the mask to 0x0FFF. 220 5. Foreach input character: 222 5a. Set HN to the result of the bitwise AND of the input 223 character and the mask; 224 5b. Emit the variable length nibble encoding of HN. 226 2.4.3 Forward Transformation Algorithm 228 The UTF-6 transformation algorithm accepts a string in UTF-16 229 [ISO10646] format as input. The encoding algorithm is as follows: 231 1. Break the hostname string into dot-separated hostname parts. 232 For each hostname part, perform steps 2 and 3 below; 234 2. Compress the component using the method described in section 235 2.4.2 above, and encode using the encoding described in section 2.4.1; 237 3. Prepend the post-converted name prefix 'wq--' (see section 2.1 238 above) to the resulting string. 240 2.5 UTF-6 Decoding 242 2.5.1 Variable Length Hex Decoding 244 1. Let N be the lower case of the first input character; 246 If N is not in set [ghijklmnopqrstuv] return error, 247 else consume the input character; 249 2. Let R = N - 'g'; 251 3. If another input character exists, 252 then let N be the lower case of the next input character, 253 else goto Step 9; 255 4. If N is not in the set [0123456789abcdef], go to Step 9; 257 5. Let N = the lower case of the next input character and consume 258 the input character; 260 6. Let R = R * 16; 262 7. If N is in set [0123456789], 263 then let R = R + (N - '0'), 264 else let R = R + (N - 'a') + 10; 266 8. Go to step 3; 268 9. Return decoded result R. 270 2.5.2 UTF-6 Decompression Algorithm 272 1. Let N be the lower case of the first input character; 274 2. If N != 'y' and N != 'z', 276 2a. Let CPART be 0; 277 2b. Let VMAX be 0xFFFF; 279 This is the no-compression case; 281 3. If N == 'y', 283 3a. Let M be the variable length hex decoding of the next 284 character; 285 3b. Let CPART be the result of M * 0x0100; 286 3c. Let VMAX be 0x00FF; 287 3d. Continue to Step 5; 289 4. If N == 'z', 291 4a. Let M be the variable length hex decoding of the next 292 character; 293 4b. Let CPART be the result of M * 0x1000; 294 4c. Let VMAX be 0x0FFF; 295 4d. Continue to Step 5; 297 5. While another input character exists, let N be the lower case of 298 the next input character, and do the following: 300 5a. If N == '-' consume the character and 301 then append '-' to the result string, 302 else let VPART be the next variable hex decoded value; 303 5b. If VPART > VMAX, return error, 304 else append CPART + VPART to the result string; 306 6. Return the result string. 308 2.5.3 Reverse Transformation Algorithm 310 1. Break the string into dot-separated components and apply Steps 311 2 through 4 to each component: 313 2. Check for legality (in terms of RFC1035 permitted characters) and 314 return error status if illegal, 316 3. Remove the post converted name prefix 'wq--' (see Section 2.1), 318 4. Decompress the component using the decompression algorithm 319 described above. 321 5. Concatenate the decoded segments with dot separators and return. 323 3. Examples 325 The examples below illustrate the encoding algorithm and provide 326 comparisons to alternate encoding schemes. UTF-5 sequences are 327 prefixed with '----', as no ACE prefix was defined for that encoding. 329 3.1 'www.walid.com' (in Arabic): 331 UTF-16: U+0645 U+0648 U+0642 U+0639 . U+0648 U+0644 U+064A U+062F . 332 U+0634 U+0631 U+0643 U+0629 334 UTF-6: wq--ymk5k8k2j9.wq--ymk8k4kaif.wq--ymj4j1k3i9 336 UTF-5: ----m45m48m42m39.----m48m44m4am2f.----m34m31m43m29 338 RACE: bq--azcuqqrz.bq--azeeisrp.bq--ay2dcqzj 340 LACE: bq--aqdekscche.bq--aqdeqrckf5.bq--aqddimkdfe 342 3.2 Mixed Katakana and Hiragana (SOREZORENOBASHO) 344 UTF-16: U+305D U+308C U+305E U+308C U+306E U+5834 U+6240 346 UTF-6: 348 UTF-5: 350 RACE: bq--4ayf3memgbpdbdbqnzmdiysa 352 LACE: bq--auyf4dc7rrxacwbuafrea 354 3.3 Currently Disallowed ASCII Characters ($OneBillionDollars!): 356 UTF-16: U+0024 U+004F U+006E U+0065 U+0042 U+0069 U+006C U+006C 357 U+0069 U+006F U+006E U+0044 U+006F U+006C U+006C U+0061 358 U+0072 U+0073 U+0021 360 UTF-6: 362 UTF-5: 364 RACE: bq--aase74tfijuwy4djn6xei44mnrqxe5zb 366 LACE: bq--cmacit4omvbgs4dmnfxw5rdpnrwgc5ttee 368 4. Security Considerations 370 Much of the security of the Internet relies on the DNS and any 371 change to the characteristics of the DNS may change the security of 372 much of the Internet. Therefore UTF-6 makes no changes to the DNS itself. 374 UTF-6 is designed so that distinct Unicode sequences map to distinct 375 domain name sequences (modulo the Unicode and DNS equivalence rules). 376 Therefore use of UTF-6 with DNS will not negatively affect security. 378 5. References 380 [IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name 381 Proposals", draft-ietf-idn-compare. 383 [IDNREQ] James Seng, "Requirements of Internationalized Domain Names", 384 draft-ietf-idn-requirement. 386 [IDNNAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of 387 Internationalized Host Names", draft-ietf-idn-nameprep 389 [IDNDUERST] M. Duerst, "Internationalization of Domain Names", 390 draft-duerst-dns-i18n. 392 [ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information 393 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- 394 Part 1: Architecture and Basic Multilingual Plane. Five amendments and 395 a technical corrigendum have been published up to now. UTF-16 is 396 described in Annex Q, published as Amendment 1. 17 other amendments are 397 currently at various stages of standardization. 399 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 400 Requirement Levels", March 1997, RFC 2119. 402 [STD13] Paul Mockapetris, "Domain names - implementation and 403 specification", November 1987, STD 13 (RFC 1035). 405 [UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version 406 3.0", ISBN 0-201-61633-5. Described at 407 . 409 A. Acknowledgements 411 The structure (and some of the structural text) of this document is 412 intentionally borrowed from the LACE IDN draft (draft-ietf-idn-lace) 413 by Mark Davis and Paul Hoffman. 415 The 'SOREZORENOBASHO' example was taken from draft-ietf-idn-brace draft 416 by Adam Costello. 418 B. IANA Considerations 420 There are no IANA considerations in this document. 422 C. Author Contact Information 424 Mark Welter 425 Brian W. Spolarich 426 WALID, Inc. 427 State Technology Park 428 2245 S. State St. 429 Ann Arbor, MI 48104 430 +1-734-822-2020 432 mwelter@walid.com 433 briansp@walid.com 434 -----BEGIN PGP SIGNATURE----- 435 Version: GnuPG v1.0.1 (GNU/Linux) 436 Comment: For info see http://www.gnupg.org 438 iD8DBQE6FaCt/DkPcNgtD/0RAtRmAJwISVeJGY6qmll71mL+Axc51o8iIwCgmNt/ 439 86RcQh1JQYWTux+8FS+XvMU= 440 =bxiv 441 -----END PGP SIGNATURE-----