idnits 2.17.1 draft-ietf-idn-race-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 532 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 2 instances of too long lines in the document, the longest one being 8 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 364 has weird spacing: '... bits char...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 16, 2000) is 8593 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Normative reference to a draft: ref. 'IDNComp' -- No information found for draft-ietf-idn-requirement - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'IDNReq' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode3' Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft Paul Hoffman 2 draft-ietf-idn-race-02.txt IMC & VPNC 3 October 16, 2000 4 Expires in six months 6 RACE: Row-based ASCII Compatible Encoding for IDN 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other 15 groups may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference 20 material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 Abstract 30 This document describes a transformation method for representing 31 non-ASCII characters in host name parts in a fashion that is completely 32 compatible with the current DNS. It is a potential candidate for an 33 ASCII-Compatible Encoding (ACE) for internationalized host names, as 34 described in the comparison document from the IETF IDN Working Group. 35 This method is based on the observation that many internationalized 36 host name parts will have all their characters in one row of the ISO 37 10646 repertoire. 39 1. Introduction 41 There is a strong world-wide desire to use characters other than plain 42 ASCII in host names. Host names have become the equivalent of business 43 or product names for many services on the Internet, so there is a need 44 to make them usable by people whose native scripts are not representable 45 by ASCII. The requirements for internationalizing host names are 46 described in the IDN WG's requirements document, [IDNReq]. 48 The IDN WG's comparison document [IDNComp] describes three potential 49 main architectures for IDN: arch-1 (just send binary), arch-2 (send 50 binary or ACE), and arch-3 (just send ACE). RACE is an ACE, called 51 Row-based ACE or RACE, that can be used with protocols that match arch-2 52 or arch-3. RACE specifies an ACE format as specified in ace-1 in 53 [IDNComp]. Further, it specifies an identifying mechanism for ace-2 in 54 [IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the 55 beginning of the name part). 57 Author's note: although earlier drafts of this document supported the 58 ideas in arch-3, I no longer support that idea and instead only support 59 arch-2. Of course, someone else might right an IDN proposal that matches 60 arch-3 and use RACE as the protocol. 62 In formal terms, RACE describes a character encoding scheme of the ISO 63 10646 [ISO10646] coded character set and the rules for using that scheme 64 in the DNS. As such, it could also be called a "charset" as defined in 65 [IDNReq]. 67 The RACE protocol has the following features: 69 - There is exactly one way to convert internationalized host parts to 70 and from RACE parts. Host name part uniqueness is preserved. 72 - Host parts that have no international characters are not changed. 74 - Names using RACE can include more internationalized characters than 75 with other ACE protocols that have been suggested to date. Names in the 76 Han, Yi, Hangul syllables, or Ethiopic scripts can have up to 17 77 characters, and names in most other scripts can have up to 35 78 characters. Further, a name that consist of characters from one 79 non-Latin script but also contains some Latin characters such as digits 80 or hyphens can have close to 33 characters. 82 It is important to note that the following sections contain many 83 normative statements with "MUST" and "MUST NOT". Any implementation that 84 does not follow these statements exactly is likely to cause damage to 85 the Internet by creating non-unique representations of host names. 87 1.1 Terminology 89 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and 90 "MAY" in this document are to be interpreted as described in RFC 2119 91 [RFC2119]. 93 Hexadecimal values are shown preceded with an "0x". For example, 94 "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are 95 shown preceded with an "0b". For example, a nine-bit value might be 96 shown as "0b101101111". 98 Examples in this document use the notation from the Unicode Standard 99 [Unicode3] as well as the ISO 10646 names. For example, the letter "a" 100 may be represented as either "U+0061" or "LATIN SMALL LETTER A". 102 RACE converts strings with internationalized characters into 103 strings of US-ASCII that are acceptable as host name parts in current 104 DNS host naming usage. The former are called "pre-converted" and the 105 latter are called "post-converted". 107 1.2 IDN summary 109 Using the terminology in [IDNComp], RACE specifies an ACE format as 110 specified in ace-1. Further, it specifies an identifying mechanism for 111 ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning 112 of the name part). 114 RACE has the following length characteristics. In this list, "row" means 115 a row from ISO 10646. 117 - If the characters in the input all come from the same row, up to 35 118 characters per name part are allowed. 120 - If the characters in the input come from two or more rows, neither of 121 which is row 0, up to 17 characters per name part are allowed. 123 - If the characters in the input come from two rows, one of which is row 124 0, between 17 and 33 characters per name part are allowed. 126 2. Host Part Transformation 128 According to [STD13], host parts must be case-insensitive, start and 129 end with a letter or digit, and contain only letters, digits, and the 130 hyphen character ("-"). This, of course, excludes any internationalized 131 characters, as well as many other characters in the ASCII character 132 repertoire. Further, domain name parts must be 63 octets or shorter in 133 length. 135 2.1 Name tagging 137 All post-converted name parts that contain internationalized characters 138 begin with the string "bq--". (Of course, because host name parts are 139 case-insensitive, this might also be represented as "Bq--" or "bQ--" or 140 "BQ--".) The string "bq--" was chosen because it is extremely unlikely 141 to exist in host parts before this specification was produced. As a 142 historical note, in late August 2000, none of the second-level host name 143 parts in any of the .com, .edu, .net, and .org top-level domains began 144 with "bq--"; there are many tens of thousands of other strings of three 145 characters followed by a hyphen that have this property and could be 146 used instead. The string "bq--" will change to other strings with the 147 same properties in future versions of this draft. 149 Note that a zone administrator might still choose to use "bq--" at the 150 beginning of a host name part even if that part does not contain 151 internationalized characters. Zone administrators SHOULD NOT create host 152 part names that begin with "bq--" unless those names are post-converted 153 names. Creating host part names that begin with "bq--" but that are not 154 post-converted names may cause two distinct problems. Some display 155 systems, after converting the post-converted name part back to an 156 internationalized name part, might display the name parts in a 157 possibly-confusing fashion to users. More seriously, some resolvers, 158 after converting the post-converted name part back to an 159 internationalized name part, might reject the host name if it contains 160 illegal characters. 162 2.2 Converting an internationalized name to an ACE name part 164 To convert a string of internationalized characters into an ACE name 165 part, the following steps MUST be preformed in the exact order of the 166 subsections given here. 168 If a name part consists exclusively of characters that conform to the 169 host name requirements in [STD13], the name MUST NOT be converted to 170 RACE. That is, a name part that can be represented without RACE MUST NOT 171 be encoded using RACE. This absolute requirement prevents there from 172 being two different encodings for a single DNS host name. 174 If any checking for prohibited name parts (such as ones that are 175 prohibited characters, case-folding, or canonicalization) is to be done, 176 it MUST be done before doing the conversion to an ACE name part. 178 The input name string consists of characters from the ISO 10646 179 character set in big-endian UTF-16 encoding. This is the pre-converted 180 string. 182 Characters outside the first plane of characters (that is, outside the 183 first 0xFFFF characters) MUST be represented using surrogates, as 184 described in the UTF-16 description in ISO 10646. 186 2.2.1 Compress the pre-converted string 188 The entire pre-converted string MUST be compressed using the compression 189 algorithm specified in section 2.4. The result of this step is the 190 compressed string. 192 2.2.2 Check the length of the compressed string 194 The compressed string MUST be 36 octets or shorter. If the compressed 195 string is 37 octets or longer, the conversion MUST stop with an error. 197 2.2.3 Encode the compressed string with Base32 199 The compressed string MUST be converted using the Base32 encoding 200 described in section 2.5. The result of this step is the encoded string. 202 2.2.4 Prepend "bq--" to the encoded string and finish 204 Prepend the characters "bq--" to the encoded string. This is the host 205 name part that can be used in DNS resolution. 207 2.3 Converting a host name part to an internationalized name 209 The input string for conversion is a valid host name part. Note that if 210 any checking for prohibited name parts (such as ones that are already 211 legal DNS name parts), prohibited characters, case-folding, or 212 canonicalization is to be done, it MUST be done after doing the 213 conversion from an ACE name part. (Previous versions of this draft 214 specified these steps.) 216 2.3.1 Strip the "bq--" 218 The input string MUST begin with the characters "bq--". If it does not, 219 the conversion MUST stop with an error. Otherwise, remove the characters 220 "bq--" from the input string. The result of this step is the stripped 221 string. 223 2.3.2 Decode the stripped string with Base32 225 The entire stripped string MUST be checked to see if it is valid Base32 226 output. The entire stripped string MUST be changed to all lower-case 227 letters and digits. If any resulting characters are not in Table 1, the 228 conversion MUST stop with an error; the input string is the 229 post-converted string. Otherwise, the entire resulting string MUST be 230 converted to a binary format using the Base32 decoding described in 231 section 2.5. The result of this step is the decoded string. 233 2.3.3 Decompress the decoded string 235 The entire decoded string MUST be converted to ISO 10646 characters 236 using the decompression algorithm described in section 2.4. The result 237 of this is the internationalized string. 239 2.4 Compression algorithm 241 The basic method for compression is to reduce a full string that 242 consists of characters all from a single row of the ISO 10646 243 repertoire, or all from a single row plus from row 0, to as few octets 244 as possible. Any full string that has characters that come from two 245 rows, neither of which are row 0, or three or more rows, has all the 246 octets of the input string in the output string. 248 If the string comes from only one row, compression is to one octet per 249 character in the string. If the string comes from only one row other 250 than row 0, but also has characters only from row 0, compression is to 251 one octet for the characters from the non-0 row and two octets for the 252 characters from row 0. Otherwise, there is no compression and the output 253 is a string that has two octets per input character. 255 The compressed string always has a one-octet header. If the string comes 256 from only one row, the header octet is the upper octet of the 257 characters. If the string comes from only one row other than row 0, but 258 also has characters only from row 0, the header octet is the upper octet 259 of the characters from the non-0 row. Otherwise, the header octet is 260 0xD8, which is the upper octet of a surrogate pair. Design note: It is 261 impossible to have a legal stream of UTF-16 characters that has all the 262 upper octets being 0xD8 because a character whose upper octet is 0xD8 263 must be followed by one whose upper octet is in the range 0xDC through 264 0xDF. 266 Although the two-octet mode limits the number of characters in a RACE 267 name part to 17, this is still generally enough for almost all names in 268 almost scripts. Also, this limit is close to the limits set by other 269 encoding proposals. 271 Note that the compression and decompression rules MUST be followed 272 exactly. This requirement prevents a single host name part from having 273 two encodings. Thus, for any input to the algorithm, there is only one 274 possible output. An implementation cannot chose to use one-octet mode or 275 two-octet mode using anything other than the logic given in this 276 section. 278 2.4.1 Compressing a string 280 The input string is in UTF-16 encoding with no byte order mark. 282 Design note: No checking is done on the input to this algorithm. It is 283 assumed that all checking for valid ISO 10646 characters has already 284 been done by a previous step in the conversion process. 286 Design note: In step 5, 0xFF was chosen as the escape character because 287 it appears in the fewest number of scripts in ISO 10646, and therefore 288 the "escaped escape" will be needed the least. 0x99 was chosen as the 289 second octet for the "escaped escape" because the character U+0099 has 290 no value, and is not even used as a control character in the C1 controls 291 or in ISO 6429. 293 1) Read each pair of octets in the input stream, comparing the upper 294 octet of each. If all of the upper octets (called U1) are the same, go 295 to step 4. 297 2) Read each pair of octets in the input stream, comparing the upper 298 octet of each. If all of the upper octets are either 0 or one single 299 other value (called U1), go to step 4. 301 3) Output 0xD8, followed by the entire input stream. Finish. 303 4) Output U1. 305 5) If you are at the end of the input string, finish. Otherwise, read 306 the next octet, called U2, and the octet after that, called N1. 308 6) If U2 is equal to U1, and N1 is not equal to 0xFF, output N1, and go 309 to step 5. 311 7) If U2 is equal to U1, and N1 is equal to 0xFF, output 0xFF followed 312 by 0x99, and go to step 5. 314 8) Output 0xFF followed by N1. Go to step 5. 316 2.4.2 Decompressing a string 318 1) Read the first octet of the input string. Call the value of the first 319 octet U1. If U1 is 0xD8, go to step 7. 321 2) If you are at the end of the input string, finish. Otherwise, read 322 the next octet in the input string, called N1. If N1 is 0xFF, go to step 323 4. 325 3) Output U1 followed by N1. Go to step 2. 327 4) If you are at the end of the input string, stop with an error. 329 5) Read the next octet of the input string, called N1. If N1 is 0x99, 330 output U1 followed by 0xFF, and go to step 2. 332 6) Output 0x00 followed by N1. Go to step 2. 334 7) Read the rest of the input stream and put it in the output stream. 335 Finish. 337 2.4.3 Compression examples 339 For the input string of , all characters are in 340 the same row, 0x01. Thus, the output is 0x012E104A. 342 For the input string of , the characters are all 343 in row 0x01 or row 0x00. Thus, the output is 0x012EFFD04A. 345 For the input string of , the characters are all 346 in row 0x12. Thus, the output is 0x1290FF990C. 348 For the input string of , the characters are 349 from two rows other than 0x00. Thus, the output is 0xD8012E00D024C3. 351 2.5 Base32 353 In order to encode non-ASCII characters in DNS-compatible host name parts, 354 they must be converted into legal characters. This is done with Base32 355 encoding, described here. 357 Table 1 shows the mapping between input bits and output characters in 358 Base32. Design note: the digits used in Base32 are "2" through "7" 359 instead of "0" through "6" in order to avoid digits "0" and "1". This 360 helps reduce errors for users who are entering a Base32 stream and may 361 misinterpret a "0" for an "O" or a "1" for an "l". 363 Table 1: Base32 conversion 364 bits char hex bits char hex 365 00000 a 0x61 10000 q 0x71 366 00001 b 0x62 10001 r 0x72 367 00010 c 0x63 10010 s 0x73 368 00011 d 0x64 10011 t 0x74 369 00100 e 0x65 10100 u 0x75 370 00101 f 0x66 10101 v 0x76 371 00110 g 0x67 10110 w 0x77 372 00111 h 0x68 10111 x 0x78 373 01000 i 0x69 11000 y 0x79 374 01001 j 0x6a 11001 z 0x7a 375 01010 k 0x6b 11010 2 0x32 376 01011 l 0x6c 11011 3 0x33 377 01100 m 0x6d 11100 4 0x34 378 01101 n 0x6e 11101 5 0x35 379 01110 o 0x6f 11110 6 0x36 380 01111 p 0x70 11111 7 0x37 382 2.5.1 Encoding octets as Base32 384 The input is a stream of octets. However, the octets are then treated 385 as a stream of bits. 387 Design note: The assumption that the input is a stream of octets 388 (instead of a stream of bits) was made so that no padding was needed. 389 If you are reusing this algorithm for a stream of bits, you must add a 390 padding mechanism in order to differentiate different lengths of input. 392 1) Set the read pointer to the beginning of the input bit stream. 394 2) Look at the five bits after the read pointer. If there are not five 395 bits, go to step 5. 397 3) Look up the value of the set of five bits in the bits column of 398 Table 1, and output the character from the char column (whose hex value 399 is in the hex column). 401 4) Move the read pointer five bits forward. If the read pointer is at 402 the end of the input bit stream (that is, there are no more bits in the 403 input), stop. Otherwise, go to step 2. 405 5) Pad the bits seen until there are five bits. 407 6) Look up the value of the set of five bits in the bits column of 408 Table 1, and output the character from the char column (whose hex value 409 is in the hex column). 411 2.5.2 Decoding Base32 as octets 413 The input is octets in network byte order. The input octets MUST be 414 values from the second column in Table 1. 416 1) Set the read pointer to the beginning of the input octet stream. 418 2) Look up the character value of the octet in the char column (or hex 419 value in hex column) of Table 1, and output the five bits from the bits 420 column. 422 3) Move the read pointer one octet forward. If the read pointer is at 423 the end of the input octet stream (that is, there are no more octets in 424 the input), stop. Otherwise, go to step 2. 426 2.5.3 Base32 example 428 Assume you want to encode the value 0x3a270f93. The bit string is: 430 3 a 2 7 0 f 9 3 431 00111010 00100111 00001111 10010011 433 Broken into chunks of five bits, this is: 435 00111 01000 10011 10000 11111 00100 11 437 Padding is added to make the last chunk five bits: 439 00111 01000 10011 10000 11111 00100 11000 441 The output of encoding is: 443 00111 01000 10011 10000 11111 00100 11000 444 h i t q 7 e y 445 or "hitq7ey". 447 3. Security Considerations 449 Much of the security of the Internet relies on the DNS. Thus, any 450 change to the characteristics of the DNS can change the security of 451 much of the Internet. Thus, RACE makes no changes to the DNS 452 itself. 454 Host names are used by users to connect to Internet servers. The 455 security of the Internet would be compromised if a user entering a 456 single internationalized name could be connected to different servers 457 based on different interpretations of the internationalized host 458 name. 460 RACE is designed so that every internationalized host name part 461 can be represented as one and only one DNS-compatible string. If there 462 is any way to follow the steps in this document and get two or more 463 different results, it is a severe and fatal error in the protocol. 465 4. References 467 [IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name Proposals", 468 draft-ietf-idn-compare. 470 [IDNReq] James Seng, "Requirements of Internationalized Domain Names", 471 draft-ietf-idn-requirement. 473 [ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information 474 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- 475 Part 1: Architecture and Basic Multilingual Plane. Five amendments and 476 a technical corrigendum have been published up to now. UTF-16 is 477 described in Annex Q, published as Amendment 1. 17 other amendments are 478 currently at various stages of standardization. [[[ THIS REFERENCE 479 NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]] 481 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 482 Requirement Levels", March 1997, RFC 2119. 484 [STD13] Paul Mockapetris, "Domain names - implementation and 485 specification", November 1987, STD 13 (RFC 1035). 487 [Unicode3] The Unicode Consortium, "The Unicode Standard -- Version 488 3.0", ISBN 0-201-61633-5. Described at 489 . 491 A. Acknowledgements 493 Mark Davis contributed many ideas to the initial draft of this document. 494 Graham Klyne and Martin Duerst offered technical comments on the 495 algorithms used. GIM Gyeongseog and Pongtorn Jentaweepornkul helped fix 496 technical errors in early drafts. 498 Base32 is quite obviously inspired by the tried-and-true Base64 499 Content-Transfer-Encoding from MIME. 501 B. Changes from Versions -01 to -02 of this Draft 503 Removed section 1.3 (open issues) because no one said anything 504 in support of either proposal. 506 Added the prohibition on encoding a string that is already in 507 STD13 format in section 2.2. 509 C. IANA Considerations 511 There are no IANA considerations in this document. 513 D. Author Contact Information 515 Paul Hoffman 516 Internet Mail Consortium and VPN Consortium 517 127 Segre Place 518 Santa Cruz, CA 95060 USA 519 paul.hoffman@imc.org and paul.hoffman@vpnc.org