idnits 2.17.1 draft-ietf-idn-lace-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 523 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 3 instances of too long lines in the document, the longest one being 8 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 364 has weird spacing: '... bits char...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Normative reference to a draft: ref. 'IDNComp' -- No information found for draft-ietf-idn-requirement - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'IDNReq' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode3' Summary: 4 errors (**), 0 flaws (~~), 4 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft Mark Davis 2 draft-ietf-idn-lace-00.txt IBM 3 November 6, 2000 Paul Hoffman 4 Expires May 6, 2001 IMC & VPNC 6 LACE: Length-based ASCII Compatible Encoding for IDN 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other 15 groups may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference 20 material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 Abstract 30 This document describes a transformation method for representing 31 non-ASCII characters in host name parts in a fashion that is completely 32 compatible with the current DNS. It is a potential candidate for an 33 ASCII-Compatible Encoding (ACE) for internationalized host names, as 34 described in the comparison document from the IETF IDN Working Group. 35 This method is based on the observation that many internationalized host 36 name parts will have a few substrings from a small number of rows of the 37 ISO 10646 repertoire. Run-length encoding for these types of 38 host names will be fairly compact, and is fairly easy to describe. 40 1. Introduction 42 There is a strong world-wide desire to use characters other than plain 43 ASCII in host names. Host names have become the equivalent of business 44 or product names for many services on the Internet, so there is a need 45 to make them usable by people whose native scripts are not representable 46 by ASCII. The requirements for internationalizing host names are 47 described in the IDN WG's requirements document, [IDNReq]. 49 The IDN WG's comparison document [IDNComp] describes three potential 50 main architectures for IDN: arch-1 (just send binary), arch-2 (send 51 binary or ACE), and arch-3 (just send ACE). LACE is an ACE, called 52 Row-based ACE or LACE, that can be used with protocols that match arch-2 53 or arch-3. LACE specifies an ACE format as specified in ace-1 in 54 [IDNComp]. Further, it specifies an identifying mechanism for ace-2 in 55 [IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the 56 beginning of the name part). 58 In formal terms, LACE describes a character encoding scheme of the 59 ISO/IEC 10646 [ISO10646] coded character set (whose assignment of 60 characters is synchronized with Unicode [Unicode3]) and the rules for 61 using that scheme in the DNS. As such, it could also be called a 62 "charset" as defined in [IDNReq]. 64 The LACE protocol has the following features: 66 - There is exactly one way to convert internationalized host parts to 67 and from LACE parts. Host name part uniqueness is preserved. 69 - Host parts that have no international characters are not changed. 71 - Names using LACE can include more internationalized characters than 72 with other ACE protocols that have been suggested to date. LACE-encoded 73 names are variable length, depending on the number of transitions 74 between rows in the ISO 10646 repertoire that appear in the name part. 75 Name parts that cannot be compressed using run-length encoding can have 76 up to 17 characters, and names that can be compressed can have up to 35 77 characters. Further, a name that has just a few row transitions 78 typically can have over 30 characters. 80 It is important to note that the following sections contain many 81 normative statements with "MUST" and "MUST NOT". Any implementation that 82 does not follow these statements exactly is likely to cause damage to 83 the Internet by creating non-unique representations of host names. 85 1.1 Terminology 87 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and 88 "MAY" in this document are to be interpreted as described in RFC 2119 89 [RFC2119]. 91 Hexadecimal values are shown preceded with an "0x". For example, 92 "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are 93 shown preceded with an "0b". For example, a nine-bit value might be 94 shown as "0b101101111". 96 Examples in this document use the notation from the Unicode Standard 97 [Unicode3] as well as the ISO 10646 names. For example, the letter "a" 98 may be represented as either "U+0061" or "LATIN SMALL LETTER A". 100 LACE converts strings with internationalized characters into 101 strings of US-ASCII that are acceptable as host name parts in current 102 DNS host naming usage. The former are called "pre-converted" and the 103 latter are called "post-converted". 105 1.2 IDN summary 107 Using the terminology in [IDNComp], LACE specifies an ACE format as 108 specified in ace-1. Further, it specifies an identifying mechanism for 109 ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning 110 of the name part). 112 LACE has the following length characteristics. In this list, "row" means 113 a row from ISO 10646. 115 - LACE-encoded names are variable length, depending on the number of 116 transitions between rows that appear in the name part. 118 - Name parts that cannot be compressed using run-length encoding can 119 have up to 17 characters. 121 - Names that can be compressed can have up to 35 characters. 123 -A name that has just a few row transitions typically can have over 30 124 characters. 126 2. Host Part Transformation 128 According to [STD13], host parts must be case-insensitive, start and 129 end with a letter or digit, and contain only letters, digits, and the 130 hyphen character ("-"). This, of course, excludes any internationalized 131 characters, as well as many other characters in the ASCII character 132 repertoire. Further, domain name parts must be 63 octets or shorter in 133 length. 135 2.1 Name tagging 137 All post-converted name parts that contain internationalized characters 138 begin with the string "bq--". (Of course, because host name parts are 139 case-insensitive, this might also be represented as "Bq--" or "bQ--" or 140 "BQ--".) The string "bq--" was chosen because it is extremely unlikely 141 to exist in host parts before this specification was produced. As a 142 historical note, in late August 2000, none of the second-level host name 143 parts in any of the .com, .edu, .net, and .org top-level domains began 144 with "bq--"; there are many tens of thousands of other strings of three 145 characters followed by a hyphen that have this property and could be 146 used instead. The string "bq--" will change to other strings with the 147 same properties in future versions of this draft. 149 Note that a zone administrator might still choose to use "bq--" at the 150 beginning of a host name part even if that part does not contain 151 internationalized characters. Zone administrators SHOULD NOT create host 152 part names that begin with "bq--" unless those names are post-converted 153 names. Creating host part names that begin with "bq--" but that are not 154 post-converted names may cause two distinct problems. Some display 155 systems, after converting the post-converted name part back to an 156 internationalized name part, might display the name parts in a 157 possibly-confusing fashion to users. More seriously, some resolvers, 158 after converting the post-converted name part back to an 159 internationalized name part, might reject the host name if it contains 160 illegal characters. 162 2.2 Converting an internationalized name to an ACE name part 164 To convert a string of internationalized characters into an ACE name 165 part, the following steps MUST be preformed in the exact order of the 166 subsections given here. 168 If a name part consists exclusively of characters that conform to the 169 host name requirements in [STD13], the name MUST NOT be converted to 170 LACE. That is, a name part that can be represented without LACE MUST NOT 171 be encoded using LACE. This absolute requirement prevents there from 172 being two different encodings for a single DNS host name. 174 If any checking for prohibited name parts (such as ones that are 175 prohibited characters, case-folding, or canonicalization) is to be done, 176 it MUST be done before doing the conversion to an ACE name part. 178 The input name string consists of characters from the ISO 10646 179 character set in big-endian UTF-16 encoding. This is the pre-converted 180 string. 182 Characters outside the first plane of characters 183 (those with codepoints above U+FFFF) MUST be represented using surrogates, as 184 described in the UTF-16 description in ISO 10646. 186 2.2.1 Compress the pre-converted string 188 The entire pre-converted string MUST be compressed using the compression 189 algorithm specified in section 2.4. The result of this step is the 190 compressed string. 192 2.2.2 Check the length of the compressed string 194 The compressed string MUST be 36 octets or shorter. If the compressed 195 string is 37 octets or longer, the conversion MUST stop with an error. 197 2.2.3 Encode the compressed string with Base32 199 The compressed string MUST be converted using the Base32 encoding 200 described in section 2.5. The result of this step is the encoded string. 202 2.2.4 Prepend "bq--" to the encoded string and finish 204 Prepend the characters "bq--" to the encoded string. This is the host 205 name part that can be used in DNS resolution. 207 2.3 Converting a host name part to an internationalized name 209 The input string for conversion is a valid host name part. Note that if 210 any checking for prohibited name parts (such as prohibited characters, 211 case-folding, or canonicalization is to be done, it MUST be done after 212 doing the conversion from an ACE name part. 214 If a decoded name part consists exclusively of characters that conform 215 to the host name requirements in [STD13], the conversion from LACE MUST 216 fail. Because a name part that can be represented without LACE MUST NOT 217 be encoded using LACE, the decoding process MUST check for name parts 218 that consists exclusively of characters that conform to the host name 219 requirements in [STD13] and, if such a name part is found, MUST 220 beconsidered an error (and possibly a security violation). 222 2.3.1 Strip the "bq--" 224 The input string MUST begin with the characters "bq--". If it does not, 225 the conversion MUST stop with an error. Otherwise, remove the characters 226 "bq--" from the input string. The result of this step is the stripped 227 string. 229 2.3.2 Decode the stripped string with Base32 231 The entire stripped string MUST be checked to see if it is valid Base32 232 output. The entire stripped string MUST be changed to all lower-case 233 letters and digits. If any resulting characters are not in Table 1, the 234 conversion MUST stop with an error; the input string is the 235 post-converted string. Otherwise, the entire resulting string MUST be 236 converted to a binary format using the Base32 decoding described in 237 section 2.5. The result of this step is the decoded string. 239 2.3.3 Decompress the decoded string 241 The entire decoded string MUST be converted to ISO 10646 characters 242 using the decompression algorithm described in section 2.4. The result 243 of this is the internationalized string. 245 2.4 Compression algorithm 247 The basic method for compression is to reduce a substring that consists 248 of characters all from a single row of the ISO 10646 repertoire to a 249 count octet followed by the row header followed by the lower octets of 250 the characters. If this ends up being longer than the input, the string 251 is not compressed, but instead has a unique one-octet header attached. 253 Although the uncompressed mode limits the number of characters in a LACE 254 name part to 17, this is still generally enough for almost all names in 255 almost scripts. Also, this limit is close to the limits set by other 256 encoding proposals. 258 Note that the compression and decompression rules MUST be followed 259 exactly. This requirement prevents a single host name part from having 260 two encodings. Thus, for any input to the algorithm, there is only one 261 possible output. An implementation cannot chose to use one-octet mode or 262 two-octet mode using anything other than the logic given in this 263 section. 265 2.4.1 Compressing a string 267 The input string is in big-endian UTF-16 encoding with no byte order 268 mark. 270 Design note: No checking is done on the input to this algorithm. It is 271 assumed that all checking for valid ISO/IEC 10646 characters has already 272 been done by a previous step in the conversion process. 274 1) If the length of the input is not even, or is less than 2, stop with 275 an error. 277 2) Set the input pointer, called IP, to the first octet of the input 278 string. 280 3) Set the variable called HIGH to the octet at IP. 282 4) Determine the number of pairs at or after IP that have HIGH as the 283 first octet; call this COUNT. 285 5) Put into an output buffer the single octet for COUNT followed by the 286 single octet for HIGH, followed by all those low octets. Move IP to the 287 end of those pairs; that is, set IP to IP+(2*(COUNT+1)). 289 6) If IP is not at the end of the input string, go to step 3. 291 7) If the length of the output buffer is less than or equal to the 292 length of the input buffer (in octets, not in characters), output the 293 buffer. Otherwise, output the octet 0xFF followed by the input buffer. 294 Note that there can only be one possible representation for a name part, 295 so that outputting the wrong name part is a serious security error. 296 Decompression schemes MUST accept only the valid form and MUST NOT 297 accept invalid forms. 299 2.4.2 Decompressing a string 301 1. Set the input pointer, called IP, to the first octet of the input 302 string. If there is no first octet, stop with an error. 304 2. If the octet at IP is 0xFF, go to step 10. 306 3. Get the octet at IP, call it COUNT. Set IP to IP+1. If IP is now at 307 the end of the input string, stop with an error. 309 4. Get the octet at IP, call it HIGH. Set IP to IP+1. If IP is now at 310 the end of the input string, stop with an error. 312 5. Get the octet at IP, call it LOW. Set IP to IP+1. 314 6. Output HIGH, then LOW, to the output buffer. 316 7. Decrement COUNT. If COUNT is greater than 0, go to step 5. 318 8. If IP is not at the end of the input buffer, go to step 3. 320 9. Compare the length of the input string with the length of the output 321 buffer. If the length of the output buffer is longer than the length of 322 the input buffer, stop with an error because the wrong compression form 323 was used. Otherwise, send out the output buffer and stop. 325 10. Set IP to IP+1. Copy the rest of the input buffer to the output 326 buffer. Compress the output buffer into a separate comparison buffer 327 following the steps for compression above. If the length of the 328 comparison buffer is less than or equal to the length of the output 329 buffer, stop with an error because the wrong compression form was used. 330 Otherwise, send out the output buffer and stop. 332 2.4.3 Compression examples 334 The five input characters are 335 represented in big-endian UTF-16 as the ten octets <30 E6 30 CB 30 B3 30 336 FC 30 C9>. All the code units are in the same row (03). The output 337 buffer has seven octets <05 30 E6 CB B3 FC C9>, which is shorter than 338 the input string. Thus the output is <05 30 E6 CB B3 FC C9>. 340 The four input characters are represented 341 in big-endian UTF-16 as the eight octets <01 2E 01 10 01 4A 00 C5>. The 342 output buffer has eight octets <03 01 2E 10 4A 01 00 C5>, which is the 343 same length as the input string. Thus, the output is <03 01 2E 10 4A 01 344 00 C5>. 346 The three input characters are represented in 347 big-endian UTF-16 as the six octets <01 2E 00 D0 01 4A>. The output 348 buffer is nine octets <01 01 2E 01 00 D0 01 01 4A>, which is longer than 349 the input buffer. Thus, the output is . 351 2.5 Base32 353 In order to encode non-ASCII characters in DNS-compatible host name parts, 354 they must be converted into legal characters. This is done with Base32 355 encoding, described here. 357 Table 1 shows the mapping between input bits and output characters in 358 Base32. Design note: the digits used in Base32 are "2" through "7" 359 instead of "0" through "6" in order to avoid digits "0" and "1". This 360 helps reduce errors for users who are entering a Base32 stream and may 361 misinterpret a "0" for an "O" or a "1" for an "l". 363 Table 1: Base32 conversion 364 bits char hex bits char hex 365 00000 a 0x61 10000 q 0x71 366 00001 b 0x62 10001 r 0x72 367 00010 c 0x63 10010 s 0x73 368 00011 d 0x64 10011 t 0x74 369 00100 e 0x65 10100 u 0x75 370 00101 f 0x66 10101 v 0x76 371 00110 g 0x67 10110 w 0x77 372 00111 h 0x68 10111 x 0x78 373 01000 i 0x69 11000 y 0x79 374 01001 j 0x6a 11001 z 0x7a 375 01010 k 0x6b 11010 2 0x32 376 01011 l 0x6c 11011 3 0x33 377 01100 m 0x6d 11100 4 0x34 378 01101 n 0x6e 11101 5 0x35 379 01110 o 0x6f 11110 6 0x36 380 01111 p 0x70 11111 7 0x37 382 2.5.1 Encoding octets as Base32 384 The input is a stream of octets. However, the octets are then treated 385 as a stream of bits. 387 Design note: The assumption that the input is a stream of octets 388 (instead of a stream of bits) was made so that no padding was needed. 389 If you are reusing this algorithm for a stream of bits, you must add a 390 padding mechanism in order to differentiate different lengths of input. 392 1) Set the read pointer to the beginning of the input bit stream. 394 2) Look at the five bits after the read pointer. If there are not five 395 bits, go to step 5. 397 3) Look up the value of the set of five bits in the bits column of 398 Table 1, and output the character from the char column (whose hex value 399 is in the hex column). 401 4) Move the read pointer five bits forward. If the read pointer is at 402 the end of the input bit stream (that is, there are no more bits in the 403 input), stop. Otherwise, go to step 2. 405 5) Pad the bits seen until there are five bits. 407 6) Look up the value of the set of five bits in the bits column of 408 Table 1, and output the character from the char column (whose hex value 409 is in the hex column). 411 2.5.2 Decoding Base32 as octets 413 The input is octets in network byte order. The input octets MUST be 414 values from the second column in Table 1. 416 1) Set the read pointer to the beginning of the input octet stream. 418 2) Look up the character value of the octet in the char column (or hex 419 value in hex column) of Table 1, and output the five bits from the bits 420 column. 422 3) Move the read pointer one octet forward. If the read pointer is at 423 the end of the input octet stream (that is, there are no more octets in 424 the input), stop. Otherwise, go to step 2. 426 2.5.3 Base32 example 428 Assume you want to encode the value 0x3a270f93. The bit string is: 430 3 a 2 7 0 f 9 3 431 00111010 00100111 00001111 10010011 433 Broken into chunks of five bits, this is: 435 00111 01000 10011 10000 11111 00100 11 437 Padding is added to make the last chunk five bits: 439 00111 01000 10011 10000 11111 00100 11000 441 The output of encoding is: 443 00111 01000 10011 10000 11111 00100 11000 444 h i t q 7 e y 445 or "hitq7ey". 447 3. Security Considerations 449 Much of the security of the Internet relies on the DNS. Thus, any 450 change to the characteristics of the DNS can change the security of 451 much of the Internet. Thus, LACE makes no changes to the DNS 452 itself. 454 Host names are used by users to connect to Internet servers. The 455 security of the Internet would be compromised if a user entering a 456 single internationalized name could be connected to different servers 457 based on different interpretations of the internationalized host 458 name. 460 LACE is designed so that every internationalized host name part 461 can be represented as one and only one DNS-compatible string. If there 462 is any way to follow the steps in this document and get two or more 463 different results, it is a severe and fatal error in the protocol. 465 4. References 467 [IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name Proposals", 468 draft-ietf-idn-compare. 470 [IDNReq] James Seng, "Requirements of Internationalized Domain Names", 471 draft-ietf-idn-requirement. 473 [ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information 474 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- 475 Part 1: Architecture and Basic Multilingual Plane. Five amendments and 476 a technical corrigendum have been published up to now. UTF-16 is 477 described in Annex Q, published as Amendment 1. 17 other amendments are 478 currently at various stages of standardization. [[[ THIS REFERENCE 479 NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]] 481 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 482 Requirement Levels", March 1997, RFC 2119. 484 [STD13] Paul Mockapetris, "Domain names - implementation and 485 specification", November 1987, STD 13 (RFC 1035). 487 [Unicode3] The Unicode Consortium, "The Unicode Standard -- Version 488 3.0", ISBN 0-201-61633-5. Described at 489 . 491 A. Acknowledgements 493 Base32 is quite obviously inspired by the tried-and-true Base64 494 Content-Transfer-Encoding from MIME. 496 B. IANA Considerations 498 There are no IANA considerations in this document. 500 C. Author Contact Information 502 Mark Davis 503 IBM 504 10275 N. De Anza Blvd 505 Cupertino, CA 95014 506 mark.davis@us.ibm.com and mark.davis@macchiato.com 508 Paul Hoffman 509 Internet Mail Consortium and VPN Consortium 510 127 Segre Place 511 Santa Cruz, CA 95060 USA 512 paul.hoffman@imc.org and paul.hoffman@vpnc.org