idnits 2.17.1 draft-hoffman-idn-cidnuc-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 722 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 2 instances of too long lines in the document, the longest one being 10 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 382 has weird spacing: '... bits char...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 10, 2000) is 8813 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'UTR15' is mentioned on line 645, but not defined == Missing Reference: 'UTR6' is mentioned on line 669, but not defined -- No information found for draft-ietf-idn-requirment - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'IDNReq' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Normative reference to a draft: ref. 'Norm' ** Obsolete normative reference: RFC 2278 (Obsoleted by RFC 2978) -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode3' -- Possible downref: Non-RFC (?) normative reference: ref. 'UnicodeData' Summary: 6 errors (**), 0 flaws (~~), 6 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft Paul Hoffman 2 draft-hoffman-idn-cidnuc-03.txt IMC & VPNC 3 March 10, 2000 4 Expires in six months 6 Compatible Internationalized Domain Names Using Compression 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other 15 groups may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference 20 material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 Abstract 30 This protocol describes a transformation method for representing non- 31 ASCII characters in domain names in a fashion that is completely 32 compatible with the current DNS. It meets the many requirements for 33 internationalization of domain names. 35 Note: this protocol is quite experimental and should not be deployed in 36 the Internet until it reaches standards track in the IETF. 38 1. Introduction 40 There is a strong world-wide desire to use characters other than plain 41 ASCII in domain names. Domain names have become the equivalent of 42 business or product names for many services on the Internet, so there 43 is a need to make them usable by people whose native scripts are not 44 representable by ASCII. The requirements for internationalizing domain 45 names are described in [IDNReq]. 47 The protocol in this document describes how to take almost any 48 character used in human writing and use it in a domain name in a way 49 that is completely compatible with the current DNS. The protocol 50 requires absolutely no changes to the DNS [STD13]. 52 The protocol works for both entry and display of internationalized 53 characters. For domain name entry, a user enters the international 54 (that is, non-ASCII) characters of a domain name into a converter, and 55 that converter transforms the name entered into a DNS-compatible 56 format. Each domain part of internationalized domain names is tagged, 57 and some parts may be internationalized while others use today's plain 58 ASCII format. For domain name display, the display utility converts 59 each tagged domain part from its DNS-compatible format into the 60 internationalized characters and displays them inline with any non- 61 internationalized domain part. Users never have to see the converted 62 versions of the internationalized name parts. 64 In formal terms, this protocol describes a character encoding scheme of 65 the ISO 10646 [ISO10646] coded character set and the rules for using 66 that scheme in the DNS. As such, it could also be called a "charset" as 67 defined in [RFC2278]. 69 The protocol has the following features: 71 - There are no changes to the DNS protocols or the way that domain 72 names are interpreted. There are also no change to the DNS root 73 servers, nor to zone files. The protocol can start to be used the DNS 74 today with only changes in the non-protocol portions. 76 - There is exactly one way to convert internationalized domain parts to 77 and from DNS-compatible parts. Domain part uniqueness is preserved. 78 Domain parts that have no international characters are not changed. 80 - Essentially all characters written in all common (and many uncommon) 81 human scripts can be put in any name part. 83 - Transformed names can be entered and displayed without ASCII case 84 sensitivity. 86 - Names using this protocol can include more internationalized 87 characters than with other ASCII-converted protocols that have been 88 suggested to date. 90 1.1 Terminology 92 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and 93 "MAY" in this document are to be interpreted as described in RFC 2119 94 [RFC2119]. 96 Hexadecimal values are shown preceded with an "0x". For example, 97 "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are 98 shown preceded with an "0b". For example, a nine-bit value might be 99 shown as "0b101101111". 101 Examples in this document use the notation from the Unicode Standard 102 [Unicode3] as well as the ISO 10646 names. For example, the letter "a" 103 may be represented as either "U+0061" or "LATIN SMALL LETTER A". 105 This protocol converts strings with internationalized characters into 106 strings of US-ASCII that are acceptable as domain name parts in current 107 DNS host naming usage. The former are called "pre-converted" and the 108 latter are called "post-converted". 110 2. Domain Part Transformation 112 Any domain part that contains one or more non-ASCII characters is 113 transformed into a DNS-compatible name before passing it to a DNS 114 resolver or other program that uses traditional domain names. This step 115 is usually done at the time a user enters a domain name into an 116 application. When a domain name is displayed to a user, the display 117 program can covert any domain part that is tagged as holding 118 internationalized characters into a displayable representation that 119 includes the internationalized characters. 121 It is important to note that the following sections contain many 122 normative statements with "MUST" and "MUST NOT". Any implementation 123 that does not follow these statements exactly is likely to cause damage 124 to the Internet by creating non-unique representations of domain names. 126 According to [STD13], domain parts must be case-insensitive, start with 127 a letter, and contain only letters, digits, and the hyphen character 128 ("-"). This, of course, excludes any internationalized characters, as 129 well as many other characters in the ASCII character repertoire. 130 Further, domain name parts must be 63 octets or shorter in length. 132 2.1 Name tagging 134 Internationalized domain parts are converted to and from a display 135 representation that include non-ASCII characters. Thus, a program that 136 converts from DNS-compatible name parts to viewable name parts must be 137 able to recognize name parts that need to be converted. 139 All post-converted name parts that contain internationalized characters 140 begin with the string "aq8". (Of course, because domain name parts are 141 case-insensitive, this might also be represented as "Aq8" or "aQ8" or 142 "AQ8".) The string "aq8" was chosen because it is extremely unlikely to 143 exist in domain parts before this specification was produced. As a 144 historical note, in early March 2000, none of the second-level domain 145 parts in any of the .com, .edu, .net, and .org top-level domains began 146 with "aq8"; there are about 9,500 other strings of three legal 147 characters that have this property and could be used instead. 149 Note that a zone administrator can still choose to use "aq8" at the 150 beginning of a domain part even if that part does not contain 151 internationalized characters. Zone administrators SHOULD NOT create 152 domain part names that begin with "aq8" unless those names are post- 153 converted names. Creating domain part names that begin with "aq8" but 154 that are not post-converted names may cause display systems that 155 conform to this document to display the name parts in a possibly- 156 confusing fashion to users. However, creating such names will not cause 157 any DNS resolution problems; it will only cause display problems (and 158 possibly entry problems) for some users. 160 2.2 Converting an internationalized name to a domain name part 162 To convert a string of internationalized characters into a DNS- 163 compatible domain name part, the following steps MUST be preformed in 164 the exact order of the subsections given here. Note that these steps 165 MUST be done by zone administrators who are creating internationalized 166 domain name parts in their zones and MUST be done by clients who are 167 resolving domain names. 169 The input name string consists of characters from the ISO 10646 170 character set in big-endian UTF-16 encoding. This is the pre-converted 171 string. 173 Characters outside the first plane of characters (that is, outside the 174 first 0xFFFF characters) MUST be represented using surrogates, as 175 described in the UTF-16 description in ISO 10646. 177 The characters in Table 1 MUST NOT appear in pre-converted domain name 178 parts. The characters in this list have been chosen for many reasons, 179 mostly to avoid problems with displayed characters. The reasons 180 include: 182 - The character is a period 184 - The character is a separator (space, line, or paragraph) 186 - The character is a control character 188 - The character is a formatting character 190 - The character is a private-use character 192 Table 1: Characters illegal in domain names 193 U+002E (FULL STOP) 194 All characters in the Unicode Character Database [UnicodeData] whose 195 General Category is any of: 196 Zs 197 Zl 198 Zp 199 Cc 200 Cf 201 Co 203 Design note: The above list will proabably change and will probably be 204 taken to a separate document so there can be more focused discussion on 205 it. For example, there appears to be a desire to not allow uppercase 206 and lowercase, and some discussion of not allowing characters that do 207 not "normally" appear in "names". The above list could include all of 208 the characters of the not- chosen case and of type "punctuation" and 209 "symbol", minus those that "normally" appear in "names". 211 Design note: There is no reason to assume that this database must be 212 run by the Unicode Consortium. It is quite believable that, given the 213 importance of the database, that it could be maintained by IANA for the 214 IETF, quite probably with the help of the Unicode Consortium. 216 2.2.1 Check for a name that cannot be transformed 218 An untransformed input strings that is already a legitimate domain name 219 part MUST NOT be converted. Each character in the input string MUST be 220 compared to the following list of characters: 222 U+002D (HYPHEN-MINUS) 223 U+0030 through U+0039 (DIGIT ZERO, ...) 224 U+0041 through U+005A (LATIN CAPITAL LETTER A, ...) 225 U+0061 through U+007A (LATIN SMALL LETTER A, ...) 227 If all the characters in the input string are in the above set of 228 characters, the conversion MUST stop with an error. The input string 229 itself MUST be used as the domain name part. 231 2.2.2 Check for illegal characters in the input string 233 Each character in the input string MUST be checked against Table 1. If 234 any character in the input string matches a character listed in Table 235 1, the conversion MUST stop with an error. The characters in Table 1 236 MUST NOT appear in any internationalized domain name part. 238 Further, each character in the input string MUST be checked to see if 239 it is part of a malformed surrogate pair. If any character is part of a 240 malformed surrogate pair, the conversion MUST stop with an error. 241 Malformed surrogate pairs MUST NOT appear in any internationalized 242 domain name part. 244 2.2.3 Normalize the input string 246 The entire input string MUST be normalized using Normalization Form C 247 as described in [Norm]. The normalization MUST be applied to the entire 248 input string, not to substrings. The result of this step is the 249 normalized string. 251 2.2.4 Compress the normalized string 253 The entire normalized string MUST be compressed using the compression 254 algorithm specified in section 2.4. The result of this step is the 255 compressed string. 257 2.2.5 Check the length of the compressed string 259 The compressed string MUST be 37 octets or shorter. If the compressed 260 string is 38 octets or longer, the conversion MUST stop with an error. 262 2.2.6 Encode the compressed string with Base32 264 The compressed string MUST be converted to a DNS-compatible encoding 265 using the Base32 encoding described in section 2.5. The result of this 266 step is the encoded string. 268 2.2.7 Prepend "aq8" to the encoded string and finish 270 Prepend the characters "aq8" to the encoded string. This is the domain 271 name part that can be used in DNS resolution. 273 2.3 Converting a domain name part to an internationalized name 275 The input string for conversion is a valid domain name part. 277 2.3.1 Strip the "aq8" 279 The input string MUST begin with the characters "aq8". If it does not, 280 the conversion MUST stop and the displaying program MUST NOT treat the 281 domain name part as internationalized characters and the input string 282 is the post-converted string. Otherwise, remove the characters "aq8" 283 from the input string. The result of this step is the stripped string. 285 2.3.2 Decode the stripped string with Base32 287 The entire stripped string MUST be checked to see if it is valid Base32 288 output. The entire stripped string MUST changed to all lower-case 289 letters. If any resulting characters are not in Table 2, the conversion 290 MUST stop and the displaying program MUST NOT treat the domain name 291 part as internationalized characters; the input string is the post- 292 converted string. Otherwise, the entire resulting string MUST be 293 converted to a binary format using the Base32 decoding described in 294 section 2.5. The result of this step is the decoded string. 296 2.3.3 Decompress the decoded string 298 The entire decoded string MUST be converted to ISO 10646 characters 299 using the decompression algorithm described in section 2.4. The result 300 of this is the internationalized string. 302 2.3.4 Verify the internationalized string and finish 304 Each character in the internationalized string MUST be verified before 305 the string can be used. If the string only consists of the characters 306 listed in section 2.2.1, the conversion MUST stop and the input string 307 is the post-converted string. If any of the characters in the string 308 are Table 1 from section 2.2.2, the conversion MUST stop and the input 309 string is the post-converted string. 311 The internationalized string MUST be checked for invalid surrogate 312 pairs, as described in ISO 10646. If an invalid surrogate pair is 313 found, the conversion MUST stop and the input string is the post- 314 converted string. 316 If no errors are found, the verified string is the post-converted 317 string. 319 2.4 Compression algorithm 321 The basic method for compression is to reduce sequences of characters 322 that all have the same upper octet to single octets. Any string that 323 has a character that doesn't have the same upper octet as all the other 324 characters in the string has all the octets of the input string in the 325 output string. 327 The compressed string always has a one-octet header. For one-octet 328 mode, the header octet is the upper octet of the stream. For two-octet 329 mode, the header octet is 0xD8, which is the upper octet of a surrogate 330 pair. Design note: It is impossible to have a legal stream of UTF-16 331 characters that has all the upper octets being 0xD8 because a character 332 whose upper octet is 0xD8 must be followed by one whose upper octet is 333 in the range 0xDC through 0xDF. 335 Although the two-octet mode limits the number of characters to 17, this 336 is still generally enough for almost all names in almost scripts. Also, 337 this limit is close to the limits set by other encoding proposals. 339 Note that all name parts whose characters have the same upper octet 340 MUST be expressed in the one-octet mode. This requirement prevents a 341 single domain name part from having two encodings. 343 2.4.1 Compressing a string 345 Design note: No checking is done on the input to this algorithm. It is 346 assumed that all checking for valid ISO 10646 characters has already 347 been done by a previous step in the conversion process. 349 1) Read each character in the input stream, comparing the upper octet 350 of each. If all of the upper octets match, go to step 3. 352 2) Output 0xD8, followed by the entire input stream. Finish. 354 3) Output the upper octet of the first character. Output the lower 355 octet of each character in the input. Finish. 357 2.4.2 Decompressing a string 359 1) Read the first octet of the input string. If it is 0xD8, go to step 360 3. 362 2) Call the value of this first octet "upper". For each other octet in 363 the input, output "upper", then output the octet from the input. 364 Finish. 366 3) Read the rest of the input stream and put it in the output stream. 367 Finish. 369 2.5 Base32 371 In order to encode non-ASCII characters in DNS-compatible domain parts, 372 they must be converted into legal characters. This is done with Base32 373 encoding, described here. 375 Table 2 shows the mapping between input bits and output characters in 376 Base32. Design note: the digits used in Base32 are "2" through "7" 377 instead of "0" through "6" in order to avoid digits "0" and "1". This 378 helps reduce errors for users who are entering a Base32 stream and may 379 misinterpret a "0" for an "O" or a "1" for an "l". 381 Table 2: Base32 conversion 382 bits char hex bits char hex 383 00000 a 0x61 10000 q 0x71 384 00001 b 0x62 10001 r 0x72 385 00010 c 0x63 10010 s 0x73 386 00011 d 0x64 10011 t 0x74 387 00100 e 0x65 10100 u 0x75 388 00101 f 0x66 10101 v 0x76 389 00110 g 0x67 10110 w 0x77 390 00111 h 0x68 10111 x 0x78 391 01000 i 0x69 11000 y 0x79 392 01001 j 0x6a 11001 z 0x7a 393 01010 k 0x6b 11010 2 0x32 394 01011 l 0x6c 11011 3 0x33 395 01100 m 0x6d 11100 4 0x34 396 01101 n 0x6e 11101 5 0x35 397 01110 o 0x6f 11110 6 0x36 398 01111 p 0x70 11111 7 0x37 400 2.5.1 Encoding octets as Base32 402 The input is a stream of octets. However, the octets are then treated 403 as a stream of bits. 405 Design note: The assumption that the input is a stream of octets 406 (instead of a stream of bits) was made so that no padding was needed. 407 If you are reusing this encoding for a stream of bits, you must add a 408 padding mechanism in order to differentiate different lengths of input. 410 1) Set the read pointer to the beginning of the input bit stream. 412 2) Look at the five bits after the read pointer. If there are not five 413 bits, go to step 5. 415 3) Look up the value of the set of five bits in the bits column of 416 Table 2, and output the character from the char column (whose hex value 417 is in the hex column). 419 4) Move the read pointer five bits forward. If the read pointer is at 420 the end of the input bit stream (that is, there are no more bits in the 421 input), stop. Otherwise, go to step 2. 423 5) Pad the bits seen until there are five bits. 425 6) Look up the value of the set of five bits in the bits column of 426 Table 2, and output the character from the char column (whose hex value 427 is in the hex column). 429 2.5.2 Decoding Base32 as octets 431 The input is octets in network byte order. The input octets MUST be 432 values from the second column in Table 2. 434 1) Set the read pointer to the beginning of the input octet stream. 436 2) Look up the character value of the octet in the char column (or hex 437 value in hex column) of Table 2, and output the five bits from the bits 438 column. 440 3) Move the read pointer one octet forward. If the read pointer is at 441 the end of the input octet stream (that is, there are no more octets in 442 the input), stop. Otherwise, go to step 2. 444 2.5.3 Base32 example 446 Assume you want to the value 0x3a270f93. The bit string is: 448 3 a 2 7 0 f 9 3 449 00111010 00100111 00001111 10010011 451 Broken into chunks of five bits, this is: 453 00111 01000 10011 10000 11111 00100 11 455 The output of encoding is: 457 00111 01000 10011 10000 11111 00100 11 458 h i t q 7 e y 459 or "hitq7ey". 461 3. Implementing User Interfaces 463 This section gives guidelines to creators of programs that allow entry 464 or display of domain names. 466 The use of internationalized domain name parts in user applications 467 should be as transparent to the user as possible. A user should be able 468 to enter and see internationalized domain names as the pre-converted 469 names if at all possible. 471 For instance, if the user is able to enter Chinese characters anywhere 472 in a program, he or she should also be able to enter Chinese characters 473 into any interface component that would take in a domain name, such as 474 dialog box asking for a URL. Similarly, if any part of a program can 475 display Arabic characters, any domain name that has Arabic characters 476 in it should be able to be displayed with Arabic characters, not as the 477 ASCII transformation of those characters. 479 3.1 Name entry 481 In non-internationalized systems, the user enters a domain name and 482 that name is usually sent unchecked to a domain name resolver, which 483 returns an IPv4 address. With internationalized names, the user 484 application MUST convert the pre-converted name into a post-converted 485 name so that is acceptable to resolvers. 487 Some users might have access to the post-converted format of an 488 internationalized name. Because of this, users SHOULD be able to enter 489 post-converted names directly into an interface component for domain 490 names. This capability should already be in the interface because the 491 post-converted names are already legal. It is important that interfaces 492 not prohibit the entry of long domain names. (Of course, they should 493 not be prohibiting them anyway.) 495 There are a wide variety of user input methods. Keyboard input methods 496 vary widely from script to script, and even within a single script, 497 there are often more than one method. Humorously, people who don't use 498 a particular script often cannot comprehend how someone who uses that 499 script can input it with a keyboard and will often declare such input 500 as impossible. 502 Regardless of the input method, any system that allows input of non- 503 ASCII characters SHOULD allow input of pre-converted domain names in 504 the same fashion. 506 3.2 Name display 508 As a user enters internationalized characters, they are often displayed 509 to the user at the same time. For instance, in a typical entry box for 510 a URL, the characters are displayed as they are entered. Such display 511 should, of course, also happen for internationalized characters in 512 domain names. 514 Choosing what to do with domain names in free text is more difficult 515 because not all scripts are easily displayable. For instance, assume 516 that you are reading a sentence on a page that says "You can reach the 517 company at" followed by a URL. If the domain name portion of the URL is 518 internationalized, each domain name part SHOULD be shown as a pre- 519 converted string if possible. If it is not possible (such as if no font 520 for the script is available), the domain name part SHOULD be shown as 521 post-converted characters. 523 A display program has two choices when displaying an internationalized 524 name part for which there is one or more characters that the program 525 cannot display. The first choice is to display (but not replace with) a 526 "replacement character" that does not look like any other character in 527 the display. The second choice is to display the post-converted name; 528 this is admittedly ugly and does not give the user any useful 529 information other than "the text could not be displayed". In general, 530 the first option is the better choice. However, it is very important 531 that user still be able to copy a domain name part even if it has 532 characters that cannot be displayed. Thus, if a display system chooses 533 to display a "replacement character", the underlying character MUST 534 still be the undisplayable character. 536 Note that some domain name parts that start with "aq8" are not pre- 537 converted parts. Such names may contain characters that are not in the 538 Base32 character set. In such cases, a display program SHOULD display 539 the name part without attempting to convert it to post-converted 540 characters. 542 3.3 Zone files and registration 544 Historically, zone files have been maintained as US-ASCII text. For 545 stability reasons, this practice MUST continue. Thus, zone files MUST 546 only contain post-converted name parts. Zone administrators can use 547 tools to enter and view the internationalized parts of zone files. 549 Registrars for public name spaces such as .com have the same requirements 550 as any zone administrator. They MUST be sure not to register names 551 that are illegal. In the case if this protocol, the registration can 552 continue to be done with US-ASCII characters, but the registrar MUST 553 then check that the conversion to an internationalized name does 554 not result in an error. 556 3.4 Users and post-converted name parts 558 The Internet has a long history of trying to hide technical detail from 559 users only to have that detail exposed, often in a confusing fashion. 560 This protocol attempts to minimize the impact of such exposure. 562 Clearly, no user will be able to understand a post-converted name part. 563 However, they are unlikely to have any significant problems with them. 564 It is likely to become common lore that domain names that have 565 internationalized parts also have an all-text version that looks like 566 gibberish. These long names can be copied (by program or even by hand) 567 just like current domain names are. 569 4. Security Considerations 571 Much of the security of the Internet relies on the DNS. Thus, any 572 change to the characteristics of the DNS can change the security of 573 much of the Internet. Thus, this protocol makes no changes to the DNS 574 itself. 576 Host names are used by users to connect to Internet servers. The 577 security of the Internet would be compromised if a user entering a 578 single internationalized name could be connected to different servers 579 based on different interpretations of the internationalized domain 580 name. 582 This protocol is designed so that every internationalized domain part 583 can be represented as one and only one DNS-compatible string. If there 584 is any way to follow the steps in this document and get two or more 585 different results, it is a severe and fatal error in the protocol. 587 5. References 589 [IDNReq] James Seng, "Requirements of Internationalized Domain Names", 590 draft-ietf-idn-requirment. 592 [ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information 593 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- 594 Part 1: Architecture and Basic Multilingual Plane. Five amendments and 595 a technical corrigendum have been published up to now. UTF-16 is 596 described in Annex Q, published as Amendment 1. 17 other amendments are 597 currently at various stages of standardization. [[[ THIS REFERENCE 598 NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]] 600 [Norm] Mark Davis and Martin Duerst, "UCharacter Normalization in ITEF Protocols", 601 draft-duerst-i18n-norm. 603 [RFC2045] Ned Freed and Nathaniel Borenstein, "Multipurpose Internet 604 Mail Extensions (MIME) Part One: Format of Internet Message Bodies", 605 November 1996, RFC 2045. 607 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 608 Requirement Levels", March 1997, RFC 2119. 610 [RFC2278] Ned Freed and Jon Postel, "IANA Charset Registration 611 Procedures", January 1998, RFC 2278. 613 [STD13] Paul Mockapetris, "Domain names - implementation and 614 specification", November 1987, STD 13 (RFC 1035). 616 [Unicode3] The Unicode Consortium, "The Unicode Standard -- Version 617 3.0", ISBN 0-201-61633-5. Described at 618 . 620 [UnicodeData] The Unicode Character Database, 621 . The database 622 is described in 623 . 625 A. Acknowledgements 627 Mark Davis contributed many ideas to the initial draft of this 628 document. Graham Klyne and Martin Duerst offered technical comments on 629 the algorithms used. 631 Base32 is quite obviously inspired by the tried-and-true Base64 632 Content-Transfer-Encoding described in [RFC2045]. 634 B. Changes from Previous Versions of this Draft 636 B.1 Changes from -02 to -03 638 Throughout: changed "wg4" to "aq8". 640 2.2: Updated the first design note to indicate that the table 641 will probably be moved to its own draft. 643 2.2.3: Changed reference for normalization from [UTR15] to [Norm]. 645 5: Updated the reference for [IDNReq]. Removed [UTR15] and replaced 646 it with [Norm]. 648 B.2 Changes from -01 to -02 650 Throughout: Changed "ph6" to "wg4". 652 2.1: Updated count of unused three-letter prefixes. 654 2.3: Removed all the error states and clarified that any error in 655 conversion means that the input string is the post-converted 656 string. 658 2.4: Radically changed the compression scheme; the previous one 659 was far too cumbersome. 661 2.5: Renumbered Table 3 to Table 2. 663 2.5.1: Changed the second paragraph (should have been done in 664 the change to -01 to remove padding). 666 3.2: Clarified the paragraph emphasizing the need for users to be able 667 to copy names even if they are not displayable. 669 5: Removed reference to [UTR6]. 671 A: Added Martin Duerst. Removed reference to the compression 672 algorithm because it has changed. 674 B.3 Changes from -00 to -01 676 Throughout: Changed references to the character set from Unicode 677 to ISO 10646, even though they are equivalent. Also changed 678 references to the rules for surrogate pairs to ISO 10646. 680 1.1: Clarified last paragraph. 682 2.2: Reworded the first design note to make excluding case stuff 683 more likely. 685 2.5: Removed the "8" padding in the Base32 algorithm because 686 it was superfluous. 688 2.5.1: Removed "in network byte order" from the first sentence 689 because it was redundant. 691 3.3: Made the first paragraph stronger. 693 5: Added reference to ISO 10646. This still needs work. 695 A: Added Graham Klyne. 697 C. IANA Considerations 699 There are no IANA considerations in the current draft. However, if 700 it is decided to have IANA maintain the character database, this 701 section will become much longer. 703 D. Author Contact Information 705 Paul Hoffman 706 Internet Mail Consortium and VPN Consortium 707 127 Segre Place 708 Santa Cruz, CA 95060 USA 709 paul.hoffman@imc.org and paul.hoffman@vpnc.org