idnits 2.17.1 draft-goldsmith-utf7-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-25) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 781 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (11 March 1997) is 9907 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'ISO 10646' is defined on line 596, but no explicit reference was found in the text == Unused Reference: 'RFC-1641' is defined on line 600, but no explicit reference was found in the text == Unused Reference: 'US-ASCII' is defined on line 603, but no explicit reference was found in the text == Unused Reference: 'ISO-8859' is defined on line 606, but no explicit reference was found in the text == Unused Reference: 'RFC822' is defined on line 617, but no explicit reference was found in the text == Unused Reference: 'MIME' is defined on line 620, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO 10646' ** Downref: Normative reference to an Experimental RFC: RFC 1641 -- Possible downref: Non-RFC (?) normative reference: ref. 'US-ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO-8859' ** Obsolete normative reference: RFC 822 (Obsoleted by RFC 2822) Summary: 10 errors (**), 0 flaws (~~), 8 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group D. Goldsmith 2 Internet Draft Apple Computer, Inc. 3 Expires: 11 September 1997 M. Davis 4 Will obsolete: RFC 1642 Taligent, Inc. 5 11 March 1997 7 UTF-7 9 A Mail-Safe Transformation Format of Unicode 11 Status of this Memo 13 This document is an Internet-Draft. Internet-Drafts are working 14 documents of the Internet Engineering Task Force (IETF), its areas, 15 and its working groups. Note that other groups may also distribute 16 working documents as Internet-Drafts. Internet-Drafts are draft 17 documents valid for a maximum of six months. 19 Internet-Drafts may be updated, replaced, or obsoleted by other 20 documents at any time. It is not appropriate to use Internet-Drafts 21 as reference material or to cite them other than as a "working draft" 22 or "work in progress". 24 To learn the current status of any Internet-Draft, please check the 25 1id-abstracts.txt listing contained in the Internet-Drafts Shadow 26 Directories on ds.internic.net (US East Coast), nic.nordu.net 27 (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific 28 Rim). 30 Distribution of this document is unlimited. Please send comments to 31 the author at . This document is intended to 32 become an experimental RFC. 34 Abstract 36 The Unicode Standard, version 2.0, and ISO/IEC 10646-1:1993(E) (as 37 amended) jointly define a character set (hereafter referred to as 38 Unicode) which encompasses most of the world's writing systems. 39 However, Internet mail (STD 11, RFC 822) currently supports only 7- 40 bit US ASCII as a character set. MIME (RFC 2045 through 2049) extends 41 Internet mail to support different media types and character sets, 42 and thus could support Unicode in mail messages. MIME neither defines 43 Unicode as a permitted character set nor specifies how it would be 44 encoded, although it does provide for the registration of additional 45 character sets over time. 47 This document describes a transformation format of Unicode that 48 contains only 7-bit ASCII octets and is intended to be readable by 49 humans in the limiting case that the document consists of characters 50 from the US-ASCII repertoire. It also specifies how this 51 transformation format is used in the context of MIME and RFC 1641, 52 "Using Unicode with MIME". 54 Motivation 56 Although other transformation formats of Unicode exist and could 57 conceivably be used in this context (most notably UTF-8, also known 58 as UTF-2 or UTF-FSS), they suffer the disadvantage that they use 59 octets in the range decimal 128 through 255 to encode Unicode 60 characters outside the US-ASCII range. Thus, in the context of mail, 61 those octets must themselves be encoded. This requires putting text 62 through two successive encoding processes, and leads to a significant 63 expansion of characters outside the US-ASCII range, putting non- 64 English speakers at a disadvantage. For example, using UTF-8 together 65 with the Quoted-Printable content transfer encoding of MIME 66 represents US-ASCII characters in one octet, but other characters may 67 require up to nine octets. 69 Overview 71 UTF-7 encodes Unicode characters as US-ASCII octets, together with 72 shift sequences to encode characters outside that range. For this 73 purpose, one of the characters in the US-ASCII repertoire is reserved 74 for use as a shift character. 76 Many mail gateways and systems cannot handle the entire US-ASCII 77 character set (those based on EBCDIC, for example), and so UTF-7 78 contains provisions for encoding characters within US-ASCII in a way 79 that all mail systems can accomodate. 81 UTF-7 should normally be used only in the context of 7 bit 82 transports, such as mail. In other contexts, straight Unicode or 83 UTF-8 is preferred. 85 See RFC 1641, "Using Unicode with MIME" for the overall specification 86 on usage of Unicode transformation formats with MIME. 88 Definitions 90 First, the definition of Unicode: 92 The 16 bit character set Unicode is defined by "The Unicode 93 Standard, Version 2.0". This character set is identical with the 94 character repertoire and coding of the international standard 95 ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2; 96 Subset=300; Implementation Level=3, including the first 7 97 amendments to 10646 plus editorial corrections. 99 Note. Unicode 2.0 further specifies the use and interaction of 100 these character codes beyond the ISO standard. However, any valid 101 10646 sequence is a valid Unicode sequence, and vice versa; 102 Unicode supplies interpretations of sequences on which the ISO 103 standard is silent as to interpretation. 105 Next, some handy definitions of US-ASCII character subsets: 107 Set D (directly encoded characters) consists of the following 108 characters (derived from RFC 1521, Appendix B, which no longer 109 appears in RFC 2045): the upper and lower case letters A through Z 110 and a through z, the 10 digits 0-9, and the following nine special 111 characters (note that "+" and "=" are omitted): 113 Character ASCII & Unicode Value (decimal) 114 ' 39 115 ( 40 116 ) 41 117 , 44 118 - 45 119 . 46 120 / 47 121 : 58 122 ? 63 124 Set O (optional direct characters) consists of the following 125 characters (note that "\" and "~" are omitted): 127 Character ASCII & Unicode Value (decimal) 128 ! 33 129 " 34 130 # 35 131 $ 36 132 % 37 133 & 38 134 * 42 135 ; 59 136 < 60 137 = 61 138 > 62 139 @ 64 140 [ 91 141 ] 93 142 ^ 94 143 _ 95 144 ' 96 145 { 123 146 | 124 147 } 125 149 Rationale. The characters "\" and "~" are omitted because they are 150 often redefined in variants of ASCII. 152 Set B (Modified Base 64) is the set of characters in the Base64 153 alphabet defined in RFC 2045, excluding the pad character "=" 154 (decimal value 61). 156 Rationale. The pad character = is excluded because UTF-7 is designed 157 for use within header fields as set forth in RFC 2047. Since the only 158 readable encoding in RFC 2047 is "Q" (based on RFC 2045's Quoted- 159 Printable), the "=" character is not available for use (without a lot 160 of escape sequences). This was very unfortunate but unavoidable. The 161 "=" character could otherwise have been used as the UTF-7 escape 162 character as well (rather than using "+"). 164 Note that all characters in US-ASCII have the same value in Unicode 165 when zero-extended to 16 bits. 167 UTF-7 Definition 169 A UTF-7 stream represents 16-bit Unicode characters using 7-bit US- 170 ASCII octets as follows: 172 Rule 1: (direct encoding) Unicode characters in set D above may be 173 encoded directly as their ASCII equivalents. Unicode characters in 174 Set O may optionally be encoded directly as their ASCII 175 equivalents, bearing in mind that many of these characters are 176 illegal in header fields, or may not pass correctly through some 177 mail gateways. 179 Rule 2: (Unicode shifted encoding) Any Unicode character sequence 180 may be encoded using a sequence of characters in set B, when 181 preceded by the shift character "+" (US-ASCII character value 182 decimal 43). The "+" signals that subsequent octets are to be 183 interpreted as elements of the Modified Base64 alphabet until a 184 character not in that alphabet is encountered. Such characters 185 include control characters such as carriage returns and line 186 feeds; thus, a Unicode shifted sequence always terminates at the 187 end of a line. As a special case, if the sequence terminates with 188 the character "-" (US-ASCII decimal 45) then that character is 189 absorbed; other terminating characters are not absorbed and are 190 processed normally. 192 Note that if the first character after the shifted sequence is "-" 193 then an extra "-" must be present to terminate the shifted 194 sequence so that the actual "-" is not itself absorbed. 196 Rationale. A terminating character is necessary for cases where 197 the next character after the Modified Base64 sequence is part of 198 character set B or is itself the terminating character. It can 199 also enhance readability by delimiting encoded sequences. 201 Also as a special case, the sequence "+-" may be used to encode 202 the character "+". A "+" character followed immediately by any 203 character other than members of set B or "-" is an ill-formed 204 sequence. 206 Unicode is encoded using Modified Base64 by first converting 207 Unicode 16-bit quantities to an octet stream (with the most 208 significant octet first). Surrogate pairs (UTF-16) are converted 209 by treating each half of the pair as a separate 16 bit quantity 210 (i.e., no special treatment). Text with an odd number of octets is 211 ill-formed. ISO 10646 characters outside the range addressable via 212 surrogate pairs cannot be encoded. 214 Rationale. ISO/IEC 10646-1:1993(E) specifies that when characters 215 in the UCS-2 form are serialized as octets, that the most 216 significant octet appear first. This is also in keeping with 217 common network practice of choosing a canonical format for 218 transmission. 220 Rationale. The policy for code point allocation within ISO 10646 221 and Unicode is that the repertoires be kept synchronized. No code 222 points will be allocated in ISO 10646 outside the range 223 addressable by surrogate pairs. 225 Next, the octet stream is encoded by applying the Base64 content 226 transfer encoding algorithm as defined in RFC 2045, modified to 227 omit the "=" pad character. Instead, when encoding, zero bits are 228 added to pad to a Base64 character boundary. When decoding, any 229 bits at the end of the Modified Base64 sequence that do not 230 constitute a complete 16-bit Unicode character are discarded. If 231 such discarded bits are non-zero the sequence is ill-formed. 233 Rationale. The pad character "=" is not used when encoding 234 Modified Base64 because of the conflict with its use as an escape 235 character for the Q content transfer encoding in RFC 2047 header 236 fields, as mentioned above. 238 Rule 3: The space (decimal 32), tab (decimal 9), carriage return 239 (decimal 13), and line feed (decimal 10) characters may be 240 directly represented by their ASCII equivalents. However, note 241 that MIME content transfer encodings have rules concerning the use 242 of such characters. Usage that does not conform to the 243 restrictions of RFC 822, for example, would have to be encoded 244 using MIME content transfer encodings other than 7bit or 8bit, 245 such as quoted-printable, binary, or base64. 247 Given this set of rules, Unicode characters which may be encoded via 248 rules 1 or 3 take one octet per character, and other Unicode 249 characters are encoded on average with 2 2/3 octets per character 250 plus one octet to switch into Modified Base64 and an optional octet 251 to switch out. 253 Example. The Unicode sequence "A." 254 (hexadecimal 0041,2262,0391,002E) may be encoded as follows: 256 A+ImIDkQ. 258 Example. The Unicode sequence "Hi Mom --!" 259 (hexadecimal 0048, 0069, 0020, 004D, 006F, 006D, 0020, 002D, 263A, 260 002D, 0021) may be encoded as follows: 262 Hi Mom -+Jjo--! 264 Example. The Unicode sequence representing the Han characters for 265 the Japanese word "nihongo" (hexadecimal 65E5,672C,8A9E) may be 266 encoded as follows: 268 +ZeVnLIqe- 270 Use of Character Set UTF-7 Within MIME 272 Character set UTF-7 is safe for mail transmission and therefore may 273 be used with any content transfer encoding in MIME (except where line 274 length and line break restrictions are violated). Specifically, the 7 275 bit encoding for bodies and the Q encoding for headers are both 276 acceptable. The MIME character set tag is UTF-7. This signifies any 277 version of Unicode equal to or greater than 2.0. 279 Example. Here is a text portion of a MIME message containing the 280 Unicode sequence "Hi Mom !" (hexadecimal 0048, 281 0069, 0020, 004D, 006F, 006D, 0020, 263A, 0021). 283 Content-Type: text/plain; charset=UTF-7 285 Hi Mom +Jjo-! 287 Example. Here is a text portion of a MIME message containing the 288 Unicode sequence representing the Han characters for the Japanese 289 word "nihongo" (hexadecimal 65E5,672C,8A9E). 291 Content-Type: text/plain; charset=UTF-7 293 +ZeVnLIqe- 295 Example. Here is a text portion of a MIME message containing the 296 Unicode sequence "A." (hexadecimal 297 0041,2262,0391,002E). 299 Content-Type: text/plain; charset=utf-7 301 A+ImIDkQ. 303 Example. Here is a text portion of a MIME message containing the 304 Unicode sequence "Item 3 is 1." (hexadecimal 0049, 305 0074, 0065, 006D, 0020, 0033, 0020, 0069, 0073, 0020, 00A3, 0031, 306 002E). 308 Content-Type: text/plain; charset=UTF-7 310 Item 3 is +AKM-1. 312 Note that to achieve the best interoperability with systems that may 313 not support Unicode or MIME, when preparing text for mail 314 transmission line breaks should follow Internet conventions. This 315 means that lines should be short and terminated with the proper SMTP 316 CRLF sequence. Unicode LINE SEPARATOR (hexadecimal 2028) and 317 PARAGRAPH SEPARATOR (hexadecimal 2029) should be converted to SMTP 318 line breaks. Ideally, this would be handled transparently by a 319 Unicode-aware user agent. 321 This preparation is not absolutely necessary, since UTF-7 and the 322 appropriate MIME content transfer encoding can handle text that does 323 not follow Internet conventions, but readability by systems without 324 Unicode or MIME will be impaired. See RFC 2045 for a discussion of 325 mail interoperability issues. 327 Lines should never be broken in the middle of a UTF-7 shifted 328 sequence, since such sequences may not cross line breaks. Therefore, 329 UTF-7 encoding should take place after line breaking. If a line 330 containing a shifted sequence is too long after encoding, a MIME 331 content transfer encoding such as Quoted Printable can be used to 332 encode the text. Another possibility is to perform line breaking and 333 UTF-7 encoding at the same time, so that lines containing shifted 334 sequences already conform to length restrictions. 336 Discussion 338 In this section we will motivate the introduction of UTF-7 as opposed 339 to the alternative of using the existing transformation formats of 340 Unicode (e.g., UTF-8) with MIME's content transfer encodings. Before 341 discussing this, it will be useful to list some assumptions about 342 character frequency within typical natural language text strings that 343 we use to estimate typical storage requirements: 345 1. Most Western European languages use roughly 7/8 of their letters 346 from US-ASCII and 1/8 from Latin 1 (ISO-8859-1). 348 2. Most non-Roman alphabet-based languages (e.g., Greek) use about 349 1/6 of their letters from ASCII (since white space is in the 7-bit 350 area) and the rest from their alphabets. 352 3. East Asian ideographic-based languages (including Japanese) use 353 essentially all of their characters from the Han or CJK syllabary 354 area. 356 4. Non-directly encoded punctuation characters do not occur 357 frequently enough to affect the results. 359 Notice that current 8 bit standards, such as ISO-8859-x, require use 360 of a content transfer encoding. For comparison with the subsequent 361 discussion, the costs break down as follows (note that many of these 362 figures are approximate since they depend on the exact composition of 363 the text): 365 8859-x in Base64 367 Text type Average octets/character 368 All 1.33 370 8859-x in Quoted Printable 372 Text type Average octets/character 373 US-ASCII 1 374 Western European 1.25 375 Other 2.67 377 Note also that Unicode encoded in Base64 takes a constant 2.67 octets 378 per character. For purposes of comparison, we will look at UTF-8 in 379 Base64 and Quoted Printable, and UTF-7. Also note that fixed overhead 380 for long strings is relative to 1/n, where n is the encoded string 381 length in octets. 383 UTF-8 in Base64 385 Text type Average octets/character 386 US-ASCII 1.33 387 Western European 1.5 388 Some Alphabetics 2.44 389 All others 4 391 UTF-8 in Quoted Printable 393 Text type Average octets/character 394 US-ASCII 1 395 Western European 1.63 396 Some Alphabetics 5.17 397 All others 7-9 399 UTF-7 401 Text type Average octets/character 402 Most US-ASCII 1 403 Western European 1.5 404 All others 2.67+2/n 406 We feel that the UTF-8 in Quoted Printable option is not viable due 407 to the very large expansion of all text except Western European. This 408 would only be viable in texts consisting of large expanses of US- 409 ASCII or Latin characters with occasional other characters 410 interspersed. We would prefer to introduce one encoding that works 411 reasonably well for all users. 413 We also feel that UTF-8 in Base64 has high expansion for non- 414 Western-European users, and is less desirable because it cannot be 415 read directly, even when the content is largely US-ASCII. The base 416 encoding of UTF-7 gives competitive results and is readable for ASCII 417 text. 419 UTF-7 gives results competitive with ISO-8859-x, with access to all 420 of the Unicode character set. We believe this justifies the 421 introduction of a new transformation format of Unicode. 423 As an alternative to use of UTF-7, it might be possible to intermix 424 Unicode characters with other character sets using an existing MIME 425 mechanism, the multipart/mixed content type, ignoring for the moment 426 the issues with line breaks (thanks to Nathaniel Borenstein for 427 suggesting this). For instance (repeating an earlier example): 429 Content-type: multipart/mixed; boundary=foo 430 Content-Disposition: inline 432 --foo 433 Content-type: text/plain; charset=us-ascii 435 Hi Mom 436 --foo 437 Content-type: text/plain; charset=UNICODE-2-0 438 Content-transfer-encoding: base64 440 Jjo= 441 --foo 442 Content-type: text/plain; charset=us-ascii 444 ! 445 --foo-- 447 Theoretically, this removes the need for UTF-7 in message bodies 448 (multipart may not be used in header fields). However, we feel that 449 as use of the Unicode character set becomes more widespread, 450 intermittent use of specialized Unicode characters (such as dingbats 451 and mathematical symbols) will occur, and that text will also 452 typically include small snippets from other scripts, such as 453 Cyrillic, Greek, or East Asian languages (anything in the Roman 454 script is already handled adequately by existing MIME character 455 sets). Although the multipart technique works well for large chunks 456 of text in alternating character sets, we feel it does not adequately 457 support the kinds of uses just discussed, and so we still believe the 458 introduction of UTF-7 is justified. 460 Summary 462 The UTF-7 encoding allows Unicode characters to be encoded within the 463 US-ASCII 7 bit character set. It is most effective for Unicode 464 sequences which contain relatively long strings of US-ASCII 465 characters interspersed with either single Unicode characters or 466 strings of Unicode characters, as it allows the US-ASCII portions to 467 be read on systems without direct Unicode support. 469 UTF-7 should only be used with 7 bit transports such as mail. In 470 other contexts, use of straight Unicode or UTF-8 is preferred. 472 Acknowledgements 474 Many thanks to the following people for their contributions, 475 comments, and suggestions. If we have omitted anyone it was through 476 oversight and not intentionally. 478 Glenn Adams 479 Harald T. Alvestrand 480 Nathaniel Borenstein 481 Lee Collins 482 Jim Conklin 483 Dave Crocker 484 Steve Dorner 485 Dana S. Emery 486 Ned Freed 487 Kari E. Hurtta 488 John H. Jenkins 489 John C. Klensin 490 Valdis Kletnieks 491 Keith Moore 492 Masataka Ohta 493 Einar Stefferud 494 Erik M. van der Poel 496 Appendix A -- Examples 498 Here is a longer example, taken from a document originally in Big5 499 code. It has been condensed for brevity. There are two versions: the 500 first uses optional characters from set O (and so may not pass 501 through some mail gateways), and the second does not. 503 Content-type: text/plain; charset=utf-7 505 Below is the full Chinese text of the Analects (+itaKng-). 507 The sources for the text are: 509 "The sayings of Confucius," James R. Ware, trans. +U/BTFw-: 510 +ZYeB9FH6ckh5Pg-, 1980. (Chinese text with English translation) 512 +Vttm+E6UfZM-, +W4tRQ066bOg-, +UxdOrA-: +Ti1XC2b4Xpc-, 1990. 514 "The Chinese Classics with a Translation, Critical and 515 Exegetical Notes, Prolegomena, and Copius Indexes," James 516 Legge, trans., Taipei: Southern Materials Center Publishing, 517 Inc., 1991. (Chinese text with English translation) 519 Big Five and GB versions of the text are being made available 520 separately. 522 Neither the Big Five nor GB contain all the characters used in 523 this text. Missing characters have been indicated using their 524 Unicode/ISO 10646 code points. "U+-" followed by four 525 hexadecimal digits indicates a Unicode/10646 code (e.g., 526 U+-9F08). There is no good solution to the problem of the small 527 size of the Big Five/GB character sets; this represents the 528 solution I find personally most satisfactory. 530 (omitted...) 532 I have tried to minimize this problem by using variant 533 characters where they were available and the character 534 actually in the text was not. Only variants listed as such in 535 the +XrdxmVtXUXg- were used. 537 (omitted...) 539 John H. Jenkins 540 +TpVPXGBG- 541 jenkins@apple.com 542 5 January 1993 543 (omitted...) 545 Content-type: text/plain; charset=utf-7 547 Below is the full Chinese text of the Analects (+itaKng-). 549 The sources for the text are: 551 +ACI-The sayings of Confucius,+ACI- James R. Ware, trans. +U/BTFw-: 552 +ZYeB9FH6ckh5Pg-, 1980. (Chinese text with English translation) 554 +Vttm+E6UfZM-, +W4tRQ066bOg-, +UxdOrA-: +Ti1XC2b4Xpc-, 1990. 556 +ACI-The Chinese Classics with a Translation, Critical and 557 Exegetical Notes, Prolegomena, and Copius Indexes,+ACI- James 558 Legge, trans., Taipei: Southern Materials Center Publishing, 559 Inc., 1991. (Chinese text with English translation) 561 Big Five and GB versions of the text are being made available 562 separately. 564 Neither the Big Five nor GB contain all the characters used in 565 this text. Missing characters have been indicated using their 566 Unicode/ISO 10646 code points. +ACI-U+-+ACI- followed by four 567 hexadecimal digits indicates a Unicode/10646 code (e.g., 568 U+-9F08). There is no good solution to the problem of the small 569 size of the Big Five/GB character sets+ADs- this represents the 570 solution I find personally most satisfactory. 572 (omitted...) 574 I have tried to minimize this problem by using variant 575 characters where they were available and the character 576 actually in the text was not. Only variants listed as such in 577 the +XrdxmVtXUXg- were used. 579 (omitted...) 581 John H. Jenkins 582 +TpVPXGBG- 583 jenkins+AEA-apple.com 584 5 January 1993 585 (omitted...) 587 Security Considerations 589 Security issues are not discussed in this memo. 591 References 593 [UNICODE 2.0] "The Unicode Standard, Version 2.0", The Unicode 594 Consortium, Addison-Wesley, 1996. ISBN 0-201-48345-9. 596 [ISO 10646] ISO/IEC 10646-1:1993(E) Information Technology--Universal 597 Multiple-octet Coded Character Set (UCS). See also 598 amendments 1 through 7, plus editorial corrections. 600 [RFC-1641] Goldsmith, D., and M. Davis, "Using Unicode with MIME", 601 RFC 1641, Taligent, Inc., July 1994. 603 [US-ASCII] Coded Character Set--7-bit American Standard Code for 604 Information Interchange, ANSI X3.4-1986. 606 [ISO-8859] Information Processing -- 8-bit Single-Byte Coded Graphic 607 Character Sets -- Part 1: Latin Alphabet No. 1, ISO 608 8859-1:1987. Part 2: Latin alphabet No. 2, ISO 8859-2, 609 1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 1988. 610 Part 4: Latin alphabet No. 4, ISO 8859-4, 1988. Part 5: 611 Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6: 612 Latin/Arabic alphabet, ISO 8859-6, 1987. Part 7: 613 Latin/Greek alphabet, ISO 8859-7, 1987. Part 8: 614 Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin 615 alphabet No. 5, ISO 8859-9, 1990. 617 [RFC822] Crocker, D., "Standard for the Format of ARPA Internet 618 Text Messages", STD 11, RFC 822, UDEL, August 1982. 620 [MIME] Borenstein N., N. Freed, K. Moore, J. Klensin, and J. 621 Postel, "MIME (Multipurpose Internet Mail Extensions) 622 Parts One through Five", RFC 2045, 2046, 2047, 2048, and 623 2049, November 1996. 625 Authors' Addresses 627 David Goldsmith 628 Apple Computer, Inc. 629 2 Infinite Loop, MS: 302-2IS 630 Cupertino, CA 95014 632 Phone: 408-974-1957 633 Fax: 408-862-4566 634 EMail: goldsmith@apple.com 636 Mark Davis 637 Taligent, Inc. 638 10201 N. DeAnza Blvd. 639 Cupertino, CA 95014-2233 641 Phone: 408-777-5116 642 Fax: 408-777-5081 643 EMail: mark_davis@taligent.com