idnits 2.17.1 

draft-goldsmith-utf7-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-25) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 781 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (11 March 1997) is 9907 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'ISO 10646' is defined on line 596, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC-1641' is defined on line 600, but no explicit
     reference was found in the text

  == Unused Reference: 'US-ASCII' is defined on line 603, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO-8859' is defined on line 606, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC822' is defined on line 617, but no explicit
     reference was found in the text

  == Unused Reference: 'MIME' is defined on line 620, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO 10646'

  ** Downref: Normative reference to an Experimental RFC: RFC 1641

  -- Possible downref: Non-RFC (?) normative reference: ref. 'US-ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO-8859'

  ** Obsolete normative reference: RFC  822 (Obsoleted by RFC 2822)


     Summary: 10 errors (**), 0 flaws (~~), 8 warnings (==), 6 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                       D. Goldsmith
2	Internet Draft <draft-goldsmith-utf7-02.txt>        Apple Computer, Inc.
3	Expires: 11 September 1997                                      M. Davis
4	Will obsolete: RFC 1642                                   Taligent, Inc.
5	                                                           11 March 1997

7	                                 UTF-7

9	              A Mail-Safe Transformation Format of Unicode

11	Status of this Memo

13	   This document is an Internet-Draft. Internet-Drafts are working
14	   documents of the Internet Engineering Task Force (IETF), its areas,
15	   and its working groups. Note that other groups may also distribute
16	   working documents as Internet-Drafts.  Internet-Drafts are draft
17	   documents valid for a maximum of six months.

19	   Internet-Drafts may be updated, replaced, or obsoleted by other
20	   documents at any time. It is not appropriate to use Internet-Drafts
21	   as reference material or to cite them other than as a "working draft"
22	   or "work in progress".

24	   To learn the current status of any Internet-Draft, please check the
25	   1id-abstracts.txt listing contained in the Internet-Drafts Shadow
26	   Directories on ds.internic.net (US East Coast), nic.nordu.net
27	   (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
28	   Rim).

30	   Distribution of this document is unlimited. Please send comments to
31	   the author at <goldsmith@apple.com>. This document is intended to
32	   become an experimental RFC.

34	Abstract

36	   The Unicode Standard, version 2.0, and ISO/IEC 10646-1:1993(E) (as
37	   amended) jointly define a character set (hereafter referred to as
38	   Unicode) which encompasses most of the world's writing systems.
39	   However, Internet mail (STD 11, RFC 822) currently supports only 7-
40	   bit US ASCII as a character set. MIME (RFC 2045 through 2049) extends
41	   Internet mail to support different media types and character sets,
42	   and thus could support Unicode in mail messages. MIME neither defines
43	   Unicode as a permitted character set nor specifies how it would be
44	   encoded, although it does provide for the registration of additional
45	   character sets over time.

47	   This document describes a transformation format of Unicode that
48	   contains only 7-bit ASCII octets and is intended to be readable by
49	   humans in the limiting case that the document consists of characters
50	   from the US-ASCII repertoire. It also specifies how this
51	   transformation format is used in the context of MIME and RFC 1641,
52	   "Using Unicode with MIME".

54	Motivation

56	   Although other transformation formats of Unicode exist and could
57	   conceivably be used in this context (most notably UTF-8, also known
58	   as UTF-2 or UTF-FSS), they suffer the disadvantage that they use
59	   octets in the range decimal 128 through 255 to encode Unicode
60	   characters outside the US-ASCII range. Thus, in the context of mail,
61	   those octets must themselves be encoded. This requires putting text
62	   through two successive encoding processes, and leads to a significant
63	   expansion of characters outside the US-ASCII range, putting non-
64	   English speakers at a disadvantage. For example, using UTF-8 together
65	   with the Quoted-Printable content transfer encoding of MIME
66	   represents US-ASCII characters in one octet, but other characters may
67	   require up to nine octets.

69	Overview

71	   UTF-7 encodes Unicode characters as US-ASCII octets, together with
72	   shift sequences to encode characters outside that range. For this
73	   purpose, one of the characters in the US-ASCII repertoire is reserved
74	   for use as a shift character.

76	   Many mail gateways and systems cannot handle the entire US-ASCII
77	   character set (those based on EBCDIC, for example), and so UTF-7
78	   contains provisions for encoding characters within US-ASCII in a way
79	   that all mail systems can accomodate.

81	   UTF-7 should normally be used only in the context of 7 bit
82	   transports, such as mail. In other contexts, straight Unicode or
83	   UTF-8 is preferred.

85	   See RFC 1641, "Using Unicode with MIME" for the overall specification
86	   on usage of Unicode transformation formats with MIME.

88	Definitions

90	   First, the definition of Unicode:

92	      The 16 bit character set Unicode is defined by "The Unicode
93	      Standard, Version 2.0". This character set is identical with the
94	      character repertoire and coding of the international standard
95	      ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2;
96	      Subset=300; Implementation Level=3, including the first 7
97	      amendments to 10646 plus editorial corrections.

99	      Note. Unicode 2.0 further specifies the use and interaction of
100	      these character codes beyond the ISO standard. However, any valid
101	      10646 sequence is a valid Unicode sequence, and vice versa;
102	      Unicode supplies interpretations of sequences on which the ISO
103	      standard is silent as to interpretation.

105	   Next, some handy definitions of US-ASCII character subsets:

107	      Set D (directly encoded characters) consists of the following
108	      characters (derived from RFC 1521, Appendix B, which no longer
109	      appears in RFC 2045): the upper and lower case letters A through Z
110	      and a through z, the 10 digits 0-9, and the following nine special
111	      characters (note that "+" and "=" are omitted):

113	               Character   ASCII & Unicode Value (decimal)
114	                  '           39
115	                  (           40
116	                  )           41
117	                  ,           44
118	                  -           45
119	                  .           46
120	                  /           47
121	                  :           58
122	                  ?           63

124	      Set O (optional direct characters) consists of the following
125	      characters (note that "\" and "~" are omitted):

127	               Character   ASCII & Unicode Value (decimal)
128	                  !           33
129	                  "           34
130	                  #           35
131	                  $           36
132	                  %           37
133	                  &           38
134	                  *           42
135	                  ;           59
136	                  <           60
137	                  =           61
138	                  >           62
139	                  @           64
140	                  [           91
141	                  ]           93
142	                  ^           94
143	                  _           95
144	                  '           96
145	                  {           123
146	                  |           124
147	                  }           125

149	   Rationale. The characters "\" and "~" are omitted because they are
150	   often redefined in variants of ASCII.

152	   Set B (Modified Base 64) is the set of characters in the Base64
153	   alphabet defined in RFC 2045, excluding the pad character "="
154	   (decimal value 61).

156	   Rationale. The pad character = is excluded because UTF-7 is designed
157	   for use within header fields as set forth in RFC 2047. Since the only
158	   readable encoding in RFC 2047 is "Q" (based on RFC 2045's Quoted-
159	   Printable), the "=" character is not available for use (without a lot
160	   of escape sequences). This was very unfortunate but unavoidable. The
161	   "=" character could otherwise have been used as the UTF-7 escape
162	   character as well (rather than using "+").

164	   Note that all characters in US-ASCII have the same value in Unicode
165	   when zero-extended to 16 bits.

167	UTF-7 Definition

169	   A UTF-7 stream represents 16-bit Unicode characters using 7-bit US-
170	   ASCII octets as follows:

172	      Rule 1: (direct encoding) Unicode characters in set D above may be
173	      encoded directly as their ASCII equivalents. Unicode characters in
174	      Set O may optionally be encoded directly as their ASCII
175	      equivalents, bearing in mind that many of these characters are
176	      illegal in header fields, or may not pass correctly through some
177	      mail gateways.

179	      Rule 2: (Unicode shifted encoding) Any Unicode character sequence
180	      may be encoded using a sequence of characters in set B, when
181	      preceded by the shift character "+" (US-ASCII character value
182	      decimal 43). The "+" signals that subsequent octets are to be
183	      interpreted as elements of the Modified Base64 alphabet until a
184	      character not in that alphabet is encountered. Such characters
185	      include control characters such as carriage returns and line
186	      feeds; thus, a Unicode shifted sequence always terminates at the
187	      end of a line. As a special case, if the sequence terminates with
188	      the character "-" (US-ASCII decimal 45) then that character is
189	      absorbed; other terminating characters are not absorbed and are
190	      processed normally.

192	      Note that if the first character after the shifted sequence is "-"
193	      then an extra "-" must be present to terminate the shifted
194	      sequence so that the actual "-" is not itself absorbed.

196	      Rationale. A terminating character is necessary for cases where
197	      the next character after the Modified Base64 sequence is part of
198	      character set B or is itself the terminating character. It can
199	      also enhance readability by delimiting encoded sequences.

201	      Also as a special case, the sequence "+-" may be used to encode
202	      the character "+". A "+" character followed immediately by any
203	      character other than members of set B or "-" is an ill-formed
204	      sequence.

206	      Unicode is encoded using Modified Base64 by first converting
207	      Unicode 16-bit quantities to an octet stream (with the most
208	      significant octet first). Surrogate pairs (UTF-16) are converted
209	      by treating each half of the pair as a separate 16 bit quantity
210	      (i.e., no special treatment). Text with an odd number of octets is
211	      ill-formed. ISO 10646 characters outside the range addressable via
212	      surrogate pairs cannot be encoded.

214	      Rationale. ISO/IEC 10646-1:1993(E) specifies that when characters
215	      in the UCS-2 form are serialized as octets, that the most
216	      significant octet appear first.  This is also in keeping with
217	      common network practice of choosing a canonical format for
218	      transmission.

220	      Rationale. The policy for code point allocation within ISO 10646
221	      and Unicode is that the repertoires be kept synchronized. No code
222	      points will be allocated in ISO 10646 outside the range
223	      addressable by surrogate pairs.

225	      Next, the octet stream is encoded by applying the Base64 content
226	      transfer encoding algorithm as defined in RFC 2045, modified to
227	      omit the "=" pad character. Instead, when encoding, zero bits are
228	      added to pad to a Base64 character boundary. When decoding, any
229	      bits at the end of the Modified Base64 sequence that do not
230	      constitute a complete 16-bit Unicode character are discarded. If
231	      such discarded bits are non-zero the sequence is ill-formed.

233	      Rationale. The pad character "=" is not used when encoding
234	      Modified Base64 because of the conflict with its use as an escape
235	      character for the Q content transfer encoding in RFC 2047 header
236	      fields, as mentioned above.

238	      Rule 3: The space (decimal 32), tab (decimal 9), carriage return
239	      (decimal 13), and line feed (decimal 10) characters may be
240	      directly represented by their ASCII equivalents. However, note
241	      that MIME content transfer encodings have rules concerning the use
242	      of such characters. Usage that does not conform to the
243	      restrictions of RFC 822, for example, would have to be encoded
244	      using MIME content transfer encodings other than 7bit or 8bit,
245	      such as quoted-printable, binary, or base64.

247	   Given this set of rules, Unicode characters which may be encoded via
248	   rules 1 or 3 take one octet per character, and other Unicode
249	   characters are encoded on average with 2 2/3 octets per character
250	   plus one octet to switch into Modified Base64 and an optional octet
251	   to switch out.

253	      Example. The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>."
254	      (hexadecimal 0041,2262,0391,002E) may be encoded as follows:

256	            A+ImIDkQ.

258	      Example. The Unicode sequence "Hi Mom -<WHITE SMILING FACE>-!"
259	      (hexadecimal 0048, 0069, 0020, 004D, 006F, 006D, 0020, 002D, 263A,
260	      002D, 0021) may be encoded as follows:

262	            Hi Mom -+Jjo--!

264	      Example. The Unicode sequence representing the Han characters for
265	      the Japanese word "nihongo" (hexadecimal 65E5,672C,8A9E) may be
266	      encoded as follows:

268	            +ZeVnLIqe-

270	Use of Character Set UTF-7 Within MIME

272	   Character set UTF-7 is safe for mail transmission and therefore may
273	   be used with any content transfer encoding in MIME (except where line
274	   length and line break restrictions are violated). Specifically, the 7
275	   bit encoding for bodies and the Q encoding for headers are both
276	   acceptable. The MIME character set tag is UTF-7. This signifies any
277	   version of Unicode equal to or greater than 2.0.

279	      Example. Here is a text portion of a MIME message containing the
280	      Unicode sequence "Hi Mom <WHITE SMILING FACE>!" (hexadecimal 0048,
281	      0069, 0020, 004D, 006F, 006D, 0020, 263A, 0021).

283	      Content-Type: text/plain; charset=UTF-7

285	      Hi Mom +Jjo-!

287	      Example. Here is a text portion of a MIME message containing the
288	      Unicode sequence representing the Han characters for the Japanese
289	      word "nihongo" (hexadecimal 65E5,672C,8A9E).

291	      Content-Type: text/plain; charset=UTF-7

293	      +ZeVnLIqe-

295	      Example. Here is a text portion of a MIME message containing the
296	      Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (hexadecimal
297	      0041,2262,0391,002E).

299	      Content-Type: text/plain; charset=utf-7

301	      A+ImIDkQ.

303	      Example. Here is a text portion of a MIME message containing the
304	      Unicode sequence "Item 3 is <POUND SIGN>1."  (hexadecimal 0049,
305	      0074, 0065, 006D, 0020, 0033, 0020, 0069, 0073, 0020, 00A3, 0031,
306	      002E).

308	      Content-Type: text/plain; charset=UTF-7

310	      Item 3 is +AKM-1.

312	   Note that to achieve the best interoperability with systems that may
313	   not support Unicode or MIME, when preparing text for mail
314	   transmission line breaks should follow Internet conventions. This
315	   means that lines should be short and terminated with the proper SMTP
316	   CRLF sequence. Unicode LINE SEPARATOR (hexadecimal 2028) and
317	   PARAGRAPH SEPARATOR (hexadecimal 2029) should be converted to SMTP
318	   line breaks. Ideally, this would be handled transparently by a
319	   Unicode-aware user agent.

321	   This preparation is not absolutely necessary, since UTF-7 and the
322	   appropriate MIME content transfer encoding can handle text that does
323	   not follow Internet conventions, but readability by systems without
324	   Unicode or MIME will be impaired. See RFC 2045 for a discussion of
325	   mail interoperability issues.

327	   Lines should never be broken in the middle of a UTF-7 shifted
328	   sequence, since such sequences may not cross line breaks. Therefore,
329	   UTF-7 encoding should take place after line breaking. If a line
330	   containing a shifted sequence is too long after encoding, a MIME
331	   content transfer encoding such as Quoted Printable can be used to
332	   encode the text. Another possibility is to perform line breaking and
333	   UTF-7 encoding at the same time, so that lines containing shifted
334	   sequences already conform to length restrictions.

336	Discussion

338	   In this section we will motivate the introduction of UTF-7 as opposed
339	   to the alternative of using the existing transformation formats of
340	   Unicode (e.g., UTF-8) with MIME's content transfer encodings. Before
341	   discussing this, it will be useful to list some assumptions about
342	   character frequency within typical natural language text strings that
343	   we use to estimate typical storage requirements:

345	   1. Most Western European languages use roughly 7/8 of their letters
346	      from US-ASCII and 1/8 from Latin 1 (ISO-8859-1).

348	   2. Most non-Roman alphabet-based languages (e.g., Greek) use about
349	      1/6 of their letters from ASCII (since white space is in the 7-bit
350	      area) and the rest from their alphabets.

352	   3. East Asian ideographic-based languages (including Japanese) use
353	      essentially all of their characters from the Han or CJK syllabary
354	      area.

356	   4. Non-directly encoded punctuation characters do not occur
357	      frequently enough to affect the results.

359	   Notice that current 8 bit standards, such as ISO-8859-x, require use
360	   of a content transfer encoding. For comparison with the subsequent
361	   discussion, the costs break down as follows (note that many of these
362	   figures are approximate since they depend on the exact composition of
363	   the text):

365	   8859-x in Base64

367	      Text type          Average octets/character
368	      All                      1.33

370	   8859-x in Quoted Printable

372	      Text type          Average octets/character
373	      US-ASCII                 1
374	      Western European         1.25
375	      Other                    2.67

377	   Note also that Unicode encoded in Base64 takes a constant 2.67 octets
378	   per character. For purposes of comparison, we will look at UTF-8 in
379	   Base64 and Quoted Printable, and UTF-7. Also note that fixed overhead
380	   for long strings is relative to 1/n, where n is the encoded string
381	   length in octets.

383	   UTF-8 in Base64

385	      Text type          Average octets/character
386	      US-ASCII                 1.33
387	      Western European         1.5
388	      Some Alphabetics         2.44
389	      All others               4

391	   UTF-8 in Quoted Printable

393	      Text type          Average octets/character
394	      US-ASCII                 1
395	      Western European         1.63
396	      Some Alphabetics         5.17
397	      All others               7-9

399	   UTF-7

401	      Text type          Average octets/character
402	      Most US-ASCII            1
403	      Western European         1.5
404	      All others               2.67+2/n

406	   We feel that the UTF-8 in Quoted Printable option is not viable due
407	   to the very large expansion of all text except Western European. This
408	   would only be viable in texts consisting of large expanses of US-
409	   ASCII or Latin characters with occasional other characters
410	   interspersed. We would prefer to introduce one encoding that works
411	   reasonably well for all users.

413	   We also feel that UTF-8 in Base64 has high expansion for non-
414	   Western-European users, and is less desirable because it cannot be
415	   read directly, even when the content is largely US-ASCII. The base
416	   encoding of UTF-7 gives competitive results and is readable for ASCII
417	   text.

419	   UTF-7 gives results competitive with ISO-8859-x, with access to all
420	   of the Unicode character set. We believe this justifies the
421	   introduction of a new transformation format of Unicode.

423	   As an alternative to use of UTF-7, it might be possible to intermix
424	   Unicode characters with other character sets using an existing MIME
425	   mechanism, the multipart/mixed content type, ignoring for the moment
426	   the issues with line breaks (thanks to Nathaniel Borenstein for
427	   suggesting this). For instance (repeating an earlier example):

429	      Content-type: multipart/mixed; boundary=foo
430	      Content-Disposition: inline

432	      --foo
433	      Content-type: text/plain; charset=us-ascii

435	      Hi Mom
436	      --foo
437	      Content-type: text/plain; charset=UNICODE-2-0
438	      Content-transfer-encoding: base64

440	      Jjo=
441	      --foo
442	      Content-type: text/plain; charset=us-ascii

444	      !
445	      --foo--

447	   Theoretically, this removes the need for UTF-7 in message bodies
448	   (multipart may not be used in header fields). However, we feel that
449	   as use of the Unicode character set becomes more widespread,
450	   intermittent use of specialized Unicode characters (such as dingbats
451	   and mathematical symbols) will occur, and that text will also
452	   typically include small snippets from other scripts, such as
453	   Cyrillic, Greek, or East Asian languages (anything in the Roman
454	   script is already handled adequately by existing MIME character
455	   sets). Although the multipart technique works well for large chunks
456	   of text in alternating character sets, we feel it does not adequately
457	   support the kinds of uses just discussed, and so we still believe the
458	   introduction of UTF-7 is justified.

460	Summary

462	   The UTF-7 encoding allows Unicode characters to be encoded within the
463	   US-ASCII 7 bit character set. It is most effective for Unicode
464	   sequences which contain relatively long strings of US-ASCII
465	   characters interspersed with either single Unicode characters or
466	   strings of Unicode characters, as it allows the US-ASCII portions to
467	   be read on systems without direct Unicode support.

469	   UTF-7 should only be used with 7 bit transports such as mail. In
470	   other contexts, use of straight Unicode or UTF-8 is preferred.

472	Acknowledgements

474	   Many thanks to the following people for their contributions,
475	   comments, and suggestions. If we have omitted anyone it was through
476	   oversight and not intentionally.

478	         Glenn Adams
479	         Harald T. Alvestrand
480	         Nathaniel Borenstein
481	         Lee Collins
482	         Jim Conklin
483	         Dave Crocker
484	         Steve Dorner
485	         Dana S. Emery
486	         Ned Freed
487	         Kari E. Hurtta
488	         John H. Jenkins
489	         John C. Klensin
490	         Valdis Kletnieks
491	         Keith Moore
492	         Masataka Ohta
493	         Einar Stefferud
494	         Erik M. van der Poel

496	Appendix A -- Examples

498	   Here is a longer example, taken from a document originally in Big5
499	   code. It has been condensed for brevity. There are two versions: the
500	   first uses optional characters from set O (and so may not pass
501	   through some mail gateways), and the second does not.

503	   Content-type: text/plain; charset=utf-7

505	   Below is the full Chinese text of the Analects (+itaKng-).

507	   The sources for the text are:

509	   "The sayings of Confucius," James R. Ware, trans.  +U/BTFw-:
510	   +ZYeB9FH6ckh5Pg-, 1980.  (Chinese text with English translation)

512	   +Vttm+E6UfZM-, +W4tRQ066bOg-, +UxdOrA-:  +Ti1XC2b4Xpc-, 1990.

514	   "The Chinese Classics with a Translation, Critical and
515	   Exegetical Notes, Prolegomena, and Copius Indexes," James
516	   Legge, trans., Taipei:  Southern Materials Center Publishing,
517	   Inc., 1991.  (Chinese text with English translation)

519	   Big Five and GB versions of the text are being made available
520	   separately.

522	   Neither the Big Five nor GB contain all the characters used in
523	   this text.  Missing characters have been indicated using their
524	   Unicode/ISO 10646 code points.  "U+-" followed by four
525	   hexadecimal digits indicates a Unicode/10646 code (e.g.,
526	   U+-9F08).  There is no good solution to the problem of the small
527	   size of the Big Five/GB character sets; this represents the
528	   solution I find personally most satisfactory.

530	   (omitted...)

532	   I have tried to minimize this problem by using variant
533	   characters where they were available and the character
534	   actually in the text was not.  Only variants listed as such in
535	   the +XrdxmVtXUXg- were used.

537	   (omitted...)

539	   John H. Jenkins
540	   +TpVPXGBG-
541	   jenkins@apple.com
542	   5 January 1993
543	   (omitted...)

545	   Content-type: text/plain; charset=utf-7

547	   Below is the full Chinese text of the Analects (+itaKng-).

549	   The sources for the text are:

551	   +ACI-The sayings of Confucius,+ACI- James R. Ware, trans.  +U/BTFw-:
552	   +ZYeB9FH6ckh5Pg-, 1980.  (Chinese text with English translation)

554	   +Vttm+E6UfZM-, +W4tRQ066bOg-, +UxdOrA-:  +Ti1XC2b4Xpc-, 1990.

556	   +ACI-The Chinese Classics with a Translation, Critical and
557	   Exegetical Notes, Prolegomena, and Copius Indexes,+ACI- James
558	   Legge, trans., Taipei:  Southern Materials Center Publishing,
559	   Inc., 1991.  (Chinese text with English translation)

561	   Big Five and GB versions of the text are being made available
562	   separately.

564	   Neither the Big Five nor GB contain all the characters used in
565	   this text.  Missing characters have been indicated using their
566	   Unicode/ISO 10646 code points.  +ACI-U+-+ACI- followed by four
567	   hexadecimal digits indicates a Unicode/10646 code (e.g.,
568	   U+-9F08).  There is no good solution to the problem of the small
569	   size of the Big Five/GB character sets+ADs- this represents the
570	   solution I find personally most satisfactory.

572	   (omitted...)

574	   I have tried to minimize this problem by using variant
575	   characters where they were available and the character
576	   actually in the text was not.  Only variants listed as such in
577	   the +XrdxmVtXUXg- were used.

579	   (omitted...)

581	   John H. Jenkins
582	   +TpVPXGBG-
583	   jenkins+AEA-apple.com
584	   5 January 1993
585	   (omitted...)

587	Security Considerations

589	   Security issues are not discussed in this memo.

591	References

593	[UNICODE 2.0]  "The Unicode Standard, Version 2.0", The Unicode
594	               Consortium, Addison-Wesley, 1996. ISBN 0-201-48345-9.

596	[ISO 10646]    ISO/IEC 10646-1:1993(E) Information Technology--Universal
597	               Multiple-octet Coded Character Set (UCS). See also
598	               amendments 1 through 7, plus editorial corrections.

600	[RFC-1641]     Goldsmith, D., and M. Davis, "Using Unicode with MIME",
601	               RFC 1641, Taligent, Inc., July 1994.

603	[US-ASCII]     Coded Character Set--7-bit American Standard Code for
604	               Information Interchange, ANSI X3.4-1986.

606	[ISO-8859]     Information Processing -- 8-bit Single-Byte Coded Graphic
607	               Character Sets -- Part 1: Latin Alphabet No. 1, ISO
608	               8859-1:1987.  Part 2: Latin alphabet No.  2, ISO 8859-2,
609	               1987.  Part 3: Latin alphabet No. 3, ISO 8859-3, 1988.
610	               Part 4: Latin alphabet No.  4, ISO 8859-4, 1988.  Part 5:
611	               Latin/Cyrillic alphabet, ISO 8859-5, 1988.  Part 6:
612	               Latin/Arabic alphabet, ISO 8859-6, 1987.  Part 7:
613	               Latin/Greek alphabet, ISO 8859-7, 1987.  Part 8:
614	               Latin/Hebrew alphabet, ISO 8859-8, 1988.  Part 9: Latin
615	               alphabet No. 5, ISO 8859-9, 1990.

617	[RFC822]       Crocker, D., "Standard for the Format of ARPA Internet
618	               Text Messages", STD 11, RFC 822, UDEL, August 1982.

620	[MIME]         Borenstein N., N. Freed, K. Moore, J. Klensin, and J.
621	               Postel, "MIME (Multipurpose Internet Mail Extensions)
622	               Parts One through Five", RFC 2045, 2046, 2047, 2048, and
623	               2049, November 1996.

625	Authors' Addresses

627	   David Goldsmith
628	   Apple Computer, Inc.
629	   2 Infinite Loop, MS: 302-2IS
630	   Cupertino, CA 95014

632	   Phone: 408-974-1957
633	   Fax: 408-862-4566
634	   EMail: goldsmith@apple.com

636	   Mark Davis
637	   Taligent, Inc.
638	   10201 N. DeAnza Blvd.
639	   Cupertino, CA 95014-2233

641	   Phone: 408-777-5116
642	   Fax: 408-777-5081
643	   EMail: mark_davis@taligent.com