< draft-goldsmith-utf7-01.txt   draft-goldsmith-utf7-02.txt >
Network Working Group D. Goldsmith Network Working Group D. Goldsmith
Internet Draft <draft-goldsmith-utf7-01.txt> Apple Computer, Inc. Internet Draft <draft-goldsmith-utf7-02.txt> Apple Computer, Inc.
Expires: 3 August 1997 M. Davis Expires: 11 September 1997 M. Davis
Will obsolete: RFC 1642 Taligent, Inc. Will obsolete: RFC 1642 Taligent, Inc.
3 February 1997 11 March 1997
UTF-7 UTF-7
A Mail-Safe Transformation Format of Unicode A Mail-Safe Transformation Format of Unicode
Status of this Memo Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas, documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute and its working groups. Note that other groups may also distribute
skipping to change at line 41 skipping to change at line 40
Distribution of this document is unlimited. Please send comments to Distribution of this document is unlimited. Please send comments to
the author at <goldsmith@apple.com>. This document is intended to the author at <goldsmith@apple.com>. This document is intended to
become an experimental RFC. become an experimental RFC.
Abstract Abstract
The Unicode Standard, version 2.0, and ISO/IEC 10646-1:1993(E) (as The Unicode Standard, version 2.0, and ISO/IEC 10646-1:1993(E) (as
amended) jointly define a character set (hereafter referred to as amended) jointly define a character set (hereafter referred to as
Unicode) which encompasses most of the world's writing systems. Unicode) which encompasses most of the world's writing systems.
However, Internet mail (STD 11, RFC 822) currently supports only 7- However, Internet mail (STD 11, RFC 822) currently supports only 7-
bit US ASCII as a character set. MIME (RFC 1521 and RFC 1522) extends bit US ASCII as a character set. MIME (RFC 2045 through 2049) extends
Internet mail to support different media types and character sets, Internet mail to support different media types and character sets,
and thus could support Unicode in mail messages. MIME neither defines and thus could support Unicode in mail messages. MIME neither defines
Unicode as a permitted character set nor specifies how it would be Unicode as a permitted character set nor specifies how it would be
encoded, although it does provide for the registration of additional encoded, although it does provide for the registration of additional
character sets over time. character sets over time.
This document describes a transformation format of Unicode that This document describes a transformation format of Unicode that
contains only 7-bit ASCII characters and is intended to be readable contains only 7-bit ASCII octets and is intended to be readable by
by humans in the limiting case that the document consists of humans in the limiting case that the document consists of characters
characters from the US-ASCII repertoire. It also specifies how this from the US-ASCII repertoire. It also specifies how this
transformation format is used in the context of MIME and RFC 1641, transformation format is used in the context of MIME and RFC 1641,
"Using Unicode with MIME". "Using Unicode with MIME".
Motivation Motivation
Although other transformation formats of Unicode exist and could Although other transformation formats of Unicode exist and could
conceivably be used in this context (most notably UTF-8, also known conceivably be used in this context (most notably UTF-8, also known
as UTF-2 or UTF-FSS), they suffer the disadvantage that they use as UTF-2 or UTF-FSS), they suffer the disadvantage that they use
octets in the range decimal 128 through 255 to encode Unicode octets in the range decimal 128 through 255 to encode Unicode
characters outside the US-ASCII range. Thus, in the context of mail, characters outside the US-ASCII range. Thus, in the context of mail,
those octets must themselves be encoded. This requires putting text those octets must themselves be encoded. This requires putting text
through two successive encoding processes, and leads to a significant through two successive encoding processes, and leads to a significant
expansion of characters outside the US-ASCII range, putting non- expansion of characters outside the US-ASCII range, putting non-
English speakers at a disadvantage. For example, using UTF-8 together English speakers at a disadvantage. For example, using UTF-8 together
with the Quoted-Printable content transfer encoding of MIME with the Quoted-Printable content transfer encoding of MIME
represents US-ASCII characters in one octet, but other characters may represents US-ASCII characters in one octet, but other characters may
require up to nine octets. require up to nine octets.
Overview Overview
UTF-7 encodes Unicode characters as US-ASCII, together with shift UTF-7 encodes Unicode characters as US-ASCII octets, together with
sequences to encode characters outside that range. For this purpose, shift sequences to encode characters outside that range. For this
one of the characters in the US-ASCII repertoire is reserved for use purpose, one of the characters in the US-ASCII repertoire is reserved
as a shift character. for use as a shift character.
Many mail gateways and systems cannot handle the entire US-ASCII Many mail gateways and systems cannot handle the entire US-ASCII
character set (those based on EBCDIC, for example), and so UTF-7 character set (those based on EBCDIC, for example), and so UTF-7
contains provisions for encoding characters within US-ASCII in a way contains provisions for encoding characters within US-ASCII in a way
that all mail systems can accomodate. that all mail systems can accomodate.
UTF-7 should normally be used only in the context of 7 bit UTF-7 should normally be used only in the context of 7 bit
transports, such as mail and news. In other contexts, straight transports, such as mail. In other contexts, straight Unicode or
Unicode or UTF-8 is preferred. UTF-8 is preferred.
See RFC 1641, "Using Unicode with MIME" for the overall specification See RFC 1641, "Using Unicode with MIME" for the overall specification
on usage of Unicode transformation formats with MIME. on usage of Unicode transformation formats with MIME.
Definitions Definitions
First, the definition of Unicode: First, the definition of Unicode:
The 16 bit character set Unicode is defined by "The Unicode The 16 bit character set Unicode is defined by "The Unicode
Standard, Version 2.0". This character set is identical with the Standard, Version 2.0". This character set is identical with the
skipping to change at line 109 skipping to change at line 108
Note. Unicode 2.0 further specifies the use and interaction of Note. Unicode 2.0 further specifies the use and interaction of
these character codes beyond the ISO standard. However, any valid these character codes beyond the ISO standard. However, any valid
10646 sequence is a valid Unicode sequence, and vice versa; 10646 sequence is a valid Unicode sequence, and vice versa;
Unicode supplies interpretations of sequences on which the ISO Unicode supplies interpretations of sequences on which the ISO
standard is silent as to interpretation. standard is silent as to interpretation.
Next, some handy definitions of US-ASCII character subsets: Next, some handy definitions of US-ASCII character subsets:
Set D (directly encoded characters) consists of the following Set D (directly encoded characters) consists of the following
characters (derived from RFC 1521, Appendix B): the upper and characters (derived from RFC 1521, Appendix B, which no longer
lower case letters A through Z and a through z, the 10 digits 0-9, appears in RFC 2045): the upper and lower case letters A through Z
and the following nine special characters (note that "+" and "=" and a through z, the 10 digits 0-9, and the following nine special
are omitted): characters (note that "+" and "=" are omitted):
Character ASCII & Unicode Value (decimal) Character ASCII & Unicode Value (decimal)
' 39 ' 39
( 40 ( 40
) 41 ) 41
, 44 , 44
- 45 - 45
. 46 . 46
/ 47 / 47
: 58 : 58
skipping to change at line 154 skipping to change at line 153
_ 95 _ 95
' 96 ' 96
{ 123 { 123
| 124 | 124
} 125 } 125
Rationale. The characters "\" and "~" are omitted because they are Rationale. The characters "\" and "~" are omitted because they are
often redefined in variants of ASCII. often redefined in variants of ASCII.
Set B (Modified Base 64) is the set of characters in the Base64 Set B (Modified Base 64) is the set of characters in the Base64
alphabet defined in RFC 1521, excluding the pad character "=" alphabet defined in RFC 2045, excluding the pad character "="
(decimal value 61). (decimal value 61).
Rationale. The pad character = is excluded because UTF-7 is designed Rationale. The pad character = is excluded because UTF-7 is designed
for use within header fields as set forth in RFC 1522. Since the only for use within header fields as set forth in RFC 2047. Since the only
readable encoding in RFC 1522 is "Q" (based on RFC 1521's Quoted- readable encoding in RFC 2047 is "Q" (based on RFC 2045's Quoted-
Printable), the "=" character is not available for use (without a lot Printable), the "=" character is not available for use (without a lot
of escape sequences). This was very unfortunate but unavoidable. The of escape sequences). This was very unfortunate but unavoidable. The
"=" character could otherwise have been used as the UTF-7 escape "=" character could otherwise have been used as the UTF-7 escape
character as well (rather than using "+"). character as well (rather than using "+").
Note that all characters in US-ASCII have the same value in Unicode Note that all characters in US-ASCII have the same value in Unicode
when zero-extended to 16 bits. when zero-extended to 16 bits.
UTF-7 Definition UTF-7 Definition
A UTF-7 stream represents 16-bit Unicode characters in 7-bit US-ASCII A UTF-7 stream represents 16-bit Unicode characters using 7-bit US-
as follows: ASCII octets as follows:
Rule 1: (direct encoding) Unicode characters in set D above may be Rule 1: (direct encoding) Unicode characters in set D above may be
encoded directly as their ASCII equivalents. Unicode characters in encoded directly as their ASCII equivalents. Unicode characters in
Set O may optionally be encoded directly as their ASCII Set O may optionally be encoded directly as their ASCII
equivalents, bearing in mind that many of these characters are equivalents, bearing in mind that many of these characters are
illegal in header fields, or may not pass correctly through some illegal in header fields, or may not pass correctly through some
mail gateways. mail gateways.
Rule 2: (Unicode shifted encoding) Any Unicode character sequence Rule 2: (Unicode shifted encoding) Any Unicode character sequence
may be encoded using a sequence of characters in set B, when may be encoded using a sequence of characters in set B, when
skipping to change at line 212 skipping to change at line 211
Also as a special case, the sequence "+-" may be used to encode Also as a special case, the sequence "+-" may be used to encode
the character "+". A "+" character followed immediately by any the character "+". A "+" character followed immediately by any
character other than members of set B or "-" is an ill-formed character other than members of set B or "-" is an ill-formed
sequence. sequence.
Unicode is encoded using Modified Base64 by first converting Unicode is encoded using Modified Base64 by first converting
Unicode 16-bit quantities to an octet stream (with the most Unicode 16-bit quantities to an octet stream (with the most
significant octet first). Surrogate pairs (UTF-16) are converted significant octet first). Surrogate pairs (UTF-16) are converted
by treating each half of the pair as a separate 16 bit quantity by treating each half of the pair as a separate 16 bit quantity
(i.e., no special treatment). Text with an odd number of octets is (i.e., no special treatment). Text with an odd number of octets is
ill-formed. ill-formed. ISO 10646 characters outside the range addressable via
surrogate pairs cannot be encoded.
Rationale. ISO/IEC 10646-1:1993(E) specifies that when characters Rationale. ISO/IEC 10646-1:1993(E) specifies that when characters
in the UCS-2 form are serialized as octets, that the most in the UCS-2 form are serialized as octets, that the most
significant octet appear first. This is also in keeping with significant octet appear first. This is also in keeping with
common network practice of choosing a canonical format for common network practice of choosing a canonical format for
transmission. transmission.
Rationale. The policy for code point allocation within ISO 10646
and Unicode is that the repertoires be kept synchronized. No code
points will be allocated in ISO 10646 outside the range
addressable by surrogate pairs.
Next, the octet stream is encoded by applying the Base64 content Next, the octet stream is encoded by applying the Base64 content
transfer encoding algorithm as defined in RFC 1521, modified to transfer encoding algorithm as defined in RFC 2045, modified to
omit the "=" pad character. Instead, when encoding, zero bits are omit the "=" pad character. Instead, when encoding, zero bits are
added to pad to a Base64 character boundary. When decoding, any added to pad to a Base64 character boundary. When decoding, any
bits at the end of the Modified Base64 sequence that do not bits at the end of the Modified Base64 sequence that do not
constitute a complete 16-bit Unicode character are discarded. If constitute a complete 16-bit Unicode character are discarded. If
such discarded bits are non-zero the sequence is ill-formed. such discarded bits are non-zero the sequence is ill-formed.
Rationale. The pad character "=" is not used when encoding Rationale. The pad character "=" is not used when encoding
Modified Base64 because of the conflict with its use as an escape Modified Base64 because of the conflict with its use as an escape
character for the Q content transfer encoding in RFC 1522 header character for the Q content transfer encoding in RFC 2047 header
fields, as mentioned above. fields, as mentioned above.
Rule 3: The space (decimal 32), tab (decimal 9), carriage return Rule 3: The space (decimal 32), tab (decimal 9), carriage return
(decimal 13), and line feed (decimal 10) characters may be (decimal 13), and line feed (decimal 10) characters may be
directly represented by their ASCII equivalents. However, note directly represented by their ASCII equivalents. However, note
that MIME content transfer encodings have rules concerning the use that MIME content transfer encodings have rules concerning the use
of such characters. Usage that does not conform to the of such characters. Usage that does not conform to the
restrictions of RFC 822, for example, would have to be encoded restrictions of RFC 822, for example, would have to be encoded
using MIME content transfer encodings other than 7bit or 8bit, using MIME content transfer encodings other than 7bit or 8bit,
such as quoted-printable, binary, or base64. such as quoted-printable, binary, or base64.
skipping to change at line 319 skipping to change at line 324
transmission line breaks should follow Internet conventions. This transmission line breaks should follow Internet conventions. This
means that lines should be short and terminated with the proper SMTP means that lines should be short and terminated with the proper SMTP
CRLF sequence. Unicode LINE SEPARATOR (hexadecimal 2028) and CRLF sequence. Unicode LINE SEPARATOR (hexadecimal 2028) and
PARAGRAPH SEPARATOR (hexadecimal 2029) should be converted to SMTP PARAGRAPH SEPARATOR (hexadecimal 2029) should be converted to SMTP
line breaks. Ideally, this would be handled transparently by a line breaks. Ideally, this would be handled transparently by a
Unicode-aware user agent. Unicode-aware user agent.
This preparation is not absolutely necessary, since UTF-7 and the This preparation is not absolutely necessary, since UTF-7 and the
appropriate MIME content transfer encoding can handle text that does appropriate MIME content transfer encoding can handle text that does
not follow Internet conventions, but readability by systems without not follow Internet conventions, but readability by systems without
Unicode or MIME will be impaired. See RFC 1521 for an in-depth Unicode or MIME will be impaired. See RFC 2045 for a discussion of
discussion of mail interoperability issues. mail interoperability issues.
Lines should never be broken in the middle of a UTF-7 shifted Lines should never be broken in the middle of a UTF-7 shifted
sequence, since such sequences may not cross line breaks. Therefore, sequence, since such sequences may not cross line breaks. Therefore,
UTF-7 encoding should take place after line breaking. If a line UTF-7 encoding should take place after line breaking. If a line
containing a shifted sequence is too long after encoding, a MIME containing a shifted sequence is too long after encoding, a MIME
content transfer encoding such as Quoted Printable can be used to content transfer encoding such as Quoted Printable can be used to
encode the text. Another possibility is to perform line breaking and encode the text. Another possibility is to perform line breaking and
UTF-7 encoding at the same time, so that lines containing shifted UTF-7 encoding at the same time, so that lines containing shifted
sequences already conform to length restrictions. sequences already conform to length restrictions.
skipping to change at line 343 skipping to change at line 348
In this section we will motivate the introduction of UTF-7 as opposed In this section we will motivate the introduction of UTF-7 as opposed
to the alternative of using the existing transformation formats of to the alternative of using the existing transformation formats of
Unicode (e.g., UTF-8) with MIME's content transfer encodings. Before Unicode (e.g., UTF-8) with MIME's content transfer encodings. Before
discussing this, it will be useful to list some assumptions about discussing this, it will be useful to list some assumptions about
character frequency within typical natural language text strings that character frequency within typical natural language text strings that
we use to estimate typical storage requirements: we use to estimate typical storage requirements:
1. Most Western European languages use roughly 7/8 of their letters 1. Most Western European languages use roughly 7/8 of their letters
from US-ASCII and 1/8 from Latin 1 (ISO-8859-1). from US-ASCII and 1/8 from Latin 1 (ISO-8859-1).
2. Most non-European alphabet-based languages (e.g., Greek) use about 2. Most non-Roman alphabet-based languages (e.g., Greek) use about
1/6 of their letters from ASCII (since white space is in the 7-bit 1/6 of their letters from ASCII (since white space is in the 7-bit
area) and the rest from their alphabets. area) and the rest from their alphabets.
3. East Asian ideographic-based languages (including Japanese) use 3. East Asian ideographic-based languages (including Japanese) use
essentially all of their characters from the Han or CJK syllabary essentially all of their characters from the Han or CJK syllabary
area. area.
4. Non-directly encoded punctuation characters do not occur 4. Non-directly encoded punctuation characters do not occur
frequently enough to affect the results. frequently enough to affect the results.
skipping to change at line 418 skipping to change at line 423
We also feel that UTF-8 in Base64 has high expansion for non- We also feel that UTF-8 in Base64 has high expansion for non-
Western-European users, and is less desirable because it cannot be Western-European users, and is less desirable because it cannot be
read directly, even when the content is largely US-ASCII. The base read directly, even when the content is largely US-ASCII. The base
encoding of UTF-7 gives competitive results and is readable for ASCII encoding of UTF-7 gives competitive results and is readable for ASCII
text. text.
UTF-7 gives results competitive with ISO-8859-x, with access to all UTF-7 gives results competitive with ISO-8859-x, with access to all
of the Unicode character set. We believe this justifies the of the Unicode character set. We believe this justifies the
introduction of a new transformation format of Unicode. introduction of a new transformation format of Unicode.
As an alternative to use of UTF-7, it is possible to intermix Unicode As an alternative to use of UTF-7, it might be possible to intermix
characters with other character sets using an existing MIME Unicode characters with other character sets using an existing MIME
mechanism, the multipart/mixed content type (thanks to Nathaniel mechanism, the multipart/mixed content type, ignoring for the moment
Borenstein for pointing this out). For instance (repeating an earlier the issues with line breaks (thanks to Nathaniel Borenstein for
example): suggesting this). For instance (repeating an earlier example):
Content-type: multipart/mixed; boundary=foo Content-type: multipart/mixed; boundary=foo
Content-Disposition: inline
--foo --foo
Content-type: text/plain; charset=us-ascii Content-type: text/plain; charset=us-ascii
Hi Mom Hi Mom
--foo --foo
Content-type: text/plain; charset=UNICODE-2-0 Content-type: text/plain; charset=UNICODE-2-0
Content-transfer-encoding: base64 Content-transfer-encoding: base64
Jjo= Jjo=
skipping to change at line 463 skipping to change at line 469
Summary Summary
The UTF-7 encoding allows Unicode characters to be encoded within the The UTF-7 encoding allows Unicode characters to be encoded within the
US-ASCII 7 bit character set. It is most effective for Unicode US-ASCII 7 bit character set. It is most effective for Unicode
sequences which contain relatively long strings of US-ASCII sequences which contain relatively long strings of US-ASCII
characters interspersed with either single Unicode characters or characters interspersed with either single Unicode characters or
strings of Unicode characters, as it allows the US-ASCII portions to strings of Unicode characters, as it allows the US-ASCII portions to
be read on systems without direct Unicode support. be read on systems without direct Unicode support.
UTF-7 should only be used with 7 bit transports such as mail and UTF-7 should only be used with 7 bit transports such as mail. In
news. In other contexts, use of straight Unicode or UTF-8 is other contexts, use of straight Unicode or UTF-8 is preferred.
preferred.
Acknowledgements Acknowledgements
Many thanks to the following people for their contributions, Many thanks to the following people for their contributions,
comments, and suggestions. If we have omitted anyone it was through comments, and suggestions. If we have omitted anyone it was through
oversight and not intentionally. oversight and not intentionally.
Glenn Adams Glenn Adams
Harald T. Alvestrand Harald T. Alvestrand
Nathaniel Borenstein Nathaniel Borenstein
skipping to change at line 615 skipping to change at line 620
Part 4: Latin alphabet No. 4, ISO 8859-4, 1988. Part 5: Part 4: Latin alphabet No. 4, ISO 8859-4, 1988. Part 5:
Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6: Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6:
Latin/Arabic alphabet, ISO 8859-6, 1987. Part 7: Latin/Arabic alphabet, ISO 8859-6, 1987. Part 7:
Latin/Greek alphabet, ISO 8859-7, 1987. Part 8: Latin/Greek alphabet, ISO 8859-7, 1987. Part 8:
Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin
alphabet No. 5, ISO 8859-9, 1990. alphabet No. 5, ISO 8859-9, 1990.
[RFC822] Crocker, D., "Standard for the Format of ARPA Internet [RFC822] Crocker, D., "Standard for the Format of ARPA Internet
Text Messages", STD 11, RFC 822, UDEL, August 1982. Text Messages", STD 11, RFC 822, UDEL, August 1982.
[RFC-1521] Borenstein N., and N. Freed, "MIME (Multipurpose Internet [MIME] Borenstein N., N. Freed, K. Moore, J. Klensin, and J.
Mail Extensions) Part One: Mechanisms for Specifying and Postel, "MIME (Multipurpose Internet Mail Extensions)
Describing the Format of Internet Message Bodies", RFC Parts One through Five", RFC 2045, 2046, 2047, 2048, and
1521, Bellcore, Innosoft, September 1993. 2049, November 1996.
[RFC-1522] Moore, K., "Representation of Non-Ascii Text in Internet
Message Headers" RFC 1522, University of Tennessee,
September 1993.
Authors' Addresses Authors' Addresses
David Goldsmith David Goldsmith
Apple Computer, Inc. Apple Computer, Inc.
2 Infinite Loop, MS: 302-2IS 2 Infinite Loop, MS: 302-2IS
Cupertino, CA 95014 Cupertino, CA 95014
Phone: 408-974-1957 Phone: 408-974-1957
Fax: 408-862-4566 Fax: 408-862-4566
 End of changes. 21 change blocks. 
45 lines changed or deleted 46 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/