< draft-hoffman-utf16-03.txt   draft-hoffman-utf16-04.txt >
Internet Draft Paul Hoffman Internet Draft Paul Hoffman
<draft-hoffman-utf16-03.txt> Internet Mail Consortium <draft-hoffman-utf16-04.txt> Internet Mail Consortium
April 19, 1999 Francois Yergeau June 1, 1999 Francois Yergeau
Alis Technologies Alis Technologies
UTF-16, an encoding of ISO 10646 UTF-16, an encoding of ISO 10646
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering Task
skipping to change at line 42 skipping to change at line 41
This document describes the UTF-16 encoding of Unicode/ISO-10646, This document describes the UTF-16 encoding of Unicode/ISO-10646,
addresses the issues of serializing UTF-16 as an octet stream for addresses the issues of serializing UTF-16 as an octet stream for
transmission over the Internet, defines MIME charset naming as transmission over the Internet, defines MIME charset naming as
described in [CHARSET-REG], and contains the registration for three described in [CHARSET-REG], and contains the registration for three
MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE
(little-endian), and UTF-16. (little-endian), and UTF-16.
1.1 Background and motivation 1.1 Background and motivation
The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly The Unicode Standard [UNICODE] and ISO/IEC 10646 [ISO-10646] jointly
define a coded character set (CCS), hereafter referred to as Unicode, define a coded character set (CCS), hereafter referred to as Unicode,
which encompasses most of the world's writing systems [WORKSHOP]. which encompasses most of the world's writing systems [WORKSHOP].
UTF-16, the object of this specification, is one of the standard ways UTF-16, the object of this specification, is one of the standard ways
of encoding Unicode character data; it has the characteristics of of encoding Unicode character data; it has the characteristics of
encoding all currently defined characters (in plane 0, the BMP) in encoding all currently defined characters (in plane 0, the BMP) in
exactly two octets and of being able to encode all other characters exactly two octets and of being able to encode all other characters
likely to be defined (the next 16 planes) in exactly four octets. likely to be defined (the next 16 planes) in exactly four octets.
The Unicode Standard further defines additional character properties The Unicode Standard further defines additional character properties
and other application details of great interest to implementors. Up to and other application details of great interest to implementors. Up to
skipping to change at line 104 skipping to change at line 103
- Characters with values between 0x10000 and 0x10FFFF are represented - Characters with values between 0x10000 and 0x10FFFF are represented
by a 16-bit integer with a value between 0xD800 and 0xDBFF (within by a 16-bit integer with a value between 0xD800 and 0xDBFF (within
the so-called high-half zone or high surrogate area) followed by a the so-called high-half zone or high surrogate area) followed by a
16-bit integer with a value between 0xDC00 and 0xDFFF (within the 16-bit integer with a value between 0xDC00 and 0xDFFF (within the
so-called low-half zone or low surrogate area). so-called low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded in - Characters with values greater than 0x10FFFF cannot be encoded in
UTF-16. UTF-16.
Note: Values between 0xD800 and 0xDFFF are specifically reserved for
use with UTF-16, and don't have any characters assigned to them.
2.1 Encoding UTF-16 2.1 Encoding UTF-16
Encoding of a single character from an ISO 10646 character value to Encoding of a single character from an ISO 10646 character value to
UTF-16 proceeds as follows. Let U be the character number, no greater UTF-16 proceeds as follows. Let U be the character number, no greater
than 0x10FFFF. than 0x10FFFF.
1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate.
2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, 2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,
U' must be less than or equal to 0xFFFFF. That is, U' can be U' must be less than or equal to 0xFFFFF. That is, U' can be
skipping to change at line 161 skipping to change at line 163
Note that steps 2 and 3 indicate errors. Error recovery is not Note that steps 2 and 3 indicate errors. Error recovery is not
specified by this document. When terminating with an error in steps 2 specified by this document. When terminating with an error in steps 2
and 3, it may be wise to set U to the value of W1 to help the caller and 3, it may be wise to set U to the value of W1 to help the caller
diagnose the error and not lose information. Also note that a string diagnose the error and not lose information. Also note that a string
decoding algorithm, as opposed to the single-character decoding decoding algorithm, as opposed to the single-character decoding
described above, need not terminate upon detection of an error, if described above, need not terminate upon detection of an error, if
proper error reporting and/or recovery is provided. proper error reporting and/or recovery is provided.
3. Labelling UTF-16 text 3. Labelling UTF-16 text
This specification contains registration for three MIME charsets: Appendix A of this specification contains registrations for three MIME
"UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the charsets: "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent
combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and the combination of a CCS (a coded character set) and a CES (a character
the CES is the same in all three cases, except for the serialization encoding scheme). Here the CCS is Unicode/ISO 10646 and the CES is the
order of the octets in each character, and the external determination same in all three cases, except for the serialization order of the
of which serialization is used. octets in each character, and the external determination of which
serialization is used.
This section describes which of the three labels to apply to a stream This section describes which of the three labels to apply to a stream
of text. Section 4 describes how to interpret the labels on a stream of of text. Section 4 describes how to interpret the labels on a stream of
text. text.
3.1 Definition of big-endian and little-endian 3.1 Definition of big-endian and little-endian
Historically, computer hardware has processed two-octet entities such Historically, computer hardware has processed two-octet entities such
as 16-bit integers in one of two ways. So-called "big-endian" hardware as 16-bit integers in one of two ways. So-called "big-endian" hardware
handles two-octet entities with the higher-order octet first, that is handles two-octet entities with the higher-order octet first, that is
skipping to change at line 200 skipping to change at line 203
void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ void write_be(unsigned short u, FILE f) /* assume short is 16 bits */
{ {
putc(u >> 8, f); /* output high-order byte */ putc(u >> 8, f); /* output high-order byte */
putc(u & 0xFF, f); /* then low-order */ putc(u & 0xFF, f); /* then low-order */
} }
The term "network byte order" has been used in many RFCs to indicate The term "network byte order" has been used in many RFCs to indicate
big-endian serialization, although that term has yet to be formally big-endian serialization, although that term has yet to be formally
defined in a standards-track document. Although ISO 10646 prefers defined in a standards-track document. Although ISO 10646 prefers
big-endian serialization (section 6.3 of [ISO-10646]), it is likely big-endian serialization (section 6.3 of [ISO-10646]), little-endian
that little-endian order will also be used on the Internet. order is also sometimes used on the Internet.
3.2 Byte order mark (BOM) 3.2 Byte order mark (BOM)
The Unicode Standard and ISO 10646 define the character "ZERO WIDTH The Unicode Standard and ISO 10646 define the character "ZERO WIDTH
NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE
ORDER MARK" (abbreviated "BOM"). The latter name hints at a second ORDER MARK" (abbreviated "BOM"). The latter name hints at a second
possible usage of the character, in addition to its normal use as a possible usage of the character, in addition to its normal use as a
genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage, genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage,
suggested by Unicode section 2.4 and ISO 10646 Annex F (informative), suggested by Unicode section 2.4 and ISO 10646 Annex F (informative),
is to prepend a 0xFEFF character to a stream of Unicode characters as a is to prepend a 0xFEFF character to a stream of Unicode characters as a
skipping to change at line 271 skipping to change at line 274
Text in the "UTF-16LE" charset MUST be serialized with the octets which Text in the "UTF-16LE" charset MUST be serialized with the octets which
make up a single 16-bit UTF-16 value in little-endian order. Systems make up a single 16-bit UTF-16 value in little-endian order. Systems
labelling UTF-16LE text MUST NOT prepend a BOM to the text. labelling UTF-16LE text MUST NOT prepend a BOM to the text.
Any labelling application that uses UTF-16 character encoding, and puts Any labelling application that uses UTF-16 character encoding, and puts
an explicit charset label on the text, and does not know the an explicit charset label on the text, and does not know the
serialization order of the characters in text, MUST label the text as serialization order of the characters in text, MUST label the text as
"UTF-16", and SHOULD make sure the text starts with 0xFEFF. "UTF-16", and SHOULD make sure the text starts with 0xFEFF.
An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE"
"UTF-16LE" is that some document formats mandate a BOM in UTF-16 text, would occur with document formats that mandate a BOM in UTF-16 text,
thereby requiring the use of the "UTF-16" tag only. thereby requiring the use of the "UTF-16" tag only.
4. Interpreting text labels 4. Interpreting text labels
When a program sees text labelled as "UTF-16BE", "UTF-16LE", or When a program sees text labelled as "UTF-16BE", "UTF-16LE", or
"UTF-16", it can make some assumptions, based on the labelling rules "UTF-16", it can make some assumptions, based on the labelling rules
given in the previous section. These assumptions allow the program to given in the previous section. These assumptions allow the program to
then process the text. then process the text.
4.1 Interpreting text labelled as UTF-16BE 4.1 Interpreting text labelled as UTF-16BE
skipping to change at line 322 skipping to change at line 325
label MUST NOT assume the serialization without first checking the label MUST NOT assume the serialization without first checking the
first two octets to see if they are a big-endian BOM, a little-endian first two octets to see if they are a big-endian BOM, a little-endian
BOM, or not a BOM. All applications that process text with the "UTF-16" BOM, or not a BOM. All applications that process text with the "UTF-16"
charset label MUST be able to interpret both big-endian and charset label MUST be able to interpret both big-endian and
little-endian text. little-endian text.
5. Examples 5. Examples
For the sake of example, let's suppose that there is a hieroglyphic For the sake of example, let's suppose that there is a hieroglyphic
character representing the Egyptian god Ra with character value character representing the Egyptian god Ra with character value
0x00012345 (this character does not exist at present in Unicode). 0x12345 (this character does not exist at present in Unicode).
The examples here all evaluate to the phrase: The examples here all evaluate to the phrase:
*=Ra *=Ra
where the "*" represents the Ra hieroglyph (0x00012345). where the "*" represents the Ra hieroglyph (0x12345).
Text labelled with UTF-16BE, without a BOM: Text labelled with UTF-16BE, without a BOM:
D8 08 DF 45 00 3D 00 52 00 61 D8 08 DF 45 00 3D 00 52 00 61
Text labelled with UTF-16LE, without a BOM: Text labelled with UTF-16LE, without a BOM:
08 D8 45 DF 3D 00 52 00 61 00 08 D8 45 DF 3D 00 52 00 61 00
Big-endian text labelled with UTF-16, with a BOM: Big-endian text labelled with UTF-16, with a BOM:
FE FF D8 08 DF 45 00 3D 00 52 00 61 FE FF D8 08 DF 45 00 3D 00 52 00 61
skipping to change at line 418 skipping to change at line 421
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- technology -- Universal Multiple-Octet Coded Character Set (UCS) --
Part 1: Architecture and Basic Multilingual Plane. Twelve amendments Part 1: Architecture and Basic Multilingual Plane. Twelve amendments
and two technical corrigenda have been published up to now. UTF-16 is and two technical corrigenda have been published up to now. UTF-16 is
described in Annex Q, published as Amendment 1. Many other amendments described in Annex Q, published as Amendment 1. Many other amendments
are currently at various stages of standardization. are currently at various stages of standardization.
[MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version
2.1", Unicode Technical Report #8. 2.0", ISBN 0-201-48345-9; with Unicode Technical Report #8, "The
Unicode Standard, Version 2.1",
http://www.unicode.org/unicode/reports/tr8.html.
[UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC
2279, January 1998. 2279, January 1998.
[WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set
Workshop", RFC 2130, April 1997. Workshop", RFC 2130, April 1997.
9. Acknowledgments 9. Acknowledgments
Deborah Goldsmith wrote a great deal of the initial wording for this Deborah Goldsmith wrote a great deal of the initial wording for this
skipping to change at line 449 skipping to change at line 454
Murata Makoto Murata Makoto
Larry Masinter Larry Masinter
Markus Scherer Markus Scherer
Ken Whistler Ken Whistler
Some of the text in this specification was copied from [UTF-8], and Some of the text in this specification was copied from [UTF-8], and
that document was worked on by many people. Please see the that document was worked on by many people. Please see the
acknowledgments section in that document for more people who may have acknowledgments section in that document for more people who may have
contributed indirectly to this document. contributed indirectly to this document.
10. Changes between draft -02 and -03 10. Changes between draft -03 and -04
1: Reorganized the sections. Added information about two octets being
enough for all current characters and the committees saying they will
not go beyond what can be defined in UTF-16.
2.1: Reworded step 2 with words to make it easier to read.
2.2: Added "U" to step 1. Also added note to the end of the last
paragraph about string decoding and errors.
3: Added a reference to section 4 about interpreting labels. 2: Added note at the end of the section about 0xD800-0xDFFF being
reserved for UTF-16.
3.1: Reworded last sentence in last paragraph. 3: Spelled out CCS and CES in the first paragraph. Also put a reference
to Appendix A in the first paragraph. In the last paragraph, changed
the last sentence to indicate that little-ending is already sometimes
used on the Internet.
4.3: Added requirement that apps that can read UTF-16 must be able to 3.3: Changed the last paragraph to explain which kind of rules it
interpret both big-endian and little-endian. applies to.
5: Corrected the examples due to wrong encoding. 5: Changed "0x00012345" to "0x12345".
11: Moved author's addresses to Appendix B. 8: Changed the reference to [UNICODE].
A. Charset registrations A. Charset registrations
This memo is meant to serve as the basis for registration of three MIME This memo is meant to serve as the basis for registration of three MIME
charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE",
"UTF-16LE", and "UTF-16". These strings label objects containing text "UTF-16LE", and "UTF-16". These strings label objects containing text
consisting of characters from the repertoire of ISO/IEC 10646 including consisting of characters from the repertoire of ISO/IEC 10646 including
all amendments at least up to amendment 5 (Korean block), encoded to a all amendments at least up to amendment 5 (Korean block), encoded to a
sequence of octets using the encoding and serialization schemes sequence of octets using the encoding and serialization schemes
outlined above. outlined above.
 End of changes. 16 change blocks. 
33 lines changed or deleted 33 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/