< draft-hoffman-utf16-02.txt   draft-hoffman-utf16-03.txt >
Internet Draft Paul Hoffman Internet Draft Paul Hoffman
<draft-hoffman-utf16-02.txt> Internet Mail Consortium <draft-hoffman-utf16-03.txt> Internet Mail Consortium
February 10, 1999 Francois Yergeau April 19, 1999 Francois Yergeau
Alis Technologies Alis Technologies
UTF-16, an encoding of ISO 10646 UTF-16, an encoding of ISO 10646
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts. may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months and Internet-Drafts are draft documents valid for a maximum of six months
may be updated, replaced, or obsoleted by other documents at any time. It and may be updated, replaced, or obsoleted by other documents at any
is inappropriate to use Internet- Drafts as reference material or to cite time. It is inappropriate to use Internet-Drafts as reference material
them other than as "work in progress." or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright (C) The Internet Society (1999). All Rights Reserved. Copyright (C) The Internet Society (1999). All Rights Reserved.
1. Introduction 1. Introduction
This document describes the UTF-16 encoding of Unicode/ISO-10646 and This document describes the UTF-16 encoding of Unicode/ISO-10646,
contains the registration for three MIME charset parameter values: addresses the issues of serializing UTF-16 as an octet stream for
UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16. transmission over the Internet, defines MIME charset naming as
described in [CHARSET-REG], and contains the registration for three
MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE
(little-endian), and UTF-16.
1.1 Background 1.1 Background and motivation
The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly
define a coded character set (CCS), hereafter referred to as Unicode, which define a coded character set (CCS), hereafter referred to as Unicode,
encompasses most of the world's writing systems [WORKSHOP]. UTF-16, the which encompasses most of the world's writing systems [WORKSHOP].
object of this specification, is a way to encode Unicode characters that UTF-16, the object of this specification, is one of the standard ways
has the characteristics of encoding the vast majority of currently-defined of encoding Unicode character data; it has the characteristics of
characters in exactly two octets and of being able to encode all other encoding all currently defined characters (in plane 0, the BMP) in
characters that will be defined in exactly four octets. exactly two octets and of being able to encode all other characters
likely to be defined (the next 16 planes) in exactly four octets.
The Unicode Standard further defines additional character properties and
other application details of great interest to implementors. Up to the
present time, changes in Unicode and amendments to ISO/IEC 10646 have
tracked each other, so that the character repertoires and code point
assignments have remained in sync. The relevant standardization committees
have committed to maintain this very useful synchronism.
1.2 Motivation
The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF The Unicode Standard further defines additional character properties
policy on character sets and languages, [CHARPOLICY], says that IETF and other application details of great interest to implementors. Up to
protocols MUST be able to use the UTF-8 charset. However, relative to the present time, changes in Unicode and amendments to ISO/IEC 10646
UTF-16, UTF-8 imposes a space penalty for characters whose values are have tracked each other, so that the character repertoires and code
between 0x0800 and 0xFFFF. Also, characters represented in UTF-8 have varying point assignments have remained in sync. The relevant standardization
sizes. Using UTF-16 provides a way to transmit character data that is committees have committed to maintain this very useful synchronism, as
mostly uniform in size. Some products and network standards already specify well as not to assign characters outside of the 17 planes accessible to
UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in UTF-16.
many protocols, such as the direct encoding of US-ASCII characters and
re-synchronization after loss of octets.)
UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as The IETF policy on character sets and languages [CHARPOLICY] says that
a sequence of 16-bit quantities. This document addresses the issues of IETF protocols MUST be able to use the UTF-8 charset [UTF-8]. Although
serializing UTF-16 as an octet stream for transmission over the Internet UTF-8 has many beneficial properties, such as the direct encoding of
and of MIME charset naming as described in [CHARSET-REG]. US-ASCII characters, re-synchronization after loss of octets and
immunity to the byte-order issue (see 3.1 below), it is a
variable-width encoding and is less dense than UTF-16 for characters
whose values are between 0x0800 and 0xFFFF. Some products and network
standards already specify UTF-16, making it an important encoding for
the Internet.
1.3 Terminology 1.2 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. document are to be interpreted as described in RFC 2119 [MUSTSHOULD].
Throughout this document, character values are shown in hexadecimal Throughout this document, character values are shown in hexadecimal
notation. For example, "0x013C" is the character whose value is the notation. For example, "0x013C" is the character whose value is the
character assigned the integer value 316 (decimal) in the CCS. character assigned the integer value 316 (decimal) in the CCS.
2. UTF-16 definition 2. UTF-16 definition
In ISO 10646, each character is assigned a number, which Unicode calls the In ISO 10646, each character is assigned a number, which Unicode calls
Unicode scalar value. This number is the same as the UCS-4 value of the the Unicode scalar value. This number is the same as the UCS-4 value of
character, and this document will refer to it as the "character value" for the character, and this document will refer to it as the "character
brevity. In the UTF-16 encoding, characters are represented using either value" for brevity. In the UTF-16 encoding, characters are represented
one or two unsigned 16-bit integers, depending on the character value. using either one or two unsigned 16-bit integers, depending on the
Serialization of these integers for transmission as a byte stream is character value. Serialization of these integers for transmission as a
discussed in Section 3. byte stream is discussed in Section 3.
The rules for how characters are encoded in UTF-16 are: The rules for how characters are encoded in UTF-16 are:
- Characters with values less than 0x10000 are represented as a single - Characters with values less than 0x10000 are represented as a single
16-bit integer with a value equal to that of the character number. 16-bit integer with a value equal to that of the character number.
- Characters with values between 0x10000 and 0x10FFFF are represented by a - Characters with values between 0x10000 and 0x10FFFF are represented
16-bit integer with a value between 0xD800 and 0xDBFF (within the by a 16-bit integer with a value between 0xD800 and 0xDBFF (within
so-called high-half zone or high surrogate area) followed by a 16-bit the so-called high-half zone or high surrogate area) followed by a
integer with a value between 0xDC00 and 0xDFFF (within the so-called 16-bit integer with a value between 0xDC00 and 0xDFFF (within the
low-half zone or low surrogate area). so-called low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded in - Characters with values greater than 0x10FFFF cannot be encoded in
UTF-16. UTF-16.
2.1 Encoding UTF-16 2.1 Encoding UTF-16
Encoding of a single character from an ISO 10646 character value to UTF-16 Encoding of a single character from an ISO 10646 character value to
proceeds as follows. Let U be the character number, no greater than UTF-16 proceeds as follows. Let U be the character number, no greater
0x10FFFF. than 0x10FFFF.
1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate.
2) Let U' = U - 0x10000. Note that because U <= 0x10FFFF, U' <= 0xFFFFF, 2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,
that is, U' can be represented in 20 bits. U' must be less than or equal to 0xFFFFF. That is, U' can be
represented in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and
0xDC00, respectively. These integers each have 10 bits free to encode the 0xDC00, respectively. These integers each have 10 bits free to encode
character value, for a total of 20 bits. the character value, for a total of 20 bits.
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits 4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order
of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of
Terminate. W2. Terminate.
Graphically, steps 2 through 4 look like: Graphically, steps 2 through 4 look like:
U' = yyyyyyyyyyxxxxxxxxxx U' = yyyyyyyyyyxxxxxxxxxx
W1 = 110110yyyyyyyyyy W1 = 110110yyyyyyyyyy
W2 = 110111xxxxxxxxxx W2 = 110111xxxxxxxxxx
2.2 Decoding UTF-16 2.2 Decoding UTF-16
Decoding of a single character from UTF-16 to an ISO 10646 character value Decoding of a single character from UTF-16 to an ISO 10646 character
proceeds as follows. Let W1 be the next 16-bit integer in the sequence of value proceeds as follows. Let W1 be the next 16-bit integer in the
integers representing the text. Let W2 be the (eventual) next integer sequence of integers representing the text. Let W2 be the (eventual)
following W1. next integer following W1.
1) If W1 < 0xD800 or W1 > 0xDFFF, the character value is the value of W1. 1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value of
Terminate. W1. Terminate.
2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in 2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
error and no valid character can be obtained using W1. Terminate. is in error and no valid character can be obtained using W1. Terminate.
3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not 3) If there is no W2 (that is, the sequence ends with W1), or if W2 is
between 0xDC00 and 0xDFFF, the sequence is in error. Terminate. not between 0xDC00 and 0xDFFF, the sequence is in error. Terminate.
4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of 4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits
W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10 of W1 as its 10 high-order bits and the 10 low-order bits of W2 as its
low-order bits. 10 low-order bits.
5) Add 0x10000 to U' to obtain the character value U. Terminate. 5) Add 0x10000 to U' to obtain the character value U. Terminate.
Note that steps 2 and 3 indicate errors. Error recovery is not specified by Note that steps 2 and 3 indicate errors. Error recovery is not
this document. When terminating with an error in steps 2 and 3, it may be specified by this document. When terminating with an error in steps 2
wise to set U to the value of W1 to help the caller diagnose the error and and 3, it may be wise to set U to the value of W1 to help the caller
not lose information. diagnose the error and not lose information. Also note that a string
decoding algorithm, as opposed to the single-character decoding
described above, need not terminate upon detection of an error, if
proper error reporting and/or recovery is provided.
3. Labelling UTF-16 text 3. Labelling UTF-16 text
This specification contains registration for three MIME charsets: This specification contains registration for three MIME charsets:
"UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the
combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and the combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and
CES is the same in all three cases, except for the serialization order of the CES is the same in all three cases, except for the serialization
the octets in each character, and the external determination of which order of the octets in each character, and the external determination
serialization is used. of which serialization is used.
This section describes which of the three labels to apply to a stream of text. This section describes which of the three labels to apply to a stream
of text. Section 4 describes how to interpret the labels on a stream of
text.
3.1 Definition of big-endian and little-endian 3.1 Definition of big-endian and little-endian
Historically, computer hardware has processed two-octet entities such as Historically, computer hardware has processed two-octet entities such
16-bit integers in one of two ways. So-called "big-endian" hardware handles as 16-bit integers in one of two ways. So-called "big-endian" hardware
two-octet entities with the higher-order octet first, that is at the lower handles two-octet entities with the higher-order octet first, that is
address in memory; when written out to disk or to a network interface at the lower address in memory; when written out to disk or to a
(serializing), the high-order octet thus appears first in the data stream. network interface (serializing), the high-order octet thus appears
On the other hand, "Little-endian" hardware handles two-octet entities with first in the data stream. On the other hand, "Little-endian" hardware
the lower-order octet first. Hardware of both kinds is common today. handles two-octet entities with the lower-order octet first. Hardware
of both kinds is common today.
For example, the unsigned 16-bit integer that represents the decimal number For example, the unsigned 16-bit integer that represents the decimal
258 is 0x0102. The big-endian serialization of that number is the octet number 258 is 0x0102. The big-endian serialization of that number is
0x01 followed by the octet 0x02. The little-endian serialization of that the octet 0x01 followed by the octet 0x02. The little-endian
number is the octet 0x02 followed by the octet 0x01. The following C code serialization of that number is the octet 0x02 followed by the octet
fragment demonstrates a way to write 16-bit quantities to a file in 0x01. The following C code fragment demonstrates a way to write 16-bit
big-endian order, irrespective of the hardware's native byte order. quantities to a file in big-endian order, irrespective of the
hardware's native byte order.
void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ void write_be(unsigned short u, FILE f) /* assume short is 16 bits */
{ {
putc(u >> 8, f); /* output high-order byte */ putc(u >> 8, f); /* output high-order byte */
putc(u & 0xFF, f); /* then low-order */ putc(u & 0xFF, f); /* then low-order */
} }
The term "network byte order" has been used in many RFCs to indicate The term "network byte order" has been used in many RFCs to indicate
big-endian serialization, although that term has yet to be formally big-endian serialization, although that term has yet to be formally
defined in a standards-track document. ISO 10646 prefers big-endian defined in a standards-track document. Although ISO 10646 prefers
serialization (section 6.3 of [ISO-10646]), but it is nonetheless big-endian serialization (section 6.3 of [ISO-10646]), it is likely
considered likely that little-endian order will also be used on the that little-endian order will also be used on the Internet.
Internet.
3.2 Byte order mark (BOM) 3.2 Byte order mark (BOM)
The Unicode Standard and ISO 10646 define the character "ZERO WIDTH The Unicode Standard and ISO 10646 define the character "ZERO WIDTH
NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE
MARK" (abbreviated "BOM"). The latter name hints at a second possible usage ORDER MARK" (abbreviated "BOM"). The latter name hints at a second
of the character, in addition to its normal use as a genuine "ZERO WIDTH possible usage of the character, in addition to its normal use as a
NON-BREAKING SPACE" within text. This usage, suggested by Unicode section genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage,
2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character suggested by Unicode section 2.4 and ISO 10646 Annex F (informative),
to a stream of Unicode characters as a "signature"; a receiver of such a is to prepend a 0xFEFF character to a stream of Unicode characters as a
serialized stream may then use the initial character both as a hint that "signature"; a receiver of such a serialized stream may then use the
the stream consists of Unicode characters and as a way to recognize the initial character both as a hint that the stream consists of Unicode
serialization order. In serialized UTF-16 prepended with such a signature, characters and as a way to recognize the serialization order. In
the order is big-endian if the first two octets are 0xFE followed by 0xFF; serialized UTF-16 prepended with such a signature, the order is
if they are 0xFF followed by 0xFE, the order is little-endian. Note that big-endian if the first two octets are 0xFE followed by 0xFF; if they
0xFFFE is not a Unicode character, precisely to preserve the usefulness of are 0xFF followed by 0xFE, the order is little-endian. Note that 0xFFFE
is not a Unicode character, precisely to preserve the usefulness of
0xFEFF as a byte-order mark. 0xFEFF as a byte-order mark.
It is important to understand that the character 0xFEFF appearing at any It is important to understand that the character 0xFEFF appearing at
position other than the beginning of a stream MUST be interpreted with the any position other than the beginning of a stream MUST be interpreted
semantics for the zero-width non-breaking space, and MUST NOT be with the semantics for the zero-width non-breaking space, and MUST NOT
interpreted as a byte-order mark. The contrapositive of that statement is be interpreted as a byte-order mark. The contrapositive of that
not always true: the character 0xFEFF in the first position of a stream MAY statement is not always true: the character 0xFEFF in the first
be interpreted as a zero-width non-breaking space, and is not always a position of a stream MAY be interpreted as a zero-width non-breaking
byte-order mark. For example, if a process splits a UTF-16 string into space, and is not always a byte-order mark. For example, if a process
many parts, a part might begin with 0xFEFF because there was a splits a UTF-16 string into many parts, a part might begin with 0xFEFF
zero-width non-breaking space at the beginning of that substring. because there was a zero-width non-breaking space at the beginning of
that substring.
The Unicode standard further suggests than an initial 0xFEFF character may The Unicode standard further suggests than an initial 0xFEFF character
be stripped before processing the text, the rationale being that such a may be stripped before processing the text, the rationale being that
character in initial position may be an artifact of the encoding (an such a character in initial position may be an artifact of the encoding
encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING (an encoding signature), not a genuine intended "ZERO WIDTH
SPACE". Note that such stripping might affect an external process at a NON-BREAKING SPACE". Note that such stripping might affect an external
different layer (such as a digital signature or a count of the characters) process at a different layer (such as a digital signature or a count of
that is relying on the presence of all characters in the stream. the characters) that is relying on the presence of all characters in
the stream.
In particular, in UTF-16 plain text it is likely, but not certain, that an In particular, in UTF-16 plain text it is likely, but not certain, that
initial 0xFEFF is a signature; when concatenating two strings, it is an initial 0xFEFF is a signature. When concatenating two strings, it is
important to strip out those signatures, for otherwise the resulting string important to strip out those signatures, because otherwise the
may contain an unintended "ZERO WIDTH NON-BREAKING SPACE" at the connection resulting string may contain an unintended "ZERO WIDTH NON-BREAKING
point. Also, some specifications mandate an initial 0xFEFF character in SPACE" at the connection point. Also, some specifications mandate an
objects encoded in UTF-16 and specify that this signature is not part of initial 0xFEFF character in objects encoded in UTF-16 and specify that
the object. this signature is not part of the object.
3.3 Choosing a label for UTF-16 text 3.3 Choosing a label for UTF-16 text
Any labelling application that uses UTF-16 character encoding, and puts an Any labelling application that uses UTF-16 character encoding, and
explicit charset label on the text, and knows the serialization order of explicitly labels the text, and knows the serialization order of the
the characters in text, SHOULD label the text as either "UTF-16BE" or characters in text, SHOULD label the text as either "UTF-16BE" or
"UTF-16LE", whichever is appropriate based on the endianness of the text. "UTF-16LE", whichever is appropriate based on the endianness of the
This allows applications processing the text, but unable to look inside the text. This allows applications processing the text, but unable to look
text, to know the serialization definitively. inside the text, to know the serialization definitively.
Text in the "UTF-16BE" charset MUST be serialized with the octets which Text in the "UTF-16BE" charset MUST be serialized with the octets which
make up a single 16-bit UTF-16 value in big-endian order. Systems labelling make up a single 16-bit UTF-16 value in big-endian order. Systems
UTF-16BE text MUST NOT prepend a BOM to the text. labelling UTF-16BE text MUST NOT prepend a BOM to the text.
Text in the "UTF-16LE" charset MUST be serialized with the octets which Text in the "UTF-16LE" charset MUST be serialized with the octets which
make up a single 16-bit UTF-16 value in little-endian order. Systems make up a single 16-bit UTF-16 value in little-endian order. Systems
labelling UTF-16LE text MUST NOT prepend a BOM to the text. labelling UTF-16LE text MUST NOT prepend a BOM to the text.
Any labelling application that uses UTF-16 character encoding, and puts an Any labelling application that uses UTF-16 character encoding, and puts
explicit charset label on the text, and does not know the serialization an explicit charset label on the text, and does not know the
order of the characters in text, MUST label the text as "UTF-16", and serialization order of the characters in text, MUST label the text as
SHOULD make sure the text starts with 0xFEFF. "UTF-16", and SHOULD make sure the text starts with 0xFEFF.
An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or
"UTF-16LE" is that some document formats mandate a BOM in UTF-16 text, "UTF-16LE" is that some document formats mandate a BOM in UTF-16 text,
thereby requiring the use of the "UTF-16" tag only. thereby requiring the use of the "UTF-16" tag only.
4. Interpreting text labels 4. Interpreting text labels
When a program sees text labelled as "UTF-16BE", "UTF-16LE", or "UTF-16", When a program sees text labelled as "UTF-16BE", "UTF-16LE", or
it can make some assumptions, based on the labelling rules given in the "UTF-16", it can make some assumptions, based on the labelling rules
previous section. These assumptions allow the program to then process the given in the previous section. These assumptions allow the program to
text. then process the text.
4.1 Interpreting text labelled as UTF-16BE 4.1 Interpreting text labelled as UTF-16BE
Text labelled "UTF-16BE" can always be interpreted as always being Text labelled "UTF-16BE" can always be interpreted as being big-endian.
big-endian. The detection of an initial BOM does not affect The detection of an initial BOM does not affect de-serialization of
de-serialization of text labelled as UTF-16BE. Finding 0xFF followed by text labelled as UTF-16BE. Finding 0xFF followed by 0xFE is an error
0xFE is an error since there is no Unicode character 0xFFFE. since there is no Unicode character 0xFFFE.
4.2 Interpreting text labelled as UTF-16LE 4.2 Interpreting text labelled as UTF-16LE
Text labelled "UTF-16LE" can always be interpreted as always being Text labelled "UTF-16LE" can always be interpreted as being
little-endian. The detection of an initial BOM does not affect little-endian. The detection of an initial BOM does not affect
de-serialization of text labelled as UTF-16LE. Finding 0xFE followed by de-serialization of text labelled as UTF-16LE. Finding 0xFE followed by
0xFF is an error since there is no Unicode character 0xFFFE, which would be 0xFF is an error since there is no Unicode character 0xFFFE, which
the interpretation of those octets under little-endian order. would be the interpretation of those octets under little-endian order.
4.3 Interpreting text labelled as UTF-16 4.3 Interpreting text labelled as UTF-16
Text labelled with the "UTF-16" charset might be serialized in either Text labelled with the "UTF-16" charset might be serialized in either
big-endian or little-endian order. If the first two octets of the text is big-endian or little-endian order. If the first two octets of the text
0xFE followed by 0xFF, then the text can be interpreted as being is 0xFE followed by 0xFF, then the text can be interpreted as being
big-endian. If the first two octets of the text is 0xFF followed by 0xFE, big-endian. If the first two octets of the text is 0xFF followed by
then the text can be interpreted as being little-endian. If the first two 0xFE, then the text can be interpreted as being little-endian. If the
octets of the text is not 0xFE followed by 0xFF, and is not 0xFF followed first two octets of the text is not 0xFE followed by 0xFF, and is not
by 0xFE, then the text SHOULD be interpreted as being big-endian. 0xFF followed by 0xFE, then the text SHOULD be interpreted as being
big-endian.
All applications that process text with the "UTF-16" charset label MUST be All applications that process text with the "UTF-16" charset label MUST
able to read at least the first two octets of the text and be able to be able to read at least the first two octets of the text and be able
process those octets in order to determine the serialization order of the to process those octets in order to determine the serialization order
text. Applications that process text with the "UTF-16" charset label MUST of the text. Applications that process text with the "UTF-16" charset
NOT assume the serialization without first checking the first two octets to label MUST NOT assume the serialization without first checking the
see if they are a big-endian BOM, a little-endian BOM, or not a BOM. first two octets to see if they are a big-endian BOM, a little-endian
BOM, or not a BOM. All applications that process text with the "UTF-16"
charset label MUST be able to interpret both big-endian and
little-endian text.
5. Examples 5. Examples
For the sake of example, let's suppose that there is a hieroglyphic For the sake of example, let's suppose that there is a hieroglyphic
character representing the Egyptian god Ra with character value 0x00012345 character representing the Egyptian god Ra with character value
(this character does not exist at present in Unicode). 0x00012345 (this character does not exist at present in Unicode).
The examples here all evaluate to the phrase: The examples here all evaluate to the phrase:
*=Ra *=Ra
where the "*" represents the Ra hieroglyph (0x00012345). where the "*" represents the Ra hieroglyph (0x00012345).
Text labelled with UTF-16BE, without a BOM: Text labelled with UTF-16BE, without a BOM:
D8 48 DF 45 00 3D 00 52 00 61 D8 08 DF 45 00 3D 00 52 00 61
Text labelled with UTF-16LE, without a BOM: Text labelled with UTF-16LE, without a BOM:
48 D8 45 DF 3D 00 52 00 61 00 08 D8 45 DF 3D 00 52 00 61 00
Big-endian text labelled with UTF-16, with a BOM: Big-endian text labelled with UTF-16, with a BOM:
FE FF D8 48 DF 45 00 3D 00 52 00 61 FE FF D8 08 DF 45 00 3D 00 52 00 61
Little-endian text labelled with UTF-16, with a BOM: Little-endian text labelled with UTF-16, with a BOM:
FF FE 48 D8 45 DF 3D 00 52 00 61 00 FF FE 08 D8 45 DF 3D 00 52 00 61 00
6. Versions of the standards 6. Versions of the standards
ISO/IEC 10646 is updated from time to time by published amendments; ISO/IEC 10646 is updated from time to time by published amendments;
similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0, similarly, different versions of the Unicode standard exist: 1.0, 1.1,
and 2.1 as of this writing. Each new version replaces the previous one, 2.0, and 2.1 as of this writing. Each new version replaces the
but implementations, and more significantly data, are not updated previous one, but implementations, and more significantly data, are not
instantly. updated instantly.
In general, the changes amount to adding new characters, which does not In general, the changes amount to adding new characters, which does not
pose particular problems with old data. Amendment 5 to ISO/IEC 10646, pose particular problems with old data. Amendment 5 to ISO/IEC 10646,
however, has moved and expanded the Korean Hangul block, thereby making any however, has moved and expanded the Korean Hangul block, thereby making
previous data containing Hangul characters invalid under the new version. any previous data containing Hangul characters invalid under the new
Unicode 2.0 has the same difference from Unicode 1.1. The official version. Unicode 2.0 has the same difference from Unicode 1.1. The
justification for allowing such an incompatible change was that no official justification for allowing such an incompatible change was
significant implementations and data containing Hangul existed, a statement that no significant implementations and data containing Hangul existed,
that is likely to be true but remains unprovable. The incident has been a statement that is likely to be true but remains unprovable. The
dubbed the "Korean mess", and the relevant committees have pledged to incident has been dubbed the "Korean mess", and the relevant committees
never, ever again make such an incompatible change. have pledged to never, ever again make such an incompatible change.
New versions, and in particular any incompatible changes, have consequences New versions, and in particular any incompatible changes, have
regarding MIME character encoding labels, to be discussed in Appendix A. consequences regarding MIME character encoding labels, to be discussed
in Appendix A.
7. Security considerations 7. Security considerations
UTF-16 is based on the ISO 10646 character set, which is frequently being UTF-16 is based on the ISO 10646 character set, which is frequently
added to, as described in Section 6 and Appendix A of this document. being added to, as described in Section 6 and Appendix A of this
Processors must be able to handle characters that are not defined at the document. Processors must be able to handle characters that are not
time that the processor was created in such a way as to not allow an defined at the time that the processor was created in such a way as to
attacker to harm a recipient by including unknown characters. not allow an attacker to harm a recipient by including unknown
characters.
Processors that handle any type of text, including text encoded as UTF-16, Processors that handle any type of text, including text encoded as
must be vigilant in checking for control characters that might reprogram a UTF-16, must be vigilant in checking for control characters that might
display terminal or keyboard. Similarly, processors that interpret text reprogram a display terminal or keyboard. Similarly, processors that
entities (such as looking for embedded programming code), must be careful interpret text entities (such as looking for embedded programming
not to execute the code without first alerting the recipient. code), must be careful not to execute the code without first alerting
the recipient.
Text in UTF-16 may contain special characters, such as the OBJECT Text in UTF-16 may contain special characters, such as the OBJECT
REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, REPLACEMENT CHARACTER (0xFFFC), that might cause external processing,
depending on the interpretation of the processing program and the depending on the interpretation of the processing program and the
availability of an external data stream that would be executed. This availability of an external data stream that would be executed. This
external processing may have side-effects that allow the sender of a external processing may have side-effects that allow the sender of a
message to attack the receiving system. message to attack the receiving system.
Implementors of UTF-16 need to consider the security aspects of how they Implementors of UTF-16 need to consider the security aspects of how
handle illegal UTF-16 sequences (that is, sequences involving surrogate they handle illegal UTF-16 sequences (that is, sequences involving
pairs that have illegal values or unpaired surrogates). It is conceivable surrogate pairs that have illegal values or unpaired surrogates). It is
that in some circumstances an attacker would be able to exploit an conceivable that in some circumstances an attacker would be able to
incautious UTF-16 parser by sending it an octet sequence that is not exploit an incautious UTF-16 parser by sending it an octet sequence
permitted by the UTF-16 syntax, causing it to behave in some anomalous that is not permitted by the UTF-16 syntax, causing it to behave in
fashion. some anomalous fashion.
8. References 8. References
[CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", [CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and
BCP 18, RFC 2277, January 1998. Languages", BCP 18, RFC 2277, January 1998.
[CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration
Procedures", BCP 19, RFC 2278, January 1998. Procedures", BCP 19, RFC 2278, January 1998.
[HTTP-1.1] Fielding, R., et. al., "Hypertext Transfer Protocol --
HTTP/1.1", RFC 2068, January 1997.
[ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: technology -- Universal Multiple-Octet Coded Character Set (UCS) --
Architecture and Basic Multilingual Plane. Twelve amendments and two Part 1: Architecture and Basic Multilingual Plane. Twelve amendments
technical corrigenda have been published up to now. UTF-16 is described in and two technical corrigenda have been published up to now. UTF-16 is
Annex Q, published as Amendment 1. Many other amendments are currently at described in Annex Q, published as Amendment 1. Many other amendments
various stages of standardization. are currently at various stages of standardization.
[MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version
Unicode Technical Report #8. 2.1", Unicode Technical Report #8.
[UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC
2279, January 1998. 2279, January 1998.
[WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set Workshop", [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set
RFC 2130, April 1997. Workshop", RFC 2130, April 1997.
9. Acknowledgments 9. Acknowledgments
Deborah Goldsmith wrote a great deal of the initial wording for this Deborah Goldsmith wrote a great deal of the initial wording for this
specification. Martin Duerst gave numerous significant changes. Other specification. Martin Duerst proposed numerous significant changes.
significant contributors include: Other significant contributors include:
Mati Allouche Mati Allouche
Walt Daniels Walt Daniels
Mark Davis Mark Davis
Ned Freed Ned Freed
Asmus Freytag Asmus Freytag
Lloyd Honomichl Lloyd Honomichl
Dan Kegel Dan Kegel
Murata Makoto Murata Makoto
Larry Masinter Larry Masinter
Markus Scherer
Ken Whistler Ken Whistler
Some of the text in this specification was copied from [UTF-8], and that Some of the text in this specification was copied from [UTF-8], and
document was worked on by many people. Please see the acknowledgments that document was worked on by many people. Please see the
section in that document for more people who may have contributed acknowledgments section in that document for more people who may have
indirectly to this document. contributed indirectly to this document.
10. Authors' address
Paul Hoffman
Internet Mail Consortium
127 Segre Place
Santa Cruz, CA 95060 USA
phoffman@imc.org
Francois Yergeau
Alis Technologies
100, boul. Alexis-Nihon, Suite 600
Montreal QC H4M 2P2 Canada
fyergeau@alis.com
11. Changes between draft -01 and -02
Fixed some spelling mistakes throughout.
Updated the status boilerplate.
Clarified the parameter values in 1.
Added [WORKSHOP] reference in 1.1 and 8. Also fuzzified the description of
what UTF-16 is (instead of getting into hair-splitting on CESs, CCSs, and
so on).
Corrected 1.2 on the characters for which UTF-8 incurs a space penalty.
Added "from ISO 10646 to UTF-16" to the beginning of 2.1.
Added "from UTF-16 to ISO 10646" to the beginning of 2.2.
Added text to the end of the note at the end of 2.2 about possibly emitting
the ill-formed characters when decoding.
Rearranged much of sections 3 and 4. This makes the following changes
hard to follow; the references refer to the *old* section numbers,
not necessarily the ones as they exist in this draft. Sorry about that...
Changed the end of the first paragraph of 3.1 to get out of the
which-endian-has-most debate.
Clarified the fourth paragraph of 3.1 (the one that begins
"This specification thus...") about the use of "UTF-16" as both a
sequencing mechanism and a charset label.
Added Martin Duerst's C code fragment for big-endian order. 10. Changes between draft -02 and -03
Added the sentence to the end of the sixth paragraph of 3.1 (the one 1: Reorganized the sections. Added information about two octets being
that begins "It is important...") with the example of substrings and enough for all current characters and the committees saying they will
ZWNBSs. not go beyond what can be defined in UTF-16.
Added text about SHOULD NOT put an intial BOM in both 3.2 and 3.3. 2.1: Reworded step 2 with words to make it easier to read.
Clarified the last clause in section 3.3. 2.2: Added "U" to step 1. Also added note to the end of the last
paragraph about string decoding and errors.
Removed the last paragraph of 4 (the paragraph that used to start 3: Added a reference to section 4 about interpreting labels.
"Because creating text labelled...") because it related to text-creating
programs instead of text-labelling programs.
Rearragned and relabelled some of the examples in 5. 3.1: Reworded last sentence in last paragraph.
Removed "obsoletes" from the first paragraph of 6. Slightly fuzzified 4.3: Added requirement that apps that can read UTF-16 must be able to
the "no implementations" sentence in the second paragraph. interpret both big-endian and little-endian.
Alphabatized the references in 8. 5: Corrected the examples due to wrong encoding.
Added Larry Masinter to section 9. Gave Martin Duerst more credit. 11: Moved author's addresses to Appendix B.
A. Charset registrations A. Charset registrations
This memo is meant to serve as the basis for registration of three MIME This memo is meant to serve as the basis for registration of three MIME
charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE",
and "UTF-16". These strings label objects containing text consisting of "UTF-16LE", and "UTF-16". These strings label objects containing text
characters from the repertoire of ISO/IEC 10646 including all amendments at consisting of characters from the repertoire of ISO/IEC 10646 including
least up to amendment 5 (Korean block), encoded to a sequence of octets all amendments at least up to amendment 5 (Korean block), encoded to a
using the encoding and serialization schemes outlined above. sequence of octets using the encoding and serialization schemes
outlined above.
Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use
media types under the "text" top-level type, because they do not encode in media types under the "text" top-level type, because they do not
line endings in the way required for MIME "text" media types. encode line endings in the way required for MIME "text" media types. An
exception to this is HTTP, which uses a MIME-like mechanism, but is
exempt from the restrictions on the text top-level type (see section
19.4.1 of HTTP 1.1 [HTTP-1.1]).
It is noteworthy that the labels described here do not contain a version It is noteworthy that the labels described here do not contain a
identification, referring generically to ISO/IEC 10646. This is version identification, referring generically to ISO/IEC 10646. This is
intentional, the rationale being as follows: intentional, the rationale being as follows:
A MIME charset is designed to give just the information needed to interpret A MIME charset is designed to give just the information needed to
a sequence of bytes received on the wire into a sequence of characters, interpret a sequence of bytes received on the wire into a sequence of
nothing more (see RFC 2045, section 2.2, in [MIME]). As long as a character characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As
set standard does not change incompatibly, version numbers serve no long as a character set standard does not change incompatibly, version
purpose, because one gains nothing by learning from the tag that newly numbers serve no purpose, because one gains nothing by learning from
assigned characters may be received that one doesn't know about. The tag the tag that newly assigned characters may be received that one doesn't
itself doesn't teach anything about the new characters, which are going to know about. The tag itself doesn't teach anything about the new
be received anyway. characters, which are going to be received anyway.
Hence, as long as the standards evolve compatibly, the apparent advantage Hence, as long as the standards evolve compatibly, the apparent
of having labels that identify the versions is only that, apparent. But advantage of having labels that identify the versions is only that,
there is a disadvantage to such version-dependent labels: when an older apparent. But there is a disadvantage to such version-dependent
application receives data accompanied by a newer, unknown label, it may labels: when an older application receives data accompanied by a newer,
fail to recognize the label and be completely unable to deal with the data, unknown label, it may fail to recognize the label and be completely
whereas a generic, known label would have triggered mostly correct unable to deal with the data, whereas a generic, known label would have
processing of the data, which may well not contain any new characters. triggered mostly correct processing of the data, which may well not
contain any new characters.
The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible
principle contradicting the appropriateness of a version independent MIME change, in principle contradicting the appropriateness of a version
charset as described above. But the compatibility problem can only appear independent MIME charset as described above. But the compatibility
with data containing Korean Hangul characters encoded according to Unicode problem can only appear with data containing Korean Hangul characters
1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there is encoded according to Unicode 1.1 (or equivalently ISO/IEC 10646 before
arguably no such data to worry about, this being the very reason the amendment 5), and there is arguably no such data to worry about, this
incompatible change was deemed acceptable. being the very reason the incompatible change was deemed acceptable.
In practice, then, a version-independent label is warranted, provided the In practice, then, a version-independent label is warranted, provided
label is understood to refer to all versions after Amendment 5, and the label is understood to refer to all versions after Amendment 5, and
provided no incompatible change actually occurs. Should incompatible provided no incompatible change actually occurs. Should incompatible
changes occur in a later version of ISO/IEC 10646, the MIME charsets changes occur in a later version of ISO/IEC 10646, the MIME charsets
defined here will stay aligned with the previous version until and unless defined here will stay aligned with the previous version until and
the IETF specifically decides otherwise. unless the IETF specifically decides otherwise.
A.1 Registration for UTF-16BE A.1 Registration for UTF-16BE
To: ietf-charsets@iana.org To: ietf-charsets@iana.org
Subject: Registration of new charset Subject: Registration of new charset
Charset name(s): UTF-16BE Charset name(s): UTF-16BE
Published specification(s): This specification Published specification(s): This specification
skipping to change at line 594 skipping to change at line 572
Charset name(s): UTF-16 Charset name(s): UTF-16
Published specification(s): This specification Published specification(s): This specification
Suitable for use in MIME content types under the Suitable for use in MIME content types under the
"text" top-level type: No "text" top-level type: No
Person & email address to contact for further information: Person & email address to contact for further information:
Paul Hoffman <phoffman@imc.org> Paul Hoffman <phoffman@imc.org>
Francois Yergeau <fyergeau@alis.com> Francois Yergeau <fyergeau@alis.com>
B. Authors' address
Paul Hoffman
Internet Mail Consortium
127 Segre Place
Santa Cruz, CA 95060 USA
phoffman@imc.org
Francois Yergeau
Alis Technologies
100, boul. Alexis-Nihon, Suite 600
Montreal QC H4M 2P2 Canada
fyergeau@alis.com
 End of changes. 79 change blocks. 
325 lines changed or deleted 303 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/