< draft-hoffman-utf16-01.txt   draft-hoffman-utf16-02.txt >
Internet Draft Paul Hoffman Internet Draft Paul Hoffman
<draft-hoffman-utf16-01.txt> Internet Mail Consortium <draft-hoffman-utf16-02.txt> Internet Mail Consortium
December 13, 1998 Francois Yergeau February 10, 1999 Francois Yergeau
Alis Technologies Alis Technologies
UTF-16, an encoding of ISO 10646 UTF-16, an encoding of ISO 10646
Status of this Memo Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working documents This document is an Internet-Draft and is in full conformance with all
of the Internet Engineering Task Force (IETF), its areas, and its working provisions of Section 10 of RFC2026.
groups. Note that other groups may also distribute working documents as
Internet- Drafts.
Internet-Drafts are draft documents valid for a maximum of six months. Internet-Drafts are working documents of the Internet Engineering Task
Internet-Drafts may be updated, replaced, or obsoleted by other documents Force (IETF), its areas, and its working groups. Note that other groups
at any time. It is not appropriate to use Internet-Drafts as reference may also distribute working documents as Internet-Drafts.
material or to cite them other than as a "working draft" or "work in
progress".
To view the entire list of current Internet-Drafts, please check the Internet-Drafts are draft documents valid for a maximum of six months and
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow may be updated, replaced, or obsoleted by other documents at any time. It
Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), is inappropriate to use Internet- Drafts as reference material or to cite
ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), them other than as "work in progress."
ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).
Copyright (C) The Internet Society (1998). All Rights Reserved. The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright (C) The Internet Society (1999). All Rights Reserved.
1. Introduction 1. Introduction
This document specifies the UTF-16 encoding of Unicode/ISO-10646 and This document describes the UTF-16 encoding of Unicode/ISO-10646 and
contains the registration for three MIME charset parameter values: contains the registration for three MIME charset parameter values:
UTF-16BE, UTF-16LE, and UTF-16. UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16.
1.1 Background 1.1 Background
The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly
define a coded character set (CCS), hereafter referred to as Unicode, which define a coded character set (CCS), hereafter referred to as Unicode, which
encompasses most of the world's writing systems. UTF-16, the object of this encompasses most of the world's writing systems [WORKSHOP]. UTF-16, the
specification, is a character encoding scheme (CES) of Unicode that has the object of this specification, is a way to encode Unicode characters that
characteristics of encoding the vast majority of currently-defined has the characteristics of encoding the vast majority of currently-defined
characters in exactly two octets and of being able to encode all other characters in exactly two octets and of being able to encode all other
characters that will be defined in exactly four octets. characters that will be defined in exactly four octets.
The Unicode Standard further defines additional character properties and The Unicode Standard further defines additional character properties and
other application details of great interest to implementors. Up to the other application details of great interest to implementors. Up to the
present time, changes in Unicode and amendments to ISO/IEC 10646 have present time, changes in Unicode and amendments to ISO/IEC 10646 have
tracked each other, so that the character repertoires and code point tracked each other, so that the character repertoires and code point
assignments have remained in sync. The relevant standardization committees assignments have remained in sync. The relevant standardization committees
have committed to maintain this very useful synchronism. have committed to maintain this very useful synchronism.
1.2 Motivation 1.2 Motivation
The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF
policy on character sets and languages, [CHARPOLICY], says that IETF policy on character sets and languages, [CHARPOLICY], says that IETF
protocols MUST be able to use the UTF-8 charset. However, relative to protocols MUST be able to use the UTF-8 charset. However, relative to
UTF-16, UTF-8 imposes a space penalty for characters whose values are UTF-16, UTF-8 imposes a space penalty for characters whose values are
greater than 0x0800. Also, characters represented in UTF-8 have varying between 0x0800 and 0xFFFF. Also, characters represented in UTF-8 have varying
sizes. Using UTF-16 provides a way to transmit character data that is sizes. Using UTF-16 provides a way to transmit character data that is
mostly uniform in size. Some products and network standards already specify mostly uniform in size. Some products and network standards already specify
UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in
many protocols, such as the direct encoding of US-ASCII characters and many protocols, such as the direct encoding of US-ASCII characters and
re-synchronization after loss of octets.) re-synchronization after loss of octets.)
UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as
a sequence of 16-bit quantities. This document addresses the issues of a sequence of 16-bit quantities. This document addresses the issues of
serializing UTF-16 as an octet stream for transmission over the Internet serializing UTF-16 as an octet stream for transmission over the Internet
and of MIME charset naming as described in [CHARSET-REG]. and of MIME charset naming as described in [CHARSET-REG].
skipping to change at line 107 skipping to change at line 107
16-bit integer with a value between 0xD800 and 0xDBFF (within the 16-bit integer with a value between 0xD800 and 0xDBFF (within the
so-called high-half zone or high surrogate area) followed by a 16-bit so-called high-half zone or high surrogate area) followed by a 16-bit
integer with a value between 0xDC00 and 0xDFFF (within the so-called integer with a value between 0xDC00 and 0xDFFF (within the so-called
low-half zone or low surrogate area). low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded in - Characters with values greater than 0x10FFFF cannot be encoded in
UTF-16. UTF-16.
2.1 Encoding UTF-16 2.1 Encoding UTF-16
Encoding of a single character proceeds as follows. Let U be the character Encoding of a single character from an ISO 10646 character value to UTF-16
number, no greater than 0x10FFFF. proceeds as follows. Let U be the character number, no greater than
0x10FFFF.
1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate.
2) Let U' = U - 0x10000. Note that because U <= 0x10FFFF, U' <= 0xFFFFF, 2) Let U' = U - 0x10000. Note that because U <= 0x10FFFF, U' <= 0xFFFFF,
that is, U' can be represented in 20 bits. that is, U' can be represented in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and
0xDC00, respectively. These integers each have 10 bits free to encode the 0xDC00, respectively. These integers each have 10 bits free to encode the
character value, for a total of 20 bits. character value, for a total of 20 bits.
skipping to change at line 130 skipping to change at line 131
of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2.
Terminate. Terminate.
Graphically, steps 2 through 4 look like: Graphically, steps 2 through 4 look like:
U' = yyyyyyyyyyxxxxxxxxxx U' = yyyyyyyyyyxxxxxxxxxx
W1 = 110110yyyyyyyyyy W1 = 110110yyyyyyyyyy
W2 = 110111xxxxxxxxxx W2 = 110111xxxxxxxxxx
2.2 Decoding UTF-16 2.2 Decoding UTF-16
Decoding of a single character proceeds as follows. Let W1 be the next Decoding of a single character from UTF-16 to an ISO 10646 character value
16-bit integer in the sequence of integers representing the text. Let W2 be proceeds as follows. Let W1 be the next 16-bit integer in the sequence of
the (eventual) next integer following W1. integers representing the text. Let W2 be the (eventual) next integer
following W1.
1) If W1 < 0xD800 or W1 > 0xDFFF, the character value is the value of W1. 1) If W1 < 0xD800 or W1 > 0xDFFF, the character value is the value of W1.
Terminate. Terminate.
2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in 2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in
error and no valid character can be obtained using W1. Terminate. error and no valid character can be obtained using W1. Terminate.
3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not 3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not
between 0xDC00 and 0xDFFF, the sequence is in error. Terminate. between 0xDC00 and 0xDFFF, the sequence is in error. Terminate.
4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of 4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of
W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10 W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10
low-order bits. low-order bits.
5) Add 0x10000 to U' to obtain the character value U. Terminate. 5) Add 0x10000 to U' to obtain the character value U. Terminate.
Note that steps 2 and 3 indicate errors. Error recovery is not specified by Note that steps 2 and 3 indicate errors. Error recovery is not specified by
this document. this document. When terminating with an error in steps 2 and 3, it may be
wise to set U to the value of W1 to help the caller diagnose the error and
not lose information.
3. Serialization of characters 3. Labelling UTF-16 text
This specification contains registration for three MIME charsets:
"UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the
combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and the
CES is the same in all three cases, except for the serialization order of
the octets in each character, and the external determination of which
serialization is used.
This section describes which of the three labels to apply to a stream of text.
3.1 Definition of big-endian and little-endian 3.1 Definition of big-endian and little-endian
Historically, computer hardware has processed two-octet entities such as Historically, computer hardware has processed two-octet entities such as
16-bit integers in one of two ways. So-called "big-endian" hardware handles 16-bit integers in one of two ways. So-called "big-endian" hardware handles
two-octet entities with the higher-order octet first, that is at the lower two-octet entities with the higher-order octet first, that is at the lower
address in memory; when written out to disk or to a network interface address in memory; when written out to disk or to a network interface
(serializing), the high-order octet thus appears first in the data stream. (serializing), the high-order octet thus appears first in the data stream.
"Little-endian" hardware handles two-octet entities with the lower-order On the other hand, "Little-endian" hardware handles two-octet entities with
octet first. Most modern hardware is little-endian, but there are many the lower-order octet first. Hardware of both kinds is common today.
current examples of big-endian hardware.
For example, the unsigned 16-bit integer that represents the decimal number For example, the unsigned 16-bit integer that represents the decimal number
258 is 0x0102. The big-endian serialization of that number is the octet 258 is 0x0102. The big-endian serialization of that number is the octet
0x01 followed by the octet 0x02. The little-endian serialization of that 0x01 followed by the octet 0x02. The little-endian serialization of that
number is the octet 0x02 followed by the octet 0x01. number is the octet 0x02 followed by the octet 0x01. The following C code
fragment demonstrates a way to write 16-bit quantities to a file in
big-endian order, irrespective of the hardware's native byte order.
void write_be(unsigned short u, FILE f) /* assume short is 16 bits */
{
putc(u >> 8, f); /* output high-order byte */
putc(u & 0xFF, f); /* then low-order */
}
The term "network byte order" has been used in many RFCs to indicate The term "network byte order" has been used in many RFCs to indicate
big-endian serialization, although that term has never been formally big-endian serialization, although that term has yet to be formally
defined in a standards-track document. ISO 10646 prefers big-endian defined in a standards-track document. ISO 10646 prefers big-endian
serialization (section 6.3 of [ISO-10646]), but it is nonetheless serialization (section 6.3 of [ISO-10646]), but it is nonetheless
considered likely that little-endian order will also be used on the considered likely that little-endian order will also be used on the
Internet. Internet.
This specification thus contains registration for three charsets: 3.2 Byte order mark (BOM)
"UTF-16BE", "UTF-16LE", and "UTF-16". The character encoding schemes these
charsets use are identical except for the serialization order of the octets
in each character, and the external determination of which serialization is
used.
The Unicode Standard and ISO 10646 define the character "ZERO WIDTH The Unicode Standard and ISO 10646 define the character "ZERO WIDTH
NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER
MARK" (abbreviated "BOM"). The latter name hints at a second possible usage MARK" (abbreviated "BOM"). The latter name hints at a second possible usage
of the character, in addition to its normal use as a genuine "ZERO WIDTH of the character, in addition to its normal use as a genuine "ZERO WIDTH
NON-BREAKING SPACE" within text. This usage, suggested by Unicode section NON-BREAKING SPACE" within text. This usage, suggested by Unicode section
2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character 2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character
to a stream of Unicode characters as a "signature"; a receiver of such a to a stream of Unicode characters as a "signature"; a receiver of such a
serialized stream may then use the initial character both as a hint that serialized stream may then use the initial character both as a hint that
the stream consists of Unicode characters and as a way to recognize the the stream consists of Unicode characters and as a way to recognize the
skipping to change at line 204 skipping to change at line 220
if they are 0xFF followed by 0xFE, the order is little-endian. Note that if they are 0xFF followed by 0xFE, the order is little-endian. Note that
0xFFFE is not a Unicode character, precisely to preserve the usefulness of 0xFFFE is not a Unicode character, precisely to preserve the usefulness of
0xFEFF as a byte-order mark. 0xFEFF as a byte-order mark.
It is important to understand that the character 0xFEFF appearing at any It is important to understand that the character 0xFEFF appearing at any
position other than the beginning of a stream MUST be interpreted with the position other than the beginning of a stream MUST be interpreted with the
semantics for the zero-width non-breaking space, and MUST NOT be semantics for the zero-width non-breaking space, and MUST NOT be
interpreted as a byte-order mark. The contrapositive of that statement is interpreted as a byte-order mark. The contrapositive of that statement is
not always true: the character 0xFEFF in the first position of a stream MAY not always true: the character 0xFEFF in the first position of a stream MAY
be interpreted as a zero-width non-breaking space, and is not always a be interpreted as a zero-width non-breaking space, and is not always a
byte-order mark. byte-order mark. For example, if a process splits a UTF-16 string into
many parts, a part might begin with 0xFEFF because there was a
zero-width non-breaking space at the beginning of that substring.
The Unicode standard further suggests than an initial 0xFEFF character may The Unicode standard further suggests than an initial 0xFEFF character may
be stripped before processing the text, the rationale being that such a be stripped before processing the text, the rationale being that such a
character in initial position may be an artifact of the encoding (an character in initial position may be an artifact of the encoding (an
encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING
SPACE". Nevertheless, such stripping MUST NOT take place before any SPACE". Note that such stripping might affect an external process at a
MIME-related operations (such as hash algorithms, digest, or byte-count different layer (such as a digital signature or a count of the characters)
computations) have been completed. Such operations depend on the exact that is relying on the presence of all characters in the stream.
bytes of the data, which therefore may not be modified in any way. After
all MIME-related operations have been completed (for instance after a MIME
processor has handed an entity to a specific media type processor), an
initial 0xFEFF MAY be removed if appropriate, although this will prevent
later comparison with the original MIME object. In particular, in UTF-16
plain text it is likely that an initial 0xFEFF is a signature; when
concatenating two strings, it is important to strip out those signatures,
for otherwise the resulting string may contain an unintended "ZERO WIDTH
NON-BREAKING SPACE" at the connection point. Also, some specifications
mandate an initial 0xFEFF character in objects encoded in UTF-16 and
specify that this signature is not part of the object.
3.2 Serialization in UTF-16BE In particular, in UTF-16 plain text it is likely, but not certain, that an
initial 0xFEFF is a signature; when concatenating two strings, it is
important to strip out those signatures, for otherwise the resulting string
may contain an unintended "ZERO WIDTH NON-BREAKING SPACE" at the connection
point. Also, some specifications mandate an initial 0xFEFF character in
objects encoded in UTF-16 and specify that this signature is not part of
the object.
Text in the "UTF-16BE" charset MUST be serialized with the octets which 3.3 Choosing a label for UTF-16 text
make up a single 16-bit UTF-16 value in big-endian order. The detection of
an initial BOM does not affect de-serialization of text labelled as
UTF-16BE. Finding 0xFF follwed by 0xFE is an error since there is no
Unicode character 0xFFFE.
3.3 Serialization in UTF-16LE Any labelling application that uses UTF-16 character encoding, and puts an
explicit charset label on the text, and knows the serialization order of
the characters in text, SHOULD label the text as either "UTF-16BE" or
"UTF-16LE", whichever is appropriate based on the endianness of the text.
This allows applications processing the text, but unable to look inside the
text, to know the serialization definitively.
Text in the "UTF-16BE" charset MUST be serialized with the octets which
make up a single 16-bit UTF-16 value in big-endian order. Systems labelling
UTF-16BE text MUST NOT prepend a BOM to the text.
Text in the "UTF-16LE" charset MUST be serialized with the octets which Text in the "UTF-16LE" charset MUST be serialized with the octets which
make up a single 16-bit UTF-16 value in little-endian order. The detection make up a single 16-bit UTF-16 value in little-endian order. Systems
of an initial BOM does not affect de-serialization of text labelled as labelling UTF-16LE text MUST NOT prepend a BOM to the text.
UTF-16LE. Finding 0xFE folled by 0xFF is an error since there is no Unicode
character 0xFFFE, which is the interpretation of the 0xFEFF character under
little-endian order.
3.4 Serialization in UTF-16 Any labelling application that uses UTF-16 character encoding, and puts an
explicit charset label on the text, and does not know the serialization
order of the characters in text, MUST label the text as "UTF-16", and
SHOULD make sure the text starts with 0xFEFF.
Text in the "UTF-16" charset MAY be serialized in either big-endian or An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or
little-endian order. If the first two octets of the text is 0xFE followed "UTF-16LE" is that some document formats mandate a BOM in UTF-16 text,
by 0xFF, then the text MUST be big-endian. If the first two octets of the thereby requiring the use of the "UTF-16" tag only.
text is 0xFF followed by 0xFE, then the text MUST be little-endian. If the
first two octets of the text is not 0xFE followed by 0xFF and is not 0xFF
followed by 0xFE, then the text MUST be big-endian. Big-endian text in the
"UTF-16" charset MAY start with the 0xFEFF character, but the 0xFEFF
character is not required.
All applications that process text in the "UTF-16" charset MUST be able to 4. Interpreting text labels
read at least the first two octets of the text and be able to process those
octets in order to determine the serialization of the text. Applications
that use the "UTF-16" charset parameter value MUST NOT assume the
serialization without first checking the first two octets to see if they
are a big-endian BOM or a little-endian BOM or not a BOM.
4. Choosing a charset When a program sees text labelled as "UTF-16BE", "UTF-16LE", or "UTF-16",
it can make some assumptions, based on the labelling rules given in the
previous section. These assumptions allow the program to then process the
text.
Any labelling application that uses UTF-16 character encoding, and puts an 4.1 Interpreting text labelled as UTF-16BE
explicit charset label on the text, and knows the serialization of the
characters in text, MUST label the text as either "UTF-16BE" or "UTF-16LE",
whichever is appropriate. This allows applications that are processing the
text that are not able to look inside the text to know the serialization
definitively.
Any labelling application that uses UTF-16 character encoding, and puts an Text labelled "UTF-16BE" can always be interpreted as always being
explicit charset label on the text, and does not know the serialization of big-endian. The detection of an initial BOM does not affect
the characters in text, MUST label the text as "UTF-16", and SHOULD be sure de-serialization of text labelled as UTF-16BE. Finding 0xFF followed by
the text starts with 0xFEFF. An application processing text that is 0xFE is an error since there is no Unicode character 0xFFFE.
labelled with the "UTF-16" charset parameter value knows that the
serialization cannot be determined without looking inside the text itself.
Fortunately, the processing application needs to only look at the first
character (the first two octets) of the text to determine the
serialization.
Because creating text labelled as being in the "UTF-16" charset forces the 4.2 Interpreting text labelled as UTF-16LE
recipient to read and understand the first character of the text object, a
text-creating program SHOULD create text labelled as "UTF-16BE" or Text labelled "UTF-16LE" can always be interpreted as always being
"UTF-16LE" if possible. Text-creating programs that create text using little-endian. The detection of an initial BOM does not affect
UTF-16 encoding SHOULD emit big-endian text if possible. de-serialization of text labelled as UTF-16LE. Finding 0xFE followed by
0xFF is an error since there is no Unicode character 0xFFFE, which would be
the interpretation of those octets under little-endian order.
4.3 Interpreting text labelled as UTF-16
Text labelled with the "UTF-16" charset might be serialized in either
big-endian or little-endian order. If the first two octets of the text is
0xFE followed by 0xFF, then the text can be interpreted as being
big-endian. If the first two octets of the text is 0xFF followed by 0xFE,
then the text can be interpreted as being little-endian. If the first two
octets of the text is not 0xFE followed by 0xFF, and is not 0xFF followed
by 0xFE, then the text SHOULD be interpreted as being big-endian.
All applications that process text with the "UTF-16" charset label MUST be
able to read at least the first two octets of the text and be able to
process those octets in order to determine the serialization order of the
text. Applications that process text with the "UTF-16" charset label MUST
NOT assume the serialization without first checking the first two octets to
see if they are a big-endian BOM, a little-endian BOM, or not a BOM.
5. Examples 5. Examples
For the sake of example, let's suppose that there is a hieroglyphic For the sake of example, let's suppose that there is a hieroglyphic
character representing the Egyptian god Ra with character value 0x00012345 character representing the Egyptian god Ra with character value 0x00012345
(this character does not exist at present in Unicode). (this character does not exist at present in Unicode).
The examples here all evaluate to the phrase: The examples here all evaluate to the phrase:
*=Ra *=Ra
where the "*" represents the Ra hieroglyph (0x00012345). where the "*" represents the Ra hieroglyph (0x00012345).
Text that is labelled with UTF-16BE, with no BOM: Text labelled with UTF-16BE, without a BOM:
D8 48 DF 45 00 3D 00 52 00 61 D8 48 DF 45 00 3D 00 52 00 61
Text that is labelled with UTF-16BE, with a BOM: Text labelled with UTF-16LE, without a BOM:
FE FF D8 48 DF 45 00 3D 00 52 00 61
Text that is labelled with UTF-16LE, with no BOM:
48 D8 45 DF 3D 00 52 00 61 00 48 D8 45 DF 3D 00 52 00 61 00
Little-endian text that is labelled with UTF-16: Big-endian text labelled with UTF-16, with a BOM:
FE FF D8 48 DF 45 00 3D 00 52 00 61
Little-endian text labelled with UTF-16, with a BOM:
FF FE 48 D8 45 DF 3D 00 52 00 61 00 FF FE 48 D8 45 DF 3D 00 52 00 61 00
6. Versions of the standards 6. Versions of the standards
ISO/IEC 10646 is updated from time to time by published amendments; ISO/IEC 10646 is updated from time to time by published amendments;
similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0, similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0,
and 2.1 as of this writing. Each new version obsoletes and replaces the and 2.1 as of this writing. Each new version replaces the previous one,
previous one, but implementations, and more significantly data, are not but implementations, and more significantly data, are not updated
updated instantly. instantly.
In general, the changes amount to adding new characters, which does not In general, the changes amount to adding new characters, which does not
pose particular problems with old data. Amendment 5 to ISO/IEC 10646, pose particular problems with old data. Amendment 5 to ISO/IEC 10646,
however, has moved and expanded the Korean Hangul block, thereby making any however, has moved and expanded the Korean Hangul block, thereby making any
previous data containing Hangul characters invalid under the new version. previous data containing Hangul characters invalid under the new version.
Unicode 2.0 has the same difference from Unicode 1.1. The official Unicode 2.0 has the same difference from Unicode 1.1. The official
justification for allowing such an incompatible change was that no justification for allowing such an incompatible change was that no
implementations and no data containing Hangul existed, a statement that is significant implementations and data containing Hangul existed, a statement
likely to be true but remains unprovable. The incident has been dubbed the that is likely to be true but remains unprovable. The incident has been
"Korean mess", and the relevant committees have pledged to never, ever dubbed the "Korean mess", and the relevant committees have pledged to
again make such an incompatible change. never, ever again make such an incompatible change.
New versions, and in particular any incompatible changes, have consequences New versions, and in particular any incompatible changes, have consequences
regarding MIME character encoding labels, to be discussed in Appendix A. regarding MIME character encoding labels, to be discussed in Appendix A.
7. Security considerations 7. Security considerations
UTF-16 is based on the ISO 10646 character set, which is frequently being UTF-16 is based on the ISO 10646 character set, which is frequently being
added to, as described in Section 6 and Appendix A of this document. added to, as described in Section 6 and Appendix A of this document.
Processors must be able to handle characters that are not defined at the Processors must be able to handle characters that are not defined at the
time that the processor was created in such a way as to not allow an time that the processor was created in such a way as to not allow an
skipping to change at line 354 skipping to change at line 374
Text in UTF-16 may contain special characters, such as the OBJECT Text in UTF-16 may contain special characters, such as the OBJECT
REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, REPLACEMENT CHARACTER (0xFFFC), that might cause external processing,
depending on the interpretation of the processing program and the depending on the interpretation of the processing program and the
availability of an external data stream that would be executed. This availability of an external data stream that would be executed. This
external processing may have side-effects that allow the sender of a external processing may have side-effects that allow the sender of a
message to attack the receiving system. message to attack the receiving system.
Implementors of UTF-16 need to consider the security aspects of how they Implementors of UTF-16 need to consider the security aspects of how they
handle illegal UTF-16 sequences (that is, sequences involving surrogate handle illegal UTF-16 sequences (that is, sequences involving surrogate
pairs that have illegal values). It is conceivable that in some pairs that have illegal values or unpaired surrogates). It is conceivable
circumstances an attacker would be able to exploit an incautious UTF-16 that in some circumstances an attacker would be able to exploit an
parser by sending it an octet sequence that is not permitted by the UTF-16 incautious UTF-16 parser by sending it an octet sequence that is not
syntax, causing it to behave in some anomalous fashion. permitted by the UTF-16 syntax, causing it to behave in some anomalous
fashion.
8. References 8. References
[CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages",
BCP 18, RFC 2277, January 1998.
[CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration
Procedures", BCP 19, RFC 2278, January 1998. Procedures", BCP 19, RFC 2278, January 1998.
[ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane. Twelve amendments and two Architecture and Basic Multilingual Plane. Twelve amendments and two
technical corrigenda have been published up to now. UTF-16 is described in technical corrigenda have been published up to now. UTF-16 is described in
Annex Q, published as Amendment 1. Many other amendments are currently at Annex Q, published as Amendment 1. Many other amendments are currently at
various stages of standardization. various stages of standardization.
[MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1",
BCP 18, RFC 2277, January 1998. Unicode Technical Report #8.
[UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC
2279, January 1998. 2279, January 1998.
[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set Workshop",
Unicode Technical Report #8. RFC 2130, April 1997.
9. Acknowledgments 9. Acknowledgments
Deborah Goldsmith wrote a great deal of the initial wording for this Deborah Goldsmith wrote a great deal of the initial wording for this
specification. Other significant contributors include: specification. Martin Duerst gave numerous significant changes. Other
significant contributors include:
Mati Allouche Mati Allouche
Walt Daniels Walt Daniels
Mark Davis Mark Davis
Martin Duerst
Ned Freed Ned Freed
Asmus Freytag Asmus Freytag
Lloyd Honomichl Lloyd Honomichl
Dan Kegel Dan Kegel
Murata Makoto Murata Makoto
Larry Masinter
Ken Whistler Ken Whistler
Some of the text in this specification was copied from [UTF-8], and that Some of the text in this specification was copied from [UTF-8], and that
document was worked on by many people. Please see the acknowledgements document was worked on by many people. Please see the acknowledgments
section in that document for more people who may have contributed section in that document for more people who may have contributed
indirectly to this document. indirectly to this document.
10. Authors' address 10. Authors' address
Paul Hoffman Paul Hoffman
Internet Mail Consortium Internet Mail Consortium
127 Segre Place 127 Segre Place
Santa Cruz, CA 95060 USA Santa Cruz, CA 95060 USA
phoffman@imc.org phoffman@imc.org
Francois Yergeau Francois Yergeau
Alis Technologies Alis Technologies
100, boul. Alexis-Nihon, Suite 600 100, boul. Alexis-Nihon, Suite 600
Montreal QC H4M 2P2 Canada Montreal QC H4M 2P2 Canada
fyergeau@alis.com fyergeau@alis.com
11. Changes between draft -01 and -02
Fixed some spelling mistakes throughout.
Updated the status boilerplate.
Clarified the parameter values in 1.
Added [WORKSHOP] reference in 1.1 and 8. Also fuzzified the description of
what UTF-16 is (instead of getting into hair-splitting on CESs, CCSs, and
so on).
Corrected 1.2 on the characters for which UTF-8 incurs a space penalty.
Added "from ISO 10646 to UTF-16" to the beginning of 2.1.
Added "from UTF-16 to ISO 10646" to the beginning of 2.2.
Added text to the end of the note at the end of 2.2 about possibly emitting
the ill-formed characters when decoding.
Rearranged much of sections 3 and 4. This makes the following changes
hard to follow; the references refer to the *old* section numbers,
not necessarily the ones as they exist in this draft. Sorry about that...
Changed the end of the first paragraph of 3.1 to get out of the
which-endian-has-most debate.
Clarified the fourth paragraph of 3.1 (the one that begins
"This specification thus...") about the use of "UTF-16" as both a
sequencing mechanism and a charset label.
Added Martin Duerst's C code fragment for big-endian order.
Added the sentence to the end of the sixth paragraph of 3.1 (the one
that begins "It is important...") with the example of substrings and
ZWNBSs.
Added text about SHOULD NOT put an intial BOM in both 3.2 and 3.3.
Clarified the last clause in section 3.3.
Removed the last paragraph of 4 (the paragraph that used to start
"Because creating text labelled...") because it related to text-creating
programs instead of text-labelling programs.
Rearragned and relabelled some of the examples in 5.
Removed "obsoletes" from the first paragraph of 6. Slightly fuzzified
the "no implementations" sentence in the second paragraph.
Alphabatized the references in 8.
Added Larry Masinter to section 9. Gave Martin Duerst more credit.
A. Charset registrations A. Charset registrations
This memo is meant to serve as the basis for registration of three MIME This memo is meant to serve as the basis for registration of three MIME
charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE",
and "UTF-16". These strings label objects containing text consisting of and "UTF-16". These strings label objects containing text consisting of
characters from the repertoire of ISO/IEC 10646 including all amendments at characters from the repertoire of ISO/IEC 10646 including all amendments at
least up to amendment 5 (Korean block), encoded to a sequence of octets least up to amendment 5 (Korean block), encoded to a sequence of octets
using the encoding and serialization schemes outlined above. using the encoding and serialization schemes outlined above.
Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in
 End of changes. 46 change blocks. 
129 lines changed or deleted 209 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/