< draft-hoffman-utf16-00.txt   draft-hoffman-utf16-01.txt >
Internet Draft Paul Hoffman Internet Draft Paul Hoffman
<draft-hoffman-utf16-00.txt> Internet Mail Consortium <draft-hoffman-utf16-01.txt> Internet Mail Consortium
November 12, 1998 Francois Yergeau December 13, 1998 Francois Yergeau
Alis Technologies Alis Technologies
UTF-16, an encoding of ISO 10646 UTF-16, an encoding of ISO 10646
Status of this Memo Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working documents This document is an Internet-Draft. Internet-Drafts are working documents
of the Internet Engineering Task Force (IETF), its areas, and its working of the Internet Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working documents as groups. Note that other groups may also distribute working documents as
Internet- Drafts. Internet- Drafts.
skipping to change at line 39 skipping to change at line 39
1. Introduction 1. Introduction
This document specifies the UTF-16 encoding of Unicode/ISO-10646 and This document specifies the UTF-16 encoding of Unicode/ISO-10646 and
contains the registration for three MIME charset parameter values: contains the registration for three MIME charset parameter values:
UTF-16BE, UTF-16LE, and UTF-16. UTF-16BE, UTF-16LE, and UTF-16.
1.1 Background 1.1 Background
The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly
define a character set (hereafter referred to as Unicode) which encompasses define a coded character set (CCS), hereafter referred to as Unicode, which
most of the world's writing systems. UTF-16, the object of this encompasses most of the world's writing systems. UTF-16, the object of this
specification, is an encoding scheme of this character set that has the specification, is a character encoding scheme (CES) of Unicode that has the
characteristics of encoding the vast majority of currently-defined characteristics of encoding the vast majority of currently-defined
characters in exactly two octets and of being able to encode all other characters in exactly two octets and of being able to encode all other
characters that will be defined in exactly four octets. characters that will be defined in exactly four octets.
The Unicode Standard further defines additional character properties and The Unicode Standard further defines additional character properties and
other application details of great interest to implementors. Up to the other application details of great interest to implementors. Up to the
present time, changes in Unicode and amendments to ISO/IEC 10646 have present time, changes in Unicode and amendments to ISO/IEC 10646 have
tracked each other, so that the character repertoires and code point tracked each other, so that the character repertoires and code point
assignments have remained in sync. The relevant standardization committees assignments have remained in sync. The relevant standardization committees
have committed to maintain this very useful synchronism. have committed to maintain this very useful synchronism.
1.2 Motivation 1.2 Motivation
The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF
policy on character sets, [CHARPOLICY], says that IETF protocols MUST be policy on character sets and languages, [CHARPOLICY], says that IETF
able to use the UTF-8 charset. However, relative to UTF-16, UTF-8 imposes a protocols MUST be able to use the UTF-8 charset. However, relative to
space penalty for characters whose values are greater than 0x0800. Also, UTF-16, UTF-8 imposes a space penalty for characters whose values are
characters represented in UTF-8 have varying sizes. Using UTF-16 provides a greater than 0x0800. Also, characters represented in UTF-8 have varying
way to transmit character data that is mostly uniform in size. Some sizes. Using UTF-16 provides a way to transmit character data that is
products and network standards already specify UTF-16. (Note, however, that mostly uniform in size. Some products and network standards already specify
UTF-8 has many other advantages over UTF-16 in many protocols, such as the UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in
direct encoding of US-ASCII characters.) many protocols, such as the direct encoding of US-ASCII characters and
re-synchronization after loss of octets.)
UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as
a sequence of 16-bit quantities. This document addresses the issues of a sequence of 16-bit quantities. This document addresses the issues of
serializing UTF-16 as an octet stream for transmission over the Internet serializing UTF-16 as an octet stream for transmission over the Internet
and of MIME charset naming as described in [CHARSET-REG]. and of MIME charset naming as described in [CHARSET-REG].
1.3 Terminology 1.3 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. document are to be interpreted as described in RFC 2119 [MUSTSHOULD].
Throughout this document, character values are shown in hexadecimal Throughout this document, character values are shown in hexadecimal
notation. For example, "0x013C" is the character whose value is at the notation. For example, "0x013C" is the character whose value is the
codepoint that is 316 (decimal) positions from the base of the character character assigned the integer value 316 (decimal) in the CCS.
set.
2. UTF-16 definition 2. UTF-16 definition
In ISO 10646, each character is assigned a number, which Unicode calls the In ISO 10646, each character is assigned a number, which Unicode calls the
Unicode scalar value. This number is the same as the UCS-4 value of the Unicode scalar value. This number is the same as the UCS-4 value of the
character, and this document will refer to it as the "character value" for character, and this document will refer to it as the "character value" for
brevity. In the UTF-16 encoding, characters are represented using either brevity. In the UTF-16 encoding, characters are represented using either
one or two unsigned 16-bit integers, depending on the character value. one or two unsigned 16-bit integers, depending on the character value.
Serialization of these integers for transmission as a byte stream is Serialization of these integers for transmission as a byte stream is
discussed in Section 3. discussed in Section 3.
The rules for how characters are encoded in UTF-16 are: The rules for how characters are encoded in UTF-16 are:
- Characters with values less than 0x10000 are represented as a single - Characters with values less than 0x10000 are represented as a single
integer with a value equal to that of the character number. 16-bit integer with a value equal to that of the character number.
- Characters with values between 0x10000 and 0x10FFFF are represented by - Characters with values between 0x10000 and 0x10FFFF are represented by a
an integer with a value between 0xD800 and 0xDBFF (within the so-called 16-bit integer with a value between 0xD800 and 0xDBFF (within the
high-half zone or high surrogate area) followed by an integer with a so-called high-half zone or high surrogate area) followed by a 16-bit
value between 0xDC00 and 0xDFFF (within the so-called low-half zone or integer with a value between 0xDC00 and 0xDFFF (within the so-called
low surrogate area). low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded in - Characters with values greater than 0x10FFFF cannot be encoded in
UTF-16. UTF-16.
2.1 Encoding UTF-16 2.1 Encoding UTF-16
Encoding of a single character proceeds as follows. Let U be the character Encoding of a single character proceeds as follows. Let U be the character
number, no greater than 0x10FFFF. number, no greater than 0x10FFFF.
1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate.
skipping to change at line 177 skipping to change at line 177
0x01 followed by the octet 0x02. The little-endian serialization of that 0x01 followed by the octet 0x02. The little-endian serialization of that
number is the octet 0x02 followed by the octet 0x01. number is the octet 0x02 followed by the octet 0x01.
The term "network byte order" has been used in many RFCs to indicate The term "network byte order" has been used in many RFCs to indicate
big-endian serialization, although that term has never been formally big-endian serialization, although that term has never been formally
defined in a standards-track document. ISO 10646 prefers big-endian defined in a standards-track document. ISO 10646 prefers big-endian
serialization (section 6.3 of [ISO-10646]), but it is nonetheless serialization (section 6.3 of [ISO-10646]), but it is nonetheless
considered likely that little-endian order will also be used on the considered likely that little-endian order will also be used on the
Internet. Internet.
This specification thus contains registration for three charset parameter This specification thus contains registration for three charsets:
values: "UTF-16BE", "UTF-16LE", and "UTF-16". The three character encodings "UTF-16BE", "UTF-16LE", and "UTF-16". The character encoding schemes these
are identical except for the serialization order of the octets in each charsets use are identical except for the serialization order of the octets
character, and the external determination of which serialization is used. in each character, and the external determination of which serialization is
used.
The Unicode Standard defines the character "ZERO WIDTH NON-BREAKING SPACE" The Unicode Standard and ISO 10646 define the character "ZERO WIDTH
(0xFEFF) which is also known as the "BYTE ORDER MARK", abbreviated "BOM". NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER
All BOM characters MUST be considered to be characters of the text object MARK" (abbreviated "BOM"). The latter name hints at a second possible usage
that is labelled with the "UTF-16BE", "UTF-16LE", or "UTF-16" charset of the character, in addition to its normal use as a genuine "ZERO WIDTH
parameter values. The BOM characters MUST be included when performing NON-BREAKING SPACE" within text. This usage, suggested by Unicode section
MIME-related operations over the entire text, such as in hash algorithms 2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character
and length calculations. After the text has been processed, the BOM MAY be to a stream of Unicode characters as a "signature"; a receiver of such a
removed, although this will prevent later comparison with the original MIME serialized stream may then use the initial character both as a hint that
object. the stream consists of Unicode characters and as a way to recognize the
serialization order. In serialized UTF-16 prepended with such a signature,
the order is big-endian if the first two octets are 0xFE followed by 0xFF;
if they are 0xFF followed by 0xFE, the order is little-endian. Note that
0xFFFE is not a Unicode character, precisely to preserve the usefulness of
0xFEFF as a byte-order mark.
It is important to understand that the character 0xFEFF appearing at any
position other than the beginning of a stream MUST be interpreted with the
semantics for the zero-width non-breaking space, and MUST NOT be
interpreted as a byte-order mark. The contrapositive of that statement is
not always true: the character 0xFEFF in the first position of a stream MAY
be interpreted as a zero-width non-breaking space, and is not always a
byte-order mark.
The Unicode standard further suggests than an initial 0xFEFF character may
be stripped before processing the text, the rationale being that such a
character in initial position may be an artifact of the encoding (an
encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING
SPACE". Nevertheless, such stripping MUST NOT take place before any
MIME-related operations (such as hash algorithms, digest, or byte-count
computations) have been completed. Such operations depend on the exact
bytes of the data, which therefore may not be modified in any way. After
all MIME-related operations have been completed (for instance after a MIME
processor has handed an entity to a specific media type processor), an
initial 0xFEFF MAY be removed if appropriate, although this will prevent
later comparison with the original MIME object. In particular, in UTF-16
plain text it is likely that an initial 0xFEFF is a signature; when
concatenating two strings, it is important to strip out those signatures,
for otherwise the resulting string may contain an unintended "ZERO WIDTH
NON-BREAKING SPACE" at the connection point. Also, some specifications
mandate an initial 0xFEFF character in objects encoded in UTF-16 and
specify that this signature is not part of the object.
3.2 Serialization in UTF-16BE 3.2 Serialization in UTF-16BE
Text labelled with the "UTF-16BE" charset parameter value MUST be Text in the "UTF-16BE" charset MUST be serialized with the octets which
serialized with the octets which make up a single 16-bit UTF-16 value in make up a single 16-bit UTF-16 value in big-endian order. The detection of
big-endian order. The detection of an initial BOM or a reversed BOM does an initial BOM does not affect de-serialization of text labelled as
not affect de-serialization of text labelled as UTF-16BE. Finding a UTF-16BE. Finding 0xFF follwed by 0xFE is an error since there is no
reversed BOM (that is, the octet 0xFF followed by the octet 0xFE) is an Unicode character 0xFFFE.
error since there is no Unicode character 0xFFFE.
3.3 Serialization in UTF-16LE 3.3 Serialization in UTF-16LE
Text labelled with the "UTF-16LE" charset parameter value MUST be Text in the "UTF-16LE" charset MUST be serialized with the octets which
serialized with the octets which make up a single 16-bit UTF-16 value in make up a single 16-bit UTF-16 value in little-endian order. The detection
little-endian order. The detection of an initial BOM or a reversed BOM does of an initial BOM does not affect de-serialization of text labelled as
not affect de-serialization of text labelled as UTF-16BE. Finding a UTF-16LE. Finding 0xFE folled by 0xFF is an error since there is no Unicode
non-reversed BOM (that is, the octet 0xFE followed by the octet 0xFF) is an character 0xFFFE, which is the interpretation of the 0xFEFF character under
error since there is no Unicode character 0xFFFE, which is the little-endian order.
interpretation of the non-reversed BOM under little-endian order.
3.4 Serialization in UTF-16 3.4 Serialization in UTF-16
Text labelled with the "UTF-16" charset parameter value MAY be serialized Text in the "UTF-16" charset MAY be serialized in either big-endian or
in either big-endian or little-endian order. Text labelled as UTF-16 MUST little-endian order. If the first two octets of the text is 0xFE followed
be big-endian unless the first two octets of the text is sequence of octets by 0xFF, then the text MUST be big-endian. If the first two octets of the
0xFF 0xFE, in which case the serialization MUST be little-endian. text is 0xFF followed by 0xFE, then the text MUST be little-endian. If the
first two octets of the text is not 0xFE followed by 0xFF and is not 0xFF
Big-endian text labelled with the "UTF-16" charset parameter value MAY followed by 0xFE, then the text MUST be big-endian. Big-endian text in the
start with the big-endian BOM (the character 0xFEFF), but the BOM is not "UTF-16" charset MAY start with the 0xFEFF character, but the 0xFEFF
required. BOM characters other than the first character of a body part are character is not required.
not interpreted as BOMs.
All applications that process text that uses the "UTF-16" charset parameter All applications that process text in the "UTF-16" charset MUST be able to
value MUST be able to read at least the first two octets of the text and be read at least the first two octets of the text and be able to process those
able to process those octets in order to determine the serialization of the octets in order to determine the serialization of the text. Applications
text. Applications that use the "UTF-16" charset parameter value MUST NOT that use the "UTF-16" charset parameter value MUST NOT assume the
assume the serialization without first checking the first two octets to see serialization without first checking the first two octets to see if they
if they are a big-endian BOM or a little-endian BOM or not a BOM. are a big-endian BOM or a little-endian BOM or not a BOM.
4. Choosing a charset 4. Choosing a charset
Any labelling application that uses UTF-16 character encoding, and puts an Any labelling application that uses UTF-16 character encoding, and puts an
explicit charset label on the text, and knows the serialization of the explicit charset label on the text, and knows the serialization of the
characters in text, MUST label the text with the "UTF-16BE" or the characters in text, MUST label the text as either "UTF-16BE" or "UTF-16LE",
"UTF-16LE" charset parameter values. This allows applications that are whichever is appropriate. This allows applications that are processing the
processing the text that are not able to look inside the text to know the text that are not able to look inside the text to know the serialization
serialization definitively. definitively.
Any labelling application that uses UTF-16 character encoding, and puts an Any labelling application that uses UTF-16 character encoding, and puts an
explicit charset label on the text, and does not know the serialization of explicit charset label on the text, and does not know the serialization of
the characters in text, MUST label the text with the "UTF-16" charset the characters in text, MUST label the text as "UTF-16", and SHOULD be sure
parameter value, and SHOULD be sure the text starts with a BOM. An the text starts with 0xFEFF. An application processing text that is
application processing text that is labelled with the "UTF-16" charset labelled with the "UTF-16" charset parameter value knows that the
parameter value knows that the serialization cannot be determined without serialization cannot be determined without looking inside the text itself.
looking inside the text itself. Fortunately, the processing application Fortunately, the processing application needs to only look at the first
needs only look at the first character (the first two octets) of the text character (the first two octets) of the text to determine the
to determine the serialization. serialization.
Because creating text that uses the "UTF-16" charset parameter value forces Because creating text labelled as being in the "UTF-16" charset forces the
the recipient to read and understand the first character of the text recipient to read and understand the first character of the text object, a
object, a text-creating program SHOULD create text labelled with the text-creating program SHOULD create text labelled as "UTF-16BE" or
"UTF-16BE" or the "UTF-16LE" charset parameter values if possible. "UTF-16LE" if possible. Text-creating programs that create text using
Text-creating programs that create text using UTF-16 encoding SHOULD emit UTF-16 encoding SHOULD emit big-endian text if possible.
big-endian text if possible.
5. Examples 5. Examples
For the sake of example, let's suppose that there is a hieroglyphic For the sake of example, let's suppose that there is a hieroglyphic
character representing the Egyptian god Ra with character value 0x00012345 character representing the Egyptian god Ra with character value 0x00012345
(this character does not exist at present in Unicode). (this character does not exist at present in Unicode).
The examples here all evaluate to the phrase: The examples here all evaluate to the phrase:
*=Ra *=Ra
skipping to change at line 311 skipping to change at line 340
7. Security considerations 7. Security considerations
UTF-16 is based on the ISO 10646 character set, which is frequently being UTF-16 is based on the ISO 10646 character set, which is frequently being
added to, as described in Section 6 and Appendix A of this document. added to, as described in Section 6 and Appendix A of this document.
Processors must be able to handle characters that are not defined at the Processors must be able to handle characters that are not defined at the
time that the processor was created in such a way as to not allow an time that the processor was created in such a way as to not allow an
attacker to harm a recipient by including unknown characters. attacker to harm a recipient by including unknown characters.
Processors that handle any type of text, including text encoded as UTF-16, Processors that handle any type of text, including text encoded as UTF-16,
must be vigilant for control characters that might reprogram a display must be vigilant in checking for control characters that might reprogram a
terminal or keyboard. Similarly, processors that interpret text entities display terminal or keyboard. Similarly, processors that interpret text
(such as looking for embedded programming code), must be careful not to entities (such as looking for embedded programming code), must be careful
execute the code without first alerting the recipient. not to execute the code without first alerting the recipient.
Text in UTF-16 may contain special characters, such as the OBJECT Text in UTF-16 may contain special characters, such as the OBJECT
REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, REPLACEMENT CHARACTER (0xFFFC), that might cause external processing,
depending on the interpretation of the processing program and the depending on the interpretation of the processing program and the
availability of an external data stream that would be executed. This availability of an external data stream that would be executed. This
external processing may have side-effects that allow the sender of a external processing may have side-effects that allow the sender of a
message to attack the receiving system. message to attack the receiving system.
Implementors of UTF-16 need to consider the security aspects of how they Implementors of UTF-16 need to consider the security aspects of how they
handle illegal UTF-16 sequences (that is, sequences involving surrogate handle illegal UTF-16 sequences (that is, sequences involving surrogate
pairs that have illegal values). It is conceivable that in some pairs that have illegal values). It is conceivable that in some
circumstances an attacker would be able to exploit an incautious UTF-16 circumstances an attacker would be able to exploit an incautious UTF-16
parser by sending it an octet sequence that is not permitted by the UTF-16 parser by sending it an octet sequence that is not permitted by the UTF-16
syntax. syntax, causing it to behave in some anomalous fashion.
8. References 8. References
[CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration
Procedures", BCP 19, RFC 2278, January 1998. Procedures", BCP 19, RFC 2278, January 1998.
[ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane. Twelve amendments and two Architecture and Basic Multilingual Plane. Twelve amendments and two
technical corrigenda have been published up to now. UTF-16 is described in technical corrigenda have been published up to now. UTF-16 is described in
skipping to change at line 356 skipping to change at line 385
BCP 18, RFC 2277, January 1998. BCP 18, RFC 2277, January 1998.
[UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC
2279, January 1998. 2279, January 1998.
[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1",
Unicode Technical Report #8. Unicode Technical Report #8.
9. Acknowledgments 9. Acknowledgments
David Goldsmith wrote a great deal of the initial wording for this Deborah Goldsmith wrote a great deal of the initial wording for this
specification. Other significant contributors include: specification. Other significant contributors include:
Mati Allouche Mati Allouche
Walt Daniels Walt Daniels
Mark Davis Mark Davis
Martin Duerst Martin Duerst
Ned Freed
Asmus Freytag Asmus Freytag
Lloyd Honomichl Lloyd Honomichl
Dan Kegel
Murata Makoto Murata Makoto
Ken Whistler Ken Whistler
Some of the text in this specification was copied from [UTF-8], and that Some of the text in this specification was copied from [UTF-8], and that
document was worked on by many people. Please see the acknowledgements document was worked on by many people. Please see the acknowledgements
section in that document for more people who may have contributed section in that document for more people who may have contributed
indirectly to this document. indirectly to this document.
10. Authors' address 10. Authors' address
skipping to change at line 390 skipping to change at line 421
Francois Yergeau Francois Yergeau
Alis Technologies Alis Technologies
100, boul. Alexis-Nihon, Suite 600 100, boul. Alexis-Nihon, Suite 600
Montreal QC H4M 2P2 Canada Montreal QC H4M 2P2 Canada
fyergeau@alis.com fyergeau@alis.com
A. Charset registrations A. Charset registrations
This memo is meant to serve as the basis for registration of three MIME This memo is meant to serve as the basis for registration of three MIME
character set parameters (charsets) [CHARSET-REG]. The proposed charset charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE",
parameter values are "UTF-16BE", "UTF-16LE", and "UTF-16". These strings and "UTF-16". These strings label objects containing text consisting of
label media types containing text consisting of characters from the characters from the repertoire of ISO/IEC 10646 including all amendments at
repertoire of ISO/IEC 10646 including all amendments at least up to least up to amendment 5 (Korean block), encoded to a sequence of octets
amendment 5 (Korean block), encoded to a sequence of octets using the using the encoding and serialization schemes outlined above.
encoding and serialization schemes outlined above.
Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in
MIME content types under the "text" top-level type, because they do not media types under the "text" top-level type, because they do not encode
encode line endings in the way required for MIME "text" media types. line endings in the way required for MIME "text" media types.
It is noteworthy that the labels described here do not contain a version It is noteworthy that the labels described here do not contain a version
identification, referring generically to ISO/IEC 10646. This is identification, referring generically to ISO/IEC 10646. This is
intentional, the rationale being as follows: intentional, the rationale being as follows:
A MIME charset label is designed to give just the information needed to A MIME charset is designed to give just the information needed to interpret
interpret a sequence of bytes received on the wire into a sequence of a sequence of bytes received on the wire into a sequence of characters,
characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As long as nothing more (see RFC 2045, section 2.2, in [MIME]). As long as a character
a character set standard does not change incompatibly, version numbers set standard does not change incompatibly, version numbers serve no
serve no purpose, because one gains nothing by learning from the tag that purpose, because one gains nothing by learning from the tag that newly
newly assigned characters may be received that one doesn't know about. The assigned characters may be received that one doesn't know about. The tag
tag itself doesn't teach anything about the new characters, which are going itself doesn't teach anything about the new characters, which are going to
to be received anyway. be received anyway.
Hence, as long as the standards evolve compatibly, the apparent advantage Hence, as long as the standards evolve compatibly, the apparent advantage
of having labels that identify the versions is only that, apparent. But of having labels that identify the versions is only that, apparent. But
there is a disadvantage to such version-dependent labels: when an older there is a disadvantage to such version-dependent labels: when an older
application receives data accompanied by a newer, unknown label, it may application receives data accompanied by a newer, unknown label, it may
fail to recognize the label and be completely unable to deal with the data, fail to recognize the label and be completely unable to deal with the data,
whereas a generic, known label would have triggered mostly correct whereas a generic, known label would have triggered mostly correct
processing of the data, which may well not contain any new characters. processing of the data, which may well not contain any new characters.
The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in
principle contradicting the appropriateness of a version independent MIME principle contradicting the appropriateness of a version independent MIME
charset label as described above. But the compatibility problem can only charset as described above. But the compatibility problem can only appear
appear with data containing Korean Hangul characters encoded according to with data containing Korean Hangul characters encoded according to Unicode
Unicode 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there is
is arguably no such data to worry about, this being the very reason the arguably no such data to worry about, this being the very reason the
incompatible change was deemed acceptable. incompatible change was deemed acceptable.
In practice, then, a version-independent label is warranted, provided the In practice, then, a version-independent label is warranted, provided the
label is understood to refer to all versions after Amendment 5, and label is understood to refer to all versions after Amendment 5, and
provided no incompatible change actually occurs. Should incompatible provided no incompatible change actually occurs. Should incompatible
changes occur in a later version of ISO/IEC 10646, the MIME charset labels changes occur in a later version of ISO/IEC 10646, the MIME charsets
defined here will stay aligned with the previous version until and unless defined here will stay aligned with the previous version until and unless
the IETF specifically decides otherwise. the IETF specifically decides otherwise.
A.1 Registration for UTF-16BE A.1 Registration for UTF-16BE
To: ietf-charsets@iana.org To: ietf-charsets@iana.org
Subject: Registration of new charset Subject: Registration of new charset
Charset name(s): UTF-16BE Charset name(s): UTF-16BE
 End of changes. 25 change blocks. 
107 lines changed or deleted 137 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/