| < draft-hoffman-utf16-02.txt | draft-hoffman-utf16-03.txt > | |||
|---|---|---|---|---|
| Internet Draft Paul Hoffman | Internet Draft Paul Hoffman | |||
| <draft-hoffman-utf16-02.txt> Internet Mail Consortium | <draft-hoffman-utf16-03.txt> Internet Mail Consortium | |||
| February 10, 1999 Francois Yergeau | April 19, 1999 Francois Yergeau | |||
| Alis Technologies | Alis Technologies | |||
| UTF-16, an encoding of ISO 10646 | UTF-16, an encoding of ISO 10646 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with all | This document is an Internet-Draft and is in full conformance with all | |||
| provisions of Section 10 of RFC2026. | provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering Task | Internet-Drafts are working documents of the Internet Engineering Task | |||
| Force (IETF), its areas, and its working groups. Note that other groups | Force (IETF), its areas, and its working groups. Note that other groups | |||
| may also distribute working documents as Internet-Drafts. | may also distribute working documents as Internet-Drafts. | |||
| Internet-Drafts are draft documents valid for a maximum of six months and | Internet-Drafts are draft documents valid for a maximum of six months | |||
| may be updated, replaced, or obsoleted by other documents at any time. It | and may be updated, replaced, or obsoleted by other documents at any | |||
| is inappropriate to use Internet- Drafts as reference material or to cite | time. It is inappropriate to use Internet-Drafts as reference material | |||
| them other than as "work in progress." | or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| Copyright (C) The Internet Society (1999). All Rights Reserved. | Copyright (C) The Internet Society (1999). All Rights Reserved. | |||
| 1. Introduction | 1. Introduction | |||
| This document describes the UTF-16 encoding of Unicode/ISO-10646 and | This document describes the UTF-16 encoding of Unicode/ISO-10646, | |||
| contains the registration for three MIME charset parameter values: | addresses the issues of serializing UTF-16 as an octet stream for | |||
| UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16. | transmission over the Internet, defines MIME charset naming as | |||
| described in [CHARSET-REG], and contains the registration for three | ||||
| MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE | ||||
| (little-endian), and UTF-16. | ||||
| 1.1 Background | 1.1 Background and motivation | |||
| The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly | The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly | |||
| define a coded character set (CCS), hereafter referred to as Unicode, which | define a coded character set (CCS), hereafter referred to as Unicode, | |||
| encompasses most of the world's writing systems [WORKSHOP]. UTF-16, the | which encompasses most of the world's writing systems [WORKSHOP]. | |||
| object of this specification, is a way to encode Unicode characters that | UTF-16, the object of this specification, is one of the standard ways | |||
| has the characteristics of encoding the vast majority of currently-defined | of encoding Unicode character data; it has the characteristics of | |||
| characters in exactly two octets and of being able to encode all other | encoding all currently defined characters (in plane 0, the BMP) in | |||
| characters that will be defined in exactly four octets. | exactly two octets and of being able to encode all other characters | |||
| likely to be defined (the next 16 planes) in exactly four octets. | ||||
| The Unicode Standard further defines additional character properties and | ||||
| other application details of great interest to implementors. Up to the | ||||
| present time, changes in Unicode and amendments to ISO/IEC 10646 have | ||||
| tracked each other, so that the character repertoires and code point | ||||
| assignments have remained in sync. The relevant standardization committees | ||||
| have committed to maintain this very useful synchronism. | ||||
| 1.2 Motivation | ||||
| The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF | The Unicode Standard further defines additional character properties | |||
| policy on character sets and languages, [CHARPOLICY], says that IETF | and other application details of great interest to implementors. Up to | |||
| protocols MUST be able to use the UTF-8 charset. However, relative to | the present time, changes in Unicode and amendments to ISO/IEC 10646 | |||
| UTF-16, UTF-8 imposes a space penalty for characters whose values are | have tracked each other, so that the character repertoires and code | |||
| between 0x0800 and 0xFFFF. Also, characters represented in UTF-8 have varying | point assignments have remained in sync. The relevant standardization | |||
| sizes. Using UTF-16 provides a way to transmit character data that is | committees have committed to maintain this very useful synchronism, as | |||
| mostly uniform in size. Some products and network standards already specify | well as not to assign characters outside of the 17 planes accessible to | |||
| UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in | UTF-16. | |||
| many protocols, such as the direct encoding of US-ASCII characters and | ||||
| re-synchronization after loss of octets.) | ||||
| UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as | The IETF policy on character sets and languages [CHARPOLICY] says that | |||
| a sequence of 16-bit quantities. This document addresses the issues of | IETF protocols MUST be able to use the UTF-8 charset [UTF-8]. Although | |||
| serializing UTF-16 as an octet stream for transmission over the Internet | UTF-8 has many beneficial properties, such as the direct encoding of | |||
| and of MIME charset naming as described in [CHARSET-REG]. | US-ASCII characters, re-synchronization after loss of octets and | |||
| immunity to the byte-order issue (see 3.1 below), it is a | ||||
| variable-width encoding and is less dense than UTF-16 for characters | ||||
| whose values are between 0x0800 and 0xFFFF. Some products and network | ||||
| standards already specify UTF-16, making it an important encoding for | ||||
| the Internet. | ||||
| 1.3 Terminology | 1.2 Terminology | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. | document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. | |||
| Throughout this document, character values are shown in hexadecimal | Throughout this document, character values are shown in hexadecimal | |||
| notation. For example, "0x013C" is the character whose value is the | notation. For example, "0x013C" is the character whose value is the | |||
| character assigned the integer value 316 (decimal) in the CCS. | character assigned the integer value 316 (decimal) in the CCS. | |||
| 2. UTF-16 definition | 2. UTF-16 definition | |||
| In ISO 10646, each character is assigned a number, which Unicode calls the | In ISO 10646, each character is assigned a number, which Unicode calls | |||
| Unicode scalar value. This number is the same as the UCS-4 value of the | the Unicode scalar value. This number is the same as the UCS-4 value of | |||
| character, and this document will refer to it as the "character value" for | the character, and this document will refer to it as the "character | |||
| brevity. In the UTF-16 encoding, characters are represented using either | value" for brevity. In the UTF-16 encoding, characters are represented | |||
| one or two unsigned 16-bit integers, depending on the character value. | using either one or two unsigned 16-bit integers, depending on the | |||
| Serialization of these integers for transmission as a byte stream is | character value. Serialization of these integers for transmission as a | |||
| discussed in Section 3. | byte stream is discussed in Section 3. | |||
| The rules for how characters are encoded in UTF-16 are: | The rules for how characters are encoded in UTF-16 are: | |||
| - Characters with values less than 0x10000 are represented as a single | - Characters with values less than 0x10000 are represented as a single | |||
| 16-bit integer with a value equal to that of the character number. | 16-bit integer with a value equal to that of the character number. | |||
| - Characters with values between 0x10000 and 0x10FFFF are represented by a | - Characters with values between 0x10000 and 0x10FFFF are represented | |||
| 16-bit integer with a value between 0xD800 and 0xDBFF (within the | by a 16-bit integer with a value between 0xD800 and 0xDBFF (within | |||
| so-called high-half zone or high surrogate area) followed by a 16-bit | the so-called high-half zone or high surrogate area) followed by a | |||
| integer with a value between 0xDC00 and 0xDFFF (within the so-called | 16-bit integer with a value between 0xDC00 and 0xDFFF (within the | |||
| low-half zone or low surrogate area). | so-called low-half zone or low surrogate area). | |||
| - Characters with values greater than 0x10FFFF cannot be encoded in | - Characters with values greater than 0x10FFFF cannot be encoded in | |||
| UTF-16. | UTF-16. | |||
| 2.1 Encoding UTF-16 | 2.1 Encoding UTF-16 | |||
| Encoding of a single character from an ISO 10646 character value to UTF-16 | Encoding of a single character from an ISO 10646 character value to | |||
| proceeds as follows. Let U be the character number, no greater than | UTF-16 proceeds as follows. Let U be the character number, no greater | |||
| 0x10FFFF. | than 0x10FFFF. | |||
| 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. | 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. | |||
| 2) Let U' = U - 0x10000. Note that because U <= 0x10FFFF, U' <= 0xFFFFF, | 2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, | |||
| that is, U' can be represented in 20 bits. | U' must be less than or equal to 0xFFFFF. That is, U' can be | |||
| represented in 20 bits. | ||||
| 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and | 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and | |||
| 0xDC00, respectively. These integers each have 10 bits free to encode the | 0xDC00, respectively. These integers each have 10 bits free to encode | |||
| character value, for a total of 20 bits. | the character value, for a total of 20 bits. | |||
| 4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits | 4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order | |||
| of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. | bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of | |||
| Terminate. | W2. Terminate. | |||
| Graphically, steps 2 through 4 look like: | Graphically, steps 2 through 4 look like: | |||
| U' = yyyyyyyyyyxxxxxxxxxx | U' = yyyyyyyyyyxxxxxxxxxx | |||
| W1 = 110110yyyyyyyyyy | W1 = 110110yyyyyyyyyy | |||
| W2 = 110111xxxxxxxxxx | W2 = 110111xxxxxxxxxx | |||
| 2.2 Decoding UTF-16 | 2.2 Decoding UTF-16 | |||
| Decoding of a single character from UTF-16 to an ISO 10646 character value | Decoding of a single character from UTF-16 to an ISO 10646 character | |||
| proceeds as follows. Let W1 be the next 16-bit integer in the sequence of | value proceeds as follows. Let W1 be the next 16-bit integer in the | |||
| integers representing the text. Let W2 be the (eventual) next integer | sequence of integers representing the text. Let W2 be the (eventual) | |||
| following W1. | next integer following W1. | |||
| 1) If W1 < 0xD800 or W1 > 0xDFFF, the character value is the value of W1. | 1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value of | |||
| Terminate. | W1. Terminate. | |||
| 2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in | 2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence | |||
| error and no valid character can be obtained using W1. Terminate. | is in error and no valid character can be obtained using W1. Terminate. | |||
| 3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not | 3) If there is no W2 (that is, the sequence ends with W1), or if W2 is | |||
| between 0xDC00 and 0xDFFF, the sequence is in error. Terminate. | not between 0xDC00 and 0xDFFF, the sequence is in error. Terminate. | |||
| 4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of | 4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits | |||
| W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10 | of W1 as its 10 high-order bits and the 10 low-order bits of W2 as its | |||
| low-order bits. | 10 low-order bits. | |||
| 5) Add 0x10000 to U' to obtain the character value U. Terminate. | 5) Add 0x10000 to U' to obtain the character value U. Terminate. | |||
| Note that steps 2 and 3 indicate errors. Error recovery is not specified by | Note that steps 2 and 3 indicate errors. Error recovery is not | |||
| this document. When terminating with an error in steps 2 and 3, it may be | specified by this document. When terminating with an error in steps 2 | |||
| wise to set U to the value of W1 to help the caller diagnose the error and | and 3, it may be wise to set U to the value of W1 to help the caller | |||
| not lose information. | diagnose the error and not lose information. Also note that a string | |||
| decoding algorithm, as opposed to the single-character decoding | ||||
| described above, need not terminate upon detection of an error, if | ||||
| proper error reporting and/or recovery is provided. | ||||
| 3. Labelling UTF-16 text | 3. Labelling UTF-16 text | |||
| This specification contains registration for three MIME charsets: | This specification contains registration for three MIME charsets: | |||
| "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the | "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the | |||
| combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and the | combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and | |||
| CES is the same in all three cases, except for the serialization order of | the CES is the same in all three cases, except for the serialization | |||
| the octets in each character, and the external determination of which | order of the octets in each character, and the external determination | |||
| serialization is used. | of which serialization is used. | |||
| This section describes which of the three labels to apply to a stream of text. | This section describes which of the three labels to apply to a stream | |||
| of text. Section 4 describes how to interpret the labels on a stream of | ||||
| text. | ||||
| 3.1 Definition of big-endian and little-endian | 3.1 Definition of big-endian and little-endian | |||
| Historically, computer hardware has processed two-octet entities such as | Historically, computer hardware has processed two-octet entities such | |||
| 16-bit integers in one of two ways. So-called "big-endian" hardware handles | as 16-bit integers in one of two ways. So-called "big-endian" hardware | |||
| two-octet entities with the higher-order octet first, that is at the lower | handles two-octet entities with the higher-order octet first, that is | |||
| address in memory; when written out to disk or to a network interface | at the lower address in memory; when written out to disk or to a | |||
| (serializing), the high-order octet thus appears first in the data stream. | network interface (serializing), the high-order octet thus appears | |||
| On the other hand, "Little-endian" hardware handles two-octet entities with | first in the data stream. On the other hand, "Little-endian" hardware | |||
| the lower-order octet first. Hardware of both kinds is common today. | handles two-octet entities with the lower-order octet first. Hardware | |||
| of both kinds is common today. | ||||
| For example, the unsigned 16-bit integer that represents the decimal number | For example, the unsigned 16-bit integer that represents the decimal | |||
| 258 is 0x0102. The big-endian serialization of that number is the octet | number 258 is 0x0102. The big-endian serialization of that number is | |||
| 0x01 followed by the octet 0x02. The little-endian serialization of that | the octet 0x01 followed by the octet 0x02. The little-endian | |||
| number is the octet 0x02 followed by the octet 0x01. The following C code | serialization of that number is the octet 0x02 followed by the octet | |||
| fragment demonstrates a way to write 16-bit quantities to a file in | 0x01. The following C code fragment demonstrates a way to write 16-bit | |||
| big-endian order, irrespective of the hardware's native byte order. | quantities to a file in big-endian order, irrespective of the | |||
| hardware's native byte order. | ||||
| void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ | void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ | |||
| { | { | |||
| putc(u >> 8, f); /* output high-order byte */ | putc(u >> 8, f); /* output high-order byte */ | |||
| putc(u & 0xFF, f); /* then low-order */ | putc(u & 0xFF, f); /* then low-order */ | |||
| } | } | |||
| The term "network byte order" has been used in many RFCs to indicate | The term "network byte order" has been used in many RFCs to indicate | |||
| big-endian serialization, although that term has yet to be formally | big-endian serialization, although that term has yet to be formally | |||
| defined in a standards-track document. ISO 10646 prefers big-endian | defined in a standards-track document. Although ISO 10646 prefers | |||
| serialization (section 6.3 of [ISO-10646]), but it is nonetheless | big-endian serialization (section 6.3 of [ISO-10646]), it is likely | |||
| considered likely that little-endian order will also be used on the | that little-endian order will also be used on the Internet. | |||
| Internet. | ||||
| 3.2 Byte order mark (BOM) | 3.2 Byte order mark (BOM) | |||
| The Unicode Standard and ISO 10646 define the character "ZERO WIDTH | The Unicode Standard and ISO 10646 define the character "ZERO WIDTH | |||
| NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER | NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE | |||
| MARK" (abbreviated "BOM"). The latter name hints at a second possible usage | ORDER MARK" (abbreviated "BOM"). The latter name hints at a second | |||
| of the character, in addition to its normal use as a genuine "ZERO WIDTH | possible usage of the character, in addition to its normal use as a | |||
| NON-BREAKING SPACE" within text. This usage, suggested by Unicode section | genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage, | |||
| 2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character | suggested by Unicode section 2.4 and ISO 10646 Annex F (informative), | |||
| to a stream of Unicode characters as a "signature"; a receiver of such a | is to prepend a 0xFEFF character to a stream of Unicode characters as a | |||
| serialized stream may then use the initial character both as a hint that | "signature"; a receiver of such a serialized stream may then use the | |||
| the stream consists of Unicode characters and as a way to recognize the | initial character both as a hint that the stream consists of Unicode | |||
| serialization order. In serialized UTF-16 prepended with such a signature, | characters and as a way to recognize the serialization order. In | |||
| the order is big-endian if the first two octets are 0xFE followed by 0xFF; | serialized UTF-16 prepended with such a signature, the order is | |||
| if they are 0xFF followed by 0xFE, the order is little-endian. Note that | big-endian if the first two octets are 0xFE followed by 0xFF; if they | |||
| 0xFFFE is not a Unicode character, precisely to preserve the usefulness of | are 0xFF followed by 0xFE, the order is little-endian. Note that 0xFFFE | |||
| is not a Unicode character, precisely to preserve the usefulness of | ||||
| 0xFEFF as a byte-order mark. | 0xFEFF as a byte-order mark. | |||
| It is important to understand that the character 0xFEFF appearing at any | It is important to understand that the character 0xFEFF appearing at | |||
| position other than the beginning of a stream MUST be interpreted with the | any position other than the beginning of a stream MUST be interpreted | |||
| semantics for the zero-width non-breaking space, and MUST NOT be | with the semantics for the zero-width non-breaking space, and MUST NOT | |||
| interpreted as a byte-order mark. The contrapositive of that statement is | be interpreted as a byte-order mark. The contrapositive of that | |||
| not always true: the character 0xFEFF in the first position of a stream MAY | statement is not always true: the character 0xFEFF in the first | |||
| be interpreted as a zero-width non-breaking space, and is not always a | position of a stream MAY be interpreted as a zero-width non-breaking | |||
| byte-order mark. For example, if a process splits a UTF-16 string into | space, and is not always a byte-order mark. For example, if a process | |||
| many parts, a part might begin with 0xFEFF because there was a | splits a UTF-16 string into many parts, a part might begin with 0xFEFF | |||
| zero-width non-breaking space at the beginning of that substring. | because there was a zero-width non-breaking space at the beginning of | |||
| that substring. | ||||
| The Unicode standard further suggests than an initial 0xFEFF character may | The Unicode standard further suggests than an initial 0xFEFF character | |||
| be stripped before processing the text, the rationale being that such a | may be stripped before processing the text, the rationale being that | |||
| character in initial position may be an artifact of the encoding (an | such a character in initial position may be an artifact of the encoding | |||
| encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING | (an encoding signature), not a genuine intended "ZERO WIDTH | |||
| SPACE". Note that such stripping might affect an external process at a | NON-BREAKING SPACE". Note that such stripping might affect an external | |||
| different layer (such as a digital signature or a count of the characters) | process at a different layer (such as a digital signature or a count of | |||
| that is relying on the presence of all characters in the stream. | the characters) that is relying on the presence of all characters in | |||
| the stream. | ||||
| In particular, in UTF-16 plain text it is likely, but not certain, that an | In particular, in UTF-16 plain text it is likely, but not certain, that | |||
| initial 0xFEFF is a signature; when concatenating two strings, it is | an initial 0xFEFF is a signature. When concatenating two strings, it is | |||
| important to strip out those signatures, for otherwise the resulting string | important to strip out those signatures, because otherwise the | |||
| may contain an unintended "ZERO WIDTH NON-BREAKING SPACE" at the connection | resulting string may contain an unintended "ZERO WIDTH NON-BREAKING | |||
| point. Also, some specifications mandate an initial 0xFEFF character in | SPACE" at the connection point. Also, some specifications mandate an | |||
| objects encoded in UTF-16 and specify that this signature is not part of | initial 0xFEFF character in objects encoded in UTF-16 and specify that | |||
| the object. | this signature is not part of the object. | |||
| 3.3 Choosing a label for UTF-16 text | 3.3 Choosing a label for UTF-16 text | |||
| Any labelling application that uses UTF-16 character encoding, and puts an | Any labelling application that uses UTF-16 character encoding, and | |||
| explicit charset label on the text, and knows the serialization order of | explicitly labels the text, and knows the serialization order of the | |||
| the characters in text, SHOULD label the text as either "UTF-16BE" or | characters in text, SHOULD label the text as either "UTF-16BE" or | |||
| "UTF-16LE", whichever is appropriate based on the endianness of the text. | "UTF-16LE", whichever is appropriate based on the endianness of the | |||
| This allows applications processing the text, but unable to look inside the | text. This allows applications processing the text, but unable to look | |||
| text, to know the serialization definitively. | inside the text, to know the serialization definitively. | |||
| Text in the "UTF-16BE" charset MUST be serialized with the octets which | Text in the "UTF-16BE" charset MUST be serialized with the octets which | |||
| make up a single 16-bit UTF-16 value in big-endian order. Systems labelling | make up a single 16-bit UTF-16 value in big-endian order. Systems | |||
| UTF-16BE text MUST NOT prepend a BOM to the text. | labelling UTF-16BE text MUST NOT prepend a BOM to the text. | |||
| Text in the "UTF-16LE" charset MUST be serialized with the octets which | Text in the "UTF-16LE" charset MUST be serialized with the octets which | |||
| make up a single 16-bit UTF-16 value in little-endian order. Systems | make up a single 16-bit UTF-16 value in little-endian order. Systems | |||
| labelling UTF-16LE text MUST NOT prepend a BOM to the text. | labelling UTF-16LE text MUST NOT prepend a BOM to the text. | |||
| Any labelling application that uses UTF-16 character encoding, and puts an | Any labelling application that uses UTF-16 character encoding, and puts | |||
| explicit charset label on the text, and does not know the serialization | an explicit charset label on the text, and does not know the | |||
| order of the characters in text, MUST label the text as "UTF-16", and | serialization order of the characters in text, MUST label the text as | |||
| SHOULD make sure the text starts with 0xFEFF. | "UTF-16", and SHOULD make sure the text starts with 0xFEFF. | |||
| An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or | An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or | |||
| "UTF-16LE" is that some document formats mandate a BOM in UTF-16 text, | "UTF-16LE" is that some document formats mandate a BOM in UTF-16 text, | |||
| thereby requiring the use of the "UTF-16" tag only. | thereby requiring the use of the "UTF-16" tag only. | |||
| 4. Interpreting text labels | 4. Interpreting text labels | |||
| When a program sees text labelled as "UTF-16BE", "UTF-16LE", or "UTF-16", | When a program sees text labelled as "UTF-16BE", "UTF-16LE", or | |||
| it can make some assumptions, based on the labelling rules given in the | "UTF-16", it can make some assumptions, based on the labelling rules | |||
| previous section. These assumptions allow the program to then process the | given in the previous section. These assumptions allow the program to | |||
| text. | then process the text. | |||
| 4.1 Interpreting text labelled as UTF-16BE | 4.1 Interpreting text labelled as UTF-16BE | |||
| Text labelled "UTF-16BE" can always be interpreted as always being | Text labelled "UTF-16BE" can always be interpreted as being big-endian. | |||
| big-endian. The detection of an initial BOM does not affect | The detection of an initial BOM does not affect de-serialization of | |||
| de-serialization of text labelled as UTF-16BE. Finding 0xFF followed by | text labelled as UTF-16BE. Finding 0xFF followed by 0xFE is an error | |||
| 0xFE is an error since there is no Unicode character 0xFFFE. | since there is no Unicode character 0xFFFE. | |||
| 4.2 Interpreting text labelled as UTF-16LE | 4.2 Interpreting text labelled as UTF-16LE | |||
| Text labelled "UTF-16LE" can always be interpreted as always being | Text labelled "UTF-16LE" can always be interpreted as being | |||
| little-endian. The detection of an initial BOM does not affect | little-endian. The detection of an initial BOM does not affect | |||
| de-serialization of text labelled as UTF-16LE. Finding 0xFE followed by | de-serialization of text labelled as UTF-16LE. Finding 0xFE followed by | |||
| 0xFF is an error since there is no Unicode character 0xFFFE, which would be | 0xFF is an error since there is no Unicode character 0xFFFE, which | |||
| the interpretation of those octets under little-endian order. | would be the interpretation of those octets under little-endian order. | |||
| 4.3 Interpreting text labelled as UTF-16 | 4.3 Interpreting text labelled as UTF-16 | |||
| Text labelled with the "UTF-16" charset might be serialized in either | Text labelled with the "UTF-16" charset might be serialized in either | |||
| big-endian or little-endian order. If the first two octets of the text is | big-endian or little-endian order. If the first two octets of the text | |||
| 0xFE followed by 0xFF, then the text can be interpreted as being | is 0xFE followed by 0xFF, then the text can be interpreted as being | |||
| big-endian. If the first two octets of the text is 0xFF followed by 0xFE, | big-endian. If the first two octets of the text is 0xFF followed by | |||
| then the text can be interpreted as being little-endian. If the first two | 0xFE, then the text can be interpreted as being little-endian. If the | |||
| octets of the text is not 0xFE followed by 0xFF, and is not 0xFF followed | first two octets of the text is not 0xFE followed by 0xFF, and is not | |||
| by 0xFE, then the text SHOULD be interpreted as being big-endian. | 0xFF followed by 0xFE, then the text SHOULD be interpreted as being | |||
| big-endian. | ||||
| All applications that process text with the "UTF-16" charset label MUST be | All applications that process text with the "UTF-16" charset label MUST | |||
| able to read at least the first two octets of the text and be able to | be able to read at least the first two octets of the text and be able | |||
| process those octets in order to determine the serialization order of the | to process those octets in order to determine the serialization order | |||
| text. Applications that process text with the "UTF-16" charset label MUST | of the text. Applications that process text with the "UTF-16" charset | |||
| NOT assume the serialization without first checking the first two octets to | label MUST NOT assume the serialization without first checking the | |||
| see if they are a big-endian BOM, a little-endian BOM, or not a BOM. | first two octets to see if they are a big-endian BOM, a little-endian | |||
| BOM, or not a BOM. All applications that process text with the "UTF-16" | ||||
| charset label MUST be able to interpret both big-endian and | ||||
| little-endian text. | ||||
| 5. Examples | 5. Examples | |||
| For the sake of example, let's suppose that there is a hieroglyphic | For the sake of example, let's suppose that there is a hieroglyphic | |||
| character representing the Egyptian god Ra with character value 0x00012345 | character representing the Egyptian god Ra with character value | |||
| (this character does not exist at present in Unicode). | 0x00012345 (this character does not exist at present in Unicode). | |||
| The examples here all evaluate to the phrase: | The examples here all evaluate to the phrase: | |||
| *=Ra | *=Ra | |||
| where the "*" represents the Ra hieroglyph (0x00012345). | where the "*" represents the Ra hieroglyph (0x00012345). | |||
| Text labelled with UTF-16BE, without a BOM: | Text labelled with UTF-16BE, without a BOM: | |||
| D8 48 DF 45 00 3D 00 52 00 61 | D8 08 DF 45 00 3D 00 52 00 61 | |||
| Text labelled with UTF-16LE, without a BOM: | Text labelled with UTF-16LE, without a BOM: | |||
| 48 D8 45 DF 3D 00 52 00 61 00 | 08 D8 45 DF 3D 00 52 00 61 00 | |||
| Big-endian text labelled with UTF-16, with a BOM: | Big-endian text labelled with UTF-16, with a BOM: | |||
| FE FF D8 48 DF 45 00 3D 00 52 00 61 | FE FF D8 08 DF 45 00 3D 00 52 00 61 | |||
| Little-endian text labelled with UTF-16, with a BOM: | Little-endian text labelled with UTF-16, with a BOM: | |||
| FF FE 48 D8 45 DF 3D 00 52 00 61 00 | FF FE 08 D8 45 DF 3D 00 52 00 61 00 | |||
| 6. Versions of the standards | 6. Versions of the standards | |||
| ISO/IEC 10646 is updated from time to time by published amendments; | ISO/IEC 10646 is updated from time to time by published amendments; | |||
| similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0, | similarly, different versions of the Unicode standard exist: 1.0, 1.1, | |||
| and 2.1 as of this writing. Each new version replaces the previous one, | 2.0, and 2.1 as of this writing. Each new version replaces the | |||
| but implementations, and more significantly data, are not updated | previous one, but implementations, and more significantly data, are not | |||
| instantly. | updated instantly. | |||
| In general, the changes amount to adding new characters, which does not | In general, the changes amount to adding new characters, which does not | |||
| pose particular problems with old data. Amendment 5 to ISO/IEC 10646, | pose particular problems with old data. Amendment 5 to ISO/IEC 10646, | |||
| however, has moved and expanded the Korean Hangul block, thereby making any | however, has moved and expanded the Korean Hangul block, thereby making | |||
| previous data containing Hangul characters invalid under the new version. | any previous data containing Hangul characters invalid under the new | |||
| Unicode 2.0 has the same difference from Unicode 1.1. The official | version. Unicode 2.0 has the same difference from Unicode 1.1. The | |||
| justification for allowing such an incompatible change was that no | official justification for allowing such an incompatible change was | |||
| significant implementations and data containing Hangul existed, a statement | that no significant implementations and data containing Hangul existed, | |||
| that is likely to be true but remains unprovable. The incident has been | a statement that is likely to be true but remains unprovable. The | |||
| dubbed the "Korean mess", and the relevant committees have pledged to | incident has been dubbed the "Korean mess", and the relevant committees | |||
| never, ever again make such an incompatible change. | have pledged to never, ever again make such an incompatible change. | |||
| New versions, and in particular any incompatible changes, have consequences | New versions, and in particular any incompatible changes, have | |||
| regarding MIME character encoding labels, to be discussed in Appendix A. | consequences regarding MIME character encoding labels, to be discussed | |||
| in Appendix A. | ||||
| 7. Security considerations | 7. Security considerations | |||
| UTF-16 is based on the ISO 10646 character set, which is frequently being | UTF-16 is based on the ISO 10646 character set, which is frequently | |||
| added to, as described in Section 6 and Appendix A of this document. | being added to, as described in Section 6 and Appendix A of this | |||
| Processors must be able to handle characters that are not defined at the | document. Processors must be able to handle characters that are not | |||
| time that the processor was created in such a way as to not allow an | defined at the time that the processor was created in such a way as to | |||
| attacker to harm a recipient by including unknown characters. | not allow an attacker to harm a recipient by including unknown | |||
| characters. | ||||
| Processors that handle any type of text, including text encoded as UTF-16, | Processors that handle any type of text, including text encoded as | |||
| must be vigilant in checking for control characters that might reprogram a | UTF-16, must be vigilant in checking for control characters that might | |||
| display terminal or keyboard. Similarly, processors that interpret text | reprogram a display terminal or keyboard. Similarly, processors that | |||
| entities (such as looking for embedded programming code), must be careful | interpret text entities (such as looking for embedded programming | |||
| not to execute the code without first alerting the recipient. | code), must be careful not to execute the code without first alerting | |||
| the recipient. | ||||
| Text in UTF-16 may contain special characters, such as the OBJECT | Text in UTF-16 may contain special characters, such as the OBJECT | |||
| REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, | REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, | |||
| depending on the interpretation of the processing program and the | depending on the interpretation of the processing program and the | |||
| availability of an external data stream that would be executed. This | availability of an external data stream that would be executed. This | |||
| external processing may have side-effects that allow the sender of a | external processing may have side-effects that allow the sender of a | |||
| message to attack the receiving system. | message to attack the receiving system. | |||
| Implementors of UTF-16 need to consider the security aspects of how they | Implementors of UTF-16 need to consider the security aspects of how | |||
| handle illegal UTF-16 sequences (that is, sequences involving surrogate | they handle illegal UTF-16 sequences (that is, sequences involving | |||
| pairs that have illegal values or unpaired surrogates). It is conceivable | surrogate pairs that have illegal values or unpaired surrogates). It is | |||
| that in some circumstances an attacker would be able to exploit an | conceivable that in some circumstances an attacker would be able to | |||
| incautious UTF-16 parser by sending it an octet sequence that is not | exploit an incautious UTF-16 parser by sending it an octet sequence | |||
| permitted by the UTF-16 syntax, causing it to behave in some anomalous | that is not permitted by the UTF-16 syntax, causing it to behave in | |||
| fashion. | some anomalous fashion. | |||
| 8. References | 8. References | |||
| [CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", | [CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and | |||
| BCP 18, RFC 2277, January 1998. | Languages", BCP 18, RFC 2277, January 1998. | |||
| [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration | [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration | |||
| Procedures", BCP 19, RFC 2278, January 1998. | Procedures", BCP 19, RFC 2278, January 1998. | |||
| [HTTP-1.1] Fielding, R., et. al., "Hypertext Transfer Protocol -- | ||||
| HTTP/1.1", RFC 2068, January 1997. | ||||
| [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information | [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information | |||
| technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: | technology -- Universal Multiple-Octet Coded Character Set (UCS) -- | |||
| Architecture and Basic Multilingual Plane. Twelve amendments and two | Part 1: Architecture and Basic Multilingual Plane. Twelve amendments | |||
| technical corrigenda have been published up to now. UTF-16 is described in | and two technical corrigenda have been published up to now. UTF-16 is | |||
| Annex Q, published as Amendment 1. Many other amendments are currently at | described in Annex Q, published as Amendment 1. Many other amendments | |||
| various stages of standardization. | are currently at various stages of standardization. | |||
| [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate | [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", | [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version | |||
| Unicode Technical Report #8. | 2.1", Unicode Technical Report #8. | |||
| [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC | [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC | |||
| 2279, January 1998. | 2279, January 1998. | |||
| [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set Workshop", | [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set | |||
| RFC 2130, April 1997. | Workshop", RFC 2130, April 1997. | |||
| 9. Acknowledgments | 9. Acknowledgments | |||
| Deborah Goldsmith wrote a great deal of the initial wording for this | Deborah Goldsmith wrote a great deal of the initial wording for this | |||
| specification. Martin Duerst gave numerous significant changes. Other | specification. Martin Duerst proposed numerous significant changes. | |||
| significant contributors include: | Other significant contributors include: | |||
| Mati Allouche | Mati Allouche | |||
| Walt Daniels | Walt Daniels | |||
| Mark Davis | Mark Davis | |||
| Ned Freed | Ned Freed | |||
| Asmus Freytag | Asmus Freytag | |||
| Lloyd Honomichl | Lloyd Honomichl | |||
| Dan Kegel | Dan Kegel | |||
| Murata Makoto | Murata Makoto | |||
| Larry Masinter | Larry Masinter | |||
| Markus Scherer | ||||
| Ken Whistler | Ken Whistler | |||
| Some of the text in this specification was copied from [UTF-8], and that | Some of the text in this specification was copied from [UTF-8], and | |||
| document was worked on by many people. Please see the acknowledgments | that document was worked on by many people. Please see the | |||
| section in that document for more people who may have contributed | acknowledgments section in that document for more people who may have | |||
| indirectly to this document. | contributed indirectly to this document. | |||
| 10. Authors' address | ||||
| Paul Hoffman | ||||
| Internet Mail Consortium | ||||
| 127 Segre Place | ||||
| Santa Cruz, CA 95060 USA | ||||
| phoffman@imc.org | ||||
| Francois Yergeau | ||||
| Alis Technologies | ||||
| 100, boul. Alexis-Nihon, Suite 600 | ||||
| Montreal QC H4M 2P2 Canada | ||||
| fyergeau@alis.com | ||||
| 11. Changes between draft -01 and -02 | ||||
| Fixed some spelling mistakes throughout. | ||||
| Updated the status boilerplate. | ||||
| Clarified the parameter values in 1. | ||||
| Added [WORKSHOP] reference in 1.1 and 8. Also fuzzified the description of | ||||
| what UTF-16 is (instead of getting into hair-splitting on CESs, CCSs, and | ||||
| so on). | ||||
| Corrected 1.2 on the characters for which UTF-8 incurs a space penalty. | ||||
| Added "from ISO 10646 to UTF-16" to the beginning of 2.1. | ||||
| Added "from UTF-16 to ISO 10646" to the beginning of 2.2. | ||||
| Added text to the end of the note at the end of 2.2 about possibly emitting | ||||
| the ill-formed characters when decoding. | ||||
| Rearranged much of sections 3 and 4. This makes the following changes | ||||
| hard to follow; the references refer to the *old* section numbers, | ||||
| not necessarily the ones as they exist in this draft. Sorry about that... | ||||
| Changed the end of the first paragraph of 3.1 to get out of the | ||||
| which-endian-has-most debate. | ||||
| Clarified the fourth paragraph of 3.1 (the one that begins | ||||
| "This specification thus...") about the use of "UTF-16" as both a | ||||
| sequencing mechanism and a charset label. | ||||
| Added Martin Duerst's C code fragment for big-endian order. | 10. Changes between draft -02 and -03 | |||
| Added the sentence to the end of the sixth paragraph of 3.1 (the one | 1: Reorganized the sections. Added information about two octets being | |||
| that begins "It is important...") with the example of substrings and | enough for all current characters and the committees saying they will | |||
| ZWNBSs. | not go beyond what can be defined in UTF-16. | |||
| Added text about SHOULD NOT put an intial BOM in both 3.2 and 3.3. | 2.1: Reworded step 2 with words to make it easier to read. | |||
| Clarified the last clause in section 3.3. | 2.2: Added "U" to step 1. Also added note to the end of the last | |||
| paragraph about string decoding and errors. | ||||
| Removed the last paragraph of 4 (the paragraph that used to start | 3: Added a reference to section 4 about interpreting labels. | |||
| "Because creating text labelled...") because it related to text-creating | ||||
| programs instead of text-labelling programs. | ||||
| Rearragned and relabelled some of the examples in 5. | 3.1: Reworded last sentence in last paragraph. | |||
| Removed "obsoletes" from the first paragraph of 6. Slightly fuzzified | 4.3: Added requirement that apps that can read UTF-16 must be able to | |||
| the "no implementations" sentence in the second paragraph. | interpret both big-endian and little-endian. | |||
| Alphabatized the references in 8. | 5: Corrected the examples due to wrong encoding. | |||
| Added Larry Masinter to section 9. Gave Martin Duerst more credit. | 11: Moved author's addresses to Appendix B. | |||
| A. Charset registrations | A. Charset registrations | |||
| This memo is meant to serve as the basis for registration of three MIME | This memo is meant to serve as the basis for registration of three MIME | |||
| charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", | charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", | |||
| and "UTF-16". These strings label objects containing text consisting of | "UTF-16LE", and "UTF-16". These strings label objects containing text | |||
| characters from the repertoire of ISO/IEC 10646 including all amendments at | consisting of characters from the repertoire of ISO/IEC 10646 including | |||
| least up to amendment 5 (Korean block), encoded to a sequence of octets | all amendments at least up to amendment 5 (Korean block), encoded to a | |||
| using the encoding and serialization schemes outlined above. | sequence of octets using the encoding and serialization schemes | |||
| outlined above. | ||||
| Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in | Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use | |||
| media types under the "text" top-level type, because they do not encode | in media types under the "text" top-level type, because they do not | |||
| line endings in the way required for MIME "text" media types. | encode line endings in the way required for MIME "text" media types. An | |||
| exception to this is HTTP, which uses a MIME-like mechanism, but is | ||||
| exempt from the restrictions on the text top-level type (see section | ||||
| 19.4.1 of HTTP 1.1 [HTTP-1.1]). | ||||
| It is noteworthy that the labels described here do not contain a version | It is noteworthy that the labels described here do not contain a | |||
| identification, referring generically to ISO/IEC 10646. This is | version identification, referring generically to ISO/IEC 10646. This is | |||
| intentional, the rationale being as follows: | intentional, the rationale being as follows: | |||
| A MIME charset is designed to give just the information needed to interpret | A MIME charset is designed to give just the information needed to | |||
| a sequence of bytes received on the wire into a sequence of characters, | interpret a sequence of bytes received on the wire into a sequence of | |||
| nothing more (see RFC 2045, section 2.2, in [MIME]). As long as a character | characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As | |||
| set standard does not change incompatibly, version numbers serve no | long as a character set standard does not change incompatibly, version | |||
| purpose, because one gains nothing by learning from the tag that newly | numbers serve no purpose, because one gains nothing by learning from | |||
| assigned characters may be received that one doesn't know about. The tag | the tag that newly assigned characters may be received that one doesn't | |||
| itself doesn't teach anything about the new characters, which are going to | know about. The tag itself doesn't teach anything about the new | |||
| be received anyway. | characters, which are going to be received anyway. | |||
| Hence, as long as the standards evolve compatibly, the apparent advantage | Hence, as long as the standards evolve compatibly, the apparent | |||
| of having labels that identify the versions is only that, apparent. But | advantage of having labels that identify the versions is only that, | |||
| there is a disadvantage to such version-dependent labels: when an older | apparent. But there is a disadvantage to such version-dependent | |||
| application receives data accompanied by a newer, unknown label, it may | labels: when an older application receives data accompanied by a newer, | |||
| fail to recognize the label and be completely unable to deal with the data, | unknown label, it may fail to recognize the label and be completely | |||
| whereas a generic, known label would have triggered mostly correct | unable to deal with the data, whereas a generic, known label would have | |||
| processing of the data, which may well not contain any new characters. | triggered mostly correct processing of the data, which may well not | |||
| contain any new characters. | ||||
| The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in | The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible | |||
| principle contradicting the appropriateness of a version independent MIME | change, in principle contradicting the appropriateness of a version | |||
| charset as described above. But the compatibility problem can only appear | independent MIME charset as described above. But the compatibility | |||
| with data containing Korean Hangul characters encoded according to Unicode | problem can only appear with data containing Korean Hangul characters | |||
| 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there is | encoded according to Unicode 1.1 (or equivalently ISO/IEC 10646 before | |||
| arguably no such data to worry about, this being the very reason the | amendment 5), and there is arguably no such data to worry about, this | |||
| incompatible change was deemed acceptable. | being the very reason the incompatible change was deemed acceptable. | |||
| In practice, then, a version-independent label is warranted, provided the | In practice, then, a version-independent label is warranted, provided | |||
| label is understood to refer to all versions after Amendment 5, and | the label is understood to refer to all versions after Amendment 5, and | |||
| provided no incompatible change actually occurs. Should incompatible | provided no incompatible change actually occurs. Should incompatible | |||
| changes occur in a later version of ISO/IEC 10646, the MIME charsets | changes occur in a later version of ISO/IEC 10646, the MIME charsets | |||
| defined here will stay aligned with the previous version until and unless | defined here will stay aligned with the previous version until and | |||
| the IETF specifically decides otherwise. | unless the IETF specifically decides otherwise. | |||
| A.1 Registration for UTF-16BE | A.1 Registration for UTF-16BE | |||
| To: ietf-charsets@iana.org | To: ietf-charsets@iana.org | |||
| Subject: Registration of new charset | Subject: Registration of new charset | |||
| Charset name(s): UTF-16BE | Charset name(s): UTF-16BE | |||
| Published specification(s): This specification | Published specification(s): This specification | |||
| skipping to change at line 594 ¶ | skipping to change at line 572 ¶ | |||
| Charset name(s): UTF-16 | Charset name(s): UTF-16 | |||
| Published specification(s): This specification | Published specification(s): This specification | |||
| Suitable for use in MIME content types under the | Suitable for use in MIME content types under the | |||
| "text" top-level type: No | "text" top-level type: No | |||
| Person & email address to contact for further information: | Person & email address to contact for further information: | |||
| Paul Hoffman <phoffman@imc.org> | Paul Hoffman <phoffman@imc.org> | |||
| Francois Yergeau <fyergeau@alis.com> | Francois Yergeau <fyergeau@alis.com> | |||
| B. Authors' address | ||||
| Paul Hoffman | ||||
| Internet Mail Consortium | ||||
| 127 Segre Place | ||||
| Santa Cruz, CA 95060 USA | ||||
| phoffman@imc.org | ||||
| Francois Yergeau | ||||
| Alis Technologies | ||||
| 100, boul. Alexis-Nihon, Suite 600 | ||||
| Montreal QC H4M 2P2 Canada | ||||
| fyergeau@alis.com | ||||
| End of changes. 79 change blocks. | ||||
| 325 lines changed or deleted | 303 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||