| < draft-hoffman-utf16-01.txt | draft-hoffman-utf16-02.txt > | |||
|---|---|---|---|---|
| Internet Draft Paul Hoffman | Internet Draft Paul Hoffman | |||
| <draft-hoffman-utf16-01.txt> Internet Mail Consortium | <draft-hoffman-utf16-02.txt> Internet Mail Consortium | |||
| December 13, 1998 Francois Yergeau | February 10, 1999 Francois Yergeau | |||
| Alis Technologies | Alis Technologies | |||
| UTF-16, an encoding of ISO 10646 | UTF-16, an encoding of ISO 10646 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft. Internet-Drafts are working documents | This document is an Internet-Draft and is in full conformance with all | |||
| of the Internet Engineering Task Force (IETF), its areas, and its working | provisions of Section 10 of RFC2026. | |||
| groups. Note that other groups may also distribute working documents as | ||||
| Internet- Drafts. | ||||
| Internet-Drafts are draft documents valid for a maximum of six months. | Internet-Drafts are working documents of the Internet Engineering Task | |||
| Internet-Drafts may be updated, replaced, or obsoleted by other documents | Force (IETF), its areas, and its working groups. Note that other groups | |||
| at any time. It is not appropriate to use Internet-Drafts as reference | may also distribute working documents as Internet-Drafts. | |||
| material or to cite them other than as a "working draft" or "work in | ||||
| progress". | ||||
| To view the entire list of current Internet-Drafts, please check the | Internet-Drafts are draft documents valid for a maximum of six months and | |||
| "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow | may be updated, replaced, or obsoleted by other documents at any time. It | |||
| Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), | is inappropriate to use Internet- Drafts as reference material or to cite | |||
| ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), | them other than as "work in progress." | |||
| ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). | ||||
| Copyright (C) The Internet Society (1998). All Rights Reserved. | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/ietf/1id-abstracts.txt | ||||
| The list of Internet-Draft Shadow Directories can be accessed at | ||||
| http://www.ietf.org/shadow.html. | ||||
| Copyright (C) The Internet Society (1999). All Rights Reserved. | ||||
| 1. Introduction | 1. Introduction | |||
| This document specifies the UTF-16 encoding of Unicode/ISO-10646 and | This document describes the UTF-16 encoding of Unicode/ISO-10646 and | |||
| contains the registration for three MIME charset parameter values: | contains the registration for three MIME charset parameter values: | |||
| UTF-16BE, UTF-16LE, and UTF-16. | UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16. | |||
| 1.1 Background | 1.1 Background | |||
| The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly | The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly | |||
| define a coded character set (CCS), hereafter referred to as Unicode, which | define a coded character set (CCS), hereafter referred to as Unicode, which | |||
| encompasses most of the world's writing systems. UTF-16, the object of this | encompasses most of the world's writing systems [WORKSHOP]. UTF-16, the | |||
| specification, is a character encoding scheme (CES) of Unicode that has the | object of this specification, is a way to encode Unicode characters that | |||
| characteristics of encoding the vast majority of currently-defined | has the characteristics of encoding the vast majority of currently-defined | |||
| characters in exactly two octets and of being able to encode all other | characters in exactly two octets and of being able to encode all other | |||
| characters that will be defined in exactly four octets. | characters that will be defined in exactly four octets. | |||
| The Unicode Standard further defines additional character properties and | The Unicode Standard further defines additional character properties and | |||
| other application details of great interest to implementors. Up to the | other application details of great interest to implementors. Up to the | |||
| present time, changes in Unicode and amendments to ISO/IEC 10646 have | present time, changes in Unicode and amendments to ISO/IEC 10646 have | |||
| tracked each other, so that the character repertoires and code point | tracked each other, so that the character repertoires and code point | |||
| assignments have remained in sync. The relevant standardization committees | assignments have remained in sync. The relevant standardization committees | |||
| have committed to maintain this very useful synchronism. | have committed to maintain this very useful synchronism. | |||
| 1.2 Motivation | 1.2 Motivation | |||
| The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF | The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF | |||
| policy on character sets and languages, [CHARPOLICY], says that IETF | policy on character sets and languages, [CHARPOLICY], says that IETF | |||
| protocols MUST be able to use the UTF-8 charset. However, relative to | protocols MUST be able to use the UTF-8 charset. However, relative to | |||
| UTF-16, UTF-8 imposes a space penalty for characters whose values are | UTF-16, UTF-8 imposes a space penalty for characters whose values are | |||
| greater than 0x0800. Also, characters represented in UTF-8 have varying | between 0x0800 and 0xFFFF. Also, characters represented in UTF-8 have varying | |||
| sizes. Using UTF-16 provides a way to transmit character data that is | sizes. Using UTF-16 provides a way to transmit character data that is | |||
| mostly uniform in size. Some products and network standards already specify | mostly uniform in size. Some products and network standards already specify | |||
| UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in | UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in | |||
| many protocols, such as the direct encoding of US-ASCII characters and | many protocols, such as the direct encoding of US-ASCII characters and | |||
| re-synchronization after loss of octets.) | re-synchronization after loss of octets.) | |||
| UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as | UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as | |||
| a sequence of 16-bit quantities. This document addresses the issues of | a sequence of 16-bit quantities. This document addresses the issues of | |||
| serializing UTF-16 as an octet stream for transmission over the Internet | serializing UTF-16 as an octet stream for transmission over the Internet | |||
| and of MIME charset naming as described in [CHARSET-REG]. | and of MIME charset naming as described in [CHARSET-REG]. | |||
| skipping to change at line 107 ¶ | skipping to change at line 107 ¶ | |||
| 16-bit integer with a value between 0xD800 and 0xDBFF (within the | 16-bit integer with a value between 0xD800 and 0xDBFF (within the | |||
| so-called high-half zone or high surrogate area) followed by a 16-bit | so-called high-half zone or high surrogate area) followed by a 16-bit | |||
| integer with a value between 0xDC00 and 0xDFFF (within the so-called | integer with a value between 0xDC00 and 0xDFFF (within the so-called | |||
| low-half zone or low surrogate area). | low-half zone or low surrogate area). | |||
| - Characters with values greater than 0x10FFFF cannot be encoded in | - Characters with values greater than 0x10FFFF cannot be encoded in | |||
| UTF-16. | UTF-16. | |||
| 2.1 Encoding UTF-16 | 2.1 Encoding UTF-16 | |||
| Encoding of a single character proceeds as follows. Let U be the character | Encoding of a single character from an ISO 10646 character value to UTF-16 | |||
| number, no greater than 0x10FFFF. | proceeds as follows. Let U be the character number, no greater than | |||
| 0x10FFFF. | ||||
| 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. | 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. | |||
| 2) Let U' = U - 0x10000. Note that because U <= 0x10FFFF, U' <= 0xFFFFF, | 2) Let U' = U - 0x10000. Note that because U <= 0x10FFFF, U' <= 0xFFFFF, | |||
| that is, U' can be represented in 20 bits. | that is, U' can be represented in 20 bits. | |||
| 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and | 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and | |||
| 0xDC00, respectively. These integers each have 10 bits free to encode the | 0xDC00, respectively. These integers each have 10 bits free to encode the | |||
| character value, for a total of 20 bits. | character value, for a total of 20 bits. | |||
| skipping to change at line 130 ¶ | skipping to change at line 131 ¶ | |||
| of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. | of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. | |||
| Terminate. | Terminate. | |||
| Graphically, steps 2 through 4 look like: | Graphically, steps 2 through 4 look like: | |||
| U' = yyyyyyyyyyxxxxxxxxxx | U' = yyyyyyyyyyxxxxxxxxxx | |||
| W1 = 110110yyyyyyyyyy | W1 = 110110yyyyyyyyyy | |||
| W2 = 110111xxxxxxxxxx | W2 = 110111xxxxxxxxxx | |||
| 2.2 Decoding UTF-16 | 2.2 Decoding UTF-16 | |||
| Decoding of a single character proceeds as follows. Let W1 be the next | Decoding of a single character from UTF-16 to an ISO 10646 character value | |||
| 16-bit integer in the sequence of integers representing the text. Let W2 be | proceeds as follows. Let W1 be the next 16-bit integer in the sequence of | |||
| the (eventual) next integer following W1. | integers representing the text. Let W2 be the (eventual) next integer | |||
| following W1. | ||||
| 1) If W1 < 0xD800 or W1 > 0xDFFF, the character value is the value of W1. | 1) If W1 < 0xD800 or W1 > 0xDFFF, the character value is the value of W1. | |||
| Terminate. | Terminate. | |||
| 2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in | 2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in | |||
| error and no valid character can be obtained using W1. Terminate. | error and no valid character can be obtained using W1. Terminate. | |||
| 3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not | 3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not | |||
| between 0xDC00 and 0xDFFF, the sequence is in error. Terminate. | between 0xDC00 and 0xDFFF, the sequence is in error. Terminate. | |||
| 4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of | 4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of | |||
| W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10 | W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10 | |||
| low-order bits. | low-order bits. | |||
| 5) Add 0x10000 to U' to obtain the character value U. Terminate. | 5) Add 0x10000 to U' to obtain the character value U. Terminate. | |||
| Note that steps 2 and 3 indicate errors. Error recovery is not specified by | Note that steps 2 and 3 indicate errors. Error recovery is not specified by | |||
| this document. | this document. When terminating with an error in steps 2 and 3, it may be | |||
| wise to set U to the value of W1 to help the caller diagnose the error and | ||||
| not lose information. | ||||
| 3. Serialization of characters | 3. Labelling UTF-16 text | |||
| This specification contains registration for three MIME charsets: | ||||
| "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the | ||||
| combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and the | ||||
| CES is the same in all three cases, except for the serialization order of | ||||
| the octets in each character, and the external determination of which | ||||
| serialization is used. | ||||
| This section describes which of the three labels to apply to a stream of text. | ||||
| 3.1 Definition of big-endian and little-endian | 3.1 Definition of big-endian and little-endian | |||
| Historically, computer hardware has processed two-octet entities such as | Historically, computer hardware has processed two-octet entities such as | |||
| 16-bit integers in one of two ways. So-called "big-endian" hardware handles | 16-bit integers in one of two ways. So-called "big-endian" hardware handles | |||
| two-octet entities with the higher-order octet first, that is at the lower | two-octet entities with the higher-order octet first, that is at the lower | |||
| address in memory; when written out to disk or to a network interface | address in memory; when written out to disk or to a network interface | |||
| (serializing), the high-order octet thus appears first in the data stream. | (serializing), the high-order octet thus appears first in the data stream. | |||
| "Little-endian" hardware handles two-octet entities with the lower-order | On the other hand, "Little-endian" hardware handles two-octet entities with | |||
| octet first. Most modern hardware is little-endian, but there are many | the lower-order octet first. Hardware of both kinds is common today. | |||
| current examples of big-endian hardware. | ||||
| For example, the unsigned 16-bit integer that represents the decimal number | For example, the unsigned 16-bit integer that represents the decimal number | |||
| 258 is 0x0102. The big-endian serialization of that number is the octet | 258 is 0x0102. The big-endian serialization of that number is the octet | |||
| 0x01 followed by the octet 0x02. The little-endian serialization of that | 0x01 followed by the octet 0x02. The little-endian serialization of that | |||
| number is the octet 0x02 followed by the octet 0x01. | number is the octet 0x02 followed by the octet 0x01. The following C code | |||
| fragment demonstrates a way to write 16-bit quantities to a file in | ||||
| big-endian order, irrespective of the hardware's native byte order. | ||||
| void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ | ||||
| { | ||||
| putc(u >> 8, f); /* output high-order byte */ | ||||
| putc(u & 0xFF, f); /* then low-order */ | ||||
| } | ||||
| The term "network byte order" has been used in many RFCs to indicate | The term "network byte order" has been used in many RFCs to indicate | |||
| big-endian serialization, although that term has never been formally | big-endian serialization, although that term has yet to be formally | |||
| defined in a standards-track document. ISO 10646 prefers big-endian | defined in a standards-track document. ISO 10646 prefers big-endian | |||
| serialization (section 6.3 of [ISO-10646]), but it is nonetheless | serialization (section 6.3 of [ISO-10646]), but it is nonetheless | |||
| considered likely that little-endian order will also be used on the | considered likely that little-endian order will also be used on the | |||
| Internet. | Internet. | |||
| This specification thus contains registration for three charsets: | 3.2 Byte order mark (BOM) | |||
| "UTF-16BE", "UTF-16LE", and "UTF-16". The character encoding schemes these | ||||
| charsets use are identical except for the serialization order of the octets | ||||
| in each character, and the external determination of which serialization is | ||||
| used. | ||||
| The Unicode Standard and ISO 10646 define the character "ZERO WIDTH | The Unicode Standard and ISO 10646 define the character "ZERO WIDTH | |||
| NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER | NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER | |||
| MARK" (abbreviated "BOM"). The latter name hints at a second possible usage | MARK" (abbreviated "BOM"). The latter name hints at a second possible usage | |||
| of the character, in addition to its normal use as a genuine "ZERO WIDTH | of the character, in addition to its normal use as a genuine "ZERO WIDTH | |||
| NON-BREAKING SPACE" within text. This usage, suggested by Unicode section | NON-BREAKING SPACE" within text. This usage, suggested by Unicode section | |||
| 2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character | 2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character | |||
| to a stream of Unicode characters as a "signature"; a receiver of such a | to a stream of Unicode characters as a "signature"; a receiver of such a | |||
| serialized stream may then use the initial character both as a hint that | serialized stream may then use the initial character both as a hint that | |||
| the stream consists of Unicode characters and as a way to recognize the | the stream consists of Unicode characters and as a way to recognize the | |||
| skipping to change at line 204 ¶ | skipping to change at line 220 ¶ | |||
| if they are 0xFF followed by 0xFE, the order is little-endian. Note that | if they are 0xFF followed by 0xFE, the order is little-endian. Note that | |||
| 0xFFFE is not a Unicode character, precisely to preserve the usefulness of | 0xFFFE is not a Unicode character, precisely to preserve the usefulness of | |||
| 0xFEFF as a byte-order mark. | 0xFEFF as a byte-order mark. | |||
| It is important to understand that the character 0xFEFF appearing at any | It is important to understand that the character 0xFEFF appearing at any | |||
| position other than the beginning of a stream MUST be interpreted with the | position other than the beginning of a stream MUST be interpreted with the | |||
| semantics for the zero-width non-breaking space, and MUST NOT be | semantics for the zero-width non-breaking space, and MUST NOT be | |||
| interpreted as a byte-order mark. The contrapositive of that statement is | interpreted as a byte-order mark. The contrapositive of that statement is | |||
| not always true: the character 0xFEFF in the first position of a stream MAY | not always true: the character 0xFEFF in the first position of a stream MAY | |||
| be interpreted as a zero-width non-breaking space, and is not always a | be interpreted as a zero-width non-breaking space, and is not always a | |||
| byte-order mark. | byte-order mark. For example, if a process splits a UTF-16 string into | |||
| many parts, a part might begin with 0xFEFF because there was a | ||||
| zero-width non-breaking space at the beginning of that substring. | ||||
| The Unicode standard further suggests than an initial 0xFEFF character may | The Unicode standard further suggests than an initial 0xFEFF character may | |||
| be stripped before processing the text, the rationale being that such a | be stripped before processing the text, the rationale being that such a | |||
| character in initial position may be an artifact of the encoding (an | character in initial position may be an artifact of the encoding (an | |||
| encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING | encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING | |||
| SPACE". Nevertheless, such stripping MUST NOT take place before any | SPACE". Note that such stripping might affect an external process at a | |||
| MIME-related operations (such as hash algorithms, digest, or byte-count | different layer (such as a digital signature or a count of the characters) | |||
| computations) have been completed. Such operations depend on the exact | that is relying on the presence of all characters in the stream. | |||
| bytes of the data, which therefore may not be modified in any way. After | ||||
| all MIME-related operations have been completed (for instance after a MIME | ||||
| processor has handed an entity to a specific media type processor), an | ||||
| initial 0xFEFF MAY be removed if appropriate, although this will prevent | ||||
| later comparison with the original MIME object. In particular, in UTF-16 | ||||
| plain text it is likely that an initial 0xFEFF is a signature; when | ||||
| concatenating two strings, it is important to strip out those signatures, | ||||
| for otherwise the resulting string may contain an unintended "ZERO WIDTH | ||||
| NON-BREAKING SPACE" at the connection point. Also, some specifications | ||||
| mandate an initial 0xFEFF character in objects encoded in UTF-16 and | ||||
| specify that this signature is not part of the object. | ||||
| 3.2 Serialization in UTF-16BE | In particular, in UTF-16 plain text it is likely, but not certain, that an | |||
| initial 0xFEFF is a signature; when concatenating two strings, it is | ||||
| important to strip out those signatures, for otherwise the resulting string | ||||
| may contain an unintended "ZERO WIDTH NON-BREAKING SPACE" at the connection | ||||
| point. Also, some specifications mandate an initial 0xFEFF character in | ||||
| objects encoded in UTF-16 and specify that this signature is not part of | ||||
| the object. | ||||
| Text in the "UTF-16BE" charset MUST be serialized with the octets which | 3.3 Choosing a label for UTF-16 text | |||
| make up a single 16-bit UTF-16 value in big-endian order. The detection of | ||||
| an initial BOM does not affect de-serialization of text labelled as | ||||
| UTF-16BE. Finding 0xFF follwed by 0xFE is an error since there is no | ||||
| Unicode character 0xFFFE. | ||||
| 3.3 Serialization in UTF-16LE | Any labelling application that uses UTF-16 character encoding, and puts an | |||
| explicit charset label on the text, and knows the serialization order of | ||||
| the characters in text, SHOULD label the text as either "UTF-16BE" or | ||||
| "UTF-16LE", whichever is appropriate based on the endianness of the text. | ||||
| This allows applications processing the text, but unable to look inside the | ||||
| text, to know the serialization definitively. | ||||
| Text in the "UTF-16BE" charset MUST be serialized with the octets which | ||||
| make up a single 16-bit UTF-16 value in big-endian order. Systems labelling | ||||
| UTF-16BE text MUST NOT prepend a BOM to the text. | ||||
| Text in the "UTF-16LE" charset MUST be serialized with the octets which | Text in the "UTF-16LE" charset MUST be serialized with the octets which | |||
| make up a single 16-bit UTF-16 value in little-endian order. The detection | make up a single 16-bit UTF-16 value in little-endian order. Systems | |||
| of an initial BOM does not affect de-serialization of text labelled as | labelling UTF-16LE text MUST NOT prepend a BOM to the text. | |||
| UTF-16LE. Finding 0xFE folled by 0xFF is an error since there is no Unicode | ||||
| character 0xFFFE, which is the interpretation of the 0xFEFF character under | ||||
| little-endian order. | ||||
| 3.4 Serialization in UTF-16 | Any labelling application that uses UTF-16 character encoding, and puts an | |||
| explicit charset label on the text, and does not know the serialization | ||||
| order of the characters in text, MUST label the text as "UTF-16", and | ||||
| SHOULD make sure the text starts with 0xFEFF. | ||||
| Text in the "UTF-16" charset MAY be serialized in either big-endian or | An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or | |||
| little-endian order. If the first two octets of the text is 0xFE followed | "UTF-16LE" is that some document formats mandate a BOM in UTF-16 text, | |||
| by 0xFF, then the text MUST be big-endian. If the first two octets of the | thereby requiring the use of the "UTF-16" tag only. | |||
| text is 0xFF followed by 0xFE, then the text MUST be little-endian. If the | ||||
| first two octets of the text is not 0xFE followed by 0xFF and is not 0xFF | ||||
| followed by 0xFE, then the text MUST be big-endian. Big-endian text in the | ||||
| "UTF-16" charset MAY start with the 0xFEFF character, but the 0xFEFF | ||||
| character is not required. | ||||
| All applications that process text in the "UTF-16" charset MUST be able to | 4. Interpreting text labels | |||
| read at least the first two octets of the text and be able to process those | ||||
| octets in order to determine the serialization of the text. Applications | ||||
| that use the "UTF-16" charset parameter value MUST NOT assume the | ||||
| serialization without first checking the first two octets to see if they | ||||
| are a big-endian BOM or a little-endian BOM or not a BOM. | ||||
| 4. Choosing a charset | When a program sees text labelled as "UTF-16BE", "UTF-16LE", or "UTF-16", | |||
| it can make some assumptions, based on the labelling rules given in the | ||||
| previous section. These assumptions allow the program to then process the | ||||
| text. | ||||
| Any labelling application that uses UTF-16 character encoding, and puts an | 4.1 Interpreting text labelled as UTF-16BE | |||
| explicit charset label on the text, and knows the serialization of the | ||||
| characters in text, MUST label the text as either "UTF-16BE" or "UTF-16LE", | ||||
| whichever is appropriate. This allows applications that are processing the | ||||
| text that are not able to look inside the text to know the serialization | ||||
| definitively. | ||||
| Any labelling application that uses UTF-16 character encoding, and puts an | Text labelled "UTF-16BE" can always be interpreted as always being | |||
| explicit charset label on the text, and does not know the serialization of | big-endian. The detection of an initial BOM does not affect | |||
| the characters in text, MUST label the text as "UTF-16", and SHOULD be sure | de-serialization of text labelled as UTF-16BE. Finding 0xFF followed by | |||
| the text starts with 0xFEFF. An application processing text that is | 0xFE is an error since there is no Unicode character 0xFFFE. | |||
| labelled with the "UTF-16" charset parameter value knows that the | ||||
| serialization cannot be determined without looking inside the text itself. | ||||
| Fortunately, the processing application needs to only look at the first | ||||
| character (the first two octets) of the text to determine the | ||||
| serialization. | ||||
| Because creating text labelled as being in the "UTF-16" charset forces the | 4.2 Interpreting text labelled as UTF-16LE | |||
| recipient to read and understand the first character of the text object, a | ||||
| text-creating program SHOULD create text labelled as "UTF-16BE" or | Text labelled "UTF-16LE" can always be interpreted as always being | |||
| "UTF-16LE" if possible. Text-creating programs that create text using | little-endian. The detection of an initial BOM does not affect | |||
| UTF-16 encoding SHOULD emit big-endian text if possible. | de-serialization of text labelled as UTF-16LE. Finding 0xFE followed by | |||
| 0xFF is an error since there is no Unicode character 0xFFFE, which would be | ||||
| the interpretation of those octets under little-endian order. | ||||
| 4.3 Interpreting text labelled as UTF-16 | ||||
| Text labelled with the "UTF-16" charset might be serialized in either | ||||
| big-endian or little-endian order. If the first two octets of the text is | ||||
| 0xFE followed by 0xFF, then the text can be interpreted as being | ||||
| big-endian. If the first two octets of the text is 0xFF followed by 0xFE, | ||||
| then the text can be interpreted as being little-endian. If the first two | ||||
| octets of the text is not 0xFE followed by 0xFF, and is not 0xFF followed | ||||
| by 0xFE, then the text SHOULD be interpreted as being big-endian. | ||||
| All applications that process text with the "UTF-16" charset label MUST be | ||||
| able to read at least the first two octets of the text and be able to | ||||
| process those octets in order to determine the serialization order of the | ||||
| text. Applications that process text with the "UTF-16" charset label MUST | ||||
| NOT assume the serialization without first checking the first two octets to | ||||
| see if they are a big-endian BOM, a little-endian BOM, or not a BOM. | ||||
| 5. Examples | 5. Examples | |||
| For the sake of example, let's suppose that there is a hieroglyphic | For the sake of example, let's suppose that there is a hieroglyphic | |||
| character representing the Egyptian god Ra with character value 0x00012345 | character representing the Egyptian god Ra with character value 0x00012345 | |||
| (this character does not exist at present in Unicode). | (this character does not exist at present in Unicode). | |||
| The examples here all evaluate to the phrase: | The examples here all evaluate to the phrase: | |||
| *=Ra | *=Ra | |||
| where the "*" represents the Ra hieroglyph (0x00012345). | where the "*" represents the Ra hieroglyph (0x00012345). | |||
| Text that is labelled with UTF-16BE, with no BOM: | Text labelled with UTF-16BE, without a BOM: | |||
| D8 48 DF 45 00 3D 00 52 00 61 | D8 48 DF 45 00 3D 00 52 00 61 | |||
| Text that is labelled with UTF-16BE, with a BOM: | Text labelled with UTF-16LE, without a BOM: | |||
| FE FF D8 48 DF 45 00 3D 00 52 00 61 | ||||
| Text that is labelled with UTF-16LE, with no BOM: | ||||
| 48 D8 45 DF 3D 00 52 00 61 00 | 48 D8 45 DF 3D 00 52 00 61 00 | |||
| Little-endian text that is labelled with UTF-16: | Big-endian text labelled with UTF-16, with a BOM: | |||
| FE FF D8 48 DF 45 00 3D 00 52 00 61 | ||||
| Little-endian text labelled with UTF-16, with a BOM: | ||||
| FF FE 48 D8 45 DF 3D 00 52 00 61 00 | FF FE 48 D8 45 DF 3D 00 52 00 61 00 | |||
| 6. Versions of the standards | 6. Versions of the standards | |||
| ISO/IEC 10646 is updated from time to time by published amendments; | ISO/IEC 10646 is updated from time to time by published amendments; | |||
| similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0, | similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0, | |||
| and 2.1 as of this writing. Each new version obsoletes and replaces the | and 2.1 as of this writing. Each new version replaces the previous one, | |||
| previous one, but implementations, and more significantly data, are not | but implementations, and more significantly data, are not updated | |||
| updated instantly. | instantly. | |||
| In general, the changes amount to adding new characters, which does not | In general, the changes amount to adding new characters, which does not | |||
| pose particular problems with old data. Amendment 5 to ISO/IEC 10646, | pose particular problems with old data. Amendment 5 to ISO/IEC 10646, | |||
| however, has moved and expanded the Korean Hangul block, thereby making any | however, has moved and expanded the Korean Hangul block, thereby making any | |||
| previous data containing Hangul characters invalid under the new version. | previous data containing Hangul characters invalid under the new version. | |||
| Unicode 2.0 has the same difference from Unicode 1.1. The official | Unicode 2.0 has the same difference from Unicode 1.1. The official | |||
| justification for allowing such an incompatible change was that no | justification for allowing such an incompatible change was that no | |||
| implementations and no data containing Hangul existed, a statement that is | significant implementations and data containing Hangul existed, a statement | |||
| likely to be true but remains unprovable. The incident has been dubbed the | that is likely to be true but remains unprovable. The incident has been | |||
| "Korean mess", and the relevant committees have pledged to never, ever | dubbed the "Korean mess", and the relevant committees have pledged to | |||
| again make such an incompatible change. | never, ever again make such an incompatible change. | |||
| New versions, and in particular any incompatible changes, have consequences | New versions, and in particular any incompatible changes, have consequences | |||
| regarding MIME character encoding labels, to be discussed in Appendix A. | regarding MIME character encoding labels, to be discussed in Appendix A. | |||
| 7. Security considerations | 7. Security considerations | |||
| UTF-16 is based on the ISO 10646 character set, which is frequently being | UTF-16 is based on the ISO 10646 character set, which is frequently being | |||
| added to, as described in Section 6 and Appendix A of this document. | added to, as described in Section 6 and Appendix A of this document. | |||
| Processors must be able to handle characters that are not defined at the | Processors must be able to handle characters that are not defined at the | |||
| time that the processor was created in such a way as to not allow an | time that the processor was created in such a way as to not allow an | |||
| skipping to change at line 354 ¶ | skipping to change at line 374 ¶ | |||
| Text in UTF-16 may contain special characters, such as the OBJECT | Text in UTF-16 may contain special characters, such as the OBJECT | |||
| REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, | REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, | |||
| depending on the interpretation of the processing program and the | depending on the interpretation of the processing program and the | |||
| availability of an external data stream that would be executed. This | availability of an external data stream that would be executed. This | |||
| external processing may have side-effects that allow the sender of a | external processing may have side-effects that allow the sender of a | |||
| message to attack the receiving system. | message to attack the receiving system. | |||
| Implementors of UTF-16 need to consider the security aspects of how they | Implementors of UTF-16 need to consider the security aspects of how they | |||
| handle illegal UTF-16 sequences (that is, sequences involving surrogate | handle illegal UTF-16 sequences (that is, sequences involving surrogate | |||
| pairs that have illegal values). It is conceivable that in some | pairs that have illegal values or unpaired surrogates). It is conceivable | |||
| circumstances an attacker would be able to exploit an incautious UTF-16 | that in some circumstances an attacker would be able to exploit an | |||
| parser by sending it an octet sequence that is not permitted by the UTF-16 | incautious UTF-16 parser by sending it an octet sequence that is not | |||
| syntax, causing it to behave in some anomalous fashion. | permitted by the UTF-16 syntax, causing it to behave in some anomalous | |||
| fashion. | ||||
| 8. References | 8. References | |||
| [CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", | ||||
| BCP 18, RFC 2277, January 1998. | ||||
| [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration | [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration | |||
| Procedures", BCP 19, RFC 2278, January 1998. | Procedures", BCP 19, RFC 2278, January 1998. | |||
| [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information | [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information | |||
| technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: | technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: | |||
| Architecture and Basic Multilingual Plane. Twelve amendments and two | Architecture and Basic Multilingual Plane. Twelve amendments and two | |||
| technical corrigenda have been published up to now. UTF-16 is described in | technical corrigenda have been published up to now. UTF-16 is described in | |||
| Annex Q, published as Amendment 1. Many other amendments are currently at | Annex Q, published as Amendment 1. Many other amendments are currently at | |||
| various stages of standardization. | various stages of standardization. | |||
| [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate | [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", | [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", | |||
| BCP 18, RFC 2277, January 1998. | Unicode Technical Report #8. | |||
| [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC | [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC | |||
| 2279, January 1998. | 2279, January 1998. | |||
| [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", | [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set Workshop", | |||
| Unicode Technical Report #8. | RFC 2130, April 1997. | |||
| 9. Acknowledgments | 9. Acknowledgments | |||
| Deborah Goldsmith wrote a great deal of the initial wording for this | Deborah Goldsmith wrote a great deal of the initial wording for this | |||
| specification. Other significant contributors include: | specification. Martin Duerst gave numerous significant changes. Other | |||
| significant contributors include: | ||||
| Mati Allouche | Mati Allouche | |||
| Walt Daniels | Walt Daniels | |||
| Mark Davis | Mark Davis | |||
| Martin Duerst | ||||
| Ned Freed | Ned Freed | |||
| Asmus Freytag | Asmus Freytag | |||
| Lloyd Honomichl | Lloyd Honomichl | |||
| Dan Kegel | Dan Kegel | |||
| Murata Makoto | Murata Makoto | |||
| Larry Masinter | ||||
| Ken Whistler | Ken Whistler | |||
| Some of the text in this specification was copied from [UTF-8], and that | Some of the text in this specification was copied from [UTF-8], and that | |||
| document was worked on by many people. Please see the acknowledgements | document was worked on by many people. Please see the acknowledgments | |||
| section in that document for more people who may have contributed | section in that document for more people who may have contributed | |||
| indirectly to this document. | indirectly to this document. | |||
| 10. Authors' address | 10. Authors' address | |||
| Paul Hoffman | Paul Hoffman | |||
| Internet Mail Consortium | Internet Mail Consortium | |||
| 127 Segre Place | 127 Segre Place | |||
| Santa Cruz, CA 95060 USA | Santa Cruz, CA 95060 USA | |||
| phoffman@imc.org | phoffman@imc.org | |||
| Francois Yergeau | Francois Yergeau | |||
| Alis Technologies | Alis Technologies | |||
| 100, boul. Alexis-Nihon, Suite 600 | 100, boul. Alexis-Nihon, Suite 600 | |||
| Montreal QC H4M 2P2 Canada | Montreal QC H4M 2P2 Canada | |||
| fyergeau@alis.com | fyergeau@alis.com | |||
| 11. Changes between draft -01 and -02 | ||||
| Fixed some spelling mistakes throughout. | ||||
| Updated the status boilerplate. | ||||
| Clarified the parameter values in 1. | ||||
| Added [WORKSHOP] reference in 1.1 and 8. Also fuzzified the description of | ||||
| what UTF-16 is (instead of getting into hair-splitting on CESs, CCSs, and | ||||
| so on). | ||||
| Corrected 1.2 on the characters for which UTF-8 incurs a space penalty. | ||||
| Added "from ISO 10646 to UTF-16" to the beginning of 2.1. | ||||
| Added "from UTF-16 to ISO 10646" to the beginning of 2.2. | ||||
| Added text to the end of the note at the end of 2.2 about possibly emitting | ||||
| the ill-formed characters when decoding. | ||||
| Rearranged much of sections 3 and 4. This makes the following changes | ||||
| hard to follow; the references refer to the *old* section numbers, | ||||
| not necessarily the ones as they exist in this draft. Sorry about that... | ||||
| Changed the end of the first paragraph of 3.1 to get out of the | ||||
| which-endian-has-most debate. | ||||
| Clarified the fourth paragraph of 3.1 (the one that begins | ||||
| "This specification thus...") about the use of "UTF-16" as both a | ||||
| sequencing mechanism and a charset label. | ||||
| Added Martin Duerst's C code fragment for big-endian order. | ||||
| Added the sentence to the end of the sixth paragraph of 3.1 (the one | ||||
| that begins "It is important...") with the example of substrings and | ||||
| ZWNBSs. | ||||
| Added text about SHOULD NOT put an intial BOM in both 3.2 and 3.3. | ||||
| Clarified the last clause in section 3.3. | ||||
| Removed the last paragraph of 4 (the paragraph that used to start | ||||
| "Because creating text labelled...") because it related to text-creating | ||||
| programs instead of text-labelling programs. | ||||
| Rearragned and relabelled some of the examples in 5. | ||||
| Removed "obsoletes" from the first paragraph of 6. Slightly fuzzified | ||||
| the "no implementations" sentence in the second paragraph. | ||||
| Alphabatized the references in 8. | ||||
| Added Larry Masinter to section 9. Gave Martin Duerst more credit. | ||||
| A. Charset registrations | A. Charset registrations | |||
| This memo is meant to serve as the basis for registration of three MIME | This memo is meant to serve as the basis for registration of three MIME | |||
| charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", | charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", | |||
| and "UTF-16". These strings label objects containing text consisting of | and "UTF-16". These strings label objects containing text consisting of | |||
| characters from the repertoire of ISO/IEC 10646 including all amendments at | characters from the repertoire of ISO/IEC 10646 including all amendments at | |||
| least up to amendment 5 (Korean block), encoded to a sequence of octets | least up to amendment 5 (Korean block), encoded to a sequence of octets | |||
| using the encoding and serialization schemes outlined above. | using the encoding and serialization schemes outlined above. | |||
| Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in | Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in | |||
| End of changes. 46 change blocks. | ||||
| 129 lines changed or deleted | 209 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||