| < draft-hoffman-utf16-00.txt | draft-hoffman-utf16-01.txt > | |||
|---|---|---|---|---|
| Internet Draft Paul Hoffman | Internet Draft Paul Hoffman | |||
| <draft-hoffman-utf16-00.txt> Internet Mail Consortium | <draft-hoffman-utf16-01.txt> Internet Mail Consortium | |||
| November 12, 1998 Francois Yergeau | December 13, 1998 Francois Yergeau | |||
| Alis Technologies | Alis Technologies | |||
| UTF-16, an encoding of ISO 10646 | UTF-16, an encoding of ISO 10646 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft. Internet-Drafts are working documents | This document is an Internet-Draft. Internet-Drafts are working documents | |||
| of the Internet Engineering Task Force (IETF), its areas, and its working | of the Internet Engineering Task Force (IETF), its areas, and its working | |||
| groups. Note that other groups may also distribute working documents as | groups. Note that other groups may also distribute working documents as | |||
| Internet- Drafts. | Internet- Drafts. | |||
| skipping to change at line 39 ¶ | skipping to change at line 39 ¶ | |||
| 1. Introduction | 1. Introduction | |||
| This document specifies the UTF-16 encoding of Unicode/ISO-10646 and | This document specifies the UTF-16 encoding of Unicode/ISO-10646 and | |||
| contains the registration for three MIME charset parameter values: | contains the registration for three MIME charset parameter values: | |||
| UTF-16BE, UTF-16LE, and UTF-16. | UTF-16BE, UTF-16LE, and UTF-16. | |||
| 1.1 Background | 1.1 Background | |||
| The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly | The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly | |||
| define a character set (hereafter referred to as Unicode) which encompasses | define a coded character set (CCS), hereafter referred to as Unicode, which | |||
| most of the world's writing systems. UTF-16, the object of this | encompasses most of the world's writing systems. UTF-16, the object of this | |||
| specification, is an encoding scheme of this character set that has the | specification, is a character encoding scheme (CES) of Unicode that has the | |||
| characteristics of encoding the vast majority of currently-defined | characteristics of encoding the vast majority of currently-defined | |||
| characters in exactly two octets and of being able to encode all other | characters in exactly two octets and of being able to encode all other | |||
| characters that will be defined in exactly four octets. | characters that will be defined in exactly four octets. | |||
| The Unicode Standard further defines additional character properties and | The Unicode Standard further defines additional character properties and | |||
| other application details of great interest to implementors. Up to the | other application details of great interest to implementors. Up to the | |||
| present time, changes in Unicode and amendments to ISO/IEC 10646 have | present time, changes in Unicode and amendments to ISO/IEC 10646 have | |||
| tracked each other, so that the character repertoires and code point | tracked each other, so that the character repertoires and code point | |||
| assignments have remained in sync. The relevant standardization committees | assignments have remained in sync. The relevant standardization committees | |||
| have committed to maintain this very useful synchronism. | have committed to maintain this very useful synchronism. | |||
| 1.2 Motivation | 1.2 Motivation | |||
| The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF | The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF | |||
| policy on character sets, [CHARPOLICY], says that IETF protocols MUST be | policy on character sets and languages, [CHARPOLICY], says that IETF | |||
| able to use the UTF-8 charset. However, relative to UTF-16, UTF-8 imposes a | protocols MUST be able to use the UTF-8 charset. However, relative to | |||
| space penalty for characters whose values are greater than 0x0800. Also, | UTF-16, UTF-8 imposes a space penalty for characters whose values are | |||
| characters represented in UTF-8 have varying sizes. Using UTF-16 provides a | greater than 0x0800. Also, characters represented in UTF-8 have varying | |||
| way to transmit character data that is mostly uniform in size. Some | sizes. Using UTF-16 provides a way to transmit character data that is | |||
| products and network standards already specify UTF-16. (Note, however, that | mostly uniform in size. Some products and network standards already specify | |||
| UTF-8 has many other advantages over UTF-16 in many protocols, such as the | UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in | |||
| direct encoding of US-ASCII characters.) | many protocols, such as the direct encoding of US-ASCII characters and | |||
| re-synchronization after loss of octets.) | ||||
| UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as | UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as | |||
| a sequence of 16-bit quantities. This document addresses the issues of | a sequence of 16-bit quantities. This document addresses the issues of | |||
| serializing UTF-16 as an octet stream for transmission over the Internet | serializing UTF-16 as an octet stream for transmission over the Internet | |||
| and of MIME charset naming as described in [CHARSET-REG]. | and of MIME charset naming as described in [CHARSET-REG]. | |||
| 1.3 Terminology | 1.3 Terminology | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. | document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. | |||
| Throughout this document, character values are shown in hexadecimal | Throughout this document, character values are shown in hexadecimal | |||
| notation. For example, "0x013C" is the character whose value is at the | notation. For example, "0x013C" is the character whose value is the | |||
| codepoint that is 316 (decimal) positions from the base of the character | character assigned the integer value 316 (decimal) in the CCS. | |||
| set. | ||||
| 2. UTF-16 definition | 2. UTF-16 definition | |||
| In ISO 10646, each character is assigned a number, which Unicode calls the | In ISO 10646, each character is assigned a number, which Unicode calls the | |||
| Unicode scalar value. This number is the same as the UCS-4 value of the | Unicode scalar value. This number is the same as the UCS-4 value of the | |||
| character, and this document will refer to it as the "character value" for | character, and this document will refer to it as the "character value" for | |||
| brevity. In the UTF-16 encoding, characters are represented using either | brevity. In the UTF-16 encoding, characters are represented using either | |||
| one or two unsigned 16-bit integers, depending on the character value. | one or two unsigned 16-bit integers, depending on the character value. | |||
| Serialization of these integers for transmission as a byte stream is | Serialization of these integers for transmission as a byte stream is | |||
| discussed in Section 3. | discussed in Section 3. | |||
| The rules for how characters are encoded in UTF-16 are: | The rules for how characters are encoded in UTF-16 are: | |||
| - Characters with values less than 0x10000 are represented as a single | - Characters with values less than 0x10000 are represented as a single | |||
| integer with a value equal to that of the character number. | 16-bit integer with a value equal to that of the character number. | |||
| - Characters with values between 0x10000 and 0x10FFFF are represented by | - Characters with values between 0x10000 and 0x10FFFF are represented by a | |||
| an integer with a value between 0xD800 and 0xDBFF (within the so-called | 16-bit integer with a value between 0xD800 and 0xDBFF (within the | |||
| high-half zone or high surrogate area) followed by an integer with a | so-called high-half zone or high surrogate area) followed by a 16-bit | |||
| value between 0xDC00 and 0xDFFF (within the so-called low-half zone or | integer with a value between 0xDC00 and 0xDFFF (within the so-called | |||
| low surrogate area). | low-half zone or low surrogate area). | |||
| - Characters with values greater than 0x10FFFF cannot be encoded in | - Characters with values greater than 0x10FFFF cannot be encoded in | |||
| UTF-16. | UTF-16. | |||
| 2.1 Encoding UTF-16 | 2.1 Encoding UTF-16 | |||
| Encoding of a single character proceeds as follows. Let U be the character | Encoding of a single character proceeds as follows. Let U be the character | |||
| number, no greater than 0x10FFFF. | number, no greater than 0x10FFFF. | |||
| 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. | 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. | |||
| skipping to change at line 177 ¶ | skipping to change at line 177 ¶ | |||
| 0x01 followed by the octet 0x02. The little-endian serialization of that | 0x01 followed by the octet 0x02. The little-endian serialization of that | |||
| number is the octet 0x02 followed by the octet 0x01. | number is the octet 0x02 followed by the octet 0x01. | |||
| The term "network byte order" has been used in many RFCs to indicate | The term "network byte order" has been used in many RFCs to indicate | |||
| big-endian serialization, although that term has never been formally | big-endian serialization, although that term has never been formally | |||
| defined in a standards-track document. ISO 10646 prefers big-endian | defined in a standards-track document. ISO 10646 prefers big-endian | |||
| serialization (section 6.3 of [ISO-10646]), but it is nonetheless | serialization (section 6.3 of [ISO-10646]), but it is nonetheless | |||
| considered likely that little-endian order will also be used on the | considered likely that little-endian order will also be used on the | |||
| Internet. | Internet. | |||
| This specification thus contains registration for three charset parameter | This specification thus contains registration for three charsets: | |||
| values: "UTF-16BE", "UTF-16LE", and "UTF-16". The three character encodings | "UTF-16BE", "UTF-16LE", and "UTF-16". The character encoding schemes these | |||
| are identical except for the serialization order of the octets in each | charsets use are identical except for the serialization order of the octets | |||
| character, and the external determination of which serialization is used. | in each character, and the external determination of which serialization is | |||
| used. | ||||
| The Unicode Standard defines the character "ZERO WIDTH NON-BREAKING SPACE" | The Unicode Standard and ISO 10646 define the character "ZERO WIDTH | |||
| (0xFEFF) which is also known as the "BYTE ORDER MARK", abbreviated "BOM". | NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER | |||
| All BOM characters MUST be considered to be characters of the text object | MARK" (abbreviated "BOM"). The latter name hints at a second possible usage | |||
| that is labelled with the "UTF-16BE", "UTF-16LE", or "UTF-16" charset | of the character, in addition to its normal use as a genuine "ZERO WIDTH | |||
| parameter values. The BOM characters MUST be included when performing | NON-BREAKING SPACE" within text. This usage, suggested by Unicode section | |||
| MIME-related operations over the entire text, such as in hash algorithms | 2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character | |||
| and length calculations. After the text has been processed, the BOM MAY be | to a stream of Unicode characters as a "signature"; a receiver of such a | |||
| removed, although this will prevent later comparison with the original MIME | serialized stream may then use the initial character both as a hint that | |||
| object. | the stream consists of Unicode characters and as a way to recognize the | |||
| serialization order. In serialized UTF-16 prepended with such a signature, | ||||
| the order is big-endian if the first two octets are 0xFE followed by 0xFF; | ||||
| if they are 0xFF followed by 0xFE, the order is little-endian. Note that | ||||
| 0xFFFE is not a Unicode character, precisely to preserve the usefulness of | ||||
| 0xFEFF as a byte-order mark. | ||||
| It is important to understand that the character 0xFEFF appearing at any | ||||
| position other than the beginning of a stream MUST be interpreted with the | ||||
| semantics for the zero-width non-breaking space, and MUST NOT be | ||||
| interpreted as a byte-order mark. The contrapositive of that statement is | ||||
| not always true: the character 0xFEFF in the first position of a stream MAY | ||||
| be interpreted as a zero-width non-breaking space, and is not always a | ||||
| byte-order mark. | ||||
| The Unicode standard further suggests than an initial 0xFEFF character may | ||||
| be stripped before processing the text, the rationale being that such a | ||||
| character in initial position may be an artifact of the encoding (an | ||||
| encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING | ||||
| SPACE". Nevertheless, such stripping MUST NOT take place before any | ||||
| MIME-related operations (such as hash algorithms, digest, or byte-count | ||||
| computations) have been completed. Such operations depend on the exact | ||||
| bytes of the data, which therefore may not be modified in any way. After | ||||
| all MIME-related operations have been completed (for instance after a MIME | ||||
| processor has handed an entity to a specific media type processor), an | ||||
| initial 0xFEFF MAY be removed if appropriate, although this will prevent | ||||
| later comparison with the original MIME object. In particular, in UTF-16 | ||||
| plain text it is likely that an initial 0xFEFF is a signature; when | ||||
| concatenating two strings, it is important to strip out those signatures, | ||||
| for otherwise the resulting string may contain an unintended "ZERO WIDTH | ||||
| NON-BREAKING SPACE" at the connection point. Also, some specifications | ||||
| mandate an initial 0xFEFF character in objects encoded in UTF-16 and | ||||
| specify that this signature is not part of the object. | ||||
| 3.2 Serialization in UTF-16BE | 3.2 Serialization in UTF-16BE | |||
| Text labelled with the "UTF-16BE" charset parameter value MUST be | Text in the "UTF-16BE" charset MUST be serialized with the octets which | |||
| serialized with the octets which make up a single 16-bit UTF-16 value in | make up a single 16-bit UTF-16 value in big-endian order. The detection of | |||
| big-endian order. The detection of an initial BOM or a reversed BOM does | an initial BOM does not affect de-serialization of text labelled as | |||
| not affect de-serialization of text labelled as UTF-16BE. Finding a | UTF-16BE. Finding 0xFF follwed by 0xFE is an error since there is no | |||
| reversed BOM (that is, the octet 0xFF followed by the octet 0xFE) is an | Unicode character 0xFFFE. | |||
| error since there is no Unicode character 0xFFFE. | ||||
| 3.3 Serialization in UTF-16LE | 3.3 Serialization in UTF-16LE | |||
| Text labelled with the "UTF-16LE" charset parameter value MUST be | Text in the "UTF-16LE" charset MUST be serialized with the octets which | |||
| serialized with the octets which make up a single 16-bit UTF-16 value in | make up a single 16-bit UTF-16 value in little-endian order. The detection | |||
| little-endian order. The detection of an initial BOM or a reversed BOM does | of an initial BOM does not affect de-serialization of text labelled as | |||
| not affect de-serialization of text labelled as UTF-16BE. Finding a | UTF-16LE. Finding 0xFE folled by 0xFF is an error since there is no Unicode | |||
| non-reversed BOM (that is, the octet 0xFE followed by the octet 0xFF) is an | character 0xFFFE, which is the interpretation of the 0xFEFF character under | |||
| error since there is no Unicode character 0xFFFE, which is the | little-endian order. | |||
| interpretation of the non-reversed BOM under little-endian order. | ||||
| 3.4 Serialization in UTF-16 | 3.4 Serialization in UTF-16 | |||
| Text labelled with the "UTF-16" charset parameter value MAY be serialized | Text in the "UTF-16" charset MAY be serialized in either big-endian or | |||
| in either big-endian or little-endian order. Text labelled as UTF-16 MUST | little-endian order. If the first two octets of the text is 0xFE followed | |||
| be big-endian unless the first two octets of the text is sequence of octets | by 0xFF, then the text MUST be big-endian. If the first two octets of the | |||
| 0xFF 0xFE, in which case the serialization MUST be little-endian. | text is 0xFF followed by 0xFE, then the text MUST be little-endian. If the | |||
| first two octets of the text is not 0xFE followed by 0xFF and is not 0xFF | ||||
| Big-endian text labelled with the "UTF-16" charset parameter value MAY | followed by 0xFE, then the text MUST be big-endian. Big-endian text in the | |||
| start with the big-endian BOM (the character 0xFEFF), but the BOM is not | "UTF-16" charset MAY start with the 0xFEFF character, but the 0xFEFF | |||
| required. BOM characters other than the first character of a body part are | character is not required. | |||
| not interpreted as BOMs. | ||||
| All applications that process text that uses the "UTF-16" charset parameter | All applications that process text in the "UTF-16" charset MUST be able to | |||
| value MUST be able to read at least the first two octets of the text and be | read at least the first two octets of the text and be able to process those | |||
| able to process those octets in order to determine the serialization of the | octets in order to determine the serialization of the text. Applications | |||
| text. Applications that use the "UTF-16" charset parameter value MUST NOT | that use the "UTF-16" charset parameter value MUST NOT assume the | |||
| assume the serialization without first checking the first two octets to see | serialization without first checking the first two octets to see if they | |||
| if they are a big-endian BOM or a little-endian BOM or not a BOM. | are a big-endian BOM or a little-endian BOM or not a BOM. | |||
| 4. Choosing a charset | 4. Choosing a charset | |||
| Any labelling application that uses UTF-16 character encoding, and puts an | Any labelling application that uses UTF-16 character encoding, and puts an | |||
| explicit charset label on the text, and knows the serialization of the | explicit charset label on the text, and knows the serialization of the | |||
| characters in text, MUST label the text with the "UTF-16BE" or the | characters in text, MUST label the text as either "UTF-16BE" or "UTF-16LE", | |||
| "UTF-16LE" charset parameter values. This allows applications that are | whichever is appropriate. This allows applications that are processing the | |||
| processing the text that are not able to look inside the text to know the | text that are not able to look inside the text to know the serialization | |||
| serialization definitively. | definitively. | |||
| Any labelling application that uses UTF-16 character encoding, and puts an | Any labelling application that uses UTF-16 character encoding, and puts an | |||
| explicit charset label on the text, and does not know the serialization of | explicit charset label on the text, and does not know the serialization of | |||
| the characters in text, MUST label the text with the "UTF-16" charset | the characters in text, MUST label the text as "UTF-16", and SHOULD be sure | |||
| parameter value, and SHOULD be sure the text starts with a BOM. An | the text starts with 0xFEFF. An application processing text that is | |||
| application processing text that is labelled with the "UTF-16" charset | labelled with the "UTF-16" charset parameter value knows that the | |||
| parameter value knows that the serialization cannot be determined without | serialization cannot be determined without looking inside the text itself. | |||
| looking inside the text itself. Fortunately, the processing application | Fortunately, the processing application needs to only look at the first | |||
| needs only look at the first character (the first two octets) of the text | character (the first two octets) of the text to determine the | |||
| to determine the serialization. | serialization. | |||
| Because creating text that uses the "UTF-16" charset parameter value forces | Because creating text labelled as being in the "UTF-16" charset forces the | |||
| the recipient to read and understand the first character of the text | recipient to read and understand the first character of the text object, a | |||
| object, a text-creating program SHOULD create text labelled with the | text-creating program SHOULD create text labelled as "UTF-16BE" or | |||
| "UTF-16BE" or the "UTF-16LE" charset parameter values if possible. | "UTF-16LE" if possible. Text-creating programs that create text using | |||
| Text-creating programs that create text using UTF-16 encoding SHOULD emit | UTF-16 encoding SHOULD emit big-endian text if possible. | |||
| big-endian text if possible. | ||||
| 5. Examples | 5. Examples | |||
| For the sake of example, let's suppose that there is a hieroglyphic | For the sake of example, let's suppose that there is a hieroglyphic | |||
| character representing the Egyptian god Ra with character value 0x00012345 | character representing the Egyptian god Ra with character value 0x00012345 | |||
| (this character does not exist at present in Unicode). | (this character does not exist at present in Unicode). | |||
| The examples here all evaluate to the phrase: | The examples here all evaluate to the phrase: | |||
| *=Ra | *=Ra | |||
| skipping to change at line 311 ¶ | skipping to change at line 340 ¶ | |||
| 7. Security considerations | 7. Security considerations | |||
| UTF-16 is based on the ISO 10646 character set, which is frequently being | UTF-16 is based on the ISO 10646 character set, which is frequently being | |||
| added to, as described in Section 6 and Appendix A of this document. | added to, as described in Section 6 and Appendix A of this document. | |||
| Processors must be able to handle characters that are not defined at the | Processors must be able to handle characters that are not defined at the | |||
| time that the processor was created in such a way as to not allow an | time that the processor was created in such a way as to not allow an | |||
| attacker to harm a recipient by including unknown characters. | attacker to harm a recipient by including unknown characters. | |||
| Processors that handle any type of text, including text encoded as UTF-16, | Processors that handle any type of text, including text encoded as UTF-16, | |||
| must be vigilant for control characters that might reprogram a display | must be vigilant in checking for control characters that might reprogram a | |||
| terminal or keyboard. Similarly, processors that interpret text entities | display terminal or keyboard. Similarly, processors that interpret text | |||
| (such as looking for embedded programming code), must be careful not to | entities (such as looking for embedded programming code), must be careful | |||
| execute the code without first alerting the recipient. | not to execute the code without first alerting the recipient. | |||
| Text in UTF-16 may contain special characters, such as the OBJECT | Text in UTF-16 may contain special characters, such as the OBJECT | |||
| REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, | REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, | |||
| depending on the interpretation of the processing program and the | depending on the interpretation of the processing program and the | |||
| availability of an external data stream that would be executed. This | availability of an external data stream that would be executed. This | |||
| external processing may have side-effects that allow the sender of a | external processing may have side-effects that allow the sender of a | |||
| message to attack the receiving system. | message to attack the receiving system. | |||
| Implementors of UTF-16 need to consider the security aspects of how they | Implementors of UTF-16 need to consider the security aspects of how they | |||
| handle illegal UTF-16 sequences (that is, sequences involving surrogate | handle illegal UTF-16 sequences (that is, sequences involving surrogate | |||
| pairs that have illegal values). It is conceivable that in some | pairs that have illegal values). It is conceivable that in some | |||
| circumstances an attacker would be able to exploit an incautious UTF-16 | circumstances an attacker would be able to exploit an incautious UTF-16 | |||
| parser by sending it an octet sequence that is not permitted by the UTF-16 | parser by sending it an octet sequence that is not permitted by the UTF-16 | |||
| syntax. | syntax, causing it to behave in some anomalous fashion. | |||
| 8. References | 8. References | |||
| [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration | [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration | |||
| Procedures", BCP 19, RFC 2278, January 1998. | Procedures", BCP 19, RFC 2278, January 1998. | |||
| [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information | [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information | |||
| technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: | technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: | |||
| Architecture and Basic Multilingual Plane. Twelve amendments and two | Architecture and Basic Multilingual Plane. Twelve amendments and two | |||
| technical corrigenda have been published up to now. UTF-16 is described in | technical corrigenda have been published up to now. UTF-16 is described in | |||
| skipping to change at line 356 ¶ | skipping to change at line 385 ¶ | |||
| BCP 18, RFC 2277, January 1998. | BCP 18, RFC 2277, January 1998. | |||
| [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC | [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC | |||
| 2279, January 1998. | 2279, January 1998. | |||
| [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", | [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", | |||
| Unicode Technical Report #8. | Unicode Technical Report #8. | |||
| 9. Acknowledgments | 9. Acknowledgments | |||
| David Goldsmith wrote a great deal of the initial wording for this | Deborah Goldsmith wrote a great deal of the initial wording for this | |||
| specification. Other significant contributors include: | specification. Other significant contributors include: | |||
| Mati Allouche | Mati Allouche | |||
| Walt Daniels | Walt Daniels | |||
| Mark Davis | Mark Davis | |||
| Martin Duerst | Martin Duerst | |||
| Ned Freed | ||||
| Asmus Freytag | Asmus Freytag | |||
| Lloyd Honomichl | Lloyd Honomichl | |||
| Dan Kegel | ||||
| Murata Makoto | Murata Makoto | |||
| Ken Whistler | Ken Whistler | |||
| Some of the text in this specification was copied from [UTF-8], and that | Some of the text in this specification was copied from [UTF-8], and that | |||
| document was worked on by many people. Please see the acknowledgements | document was worked on by many people. Please see the acknowledgements | |||
| section in that document for more people who may have contributed | section in that document for more people who may have contributed | |||
| indirectly to this document. | indirectly to this document. | |||
| 10. Authors' address | 10. Authors' address | |||
| skipping to change at line 390 ¶ | skipping to change at line 421 ¶ | |||
| Francois Yergeau | Francois Yergeau | |||
| Alis Technologies | Alis Technologies | |||
| 100, boul. Alexis-Nihon, Suite 600 | 100, boul. Alexis-Nihon, Suite 600 | |||
| Montreal QC H4M 2P2 Canada | Montreal QC H4M 2P2 Canada | |||
| fyergeau@alis.com | fyergeau@alis.com | |||
| A. Charset registrations | A. Charset registrations | |||
| This memo is meant to serve as the basis for registration of three MIME | This memo is meant to serve as the basis for registration of three MIME | |||
| character set parameters (charsets) [CHARSET-REG]. The proposed charset | charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", | |||
| parameter values are "UTF-16BE", "UTF-16LE", and "UTF-16". These strings | and "UTF-16". These strings label objects containing text consisting of | |||
| label media types containing text consisting of characters from the | characters from the repertoire of ISO/IEC 10646 including all amendments at | |||
| repertoire of ISO/IEC 10646 including all amendments at least up to | least up to amendment 5 (Korean block), encoded to a sequence of octets | |||
| amendment 5 (Korean block), encoded to a sequence of octets using the | using the encoding and serialization schemes outlined above. | |||
| encoding and serialization schemes outlined above. | ||||
| Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in | Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in | |||
| MIME content types under the "text" top-level type, because they do not | media types under the "text" top-level type, because they do not encode | |||
| encode line endings in the way required for MIME "text" media types. | line endings in the way required for MIME "text" media types. | |||
| It is noteworthy that the labels described here do not contain a version | It is noteworthy that the labels described here do not contain a version | |||
| identification, referring generically to ISO/IEC 10646. This is | identification, referring generically to ISO/IEC 10646. This is | |||
| intentional, the rationale being as follows: | intentional, the rationale being as follows: | |||
| A MIME charset label is designed to give just the information needed to | A MIME charset is designed to give just the information needed to interpret | |||
| interpret a sequence of bytes received on the wire into a sequence of | a sequence of bytes received on the wire into a sequence of characters, | |||
| characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As long as | nothing more (see RFC 2045, section 2.2, in [MIME]). As long as a character | |||
| a character set standard does not change incompatibly, version numbers | set standard does not change incompatibly, version numbers serve no | |||
| serve no purpose, because one gains nothing by learning from the tag that | purpose, because one gains nothing by learning from the tag that newly | |||
| newly assigned characters may be received that one doesn't know about. The | assigned characters may be received that one doesn't know about. The tag | |||
| tag itself doesn't teach anything about the new characters, which are going | itself doesn't teach anything about the new characters, which are going to | |||
| to be received anyway. | be received anyway. | |||
| Hence, as long as the standards evolve compatibly, the apparent advantage | Hence, as long as the standards evolve compatibly, the apparent advantage | |||
| of having labels that identify the versions is only that, apparent. But | of having labels that identify the versions is only that, apparent. But | |||
| there is a disadvantage to such version-dependent labels: when an older | there is a disadvantage to such version-dependent labels: when an older | |||
| application receives data accompanied by a newer, unknown label, it may | application receives data accompanied by a newer, unknown label, it may | |||
| fail to recognize the label and be completely unable to deal with the data, | fail to recognize the label and be completely unable to deal with the data, | |||
| whereas a generic, known label would have triggered mostly correct | whereas a generic, known label would have triggered mostly correct | |||
| processing of the data, which may well not contain any new characters. | processing of the data, which may well not contain any new characters. | |||
| The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in | The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in | |||
| principle contradicting the appropriateness of a version independent MIME | principle contradicting the appropriateness of a version independent MIME | |||
| charset label as described above. But the compatibility problem can only | charset as described above. But the compatibility problem can only appear | |||
| appear with data containing Korean Hangul characters encoded according to | with data containing Korean Hangul characters encoded according to Unicode | |||
| Unicode 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there | 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there is | |||
| is arguably no such data to worry about, this being the very reason the | arguably no such data to worry about, this being the very reason the | |||
| incompatible change was deemed acceptable. | incompatible change was deemed acceptable. | |||
| In practice, then, a version-independent label is warranted, provided the | In practice, then, a version-independent label is warranted, provided the | |||
| label is understood to refer to all versions after Amendment 5, and | label is understood to refer to all versions after Amendment 5, and | |||
| provided no incompatible change actually occurs. Should incompatible | provided no incompatible change actually occurs. Should incompatible | |||
| changes occur in a later version of ISO/IEC 10646, the MIME charset labels | changes occur in a later version of ISO/IEC 10646, the MIME charsets | |||
| defined here will stay aligned with the previous version until and unless | defined here will stay aligned with the previous version until and unless | |||
| the IETF specifically decides otherwise. | the IETF specifically decides otherwise. | |||
| A.1 Registration for UTF-16BE | A.1 Registration for UTF-16BE | |||
| To: ietf-charsets@iana.org | To: ietf-charsets@iana.org | |||
| Subject: Registration of new charset | Subject: Registration of new charset | |||
| Charset name(s): UTF-16BE | Charset name(s): UTF-16BE | |||
| End of changes. 25 change blocks. | ||||
| 107 lines changed or deleted | 137 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||