| < draft-hoffman-utf16-03.txt | draft-hoffman-utf16-04.txt > | |||
|---|---|---|---|---|
| Internet Draft Paul Hoffman | Internet Draft Paul Hoffman | |||
| <draft-hoffman-utf16-03.txt> Internet Mail Consortium | <draft-hoffman-utf16-04.txt> Internet Mail Consortium | |||
| April 19, 1999 Francois Yergeau | June 1, 1999 Francois Yergeau | |||
| Alis Technologies | Alis Technologies | |||
| UTF-16, an encoding of ISO 10646 | UTF-16, an encoding of ISO 10646 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with all | This document is an Internet-Draft and is in full conformance with all | |||
| provisions of Section 10 of RFC2026. | provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering Task | Internet-Drafts are working documents of the Internet Engineering Task | |||
| skipping to change at line 42 ¶ | skipping to change at line 41 ¶ | |||
| This document describes the UTF-16 encoding of Unicode/ISO-10646, | This document describes the UTF-16 encoding of Unicode/ISO-10646, | |||
| addresses the issues of serializing UTF-16 as an octet stream for | addresses the issues of serializing UTF-16 as an octet stream for | |||
| transmission over the Internet, defines MIME charset naming as | transmission over the Internet, defines MIME charset naming as | |||
| described in [CHARSET-REG], and contains the registration for three | described in [CHARSET-REG], and contains the registration for three | |||
| MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE | MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE | |||
| (little-endian), and UTF-16. | (little-endian), and UTF-16. | |||
| 1.1 Background and motivation | 1.1 Background and motivation | |||
| The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly | The Unicode Standard [UNICODE] and ISO/IEC 10646 [ISO-10646] jointly | |||
| define a coded character set (CCS), hereafter referred to as Unicode, | define a coded character set (CCS), hereafter referred to as Unicode, | |||
| which encompasses most of the world's writing systems [WORKSHOP]. | which encompasses most of the world's writing systems [WORKSHOP]. | |||
| UTF-16, the object of this specification, is one of the standard ways | UTF-16, the object of this specification, is one of the standard ways | |||
| of encoding Unicode character data; it has the characteristics of | of encoding Unicode character data; it has the characteristics of | |||
| encoding all currently defined characters (in plane 0, the BMP) in | encoding all currently defined characters (in plane 0, the BMP) in | |||
| exactly two octets and of being able to encode all other characters | exactly two octets and of being able to encode all other characters | |||
| likely to be defined (the next 16 planes) in exactly four octets. | likely to be defined (the next 16 planes) in exactly four octets. | |||
| The Unicode Standard further defines additional character properties | The Unicode Standard further defines additional character properties | |||
| and other application details of great interest to implementors. Up to | and other application details of great interest to implementors. Up to | |||
| skipping to change at line 104 ¶ | skipping to change at line 103 ¶ | |||
| - Characters with values between 0x10000 and 0x10FFFF are represented | - Characters with values between 0x10000 and 0x10FFFF are represented | |||
| by a 16-bit integer with a value between 0xD800 and 0xDBFF (within | by a 16-bit integer with a value between 0xD800 and 0xDBFF (within | |||
| the so-called high-half zone or high surrogate area) followed by a | the so-called high-half zone or high surrogate area) followed by a | |||
| 16-bit integer with a value between 0xDC00 and 0xDFFF (within the | 16-bit integer with a value between 0xDC00 and 0xDFFF (within the | |||
| so-called low-half zone or low surrogate area). | so-called low-half zone or low surrogate area). | |||
| - Characters with values greater than 0x10FFFF cannot be encoded in | - Characters with values greater than 0x10FFFF cannot be encoded in | |||
| UTF-16. | UTF-16. | |||
| Note: Values between 0xD800 and 0xDFFF are specifically reserved for | ||||
| use with UTF-16, and don't have any characters assigned to them. | ||||
| 2.1 Encoding UTF-16 | 2.1 Encoding UTF-16 | |||
| Encoding of a single character from an ISO 10646 character value to | Encoding of a single character from an ISO 10646 character value to | |||
| UTF-16 proceeds as follows. Let U be the character number, no greater | UTF-16 proceeds as follows. Let U be the character number, no greater | |||
| than 0x10FFFF. | than 0x10FFFF. | |||
| 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. | 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. | |||
| 2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, | 2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, | |||
| U' must be less than or equal to 0xFFFFF. That is, U' can be | U' must be less than or equal to 0xFFFFF. That is, U' can be | |||
| skipping to change at line 161 ¶ | skipping to change at line 163 ¶ | |||
| Note that steps 2 and 3 indicate errors. Error recovery is not | Note that steps 2 and 3 indicate errors. Error recovery is not | |||
| specified by this document. When terminating with an error in steps 2 | specified by this document. When terminating with an error in steps 2 | |||
| and 3, it may be wise to set U to the value of W1 to help the caller | and 3, it may be wise to set U to the value of W1 to help the caller | |||
| diagnose the error and not lose information. Also note that a string | diagnose the error and not lose information. Also note that a string | |||
| decoding algorithm, as opposed to the single-character decoding | decoding algorithm, as opposed to the single-character decoding | |||
| described above, need not terminate upon detection of an error, if | described above, need not terminate upon detection of an error, if | |||
| proper error reporting and/or recovery is provided. | proper error reporting and/or recovery is provided. | |||
| 3. Labelling UTF-16 text | 3. Labelling UTF-16 text | |||
| This specification contains registration for three MIME charsets: | Appendix A of this specification contains registrations for three MIME | |||
| "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the | charsets: "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent | |||
| combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and | the combination of a CCS (a coded character set) and a CES (a character | |||
| the CES is the same in all three cases, except for the serialization | encoding scheme). Here the CCS is Unicode/ISO 10646 and the CES is the | |||
| order of the octets in each character, and the external determination | same in all three cases, except for the serialization order of the | |||
| of which serialization is used. | octets in each character, and the external determination of which | |||
| serialization is used. | ||||
| This section describes which of the three labels to apply to a stream | This section describes which of the three labels to apply to a stream | |||
| of text. Section 4 describes how to interpret the labels on a stream of | of text. Section 4 describes how to interpret the labels on a stream of | |||
| text. | text. | |||
| 3.1 Definition of big-endian and little-endian | 3.1 Definition of big-endian and little-endian | |||
| Historically, computer hardware has processed two-octet entities such | Historically, computer hardware has processed two-octet entities such | |||
| as 16-bit integers in one of two ways. So-called "big-endian" hardware | as 16-bit integers in one of two ways. So-called "big-endian" hardware | |||
| handles two-octet entities with the higher-order octet first, that is | handles two-octet entities with the higher-order octet first, that is | |||
| skipping to change at line 200 ¶ | skipping to change at line 203 ¶ | |||
| void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ | void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ | |||
| { | { | |||
| putc(u >> 8, f); /* output high-order byte */ | putc(u >> 8, f); /* output high-order byte */ | |||
| putc(u & 0xFF, f); /* then low-order */ | putc(u & 0xFF, f); /* then low-order */ | |||
| } | } | |||
| The term "network byte order" has been used in many RFCs to indicate | The term "network byte order" has been used in many RFCs to indicate | |||
| big-endian serialization, although that term has yet to be formally | big-endian serialization, although that term has yet to be formally | |||
| defined in a standards-track document. Although ISO 10646 prefers | defined in a standards-track document. Although ISO 10646 prefers | |||
| big-endian serialization (section 6.3 of [ISO-10646]), it is likely | big-endian serialization (section 6.3 of [ISO-10646]), little-endian | |||
| that little-endian order will also be used on the Internet. | order is also sometimes used on the Internet. | |||
| 3.2 Byte order mark (BOM) | 3.2 Byte order mark (BOM) | |||
| The Unicode Standard and ISO 10646 define the character "ZERO WIDTH | The Unicode Standard and ISO 10646 define the character "ZERO WIDTH | |||
| NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE | NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE | |||
| ORDER MARK" (abbreviated "BOM"). The latter name hints at a second | ORDER MARK" (abbreviated "BOM"). The latter name hints at a second | |||
| possible usage of the character, in addition to its normal use as a | possible usage of the character, in addition to its normal use as a | |||
| genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage, | genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage, | |||
| suggested by Unicode section 2.4 and ISO 10646 Annex F (informative), | suggested by Unicode section 2.4 and ISO 10646 Annex F (informative), | |||
| is to prepend a 0xFEFF character to a stream of Unicode characters as a | is to prepend a 0xFEFF character to a stream of Unicode characters as a | |||
| skipping to change at line 271 ¶ | skipping to change at line 274 ¶ | |||
| Text in the "UTF-16LE" charset MUST be serialized with the octets which | Text in the "UTF-16LE" charset MUST be serialized with the octets which | |||
| make up a single 16-bit UTF-16 value in little-endian order. Systems | make up a single 16-bit UTF-16 value in little-endian order. Systems | |||
| labelling UTF-16LE text MUST NOT prepend a BOM to the text. | labelling UTF-16LE text MUST NOT prepend a BOM to the text. | |||
| Any labelling application that uses UTF-16 character encoding, and puts | Any labelling application that uses UTF-16 character encoding, and puts | |||
| an explicit charset label on the text, and does not know the | an explicit charset label on the text, and does not know the | |||
| serialization order of the characters in text, MUST label the text as | serialization order of the characters in text, MUST label the text as | |||
| "UTF-16", and SHOULD make sure the text starts with 0xFEFF. | "UTF-16", and SHOULD make sure the text starts with 0xFEFF. | |||
| An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or | An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE" | |||
| "UTF-16LE" is that some document formats mandate a BOM in UTF-16 text, | would occur with document formats that mandate a BOM in UTF-16 text, | |||
| thereby requiring the use of the "UTF-16" tag only. | thereby requiring the use of the "UTF-16" tag only. | |||
| 4. Interpreting text labels | 4. Interpreting text labels | |||
| When a program sees text labelled as "UTF-16BE", "UTF-16LE", or | When a program sees text labelled as "UTF-16BE", "UTF-16LE", or | |||
| "UTF-16", it can make some assumptions, based on the labelling rules | "UTF-16", it can make some assumptions, based on the labelling rules | |||
| given in the previous section. These assumptions allow the program to | given in the previous section. These assumptions allow the program to | |||
| then process the text. | then process the text. | |||
| 4.1 Interpreting text labelled as UTF-16BE | 4.1 Interpreting text labelled as UTF-16BE | |||
| skipping to change at line 322 ¶ | skipping to change at line 325 ¶ | |||
| label MUST NOT assume the serialization without first checking the | label MUST NOT assume the serialization without first checking the | |||
| first two octets to see if they are a big-endian BOM, a little-endian | first two octets to see if they are a big-endian BOM, a little-endian | |||
| BOM, or not a BOM. All applications that process text with the "UTF-16" | BOM, or not a BOM. All applications that process text with the "UTF-16" | |||
| charset label MUST be able to interpret both big-endian and | charset label MUST be able to interpret both big-endian and | |||
| little-endian text. | little-endian text. | |||
| 5. Examples | 5. Examples | |||
| For the sake of example, let's suppose that there is a hieroglyphic | For the sake of example, let's suppose that there is a hieroglyphic | |||
| character representing the Egyptian god Ra with character value | character representing the Egyptian god Ra with character value | |||
| 0x00012345 (this character does not exist at present in Unicode). | 0x12345 (this character does not exist at present in Unicode). | |||
| The examples here all evaluate to the phrase: | The examples here all evaluate to the phrase: | |||
| *=Ra | *=Ra | |||
| where the "*" represents the Ra hieroglyph (0x00012345). | where the "*" represents the Ra hieroglyph (0x12345). | |||
| Text labelled with UTF-16BE, without a BOM: | Text labelled with UTF-16BE, without a BOM: | |||
| D8 08 DF 45 00 3D 00 52 00 61 | D8 08 DF 45 00 3D 00 52 00 61 | |||
| Text labelled with UTF-16LE, without a BOM: | Text labelled with UTF-16LE, without a BOM: | |||
| 08 D8 45 DF 3D 00 52 00 61 00 | 08 D8 45 DF 3D 00 52 00 61 00 | |||
| Big-endian text labelled with UTF-16, with a BOM: | Big-endian text labelled with UTF-16, with a BOM: | |||
| FE FF D8 08 DF 45 00 3D 00 52 00 61 | FE FF D8 08 DF 45 00 3D 00 52 00 61 | |||
| skipping to change at line 418 ¶ | skipping to change at line 421 ¶ | |||
| technology -- Universal Multiple-Octet Coded Character Set (UCS) -- | technology -- Universal Multiple-Octet Coded Character Set (UCS) -- | |||
| Part 1: Architecture and Basic Multilingual Plane. Twelve amendments | Part 1: Architecture and Basic Multilingual Plane. Twelve amendments | |||
| and two technical corrigenda have been published up to now. UTF-16 is | and two technical corrigenda have been published up to now. UTF-16 is | |||
| described in Annex Q, published as Amendment 1. Many other amendments | described in Annex Q, published as Amendment 1. Many other amendments | |||
| are currently at various stages of standardization. | are currently at various stages of standardization. | |||
| [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate | [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version | [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version | |||
| 2.1", Unicode Technical Report #8. | 2.0", ISBN 0-201-48345-9; with Unicode Technical Report #8, "The | |||
| Unicode Standard, Version 2.1", | ||||
| http://www.unicode.org/unicode/reports/tr8.html. | ||||
| [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC | [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC | |||
| 2279, January 1998. | 2279, January 1998. | |||
| [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set | [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set | |||
| Workshop", RFC 2130, April 1997. | Workshop", RFC 2130, April 1997. | |||
| 9. Acknowledgments | 9. Acknowledgments | |||
| Deborah Goldsmith wrote a great deal of the initial wording for this | Deborah Goldsmith wrote a great deal of the initial wording for this | |||
| skipping to change at line 449 ¶ | skipping to change at line 454 ¶ | |||
| Murata Makoto | Murata Makoto | |||
| Larry Masinter | Larry Masinter | |||
| Markus Scherer | Markus Scherer | |||
| Ken Whistler | Ken Whistler | |||
| Some of the text in this specification was copied from [UTF-8], and | Some of the text in this specification was copied from [UTF-8], and | |||
| that document was worked on by many people. Please see the | that document was worked on by many people. Please see the | |||
| acknowledgments section in that document for more people who may have | acknowledgments section in that document for more people who may have | |||
| contributed indirectly to this document. | contributed indirectly to this document. | |||
| 10. Changes between draft -02 and -03 | 10. Changes between draft -03 and -04 | |||
| 1: Reorganized the sections. Added information about two octets being | ||||
| enough for all current characters and the committees saying they will | ||||
| not go beyond what can be defined in UTF-16. | ||||
| 2.1: Reworded step 2 with words to make it easier to read. | ||||
| 2.2: Added "U" to step 1. Also added note to the end of the last | ||||
| paragraph about string decoding and errors. | ||||
| 3: Added a reference to section 4 about interpreting labels. | 2: Added note at the end of the section about 0xD800-0xDFFF being | |||
| reserved for UTF-16. | ||||
| 3.1: Reworded last sentence in last paragraph. | 3: Spelled out CCS and CES in the first paragraph. Also put a reference | |||
| to Appendix A in the first paragraph. In the last paragraph, changed | ||||
| the last sentence to indicate that little-ending is already sometimes | ||||
| used on the Internet. | ||||
| 4.3: Added requirement that apps that can read UTF-16 must be able to | 3.3: Changed the last paragraph to explain which kind of rules it | |||
| interpret both big-endian and little-endian. | applies to. | |||
| 5: Corrected the examples due to wrong encoding. | 5: Changed "0x00012345" to "0x12345". | |||
| 11: Moved author's addresses to Appendix B. | 8: Changed the reference to [UNICODE]. | |||
| A. Charset registrations | A. Charset registrations | |||
| This memo is meant to serve as the basis for registration of three MIME | This memo is meant to serve as the basis for registration of three MIME | |||
| charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", | charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", | |||
| "UTF-16LE", and "UTF-16". These strings label objects containing text | "UTF-16LE", and "UTF-16". These strings label objects containing text | |||
| consisting of characters from the repertoire of ISO/IEC 10646 including | consisting of characters from the repertoire of ISO/IEC 10646 including | |||
| all amendments at least up to amendment 5 (Korean block), encoded to a | all amendments at least up to amendment 5 (Korean block), encoded to a | |||
| sequence of octets using the encoding and serialization schemes | sequence of octets using the encoding and serialization schemes | |||
| outlined above. | outlined above. | |||
| End of changes. 16 change blocks. | ||||
| 33 lines changed or deleted | 33 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||