| < draft-yergeau-rfc2279bis-01.txt | draft-yergeau-rfc2279bis-02.txt > | |||
|---|---|---|---|---|
| Network Working Group F. Yergeau | Network Working Group F. Yergeau | |||
| Internet-Draft Alis Technologies | Internet-Draft Alis Technologies | |||
| Expires: March 2, 2003 September 1, 2002 | Expires: April 9, 2003 October 9, 2002 | |||
| UTF-8, a transformation format of ISO 10646 | UTF-8, a transformation format of ISO 10646 | |||
| draft-yergeau-rfc2279bis-01 | draft-yergeau-rfc2279bis-02 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 31 ¶ | skipping to change at page 1, line 31 ¶ | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at http:// | The list of current Internet-Drafts can be accessed at http:// | |||
| www.ietf.org/ietf/1id-abstracts.txt. | www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on March 2, 2003. | This Internet-Draft will expire on April 9, 2003. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2002). All Rights Reserved. | Copyright (C) The Internet Society (2002). All Rights Reserved. | |||
| Abstract | Abstract | |||
| <1> | <1> | |||
| ISO/IEC 10646-1 defines a large character set called the Universal | ISO/IEC 10646-1 defines a large character set called the Universal | |||
| Character Set (UCS) which encompasses most of the world's writing | Character Set (UCS) which encompasses most of the world's writing | |||
| skipping to change at page 6, line 48 ¶ | skipping to change at page 6, line 48 ¶ | |||
| <23> | <23> | |||
| 1. Determine the number of octets required from the character number | 1. Determine the number of octets required from the character number | |||
| and the first column of the table above. It is important to note | and the first column of the table above. It is important to note | |||
| that the rows of the table are mutually exclusive, i.e. there is | that the rows of the table are mutually exclusive, i.e. there is | |||
| only one valid way to encode a given character. | only one valid way to encode a given character. | |||
| <24> | <24> | |||
| 2. Prepare the high-order bits of the octets as per the second | 2. Prepare the high-order bits of the octets as per the second | |||
| column of the table. | column of the table. | |||
| <25> | <25> | |||
| 3. Fill in the bits marked x from the bits of the character number, | 3. Fill in the bits marked x from the bits of the character number, | |||
| expressed in binary. Start from the lower-order bits of the | expressed in binary. Start by putting the lowest-order bit of | |||
| character number and put them first in the last octet of the | the character number in the lowest-order position of the last | |||
| sequence, then the next to last, etc. until all x bits are | octet of the sequence, then put the next higher-order bit of the | |||
| filled in. | character number in the next higher-order position of that octet, | |||
| etc. When the x bits of the last octet are filled in, move on to | ||||
| the next to last octet, then to the preceding one, etc. until | ||||
| all x bits are filled in. | ||||
| <26> | <26> | |||
| The definition of UTF-8 prohibits encoding character numbers between | The definition of UTF-8 prohibits encoding character numbers between | |||
| U+D800 and U+DFFF, which are reserved for use with the UTF-16 | U+D800 and U+DFFF, which are reserved for use with the UTF-16 | |||
| encoding form (as surrogate pairs) and do not directly represent | encoding form (as surrogate pairs) and do not directly represent | |||
| characters. When encoding in UTF-8 from UTF-16 data, it is necessary | characters. When encoding in UTF-8 from UTF-16 data, it is necessary | |||
| to first decode the UTF-16 data to obtain character numbers, which | to first decode the UTF-16 data to obtain character numbers, which | |||
| are then encoded in UTF-8 as described above. | are then encoded in UTF-8 as described above. This contrasts with | |||
| CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for | ||||
| use on the Internet. CESU-8 operates similarly to UTF-8 but encodes | ||||
| the UTF-16 code values (16-bit quantities) instead of the character | ||||
| number (code point). This leads to different results for character | ||||
| numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT | ||||
| valid UTF-8. | ||||
| <27> | <27> | |||
| Decoding a UTF-8 character proceeds as follows: | Decoding a UTF-8 character proceeds as follows: | |||
| <28> | <28> | |||
| 1. Initialize a binary number with all bits set to 0. Up to 31 bits | 1. Initialize a binary number with all bits set to 0. Up to 31 bits | |||
| may be needed (up to 21 if the range of character numbers is | may be needed (up to 21 if the range of character numbers is | |||
| known to be restricted to the UTF-16 accessible range). | known to be restricted to the UTF-16 accessible range). | |||
| <29> | <29> | |||
| 2. Determine which bits encode the character number from the number | 2. Determine which bits encode the character number from the number | |||
| of octets in the sequence and the second column of the table | of octets in the sequence and the second column of the table | |||
| above (the bits marked x). | above (the bits marked x). | |||
| skipping to change at page 8, line 7 ¶ | skipping to change at page 8, line 7 ¶ | |||
| <31> | <31> | |||
| Implementations of the decoding algorithm above MUST protect against | Implementations of the decoding algorithm above MUST protect against | |||
| decoding invalid sequences. For instance, a naive implementation may | decoding invalid sequences. For instance, a naive implementation may | |||
| decode the overlong UTF-8 sequence C0 80 into the character U+0000, | decode the overlong UTF-8 sequence C0 80 into the character U+0000, | |||
| or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding | or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding | |||
| invalid sequences may have security consequences or cause other | invalid sequences may have security consequences or cause other | |||
| problems. See Security Considerations (Section 10) below. | problems. See Security Considerations (Section 10) below. | |||
| 4. Syntax of UTF-8 Byte Sequences | 4. Syntax of UTF-8 Byte Sequences | |||
| <32> | <32> | |||
| A UTF-8 string is a sequence of bytes representing a sequence of UCS | A UTF-8 string is a sequence of octets representing a sequence of UCS | |||
| characters. The byte sequence is valid UTF-8 only if it matches the | characters. An octet sequence is valid UTF-8 only if it matches the | |||
| following syntax, which is derived from the rules for encoding UTF-8 | following syntax, which is derived from the rules for encoding UTF-8 | |||
| and is expressed in the ABNF of [RFC2234]. | and is expressed in the ABNF of [RFC2234]. | |||
| UTF8-string = *( UTF8-char ) | UTF8-octets = *( UTF8-char ) | |||
| UTF8-char = UTF8-1 / | ||||
| UTF8-2-head 1( UTF8-tail ) / | ||||
| UTF8-3-head 1( UTF8-tail ) / | ||||
| UTF8-4-head 2( UTF8-tail ) / | ||||
| UTF8-5-head 3( UTF8-tail ) / | ||||
| UTF8-6-head 4( UTF8-tail ) | ||||
| UTF8-1 = %x00-7F | ||||
| UTF8-2-head = %xC2-DF | ||||
| UTF8-3-head = %xE0 %xA0-BF / %xE1-EC %x80-BF / | ||||
| %xED %x80-9F / %xEE-EF %x80-BF | ||||
| UTF8-4-head = %xF0 %x90-BF / %xF1-F7 %x80-BF | ||||
| UTF8-5-head = %xF8 %x88-BF / %xF9-FB %x80-BF | ||||
| UTF8-6-head = %xFC %x84-BF / %xFD %x80-BF | ||||
| UTF8-tail = %x80-BF | ||||
| UTF8-string = *( UTF8-char ) | ||||
| UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 / UTF8-5 / UTF8-6 | UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 / UTF8-5 / UTF8-6 | |||
| UTF8-char = UTF8-1 / | ||||
| UTF8-2 / | ||||
| UTF8-3 / | ||||
| UTF8-4 / | ||||
| UTF8-5 / | ||||
| UTF8-6 | ||||
| UTF8-1 = %x00-7F | UTF8-1 = %x00-7F | |||
| UTF8-2 = %xC2-DF UTF8-tail | UTF8-2 = %xC2-DF UTF8-tail | |||
| UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / | UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / | |||
| %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) | %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) | |||
| UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F7 3( UTF8-tail ) | UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F7 3( UTF8-tail ) | |||
| UTF8-5 = %xF8 %x88-BF 3( UTF8-tail ) / %xF9-FB 4( UTF8-tail ) | UTF8-5 = %xF8 %x88-BF 3( UTF8-tail ) / %xF9-FB 4( UTF8-tail ) | |||
| UTF8-6 = %xFC %x84-BF 4( UTF8-tail ) / %xFD 5( UTF8-tail ) | UTF8-6 = %xFC %x84-BF 4( UTF8-tail ) / %xFD 5( UTF8-tail ) | |||
| UTF8-tail = %x80-BF | UTF8-tail = %x80-BF | |||
| 5. Versions of the standards | 5. Versions of the standards | |||
| skipping to change at page 10, line 7 ¶ | skipping to change at page 10, line 7 ¶ | |||
| The incident has been dubbed the "Korean mess", and the relevant | The incident has been dubbed the "Korean mess", and the relevant | |||
| committees have pledged to never, ever again make such an | committees have pledged to never, ever again make such an | |||
| incompatible change (see Unicode Consortium Policies [1]). | incompatible change (see Unicode Consortium Policies [1]). | |||
| <35> | <35> | |||
| New versions, and in particular any incompatible changes, have | New versions, and in particular any incompatible changes, have | |||
| consequences regarding MIME charset labels, to be discussed in MIME | consequences regarding MIME charset labels, to be discussed in MIME | |||
| registration (Section 8). | registration (Section 8). | |||
| 6. Byte order mark (BOM) | 6. Byte order mark (BOM) | |||
| <36> | <36> | |||
| The Unicode Standard and ISO 10646 define the character "ZERO WIDTH | The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known | |||
| NO-BREAK SPACE" (U+FEFF), which is also known informally as "BYTE | informally as "BYTE ORDER MARK" (abbreviated "BOM"). This character | |||
| ORDER MARK" (abbreviated "BOM"). The latter name hints at a second | can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but | |||
| possible usage of the character, in addition to its normal use as a | the BOM name hints at a second possible usage of the character: to | |||
| genuine "ZERO WIDTH NO-BREAK SPACE" within text. This usage, | prepend a U+FEFF character to a stream of UCS characters as a | |||
| suggested by Unicode section 2.7 and ISO/IEC 10646 Annex H | "signature". A receiver of such a serialized stream may then use the | |||
| (informative), is to prepend a U+FEFF character to a stream of UCS | initial character as a hint that the stream consists of UCS | |||
| characters as a "signature"; a receiver of such a serialized stream | characters and also to recognize which UCS encoding is involved and, | |||
| may then use the initial character as a hint that the stream consists | with encodings having a multi-octet encoding unit, as a way to | |||
| of UCS characters. The signature can also be used to recognize which | recognize the serialization order of the octets. UTF-8 having a | |||
| UCS encoding is involved and, with encodings having a multi-octet | single-octet encoding unit, this last function is useless and the BOM | |||
| encoding unit, as a way to recognize the serialization order of the | will always appear as the octet sequence EF BB BF. | |||
| octets. UTF-8 having a single-octet encoding unit, this last | ||||
| function is useless and the BOM will always appear as the octet | ||||
| sequence EF BB BF. | ||||
| <37> | <37> | |||
| It is important to understand that the character U+FEFF appearing at | It is important to understand that the character U+FEFF appearing at | |||
| any position other than the beginning of a stream MUST be interpreted | any position other than the beginning of a stream MUST be interpreted | |||
| with the semantics for the zero-width non-breaking space, and MUST | with the semantics for the zero-width non-breaking space, and MUST | |||
| NOT be interpreted as a byte-order mark. The contrapositive of that | NOT be interpreted as a signature. When interpreted as a signature, | |||
| statement is not always true: the character U+FEFF in the first | the Unicode standard suggests than an initial U+FEFF character may be | |||
| position of a stream MAY be interpreted as a zero-width non-breaking | stripped before processing the text. Such stripping is necessary in | |||
| space, and is not always a byte-order mark. For example, if a | some cases (e.g. when concatenating two strings, because otherwise | |||
| process splits a UCS string into many parts, a part might begin with | the resulting string may contain an unintended "ZERO WIDTH NO-BREAK | |||
| U+FEFF because there was a zero-width no-break space at the beginning | SPACE" at the connection point), but might affect an external process | |||
| of that substring. | at a different layer (such as a digital signature or a count of the | |||
| characters) that is relying on the presence of all characters in the | ||||
| stream. It is therefore RECOMMENDED to avoid stripping an initial | ||||
| U+FEFF interpreted as a signature without a good reason, to ignore it | ||||
| instead of stripping it when appropriate (such as for display) and to | ||||
| strip it only when really necessary. | ||||
| <38> | <38> | |||
| The Unicode standard further suggests than an initial U+FEFF | U+FEFF in the first position of a stream MAY be interpreted as a | |||
| character may be stripped before processing the text, the rationale | zero-width non-breaking space, and is not always a signature. In an | |||
| being that such a character in initial position may be an artifact of | attempt at diminishing this uncertainty, Unicode 3.2 adds a new | |||
| the encoding (an encoding signature), not a genuine intended "ZERO | character, U+2060 "WORD JOINER", with exactly the same semantics and | |||
| WIDTH NO-BREAK SPACE". Note that such stripping might affect an | ||||
| external process at a different layer (such as a digital signature or | ||||
| a count of the characters) that is relying on the presence of all | ||||
| characters in the stream. | ||||
| <39> | ||||
| In particular, in UTF-8 plain text it is likely, but not certain, | ||||
| that an initial octet sequence of EF BB BF is a signature. When | ||||
| concatenating two strings, it is important to strip out those | ||||
| signatures, because otherwise the resulting string may contain an | ||||
| unintended "ZERO WIDTH NO-BREAK SPACE" at the connection point. | ||||
| <40> | ||||
| In an attempt at diminishing the uncertainty, Unicode 3.2 adds a new | ||||
| character, U+2060 WORD JOINER, with exactly the same semantics and | ||||
| usage as U+FEFF except for the signature function, and strongly | usage as U+FEFF except for the signature function, and strongly | |||
| recommends its exclusive use for expressing word-joining semantics. | recommends its exclusive use for expressing word-joining semantics. | |||
| Eventually, following this recommendation will make it all but | Eventually, following this recommendation will make it all but | |||
| certain that any initial U+FEFF is a signature, not an intended "ZERO | certain that any initial U+FEFF is a signature, not an intended "ZERO | |||
| WIDTH NO-BREAK SPACE". | WIDTH NO-BREAK SPACE". | |||
| <39> | ||||
| In the meantime, the uncertainty unfortunately remains and may affect | ||||
| Internet protocols. Protocol specifications MAY restrict usage of | ||||
| U+FEFF as a signature in order to reduce or eliminate the potential | ||||
| ill effects of this uncertainty. In the interest of striking a | ||||
| balance between the advantages (reduction of uncertainty) and | ||||
| drawbacks (loss of the signature function) of such restrictions, it | ||||
| is useful to distinguish a few cases: | ||||
| 7. Examples | <40> | |||
| o A protocol SHOULD forbid use of U+FEFF as a signature for those | ||||
| textual protocol elements that the protocol mandates to be always | ||||
| UTF-8, the signature function being totally useless in those | ||||
| cases. | ||||
| <41> | <41> | |||
| o A protocol SHOULD also forbid use of U+FEFF as a signature for | ||||
| those textual protocol elements for which the protocol provides | ||||
| character encoding identification mechanisms, when it is expected | ||||
| that implementations of the protocol will be in a position to | ||||
| always use the mechanisms properly. This will be the case when | ||||
| the protocol elements are maintained tightly under the control of | ||||
| the implementation from the time of their creation to the time of | ||||
| their (properly labelled) transmission. | ||||
| <42> | ||||
| o A protocol SHOULD NOT forbid use of U+FEFF as a signature for | ||||
| those textual protocol elements for which the protocol does not | ||||
| provide character encoding identification mechanisms, when a ban | ||||
| would be unenforceable, or when it is expected that | ||||
| implementations of the protocol will not be in a position to | ||||
| always use the mechanisms properly. The latter two cases are | ||||
| likely to occur with larger protocol elements such as MIME | ||||
| entities, especially when implementations of the protocol will | ||||
| obtain such entities from file systems, from protocols that do not | ||||
| have encoding identification mechanisms for payloads (such as FTP) | ||||
| or from other protocols that do not guarantee proper | ||||
| identification of character encoding (such as HTTP). | ||||
| <43> | ||||
| When a protocol forbids use of U+FEFF as a signature for a certain | ||||
| protocol element, then any initial U+FEFF in that protocol element | ||||
| MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE". When a | ||||
| protocol does NOT forbid use of U+FEFF as a signature for a certain | ||||
| protocol element, then implementations SHOULD be prepared to handle a | ||||
| signature in that element and react appropriately: using the | ||||
| signature to identify the character encoding as necessary and | ||||
| stripping or ignoring the signature as appropriate. | ||||
| 7. Examples | ||||
| <44> | ||||
| The character sequence U+0041 U+2262 U+0391 U+002E "A<NOT IDENTICAL | The character sequence U+0041 U+2262 U+0391 U+002E "A<NOT IDENTICAL | |||
| TO><ALPHA>." is encoded in UTF-8 as follows: | TO><ALPHA>." is encoded in UTF-8 as follows: | |||
| --+--------+-----+-- | --+--------+-----+-- | |||
| 41 E2 89 A2 CE 91 2E | 41 E2 89 A2 CE 91 2E | |||
| --+--------+-----+-- | --+--------+-----+-- | |||
| <42> | <45> | |||
| The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo", | The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo", | |||
| meaning "the Korean language") is encoded in UTF-8 as follows: | meaning "the Korean language") is encoded in UTF-8 as follows: | |||
| --------+--------+-------- | --------+--------+-------- | |||
| ED 95 9C EA B5 AD EC 96 B4 | ED 95 9C EA B5 AD EC 96 B4 | |||
| --------+--------+-------- | --------+--------+-------- | |||
| <43> | <46> | |||
| The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo", | The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo", | |||
| meaning "the Japanese language") is encoded in UTF-8 as follows: | meaning "the Japanese language") is encoded in UTF-8 as follows: | |||
| --------+--------+-------- | --------+--------+-------- | |||
| E6 97 A5 E6 9C AC E8 AA 9E | E6 97 A5 E6 9C AC E8 AA 9E | |||
| --------+--------+-------- | --------+--------+-------- | |||
| <44> | <47> | |||
| The character U+233B4 (a Chinese character meaning 'stump of tree'), | The character U+233B4 (a Chinese character meaning 'stump of tree'), | |||
| prepended with a UTF-8 BOM, is encoded in UTF-8 as follows: | prepended with a UTF-8 BOM, is encoded in UTF-8 as follows: | |||
| --------+----------- | --------+----------- | |||
| EF BB BF F0 A3 8E B4 | EF BB BF F0 A3 8E B4 | |||
| --------+----------- | --------+----------- | |||
| 8. MIME registration | 8. MIME registration | |||
| <45> | <48> | |||
| This memo serves as the basis for registration of the MIME charset | This memo serves as the basis for registration of the MIME charset | |||
| parameter for UTF-8, according to [RFC2978]. The charset parameter | parameter for UTF-8, according to [RFC2978]. The charset parameter | |||
| value is "UTF-8". This string labels media types containing text | value is "UTF-8". This string labels media types containing text | |||
| consisting of characters from the repertoire of ISO/IEC 10646 | consisting of characters from the repertoire of ISO/IEC 10646 | |||
| including all amendments at least up to amendment 5 of the 1993 | including all amendments at least up to amendment 5 of the 1993 | |||
| edition (Korean block), encoded to a sequence of octets using the | edition (Korean block), encoded to a sequence of octets using the | |||
| encoding scheme outlined above. UTF-8 is suitable for use in MIME | encoding scheme outlined above. UTF-8 is suitable for use in MIME | |||
| content types under the "text" top-level type. | content types under the "text" top-level type. | |||
| <46> | <49> | |||
| It is noteworthy that the label "UTF-8" does not contain a version | It is noteworthy that the label "UTF-8" does not contain a version | |||
| identification, referring generically to ISO/IEC 10646. This is | identification, referring generically to ISO/IEC 10646. This is | |||
| intentional, the rationale being as follows: | intentional, the rationale being as follows: | |||
| <47> | <50> | |||
| A MIME charset label is designed to give just the information needed | A MIME charset label is designed to give just the information needed | |||
| to interpret a sequence of bytes received on the wire into a sequence | to interpret a sequence of bytes received on the wire into a sequence | |||
| of characters, nothing more (see [RFC2045], section 2.2). As long as | of characters, nothing more (see [RFC2045], section 2.2). As long as | |||
| a character set standard does not change incompatibly, version | a character set standard does not change incompatibly, version | |||
| numbers serve no purpose, because one gains nothing by learning from | numbers serve no purpose, because one gains nothing by learning from | |||
| the tag that newly assigned characters may be received that one | the tag that newly assigned characters may be received that one | |||
| doesn't know about. The tag itself doesn't teach anything about the | doesn't know about. The tag itself doesn't teach anything about the | |||
| new characters, which are going to be received anyway. | new characters, which are going to be received anyway. | |||
| <48> | <51> | |||
| Hence, as long as the standards evolve compatibly, the apparent | Hence, as long as the standards evolve compatibly, the apparent | |||
| advantage of having labels that identify the versions is only that, | advantage of having labels that identify the versions is only that, | |||
| apparent. But there is a disadvantage to such version-dependent | apparent. But there is a disadvantage to such version-dependent | |||
| labels: when an older application receives data accompanied by a | labels: when an older application receives data accompanied by a | |||
| newer, unknown label, it may fail to recognize the label and be | newer, unknown label, it may fail to recognize the label and be | |||
| completely unable to deal with the data, whereas a generic, known | completely unable to deal with the data, whereas a generic, known | |||
| label would have triggered mostly correct processing of the data, | label would have triggered mostly correct processing of the data, | |||
| which may well not contain any new characters. | which may well not contain any new characters. | |||
| <49> | <52> | |||
| Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible | Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible | |||
| change, in principle contradicting the appropriateness of a version | change, in principle contradicting the appropriateness of a version | |||
| independent MIME charset label as described above. But the | independent MIME charset label as described above. But the | |||
| compatibility problem can only appear with data containing Korean | compatibility problem can only appear with data containing Korean | |||
| Hangul characters encoded according to Unicode 1.1 (or equivalently | Hangul characters encoded according to Unicode 1.1 (or equivalently | |||
| ISO/IEC 10646 before amendment 5), and there is arguably no such data | ISO/IEC 10646 before amendment 5), and there is arguably no such data | |||
| to worry about, this being the very reason the incompatible change | to worry about, this being the very reason the incompatible change | |||
| was deemed acceptable. | was deemed acceptable. | |||
| <50> | <53> | |||
| In practice, then, a version-independent label is warranted, provided | In practice, then, a version-independent label is warranted, provided | |||
| the label is understood to refer to all versions after Amendment 5, | the label is understood to refer to all versions after Amendment 5, | |||
| and provided no incompatible change actually occurs. Should | and provided no incompatible change actually occurs. Should | |||
| incompatible changes occur in a later version of ISO/IEC 10646, the | incompatible changes occur in a later version of ISO/IEC 10646, the | |||
| MIME charset label defined here will stay aligned with the previous | MIME charset label defined here will stay aligned with the previous | |||
| version until and unless the IETF specifically decides otherwise. | version until and unless the IETF specifically decides otherwise. | |||
| 9. IANA Considerations | 9. IANA Considerations | |||
| <51> | <54> | |||
| The entry for UTF-8 in the IANA charset registry should be updated to | The entry for UTF-8 in the IANA charset registry should be updated to | |||
| point to this memo. | point to this memo. | |||
| 10. Security Considerations | 10. Security Considerations | |||
| <52> | <55> | |||
| Implementors of UTF-8 need to consider the security aspects of how | Implementors of UTF-8 need to consider the security aspects of how | |||
| they handle illegal UTF-8 sequences. It is conceivable that in some | they handle illegal UTF-8 sequences. It is conceivable that in some | |||
| circumstances an attacker would be able to exploit an incautious UTF- | circumstances an attacker would be able to exploit an incautious UTF- | |||
| 8 parser by sending it an octet sequence that is not permitted by the | 8 parser by sending it an octet sequence that is not permitted by the | |||
| UTF-8 syntax. | UTF-8 syntax. | |||
| <53> | <56> | |||
| A particularly subtle form of this attack can be carried out against | A particularly subtle form of this attack can be carried out against | |||
| a parser which performs security-critical validity checks against the | a parser which performs security-critical validity checks against the | |||
| UTF-8 encoded form of its input, but interprets certain illegal octet | UTF-8 encoded form of its input, but interprets certain illegal octet | |||
| sequences as characters. For example, a parser might prohibit the | sequences as characters. For example, a parser might prohibit the | |||
| NUL character when encoded as the single-octet sequence 00, but | NUL character when encoded as the single-octet sequence 00, but | |||
| erroneously allow the illegal two-octet sequence C0 80 and interpret | erroneously allow the illegal two-octet sequence C0 80 and interpret | |||
| it as a NUL character. Another example might be a parser which | it as a NUL character. Another example might be a parser which | |||
| prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the | prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the | |||
| illegal octet sequence 2F C0 AE 2E 2F. This last exploit has | illegal octet sequence 2F C0 AE 2E 2F. This last exploit has | |||
| actually been used in a widespread virus attacking Web servers in | actually been used in a widespread virus attacking Web servers in | |||
| 2001; the security threat is thus very real. | 2001; the security threat is thus very real. | |||
| Bibliography | Bibliography | |||
| [CESU-8] Phipps, T., "Compatibility Encoding Scheme for UTF-16: | ||||
| 8-Bit (CESU-8)", UTR 26, April 2002, <http:// | ||||
| www.unicode.org/unicode/reports/tr26/>. | ||||
| [FSS_UTF] X/Open Company Ltd., "X/Open CAE Specification C501 -- | [FSS_UTF] X/Open Company Ltd., "X/Open CAE Specification C501 -- | |||
| File System Safe UCS Transformation Format (FSS_UTF)", | File System Safe UCS Transformation Format (FSS_UTF)", | |||
| ISBN 1-85912-082-2, April 1995. | ISBN 1-85912-082-2, April 1995. | |||
| [ISO.10646-1] International Organization for Standardization, | [ISO.10646-1] International Organization for Standardization, | |||
| "Information Technology - Universal Multiple-octet | "Information Technology - Universal Multiple-octet | |||
| coded Character Set (UCS) - Part 1: Architecture and | coded Character Set (UCS) - Part 1: Architecture and | |||
| Basic Multilingual Plane", ISO Standard 10646-1, 2000. | Basic Multilingual Plane", ISO Standard 10646-1, 2000. | |||
| [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet | [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet | |||
| skipping to change at page 18, line 6 ¶ | skipping to change at page 18, line 6 ¶ | |||
| Alis Technologies | Alis Technologies | |||
| 100, boul. Alexis-Nihon, bureau 600 | 100, boul. Alexis-Nihon, bureau 600 | |||
| MontrȨal, QC H4M 2P2 | MontrȨal, QC H4M 2P2 | |||
| Canada | Canada | |||
| Phone: +1 514 747 2547 | Phone: +1 514 747 2547 | |||
| Fax: +1 514 747 2561 | Fax: +1 514 747 2561 | |||
| EMail: fyergeau@alis.com | EMail: fyergeau@alis.com | |||
| Appendix A. Acknowledgements | Appendix A. Acknowledgements | |||
| <62> | <65> | |||
| The following have participated in the drafting and discussion of | The following have participated in the drafting and discussion of | |||
| this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer, | this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer, | |||
| Mark Davis, Martin J. DÈrst, Patrick FÈñltstrȵm, Ned Freed, David | Mark Davis, Martin J. DÈrst, Patrick FÈñltstrȵm, Ned Freed, David | |||
| Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood, | Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood, | |||
| Kent Karlsson, Markus Kuhn, Michael Kung, Alain LaBontȨ, John | Simon Josefsson, Kent Karlsson, Markus Kuhn, Michael Kung, Alain | |||
| Gardiner Myers, Dan Oscarsson, Murray Sargent, Markus Scherer, Keld | LaBontȨ, Ira McDonald, Alexey Melnikov, John Gardiner Myers, Dan | |||
| Simonsen, Arnold Winkler, Kenneth Whistler and Misha Wolf. | Oscarsson, Murray Sargent, Markus Scherer, Keld Simonsen, Arnold | |||
| Winkler, Kenneth Whistler and Misha Wolf. | ||||
| Appendix B. Changes from RFC 2279 | Appendix B. Changes from RFC 2279 | |||
| <63> | <66> | |||
| <64> | <67> | |||
| o Significantly shortened Introduction. No more mention of UTF-1 or | o Significantly shortened Introduction. No more mention of UTF-1 or | |||
| UTF-7, of Transformation Formats. | UTF-7, of Transformation Formats. | |||
| <65> | <68> | |||
| o Straightened out terminology. UTF-8 now described in terms of an | o Straightened out terminology. UTF-8 now described in terms of an | |||
| encoding form of the character number. UCS-2 and UCS-4 almost | encoding form of the character number. UCS-2 and UCS-4 almost | |||
| disappeared. | disappeared. | |||
| <66> | <69> | |||
| o Note warning against decoding of invalid sequences turned into a | o Note warning against decoding of invalid sequences turned into a | |||
| normative MUST NOT. | normative MUST NOT. | |||
| <67> | <70> | |||
| o New section about the BOM, mostly extracted and slightly adapted | o New section about the UTF-8 BOM, with advice for protocols. | |||
| from RFC 2781. | <71> | |||
| <68> | ||||
| o Updated a couple of references (10646-1:2000, Unicode 3.2, RFC | o Updated a couple of references (10646-1:2000, Unicode 3.2, RFC | |||
| 2978). | 2978). | |||
| <69> | <72> | |||
| o Added TOC. | o Added TOC. | |||
| <70> | <73> | |||
| o Removed suggested UNICODE-1-1-UTF-8 MIME charset registration. | o Removed suggested UNICODE-1-1-UTF-8 MIME charset registration. | |||
| <71> | <74> | |||
| o New "Notational conventions" section about RFC 2119 and U+HHHH | o New "Notational conventions" section about RFC 2119 and U+HHHH | |||
| notation. | notation. | |||
| <72> | <75> | |||
| o Pointer to Unicode Consortium Policies added in "Versions of the | o Pointer to Unicode Consortium Policies added in "Versions of the | |||
| standards" section. | standards" section. | |||
| <73> | <76> | |||
| o Added a fourth example with a non-BMP character and a BOM. | o Added a fourth example with a non-BMP character and a BOM. | |||
| <74> | <77> | |||
| o Added a paragraph about U+2060 WORD JOINER. | o Added a paragraph about U+2060 WORD JOINER. | |||
| <75> | <78> | |||
| o Enumerate more byte values impossible in UTF-8, either as a result | o Enumerate more byte values impossible in UTF-8, either as a result | |||
| of forbidding overlong sequences or of restricting to the UTF-16 | of forbidding overlong sequences or of restricting to the UTF-16 | |||
| accessible range. | accessible range. | |||
| <76> | <79> | |||
| o Added "IANA Considerations" section to ask that the UTF-8 entry in | o Added "IANA Considerations" section to ask that the UTF-8 entry in | |||
| the charset registry point to this memo. | the charset registry point to this memo. | |||
| <80> | ||||
| o Added an ABNF syntax for valid UTF-8 octet sequences | ||||
| <81> | ||||
| o Added some warning language about CESU-8 | ||||
| Full Copyright Statement | Full Copyright Statement | |||
| Copyright (C) The Internet Society (2002). All Rights Reserved. | Copyright (C) The Internet Society (2002). All Rights Reserved. | |||
| This document and translations of it may be copied and furnished to | This document and translations of it may be copied and furnished to | |||
| others, and derivative works that comment on or otherwise explain it | others, and derivative works that comment on or otherwise explain it | |||
| or assist in its implementation may be prepared, copied, published | or assist in its implementation may be prepared, copied, published | |||
| and distributed, in whole or in part, without restriction of any | and distributed, in whole or in part, without restriction of any | |||
| kind, provided that the above copyright notice and this paragraph are | kind, provided that the above copyright notice and this paragraph are | |||
| End of changes. 43 change blocks. | ||||
| 106 lines changed or deleted | 135 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||