< draft-yergeau-rfc2279bis-04.txt   draft-yergeau-rfc2279bis-05.txt >
Network Working Group F. Yergeau Network Working Group F. Yergeau
Internet-Draft Alis Technologies Internet-Draft Alis Technologies
Expires: August 18, 2003 February 17, 2003 Expires: December 8, 2003 June 9, 2003
UTF-8, a transformation format of ISO 10646 UTF-8, a transformation format of ISO 10646
draft-yergeau-rfc2279bis-04 draft-yergeau-rfc2279bis-05
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts. groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at http://
http://www.ietf.org/ietf/1id-abstracts.txt. www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on August 18, 2003. This Internet-Draft will expire on December 8, 2003.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved. Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract Abstract
ISO/IEC 10646-1 defines a large character set called the Universal ISO/IEC 10646-1 defines a large character set called the Universal
Character Set (UCS) which encompasses most of the world's writing Character Set (UCS) which encompasses most of the world's writing
systems. The originally proposed encodings of the UCS, however, were systems. The originally proposed encodings of the UCS, however, were
skipping to change at page 2, line 16 skipping to change at page 2, line 16
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Notational conventions . . . . . . . . . . . . . . . . . . . . 4 2. Notational conventions . . . . . . . . . . . . . . . . . . . . 4
3. UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . . 4 3. UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . . 4
4. Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . . 6 4. Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . . 6
5. Versions of the standards . . . . . . . . . . . . . . . . . . 6 5. Versions of the standards . . . . . . . . . . . . . . . . . . 6
6. Byte order mark (BOM) . . . . . . . . . . . . . . . . . . . . 7 6. Byte order mark (BOM) . . . . . . . . . . . . . . . . . . . . 7
7. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 7. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
8. MIME registration . . . . . . . . . . . . . . . . . . . . . . 9 8. MIME registration . . . . . . . . . . . . . . . . . . . . . . 9
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
10. Security Considerations . . . . . . . . . . . . . . . . . . . 10 10. Security Considerations . . . . . . . . . . . . . . . . . . . 11
11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
12. Changes from RFC 2279 . . . . . . . . . . . . . . . . . . . . 11 12. Changes from RFC 2279 . . . . . . . . . . . . . . . . . . . . 12
Normative references . . . . . . . . . . . . . . . . . . . . . 12 Normative references . . . . . . . . . . . . . . . . . . . . . 12
Informative references . . . . . . . . . . . . . . . . . . . . 12 Informative references . . . . . . . . . . . . . . . . . . . . 13
Author's Address . . . . . . . . . . . . . . . . . . . . . . . 13 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 14
Intellectual Property and Copyright Statements . . . . . . . . 14 Intellectual Property and Copyright Statements . . . . . . . . 15
1. Introduction 1. Introduction
ISO/IEC 10646 [ISO.10646] defines a large character set called the ISO/IEC 10646 [ISO.10646] defines a large character set called the
Universal Character Set (UCS), which encompasses most of the world's Universal Character Set (UCS), which encompasses most of the world's
writing systems. The same set of characters is defined by the Unicode writing systems. The same set of characters is defined by the Unicode
standard [UNICODE], which further defines additional character standard [UNICODE], which further defines additional character
properties and other application details of great interest to properties and other application details of great interest to
implementers. Up to the present time, changes in Unicode and implementers. Up to the present time, changes in Unicode and
amendments and additions to ISO/IEC 10646 have tracked each other, so amendments and additions to ISO/IEC 10646 have tracked each other, so
skipping to change at page 3, line 34 skipping to change at page 3, line 34
UTF-8, the object of this memo, has a one-octet encoding unit. It UTF-8, the object of this memo, has a one-octet encoding unit. It
uses all bits of an octet, but has the quality of preserving the full uses all bits of an octet, but has the quality of preserving the full
US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one
octet having the normal US-ASCII value, and any octet with such a octet having the normal US-ASCII value, and any octet with such a
value can only stand for a US-ASCII character, and nothing else. value can only stand for a US-ASCII character, and nothing else.
UTF-8 encodes UCS characters as a varying number of octets, where the UTF-8 encodes UCS characters as a varying number of octets, where the
number of octets, and the value of each, depend on the integer value number of octets, and the value of each, depend on the integer value
assigned to the character in ISO/IEC 10646 (the character number, assigned to the character in ISO/IEC 10646 (the character number,
a.k.a. code point or Unicode scalar value). This encoding form has a.k.a. code position, code point or Unicode scalar value). This
the following characteristics (all values are in hexadecimal): encoding form has the following characteristics (all values are in
hexadecimal):
o Character numbers from U+0000 to U+007F (US-ASCII repertoire) o Character numbers from U+0000 to U+007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values). A direct correspond to octets 00 to 7F (7 bit US-ASCII values). A direct
consequence is that a plain ASCII string is also a valid UTF-8 consequence is that a plain ASCII string is also a valid UTF-8
string. string.
o US-ASCII octet values do not appear otherwise in a UTF-8 encoded o US-ASCII octet values do not appear otherwise in a UTF-8 encoded
character stream. This provides compatibility with file systems character stream. This provides compatibility with file systems
or other software (e.g. the printf() function in C libraries) that or other software (e.g. the printf() function in C libraries) that
parse based on US-ASCII values but are transparent to other parse based on US-ASCII values but are transparent to other
values. values.
o Round-trip conversion is easy between UTF-8 and other encoding o Round-trip conversion is easy between UTF-8 and other encoding
forms. forms.
o The first octet of a multi-octet sequence indicates the number of o The first octet of a multi-octet sequence indicates the number of
octets in the sequence. octets in the sequence.
o The octet values C0, C1, FE and FF never appear. If the range of o The octet values C0, C1, F5 to FF never appear.
character numbers is restricted to U+0000..U+10FFFF (the UTF-16
accessible range), then the octet values F5..FD also never appear.
o Character boundaries are easily found from anywhere in an octet o Character boundaries are easily found from anywhere in an octet
stream. stream.
o The lexicographic sorting order of UTF-8 strings is the same as if o The byte-value lexicographic sorting order of UTF-8 strings is the
ordered by character numbers. Of course this is of limited same as if ordered by character numbers. Of course this is of
interest since a sort order based on character numbers is not limited interest since a sort order based on character numbers is
culturally valid. not culturally valid.
o The Boyer-Moore fast search algorithm can be used with UTF-8 data. o The Boyer-Moore fast search algorithm can be used with UTF-8 data.
o UTF-8 strings can be fairly reliably recognized as such by a o UTF-8 strings can be fairly reliably recognized as such by a
simple algorithm, i.e. the probability that a string of characters simple algorithm, i.e. the probability that a string of characters
in any other encoding appears as valid UTF-8 is low, diminishing in any other encoding appears as valid UTF-8 is low, diminishing
with increasing string length. with increasing string length.
UTF-8 was originally a project of the X/Open Joint UTF-8 was originally a project of the X/Open Joint
Internationalization Group XOJIG with the objective to specify a File Internationalization Group XOJIG with the objective to specify a File
skipping to change at page 6, line 28 skipping to change at page 6, line 28
Implementations of the decoding algorithm above MUST protect against Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences. For instance, a naive implementation may decoding invalid sequences. For instance, a naive implementation may
decode the overlong UTF-8 sequence C0 80 into the character U+0000, decode the overlong UTF-8 sequence C0 80 into the character U+0000,
or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
invalid sequences may have security consequences or cause other invalid sequences may have security consequences or cause other
problems. See Security Considerations (Section 10) below. problems. See Security Considerations (Section 10) below.
4. Syntax of UTF-8 Byte Sequences 4. Syntax of UTF-8 Byte Sequences
For the convenience of implementors using ABNF, a definition of UTF-8
in ABNF syntax is given here.
A UTF-8 string is a sequence of octets representing a sequence of UCS A UTF-8 string is a sequence of octets representing a sequence of UCS
characters. An octet sequence is valid UTF-8 only if it matches the characters. An octet sequence is valid UTF-8 only if it matches the
following syntax, which is derived from the rules for encoding UTF-8 following syntax, which is derived from the rules for encoding UTF-8
and is expressed in the ABNF of [RFC2234]. and is expressed in the ABNF of [RFC2234].
UTF8-octets = *( UTF8-char ) UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail ) %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF UTF8-tail = %x80-BF
5. Versions of the standards NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This
grammar is believed to describe the same thing as what Unicode
describes, but does not claim to be authoritative. Implementors are
urged to rely on the authoritative source, rather than on this ABNF.
5. Versions of the standards
ISO/IEC 10646 is updated from time to time by publication of ISO/IEC 10646 is updated from time to time by publication of
amendments and additional parts; similarly, new versions of the amendments and additional parts; similarly, new versions of the
Unicode standard are published over time. Each new version obsoletes Unicode standard are published over time. Each new version obsoletes
and replaces the previous one, but implementations, and more and replaces the previous one, but implementations, and more
significantly data, are not updated instantly. significantly data, are not updated instantly.
In general, the changes amount to adding new characters, which does In general, the changes amount to adding new characters, which does
not pose particular problems with old data. In 1996, Amendment 5 to not pose particular problems with old data. In 1996, Amendment 5 to
the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded
the Korean Hangul block, thereby making any previous data containing the Korean Hangul block, thereby making any previous data containing
skipping to change at page 11, line 22 skipping to change at page 11, line 32
been used in a widespread virus attacking Web servers in 2001; the been used in a widespread virus attacking Web servers in 2001; the
security threat is thus very real. security threat is thus very real.
Another security issue occurs when encoding to UTF-8: the ISO/IEC Another security issue occurs when encoding to UTF-8: the ISO/IEC
10646 description of UTF-8 allows encoding character numbers up to 10646 description of UTF-8 allows encoding character numbers up to
U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore
a risk of buffer overflow if the range of character numbers is not a risk of buffer overflow if the range of character numbers is not
explicitly limited to U+10FFFF or if buffer sizing doesn't take into explicitly limited to U+10FFFF or if buffer sizing doesn't take into
account the possibility of 5- and 6-byte sequences. account the possibility of 5- and 6-byte sequences.
Security may also be impacted by a characteristic of several
character encodings, including UTF-8: the "same thing" (as far as a
user can tell) can be represented by several distinct character
sequences. For instance, an e with acute accent can be represented by
the precomposed U+00E9 E ACUTE character or by the canonically
equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though
UTF-8 provides a single byte sequence for each character sequence,
the existence of multiple character sequences for "the same thing"
may have security consequences whenever string matching, indexing,
searching, sorting, regular expression matching and selection are
involved. An example would be string matching of an identifier
appearing in a credential and in access control list entries. This
issue is amenable to solutions based on Unicode Normalization Forms,
see [UAX15].
11. Acknowledgements 11. Acknowledgements
The following have participated in the drafting and discussion of The following have participated in the drafting and discussion of
this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer, this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer,
Mark Davis, Martin J. Duerst, Patrick Faltstrom, Ned Freed, David Mark Davis, Martin J. Duerst, Patrick Faltstrom, Ned Freed, David
Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood, Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood,
Simon Josefsson, Kent Karlsson, Dan Kohn, Markus Kuhn, Michael Kung, Simon Josefsson, Kent Karlsson, Dan Kohn, Markus Kuhn, Michael Kung,
Alain LaBonte, Ira McDonald, Alexey Melnikov, MURATA Makoto, John Alain LaBonte, Ira McDonald, Alexey Melnikov, MURATA Makoto, John
Gardiner Myers, Dan Oscarsson, Roozbeh Pournader, Murray Sargent, Gardiner Myers, Chris Newman, Dan Oscarsson, Roozbeh Pournader,
Markus Scherer, Keld Simonsen, Arnold Winkler, Kenneth Whistler and Murray Sargent, Markus Scherer, Keld Simonsen, Arnold Winkler,
Misha Wolf. Kenneth Whistler and Misha Wolf.
12. Changes from RFC 2279 12. Changes from RFC 2279
o Restricted the range of characters to 0000-10FFFF (the UTF-16 o Restricted the range of characters to 0000-10FFFF (the UTF-16
accessible range). accessible range).
o Made Unicode the source of the normative definition of UTF-8, o Made Unicode the source of the normative definition of UTF-8,
keeping ISO/IEC 10646 as the reference for characters. keeping ISO/IEC 10646 as the reference for characters.
o Straightened out terminology. UTF-8 now described in terms of an o Straightened out terminology. UTF-8 now described in terms of an
skipping to change at page 12, line 9 skipping to change at page 12, line 32
o Turned the note warning against decoding of invalid sequences into o Turned the note warning against decoding of invalid sequences into
a normative MUST NOT. a normative MUST NOT.
o Added a new section about the UTF-8 BOM, with advice for o Added a new section about the UTF-8 BOM, with advice for
protocols. protocols.
o Removed suggested UNICODE-1-1-UTF-8 MIME charset registration. o Removed suggested UNICODE-1-1-UTF-8 MIME charset registration.
o Added an ABNF syntax for valid UTF-8 octet sequences o Added an ABNF syntax for valid UTF-8 octet sequences
o Expanded Security Considerations section, in particular impact of
Unicode normalization
Normative references Normative references
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997.
[ISO.10646] [ISO.10646]
International Organization for Standardization, International Organization for Standardization,
"Information Technology - Universal Multiple-octet coded "Information Technology - Universal Multiple-octet coded
Character Set (UCS)", ISO/IEC Standard 10646, comprised Character Set (UCS)", ISO/IEC Standard 10646, comprised
of ISO/IEC 10646-1:2000, "Information technology -- of ISO/IEC 10646-1:2000, "Information technology --
Universal Multiple-Octet Coded Character Set (UCS) -- Part Universal Multiple-Octet Coded Character Set (UCS) -- Part
1: Architecture and Basic Multilingual Plane", ISO/IEC 1: Architecture and Basic Multilingual Plane", ISO/IEC
10646-2:2001, "Information technology -- Universal 10646-2:2001, "Information technology -- Universal
Multiple-Octet Coded Character Set (UCS) -- Part 2: Multiple-Octet Coded Character Set (UCS) -- Part 2:
Supplementary Planes" and ISO/IEC 10646-1:2000/Amd 1:2002, Supplementary Planes" and ISO/IEC 10646-1:2000/Amd 1:2002,
"Mathematical symbols and other characters". "Mathematical symbols and other characters".
[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version
3.2", defined by The Unicode Standard, Version 3.0 4.0", defined by The Unicode Standard, Version 4.0
(Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1),
as amended by the Unicode Standard Annex #27: Unicode 3.1 April 2003, <http://www.unicode.org/unicode/standard/
(see http://www.unicode.org/reports/tr27) and by the versions/enumeratedversions.html#Unicode_4_0_0>.
Unicode Standard Annex #28: Unicode 3.2 (see
http://www.unicode.org/reports/tr28), March 2002,
<http://www.unicode.org/unicode/standard/versions/
enumeratedversions.html#Unicode_3_2_0>.
Informative references Informative references
[CESU-8] Phipps, T., "Compatibility Encoding Scheme for UTF-16: [CESU-8] Phipps, T., "Unicode Technical Report #26: Compatibility
8-Bit (CESU-8)", UTR 26, April 2002, Encoding Scheme for UTF-16: 8-Bit (CESU-8)", UTR 26, April
<http://www.unicode.org/unicode/reports/tr26/>. 2002, <http://www.unicode.org/unicode/reports/tr26/>.
[FSS_UTF] X/Open Company Ltd., "X/Open CAE Specification C501 -- [FSS_UTF] X/Open Company Ltd., "X/Open CAE Specification C501 --
File System Safe UCS Transformation Format (FSS_UTF)", File System Safe UCS Transformation Format (FSS_UTF)",
ISBN 1-85912-082-2, April 1995. ISBN 1-85912-082-2, April 1995.
[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Extensions (MIME) Part One: Format of Internet Message
Bodies", RFC 2045, November 1996. Bodies", RFC 2045, November 1996.
[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997.
[RFC2978] Freed, N. and J. Postel, "IANA Charset Registration [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration
Procedures", BCP 19, RFC 2978, October 2000. Procedures", BCP 19, RFC 2978, October 2000.
[UAX15] Davis, M. and M. Duerst, "Unicode Standard Annex #15:
Unicode Normalization Forms", An integral part of The
Unicode Standard, Version 4.0.0, April 2003, <http://
www.unicode.org/unicode/reports/tr15>.
[US-ASCII] [US-ASCII]
American National Standards Institute, "Coded Character American National Standards Institute, "Coded Character
Set - 7-bit American Standard Code for Information Set - 7-bit American Standard Code for Information
Interchange", ANSI X3.4, 1986. Interchange", ANSI X3.4, 1986.
URIs URIs
[1] <http://www.unicode.org/unicode/standard/policies.html> [1] <http://www.unicode.org/unicode/standard/policies.html>
Author's Address Author's Address
 End of changes. 21 change blocks. 
37 lines changed or deleted 62 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/