< draft-gulbrandsen-collation-basic-00.txt   draft-gulbrandsen-collation-basic-01.txt >
Network Working Group Chris Newman Network Working Group Chris Newman
Request for Comments: DRAFT Sun Microsystems Internet-Draft Sun Microsystems
Martin Duerst Intended Status: Proposed Standard Martin Duerst
Aoyama Gakuin University Aoyama Gakuin University
Arnt Gulbrandsen Arnt Gulbrandsen
Oryx Mail Systems GmbH Oryx Mail Systems GmbH
November 2006 March 2007
i;basic - the Unicode Collation Algorithm i;basic - Registration of Unicode Collation Algorithm
draft-gulbrandsen-collation-basic-00.txt draft-gulbrandsen-collation-basic-01.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
Internet-Drafts are draft documents valid for a maximum of six Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress". reference material or to cite them other than as "work in progress".
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet- http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-
Draft Shadow Directories can be accessed at Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This draft expires in September 2007.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society 2006. Copyright (C) The IETF Trust (2007).
Abstract Abstract
The Unicode Collation Algorithm is a widely usable collation The Unicode Collation Algorithm is a widely usable collation
covering all of Unicode. It produces tolerable results for many covering all of Unicode. It produces tolerable results for many
locales as-is, and can be further improved using locale-specific locales as-is, and can be further improved using locale-specific
tables. This document registers the UCA in the IETF's collation tables. This document registers the UCA in the IETF's collation
registry. registry.
Internet-draft November 2006
Table of Contents Table of Contents
1. Conventions Used in This Document . . . . . . . . . . . . . . 2 1. Conventions Used in This Document . . . . . . . . . . . . . . 2
2. i;basic: The Unicode Collation Algorithm . . . . . . . . . . . 2 2. i;basic: The Unicode Collation Algorithm . . . . . . . . . . . 2
3. Registration . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Registration . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. Security Considerations . . . . . . . . . . . . . . . . . . . 5 4. Security Considerations . . . . . . . . . . . . . . . . . . . 5
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5
6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 5 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 5
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 5
7.1. Normative References . . . . . . . . . . . . . . . . . . . 5 7.1. Normative References . . . . . . . . . . . . . . . . . . . 5
8. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 6 8. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 6
1. Conventions Used in This Document 1. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [KEYWORDS]. document are to be interpreted as described in [RFC2119].
2. i;basic: The Unicode Collation Algorithm 2. i;basic: The Unicode Collation Algorithm
The basic collation is intended to provide tolerable results for a The basic collation is intended to provide tolerable results for a
number of languages for all three operations (equality, substring number of languages for all three operations (equality, substring
and ordering) so it is suitable as a mandatory-to-implement and ordering) so it is suitable as a mandatory-to-implement
collation for protocols which include ordering support. The collation for protocols which include ordering support. The
ordering operation of the basic collation is the Unicode Collation ordering operation of the basic collation is the Unicode Collation
Algorithm version 14 [UCAv14]. Algorithm [UCAv14].
The equality and substring operations are created as described in The equality and substring operations are created as described in
UCAv14 section 8. While that section is informative to UCAv14, it UCAv14 section 8. While that section is informative to UCAv14, it
is normative to this collation specification. is normative to this collation specification.
This collation is based on Unicode version 3.2, with the following This collation is based on Unicode version 3.2, with the following
tables relevant: tables relevant:
1. For the normalization step, 1. For the normalization step,
http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt is http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt is
used. Column 5 is used to determine the canonical decomposition, used. Column 5 is used to determine the canonical decomposition,
while column 3 contains the canonical combining classes necessary while column 3 contains the canonical combining classes necessary
to attain canonical order. to attain canonical order.
2. The table of characters which require a logical order exception 2. The table of characters which require a logical order exception
is a subset of the table in is a subset of the table in
http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt and http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt and
is included here: is included here:
Internet-draft November 2006
0E40..0E44 ; Logical_Order_Exception 0E40..0E44 ; Logical_Order_Exception
# Lo [5] THAI CHARACTER SARA E..THAI CHARACTER SARA AI MAIMALAI # Lo [5] THAI CHARACTER SARA E..
# THAI CHARACTER SARA AI MAIMALAI
0EC0..0EC4 ; Logical_Order_Exception 0EC0..0EC4 ; Logical_Order_Exception
# Lo [5] LAO VOWEL SIGN E..LAO VOWEL SIGN AI # Lo [5] LAO VOWEL SIGN E..LAO VOWEL SIGN AI
# Total code points: 10 # Total code points: 10
3. The table used to translate normalized code points to a sort key 3. The table used to translate normalized code points to a sort key
is http://www.unicode.org/reports/tr10/allkeys-3.1.1.txt. is http://www.unicode.org/reports/tr10/allkeys-3.1.1.txt.
UCAv14 includes a number of configurable parameters and steps UCAv14 includes a number of configurable parameters and steps
labelled as potentially optional. The following list summarizes the labelled as potentially optional. The following list summarizes the
skipping to change at page 3, line 37 skipping to change at page 3, line 38
- The second level in the sort key is evaluated forwards by - The second level in the sort key is evaluated forwards by
default. This can be changed using the "direction2" variable. default. This can be changed using the "direction2" variable.
- The variable weighting uses the "non-ignorable" option by - The variable weighting uses the "non-ignorable" option by
default. default.
- The semi-stable option is not used by default. - The semi-stable option is not used by default.
- Support for one level of collation is the default behavior, ie. - Support for one level of collation is the default behavior, ie.
the collation is case-insenstive and ignores accents. This can be the collation is case-insenstive and ignores accents. This can be
changed using the "matchlevel" variable. changed using the "strength" variable.
- If the collation is adjusted to be case-sensitive, the
"casefirst" variable can be used to determine whether upper case
sorts before or after lower case.
- No preprocessing step is used by the basic collation prior to - No preprocessing step is used by the basic collation prior to
applying the UCAv14 algorithm. Note that an application protocol applying the UCAv14 algorithm. Note that an application protocol
specification MAY require pre-processing prior to the use of any specification MAY require pre-processing prior to the use of any
collations.</t> collations.
- The equality and substring algorithms use the "Whole Characters - The equality and substring algorithms use the "Whole Characters
Only" feature described in UCAv14 section 8 by default. Only" feature described in UCAv14 section 8 by default.
The "uv" variable specifies the version of the UnicodeData file
used. The legal values are the unicode version names starting with
the default, e.g. "4.0" and "4.1", but not "2.0".
The "version" variable specifies the version of the Unicode The "version" variable specifies the version of the Unicode
Collation Algorithm to use. The default is 14, and legal values are Collation Algorithm to use. The default is 4.1.0. UCA versions older
1 through the latest version. than 4.1.0 (14) are not legal. UCA versions newer than 4.1.0 are
legal, although not defined at the time of writing.
Internet-draft November 2006
The exact collation identifier with these defaults is "i;basic". The exact collation identifier with these defaults is "i;basic".
When a specification states that the basic collation is mandatory- When a specification states that the basic collation is mandatory-
to-implement, only this specific identifier is mandatory-to- to-implement, only this specific identifier is mandatory-to-
implement. implement.
The default weighting option is "non-ignorable". The "semi-stable" The default weighting option is "non-ignorable". The "semi-stable"
sort key option is not used by default. sort key option is not used by default.
Sort keys are generated as described in section 4.3 of the UCA Sort keys are generated as described in section 4.3 of the UCA
skipping to change at page 4, line 28 skipping to change at page 5, line 15
3. Registration 3. Registration
<?xml version='1.0'?> <?xml version='1.0'?>
<!DOCTYPE collation SYSTEM 'collationreg.dtd'> <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
<collation rfc="XXXX" scope="i18n" intendedUse="common"> <collation rfc="XXXX" scope="i18n" intendedUse="common">
<identifier>i;basic</identifier> <identifier>i;basic</identifier>
<title>Basic</title> <title>Basic</title>
<operations>equality order substring</operations> <operations>equality order substring</operations>
<specification>RFC XXXX</specification> <specification>RFC XXXX</specification>
<owner>IETF</owner> <owner>IETF</owner>
<submitter>chris.newman@sun.com<submitter> <submitter>chris.newman@sun.com</submitter>
<variable>
<name>uv</name>
<default>3.2</default>
</variable>
<variable> <variable>
<name>version</name> <name>version</name>
<default>14</default> <default>4.1.0</default>
</variable> </variable>
<variable> <variable>
<name>direction2</name> <name>direction2</name>
<default>forwards</name> <default>forwards</name>
<value>forwards</name> <value>forwards</name>
<value>backwards</name> <value>backwards</name>
</variable> </variable>
<variable> <variable>
<name>matchlevel</name> <name>strength</name>
<default>3</name> <default>1</name>
<value>1</name> <value>1</name>
<value>2</name> <value>2</name>
<value>3</name> <value>3</name>
<value>4</name>
<value>5</name>
</variable>
<variable>
<name>casefirst</name>
<default>off</name>
<value>off</name>
<value>upper</name>
<value>lower</name>
</variable> </variable>
</collation> </collation>
Internet-draft November 2006
4. Security Considerations 4. Security Considerations
This document raises no security issues that are not already This document raises no security issues that are not already
described in [COLLATION]. described in [RFC4790].
5. IANA Considerations 5. IANA Considerations
The IANA is requested to add the above i;basic registration to the The IANA is requested to add the above i;basic registration to the
collation registry. collation registry, http://www.iana.org/assignments/collation/.
6. Acknowledgements. 6. Acknowledgements.
This document was split off from [COLLATION] during its time as a This document was split off from [RFC4790] during its time as a
draft. Many of the people acknowledged in that RFC helped with this: draft. Many of the people acknowledged in that RFC helped with this:
Brian Carpenter, John Cowan, Dave Cridland, Mark Davis, Spencer Brian Carpenter, John Cowan, Dave Cridland, Mark Davis, Spencer
Dawkins, Lisa Dusseault, Lars Eggert, Frank Ellermann, Philip Dawkins, Lisa Dusseault, Lars Eggert, Frank Ellermann, Philip
Guenther, Tony Hansen, Ted Hardie, Sam Hartman, Kjetil Torgrim Guenther, Tony Hansen, Ted Hardie, Sam Hartman, Kjetil Torgrim
Homme, Michael Kay, John Klensin, Alexey Melnikov, Jim Melton and Homme, Michael Kay, John Klensin, Alexey Melnikov, Jim Melton and
Abhijit Menon-Sen. Abhijit Menon-Sen.
7. References 7. References
7.1. Normative References 7.1. Normative References
[COLLATION] Newman, Duerst, Gulbrandsen, "Internet Application [RFC2119] Bradner, "Key words for use in RFCs to Indicate
Protocol Collation Registry", RFC YYYY, October 2006.
[KEYWORDS] Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, Harvard University, March Requirement Levels", RFC 2119, Harvard University, March
1997. 1997.
[RFC4790] Newman, Duerst, Gulbrandsen, "Internet Application
Protocol Collation Registry", RFC 4790, February 2007.
[UCAv14] Davis, Whistler, "Unicode Collation Algorithm version [UCAv14] Davis, Whistler, "Unicode Collation Algorithm version
14", May 2005, 14", May 2005,
<http://www.unicode.org/reports/tr10/tr10-14.html>. <http://www.unicode.org/reports/tr10/tr10-14.html>.
Internet-draft November 2006
8. Authors' Addresses 8. Authors' Addresses
Chris Newman Chris Newman
Sun Microsystems Sun Microsystems
3401 Centrelake Dr., Suite 410 3401 Centrelake Dr., Suite 410
Ontario, CA 91761 Ontario, CA 91761
US US
Email: chris.newman@sun.com Email: chris.newman@sun.com
Martin Duerst Martin Duerst
Aoyama Gakuin University Aoyama Gakuin University
5-10-1 Fuchinobe 5-10-1 Fuchinobe
Sagamihara Sagamihara
Kanagawa Kanagawa
229-8558 229-8558
Japan Japan
Phone: +81 42 759 6329 Phone: +81 42 759 6329
Fax: +81 42 759 6495 Fax: +81 42 759 6495
Email: duerst@it.aoyama.ac.jp Email: duerst@it.aoyama.ac.jp
Web: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ Web: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
(Note: Please write "Duerst" with u-umlaut wherever possible, for Note: Please write "Duerst" with u-umlaut wherever possible, for
example as "D&amp;#252;rst" in XML and HTML.) example as "D&amp;#252;rst" in XML and HTML.
Arnt Gulbrandsen Arnt Gulbrandsen
Oryx Mail Systems GmbH Oryx Mail Systems GmbH
Schweppermannstr. 8 Schweppermannstr. 8
D-81671 Muenchen D-81671 Muenchen
Germany Germany
Fax: +49 89 4502 9758 Fax: +49 89 4502 9758
Email: arnt@oryx.com Email: arnt@oryx.com
Internet-draft November 2006
Open Issues
This -00 draft is published in order to establish version history.
Several necessary changes have NOT been made.
The Unicode version choice need consideration. 3.2 seems old? And
can the ten-element table be dropped - why is it there?
The variable names should be aligned with what
http://unicode.org/reports/tr35/#Collation_Elements describes. IMO
the best thing to do is to copy the CLDR names.
The variable defaults need to be considered when doing the above
rename.
Change Log
Changes in -00:
No substantive changes from draft-newman-i18n-comparator.
Intellectual Property Statement Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed Intellectual Property Rights or other rights that might be claimed
to pertain to the implementation or use of the technology described to pertain to the implementation or use of the technology described
in this document or the extent to which any license under such in this document or the extent to which any license under such
rights might or might not be available; nor does it represent that rights might or might not be available; nor does it represent that
it has made any independent effort to identify any such rights. it has made any independent effort to identify any such rights.
Information on the procedures with respect to rights in RFC Information on the procedures with respect to rights in RFC
documents can be found in BCP 78 and BCP 79. documents can be found in BCP 78 and BCP 79.
skipping to change at page 8, line 5 skipping to change at page 8, line 8
of such proprietary rights by implementers or users of this of such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository specification can be obtained from the IETF on-line IPR repository
at http://www.ietf.org/ipr. at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf- this standard. Please address the information to the IETF at ietf-
ipr@ietf.org. ipr@ietf.org.
Internet-draft November 2006
Copyright Statement Copyright Statement
Copyright (C) The Internet Society (2006). Copyright (C) The IETF Trust (2007).
This document is subject to the rights, licenses and restrictions This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors contained in BCP 78, and except as set forth therein, the authors
retain all their rights. retain all their rights.
Disclaimer of Validity Disclaimer of Validity
This document and the information contained herein are provided on This document and the information contained herein are provided on
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE.
Acknowledgment Acknowledgment
Funding for the RFC Editor function is currently provided by the Funding for the RFC Editor function is currently provided by the
Internet Society. Internet Society.
(RFC Editor: Please delete everything after this point)
Open Issues
This -00 draft is published in order to establish version history.
Several necessary changes have NOT been made.
The Unicode version choice need consideration. 3.2 seems old? And
can the ten-element table be dropped - why is it there?
The variable names should be aligned with what
http://unicode.org/reports/tr35/#Collation_Elements describes. IMO
the best thing to do is to copy the CLDR names.
The variable defaults need to be considered when doing the above
rename.
Changes in -01:
- Better title (suggested by Martin Duerst).
- Struck the "uv" variable, merged "uv" and "variable", and aligned
the result with the UCA version variable (as explained in
http://unicode.org/reports/tr35/#Collation_Elements). Starting
version changed to 4.0, since that's the oldest version for which
the two can be merged.
- Changed the default strength to 1, and called strength strength
instead of matchLevel since that's what the UCA calls it and it
seems sensible.
- Added the casefirst variable from the UCA. (Several others
variables were not added, as I'm uncertain of the right names and
default variables.)
Changes in -00:
- No substantive changes from draft-newman-i18n-comparator.
 End of changes. 33 change blocks. 
77 lines changed or deleted 50 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/