idnits 2.17.1 draft-crispin-collation-unicasemap-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 222. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 233. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 240. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 246. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 266 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 19 instances of too long lines in the document, the longest one being 1 character in excess of 72. ** The abstract seems to contain references ([BASIC], [COMPARATOR], [IMAP-SORT]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 2, 2007) is 6204 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'IMAP-SORT' is mentioned on line 189, but not defined == Missing Reference: 'BASIC' is mentioned on line 184, but not defined ** Obsolete normative reference: RFC 3454 (ref. 'STRINGPREP') (Obsoleted by RFC 7564) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE-SECURITY' Summary: 4 errors (**), 0 flaws (~~), 4 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Crispin 3 Internet-Draft University of Washington 4 Intended status: Proposed Standard May 2, 2007 5 Expires: November 2, 2007 6 Document: internet-drafts/draft-crispin-collation-unicasemap-04.txt 8 i;unicode-casemap - Simple Unicode Collation Algorithm 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that 13 any applicable patent or other IPR claims of which he or she is 14 aware have been or will be disclosed, and any of which he or she 15 becomes aware will be disclosed, in accordance with Section 6 of 16 BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 A revised version of this document will be submitted to the RFC 35 editor as an Informational Document for the Internet Community. 37 A revised version of this draft document will be submitted to the RFC 38 editor as a Proposed Standard for the Internet Community. Discussion 39 and suggestions for improvement are requested, and should be sent to 40 ietf-imapext@IMC.ORG. 42 Distribution of this memo is unlimited. 44 Abstract 46 This document describes "i;unicode-casemap", a simple 47 case-insensitive collation for Unicode strings. It provides 48 equality, substring and ordering operations. 50 Introduction 52 The "i;ascii-casemap" collation described in [COMPARATOR] is quite 53 simple to implement and provides case-independent comparisons for the 54 26 Latin alphabetics. It is specified as the default and/or baseline 55 comparator in some application protocols, e.g., [IMAP-SORT]. 57 It is possible, with a modest extension, to provide a more 58 sophisticated collation with greater multilingual applicability than 59 "i;ascii-casemap". 61 This collation, "i;unicode-casemap", is intended to be an alternative 62 to, and preferred over, "i;ascii-casemap". It does not replace the 63 "i;basic" collation described in [BASIC]. 65 1. Unicode Casemap Collation Description 67 The "i;unicode-casemap" collation is a simple collation which 68 operates on [UNICODE] strings and is case-insensitive in its 69 treatment of characters. It provides equality, substring and 70 ordering operations. All input is valid. 72 The algorithm that describes the behavior of this collation is 73 specified for Unicode input encoded in [UTF-8]. This is for ease of 74 description only. An implementation is free to use another internal 75 storage format for Unicode strings, as long as it produces the same 76 result as produced by the algorithm specified in this document for 77 any set of Unicode strings. 79 As this collation algorithm is specified for UTF-8 strings, strings 80 in other character sets and/or encodings can not be used with this 81 collation unless they are first converted to UTF-8. 83 Any input that is already in UTF-8 must be checked for invalid UTF-8 84 sequences, such as overlong sequences. A UTF-8 string that is 85 generated from a sequence of Unicode characters according to the 86 rules in [UTF-8] will not contain such invalid sequences. 88 For the equality and ordering operations, each input UTF-8 string is 89 prepared by converting it to "titlecased canonicalized UTF-8", using 90 UnicodeData.txt distributed by [UNICODE], as follows on a 91 per-character basis: 93 (1) If the codepoint has a titlecase property in UnicodeData.txt 94 (this is normally the same as the uppercase property) the 95 codepoint is converted to the titlecased codepoint. 96 (2) If the codepoint has a decomposition property of any type in 97 UnicodeData.txt the codepoint is converted to the decomposed 98 codepoints (effectively Normalization Form KD). 99 (3) The resulting codepoint(s) is/are appended to the titlecased 100 canonicalized UTF-8 string. 102 The resulting two titlecased canonicalized UTF-8 strings are then 103 treated as in i;octet for equality and ordering. 105 Care should be taken when using OS-supplied functions to implement 106 this collation as it is not locale sensitive. Functions such as 107 strcasecmp and toupper are sometimes locale sensitive and may 108 inconsistently casemap letters. 110 The i;unicode-casemap collation is well suited to use with many 111 Internet protocols and computer languages. Use with natural language 112 is often inappropriate; even though the collation apparently supports 113 languages such as Swahili and English, in real-world use it tends to 114 mis-sort a number of types of string: 116 o people and place names containing scripts that are not collated 117 according to "alphabetical order". 118 o words with characters that have diacriticals. However, 119 i;unicode-casemap generally does a better job than i;ascii-casemap 120 for most (but not all) languages. For example, German umlaut 121 letters will sort correctly, but some Scandinavian letters will 122 not. 123 o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike 124 in English), 125 o strings containing other non-letter symbols; e.g., euro and pound 126 sterling symbols, quotation marks other than '"', dashes/hyphens, 127 etc. 129 2. Unicode Casemap Collation Registration 131 132 133 134 i;unicode-casemap 135 Unicode Casemap 136 equality order substring 137 RFC XXXX 138 IETF 139 mrc@cac.washington.edu 140 142 3. Security Considerations 144 Collations will normally be used with UTF-8 strings. Thus the 145 security considerations for [UTF-8], [STRINGPREP] and 146 [UNICODE-SECURITY] also apply and are normative to this 147 specification. 149 4. IANA Considerations 151 The i;unicode-casemap collation defined in section 2 should be added 152 to the registry of collations defined in [COMPARATOR]. 154 5. Normative References 156 The following documents are normative to this document: 158 [COMPARATOR] Newman, C., "Internet Appplication Protocol 159 Collation Registry", RFC 4790, February 2007. 161 [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of 162 Internationalized Strings ("stringprep")", 163 RFC 3454, December 2002. 165 [UTF-8] Yergeau, F., "UTF-8, a transformation format 166 of ISO 10646", STD 63, RFC 3629, November 2003. 168 [UNICODE] , UnicodeData.txt 170 Although the UnicodeData.txt file referenced 171 here is part of the Unicode standard, it is 172 subject to change as new characters are added 173 to Unicode and errors are corrected in Unicode 174 revisions. As a result, it may be less stable 175 than might otherwise be implied by the 176 standards status of this specification. 178 [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security 179 Considerations", February 2006, 180 . 182 6. Informative References: 184 [BASIC] Newman, C., Duerst, M., and Gulbrandsen, A., 185 "i;basic - the Unicode Collation Algorithm", 186 draft-gulbrandsen-collation-basic, Work in 187 Progress. 189 [IMAP-SORT] Crispin, M. "Internet Message Access Protocol - 190 SORT and THREAD Extensions", 191 draft-ietf-imapext-sort, Work in Progress (in 192 RFC Editor queue). 194 Appendices 196 Author's Address 198 Mark R. Crispin 199 Networks and Distributed Computing 200 University of Washington 201 4545 15th Avenue NE 202 Seattle, WA 98105-4527 204 Phone: +1 (206) 543-5762 206 EMail: MRC@CAC.Washington.EDU 208 Full Copyright Statement 210 Copyright (C) The IETF Trust (2007). 212 This document is subject to the rights, licenses and restrictions 213 contained in BCP 78, and except as set forth therein, the authors 214 retain all their rights. 216 This document and the information contained herein are provided on an 217 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 218 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 219 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 220 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 221 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 222 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 224 Intellectual Property 226 The IETF takes no position regarding the validity or scope of any 227 Intellectual Property Rights or other rights that might be claimed to 228 pertain to the implementation or use of the technology described in 229 this document or the extent to which any license under such rights 230 might or might not be available; nor does it represent that it has 231 made any independent effort to identify any such rights. Information 232 on the procedures with respect to rights in RFC documents can be 233 found in BCP 78 and BCP 79. 235 Copies of IPR disclosures made to the IETF Secretariat and any 236 assurances of licenses to be made available, or the result of an 237 attempt made to obtain a general license or permission for the use of 238 such proprietary rights by implementers or users of this 239 specification can be obtained from the IETF on-line IPR repository at 240 http://www.ietf.org/ipr. 242 The IETF invites any interested party to bring to its attention any 243 copyrights, patents or patent applications, or other proprietary 244 rights that may cover technology that may be required to implement 245 this standard. Please address the information to the IETF at ietf- 246 ipr@ietf.org. 248 Acknowledgement 250 Funding for the RFC Editor function is currently provided by the 251 Internet Society.