idnits 2.17.1 draft-crispin-collation-unicasemap-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 303. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 314. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 321. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 327. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 347 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 38 instances of too long lines in the document, the longest one being 1 character in excess of 72. ** The abstract seems to contain references ([BASIC], [COMPARATOR], [IMAP-SORT]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 192: '...easons. Implementations MUST consider...' RFC 2119 keyword, line 201: '...n implementation MAY use an NFKD libra...' RFC 2119 keyword, line 209: '... Implementations SHOULD, as far as fea...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 30, 2007) is 6077 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'IMAP-SORT' is mentioned on line 270, but not defined == Missing Reference: 'BASIC' is mentioned on line 265, but not defined ** Obsolete normative reference: RFC 3454 (ref. 'STRINGPREP') (Obsoleted by RFC 7564) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE-DATA' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE-SECURITY' Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Crispin 3 Internet-Draft University of Washington 4 Intended status: Proposed Standard August 30, 2007 5 Expires: February 30, 2008 6 Document: internet-drafts/draft-crispin-collation-unicasemap-07.txt 8 i;unicode-casemap - Simple Unicode Collation Algorithm 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that 13 any applicable patent or other IPR claims of which he or she is 14 aware have been or will be disclosed, and any of which he or she 15 becomes aware will be disclosed, in accordance with Section 6 of 16 BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 A revised version of this document will be submitted to the RFC 35 editor as an Informational Document for the Internet Community. 37 A revised version of this draft document will be submitted to the RFC 38 editor as a Proposed Standard for the Internet Community. Discussion 39 and suggestions for improvement are requested, and should be sent to 40 ietf-imapext@IMC.ORG. 42 Distribution of this memo is unlimited. 44 Abstract 46 This document describes "i;unicode-casemap", a simple 47 case-insensitive collation for Unicode strings. It provides 48 equality, substring and ordering operations. 50 Introduction 52 The "i;ascii-casemap" collation described in [COMPARATOR] is quite 53 simple to implement and provides case-independent comparisons for the 54 26 Latin alphabetics. It is specified as the default and/or baseline 55 comparator in some application protocols, e.g., [IMAP-SORT]. 57 However, the "i;ascii-casemap" collation does not produce 58 satisfactory results with non-ASCII characters. It is possible, with 59 a modest extension, to provide a more sophisticated collation with 60 greater multilingual applicability than "i;ascii-casemap". This 61 extension provides case-independent comparisons for a much greater 62 number of characters. It also collates characters with diacriticals 63 with the non-diacritical character forms. 65 This collation, "i;unicode-casemap", is intended to be an alternative 66 to, and preferred over, "i;ascii-casemap". It does not replace the 67 "i;basic" collation described in [BASIC]. 69 1. Unicode Casemap Collation Description 71 The "i;unicode-casemap" collation is a simple collation which is 72 case-insensitive in its treatment of characters. It provides 73 equality, substring and ordering operations. The validity test 74 operation returns "valid" for any input. 76 This collation allows strings in arbitrary (and mixed) character 77 sets, as long as the character set for each string is identified and 78 it is possible to convert the string to Unicode. Strings which have 79 an unidentified character set and/or can not be converted to Unicode 80 are not rejected, but are treated as binary. 82 Each input string is prepared by converting it to a "titlecased 83 canonicalized UTF-8" string according to the following steps, using 84 UnicodeData.txt ([UNICODE-DATA]): 86 (1) A Unicode codepoint is obtained from the input string. 88 (a) If the input string is in a known charset that can be 89 converted to Unicode, a sequence in the string's charset 90 is read and checked for validity according to the rules of 91 that charset. If the sequence is valid, it is converted 92 to a Unicode codepoint. Note that for input strings in 93 UTF-8, the UTF-8 sequence must be valid according to the 94 rules of [UTF-8]; e.g., overlong UTF-8 sequences are 95 invalid. 97 (b) If the input string is in an unknown charset, or an 98 invalid sequence occurs in step (1)(a), conversion ceases. 99 No further preparation is performed, and any partial 100 preparation results are discarded. The original string is 101 used unchanged with the i;octet comparator. 103 (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]), 104 are performed on the resulting codepoint from step (1)(a). 106 (a) If the codepoint has a titlecase property in 107 UnicodeData.txt (this is normally the same as the 108 uppercase property), the codepoint is converted to the 109 codepoints in the titlecase property. 111 (b) If the resulting codepoint from (2)(a) has a decomposition 112 property of any type in UnicodeData.txt, the codepoint is 113 converted to the codepoints in the decomposition property. 114 This step is recursively applied to each of the resulting 115 codepoints until no more decomposition is possible 116 (effectively Normalization Form KD). 118 Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON) 119 has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D 120 WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a 121 decomposition property of U+0044 (LATIN CAPITAL LETTER D) 122 U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a 123 decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c 124 (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have 125 any decomposition properties. Therefore, U+01C4 is converted 126 to U+0044 U+007A U+030C by this step. 128 (3) The resulting codepoint(s) from step (2) is/are appended, in 129 UTF-8 format, to the "titlecased canonicalized UTF-8" string. 131 (4) Repeat from step (1) until there is no more data in the input 132 string. 134 Following the above preparation process on each string, the equality, 135 ordering and substring operations are as for i;octet. 137 It is permitted to use an alternative implementation of the above 138 preparation process if it produces the same results. For example, it 139 may be more convenient for an implementation to convert all input 140 strings to a sequence of UTF-16 or UTF-32 values prior to performing 141 any of the step (2) actions. Similarly, if all input strings are (or 142 are convertible to) Unicode, it may be possible to use UTF-32 as an 143 alternative to UTF-8 in step (3). 145 Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3), 146 because UTF-16 surrogates will cause i;octet to collate codepoints 147 U+E0000 through U+FFFF after non-BMP codepoints. 149 This collation is not locale sensitive. Consequently, care should be 150 taken when using OS-supplied functions to implement this collation. 151 Functions such as strcasecmp and toupper are sometimes locale 152 sensitive and may inconsistently casemap letters. 154 The i;unicode-casemap collation is well suited to use with many 155 Internet protocols and computer languages. Use with natural language 156 is often inappropriate; even though the collation apparently supports 157 languages such as Swahili and English, in real-world use it tends to 158 mis-sort a number of types of string: 160 o people and place names containing scripts that are not collated 161 according to "alphabetical order". 162 o words with characters that have diacriticals. However, 163 i;unicode-casemap generally does a better job than i;ascii-casemap 164 for most (but not all) languages. For example, German umlaut 165 letters will sort correctly, but some Scandinavian letters will 166 not. 167 o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike 168 in English), 169 o strings containing other non-letter symbols; e.g., euro and pound 170 sterling symbols, quotation marks other than '"', dashes/hyphens, 171 etc. 173 2. Unicode Casemap Collation Registration 175 176 177 178 i;unicode-casemap 179 Unicode Casemap 180 equality order substring 181 RFC XXXX 182 IETF 183 mrc@cac.washington.edu 184 186 3. Security Considerations 188 The security considerations for [UTF-8], [STRINGPREP] and 189 [UNICODE-SECURITY] apply and are normative to this specification. 191 The results from this comparator will vary depending upon the 192 implementation for several reasons. Implementations MUST consider 193 whether these possibilities are a problem for their use case: 195 1) New characters added in Unicode may have decomposition or 196 titlecase properties that will not be known to an implementation 197 based upon an older revision of Unicode. This impacts Step (2). 199 2) Step (2)(b) defines a subset of Normalization Form KD that does 200 not require normalization of out-of-order diacriticals. However, 201 an implementation MAY use an NFKD library routine that does such 202 normalization. This impacts step (2)(b) and possibly also step 203 (1)(a), and is an issue only with ill-formed UTF-8 input. 205 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8 206 (and, by extension, US-ASCII) are the only mandatory-to-implement 207 charsets. This impacts step (1)(a). 209 Implementations SHOULD, as far as feasible, support all the 210 charsets they are likely to encounter in the input data, in order 211 to avoid poor collation caused by the fall through to the (1)(b) 212 rule. 214 4) Other charsets may have revisions which add new characters that 215 are not known to an implementation based upon an older revision. 216 This impacts step (1)(a) and possibly also step (1)(b). 218 An attacker may create input that is ill-formed or in an unknown 219 charset, with the intention of impacting the results of this 220 comparator or exploiting other parts of the system which process this 221 input in different ways. Note, however, that even well-formed data 222 in a known charset can impact the result of this comparator in 223 unexpected ways. For example, an attacker can substitute U+0041 224 (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or 225 U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of a non-match of 226 strings which visually appear the same and/or to cause the string to 227 appear elsewhere in a sort. 229 4. IANA Considerations 231 The i;unicode-casemap collation defined in section 2 should be added 232 to the registry of collations defined in [COMPARATOR]. 234 5. Normative References 236 The following documents are normative to this document: 238 [COMPARATOR] Newman, C., "Internet Application Protocol 239 Collation Registry", RFC 4790, February 2007. 241 [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of 242 Internationalized Strings ("stringprep")", 243 RFC 3454, December 2002. 245 [UTF-8] Yergeau, F., "UTF-8, a transformation format 246 of ISO 10646", STD 63, RFC 3629, November 2003. 248 [UNICODE-DATA] 251 Although the UnicodeData.txt file referenced 252 here is part of the Unicode standard, it is 253 subject to change as new characters are added 254 to Unicode and errors are corrected in Unicode 255 revisions. As a result, it may be less stable 256 than might otherwise be implied by the 257 standards status of this specification. 259 [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security 260 Considerations", February 2006, 261 . 263 6. Informative References: 265 [BASIC] Newman, C., Duerst, M., and Gulbrandsen, A., 266 "i;basic - the Unicode Collation Algorithm", 267 draft-gulbrandsen-collation-basic, Work in 268 Progress. 270 [IMAP-SORT] Crispin, M. "Internet Message Access Protocol - 271 SORT and THREAD Extensions", 272 draft-ietf-imapext-sort, Work in Progress (in 273 RFC Editor queue). 275 Appendices 277 Author's Address 279 Mark R. Crispin 280 Networks and Distributed Computing 281 University of Washington 282 4545 15th Avenue NE 283 Seattle, WA 98105-4527 285 Phone: +1 (206) 543-5762 287 EMail: MRC@CAC.Washington.EDU 289 Full Copyright Statement 291 Copyright (C) The IETF Trust (2007). 293 This document is subject to the rights, licenses and restrictions 294 contained in BCP 78, and except as set forth therein, the authors 295 retain all their rights. 297 This document and the information contained herein are provided on an 298 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 299 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 300 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 301 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 302 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 303 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 305 Intellectual Property 307 The IETF takes no position regarding the validity or scope of any 308 Intellectual Property Rights or other rights that might be claimed to 309 pertain to the implementation or use of the technology described in 310 this document or the extent to which any license under such rights 311 might or might not be available; nor does it represent that it has 312 made any independent effort to identify any such rights. Information 313 on the procedures with respect to rights in RFC documents can be 314 found in BCP 78 and BCP 79. 316 Copies of IPR disclosures made to the IETF Secretariat and any 317 assurances of licenses to be made available, or the result of an 318 attempt made to obtain a general license or permission for the use of 319 such proprietary rights by implementers or users of this 320 specification can be obtained from the IETF on-line IPR repository at 321 http://www.ietf.org/ipr. 323 The IETF invites any interested party to bring to its attention any 324 copyrights, patents or patent applications, or other proprietary 325 rights that may cover technology that may be required to implement 326 this standard. Please address the information to the IETF at ietf- 327 ipr@ietf.org. 329 Acknowledgement 331 Funding for the RFC Editor function is currently provided by the 332 Internet Society.