idnits 2.17.1 draft-newman-i18n-comparator-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1566. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1543. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1550. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1556. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 52 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 821 has weird spacing: '...=accent e, o,...' == Line 822 has weird spacing: '...ch=case e, ...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 23, 2006) is 6545 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'RFC XXXX' ** Obsolete normative reference: RFC 4234 (ref. '2') (Obsoleted by RFC 5234) ** Obsolete normative reference: RFC 3066 (ref. '5') (Obsoleted by RFC 4646, RFC 4647) ** Obsolete normative reference: RFC 3454 (ref. '6') (Obsoleted by RFC 7564) ** Obsolete normative reference: RFC 3491 (ref. '7') (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. '8' -- Possible downref: Non-RFC (?) normative reference: ref. '9' -- Obsolete informational reference (is this intentional?): RFC 2222 (ref. '11') (Obsoleted by RFC 4422, RFC 4752) -- Obsolete informational reference (is this intentional?): RFC 2822 (ref. '13') (Obsoleted by RFC 5322) -- Obsolete informational reference (is this intentional?): RFC 3028 (ref. '15') (Obsoleted by RFC 5228, RFC 5429) -- Obsolete informational reference (is this intentional?): RFC 3501 (ref. '16') (Obsoleted by RFC 9051) == Outdated reference: A later version (-20) exists of draft-ietf-imapext-sort-17 == Outdated reference: A later version (-15) exists of draft-ietf-imapext-i18n-06 Summary: 8 errors (**), 0 flaws (~~), 7 warnings (==), 15 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group C. Newman 2 Internet-Draft Sun Microsystems 3 Expires: November 24, 2006 M. Duerst 4 AGU 5 A. Gulbrandsen 6 Oryx 7 May 23, 2006 9 Internet Application Protocol Collation Registry 10 draft-newman-i18n-comparator-11.txt 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on November 24, 2006. 37 Copyright Notice 39 Copyright (C) The Internet Society (2006). 41 Abstract 43 Many Internet application protocols include string-based lookup, 44 searching, or sorting operations. However the problem space for 45 searching and sorting international strings is large, not fully 46 explored, and is outside the area of expertise for the Internet 47 Engineering Task Force (IETF). Rather than attempt to solve such a 48 large problem, this specification creates an abstraction framework so 49 that application protocols can precisely identify a comparison 50 function and the repertoire of comparison functions can be extended 51 in the future. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 56 1.1. Conventions Used in this Document . . . . . . . . . . . . 4 57 2. Collation Definition and Purpose . . . . . . . . . . . . . . . 4 58 2.1. Definition . . . . . . . . . . . . . . . . . . . . . . . 4 59 2.2. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 2.3. Some Other Terms Used in this Document . . . . . . . . . 5 61 2.4. Sort Keys . . . . . . . . . . . . . . . . . . . . . . . . 5 62 3. Collation Name Syntax . . . . . . . . . . . . . . . . . . . . 6 63 3.1. Basic Syntax . . . . . . . . . . . . . . . . . . . . . . 6 64 3.2. Wildcards . . . . . . . . . . . . . . . . . . . . . . . . 6 65 3.3. Ordering Direction . . . . . . . . . . . . . . . . . . . 6 66 3.4. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 7 67 3.5. Naming Guidelines . . . . . . . . . . . . . . . . . . . . 7 68 4. Collation Specification Requirements . . . . . . . . . . . . . 8 69 4.1. Collation/Server Interface . . . . . . . . . . . . . . . 8 70 4.2. Operations Supported . . . . . . . . . . . . . . . . . . 8 71 4.2.1. Validity . . . . . . . . . . . . . . . . . . . . . . . 8 72 4.2.2. Equality . . . . . . . . . . . . . . . . . . . . . . . 9 73 4.2.3. Substring . . . . . . . . . . . . . . . . . . . . . . 9 74 4.2.4. Ordering . . . . . . . . . . . . . . . . . . . . . . . 10 75 4.3. Sort Keys . . . . . . . . . . . . . . . . . . . . . . . . 10 76 4.4. Use of Lookup Tables . . . . . . . . . . . . . . . . . . 10 77 5. Application Protocol Requirements . . . . . . . . . . . . . . 10 78 5.1. Character Encoding . . . . . . . . . . . . . . . . . . . 11 79 5.2. Operations . . . . . . . . . . . . . . . . . . . . . . . 11 80 5.3. Wildcards . . . . . . . . . . . . . . . . . . . . . . . . 12 81 5.4. Canonicalization Function . . . . . . . . . . . . . . . . 12 82 5.5. Disconnected Clients . . . . . . . . . . . . . . . . . . 12 83 5.6. Error Codes . . . . . . . . . . . . . . . . . . . . . . . 12 84 5.7. Octet Collation . . . . . . . . . . . . . . . . . . . . . 12 85 6. Use by Existing Protocols . . . . . . . . . . . . . . . . . . 13 86 7. Collation Registration . . . . . . . . . . . . . . . . . . . . 13 87 7.1. Collation Registration Procedure . . . . . . . . . . . . 13 88 7.2. Collation Registration Format . . . . . . . . . . . . . . 14 89 7.2.1. Registration Template . . . . . . . . . . . . . . . . 14 90 7.2.2. The collation Element . . . . . . . . . . . . . . . . 14 91 7.2.3. The name Element . . . . . . . . . . . . . . . . . . . 15 92 7.2.4. The title Element . . . . . . . . . . . . . . . . . . 15 93 7.2.5. The operations Element . . . . . . . . . . . . . . . . 15 94 7.2.6. The specification Element . . . . . . . . . . . . . . 15 95 7.2.7. The submitter Element . . . . . . . . . . . . . . . . 16 96 7.2.8. The owner Element . . . . . . . . . . . . . . . . . . 16 97 7.2.9. The version Element . . . . . . . . . . . . . . . . . 16 98 7.2.10. The UnicodeVersion Element . . . . . . . . . . . . . . 16 99 7.2.11. The UCAVersion Element . . . . . . . . . . . . . . . . 16 100 7.2.12. The UCAMatchLevel Element . . . . . . . . . . . . . . 16 101 7.3. DTD for Collation Registration . . . . . . . . . . . . . 17 102 7.4. Structure of Collation Registry . . . . . . . . . . . . . 17 103 7.5. Example Initial Registry Summary . . . . . . . . . . . . 18 104 8. Guidelines for Expert Reviewer . . . . . . . . . . . . . . . . 18 105 9. Initial Collations . . . . . . . . . . . . . . . . . . . . . . 19 106 9.1. ASCII Numeric Collation . . . . . . . . . . . . . . . . . 19 107 9.1.1. ASCII Numeric Collation Description . . . . . . . . . 19 108 9.1.2. ASCII Numeric Collation Registration . . . . . . . . . 20 109 9.2. ASCII Casemap Collation . . . . . . . . . . . . . . . . . 20 110 9.2.1. ASCII Casemap Collation Description . . . . . . . . . 20 111 9.2.2. ASCII Casemap Collation Registration . . . . . . . . . 21 112 9.3. Nameprep Collation . . . . . . . . . . . . . . . . . . . 21 113 9.3.1. Nameprep Collation Description . . . . . . . . . . . . 21 114 9.3.2. Nameprep Collation Registration . . . . . . . . . . . 22 115 9.4. Basic Collation . . . . . . . . . . . . . . . . . . . . . 22 116 9.4.1. Basic Collation Description . . . . . . . . . . . . . 22 117 9.4.2. Basic Collation Registration . . . . . . . . . . . . . 24 118 9.4.3. Basic Accent Sensitive Match Collation Registration . 25 119 9.4.4. Basic Case Sensitive Match Collation Registration . . 25 120 9.5. Octet Collation . . . . . . . . . . . . . . . . . . . . . 25 121 9.5.1. Octet Collation Description . . . . . . . . . . . . . 25 122 9.5.2. Octet Collation Registration . . . . . . . . . . . . . 26 123 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 124 11. Security Considerations . . . . . . . . . . . . . . . . . . . 27 125 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27 126 13. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 27 127 14. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 28 128 14.1. Changes From -10 . . . . . . . . . . . . . . . . . . . . 28 129 14.2. Changes From -09 . . . . . . . . . . . . . . . . . . . . 28 130 14.3. Changes From -08 . . . . . . . . . . . . . . . . . . . . 29 131 14.4. Changes From -06 . . . . . . . . . . . . . . . . . . . . 29 132 14.5. Changes From -05 . . . . . . . . . . . . . . . . . . . . 30 133 14.6. Changes From -04 . . . . . . . . . . . . . . . . . . . . 30 134 14.7. Changes From -03 . . . . . . . . . . . . . . . . . . . . 30 135 14.8. Changes From -02 . . . . . . . . . . . . . . . . . . . . 30 136 14.9. Changes From -01 . . . . . . . . . . . . . . . . . . . . 31 137 14.10. Changes From -00 . . . . . . . . . . . . . . . . . . . . 31 138 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 31 139 15.1. Normative References . . . . . . . . . . . . . . . . . . 31 140 15.2. Informative References . . . . . . . . . . . . . . . . . 32 141 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34 142 Intellectual Property and Copyright Statements . . . . . . . . . . 35 144 1. Introduction 146 The ACAP [12] specification introduced the concept of a comparator 147 (which we call collation in this document), but failed to create an 148 IANA registry. With the introduction of stringprep [6] and the 149 Unicode Collation Algorithm [8], it is now time to create that 150 registry and populate it with some initial values appropriate for an 151 international community. This specification replaces and generalizes 152 the definition of a comparator in ACAP and creates a collation 153 registry. 155 1.1. Conventions Used in this Document 157 The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" 158 in this document are to be interpreted as defined in "Key words for 159 use in RFCs to Indicate Requirement Levels" [1]. 161 The attribute syntax specifications use the Augmented Backus-Naur 162 Form (ABNF) [2] notation including the core rules defined in Appendix 163 A. This also inherits ABNF rules from Language Tags [5]. 165 2. Collation Definition and Purpose 167 2.1. Definition 169 A collation is a named function which takes two arbitrary length 170 strings as input and can be used to perform one or more of three 171 basic comparison operations: equality test, substring match, and 172 ordering test. 174 2.2. Purpose 176 Collations abstraction layer for comparison functions so that these 177 comparison functions can be used in multiple protocols. The details 178 of a particular comparison operation can be specified by someone with 179 appropriate expertise independent of the application protocols that 180 use that collation. This is similar to the way a charset [14] 181 separates the details of octet to character mapping from a protocol 182 specification such as MIME [10] or the way SASL [11] separates the 183 details of an authentication mechanism from a protocol specification 184 such as ACAP [12]. 186 Here is a small diagram to help illustrate the value of this 187 abstraction layer: 189 +-------------------+ +-----------------+ 190 | IMAP i18n SEARCH |--+ | Basic | 191 +-------------------+ | +--| Collation Spec | 192 | | +-----------------+ 193 +-------------------+ | +-------------+ | +-----------------+ 194 | ACAP i18n SEARCH |--+--| Collation |--+--| A stringprep | 195 +-------------------+ | | Registry | | | Collation Spec | 196 | +-------------+ | +-----------------+ 197 +-------------------+ | | +-----------------+ 198 | ...other protocol |--+ | | locale-specific | 199 +-------------------+ +--| Collation Spec | 200 +-----------------+ 202 Thus IMAP, ACAP and future application protocols with international 203 search capability simply specify how to interface to the collation 204 registry instead of each protocol specification having to specify all 205 the collations it supports. 207 2.3. Some Other Terms Used in this Document 209 The terms client, server and protocol are used in somewhat unusual 210 senses. 212 Client means a user, or a program acting directly on behalf of a 213 user. This may be an mail reader acting as an IMAP client, or it may 214 be an interactive shell where the user can type protocol directly, or 215 it may be a script or program written by the user. 217 Server means a program that performs services requested by the 218 client. This may be a traditional server such as an HTTP server, or 219 it may be a Sieve [15] interpreter running a Sieve script written by 220 a user. A server needs to use the operations provided by collations 221 in order to fulfill the client's requests. 223 The protocol describes how the client tells the server what it wants 224 done, and (if applicable) how the server tells the client about the 225 results. IMAP is a protocol by this definition, and so is the Sieve 226 language. 228 2.4. Sort Keys 230 One component of a collation is a transformation which turns a string 231 into a sort key, which is then used while sorting. 233 The transformation can range from an identity mapping (e.g., the 234 i;octet collation Section 9.5) to a mapping which makes the string 235 unreadable to a human (e.g., the basic collation Section 9.4). 237 This is an implementation detail of collations or servers. A 238 protocol SHOULD NOT expose it, since some collations leave the sort 239 key's format up to the implementation, and current conformant 240 implementations are known to use different formats. 242 3. Collation Name Syntax 244 3.1. Basic Syntax 246 The collation name itself is a single US-ASCII string beginning with 247 a letter and made up of letters, digits, and one of the following 4 248 symbols: "-", ";", "=" and ".". The name MUST NOT be longer than 254 249 characters. 251 collation-char = ALPHA / DIGIT / "-" / ";" / "=" / "." 253 collation-name = ALPHA *253collation-char 255 The name "default" is reserved. For protocol which have a default 256 collation, "default" refers to that collation. For other protocols, 257 the name "default" matches no collations, and servers SHOULD treat it 258 in the same way as they treat names of nonexistent collations. 260 3.2. Wildcards 262 The string a client uses to select a collation MAY contain one or 263 more wildcard ("*") character which matches zero or more collation- 264 chars. Wildcard characters MUST NOT be adjacent. If the wildcard 265 string matches multiple collations, the server SHOULD select the 266 collation with the broadest scope (preferably international scope), 267 the most recent table versions and the greatest number of supported 268 operations. 270 collation-wild = ("*" / (ALPHA ["*"])) *(collation-char ["*"]) 271 ; MUST NOT exceed 254 characters total 273 3.3. Ordering Direction 275 When used as a protocol element for ordering, the collation name MAY 276 be prefixed by either "+" or "-" to explicitly specify an ordering 277 direction. "+" has no effect on the ordering operation, while "-" 278 inverts the result of the ordering operation. In general, collation- 279 order is used when a client requests a collation, and collation- 280 selected is used when the server informs the client of the selected 281 collation. 283 collation-selected = ["+" / "-"] collation-name 285 collation-order = ["+" / "-"] collation-wild 287 3.4. URIs 289 Some protocols are designed to use URIs [4] to refer to collations 290 rather than simple tokens. A special section of the IANA web page is 291 reserved for such usage. The "collation-uri" form is used to refer 292 to a specific IANA registry entry for a specific named collation (the 293 collation registration may not actually be present if it is 294 experimental). The "collation-auri" form is an abstract name for an 295 ordering, a collation pattern or a vendor private collator. 297 collation-uri = "http://www.iana.org/assignments/collation/" 298 collation-name ".xml" 300 collation-auri = ( "http://www.iana.org/assignments/collation/" 301 collation-order ".xml" ) / other-uri 303 other-uri = 304 ; excluding the IANA collation namespace. 306 3.5. Naming Guidelines 308 While this specification makes no absolute requirements on the 309 structure of collation names, naming consistency is important, so the 310 following initial guidelines are provided. 312 Collation names with an international audience typically begin with 313 "i;". Collation names intended for a particular language or locale 314 typically begin with a language tag [5] followed by a ";". After the 315 first ";" is normally the name of the general collation algorithm, 316 followed by a series of algorithm modifications separated by the ";" 317 delimiter. Parameterized modifications will use "=" to delimit the 318 parameter from the value. The version numbers of any lookup tables 319 used by the algorithm SHOULD be present as parameterized 320 modifications. 322 Collation names of the form *;vnd-domain.com;* are reserved for 323 vendor-specific collations created by the owner of the domain name 324 following the "vnd-" prefix (e.g. vnd-example.com for the vendor 325 example.com). Registration of such collations (or the name space as 326 a whole) with intended use of "Vendor" is encouraged when a public 327 specification or open-source implementation is available, but is not 328 required. 330 4. Collation Specification Requirements 332 4.1. Collation/Server Interface 334 The collation itself defines what it operates on. Most collations 335 are expected to operate on character strings. The i;octet 336 (Section 9.5) collation operates on octet strings. The i;ascii- 337 numeric (Section 9.1) operation operates on numbers. 339 This specification defines the collation interface in terms of octet 340 strings. However, implementations may choose to use character 341 strings instead. Such implementations may not be able to implement 342 e.g. i;octet. Since i;octet is not currently mandatory to implement 343 for any protocol, this should not be a problem. 345 4.2. Operations Supported 347 A collation specification MUST state which of the three basic 348 operations are supported (equality, substring, ordering) and how to 349 perform each of the supported operations on any two input character 350 strings including empty strings. Collations must be deterministic, 351 i.e. given a collation with a specific name, and any two fixed input 352 strings, the result MUST be the same for the same operation. 354 In general, collation operations should behave as their names 355 suggest. While a collation may be new, the operations are not, so 356 the new collation's operations should be similar to those of older 357 collations. For example, a date/time collation should not provide a 358 "substring" operation that would morph IMAP substring SEARCH into 359 e.g. a date-range search. 361 A nonobvious consequence of the rules for each collation operation is 362 that for any single collation, either none or all of the operations 363 can return "undefined". For example, it is not possible to have an 364 equality operation that never returns "undefined" and a substring 365 operation that occasionally does. 367 4.2.1. Validity 369 The validity test takes one string as argument returns valid if its 370 input string is valid input to collation's other operations, and 371 invalid if not. (In other words, a string is valid if it is equal to 372 itself according to the collation's equality operation.) 374 The validity test is provided by all collations. It MUST NOT be 375 listed separately in the collation registration. 377 4.2.2. Equality 379 The equality test always returns "match" or "no-match" when supplied 380 valid input, and MAY return "undefined" if one or both input strings 381 are not valid. 383 The equality test MUST be reflexive and symmetric. For valid input, 384 it MUST be transitive. 386 If a collation provides either a substring or an ordering test, it 387 MUST also provide an equality test. The substring and/or ordering 388 tests MUST be consistent with the equality test. 390 In this specification, the return values of the equality test are 391 called "match", "no-match" and "undefined". This is not a 392 specification, merely a choice of phrasing. 394 4.2.3. Substring 396 The substring matching operation determines if the first string is a 397 substring of the second string, ie. if one or more substrings of the 398 second string is equal to the first, as defined by the collation's 399 equality operation. 401 A collation which supports substring matching will automatically 402 support two special cases of substring matching: prefix and suffix 403 matching if those special cases are supported by the application 404 protocol. It returns "match" or "no-match" when supplied valid input 405 and returns "undefined" when supplied invalid input. 407 Application protocols MAY return position information for substring 408 matches. If this is done, the position information SHOULD include 409 both the starting offset and the ending offset for each match. This 410 is important because more sophisticated collations can match strings 411 of unequal length (for example, a pre-composed accented character can 412 match a decomposed accented character). All matching substrings 413 should be reported, even overlapping matches (as when "ana" occurs 414 twice within "banana"). 416 A string is a substring of itself. The empty string is a substring 417 of all strings. 419 Note that the substring operation of some collations can match 420 strings of unequal length. For example, a pre-composed accented 421 character can match a decomposed accented character. Unicode 422 Collation Algorithm [8] discusses this in more detail. 424 In this specification, the return values of the substring operation 425 are called "match", "no-match" and "undefined". This is not a 426 specification, merely a choice of phrasing. 428 4.2.4. Ordering 430 The ordering operation determines how two strings are ordered. It 431 MUST be trichotomous and reflexive. For valid input, it MUST be 432 transitive. 434 Ordering returns "less" if the first string is listed before the 435 second string according to the collation, "greater" if the second 436 string is listed before the first string, and "equal" if the two 437 strings are equal as defined by the collation's equality operation. 438 If one or both strings are invalid, the result of ordering is 439 "undefined". 441 When the collation is used with a "+" prefix, the behavior is the 442 same as when used with no prefix. When the collation is used with a 443 "-" prefix, the result of the ordering operation of the collation 444 MUST be reversed. 446 In this specification, the return values of the ordering operation 447 are called "less", "equal", "greater" and "undefined". This is not a 448 specification, merely a choice of phrasing. 450 4.3. Sort Keys 452 A collation specification SHOULD describe the internal transformation 453 algorithm to generate sort keys. This algorithm can be applied to 454 individual strings and the result can be stored to potentially 455 optimize future comparison operations. A collation MAY specify that 456 the sort key is generated by the identity function. The sort key may 457 have no meaning to a human. The sort key may not be valid input to 458 the collation. 460 4.4. Use of Lookup Tables 462 Some collations use customizable lookup tables, e.g. because the 463 tables depend on locale and may be modified after shipping the 464 software. Collations which use more than one customizable lookup 465 table in a documented format MUST assign numbers to the tables they 466 use. This permits an application protocol command to access the 467 tables used by a server collation, so that clients and servers use 468 the same tables. 470 5. Application Protocol Requirements 471 This section describes the requirements and issues that an 472 application protocol needs to consider if it offers searching, 473 substring matching and/or sorting, and permits the use of characters 474 outside the US-ASCII charset. 476 5.1. Character Encoding 478 The protocol specification has to make sure that it is clear on which 479 characters (rather than just octets) the collations are used. This 480 can be done by specifying the protocol itself in terms of characters 481 (e.g. in the case of a query language), by specifying a single 482 character encoding for the protocol (e.g. UTF-8 [3]), or by 483 carefully describing the relevant issues of character encoding 484 labeling and conversion. In the later case, details to consider 485 include how to handle unknown charsets, any charsets which are 486 mandatory-to-implement, any issues with byte-order that might apply, 487 and any transfer encodings which need to be supported. 489 5.2. Operations 491 The protocol must specify which of the operations defined in this 492 specification (equality matching, substring matching and ordering) 493 can be invoked in the protocol, and how they are invoked. There may 494 be more than one way to invoke an operation. 496 The protocol MUST provide a mechanism for the client to select the 497 collation to use with equality matching, substring matching and 498 ordering. 500 If a protocol needs a total ordering and the collation chosen does 501 not provide it because the ordering operation returns "undefined" at 502 least once, the recommended fallback is to sort all invalid strings 503 after the valid ones, and use i;octet to order the invalid strings. 505 Although the collation's substring function provides a list of 506 matches, a protocol need not provide all that to the client. It may 507 provide only the first matching substring, or even just the 508 information that the substring search matched. 510 If the protocol provides positional information for the results of a 511 substring match, that positional information SHOULD fully specify the 512 substring(s) in the result that matches independent of the length of 513 the search string. For example, returning both the starting and 514 ending offset of the match would suffice, as would the starting 515 offset and a length. Returning just the starting offset is not 516 acceptable. This rule is necessary because advanced collations can 517 treat strings of different lengths as equal (for example, pre- 518 composed and decomposed accented characters). 520 5.3. Wildcards 522 The protocol MUST specify whether it allows the use of wildcards in 523 collation names or not. If the protocol allows wildcards, then: 524 The protocol MUST specify how comparisons behave in the absence of 525 explicit collation negotiation or when a collation of "*" is 526 requested. The protocol MAY specify that the default collation 527 used in such circumstances is sensitive to server configuration. 528 The protocol SHOULD provide a way to list available collations 529 matching a given wildcard pattern or patterns. 531 5.4. Canonicalization Function 533 If the protocol uses a canonicalization function for strings, then 534 use of collations MAY be appropriate for that function. As an 535 example, many protocols use case independent strings. In most cases, 536 a simple ASCII mapping to upper/lower case works well, as i;ascii- 537 casemap offers. However, in some cases another collation may be 538 better, e.g. to handle Turkish dotted/dotless i. Protocol designers 539 should consider in each case whether to use a specifiable collation. 541 5.5. Disconnected Clients 543 If the protocol supports disconnected clients, then a mechanism for 544 the client to precisely replicate the server's collation algorithm is 545 likely desirable. Thus the protocol MAY wish to provide a command to 546 fetch lookup tables used by charset conversions and collations. 548 5.6. Error Codes 550 The protocol specification should consider assigning protocol error 551 codes for the following circumstances: 552 o The client requests the use of a collation by name or pattern, but 553 no implemented collation matches that pattern. 554 o The client attempts to use a collation for an operation that is 555 not supported by that collation. For example, attempting to use 556 the "i;ascii-numeric" collation for substring matching. 557 o The client uses an equality or substring matching collation and 558 the result is an error. It may be appropriate to distinguish 559 between the two input strings, particularly when one is supplied 560 by the client and one is stored by the server. It might also be 561 appropriate to distinguish the specific case of an invalid UTF-8 562 string. 564 5.7. Octet Collation 566 The i;octet (Section 9.5) collation is only usable with protocols 567 based on octet-strings. Clients and servers MUST NOT use i;octet 568 with other protocols. 570 If the protocol permits the use of collations with data structures 571 other than strings, the protocol MUST describe the default behavior 572 for a collation with those data structures. 574 6. Use by Existing Protocols 576 Both ACAP [12] and Sieve [15] are standards track specifications 577 which used collations prior to the creation of this specification and 578 registry. Those standards do not meet all the application protocol 579 requirements described in Section 5. 581 These protocols allow the use of the i;octet (Section 9.5) collation 582 working directly on UTF-8 data as used in these protocols. 584 In Sieve, all matches are either true and false. Accordingly, Sieve 585 servers must treat "undefined" and "no-match" results of the equality 586 and substring operations as false, and only "match" as true. 588 IMAP [16] also uses collation, although the use is explicit only when 589 the COMPARATOR [18] extension is used. The built-in IMAP substring 590 operation and the ordering provided by the SORT [17] extension may 591 not meet the requirements made in this document. 593 Other protocols may be in a similar position. 595 In IMAP, the default collation is i;ascii-casemap, because its 596 operations most closely resembles IMAP's built-in operations. 598 7. Collation Registration 600 7.1. Collation Registration Procedure 602 The IETF will create a mailing list, collation@ietf.org, which can be 603 used for public discussion of collation proposals prior to 604 registration. Use of the mailing list is strongly encouraged. The 605 IESG will appoint a designated expert who will monitor the 606 collation@ietf.org mailing list and review registrations. 608 The registration procedure begins when a completed registration 609 template is sent to iana@iana.org and collation@ietf.org. The 610 designated expert is expected to tell IANA and the submitter of the 611 registration within two weeks whether the registration is approved, 612 approved with minor changes, or rejected with cause. When a 613 registration is rejected with cause, it can be re-submitted if the 614 concerns listed in the cause are addressed. Decisions made by the 615 designated expert can be appealed to IESG Applications Area Director, 616 then to the IESG. They follow the normal appeals procedure for IESG 617 decisions. 619 Collation registrations in a standards track, BCP or IESG-approved 620 experimental RFC are owned by the IETF, and changes to the 621 registration follow normal procedures for updating such documents. 622 Collation registrations in other RFCs are owned by the RFC author(s). 623 Other collation registrations are owned by the individual(s) listed 624 in the contact field of the registration and IANA will preserve this 625 information. Changes to a registration MUST be approved by the 626 owner. In the event the owner cannot be contacted for a period of 627 one month and a change is deemed necessary, the IESG MAY re-assign 628 ownership to an appropriate party. 630 7.2. Collation Registration Format 632 Registration of a collation is done by sending a well-formed XML 633 document that validates with collationreg.dtd (Section 7.3). 635 7.2.1. Registration Template 637 Here is a template for the registration: 639 640 641 642 collation name 643 technical title for collation 644 equality order substring 645 specification reference 646 email address of owner or IETF 647 email address of submitter 648 1 649 3.2 650 3.1.1 651 653 7.2.2. The collation Element 655 The root of the registration document MUST be a element. 656 The collation element contains the other elements in the 657 registration, which are described in the following sub-subsections, 658 in the order given here. 660 The element MAY include an "rfc=" attribute if the 661 specification is in an RFC. The "rfc=" attribute gives only the 662 number of the RFC, without any prefix, such as "RFC", or suffix, such 663 as ".txt". 665 The element MUST include a "scope=" attribute, which MUST 666 have one of the values "i18n", "local" or "other". 668 The element MUST include an "intendedUse=" attribute, 669 which must have one of the values "common", "limited", "vendor", or 670 "deprecated". Collation specifications intended for "common" use are 671 expected to reference standards from standards bodies with 672 significant experience dealing with the details of international 673 character sets. 675 Be aware that future revisions of this specification may add 676 additional function types, as well as additional XML attributes, 677 values and elements. Any system which automatically parses these XML 678 documents MUST take this into account to preserve future 679 compatibility. A DTD for the current definition of the collation 680 registration template is given in Section 7.3. 682 7.2.3. The name Element 684 The element gives the precise name of the collation. The 685 element is mandatory. 687 7.2.4. The title Element 689 The element gives the title of the collation. The <title> 690 element is mandatory. 692 7.2.5. The operations Element 694 The <operations> element lists which of the three operations 695 ("equality", "order" or "substring") the collation provides, 696 separated by single spaces. The <operations> element is mandatory. 698 7.2.6. The specification Element 700 The <specification> element describes where to find the 701 specification. The <specification> element is mandatory. It MAY 702 have a URI attribute. There may be more than one <specification> 703 elements, in which case they together form the specification. 705 If it is discovered that parts of a collation specification conflict, 706 a new revision of the collation is necessary, and the 707 collation@ietf.org mailing list should be notified. 709 7.2.7. The submitter Element 711 The <submitter> element provides an RFC 2822 [13] email address for 712 the person who submitted the registration. It is optional if the 713 <owner> element contains an email address. 715 There may be more than one <submitter> element. 717 7.2.8. The owner Element 719 The <owner> element contains either the four letters "IETF" or an 720 email address of the owner of the registration. The <owner> element 721 is mandatory. There may be more than one <owner> element. If so, 722 all owners are equal. Each owner can speak for all. 724 7.2.9. The version Element 726 The <version> element is included when the registration is likely to 727 be revised or has been revised in such a way that the results change 728 for certain input strings. The <version> element is optional. 730 7.2.10. The UnicodeVersion Element 732 The <UnicodeVersion> element indicates the version number of the 733 UnicodeData file on which the collation is based. The 734 <UnicodeVersion> element is optional. 736 7.2.11. The UCAVersion Element 738 The <UCAVersion> element specifics the version of the Unicode 739 Collation Algorithm on which the collation is based. The 740 <UCAVersion> element is optional. 742 7.2.12. The UCAMatchLevel Element 744 The <UCAMatchLevel> element specifies the number of Unicode Collation 745 Algorithm sort key levels used for the equality and substring 746 operations. The <UCAMatchLevel> element is optional. 748 7.3. DTD for Collation Registration 750 <!-- 751 DTD for Collation Registration Document 753 Data types: 755 entity description 756 ====== =========== 757 NUMBER [0-9]+ 758 URI As defined in RFC 3986 759 CTEXT printable ASCII text (no line-terminators) 760 TEXT character data 761 --> 762 <!ENTITY % NUMBER "CDATA"> 763 <!ENTITY % URI "CDATA"> 764 <!ENTITY % CTEXT "#PCDATA"> 765 <!ENTITY % TEXT "#PCDATA"> 766 <!ELEMENT collation (name,title,operations,specification+, 767 owner+,submitter*,version?, 768 UnicodeVersion?,UCAVersion?, 769 UCAMatchLevel?)> 770 <!ATTLIST collation 771 rfc %NUMBER; "0" 772 scope (i18n|local|other) #IMPLIED 773 intendedUse (common|limited|vendor|deprecated) #IMPLIED> 774 <!ELEMENT name (%CTEXT;)> 775 <!ELEMENT title (%CTEXT;)> 776 <!ELEMENT operations (%CTEXT;)> 777 <!ELEMENT specification (%TEXT;)> 778 <!ATTLIST specification 779 uri %URI; ""> 780 <!ELEMENT owner (%CTEXT;)> 781 <!ELEMENT submitter (%CTEXT;)> 782 <!ELEMENT version (%CTEXT;)> 783 <!ELEMENT UnicodeVersion (%CTEXT;)> 784 <!ELEMENT UCAVersion (%CTEXT;)> 785 <!ELEMENT UCAMatchLevel (%CTEXT;)> 787 7.4. Structure of Collation Registry 789 Once the registration is approved, IANA will store each XML 790 registration document in a URL of the form 791 http://www.iana.org/assignments/collation/collation-name.xml where 792 collation-name is the contents of the name element in the 793 registration. Both the submitter and the designated expert are 794 responsible for verifying that the XML is well-formed and complies 795 with the DTD. 797 IANA will also maintain a text summary of the registry under the name 798 http://www.iana.org/assignments/collation/summary.txt. This summary 799 is divided into four sections. The first section is for collations 800 intended for common use. This section is intended for collation 801 registrations published in IESG approved RFCs or for locally scoped 802 collations from the primary standards body for that locale. The 803 designated expert is encouraged to reject collation registrations 804 with an intended use of "common" if the expert believes it should be 805 "limited", as it is desirable to keep the number of "common" 806 registrations small and high quality. The second section is reserved 807 for limited use collations. The third section is reserved for 808 registered vendor specific collations. The final section is reserved 809 for deprecated collations. 811 7.5. Example Initial Registry Summary 813 The following is an example of how IANA might structure the initial 814 registry summary.txt file: 816 Collation Functions Scope Reference 817 --------- --------- ----- --------- 818 Common Use Collations: 819 i;nameprep;v=1;uv=3.2 e, o, s i18n [RFC XXXX] 820 i;basic;uca=3.1.1;uv=3.2 e, o, s i18n [RFC XXXX] 821 i;basic;uca=3.1.1;uv=3.2;match=accent e, o, s i18n [RFC XXXX] 822 i;basic;uca=3.1.1;uv=3.2;match=case e, o, s i18n [RFC XXXX] 823 i;ascii-casemap e, o, s Local [RFC XXXX] 825 Limited Use Collations: 826 i;octet e, o, s Other [RFC XXXX] 827 i;ascii-numeric e, o Other [RFC XXXX] 829 Vendor Collations: 831 Deprecated Collations: 832 i;ascii-casemap e, o, s Local [RFC XXXX] 834 References 835 ---------- 836 [RFC XXXX] Newman, C., "Internet Application Protocol Collation 837 Registry", RFC XXXX, Sun Microsystems, October 2003. 839 8. Guidelines for Expert Reviewer 841 The expert reviewer appointed by the IESG has fairly broad latitude 842 for this registry. While a number of collations are expected 843 (particularly customizations of the basic collation for localized 844 use), an explosion of collations (particularly common use collations) 845 is not desirable for widespread interoperability. However, it is 846 important for the expert reviewer to provide cause when rejecting a 847 registration, and when possible to describe corrective action to 848 permit the registration to proceed. The following table includes 849 some example reasons to reject a registration with cause: 850 o The registration is not a well-formed XML document that follows 851 the DTD. 852 o The registration has an intended use of "common", but there is no 853 evidence the collation will be widely deployed, so it should be 854 listed as "limited". 855 o The registration has an intended use of "common", but it is 856 redundant with the functionality of a previously registered 857 "common" collation. 858 o The registration has an intended use of "common", but the 859 specification is not detailed enough to allow interoperable 860 implementations by others. 861 o The collation name fails to precisely identify the version numbers 862 of relevant tables to use. 863 o The registration fails to meet one of the "MUST" requirements in 864 Section 4. 865 o The collation name fails to meet the syntax in Section 3. 866 o The collation specification referenced in the registration is 867 vague or has optional features without a clear behavior specified. 868 o The referenced specification does not adequately address security 869 considerations specific to that collation. 870 o The regitration's operations are needlessly different from those 871 of traditional operations. 873 9. Initial Collations 875 This section describes an initial set of collations for the collation 876 registry. 878 9.1. ASCII Numeric Collation 880 9.1.1. ASCII Numeric Collation Description 882 The "i;ascii-numeric" collation is a simple collation intended for 883 use with arbitrary sized unsigned decimal integer numbers stored as 884 octet strings. US-ASCII digits (0x30 to 0x39) represent digits of 885 the numbers. Before converting from string to integer, the input 886 string is truncated at the first non-digit character (so for example, 887 "4294967298", "04294967298" and "4294967298b" all represent the same 888 number, 4294967298). 890 The collation supports equality and ordering, but does not support 891 the substring operation. 893 The equality operation returns "match" if the two strings represent 894 the same number (ie. leading zeroes are disregarded), "no-match" if 895 the two strings represent different numbers, and "undefined" if 896 either string is empty or does not start with a digit. 898 The ordering operation returns "less" if the first string represents 899 a smaller number than the second, "equal" if they represent the same 900 number, and "greater" if the first string represents a larger number 901 than the second. If either string is empty or starts with a non- 902 digit, the ordering operation returns "undefined". 904 9.1.2. ASCII Numeric Collation Registration 906 <?xml version='1.0'?> 907 <!DOCTYPE collation SYSTEM 'collationreg.dtd'> 908 <collation rfc="XXXX" scope="other" intendedUse="limited"> 909 <name>i;ascii-numeric</name> 910 <title>ASCII Numeric 911 equality order 912 RFC XXXX 913 IETF 914 chris.newman@sun.com 915 917 9.2. ASCII Casemap Collation 919 9.2.1. ASCII Casemap Collation Description 921 The "i;ascii-casemap" collation is a simple collation which operates 922 on octet strings and treats US-ASCII letters case-insensitively. It 923 provides equality, substring and ordering operations. All input is 924 valid. 926 Its equality, ordering and substring operations are as for i;octet, 927 except that first, the lower-case letters (octet values 97-122) in 928 each input string are changed to upper case (octet values 65-90). 930 Care should be taken when using OS-supplied functions to implement 931 this collation as it is not locale sensitive. Functions such as 932 strcasecmp and toupper are sometimes locale sensitive and may 933 inappropriately map lower-case letters other than a-z to upper case. 935 The i;ascii-casemap collation is well suited to to use with many 936 internet protocols and computer languages. Use with natural language 937 is often inappropriate: even though the collation apparently supports 938 languages such as Italian and English, in real-world use it tends to 939 stumble over words such as "naive", names such as "Llwyd", people and 940 place names containing non-ASCII, euro and pound sterling symbols, 941 quotation marks, dashes/hyphens, etc. 943 9.2.2. ASCII Casemap Collation Registration 945 946 947 948 i;ascii-casemap 949 ASCII Casemap 950 equality order substring 951 RFC XXXX 952 IETF 953 chris.newman@sun.com 954 956 9.3. Nameprep Collation 958 9.3.1. Nameprep Collation Description 960 The "i;nameprep;v=1;uv=3.2" collation is an implementation of the 961 nameprep [7] specification based on normalization tables from Unicode 962 version 3.2. This collation applies the nameprep canoncialization 963 function to both input strings and then returns the result of the 964 i;octet collation on the canonicalized strings. While this collation 965 offers all three operations, the ordering operation it provides is 966 inadequate for use by the majority of the world. 968 Version number 1 is applied to nameprep as specified in RFC 3491. If 969 the nameprep specification is revised without any changes that would 970 produce different results when given the same pair of input octet 971 strings, then the version number will remain unchanged. 973 The table numbers for tables used by nameprep are as follows: 975 +--------------+-----------------------+ 976 | Table Number | Table Name | 977 +--------------+-----------------------+ 978 | 1 | UnicodeData-3.2.0.txt | 979 | 2 | Table B.1 | 980 | 3 | Table B.2 | 981 | 4 | Table C.1.2 | 982 | 5 | Table C.2.2 | 983 | 6 | Table C.3 | 984 | 7 | Table C.4 | 985 | 8 | Table C.5 | 986 | 9 | Table C.6 | 987 | 10 | Table C.7 | 988 | 11 | Table C.8 | 989 | 12 | Table C.9 | 990 +--------------+-----------------------+ 992 9.3.2. Nameprep Collation Registration 994 995 996 997 i;nameprep;v=1;uv=3.2 998 Nameprep 999 equality order substring 1000 RFC XXXX 1001 IETF 1002 chris.newman@sun.com 1003 1 1004 3.2 1005 1007 9.4. Basic Collation 1009 9.4.1. Basic Collation Description 1011 The basic collation is intended to provide tolerable results for a 1012 number of languages for all three operations (equality, substring and 1013 ordering) so it is suitable as a mandatory-to-implement collation for 1014 protocols which include ordering support. The ordering operation of 1015 the basic collation is the Unicode Collation Algorithm [8] version 14 1016 (UCAv14). 1018 The equality and substring operations are created as described in 1019 UCAv14 section 8. While that section is informative to UCAv14, it is 1020 normative to this collation specification. 1022 This collation is based on Unicode version 3.2, with the following 1023 tables relevant: 1024 1. For the normalization step, 1025 1026 is used. Column 5 is used to determine the canonical 1027 decomposition, while column 3 contains the canonical combining 1028 classes necessary to attain canonical order. 1029 2. The table of characters which require a logical order exception 1030 is a subset of the table in 1031 and 1032 is included here: 1034 0E40..0E44 ; Logical_Order_Exception 1035 # Lo [5] THAI CHARACTER SARA E..THAI CHARACTER SARA AI MAIMALAI 1036 0EC0..0EC4 ; Logical_Order_Exception 1037 # Lo [5] LAO VOWEL SIGN E..LAO VOWEL SIGN AI 1039 # Total code points: 10 1041 3. The table used to translate normalized code points to a sort key 1042 is . 1044 UCAv14 includes a number of configurable parameters and steps 1045 labelled as potentially optional. The following list summarizes the 1046 defaults used by this collation: 1047 o The logical order exception step is mandatory by default to 1048 support the largest number of languages. 1049 o Steps 2.1.1 to 2.1.3 are mandatory as the repertoire of the basic 1050 collation is intended to be large. 1051 o The second level in the sort key is evaluated forwards by default. 1052 o The variable weighting uses the "non-ignorable" option by default. 1053 o The semi-stable option is not used by default. 1054 o Support for exactly three levels of collation is the default 1055 behavior. 1056 o No preprocessing step is used by the basic collation prior to 1057 applying the UCAv14 algorithm. Note that an application protocol 1058 specification MAY require pre-processing prior to the use of any 1059 collations. 1060 o The equality and substring algorithms exclude differences at level 1061 2 and 3 by default (thus it is case-insensitive and ignores 1062 accentual distinctions. 1063 o The equality and substring algorithms use the "Whole Characters 1064 Only" feature described in UCAv14 section 8 by default. 1066 The exact collation name with these defaults is 1067 "i;basic;uca=3.1.1;uv=3.2". When a specification states that the 1068 basic collation is mandatory-to-implement, only this specific name is 1069 mandatory-to-implement. 1071 In order to allow modification of the optional behaviors, the 1072 following ABNF is used for variations of the basic collation: 1074 basic-collation = ("i" / Language-Tag) ";basic;uca=3.1.1;uv=3.2" 1075 [";match=accent" / ";match=case"] 1076 [";tailor=" 1*collation-char ] 1078 If multiple modifiers appear, they MUST appear in the order described 1079 above. The modifiers have the following meanings: 1080 match=accent Both the first and second levels of the sort keys are 1081 considered relevant to the equality and substring 1082 operations (rather than the default of first level 1083 only). This makes the matching functions sensitive to 1084 accentual distinctions. 1085 match=case The first three levels of sort keys are considered 1086 relevant to the equality and substring operations. 1087 This makes the matching functions sensitive to both 1088 case and accentual distinctions. 1090 The default weighting option is "non-ignorable". The "semi-stable" 1091 sort key option is not used by default. 1093 Sort keys are generated as described in section 4.3 of the UCA 1094 specification. (Note that the result is not a string of characters.) 1096 Finally, the UCAv14 algorithm permits the "allkeys" table to be 1097 tailored to a language. People who make quality tailorings are 1098 encouraged to register those tailorings using the collation registry. 1099 Tailoring names beginning with "x" are reserved for experimental use, 1100 are treated as "Limited use" and MUST NOT match wildcards if any 1101 registered collation is available that does match. 1103 9.4.2. Basic Collation Registration 1105 1106 1107 1108 i;basic;uca=3.1.1;uv=3.2 1109 Basic 1110 equality order substring 1111 RFC XXXX 1112 IETF 1113 chris.newman@sun.com 1114 3.2 1115 3.1.1 1116 1 1117 1119 9.4.3. Basic Accent Sensitive Match Collation Registration 1121 1122 1123 1124 i;basic;uca=3.1.1;uv=3.2;match=accent 1125 Basic Accent Sensitive Match 1126 equality order substring 1127 RFC XXXX 1128 IETF 1129 chris.newman@sun.com 1130 3.2 1131 3.1.1 1132 2 1133 1135 9.4.4. Basic Case Sensitive Match Collation Registration 1137 1138 1139 1140 i;basic;uca=3.1.1;uv=3.2;match=case 1141 Basic Case Sensitive Match 1142 equality order substring 1143 RFC XXXX 1144 IETF 1145 chris.newman@sun.com 1146 3.2 1147 3.1.1 1148 3 1149 1151 9.5. Octet Collation 1153 9.5.1. Octet Collation Description 1155 The "i;octet" collation is a simple and fast collation intended for 1156 use on binary octet strings rather than on character data. Protocols 1157 that want to make this collation available have to do so by 1158 explicitly allowing it. If not explicitly allowed, it MUST NOT be 1159 used. It never returns an "undefined" result. It provides equality, 1160 substring and ordering operations. 1162 The ordering algorithm is as follows: 1163 1. If both strings are the empty string, return the result "equal". 1164 2. If the first string is empty and the second is not, return the 1165 result "less". 1167 3. If the second string is empty and the first is not, return the 1168 result "greater". 1169 4. If both strings begin with the same octet value, remove the first 1170 octet from both strings and repeat this algorithm from step 1. 1171 5. If the unsigned value (0 to 255) of the first octet of the first 1172 string is less than the unsigned value of the first octet of the 1173 second string, then return "less". 1174 6. If this step is reached, return "greater". 1176 This algorithm is roughly equivalent to the C library function memcmp 1177 with appropriate length checks added. 1179 The matching operation returns "match" if the sorting algorithm would 1180 return "equal". Otherwise the matching operation returns "no-match". 1182 The substring operation returns "match" if the first string is the 1183 empty string, or if there exists a substring of the second string of 1184 length equal to the length of the first string which would result in 1185 a "match" result from the equality function. Otherwise the substring 1186 operation returns "no-match". 1188 9.5.2. Octet Collation Registration 1190 This collation is defined with intendedUse="limited" because it can 1191 only be used by protocols that explicitly allow it. 1193 1194 1195 1196 i;octet 1197 Octet 1198 equality order substring 1199 RFC XXXX 1200 IETF 1201 chris.newman@sun.com 1202 1204 10. IANA Considerations 1206 Section 7 defines how to register collations with IANA. Section 9 1207 defines a list of predefined collations, which should be registered 1208 when this document is approved and published as an RFC. 1210 The IANA publishes the DTD itself at URL 1211 http://www.iana.org/assignments/collation/collationreg.dtd. 1213 11. Security Considerations 1215 Collations will normally be used with UTF-8 strings. Thus the 1216 security considerations for UTF-8 [3], stringprep [6] and Unicode 1217 TR-36 [9] also apply and are normative to this specification. 1219 12. Acknowledgements 1221 The authors want to thank all who have contributed to this document, 1222 including at least John Cowan, Dave Cridland, Mark Davis, Lisa 1223 Dusseault, Frank Ellermann, Philip Guenther, Tony Hansen, Kjetil 1224 Torgrim Homme, Michael Kay, Alexey Melnikov, Jim Melton and Abhijit 1225 Menon-Sen. 1227 13. Open Issues 1229 When converting this to an RFC, several things must be done: Martin 1230 Duerst's name request, checking for unfortunate page breaks, adding a 1231 note to the RFC editor to possibly replace the 3066 reference. 1233 Mark Davis writes: 1235 The sample registry would suffer a combinatorial explosion if 1236 parameters are not handled differently. For example, with CLDR 1237 collations, there can be hundreds of locales, six different strength 1238 settings; four different case-first settings; three different 1239 alternate settings, backwards settings, normalization settings, case 1240 level settings, hiragana settings, and numeric settings; plus a 1241 variable-top setting which has a string as an operand. Registering 1242 the combinations that people are allowed to use would be untenable. 1243 Maybe the new DTD from Martin Duerst fixes this. 1245 Dave Cridland suggests that collations never return error. Have 1246 asked for rationale. 1248 Is it appropriate to use just the level 1 for equality checking in 1249 the basic collation? Level 1 is case insensitive and disregards 1250 accents. 1252 Can "undefined" cover all the cases "error" covered, or do we need an 1253 "error" case in addition to "undefined"? 1255 Martin Duerst suggests, sensibly: 1257 Can we give "the string a client uses to select a collation" a name? 1258 E.g. "collection request string" or some such? 1259 Should collation names be called collation identifiers? Having both 1260 name and title in the DTD is a bit confusing. Try it in -12? 1262 New DTD should permit comments, at least enough to say "The 1263 specification is a republication of the earlier document (URL HERE)" 1264 and things like that. 1266 14. Change Log 1268 14.1. Changes From -10 1269 1. Updated contact details for Martin Duerst. 1270 2. Various textual improvements. 1271 3. The registration's file name now has a mandatory .xml extension. 1272 4. Removed binding MUST for Sieve; it's more appropriate to put that 1273 in 3028bis. 1274 5. Syntax fix in registration example. 1275 6. When there are multiple specifications, they now act in concert, 1276 so it's possible to have e.g. a main specification and multiple 1277 locale-specific supplements. It is not possible to name multiple 1278 locations for the same specification any more. That'll return as 1279 a comment feature. 1280 7. Hopefully clearer exposition of i;ascii-casemap. 1281 8. The ban on registering octet-based collations is lifted. One 1282 hopes that the collation mailing list will present a suitable 1283 threshold - not too high, not too low. 1284 9. The DTD is published where IE can see it while looking at the 1285 registrations. 1287 14.2. Changes From -09 1288 1. Rename "error" to "undefined", as suggested by Mark Davis. The 1289 new name makes for nicer prose IMO. 1290 2. 7b=7 according to i;ascii-numeric. ACAP/Sieve need it. 1291 3. Clarified that even though the collation specification returns a 1292 list of substrings, the protocol/server need not use all of that 1293 information. (As indeed IMAP SEARCH does not.) 1294 4. Registrations go directly to the collation list _and_ to the 1295 IANA, not to the IANA and from there forwarded to designated 1296 expert. 1297 5. Added an acknowledgements list and populated it with a quick grep 1298 from my mailbox and memory. Surely incomplete. 1299 6. Noted that in sieve, "no-match" and "undefined" must be treated 1300 in the same way by the engine. 1301 7. Finish the rename from canonical to sort key. 1302 8. Don't fall back to i;octet from any other collation. Return 1303 undefined instead. Note that protocols may fall back to i;octet 1304 to provide total ordering, if necessary. 1306 9. Call the things operations everywhere, not operators/operations. 1308 14.3. Changes From -08 1309 1. i;ascii-casemap instead of en;ascii-casemap. 1310 2. UCA v 14. Changing to "latest version of UCA" was suggested, 1311 but rejected since IETF standards reference stable 1312 specifications, and "latest" is a moving target. 1313 3. Removed all text on multi-valued attributes. Can be added once 1314 there is a concrete need for it, either in an update to this 1315 document or in the protocol that needs it. 1316 4. "Collations MUST specify the canonicalization". Well, the UCA 1317 doesn't, so I changed that to a MAY. 1318 5. Add some text explaining why one might want to download tables. 1319 6. Changed the remaining instances of "canonicalization" to talk 1320 about sort keys. Added a note that a collation's sort key need 1321 not be valid input to the same collation. 1322 7. Reserve the word "default" and use it to name a protocol's 1323 default collation, provided that protocol has a default 1324 collation. In earlier versions of the draft, "*" was used to 1325 name the default collation, but "*" also was implicitly defined 1326 as the most general collation available. 1327 8. Reinstate the different-length example of substring match. 1328 Explain what an overlapping match is, by the canonical example. 1329 9. Avoid the word "contain" when talking about substring matches. 1330 Fewer terms is better. 1331 10. Until -07, both a collation and equality/substring/sort was 1332 called functions. In -07, the trio was renamed as operations. 1333 Now, the DTD is updated to match. 1334 11. Appeals go to the Apps AD before the general AD, as suggested by 1335 Spencer Dawkins. 1337 14.4. Changes From -06 1338 1. Clarified equality and identity: equality is as defined by a 1339 collation, identity is stronger. 1340 2. Added reference to 1341 http://www.unicode.org/reports/tr10/#Searching. 1342 3. Don't describe sort keys as a canonical representation of the 1343 string. 1344 4. Permit disconnected clients to use wildcards. (A disconnected 1345 client has to resolve the wildcard itself, in the same way that a 1346 server would.) 1347 5. Change collation-wild to have the same length limit as collation. 1348 6. Change to use "less" instead of "-1", etc., and specify that it's 1349 just phrasing, not specification. 1350 7. Don't describe the equality, substring and ordering operations as 1351 functions. The definition of collation uses the word function 1352 about the collation itself. A function that has three functions? 1353 Something has to give. 1355 8. Strike a requirement that selecting '*' is the same as not 1356 selecting any collation. It restricted the protocol's default 1357 too much. Existing code wasn't listening. 1358 9. Left out the canonicalization/sort keys. 1360 14.5. Changes From -05 1361 1. Added definitions of client, server and protocol, and prose to 1362 specify that while the IANA registrations of collations are 1363 written in terms octet strings, implementations may do it 1364 differently. 1365 2. Changed the wording for ascii-numeric to treat the numbers as 1366 numbers, etc. 1367 3. Added explicit property requirements for the three functions, 1368 e.g. that equality be symmetric. Added requirements that the 1369 three functions be consistent, and that if any operations are 1370 present, equality must be (needed for consistency). 1371 4. Random editing, e.g. changing 'numbers' for ascii-numeric to 1372 'integer numbers'. 1373 5. Gave IMAP/SORT/COMPARATOR the same grandfather treatment as ACAP 1374 and SIEVE. 1376 14.6. Changes From -04 1378 Grammar and clarity changes only. One (weak) example added. No 1379 substantive changes. 1381 14.7. Changes From -03 1383 (This does not include all changes made.) 1384 1. Checked and resolved most issues marked 'check whether this is 1385 true' or similar. 1386 2. Resolved nameprep issue: No. 1387 3. Removed NULL for compatibility with existing collations (IMAP 1388 SORT, Sieve). 1389 4. There can be multiple owners and submitters. Say how. 1390 5. Added a requirement that common collations must now be 1391 interoperable. Insufficiently detailed specs cannot be "common". 1392 6. Added a guideline that the operations provided by new collations 1393 should be reminiscent of similar operations on existing 1394 collations. 1396 14.8. Changes From -02 1398 1. Changed from data being octet sequences (in UTF-8) to data being 1399 character sequences (with octet collation as an exception). 1400 2. Made XML format description much more structured. 1402 3. Changed to , because this spelling is much 1403 more common. 1404 4. Defined 'protocol' to include query languages. 1405 5. Reorganized document, in particular IANA considerations section 1406 (which newly is just a list of pointers). 1407 6. Added subsections, and a 'Structure of this Document' section. 1408 7. Updated references. 1409 8. Created a 'Change Log' chapter, with sections for each draft. 1410 9. Reduced 'Open issues' section, open issues are now maintained at 1411 http://www.w3.org/2004/08/ietf-collation. 1413 14.9. Changes From -01 1415 Add IANA comment to open issues. Otherwise this is just a re-publish 1416 to keep the document alive. 1418 14.10. Changes From -00 1420 1. Replaced the term comparator with collation. While comparator is 1421 somewhat more precise because these abstract functions are used 1422 for matching as well as ordering, collation is the term used by 1423 other parts of the industry. Thus I have changed the name to 1424 collation for consistency. 1425 2. Remove all modifiers to the basic collation except for the 1426 customization and the match rules. The other behavior 1427 modifications can be specified in a customization of the 1428 collation. 1429 3. Use ";" instead of "-" as delimiter between parameters to make 1430 names more URL-ish. 1431 4. Add URL form for comparator reference. 1432 5. Switched registration template to use XML document. 1433 6. Added a number of useful registration template elements related 1434 to the Unicode Collation Algorithm. 1435 7. Switched language from "custom" to "tailor" to match UCA language 1436 for tailoring of the collation algorithm. 1438 15. References 1440 15.1. Normative References 1442 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1443 Levels", BCP 14, RFC 2119, March 1997. 1445 [2] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1446 Specifications: ABNF", RFC 4234, October 2005. 1448 [3] Yergeau, F., "UTF-8, a transformation format of ISO 10646", 1449 STD 63, RFC 3629, November 2003. 1451 [4] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1452 Resource Identifier (URI): Generic Syntax", RFC 3986, 1453 January 2005. 1455 [5] Alvestrand, H., "Tags for the Identification of Languages", 1456 BCP 47, RFC 3066, January 2001. 1458 [6] Hoffman, P. and M. Blanchet, "Preparation of Internationalized 1459 Strings ("stringprep")", RFC 3454, December 2002. 1461 [7] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for 1462 Internationalized Domain Names (IDN)", RFC 3491, March 2003. 1464 [8] Davis, M. and K. Whistler, "Unicode Collation Algorithm version 1465 14", May 2005, 1466 . 1468 [9] Davis, M. and M. Suignard, "Unicode Security Considerations", 1469 February 2006, . 1471 15.2. Informative References 1473 [10] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1474 Extensions (MIME) Part One: Format of Internet Message Bodies", 1475 RFC 2045, November 1996. 1477 [11] Myers, J., "Simple Authentication and Security Layer (SASL)", 1478 RFC 2222, October 1997. 1480 [12] Newman, C. and J. Myers, "ACAP -- Application Configuration 1481 Access Protocol", RFC 2244, November 1997. 1483 [13] Resnick, P., "Internet Message Format", RFC 2822, April 2001. 1485 [14] Freed, N. and J. Postel, "IANA Charset Registration 1486 Procedures", BCP 19, RFC 2978, October 2000. 1488 [15] Showalter, T., "Sieve: A Mail Filtering Language", RFC 3028, 1489 January 2001. 1491 [16] Crispin, M., "Internet Message Access Protocol - Version 1492 4rev1", RFC 3501, March 2003. 1494 [17] Crispin, M. and K. Murchison, "Internet Message Access Protocol 1495 - Sort and Thread Extensions", draft-ietf-imapext-sort-17.txt 1496 (work in progress), May 2004. 1498 [18] Newman, C. and A. Gulbrandsen, "Internet Message Access 1499 Protocol Internationalization", draft-ietf-imapext-i18n-06.txt 1500 (work in progress), January 2006. 1502 Authors' Addresses 1504 Chris Newman 1505 Sun Microsystems 1506 1050 Lakes Drive 1507 West Covina, CA 91790 1508 US 1510 Email: chris.newman@sun.com 1512 Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever possible, for example as "Dürst" in XML and HTML.) 1513 Aoyama Gakuin University 1514 5-10-1 Fuchinobe 1515 Sagamihara, Kanagawa 229-8558 1516 Japan 1518 Phone: +81 42 759 6329 1519 Fax: +81 42 759 6495 1520 Email: mailto:duerst@it.aoyama.ac.jp 1521 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 1523 Arnt Gulbrandsen 1524 Oryx Mail Systems GmbH 1525 Schweppermannstr. 8 1526 Munich 81671 1527 Germany 1529 Phone: +49 89 4502 9757 1530 Fax: +49 89 4502 9758 1531 Email: mailto:arnt@oryx.com 1532 URI: http://www.oryx.com/arnt/ 1534 Intellectual Property Statement 1536 The IETF takes no position regarding the validity or scope of any 1537 Intellectual Property Rights or other rights that might be claimed to 1538 pertain to the implementation or use of the technology described in 1539 this document or the extent to which any license under such rights 1540 might or might not be available; nor does it represent that it has 1541 made any independent effort to identify any such rights. Information 1542 on the procedures with respect to rights in RFC documents can be 1543 found in BCP 78 and BCP 79. 1545 Copies of IPR disclosures made to the IETF Secretariat and any 1546 assurances of licenses to be made available, or the result of an 1547 attempt made to obtain a general license or permission for the use of 1548 such proprietary rights by implementers or users of this 1549 specification can be obtained from the IETF on-line IPR repository at 1550 http://www.ietf.org/ipr. 1552 The IETF invites any interested party to bring to its attention any 1553 copyrights, patents or patent applications, or other proprietary 1554 rights that may cover technology that may be required to implement 1555 this standard. Please address the information to the IETF at 1556 ietf-ipr@ietf.org. 1558 Disclaimer of Validity 1560 This document and the information contained herein are provided on an 1561 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1562 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1563 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1564 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1565 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1566 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1568 Copyright Statement 1570 Copyright (C) The Internet Society (2006). This document is subject 1571 to the rights, licenses and restrictions contained in BCP 78, and 1572 except as set forth therein, the authors retain all their rights. 1574 Acknowledgment 1576 Funding for the RFC Editor function is currently provided by the 1577 Internet Society.