idnits 2.17.1 

draft-crispin-collation-unicasemap-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 303.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 314.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 321.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 327.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 347 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 38 instances of too long lines in the document, the longest
     one being 1 character in excess of 72.

  ** The abstract seems to contain references ([BASIC], [COMPARATOR],
     [IMAP-SORT]), which it shouldn't.  Please replace those with straight
     textual mentions of the documents in question.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 192: '...easons.  Implementations MUST consider...'
     RFC 2119 keyword, line 201: '...n implementation MAY use an NFKD libra...'
     RFC 2119 keyword, line 209: '... Implementations SHOULD, as far as fea...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 30, 2007) is 6077 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'IMAP-SORT' is mentioned on line 270, but not defined

  == Missing Reference: 'BASIC' is mentioned on line 265, but not defined

  ** Obsolete normative reference: RFC 3454 (ref. 'STRINGPREP') (Obsoleted by
     RFC 7564)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE-DATA'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'UNICODE-SECURITY'


     Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         M. Crispin
3	Internet-Draft                                  University of Washington
4	Intended status: Proposed Standard                       August 30, 2007
5	Expires: February 30, 2008
6	Document: internet-drafts/draft-crispin-collation-unicasemap-07.txt

8	             i;unicode-casemap - Simple Unicode Collation Algorithm

10	Status of this Memo

12	    By submitting this Internet-Draft, each author represents that
13	    any applicable patent or other IPR claims of which he or she is
14	    aware have been or will be disclosed, and any of which he or she
15	    becomes aware will be disclosed, in accordance with Section 6 of
16	    BCP 79.

18	    Internet-Drafts are working documents of the Internet Engineering
19	    Task Force (IETF), its areas, and its working groups.  Note that
20	    other groups may also distribute working documents as
21	    Internet-Drafts.

23	    Internet-Drafts are draft documents valid for a maximum of six months
24	    and may be updated, replaced, or obsoleted by other documents at any
25	    time.  It is inappropriate to use Internet-Drafts as reference
26	    material or to cite them other than as "work in progress."

28	    The list of current Internet-Drafts can be accessed at
29	    http://www.ietf.org/ietf/1id-abstracts.txt

31	    The list of Internet-Draft Shadow Directories can be accessed at
32	    http://www.ietf.org/shadow.html.

34	    A revised version of this document will be submitted to the RFC
35	    editor as an Informational Document for the Internet Community.

37	    A revised version of this draft document will be submitted to the RFC
38	    editor as a Proposed Standard for the Internet Community.  Discussion
39	    and suggestions for improvement are requested, and should be sent to
40	    ietf-imapext@IMC.ORG.

42	    Distribution of this memo is unlimited.

44	Abstract

46	    This document describes "i;unicode-casemap", a simple
47	    case-insensitive collation for Unicode strings.  It provides
48	    equality, substring and ordering operations.

50	Introduction

52	    The "i;ascii-casemap" collation described in [COMPARATOR] is quite
53	    simple to implement and provides case-independent comparisons for the
54	    26 Latin alphabetics.  It is specified as the default and/or baseline
55	    comparator in some application protocols, e.g., [IMAP-SORT].

57	    However, the "i;ascii-casemap" collation does not produce
58	    satisfactory results with non-ASCII characters.  It is possible, with
59	    a modest extension, to provide a more sophisticated collation with
60	    greater multilingual applicability than "i;ascii-casemap".  This
61	    extension provides case-independent comparisons for a much greater
62	    number of characters.  It also collates characters with diacriticals
63	    with the non-diacritical character forms.

65	    This collation, "i;unicode-casemap", is intended to be an alternative
66	    to, and preferred over, "i;ascii-casemap".  It does not replace the
67	    "i;basic" collation described in [BASIC].

69	1. Unicode Casemap Collation Description

71	    The "i;unicode-casemap" collation is a simple collation which is
72	    case-insensitive in its treatment of characters.  It provides
73	    equality, substring and ordering operations.  The validity test
74	    operation returns "valid" for any input.

76	    This collation allows strings in arbitrary (and mixed) character
77	    sets, as long as the character set for each string is identified and
78	    it is possible to convert the string to Unicode.  Strings which have
79	    an unidentified character set and/or can not be converted to Unicode
80	    are not rejected, but are treated as binary.

82	    Each input string is prepared by converting it to a "titlecased
83	    canonicalized UTF-8" string according to the following steps, using
84	    UnicodeData.txt ([UNICODE-DATA]):

86	       (1) A Unicode codepoint is obtained from the input string.

88	           (a) If the input string is in a known charset that can be
89	               converted to Unicode, a sequence in the string's charset
90	               is read and checked for validity according to the rules of
91	               that charset.  If the sequence is valid, it is converted
92	               to a Unicode codepoint.  Note that for input strings in
93	               UTF-8, the UTF-8 sequence must be valid according to the
94	               rules of [UTF-8]; e.g., overlong UTF-8 sequences are
95	               invalid.

97	           (b) If the input string is in an unknown charset, or an
98	               invalid sequence occurs in step (1)(a), conversion ceases.
99	               No further preparation is performed, and any partial
100	               preparation results are discarded.  The original string is
101	               used unchanged with the i;octet comparator.

103	       (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
104	           are performed on the resulting codepoint from step (1)(a).

106	           (a) If the codepoint has a titlecase property in
107	               UnicodeData.txt (this is normally the same as the
108	               uppercase property), the codepoint is converted to the
109	               codepoints in the titlecase property.

111	           (b) If the resulting codepoint from (2)(a) has a decomposition
112	               property of any type in UnicodeData.txt, the codepoint is
113	               converted to the codepoints in the decomposition property.
114	               This step is recursively applied to each of the resulting
115	               codepoints until no more decomposition is possible
116	               (effectively Normalization Form KD).

118	           Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
119	           has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
120	           WITH SMALL LETTER Z WITH CARON).  Codepoint U+01C5 has a
121	           decomposition property of U+0044 (LATIN CAPITAL LETTER D)
122	           U+017E (LATIN SMALL LETTER Z WITH CARON).  U+017E has a
123	           decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
124	           (COMBINING CARON).  Neither U+0044, U+007A, nor U+030C have
125	           any decomposition properties.  Therefore, U+01C4 is converted
126	           to U+0044 U+007A U+030C by this step.

128	       (3) The resulting codepoint(s) from step (2) is/are appended, in
129	           UTF-8 format, to the "titlecased canonicalized UTF-8" string.

131	       (4) Repeat from step (1) until there is no more data in the input
132	           string.

134	    Following the above preparation process on each string, the equality,
135	    ordering and substring operations are as for i;octet.

137	    It is permitted to use an alternative implementation of the above
138	    preparation process if it produces the same results.  For example, it
139	    may be more convenient for an implementation to convert all input
140	    strings to a sequence of UTF-16 or UTF-32 values prior to performing
141	    any of the step (2) actions.  Similarly, if all input strings are (or
142	    are convertible to) Unicode, it may be possible to use UTF-32 as an
143	    alternative to UTF-8 in step (3).

145	       Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
146	       because UTF-16 surrogates will cause i;octet to collate codepoints
147	       U+E0000 through U+FFFF after non-BMP codepoints.

149	    This collation is not locale sensitive.  Consequently, care should be
150	    taken when using OS-supplied functions to implement this collation.
151	    Functions such as strcasecmp and toupper are sometimes locale
152	    sensitive and may inconsistently casemap letters.

154	    The i;unicode-casemap collation is well suited to use with many
155	    Internet protocols and computer languages.  Use with natural language
156	    is often inappropriate; even though the collation apparently supports
157	    languages such as Swahili and English, in real-world use it tends to
158	    mis-sort a number of types of string:

160	    o  people and place names containing scripts that are not collated
161	       according to "alphabetical order".
162	    o  words with characters that have diacriticals.  However,
163	       i;unicode-casemap generally does a better job than i;ascii-casemap
164	       for most (but not all) languages.  For example, German umlaut
165	       letters will sort correctly, but some Scandinavian letters will
166	       not.
167	    o  names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
168	       in English),
169	    o  strings containing other non-letter symbols; e.g., euro and pound
170	       sterling symbols, quotation marks other than '"', dashes/hyphens,
171	       etc.

173	2. Unicode Casemap Collation Registration

175	    <?xml version='1.0'?>
176	    <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
177	    <collation rfc="XXXX" scope="global" intendedUse="common">
178	      <identifier>i;unicode-casemap</identifier>
179	      <title>Unicode Casemap</title>
180	      <operations>equality order substring</operations>
181	      <specification>RFC XXXX</specification>
182	      <owner>IETF</owner>
183	      <submitter>mrc@cac.washington.edu</submitter>
184	    </collation>

186	3. Security Considerations

188	    The security considerations for [UTF-8], [STRINGPREP] and
189	    [UNICODE-SECURITY] apply and are normative to this specification.

191	    The results from this comparator will vary depending upon the
192	    implementation for several reasons.  Implementations MUST consider
193	    whether these possibilities are a problem for their use case:

195	     1) New characters added in Unicode may have decomposition or
196	        titlecase properties that will not be known to an implementation
197	        based upon an older revision of Unicode.  This impacts Step (2).

199	     2) Step (2)(b) defines a subset of Normalization Form KD that does
200	        not require normalization of out-of-order diacriticals.  However,
201	        an implementation MAY use an NFKD library routine that does such
202	        normalization.  This impacts step (2)(b) and possibly also step
203	        (1)(a), and is an issue only with ill-formed UTF-8 input.

205	     3) The set of charsets handled in step (1)(a) is open-ended.  UTF-8
206	        (and, by extension, US-ASCII) are the only mandatory-to-implement
207	        charsets.  This impacts step (1)(a).

209	        Implementations SHOULD, as far as feasible, support all the
210	        charsets they are likely to encounter in the input data, in order
211	        to avoid poor collation caused by the fall through to the (1)(b)
212	        rule.

214	     4) Other charsets may have revisions which add new characters that
215	        are not known to an implementation based upon an older revision.
216	        This impacts step (1)(a) and possibly also step (1)(b).

218	    An attacker may create input that is ill-formed or in an unknown
219	    charset, with the intention of impacting the results of this
220	    comparator or exploiting other parts of the system which process this
221	    input in different ways.  Note, however, that even well-formed data
222	    in a known charset can impact the result of this comparator in
223	    unexpected ways.  For example, an attacker can substitute U+0041
224	    (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
225	    U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of a non-match of
226	    strings which visually appear the same and/or to cause the string to
227	    appear elsewhere in a sort.

229	4. IANA Considerations

231	    The i;unicode-casemap collation defined in section 2 should be added
232	    to the registry of collations defined in [COMPARATOR].

234	5. Normative References

236	    The following documents are normative to this document:

238	    [COMPARATOR]          Newman, C., "Internet Application Protocol
239	                          Collation Registry", RFC 4790, February 2007.

241	    [STRINGPREP]          Hoffman, P. and M. Blanchet, "Preparation of
242	                          Internationalized Strings ("stringprep")",
243	                          RFC 3454, December 2002.

245	    [UTF-8]               Yergeau, F., "UTF-8, a transformation format
246	                          of ISO 10646", STD 63, RFC 3629, November 2003.

248	    [UNICODE-DATA]        <http://www.unicode.org/Public/UNIDATA/
249	                          UnicodeData.txt>

251	                          Although the UnicodeData.txt file referenced
252	                          here is part of the Unicode standard, it is
253	                          subject to change as new characters are added
254	                          to Unicode and errors are corrected in Unicode
255	                          revisions.  As a result, it may be less stable
256	                          than might otherwise be implied by the
257	                          standards status of this specification.

259	    [UNICODE-SECURITY]    Davis, M. and M. Suignard, "Unicode Security
260	                          Considerations", February 2006,
261	                          <http://www.unicode.org/reports/tr36/>.

263	6. Informative References:

265	    [BASIC]               Newman, C., Duerst, M., and Gulbrandsen, A.,
266	                          "i;basic - the Unicode Collation Algorithm",
267	                          draft-gulbrandsen-collation-basic, Work in
268	                          Progress.

270	    [IMAP-SORT]           Crispin, M. "Internet Message Access Protocol -
271	                          SORT and THREAD Extensions",
272	                          draft-ietf-imapext-sort, Work in Progress (in
273	                          RFC Editor queue).

275	Appendices

277	Author's Address

279	    Mark R. Crispin
280	    Networks and Distributed Computing
281	    University of Washington
282	    4545 15th Avenue NE
283	    Seattle, WA  98105-4527

285	    Phone: +1 (206) 543-5762

287	    EMail: MRC@CAC.Washington.EDU

289	Full Copyright Statement

291	    Copyright (C) The IETF Trust (2007).

293	    This document is subject to the rights, licenses and restrictions
294	    contained in BCP 78, and except as set forth therein, the authors
295	    retain all their rights.

297	    This document and the information contained herein are provided on an
298	    "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
299	    OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
300	    THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
301	    OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
302	    THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
303	    WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

305	Intellectual Property

307	    The IETF takes no position regarding the validity or scope of any
308	    Intellectual Property Rights or other rights that might be claimed to
309	    pertain to the implementation or use of the technology described in
310	    this document or the extent to which any license under such rights
311	    might or might not be available; nor does it represent that it has
312	    made any independent effort to identify any such rights.  Information
313	    on the procedures with respect to rights in RFC documents can be
314	    found in BCP 78 and BCP 79.

316	    Copies of IPR disclosures made to the IETF Secretariat and any
317	    assurances of licenses to be made available, or the result of an
318	    attempt made to obtain a general license or permission for the use of
319	    such proprietary rights by implementers or users of this
320	    specification can be obtained from the IETF on-line IPR repository at
321	    http://www.ietf.org/ipr.

323	    The IETF invites any interested party to bring to its attention any
324	    copyrights, patents or patent applications, or other proprietary
325	    rights that may cover technology that may be required to implement
326	    this standard.  Please address the information to the IETF at ietf-
327	    ipr@ietf.org.

329	Acknowledgement

331	    Funding for the RFC Editor function is currently provided by the
332	    Internet Society.