idnits 2.17.1 

draft-ietf-iri-comparison-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  -- The draft header indicates that this document updates RFC3986, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
     (Using the creation date from RFC3986, updated by this document, for
     RFC5378 checks: 2002-11-01)

  -- The document seems to contain a disclaimer for pre-RFC5378 work, and may
     have content which was first submitted before 10 November 2008.  The
     disclaimer is necessary when there are original authors that you have
     been unable to contact, or if some do not wish to grant the BCP78 rights
     to the IETF Trust.  If you are able to get all authors (current and
     original) to grant those rights, you can and should remove the
     disclaimer; otherwise, the disclaimer is needed and you can ignore this
     comment. (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 23, 2012) is 4196 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC2119' is defined on line 505, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3490' is defined on line 508, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3491' is defined on line 512, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3629' is defined on line 516, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Duplicate reference: RFC3987, mentioned in 'RFC3987', was also mentioned
     in 'RFC3987bis'.


     Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internationalized Resource Identifiers                       L. Masinter
3	(iri)                                                              Adobe
4	Internet-Draft                                                 M. Duerst
5	Updates: 3986 (if approved)                     Aoyama Gakuin University
6	Intended status: Standards Track                        October 23, 2012
7	Expires: April 26, 2013

9	   Comparison, Equivalence and Canonicalization of Internationalized
10	                          Resource Identifiers
11	                      draft-ietf-iri-comparison-02

13	Abstract

15	   Internationalized Resource Identifiers (IRIs) are Unicode strings
16	   used to identify resources on the Internet.  Applications that use
17	   IRIs often define a means of comparing IRIs to determine when two
18	   IRIs are equivalent for the purpose of that application.  Some
19	   applications also define a method for canonicalizing an IRI --
20	   translating one IRI into another which is equivalent under the
21	   comparison method used.

23	   This document gives guidelines and best practices for defining and
24	   using IRI comparison and canonicalization methods.

26	   Comparison methods are used to determine equivalence.  As URIs are a
27	   subset of IRIs, the guidelines apply to URI comparison as well.

29	Status of this Memo

31	   This Internet-Draft is submitted in full conformance with the
32	   provisions of BCP 78 and BCP 79.

34	   Internet-Drafts are working documents of the Internet Engineering
35	   Task Force (IETF).  Note that other groups may also distribute
36	   working documents as Internet-Drafts.  The list of current Internet-
37	   Drafts is at http://datatracker.ietf.org/drafts/current/.

39	   Internet-Drafts are draft documents valid for a maximum of six months
40	   and may be updated, replaced, or obsoleted by other documents at any
41	   time.  It is inappropriate to use Internet-Drafts as reference
42	   material or to cite them other than as "work in progress."

44	   This Internet-Draft will expire on April 26, 2013.

46	Copyright Notice

48	   Copyright (c) 2012 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (http://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document.  Please review these documents
55	   carefully, as they describe your rights and restrictions with respect
56	   to this document.  Code Components extracted from this document must
57	   include Simplified BSD License text as described in Section 4.e of
58	   the Trust Legal Provisions and are provided without warranty as
59	   described in the Simplified BSD License.

61	   This document may contain material from IETF Documents or IETF
62	   Contributions published or made publicly available before November
63	   10, 2008.  The person(s) controlling the copyright in some of this
64	   material may not have granted the IETF Trust the right to allow
65	   modifications of such material outside the IETF Standards Process.
66	   Without obtaining an adequate license from the person(s) controlling
67	   the copyright in such materials, this document may not be modified
68	   outside the IETF Standards Process, and derivative works of it may
69	   not be created outside the IETF Standards Process, except to format
70	   it for publication as an RFC or to translate it into languages other
71	   than English.

73	Table of Contents

75	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
76	   2.  General guidelines . . . . . . . . . . . . . . . . . . . . . .  4
77	   3.  Preparation for Comparison . . . . . . . . . . . . . . . . . .  5
78	   4.  Comparison Hierarchy . . . . . . . . . . . . . . . . . . . . .  6
79	     4.1.  Simple String Comparison . . . . . . . . . . . . . . . . .  6
80	     4.2.  Syntax-Based Equivalence . . . . . . . . . . . . . . . . .  7
81	       4.2.1.  Case Equivalence . . . . . . . . . . . . . . . . . . .  8
82	       4.2.2.  Unicode Character Normalization  . . . . . . . . . . .  8
83	       4.2.3.  Percent-Encoding Equivalence . . . . . . . . . . . . .  9
84	       4.2.4.  Path Segment Equivalence . . . . . . . . . . . . . . . 10
85	     4.3.  Scheme-Based Comparison  . . . . . . . . . . . . . . . . . 10
86	     4.4.  Protocol-Based Comparison  . . . . . . . . . . . . . . . . 11
87	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
88	   6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12
89	   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 12
90	     7.1.  Normative References . . . . . . . . . . . . . . . . . . . 12
91	     7.2.  Informative References . . . . . . . . . . . . . . . . . . 13
92	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14

94	1.  Introduction

96	   Internationalized Resource Identifiers (IRIs) are Unicode strings
97	   used to identify resources on the Internet.  Applications that use
98	   IRIs often define a means of comparing IRIs to determine when two
99	   IRIs are equivalent for the purpose of that application.  Some
100	   applications also define a method for canonicalizing an IRI --
101	   translating one IRI into another which is equivalent under the
102	   comparison method used.

104	   This document gives guidelines and best practices for defining and
105	   using IRI comparison and canonicalization methods.

107	   As every URI is also an IRI, the comparison and canonicalization
108	   methods also apply to URIs.

110	   IRI comparison is expected to determine whether two IRIs are
111	   equivalent without using the IRIs to access their respective
112	   resource(s).  For example, comparisons are performed whenever a
113	   response cache is accessed, a browser checks its history to color a
114	   link, or an XML parser processes tags within a namespace.

116	   Comparison for equivalence is often accomplished by canonicalization:
117	   (sometimes called normalization): a process for converting data that
118	   has more than one possible representation into a "standard",
119	   "normal", or "canonical" form.  Extensive canonicalization prior to
120	   comparison of IRIs may be used by spiders and indexing engines to
121	   prune a search space or reduce duplication of request actions and
122	   response storage.

124	   IRI comparison is performed for some particular purpose.  Protocols
125	   or implementations that compare IRIs for different purposes will
126	   often be subject to differing design trade-offs in regards to how
127	   much effort should be spent in reducing aliased identifiers.  This
128	   document describes various methods that may be used to compare IRIs,
129	   the trade-offs between them, and the types of applications that might
130	   use them.

132	2.  General guidelines

134	   Because IRIs exist to identify resources, one might expect two IRIs
135	   to be considered equivalent when they identify the same resource.
136	   However, this definition of equivalence is not of much practical use,
137	   as there is in general no way for an implementation to compare two
138	   resources to determine if they are "the same" unless it has full
139	   knowledge or control of them.  Comparison methods for IRIs are
140	   generally based strictly on examining the characters that make up the
141	   IRI, without performing any network access.

143	   We use the terms "different" and "equivalent" to describe the
144	   possible outcomes of such comparisons, but there are many
145	   application-dependent versions of equivalence.

147	   Even when it is possible to determine that two IRIs are equivalent,
148	   IRI comparison is not sufficient to determine whether two IRIs
149	   identify different resources.  For example, an owner of two different
150	   domain names could decide to serve the same resource from both,
151	   resulting in two different IRIs.  For this reason, false negatives
152	   (e.g., returning "different" even with the resources are "the same")
153	   cannot be completely avoided.  Comparison methods often try to
154	   minimize false negatives while strictly avoiding false positives.
155	   However, in some cases (such as cache invalidation), false negatives
156	   are more harmful than false positives.

158	   A comparison method for determining equivalence might have multiple
159	   values, for example, returning "equivalent", "different", or
160	   "equivalence cannot be determined".

162	   Multiple canonicalization (normalizations) methods might be defined,
163	   where sequential application of each results in greater sets of
164	   equivalent values.

166	   In testing for equivalence, applications should not directly compare
167	   relative references; the references should be converted to their
168	   respective target IRIs before comparison. [[ref 3987bis]]

170	   Some IRIs contain fragment identifiers.  In general, the equivalence
171	   of two IRIs is determined first by comparing the IRIs without any
172	   fragment identifiers, and then (if appropriate) the fragment
173	   components (if any) compared.

175	   Some applications (such as XML namespaces) use IRIs as identity
176	   tokens without any relationship to acessing the resources.  Those
177	   applications use the Simple String Comparison (see Section 4.1).

179	3.  Preparation for Comparison

181	   Any kind of IRI comparison REQUIRES that any additional contextual
182	   processing is first performed, including undoing higher-level
183	   escapings or encodings in the protocol or format that carries an IRI.
184	   This preprocessing is usually done when the protocol or format is
185	   parsed.

187	   NOTE: This document has not yet been updated to use in-line Unicode
188	   examples.

190	   Examples of such escapings or encodings are entities and numeric
191	   character references in [HTML4] and [XML1].  As an example,
192	   "http://example.org/ros&eacute;" (in HTML),
193	   "http://example.org/ros&#233;" (in HTML or XML), and
194	   "http://example.org/ros&#xE9;" (in HTML or XML) are all resolved into
195	   what is denoted in this document (see 'Notation' section of
196	   [RFC3987bis]) as "http://example.org/ros&#xE9;" (the "&#xE9;" here
197	   standing for the actual e-acute character, to compensate for the fact
198	   that this document cannot contain non-ASCII characters).

200	   An IRI is a sequence of Unicode characters.  IRIs are sometimes
201	   represented in documents as sequences of bytes in a charset, either
202	   Unicode-based (UTF-8) or using some other character encoding (e.g.,
203	   ISO-8859-1).  Before comparing two such sequences, they must both be
204	   converted into sequences of Unicode characters.

206	   Similarly, encodings such as Transfer Codings in HTTP (see [RFC2616])
207	   and Content Transfer Encodings in MIME ([RFC2045]) must be unencoded.
208	   In these cases, the encoding is based not on characters but on
209	   octets, and additional care is required to make sure that characters,
210	   and not just arbitrary octets, are compared (see Section 4.1.

212	4.  Comparison Hierarchy

214	   In practice, a variety of methods are used to test IRI equivalence.
215	   These methods generally fall into a range distinguished by the amount
216	   of processing required and the degree to which the probability of
217	   false negatives is reduced.  As noted above, false negatives cannot
218	   be eliminated.  In practice, their probability can be reduced, but
219	   this reduction requires more processing and is not cost-effective for
220	   all applications.

222	   The following discussion starts with comparison methods that are
223	   cheap but have a relatively higher chance of producing false
224	   negatives, and proceeding to those that have higher computational
225	   cost and lower risk of false negatives.

227	4.1.  Simple String Comparison

229	   If two IRIs (when considered as strings of Unicode characters) are
230	   identical, then it is safe to conclude that they are equivalent.
231	   This type of equivalence test has very low computational cost and is
232	   in wide use in a variety of applications, particularly in the domain
233	   of parsing.  It is also used when a definitive answer to the question
234	   of IRI equivalence is needed that is independent of the scheme used
235	   and that can be calculated quickly and without accessing a network.
236	   An example of such a case is XML Namespaces ([XMLNamespace]).

238	   Testing strings for equivalence requires some basic precautions.
239	   This procedure is often referred to as "bit-for-bit" or "byte-for-
240	   byte" comparison, which is potentially misleading.  Testing strings
241	   for equality is normally based on pair comparison of the characters
242	   that make up the strings, starting from the first and proceeding
243	   until both strings are exhausted and all characters are found to be
244	   equal, until a pair of characters compares unequal, or until one of
245	   the strings is exhausted before the other.

247	   This character comparison requires that each pair of characters be
248	   put in comparable encoding form.  For example, should one IRI be
249	   stored in a byte array in UTF-8 encoding form and the second in a
250	   UTF-16 encoding form, bit-for-bit comparisons applied naively will
251	   produce errors.  It is better to speak of equality on a character-
252	   for-character rather than on a byte-for-byte or bit-for-bit basis.
253	   In practical terms, character-by-character comparisons should be done
254	   codepoint by codepoint after conversion to a common character
255	   encoding form.  When comparing character by character, the comparison
256	   function MUST NOT map IRIs to URIs, because such a mapping would
257	   create additional spurious equivalences.  It follows that an IRI
258	   SHOULD NOT be modified when being transported if there is any chance
259	   that this IRI might be used in a context that uses Simple String
260	   Comparison.

262	   False negatives are caused by the production and use of IRI aliases.
263	   Unnecessary aliases can be reduced, regardless of the comparison
264	   method, by consistently providing IRI references in a canonical form
265	   (after canonicalization is applied).

267	   Protocols and data formats might limit some IRI comparisons to simple
268	   string comparison, based on the theory that people and
269	   implementations will, in their own best interest, be consistent in
270	   providing IRI references, or at least be consistent enough to negate
271	   any efficiency that might be obtained from further canonicalization.

273	4.2.  Syntax-Based Equivalence

275	   Implementations may use logic based on the definitions provided by
276	   this specification to reduce the probability of false negatives.
277	   This processing is moderately higher in cost than character-for-
278	   character string comparison.  For example, an application using this
279	   approach could reasonably consider the following two IRIs equivalent:

281	             example://a/b/c/%7Bfoo%7D/ros&#xE9;
282	             eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9

284	   Web user agents, such as browsers, typically apply this type of IRI
285	   equivalence when determining whether a cached response is available.
286	   Syntax-based equivalence includes such techniques as case
287	   equivalence, Unicode character normalization, percent-encoding
288	   equivalence, and removal of dot-segments.

290	4.2.1.  Case Equivalence

292	   For all IRIs, the hexadecimal digits within a percent-encoding
293	   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
294	   should be considered equivalent to forms which use uppercase letters
295	   for the digits A-F.

297	   When an IRI uses components of the generic syntax, the component
298	   syntax equivalence rules always apply; namely, that the scheme and
299	   US-ASCII only host are case insensitive and therefore should be
300	   treated equivalent to lowercase.  For example, the URI
301	   "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/".
302	   Case equivalence for non-ASCII characters in IRI components that are
303	   IDNs are discussed in Section 4.3.  The other generic syntax
304	   components are assumed to be case sensitive unless specifically
305	   defined otherwise by the scheme.

307	   Creating schemes that allow case-insensitive syntax components
308	   containing non-ASCII characters should be avoided.  Case equivalence
309	   of non-ASCII characters can be culturally dependent and is always a
310	   complex operation.  The only exception concerns non-ASCII host names
311	   for which the character normalization includes a mapping step derived
312	   from case folding.

314	4.2.2.  Unicode Character Normalization

316	   The Unicode Standard [UNIV6] defines various equivalences between
317	   sequences of characters for various purposes.  Unicode Standard Annex
318	   #15 [UTR15] defines various Normalization Forms for these
319	   equivalences, in particular Normalization Form C (NFC, Canonical
320	   Decomposition, followed by Canonical Composition) and Normalization
321	   Form KC (NFKC, Compatibility Decomposition, followed by Canonical
322	   Composition).

324	   IRIs already in Unicode MUST NOT be normalized before parsing or
325	   interpreting.  In many non-Unicode character encodings, some text
326	   cannot be represented directly.  For example, the word "Vietnam" is
327	   natively written "Vi&#x1EC7;t Nam" (containing a LATIN SMALL LETTER E
328	   WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct transcoding from
329	   the windows-1258 character encoding leads to "Vi&#xEA;&#x323;t Nam"
330	   (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX followed by a
331	   COMBINING DOT BELOW).  Direct transcoding of other 8-bit encodings of
332	   Vietnamese may lead to other representations.

334	   Equivalence of IRIs MUST rely on the assumption that IRIs are
335	   appropriately pre-character-normalized rather than apply character
336	   normalization when comparing two IRIs.  The exceptions are conversion
337	   from a non-digital form, and conversion from a non-UCS-based
338	   character encoding to a UCS-based character encoding.  In these
339	   cases, NFC or a normalizing transcoder using NFC MUST be used for
340	   interoperability.  To avoid false negatives and problems with
341	   transcoding, IRIs SHOULD be created by using NFC.  Using NFKC may
342	   avoid even more problems; for example, by choosing half-width Latin
343	   letters instead of full-width ones, and full-width instead of half-
344	   width Katakana.

346	   As an example, "http://www.example.org/r&#xE9;sum&#xE9;.html" (in XML
347	   Notation) is in NFC.  On the other hand,
348	   "http://www.example.org/re&#x301;sume&#x301;.html" is not in NFC.

350	   The former uses precombined e-acute characters, and the latter uses
351	   "e" characters followed by combining acute accents.  Both usages are
352	   defined as canonically equivalent in [UNIV6].

354	   Note:  Because it is unknown how a particular sequence of characters
355	      is being treated with respect to character normalization, it would
356	      be inappropriate to allow third parties to normalize an IRI
357	      arbitrarily.  This does not contradict the recommendation that
358	      when a resource is created, its IRI should be as character
359	      normalized as possible (i.e., NFC or even NFKC).  This is similar
360	      to the uppercase/lowercase problems.  Some parts of a URI are case
361	      insensitive (for example, the domain name).  For others, it is
362	      unclear whether they are case sensitive, case insensitive, or
363	      something in between (e.g., case sensitive, but with a multiple
364	      choice selection if the wrong case is used, instead of a direct
365	      negative result).  The best recipe is that the creator use a
366	      reasonable capitalization and, when transferring the URI,
367	      capitalization never be changed.

369	   Various IRI schemes may allow the usage of Internationalized Domain
370	   Names (IDN) [RFC5890] either in the ireg-name part or elsewhere.
371	   Character Normalization also applies to IDNs, as discussed in
372	   Section 4.3.

374	4.2.3.  Percent-Encoding Equivalence

376	   The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a
377	   frequent source of variance among otherwise identical IRIs.  In
378	   addition to the case equivalence issue noted above, some IRI
379	   producers percent-encode octets that do not require percent-encoding,
380	   resulting in IRIs that are equivalent to their nonencoded
381	   counterparts.  These IRIs should be compared by first decoding any
382	   percent-encoded octet sequence that corresponds to an unreserved
383	   character, as described in section 2.3 of [RFC3986].

385	   For actual resolution, differences in percent-encoding (except for
386	   the percent-encoding of reserved characters) SHOULD always result in
387	   the same resource.  For example, "http://example.org/~user",
388	   "http://example.org/%7euser", and "http://example.org/%7Euser",
389	   SHOULD resolve to the same resource.

391	   If this kind of equivalence is to be tested, the percent-encoding of
392	   both IRIs to be compared first needs to be aligned; for example, by
393	   converting both IRIs to URIs, eliminating escape differences in the
394	   resulting URIs, and making sure that the case of the hexadecimal
395	   characters in the percent-encoding is always the same (preferably
396	   upper case).  If the IRI is to be passed to another application or
397	   used further in some other way, its original form MUST be preserved.
398	   The conversion described here should be performed only for local
399	   comparison.

401	4.2.4.  Path Segment Equivalence

403	   The complete path segments "." and ".." are intended only for use
404	   within relative references (Section 4.1 of [RFC3986]) and are removed
405	   as part of the reference resolution process (Section 5.2 of
406	   [RFC3986]).  However, some implementations may incorrectly assume
407	   that reference resolution is not necessary when the reference is
408	   already an IRI, and thus fail to remove dot-segments when they occur
409	   in non-relative paths.  IRI comparison SHOULD remove dot-segments by
410	   applying the remove_dot_segments algorithm to the path, as described
411	   in Section 5.2.4 of [RFC3986].

413	4.3.  Scheme-Based Comparison

415	   The syntax and semantics of IRIs vary from scheme to scheme, as
416	   described by the defining specification for each scheme.
417	   Implementations may use scheme-specific rules, at further processing
418	   cost, to reduce the probability of false negatives.  For example,
419	   because the "http" scheme makes use of an authority component, has a
420	   default port of "80", and defines an empty path to be equivalent to
421	   "/", the following four IRIs are equivalent:

423	             http://example.com
424	             http://example.com/
425	             http://example.com:/
426	           http://example.com:80/

428	   In general, an IRI that uses the generic syntax for authority with an
429	   empty path should be equivalent to a path of "/".  Likewise, an
430	   explicit ":port", for which the port is empty or the default for the
431	   scheme, is equivalent to one where the port and its ":" delimiter are
432	   elided.

434	   Another case where equivalence varies by scheme is in the handling of
435	   an empty authority component or empty host subcomponent.  For many
436	   scheme specifications, an empty authority or host is considered an
437	   error; for others, it is considered equivalent to "localhost" or the
438	   end-user's host.

440	   The presence of a missing component vs. one with an empty string
441	   component in an IRI SHOULD NOT be treated as equivalent unless
442	   explicitly defined as such by the scheme definition.  For example,
443	   the IRI "http://example.com/?" cannot be assumed to be equivalent to
444	   any of the examples above; an empty query component is NOT equivalent
445	   to a missing one.  Likewise, the presence or absence of delimiters
446	   within a userinfo subcomponent is usually significant to its
447	   interpretation.  The fragment component is not subject to any scheme-
448	   based equivalence; thus, two IRIs that differ only by the suffix "#"
449	   are considered different regardless of the scheme.

451	   Some IRI schemes allow the usage of Internationalized Domain Names
452	   (IDN) [RFC5890] either in their ireg-name part or elswhere.  When in
453	   use in IRIs, those names SHOULD conform to the definition of U-Label
454	   in [RFC5890].  An IRI containing an invalid IDN cannot successfully
455	   be resolved.  For legibility purposes, they SHOULD NOT be converted
456	   into ASCII Compatible Encoding (ACE).

458	   Scheme-based comparison may also consider IDN components and their
459	   conversions to punycode as equivalent.  As an example,
460	   "http://r&#xE9;sum&#xE9;.example.org" may be considered equivalent to
461	   "http://xn--rsum-bpad.example.org".

463	   Other scheme-specific equivalence rules are possible.

465	4.4.  Protocol-Based Comparison

467	   Substantial effort to reduce the incidence of false negatives is
468	   often cost-effective for web spiders.  Consequently, they implement
469	   even more aggressive techniques in IRI comparison.  For example, if
470	   they observe that an IRI such as

472	           http://example.com/data

474	   redirects to an IRI differing only in the trailing slash
475	    http://example.com/data/

477	   they will likely regard the two as equivalent in the future.  This
478	   kind of technique is only appropriate when equivalence is clearly
479	   indicated by both the result of accessing the resources and the
480	   common conventions of their scheme's dereference algorithm (in this
481	   case, use of redirection by HTTP origin servers to avoid problems
482	   with relative references).

484	5.  Security Considerations

486	   The primary security difficulty comes from applications choosing the
487	   wrong equivalence relationship, or two different parties disagreeing
488	   on equivalence.  This is especially a problem when IRIs are used in
489	   security protocols.

491	   Besides the large character repertoire of Unicode, reasons for
492	   confusion include different forms of normalization and different
493	   normalization expectations, use of percent-encoding with various
494	   legacy encodings, and bidirectionality issues.  See also [UTR36].

496	6.  Acknowledgements

498	   This document was originally derived from [RFC3986] and [RFC3987],
499	   based on text contributed by Tim Bray.

501	7.  References

503	7.1.  Normative References

505	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
506	              Requirement Levels", BCP 14, RFC 2119, March 1997.

508	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
509	              "Internationalizing Domain Names in Applications (IDNA)",
510	              RFC 3490, March 2003.

512	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
513	              Profile for Internationalized Domain Names (IDN)",
514	              RFC 3491, March 2003.

516	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
517	              10646", STD 63, RFC 3629, November 2003.

519	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
520	              Resource Identifier (URI): Generic Syntax", STD 66,
521	              RFC 3986, January 2005.

523	   [RFC3987bis]
524	              Duerst, M., Masinter, L., and M. Suignard,
525	              "Internationalized Resource Identifiers (IRIs)", 2012,
526	              <http://tools.ietf.org/id/draft-ietf-iri-3987bis>.

528	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
529	              Applications (IDNA): Definitions and Document Framework",
530	              RFC 5890, August 2010.

532	   [UNIV6]    The Unicode Consortium, "The Unicode Standard, Version
533	              6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
534	              ISBN 978-1-936213-01-6)", October 2010.

536	   [UTR15]    Davis, M. and M. Duerst, "Unicode Normalization Forms",
537	              Unicode Standard Annex #15, March 2008,
538	              <http://www.unicode.org/unicode/reports/tr15/
539	              tr15-23.html>.

541	7.2.  Informative References

543	   [HTML4]    Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
544	              Specification", World Wide Web Consortium Recommendation,
545	              December 1999,
546	              <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.

548	   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
549	              Extensions (MIME) Part One: Format of Internet Message
550	              Bodies", RFC 2045, November 1996.

552	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
553	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
554	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

556	   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
557	              Identifiers (IRIs)", RFC 3987, January 2005.

559	   [UTR36]    Davis, M. and M. Suignard, "Unicode Security
560	              Considerations", Unicode Technical Report #36,
561	              August 2010, <http://unicode.org/reports/tr36/>.

563	   [XML1]     Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
564	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth
565	              Edition)", World Wide Web Consortium Recommendation,
566	              August 2006, <http://www.w3.org/TR/REC-xml>.

568	   [XMLNamespace]
569	              Bray, T., Hollander, D., Layman, A., and R. Tobin,
570	              "Namespaces in XML (Second Edition)", World Wide Web
571	              Consortium Recommendation, August 2006,
572	              <http://www.w3.org/TR/REC-xml-names>.

574	Authors' Addresses

576	   Larry Masinter
577	   Adobe
578	   345 Park Ave
579	   San Jose, CA  95110
580	   U.S.A.

582	   Phone: +1-408-536-3024
583	   Email: masinter@adobe.com
584	   URI:   http://larry.masinter.net

586	   Martin Duerst
587	   Aoyama Gakuin University
588	   5-10-1 Fuchinobe
589	   Sagamihara, Kanagawa  229-8558
590	   Japan

592	   Phone: +81 42 759 6329
593	   Fax:   +81 42 759 6495
594	   Email: duerst@it.aoyama.ac.jp
595	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/