idnits 2.17.1 

draft-ietf-iri-3987bis-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Sep 2009 rather than the newer Notice from 28 Dec 2009.  (See
     https://trustee.ietf.org/license-info/)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The draft header indicates that this document obsoletes RFC3987, but the
     abstract doesn't seem to directly say this.  It does mention RFC3987
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 17, 2010) is 4933 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC5891' is defined on line 2353, but no explicit
     reference was found in the text

  == Unused Reference: 'Gettys' is defined on line 2392, but no explicit
     reference was found in the text

  == Unused Reference: 'UTR36' is defined on line 2457, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV4'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'

  -- Obsolete informational reference (is this intentional?): RFC 1738
     (Obsoleted by RFC 4248, RFC 4266)

  -- Obsolete informational reference (is this intentional?): RFC 2141
     (Obsoleted by RFC 8141)

  -- Obsolete informational reference (is this intentional?): RFC 2192
     (Obsoleted by RFC 5092)

  -- Obsolete informational reference (is this intentional?): RFC 2368
     (Obsoleted by RFC 6068)

  -- Obsolete informational reference (is this intentional?): RFC 2396
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Obsolete informational reference (is this intentional?): RFC 4395
     (Obsoleted by RFC 7595)


     Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 15 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internationalized Resource                                     M. Duerst
3	Identifiers (iri)                               Aoyama Gakuin University
4	Internet-Draft                                               M. Suignard
5	Obsoletes: 3987 (if approved)                         Unicode Consortium
6	Intended status: Standards Track                             L. Masinter
7	Expires: April 20, 2011                                            Adobe
8	                                                        October 17, 2010

10	             Internationalized Resource Identifiers (IRIs)
11	                       draft-ietf-iri-3987bis-02

13	Abstract

15	   This document defines the Internationalized Resource Identifier (IRI)
16	   protocol element, as an extension of the Uniform Resource Identifier
17	   (URI).  An IRI is a sequence of characters from the Universal
18	   Character Set (Unicode/ISO 10646).  Grammar and processing rules are
19	   given for IRIs and related syntactic forms.

21	   In addition, this document provides named additional rule sets for
22	   processing otherwise invalid IRIs, in a way that supports other
23	   specifications that wish to mandate common behavior for 'error'
24	   handling.  In particular, rules used in some XML languages (LEIRI)
25	   and web applications are given.

27	   Defining IRI as new protocol element (rather than updating or
28	   extending the definition of URI) allows independent orderly
29	   transitions: other protocols and languages that use URIs must
30	   explicitly choose to allow IRIs.

32	   Guidelines are provided for the use and deployment of IRIs and
33	   related protocol elements when revising protocols, formats, and
34	   software components that currently deal only with URIs.

36	RFC Editor: Please remove the next paragraph before publication.

38	   This document is intended to update RFC 3987 and move towards IETF
39	   Draft Standard.  For discussion and comments on this draft, please
40	   join the IETF IRI WG by subscribing to the mailing list
41	   public-iri@w3.org.  For a list of open issues, please see the issue
42	   tracker of the WG at http://trac.tools.ietf.org/wg/iri/trac/report/1.

44	Status of this Memo

46	   This Internet-Draft is submitted to IETF in full conformance with the
47	   provisions of BCP 78 and BCP 79.

49	   Internet-Drafts are working documents of the Internet Engineering
50	   Task Force (IETF), its areas, and its working groups.  Note that
51	   other groups may also distribute working documents as Internet-
52	   Drafts.

54	   Internet-Drafts are draft documents valid for a maximum of six months
55	   and may be updated, replaced, or obsoleted by other documents at any
56	   time.  It is inappropriate to use Internet-Drafts as reference
57	   material or to cite them other than as "work in progress."

59	   The list of current Internet-Drafts can be accessed at
60	   http://www.ietf.org/ietf/1id-abstracts.txt.

62	   The list of Internet-Draft Shadow Directories can be accessed at
63	   http://www.ietf.org/shadow.html.

65	   This Internet-Draft will expire on April 20, 2011.

67	Copyright Notice

69	   Copyright (c) 2010 IETF Trust and the persons identified as the
70	   document authors.  All rights reserved.

72	   This document is subject to BCP 78 and the IETF Trust's Legal
73	   Provisions Relating to IETF Documents
74	   (http://trustee.ietf.org/license-info) in effect on the date of
75	   publication of this document.  Please review these documents
76	   carefully, as they describe your rights and restrictions with respect
77	   to this document.  Code Components extracted from this document must
78	   include Simplified BSD License text as described in Section 4.e of
79	   the Trust Legal Provisions and are provided without warranty as
80	   described in the BSD License.

82	   This document may contain material from IETF Documents or IETF
83	   Contributions published or made publicly available before November
84	   10, 2008.  The person(s) controlling the copyright in some of this
85	   material may not have granted the IETF Trust the right to allow
86	   modifications of such material outside the IETF Standards Process.
87	   Without obtaining an adequate license from the person(s) controlling
88	   the copyright in such materials, this document may not be modified
89	   outside the IETF Standards Process, and derivative works of it may
90	   not be created outside the IETF Standards Process, except to format
91	   it for publication as an RFC or to translate it into languages other
92	   than English.

94	Table of Contents

96	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
97	     1.1.   Overview and Motivation . . . . . . . . . . . . . . . . .  5
98	     1.2.   Applicability . . . . . . . . . . . . . . . . . . . . . .  6
99	     1.3.   Definitions . . . . . . . . . . . . . . . . . . . . . . .  6
100	     1.4.   Notation  . . . . . . . . . . . . . . . . . . . . . . . .  9
101	   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  9
102	     2.1.   Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10
103	     2.2.   ABNF for IRI References and IRIs  . . . . . . . . . . . . 10
104	   3.  Processing IRIs and related protocol elements  . . . . . . . . 13
105	     3.1.   Converting to UCS . . . . . . . . . . . . . . . . . . . . 14
106	     3.2.   Parse the IRI into IRI components . . . . . . . . . . . . 14
107	     3.3.   General percent-encoding of IRI components  . . . . . . . 15
108	     3.4.   Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 15
109	     3.5.   Mapping query components  . . . . . . . . . . . . . . . . 17
110	     3.6.   Mapping IRIs to URIs  . . . . . . . . . . . . . . . . . . 17
111	     3.7.   Converting URIs to IRIs . . . . . . . . . . . . . . . . . 17
112	       3.7.1.  Examples . . . . . . . . . . . . . . . . . . . . . . . 19
113	   4.  Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20
114	     4.1.   Logical Storage and Visual Presentation . . . . . . . . . 21
115	     4.2.   Bidi IRI Structure  . . . . . . . . . . . . . . . . . . . 22
116	     4.3.   Input of Bidi IRIs  . . . . . . . . . . . . . . . . . . . 23
117	     4.4.   Examples  . . . . . . . . . . . . . . . . . . . . . . . . 23
118	   5.  Normalization and Comparison . . . . . . . . . . . . . . . . . 25
119	     5.1.   Equivalence . . . . . . . . . . . . . . . . . . . . . . . 25
120	     5.2.   Preparation for Comparison  . . . . . . . . . . . . . . . 26
121	     5.3.   Comparison Ladder . . . . . . . . . . . . . . . . . . . . 27
122	       5.3.1.  Simple String Comparison . . . . . . . . . . . . . . . 27
123	       5.3.2.  Syntax-Based Normalization . . . . . . . . . . . . . . 28
124	       5.3.3.  Scheme-Based Normalization . . . . . . . . . . . . . . 31
125	       5.3.4.  Protocol-Based Normalization . . . . . . . . . . . . . 32
126	   6.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 32
127	     6.1.   Limitations on UCS Characters Allowed in IRIs . . . . . . 33
128	     6.2.   Software Interfaces and Protocols . . . . . . . . . . . . 33
129	     6.3.   Format of URIs and IRIs in Documents and Protocols  . . . 33
130	     6.4.   Use of UTF-8 for Encoding Original Characters . . . . . . 34
131	     6.5.   Relative IRI References . . . . . . . . . . . . . . . . . 36
132	   7.  Liberal handling of otherwise invalid IRIs . . . . . . . . . . 36
133	     7.1.   LEIRI processing  . . . . . . . . . . . . . . . . . . . . 36
134	     7.2.   Web Address processing  . . . . . . . . . . . . . . . . . 36
135	     7.3.   Characters not allowed in IRIs  . . . . . . . . . . . . . 38
136	   8.  URI/IRI Processing Guidelines (Informative)  . . . . . . . . . 40
137	     8.1.   URI/IRI Software Interfaces . . . . . . . . . . . . . . . 40
138	     8.2.   URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 41
139	     8.3.   URI/IRI Transfer between Applications . . . . . . . . . . 42
140	     8.4.   URI/IRI Generation  . . . . . . . . . . . . . . . . . . . 42
141	     8.5.   URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 43
142	     8.6.   Display of URIs/IRIs  . . . . . . . . . . . . . . . . . . 43
143	     8.7.   Interpretation of URIs and IRIs . . . . . . . . . . . . . 44
144	     8.8.   Upgrading Strategy  . . . . . . . . . . . . . . . . . . . 44
145	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 45
146	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 46
147	   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 47
148	   12. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 48
149	     12.1.  Changes from draft-duerst-iri-bis-07 to
150	            draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . . . 48
151	     12.2.  Changes from -06 to -07 of draft-duerst-iri-bis . . . . . 48
152	       12.2.1. OLD WAY  . . . . . . . . . . . . . . . . . . . . . . . 49
153	       12.2.2. NEW WAY  . . . . . . . . . . . . . . . . . . . . . . . 49
154	     12.3.  Changes from -00 to -01 . . . . . . . . . . . . . . . . . 49
155	     12.4.  Changes from -05 to -06 of draft-duerst-iri-bis-00  . . . 49
156	     12.5.  Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 50
157	     12.6.  Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 50
158	     12.7.  Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 50
159	     12.8.  Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 50
160	     12.9.  Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 50
161	     12.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis  . . 51
162	   13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 51
163	     13.1.  Normative References  . . . . . . . . . . . . . . . . . . 51
164	     13.2.  Informative References  . . . . . . . . . . . . . . . . . 52
165	   Appendix A.  Design Alternatives . . . . . . . . . . . . . . . . . 54
166	     A.1.   New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 54
167	     A.2.   Character Encodings Other Than UTF-8  . . . . . . . . . . 55
168	     A.3.   New Encoding Convention . . . . . . . . . . . . . . . . . 55
169	     A.4.   Indicating Character Encodings in the URI/IRI . . . . . . 55
170	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 56

172	1.  Introduction

174	1.1.  Overview and Motivation

176	   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
177	   sequence of characters chosen from a limited subset of the repertoire
178	   of US-ASCII [ASCII] characters.

180	   The characters in URIs are frequently used for representing words of
181	   natural languages.  This usage has many advantages: Such URIs are
182	   easier to memorize, easier to interpret, easier to transcribe, easier
183	   to create, and easier to guess.  For most languages other than
184	   English, however, the natural script uses characters other than A -
185	   Z. For many people, handling Latin characters is as difficult as
186	   handling the characters of other scripts is for those who use only
187	   the Latin alphabet.  Many languages with non-Latin scripts are
188	   transcribed with Latin letters.  These transcriptions are now often
189	   used in URIs, but they introduce additional difficulties.

191	   The infrastructure for the appropriate handling of characters from
192	   additional scripts is now widely deployed in operating system and
193	   application software.  Software that can handle a wide variety of
194	   scripts and languages at the same time is increasingly common.  Also,
195	   an increasing number of protocols and formats can carry a wide range
196	   of characters.

198	   URIs are used both as a protocol element (for transmission and
199	   processing by software) and also a presentation element (for display
200	   and handling by people who read, interpret, coin, or guess them).
201	   The transition between these roles is more difficult and complex when
202	   dealing with the larger set of characters than allowed for URIs in
203	   [RFC3986].

205	   This document defines the protocol element called Internationalized
206	   Resource Identifier (IRI), which allow applications of URIs to be
207	   extended to use resource identifiers that have a much wider
208	   repertoire of characters.  It also provides corresponding
209	   "internationalized" versions of other constructs from [RFC3986], such
210	   as URI references.  The syntax of IRIs is defined in Section 2.

212	   Using characters outside of A - Z in IRIs adds a number of
213	   difficulties.  Section 4 discusses the special case of bidirectional
214	   IRIs using characters from scripts written right-to-left.  Section 5
215	   discusses various forms of equivalence between IRIs.  Section 6
216	   discusses the use of IRIs in different situations.  Section 8 gives
217	   additional informative guidelines.  Section 10 discusses IRI-specific
218	   security considerations.

220	1.2.  Applicability

222	   IRIs are designed to allow protocols and software that deal with URIs
223	   to be updated to handle IRIs.  A "URI scheme" (as defined by
224	   [RFC3986] and registered through the IANA process defined in
225	   [RFC4395] also serves as an "IRI scheme".  Processing of IRIs is
226	   accomplished by extending the URI syntax while retaining (and not
227	   expanding) the set of "reserved" characters, such that the syntax for
228	   any URI scheme may be uniformly extended to allow non-ASCII
229	   characters.  In addition, following parsing of an IRI, it is possible
230	   to construct a corresponding URI by first encoding characters outside
231	   of the allowed URI range and then reassembling the components.

233	   Practical use of IRIs forms in place of URIs forms depends on the
234	   following conditions being met:

236	   a. A protocol or format element MUST be explicitly designated to be
237	      able to carry IRIs.  The intent is to avoid introducing IRIs into
238	      contexts that are not defined to accept them.  For example, XML
239	      schema [XMLSchema] has an explicit type "anyURI" that includes
240	      IRIs and IRI references.  Therefore, IRIs and IRI references can
241	      be in attributes and elements of type "anyURI".  On the other
242	      hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is
243	      defined as a URI, which means that direct use of IRIs is not
244	      allowed in HTTP requests.

246	   b. The protocol or format carrying the IRIs MUST have a mechanism to
247	      represent the wide range of characters used in IRIs, either
248	      natively or by some protocol- or format-specific escaping
249	      mechanism (for example, numeric character references in [XML1]).

251	   c. The URI scheme definition, if it explicitly allows a percent sign
252	      ("%") in any syntactic component, SHOULD define the interpretation
253	      of sequences of percent-encoded octets (using "%XX" hex octets) as
254	      octet from sequences of UTF-8 encoded strings; this is recommended
255	      in the guidelines for registering new schemes, [RFC4395].  For
256	      example, this is the practice for IMAP URLs [RFC2192], POP URLs
257	      [RFC2384] and the URN syntax [RFC2141]).  Note that use of
258	      percent-encoding may also be restricted in some situations, for
259	      example, URI schemes that disallow percent-encoding might still be
260	      used with a fragment identifier which is percent-encoded (e.g.,
261	      [XPointer]).  See Section 6.4 for further discussion.

263	1.3.  Definitions

265	   The following definitions are used in this document; they follow the
266	   terms in [RFC2130], [RFC2277], and [ISO10646].

268	   character:  A member of a set of elements used for the organization,
269	      control, or representation of data.  For example, "LATIN CAPITAL
270	      LETTER A" names a character.

272	   octet:  An ordered sequence of eight bits considered as a unit.

274	   character repertoire:  A set of characters (set in the mathematical
275	      sense).

277	   sequence of characters:  A sequence of characters (one after
278	      another).

280	   sequence of octets:  A sequence of octets (one after another).

282	   character encoding:  A method of representing a sequence of
283	      characters as a sequence of octets (maybe with variants).  Also, a
284	      method of (unambiguously) converting a sequence of octets into a
285	      sequence of characters.

287	   charset:  The name of a parameter or attribute used to identify a
288	      character encoding.

290	   UCS:  Universal Character Set. The coded character set defined by
291	      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].

293	   IRI reference:  Denotes the common usage of an Internationalized
294	      Resource Identifier.  An IRI reference may be absolute or
295	      relative.  However, the "IRI" that results from such a reference
296	      only includes absolute IRIs; any relative IRI references are
297	      resolved to their absolute form.  Note that in [RFC2396] URIs did
298	      not include fragment identifiers, but in [RFC3986] fragment
299	      identifiers are part of URIs.

301	   URL:  The term "URL" was originally used [RFC1738] for roughly what
302	      is now called a "URI".  Books, software and documentation often
303	      refers to URIs and IRIs using the "URL" term.  Some usages
304	      restrict "URL" to those URIs which are not URNs.  Because of the
305	      ambiguity of the term using the term "URL" is NOT RECOMMENDED in
306	      formal documents.

308	   LEIRI (Legacy Extended IRI) processing:  This term was used in
309	      various XML specifications to refer to strings that, although not
310	      valid IRIs, were acceptable input to the processing rules in
311	      Section 7.1.

313	   (Web Address, Hypertext Reference, HREF):  These terms have been
314	      added in this document for convenience, to allow other
315	      specifications to refer to those strings that, although not valid
316	      IRIs, are acceptable input to the processing rules in Section 7.2.
317	      This usage corresponds to the parsing rules of some popular web
318	      browsing applications.  ISSUE: Need to find a good name/
319	      abbreviation for these.

321	   running text:  Human text (paragraphs, sentences, phrases) with
322	      syntax according to orthographic conventions of a natural
323	      language, as opposed to syntax defined for ease of processing by
324	      machines (e.g., markup, programming languages).

326	   protocol element:  Any portion of a message that affects processing
327	      of that message by the protocol in question.

329	   presentation element:  A presentation form corresponding to a
330	      protocol element; for example, using a wider range of characters.

332	   create (a URI or IRI):  With respect to URIs and IRIs, the term is
333	      used for the initial creation.  This may be the initial creation
334	      of a resource with a certain identifier, or the initial exposition
335	      of a resource under a particular identifier.

337	   generate (a URI or IRI):  With respect to URIs and IRIs, the term is
338	      used when the identifier is generated by derivation from other
339	      information.

341	   parsed URI component:  When a URI processor parses a URI (following
342	      the generic syntax or a scheme-specific syntax, the result is a
343	      set of parsed URI components, each of which has a type
344	      (corresponding to the syntactic definition) and a sequence of URI
345	      characters.

347	   parsed IRI component:  When an IRI processor parses an IRI directly,
348	      following the general syntax or a scheme-specific syntax, the
349	      result is a set of parsed IRI components, each of which has a type
350	      (corresponding to the syntactice definition) and a sequence of IRI
351	      characters.  (This definition is analogous to "parsed URI
352	      component".)

354	   IRI scheme:  A URI scheme may also be known as an "IRI scheme" if the
355	      scheme's syntax has been extended to allow non-US-ASCII characters
356	      according to the rules in this document.

358	1.4.  Notation

360	   RFCs and Internet Drafts currently do not allow any characters
361	   outside the US-ASCII repertoire.  Therefore, this document uses
362	   various special notations to denote such characters in examples.

364	   In text, characters outside US-ASCII are sometimes referenced by
365	   using a prefix of 'U+', followed by four to six hexadecimal digits.

367	   To represent characters outside US-ASCII in examples, this document
368	   uses two notations: 'XML Notation' and 'Bidi Notation'.

370	   XML Notation uses a leading '&#x', a trailing ';', and the
371	   hexadecimal number of the character in the UCS in between.  For
372	   example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA.  In this
373	   notation, an actual '&' is denoted by '&amp;'.

375	   Bidi Notation is used for bidirectional examples: Lower case letters
376	   stand for Latin letters or other letters that are written left to
377	   right, whereas upper case letters represent Arabic or Hebrew letters
378	   that are written right to left.

380	   To denote actual octets in examples (as opposed to percent-encoded
381	   octets), the two hex digits denoting the octet are enclosed in "<"
382	   and ">".  For example, the octet often denoted as 0xc9 is denoted
383	   here as <c9>.

385	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
386	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
387	   and "OPTIONAL" are to be interpreted as described in [RFC2119].

389	2.  IRI Syntax

391	   This section defines the syntax of Internationalized Resource
392	   Identifiers (IRIs).

394	   As with URIs, an IRI is defined as a sequence of characters, not as a
395	   sequence of octets.  This definition accommodates the fact that IRIs
396	   may be written on paper or read over the radio as well as stored or
397	   transmitted digitally.  The same IRI might be represented as
398	   different sequences of octets in different protocols or documents if
399	   these protocols or documents use different character encodings
400	   (and/or transfer encodings).  Using the same character encoding as
401	   the containing protocol or document ensures that the characters in
402	   the IRI can be handled (e.g., searched, converted, displayed) in the
403	   same way as the rest of the protocol or document.

405	2.1.  Summary of IRI Syntax

407	   IRIs are defined by extending the URI syntax in [RFC3986], but
408	   extending the class of unreserved characters by adding the characters
409	   of the UCS (Universal Character Set, [ISO10646]) beyond U+007F,
410	   subject to the limitations given in the syntax rules below and in
411	   Section 6.1.

413	   The syntax and use of components and reserved characters is the same
414	   as that in [RFC3986].  Each "URI scheme" thus also functions as an
415	   "IRI scheme", in that scheme-specific parsing rules for URIs of a
416	   scheme are be extended to allow parsing of IRIs using the same
417	   parsing rules.

419	   All the operations defined in [RFC3986], such as the resolution of
420	   relative references, can be applied to IRIs by IRI-processing
421	   software in exactly the same way as they are for URIs by URI-
422	   processing software.

424	   Characters outside the US-ASCII repertoire MUST NOT be reserved and
425	   therefore MUST NOT be used for syntactical purposes, such as to
426	   delimit components in newly defined schemes.  For example, U+00A2,
427	   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
428	   the 'iunreserved' category.  This is similar to the fact that it is
429	   not possible to use '-' as a delimiter in URIs, because it is in the
430	   'unreserved' category.

432	2.2.  ABNF for IRI References and IRIs

434	   An ABNF definition for IRI references (which are the most general
435	   concept and the start of the grammar) and IRIs is given here.  The
436	   syntax of this ABNF is described in [STD68].  Character numbers are
437	   taken from the UCS, without implying any actual binary encoding.
438	   Terminals in the ABNF are characters, not octets.

440	   The following grammar closely follows the URI grammar in [RFC3986],
441	   except that the range of unreserved characters is expanded to include
442	   UCS characters, with the restriction that private UCS characters can
443	   occur only in query parts.  The grammar is split into two parts:
444	   Rules that differ from [RFC3986] because of the above-mentioned
445	   expansion, and rules that are the same as those in [RFC3986].  For
446	   rules that are different than those in [RFC3986], the names of the
447	   non-terminals have been changed as follows.  If the non-terminal
448	   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
449	   has been prefixed.

451	   The following rules are different from those in [RFC3986]:

453	   IRI            = scheme ":" ihier-part [ "?" iquery ]
454	                    [ "#" ifragment ]

456	   ihier-part     = "//" iauthority ipath-abempty
457	                  / ipath-absolute
458	                  / ipath-rootless
459	                  / ipath-empty

461	   IRI-reference  = IRI / irelative-ref

463	   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

465	   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]

467	   irelative-part = "//" iauthority ipath-abempty
468	                  / ipath-absolute
469	                  / ipath-noscheme
470	                  / ipath-empty

472	   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
473	   iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
474	   ihost          = IP-literal / IPv4address / ireg-name

476	   pct-form       = pct-encoded

478	   ireg-name      = *( iunreserved / sub-delims )

480	   ipath          = ipath-abempty   ; begins with "/" or is empty
481	                  / ipath-absolute  ; begins with "/" but not "//"
482	                  / ipath-noscheme  ; begins with a non-colon segment
483	                  / ipath-rootless  ; begins with a segment
484	                  / ipath-empty     ; zero characters

486	   ipath-abempty  = *( path-sep isegment )
487	   ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
488	   ipath-noscheme = isegment-nz-nc *( path-sep isegment )
489	   ipath-rootless = isegment-nz *( path-sep isegment )
490	   ipath-empty    = 0<ipchar>
491	   path-sep       = "/"

493	   isegment       = *ipchar
494	   isegment-nz    = 1*ipchar
495	   isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
496	                        / "@" )
497	                  ; non-zero-length segment without any colon ":"

499	   ipchar         = iunreserved / pct-form / sub-delims / ":"
500	                  / "@"

502	   iquery         = *( ipchar / iprivate / "/" / "?" )

504	   ifragment      = *( ipchar / "/" / "?" / "#" )

506	   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

508	   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
509	                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
510	                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
511	                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
512	                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
513	                  / %xD0000-DFFFD / %xE1000-EFFFD

515	   iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
516	                  / %x100000-10FFFD

518	   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
519	   "greedy") algorithm applies.  For details, see [RFC3986].

521	   The following rules are the same as those in [RFC3986]:

523	   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

525	   port           = *DIGIT

527	   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

529	   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

531	   IPv6address    =                            6( h16 ":" ) ls32
532	                  /                       "::" 5( h16 ":" ) ls32
533	                  / [               h16 ] "::" 4( h16 ":" ) ls32
534	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
535	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
536	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
537	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
538	                  / [ *5( h16 ":" ) h16 ] "::"              h16
539	                  / [ *6( h16 ":" ) h16 ] "::"

541	   h16            = 1*4HEXDIG
542	   ls32           = ( h16 ":" h16 ) / IPv4address

544	   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

546	   dec-octet      = DIGIT                 ; 0-9
547	                  / %x31-39 DIGIT         ; 10-99
548	                  / "1" 2DIGIT            ; 100-199
549	                  / "2" %x30-34 DIGIT     ; 200-249
550	                  / "25" %x30-35          ; 250-255

552	   pct-encoded    = "%" HEXDIG HEXDIG

554	   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
555	   reserved       = gen-delims / sub-delims
556	   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
557	   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
558	                  / "*" / "+" / "," / ";" / "="

560	   This syntax does not support IPv6 scoped addressing zone identifiers.

562	3.  Processing IRIs and related protocol elements

564	   IRIs are meant to replace URIs in identifying resources within new
565	   versions of protocols, formats, and software components that use a
566	   UCS-based character repertoire.  Protocols and components may use and
567	   process IRIs directly.  However, there are still numerous systems and
568	   protocols which only accept URIs or components of parsed URIs; that
569	   is, they only accept sequences of characters within the subset of US-
570	   ASCII characters allowed in URIs.

572	   This section defines specific processing steps for IRI consumers
573	   which establish the relationship between the string given and the
574	   interpreted derivatives.  These processing steps apply to both IRIs
575	   and IRI references (i.e., absolute or relative forms); for IRIs, some
576	   steps are scheme specific.

578	3.1.  Converting to UCS

580	   Input that is already in a Unicode form (i.e., a sequence of Unicode
581	   characters or an octet-stream representing a Unicode-based character
582	   encoding such as UTF-8 or UTF-16) should be left as is and not
583	   normalized (see (see Section 5.3.2.2).

585	   If the IRI or IRI reference is an octet stream in some known non-
586	   Unicode character encoding, convert the IRI to a sequence of
587	   characters from the UCS; this sequence SHOULD also be normalized
588	   according to Unicode Normalization Form C (NFC, [UTR15]).  In this
589	   case, retain the original character encoding as the "document
590	   character encoding".  (DESIGN QUESTION: NOT WHAT MOST IMPLEMENTATIONS
591	   DO, CHANGE? )

593	   In other cases (written on paper, read aloud, or otherwise
594	   represented independent of any character encoding) represent the IRI
595	   as a sequence of characters from the UCS normalized according to
596	   Unicode Normalization Form C (NFC, [UTR15]).

598	3.2.  Parse the IRI into IRI components

600	   Parse the IRI, either as a relative reference (no scheme) or using
601	   scheme specific processing (according to the scheme given); the
602	   result resulting in a set of parsed IRI components.  (NOTE: FIX
603	   BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES THAT USE GENERIC
604	   SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN ONLY USE AUTHORITY FOR NAMES
605	   THAT FOLLOW PUNICODE.)

607	   NOTE: The result of parsing into components will correspond result in
608	   a correspondence of subtrings of the IRI according to the part
609	   matched.  For example, in [HTML5], the protocol components of
610	   interest are SCHEME (scheme), HOST (ireg-name), PORT (port), the PATH
611	   (ipath after the initial "/"), QUERY (iquery), FRAGMENT (ifragment),
612	   and AUTHORITY (iauthority).

614	   Subsequent processing rules are sometimes used to define other
615	   syntactic components.  For example, [HTML5] defines APIs for IRI
616	   processing; in these APIs:

618	   HOSTSPECIFIC  the substring that follows the substring matched by the
619	      iauthority production, or the whole string if the iauthority
620	      production wasn't matched.

622	   HOSTPORT  if there is a scheme component and a port component and the
623	      port given by the port component is different than the default
624	      port defined for the protocol given by the scheme component, then
625	      HOSTPORT is the substring that starts with the substring matched
626	      by the host production and ends with the substring matched by the
627	      port production, and includes the colon in between the two.
628	      Otherwise, it is the same as the host component.

630	3.3.  General percent-encoding of IRI components

632	   For most IRI components, it is possible to map the IRI component to
633	   an equivalent URI component by percent-encoding those characters not
634	   allowed in URIs.  Previous processing steps will have removed some
635	   characters, and the interpretation of reserved characters will have
636	   already been done (with the syntactic reserved characters outside of
637	   the IRI component).  This mapping is defined for all sequences of
638	   Unicode characters, whether or not they are valid for the component
639	   in question.

641	   For each character which is not allowed in a valid URI (NOTE: WHAT IS
642	   THE RIGHT REFERENCE HERE), apply the following steps.

644	   Convert to UTF-8  Convert the character to a sequence of one or more
645	      octets using UTF-8 [RFC3629].

647	   Percent encode  Convert each octet of this sequence to %HH, where HH
648	      is the hexadecimal notation of the octet value.  The hexadecimal
649	      notation SHOULD use uppercase letters.  (This is the general URI
650	      percent-encoding mechanism in Section 2.1 of [RFC3986].)

652	   Note that the mapping is an identity transformation for parsed URI
653	   components of valid URIs, and is idempotent: applying the mapping a
654	   second time will not change anything.

656	3.4.  Mapping ireg-name

658	   Schemes that allow non-ASCII based characters in the reg-name (ireg-
659	   name) position MUST convert the ireg-name component of an IRI as
660	   follows:

662	   Replace the ireg-name part of the IRI by the part converted using the
663	   ToASCII operation specified in Section 4.1 of [RFC3490] on each dot-
664	   separated label, and by using U+002E (FULL STOP) as a label
665	   separator, with the flag UseSTD3ASCIIRules set to FALSE, and with the
666	   flag AllowUnassigned set to FALSE.  The ToASCII operation may fail,
667	   but this would mean that the IRI cannot be resolved.  In such cases,
668	   if the domain name conversion fails, then the entire IRI conversion
669	   fails.  Processors that have no mechanism for signalling a failure
670	   MAY instead substitute an otherwise invalid host name, although such
671	   processing SHOULD be avoided.

673	   For example, the IRI
674	   "http://r&#xE9;sum&#xE9;.example.org"
675	   MAY be converted to
676	   "http://xn--rsum-bad.example.org"
677	   ; conversion to percent-encoded form, e.g.,
678	   "http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed.

680	   Note:  Domain Names may appear in parts of an IRI other than the
681	      ireg-name part.  It is the responsibility of scheme-specific
682	      implementations (if the Internationalized Domain Name is part of
683	      the scheme syntax) or of server-side implementations (if the
684	      Internationalized Domain Name is part of 'iquery') to apply the
685	      necessary conversions at the appropriate point.  Example: Trying
686	      to validate the Web page at
687	      http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
688	      http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
689	      example.org, which would convert to a URI of
690	      http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
691	      example.org.  The server-side implementation is responsible for
692	      making the necessary conversions to be able to retrieve the Web
693	      page.

695	   Note:  In this process, characters allowed in URI references and
696	      existing percent-encoded sequences are not encoded further.  (This
697	      mapping is similar to, but different from, the encoding applied
698	      when arbitrary content is included in some part of a URI.)  For
699	      example, an IRI of
700	      "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
701	      converted to
702	      "http://www.example.org/red%09ros%C3%A9#red", not to something
703	      like
704	      "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
705	      ((DESIGN QUESTION: What about e.g.
706	      http://r%C3%A9sum%C3%A9.example.org in an IRI?  Will that get
707	      converted to punycode, or not?))

709	3.5.  Mapping query components

711	   ((NOTE: SEE ISSUES LIST)) For compatibility with existing deployed
712	   HTTP infrastructure, the following special case applies for schemes
713	   "http" and "https" and IRIs whose origin has a document charset other
714	   than one which is UCS-based (e.g., UTF-8 or UTF-16).  In such a case,
715	   the "query" component of an IRI is mapped into a URI by using the
716	   document charset rather than UTF-8 as the binary representation
717	   before pct-encoding.  This mapping is not applied for any other
718	   scheme or component.

720	3.6.  Mapping IRIs to URIs

722	   The canonical mapping from a IRI to URI is defined by applying the
723	   mapping above (from IRI to URI components) and then reassembling a
724	   URI from the parsed URI components using the original punctuation
725	   that delimited the IRI components.

727	3.7.  Converting URIs to IRIs

729	   In some situations, for presentation and further processing, it is
730	   desirable to convert a URI into an equivalent IRI in which natural
731	   characters are represented directly rather than percent encoded.  Of
732	   course, every URI is already an IRI in its own right without any
733	   conversion, and in general there This section gives one such
734	   procedure for this conversion.

736	   The conversion described in this section, if given a valid URI, will
737	   result in an IRI that maps back to the URI used as an input for the
738	   conversion (except for potential case differences in percent-encoding
739	   and for potential percent-encoded unreserved characters).  However,
740	   the IRI resulting from this conversion may differ from the original
741	   IRI (if there ever was one).

743	   URI-to-IRI conversion removes percent-encodings, but not all percent-
744	   encodings can be eliminated.  There are several reasons for this:

746	   1. Some percent-encodings are necessary to distinguish percent-
747	      encoded and unencoded uses of reserved characters.

749	   2. Some percent-encodings cannot be interpreted as sequences of UTF-8
750	      octets.

752	      (Note: The octet patterns of UTF-8 are highly regular.  Therefore,
753	      there is a very high probability, but no guarantee, that percent-
754	      encodings that can be interpreted as sequences of UTF-8 octets
755	      actually originated from UTF-8.  For a detailed discussion, see
756	      [Duerst97].)

758	   3. The conversion may result in a character that is not appropriate
759	      in an IRI.  See Section 2.2, Section 4.1, and Section 6.1 for
760	      further details.

762	   4. IRI to URI conversion has different rules for dealing with domain
763	      names and query parameters.

765	   Conversion from a URI to an IRI MAY be done by using the following
766	   steps:

768	   1. Represent the URI as a sequence of octets in US-ASCII.

770	   2. Convert all percent-encodings ("%" followed by two hexadecimal
771	      digits) to the corresponding octets, except those corresponding to
772	      "%", characters in "reserved", and characters in US-ASCII not
773	      allowed in URIs.

775	   3. Re-percent-encode any octet produced in step 2 that is not part of
776	      a strictly legal UTF-8 octet sequence.

778	   4. Re-percent-encode all octets produced in step 3 that in UTF-8
779	      represent characters that are not appropriate according to
780	      Section 2.2, Section 4.1, and Section 6.1.

782	   5. Interpret the resulting octet sequence as a sequence of characters
783	      encoded in UTF-8.

785	   6. URIs known to contain domain names in the reg-name component
786	      SHOULD convert punycode-encoded domain name labels to the
787	      corresponding characters using the ToUnicode procedure.

789	   This procedure will convert as many percent-encoded characters as
790	   possible to characters in an IRI.  Because there are some choices
791	   when step 4 is applied (see Section 6.1), results may vary.

793	   Conversions from URIs to IRIs MUST NOT use any character encoding
794	   other than UTF-8 in steps 3 and 4, even if it might be possible to
795	   guess from the context that another character encoding than UTF-8 was
796	   used in the URI.  For example, the URI
797	   "http://www.example.org/r%E9sum%E9.html" might with some guessing be
798	   interpreted to contain two e-acute characters encoded as iso-8859-1.
799	   It must not be converted to an IRI containing these e-acute
800	   characters.  Otherwise, in the future the IRI will be mapped to
801	   "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
802	   URI from "http://www.example.org/r%E9sum%E9.html".

804	3.7.1.  Examples

806	   This section shows various examples of converting URIs to IRIs.  Each
807	   example shows the result after each of the steps 1 through 6 is
808	   applied.  XML Notation is used for the final result.  Octets are
809	   denoted by "<" followed by two hexadecimal digits followed by ">".

811	   The following example contains the sequence "%C3%BC", which is a
812	   strictly legal UTF-8 sequence, and which is converted into the actual
813	   character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
814	   u-umlaut).

816	   1. http://www.example.org/D%C3%BCrst

818	   2. http://www.example.org/D<c3><bc>rst

820	   3. http://www.example.org/D<c3><bc>rst

822	   4. http://www.example.org/D<c3><bc>rst

824	   5. http://www.example.org/D&#xFC;rst

826	   6. http://www.example.org/D&#xFC;rst

828	   The following example contains the sequence "%FC", which might
829	   represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
830	   iso-8859-1 character encoding.  (It might represent other characters
831	   in other character encodings.  For example, the octet <fc> in iso-
832	   8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.)  Because <fc>
833	   is not part of a strictly legal UTF-8 sequence, it is re-percent-
834	   encoded in step 3.

836	   1. http://www.example.org/D%FCrst

838	   2. http://www.example.org/D<fc>rst

840	   3. http://www.example.org/D%FCrst

842	   4. http://www.example.org/D%FCrst

844	   5. http://www.example.org/D%FCrst

846	   6. http://www.example.org/D%FCrst

848	   The following example contains "%e2%80%ae", which is the percent-
849	   encoded
850	   UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.
851	   Section 4.1 forbids the direct use of this character in an IRI.

853	   Therefore, the corresponding octets are re-percent-encoded in step 4.
854	   This example shows that the case (upper- or lowercase) of letters
855	   used in percent-encodings may not be preserved.  The example also
856	   contains a punycode-encoded domain name label (xn--99zt52a), which is
857	   not converted.

859	   1. http://xn--99zt52a.example.org/%e2%80%ae

861	   2. http://xn--99zt52a.example.org/<e2><80><ae>

863	   3. http://xn--99zt52a.example.org/<e2><80><ae>

865	   4. http://xn--99zt52a.example.org/%E2%80%AE

867	   5. http://xn--99zt52a.example.org/%E2%80%AE

869	   6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE

871	   Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
872	   (Japanese Natto).  ((EDITOR NOTE: There is some inconsistency in this
873	   note.))

875	4.  Bidirectional IRIs for Right-to-Left Languages

877	   Some UCS characters, such as those used in the Arabic and Hebrew
878	   scripts, have an inherent right-to-left (rtl) writing direction.
879	   IRIs containing these characters (called bidirectional IRIs or Bidi
880	   IRIs) require additional attention because of the non-trivial
881	   relation between logical representation (used for digital
882	   representation and for reading/spelling) and visual representation
883	   (used for display/printing).

885	   Because of the complex interaction between the logical
886	   representation, the visual representation, and the syntax of a Bidi
887	   IRI, a balance is needed between various requirements.  The main
888	   requirements are

890	   1. user-predictable conversion between visual and logical
891	      representation;

893	   2. the ability to include a wide range of characters in various parts
894	      of the IRI; and

896	   3. minor or no changes or restrictions for implementations.

898	4.1.  Logical Storage and Visual Presentation

900	   When stored or transmitted in digital representation, bidirectional
901	   IRIs MUST be in full logical order and MUST conform to the IRI syntax
902	   rules (which includes the rules relevant to their scheme).  This
903	   ensures that bidirectional IRIs can be processed in the same way as
904	   other IRIs.

906	   Bidirectional IRIs MUST be rendered by using the Unicode
907	   Bidirectional Algorithm [UNIV4], [UNI9].  Bidirectional IRIs MUST be
908	   rendered in the same way as they would be if they were in a left-to-
909	   right embedding; i.e., as if they were preceded by U+202A, LEFT-TO-
910	   RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL
911	   FORMATTING (PDF).  Setting the embedding direction can also be done
912	   in a higher-level protocol (e.g., the dir='ltr' attribute in HTML).

914	   There is no requirement to use the above embedding if the display is
915	   still the same without the embedding.  For example, a bidirectional
916	   IRI in a text with left-to-right base directionality (such as used
917	   for English or Cyrillic) that is preceded and followed by whitespace
918	   and strong left-to-right characters does not need an embedding.
919	   Also, a bidirectional relative IRI reference that only contains
920	   strong right-to-left characters and weak characters and that starts
921	   and ends with a strong right-to-left character and appears in a text
922	   with right-to-left base directionality (such as used for Arabic or
923	   Hebrew) and is preceded and followed by whitespace and strong
924	   characters does not need an embedding.

926	   In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
927	   sufficient to force the correct display behavior.  However, the
928	   details of the Unicode Bidirectional algorithm are not always easy to
929	   understand.  Implementers are strongly advised to err on the side of
930	   caution and to use embedding in all cases where they are not
931	   completely sure that the display behavior is unaffected without the
932	   embedding.

934	   The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits
935	   higher-level protocols to influence bidirectional rendering.  Such
936	   changes by higher-level protocols MUST NOT be used if they change the
937	   rendering of IRIs.

939	   The bidirectional formatting characters that may be used before or
940	   after the IRI to ensure correct display are not themselves part of
941	   the IRI.  IRIs MUST NOT contain bidirectional formatting characters
942	   (LRM, RLM, LRE, RLE, LRO, RLO, and PDF).  They affect the visual
943	   rendering of the IRI but do not appear themselves.  It would
944	   therefore not be possible to input an IRI with such characters
945	   correctly.

947	4.2.  Bidi IRI Structure

949	   The Unicode Bidirectional Algorithm is designed mainly for running
950	   text.  To make sure that it does not affect the rendering of
951	   bidirectional IRIs too much, some restrictions on bidirectional IRIs
952	   are necessary.  These restrictions are given in terms of delimiters
953	   (structural characters, mostly punctuation such as "@", ".", ":", and
954	   "/") and components (usually consisting mostly of letters and
955	   digits).

957	   The following syntax rules from Section 2.2 correspond to components
958	   for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
959	   isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.

961	   Specifications that define the syntax of any of the above components
962	   MAY divide them further and define smaller parts to be components
963	   according to this document.  As an example, the restrictions of
964	   [RFC3490] on bidirectional domain names correspond to treating each
965	   label of a domain name as a component for schemes with ireg-name as a
966	   domain name.  Even where the components are not defined formally, it
967	   may be helpful to think about some syntax in terms of components and
968	   to apply the relevant restrictions.  For example, for the usual name/
969	   value syntax in query parts, it is convenient to treat each name and
970	   each value as a component.  As another example, the extensions in a
971	   resource name can be treated as separate components.

973	   For each component, the following restrictions apply:

975	   1. A component SHOULD NOT use both right-to-left and left-to-right
976	      characters.

978	   2. A component using right-to-left characters SHOULD start and end
979	      with right-to-left characters.

981	   The above restrictions are given as "SHOULD"s, rather than as
982	   "MUST"s.  For IRIs that are never presented visually, they are not
983	   relevant.  However, for IRIs in general, they are very important to
984	   ensure consistent conversion between visual presentation and logical
985	   representation, in both directions.

987	   Note:  In some components, the above restrictions may actually be
988	      strictly enforced.  For example, [RFC3490] requires that these
989	      restrictions apply to the labels of a host name for those schemes
990	      where ireg-name is a host name.  In some other components (for
991	      example, path components) following these restrictions may not be
992	      too difficult.  For other components, such as parts of the query
993	      part, it may be very difficult to enforce the restrictions because
994	      the values of query parameters may be arbitrary character
995	      sequences.

997	   If the above restrictions cannot be satisfied otherwise, the affected
998	   component can always be mapped to URI notation as described in
999	   Section 3.3.  Please note that the whole component has to be mapped
1000	   (see also Example 9 below).

1002	4.3.  Input of Bidi IRIs

1004	   Bidi input methods MUST generate Bidi IRIs in logical order while
1005	   rendering them according to Section 4.1.  During input, rendering
1006	   SHOULD be updated after every new character is input to avoid end-
1007	   user confusion.

1009	4.4.  Examples

1011	   This section gives examples of bidirectional IRIs, in Bidi Notation.
1012	   It shows legal IRIs with the relationship between logical and visual
1013	   representation and explains how certain phenomena in this
1014	   relationship may look strange to somebody not familiar with
1015	   bidirectional behavior, but familiar to users of Arabic and Hebrew.
1016	   It also shows what happens if the restrictions given in Section 4.2
1017	   are not followed.  The examples below can be seen at [BidiEx], in
1018	   Arabic, Hebrew, and Bidi Notation variants.

1020	   To read the bidi text in the examples, read the visual representation
1021	   from left to right until you encounter a block of rtl text.  Read the
1022	   rtl block (including slashes and other special characters) from right
1023	   to left, then continue at the next unread ltr character.

1025	   Example 1: A single component with rtl characters is inverted:
1026	   Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html"
1027	   Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html"
1028	   Components can be read one by one, and each component can be read in
1029	   its natural direction.

1031	   Example 2: More than one consecutive component with rtl characters is
1032	   inverted as a whole:
1033	   Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html"
1034	   Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html"
1035	   A sequence of rtl components is read rtl, in the same way as a
1036	   sequence of rtl words is read rtl in a bidi text.

1038	   Example 3: All components of an IRI (except for the scheme) are rtl.
1039	   All rtl components are inverted overall:
1040	   Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"
1041	   Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"
1042	   The whole IRI (except the scheme) is read rtl.  Delimiters between
1043	   rtl components stay between the respective components; delimiters
1044	   between ltr and rtl components don't move.

1046	   Example 4: Each of several sequences of rtl components is inverted on
1047	   its own:
1048	   Logical representation: "http://AB.CD.ef/gh/IJ/KL.html"
1049	   Visual representation: "http://DC.BA.ef/gh/LK/JI.html"
1050	   Each sequence of rtl components is read rtl, in the same way as each
1051	   sequence of rtl words in an ltr text is read rtl.

1053	   Example 5: Example 2, applied to components of different kinds:
1054	   Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
1055	   Visual representation: "http://ab.cd.HG/FE/ij/kl.html"
1056	   The inversion of the domain name label and the path component may be
1057	   unexpected, but it is consistent with other bidi behavior.  For
1058	   reassurance that the domain component really is "ab.cd.EF", it may be
1059	   helpful to read aloud the visual representation following the bidi
1060	   algorithm.  After "http://ab.cd." one reads the RTL block
1061	   "E-F-slash-G-H", which corresponds to the logical representation.

1063	   Example 6: Same as Example 5, with more rtl components:
1064	   Logical representation: "http://ab.CD.EF/GH/IJ/kl.html"
1065	   Visual representation: "http://ab.JI/HG/FE.DC/kl.html"
1066	   The inversion of the domain name labels and the path components may
1067	   be easier to identify because the delimiters also move.

1069	   Example 7: A single rtl component includes digits:
1070	   Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"
1071	   Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"
1072	   Numbers are written ltr in all cases but are treated as an additional
1073	   embedding inside a run of rtl characters.  This is completely
1074	   consistent with usual bidirectional text.

1076	   Example 8 (not allowed): Numbers are at the start or end of an rtl
1077	   component:
1078	   Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html"
1079	   Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html"
1080	   The sequence "1/2" is interpreted by the bidi algorithm as a
1081	   fraction, fragmenting the components and leading to confusion.  There
1082	   are other characters that are interpreted in a special way close to
1083	   numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".

1085	   Example 9 (not allowed): The numbers in the previous example are
1086	   percent-encoded:
1087	   Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html",
1088	   Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html"

1090	   Example 10 (allowed but not recommended):

1092	   Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html"
1093	   Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html"
1094	   Components consisting of only numbers are allowed (it would be rather
1095	   difficult to prohibit them), but these may interact with adjacent RTL
1096	   components in ways that are not easy to predict.

1098	   Example 11 (allowed but not recommended):
1099	   Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"
1100	   Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html"
1101	   Components consisting of numbers and left-to-right characters are
1102	   allowed, but these may interact with adjacent RTL components in ways
1103	   that are not easy to predict.

1105	5.  Normalization and Comparison

1107	   Note:  The structure and much of the material for this section is
1108	      taken from section 6 of [RFC3986]; the differences are due to the
1109	      specifics of IRIs.

1111	   One of the most common operations on IRIs is simple comparison:
1112	   Determining whether two IRIs are equivalent, without using the IRIs
1113	   to access their respective resource(s).  A comparison is performed
1114	   whenever a response cache is accessed, a browser checks its history
1115	   to color a link, or an XML parser processes tags within a namespace.
1116	   Extensive normalization prior to comparison of IRIs may be used by
1117	   spiders and indexing engines to prune a search space or reduce
1118	   duplication of request actions and response storage.

1120	   IRI comparison is performed for some particular purpose.  Protocols
1121	   or implementations that compare IRIs for different purposes will
1122	   often be subject to differing design trade-offs in regards to how
1123	   much effort should be spent in reducing aliased identifiers.  This
1124	   section describes various methods that may be used to compare IRIs,
1125	   the trade-offs between them, and the types of applications that might
1126	   use them.

1128	5.1.  Equivalence

1130	   Because IRIs exist to identify resources, presumably they should be
1131	   considered equivalent when they identify the same resource.  However,
1132	   this definition of equivalence is not of much practical use, as there
1133	   is no way for an implementation to compare two resources to determine
1134	   if they are "the same" unless it has full knowledge or control of
1135	   them.  For this reason, determination of equivalence or difference of
1136	   IRIs is based on string comparison, perhaps augmented by reference to
1137	   additional rules provided by URI scheme definitions.  We use the
1138	   terms "different" and "equivalent" to describe the possible outcomes
1139	   of such comparisons, but there are many application-dependent
1140	   versions of equivalence.

1142	   Even when it is possible to determine that two IRIs are equivalent,
1143	   IRI comparison is not sufficient to determine whether two IRIs
1144	   identify different resources.  For example, an owner of two different
1145	   domain names could decide to serve the same resource from both,
1146	   resulting in two different IRIs.  Therefore, comparison methods are
1147	   designed to minimize false negatives while strictly avoiding false
1148	   positives.

1150	   In testing for equivalence, applications should not directly compare
1151	   relative references; the references should be converted to their
1152	   respective target IRIs before comparison.  When IRIs are compared to
1153	   select (or avoid) a network action, such as retrieval of a
1154	   representation, fragment components (if any) should be excluded from
1155	   the comparison.

1157	   Applications using IRIs as identity tokens with no relationship to a
1158	   protocol MUST use the Simple String Comparison (see Section 5.3.1).
1159	   All other applications MUST select one of the comparison practices
1160	   from the Comparison Ladder (see Section 5.3.

1162	5.2.  Preparation for Comparison

1164	   Any kind of IRI comparison REQUIRES that any additional contextual
1165	   processing is first performed, including undoing higher-level
1166	   escapings or encodings in the protocol or format that carries an IRI.
1167	   This preprocessing is usually done when the protocol or format is
1168	   parsed.

1170	   Examples of contextual preprocessing steps are described in
1171	   Section 7.

1173	   Examples of such escapings or encodings are entities and numeric
1174	   character references in [HTML4] and [XML1].  As an example,
1175	   "http://example.org/ros&eacute;" (in HTML),
1176	   "http://example.org/ros&#233;" (in HTML or XML), and
1177	   "http://example.org/ros&#xE9;" (in HTML or XML) are all resolved into
1178	   what is denoted in this document (see Section 1.4) as
1179	   "http://example.org/ros&#xE9;" (the "&#xE9;" here standing for the
1180	   actual e-acute character, to compensate for the fact that this
1181	   document cannot contain non-ASCII characters).

1183	   Similar considerations apply to encodings such as Transfer Codings in
1184	   HTTP (see [RFC2616]) and Content Transfer Encodings in MIME
1185	   ([RFC2045]), although in these cases, the encoding is based not on
1186	   characters but on octets, and additional care is required to make
1187	   sure that characters, and not just arbitrary octets, are compared
1188	   (see Section 5.3.1).

1190	5.3.  Comparison Ladder

1192	   In practice, a variety of methods are used to test IRI equivalence.
1193	   These methods fall into a range distinguished by the amount of
1194	   processing required and the degree to which the probability of false
1195	   negatives is reduced.  As noted above, false negatives cannot be
1196	   eliminated.  In practice, their probability can be reduced, but this
1197	   reduction requires more processing and is not cost-effective for all
1198	   applications.

1200	   If this range of comparison practices is considered as a ladder, the
1201	   following discussion will climb the ladder, starting with practices
1202	   that are cheap but have a relatively higher chance of producing false
1203	   negatives, and proceeding to those that have higher computational
1204	   cost and lower risk of false negatives.

1206	5.3.1.  Simple String Comparison

1208	   If two IRIs, when considered as character strings, are identical,
1209	   then it is safe to conclude that they are equivalent.  This type of
1210	   equivalence test has very low computational cost and is in wide use
1211	   in a variety of applications, particularly in the domain of parsing.
1212	   It is also used when a definitive answer to the question of IRI
1213	   equivalence is needed that is independent of the scheme used and that
1214	   can be calculated quickly and without accessing a network.  An
1215	   example of such a case is XML Namespaces ([XMLNamespace]).

1217	   Testing strings for equivalence requires some basic precautions.
1218	   This procedure is often referred to as "bit-for-bit" or "byte-for-
1219	   byte" comparison, which is potentially misleading.  Testing strings
1220	   for equality is normally based on pair comparison of the characters
1221	   that make up the strings, starting from the first and proceeding
1222	   until both strings are exhausted and all characters are found to be
1223	   equal, until a pair of characters compares unequal, or until one of
1224	   the strings is exhausted before the other.

1226	   This character comparison requires that each pair of characters be
1227	   put in comparable encoding form.  For example, should one IRI be
1228	   stored in a byte array in UTF-8 encoding form and the second in a
1229	   UTF-16 encoding form, bit-for-bit comparisons applied naively will
1230	   produce errors.  It is better to speak of equality on a character-
1231	   for-character rather than on a byte-for-byte or bit-for-bit basis.
1232	   In practical terms, character-by-character comparisons should be done
1233	   codepoint by codepoint after conversion to a common character
1234	   encoding form.  When comparing character by character, the comparison
1235	   function MUST NOT map IRIs to URIs, because such a mapping would
1236	   create additional spurious equivalences.  It follows that an IRI
1237	   SHOULD NOT be modified when being transported if there is any chance
1238	   that this IRI might be used in a context that uses Simple String
1239	   Comparison.

1241	   False negatives are caused by the production and use of IRI aliases.
1242	   Unnecessary aliases can be reduced, regardless of the comparison
1243	   method, by consistently providing IRI references in an already
1244	   normalized form (i.e., a form identical to what would be produced
1245	   after normalization is applied, as described below).  Protocols and
1246	   data formats often limit some IRI comparisons to simple string
1247	   comparison, based on the theory that people and implementations will,
1248	   in their own best interest, be consistent in providing IRI
1249	   references, or at least be consistent enough to negate any efficiency
1250	   that might be obtained from further normalization.

1252	5.3.2.  Syntax-Based Normalization

1254	   Implementations may use logic based on the definitions provided by
1255	   this specification to reduce the probability of false negatives.
1256	   This processing is moderately higher in cost than character-for-
1257	   character string comparison.  For example, an application using this
1258	   approach could reasonably consider the following two IRIs equivalent:

1260	      example://a/b/c/%7Bfoo%7D/ros&#xE9;
1261	      eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9

1263	   Web user agents, such as browsers, typically apply this type of IRI
1264	   normalization when determining whether a cached response is
1265	   available.  Syntax-based normalization includes such techniques as
1266	   case normalization, character normalization, percent-encoding
1267	   normalization, and removal of dot-segments.

1269	5.3.2.1.  Case Normalization

1271	   For all IRIs, the hexadecimal digits within a percent-encoding
1272	   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1273	   should be normalized to use uppercase letters for the digits A-F.

1275	   When an IRI uses components of the generic syntax, the component
1276	   syntax equivalence rules always apply; namely, that the scheme and
1277	   US-ASCII only host are case insensitive and therefore should be
1278	   normalized to lowercase.  For example, the URI
1279	   "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/".
1280	   Case equivalence for non-ASCII characters in IRI components that are
1281	   IDNs are discussed in Section 5.3.3.  The other generic syntax
1282	   components are assumed to be case sensitive unless specifically
1283	   defined otherwise by the scheme.

1285	   Creating schemes that allow case-insensitive syntax components
1286	   containing non-ASCII characters should be avoided.  Case
1287	   normalization of non-ASCII characters can be culturally dependent and
1288	   is always a complex operation.  The only exception concerns non-ASCII
1289	   host names for which the character normalization includes a mapping
1290	   step derived from case folding.

1292	5.3.2.2.  Character Normalization

1294	   The Unicode Standard [UNIV4] defines various equivalences between
1295	   sequences of characters for various purposes.  Unicode Standard Annex
1296	   #15 [UTR15] defines various Normalization Forms for these
1297	   equivalences, in particular Normalization Form C (NFC, Canonical
1298	   Decomposition, followed by Canonical Composition) and Normalization
1299	   Form KC (NFKC, Compatibility Decomposition, followed by Canonical
1300	   Composition).

1302	   IRIs already in Unicode MUST NOT be normalized before parsing or
1303	   interpreting.  In many non-Unicode character encodings, some text
1304	   cannot be represented directly.  For example, the word "Vietnam" is
1305	   natively written "Vi&#x1EC7;t Nam" (containing a LATIN SMALL LETTER E
1306	   WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct transcoding from
1307	   the windows-1258 character encoding leads to "Vi&#xEA;&#x323;t Nam"
1308	   (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX followed by a
1309	   COMBINING DOT BELOW).  Direct transcoding of other 8-bit encodings of
1310	   Vietnamese may lead to other representations.

1312	   Equivalence of IRIs MUST rely on the assumption that IRIs are
1313	   appropriately pre-character-normalized rather than apply character
1314	   normalization when comparing two IRIs.  The exceptions are conversion
1315	   from a non-digital form, and conversion from a non-UCS-based
1316	   character encoding to a UCS-based character encoding.  In these
1317	   cases, NFC or a normalizing transcoder using NFC MUST be used for
1318	   interoperability.  To avoid false negatives and problems with
1319	   transcoding, IRIs SHOULD be created by using NFC.  Using NFKC may
1320	   avoid even more problems; for example, by choosing half-width Latin
1321	   letters instead of full-width ones, and full-width instead of half-
1322	   width Katakana.

1324	   As an example, "http://www.example.org/r&#xE9;sum&#xE9;.html" (in XML
1325	   Notation) is in NFC.  On the other hand,
1326	   "http://www.example.org/re&#x301;sume&#x301;.html" is not in NFC.

1328	   The former uses precombined e-acute characters, and the latter uses
1329	   "e" characters followed by combining acute accents.  Both usages are
1330	   defined as canonically equivalent in [UNIV4].

1332	   Note:  Because it is unknown how a particular sequence of characters
1333	      is being treated with respect to character normalization, it would
1334	      be inappropriate to allow third parties to normalize an IRI
1335	      arbitrarily.  This does not contradict the recommendation that
1336	      when a resource is created, its IRI should be as character
1337	      normalized as possible (i.e., NFC or even NFKC).  This is similar
1338	      to the uppercase/lowercase problems.  Some parts of a URI are case
1339	      insensitive (for example, the domain name).  For others, it is
1340	      unclear whether they are case sensitive, case insensitive, or
1341	      something in between (e.g., case sensitive, but with a multiple
1342	      choice selection if the wrong case is used, instead of a direct
1343	      negative result).  The best recipe is that the creator use a
1344	      reasonable capitalization and, when transferring the URI,
1345	      capitalization never be changed.

1347	   Various IRI schemes may allow the usage of Internationalized Domain
1348	   Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
1349	   Character Normalization also applies to IDNs, as discussed in
1350	   Section 5.3.3.

1352	5.3.2.3.  Percent-Encoding Normalization

1354	   The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a
1355	   frequent source of variance among otherwise identical IRIs.  In
1356	   addition to the case normalization issue noted above, some IRI
1357	   producers percent-encode octets that do not require percent-encoding,
1358	   resulting in IRIs that are equivalent to their nonencoded
1359	   counterparts.  These IRIs should be normalized by decoding any
1360	   percent-encoded octet sequence that corresponds to an unreserved
1361	   character, as described in section 2.3 of [RFC3986].

1363	   For actual resolution, differences in percent-encoding (except for
1364	   the percent-encoding of reserved characters) MUST always result in
1365	   the same resource.  For example, "http://example.org/~user",
1366	   "http://example.org/%7euser", and "http://example.org/%7Euser", must
1367	   resolve to the same resource.

1369	   If this kind of equivalence is to be tested, the percent-encoding of
1370	   both IRIs to be compared has to be aligned; for example, by
1371	   converting both IRIs to URIs (see Section 3.1), eliminating escape
1372	   differences in the resulting URIs, and making sure that the case of
1373	   the hexadecimal characters in the percent-encoding is always the same
1374	   (preferably upper case).  If the IRI is to be passed to another
1375	   application or used further in some other way, its original form MUST
1376	   be preserved.  The conversion described here should be performed only
1377	   for local comparison.

1379	5.3.2.4.  Path Segment Normalization

1381	   The complete path segments "." and ".." are intended only for use
1382	   within relative references (Section 4.1 of [RFC3986]) and are removed
1383	   as part of the reference resolution process (Section 5.2 of
1384	   [RFC3986]).  However, some implementations may incorrectly assume
1385	   that reference resolution is not necessary when the reference is
1386	   already an IRI, and thus fail to remove dot-segments when they occur
1387	   in non-relative paths.  IRI normalizers should remove dot-segments by
1388	   applying the remove_dot_segments algorithm to the path, as described
1389	   in Section 5.2.4 of [RFC3986].

1391	5.3.3.  Scheme-Based Normalization

1393	   The syntax and semantics of IRIs vary from scheme to scheme, as
1394	   described by the defining specification for each scheme.
1395	   Implementations may use scheme-specific rules, at further processing
1396	   cost, to reduce the probability of false negatives.  For example,
1397	   because the "http" scheme makes use of an authority component, has a
1398	   default port of "80", and defines an empty path to be equivalent to
1399	   "/", the following four IRIs are equivalent:

1401	      http://example.com
1402	      http://example.com/
1403	      http://example.com:/
1404	      http://example.com:80/

1406	   In general, an IRI that uses the generic syntax for authority with an
1407	   empty path should be normalized to a path of "/".  Likewise, an
1408	   explicit ":port", for which the port is empty or the default for the
1409	   scheme, is equivalent to one where the port and its ":" delimiter are
1410	   elided and thus should be removed by scheme-based normalization.  For
1411	   example, the second IRI above is the normal form for the "http"
1412	   scheme.

1414	   Another case where normalization varies by scheme is in the handling
1415	   of an empty authority component or empty host subcomponent.  For many
1416	   scheme specifications, an empty authority or host is considered an
1417	   error; for others, it is considered equivalent to "localhost" or the
1418	   end-user's host.  When a scheme defines a default for authority and
1419	   an IRI reference to that default is desired, the reference should be
1420	   normalized to an empty authority for the sake of uniformity, brevity,
1421	   and internationalization.  If, however, either the userinfo or port
1422	   subcomponents are non-empty, then the host should be given explicitly
1423	   even if it matches the default.

1425	   Normalization should not remove delimiters when their associated
1426	   component is empty unless it is licensed to do so by the scheme
1427	   specification.  For example, the IRI "http://example.com/?" cannot be
1428	   assumed to be equivalent to any of the examples above.  Likewise, the
1429	   presence or absence of delimiters within a userinfo subcomponent is
1430	   usually significant to its interpretation.  The fragment component is
1431	   not subject to any scheme-based normalization; thus, two IRIs that
1432	   differ only by the suffix "#" are considered different regardless of
1433	   the scheme.

1435	   Some IRI schemes allow the usage of Internationalized Domain Names
1436	   (IDN) [RFC5890] either in their ireg-name part or elswhere.  When in
1437	   use in IRIs, those names SHOULD conform to the definition of U-Label
1438	   in [RFC5890].  An IRI containing an invalid IDN cannot successfully
1439	   be resolved.  For legibility purposes, they SHOULD NOT be converted
1440	   into ASCII Compatible Encoding (ACE).

1442	   Scheme-based normalization may also consider IDN components and their
1443	   conversions to punycode as equivalent.  As an example,
1444	   "http://r&#xE9;sum&#xE9;.example.org" may be considered equivalent to
1445	   "http://xn--rsum-bpad.example.org".

1447	   Other scheme-specific normalizations are possible.

1449	5.3.4.  Protocol-Based Normalization

1451	   Substantial effort to reduce the incidence of false negatives is
1452	   often cost-effective for web spiders.  Consequently, they implement
1453	   even more aggressive techniques in IRI comparison.  For example, if
1454	   they observe that an IRI such as

1456	      http://example.com/data

1458	   redirects to an IRI differing only in the trailing slash

1460	      http://example.com/data/

1462	   they will likely regard the two as equivalent in the future.  This
1463	   kind of technique is only appropriate when equivalence is clearly
1464	   indicated by both the result of accessing the resources and the
1465	   common conventions of their scheme's dereference algorithm (in this
1466	   case, use of redirection by HTTP origin servers to avoid problems
1467	   with relative references).

1469	6.  Use of IRIs
1470	6.1.  Limitations on UCS Characters Allowed in IRIs

1472	   This section discusses limitations on characters and character
1473	   sequences usable for IRIs beyond those given in Section 2.2 and
1474	   Section 4.1.  The considerations in this section are relevant when
1475	   IRIs are created and when URIs are converted to IRIs.

1477	   a. The repertoire of characters allowed in each IRI component is
1478	      limited by the definition of that component.  For example, the
1479	      definition of the scheme component does not allow characters
1480	      beyond US-ASCII.

1482	      (Note: In accordance with URI practice, generic IRI software
1483	      cannot and should not check for such limitations.)

1485	   b. The UCS contains many areas of characters for which there are
1486	      strong visual look-alikes.  Because of the likelihood of
1487	      transcription errors, these also should be avoided.  This includes
1488	      the full-width equivalents of Latin characters, half-width
1489	      Katakana characters for Japanese, and many others.  It also
1490	      includes many look-alikes of "space", "delims", and "unwise",
1491	      characters excluded in [RFC3491].

1493	   Additional information is available from [UNIXML].  [UNIXML] is
1494	   written in the context of running text rather than in that of
1495	   identifiers.  Nevertheless, it discusses many of the categories of
1496	   characters not appropriate for IRIs.

1498	6.2.  Software Interfaces and Protocols

1500	   Although an IRI is defined as a sequence of characters, software
1501	   interfaces for URIs typically function on sequences of octets or
1502	   other kinds of code units.  Thus, software interfaces and protocols
1503	   MUST define which character encoding is used.

1505	   Intermediate software interfaces between IRI-capable components and
1506	   URI-only components MUST map the IRIs per Section 3.6, when
1507	   transferring from IRI-capable to URI-only components.  This mapping
1508	   SHOULD be applied as late as possible.  It SHOULD NOT be applied
1509	   between components that are known to be able to handle IRIs.

1511	6.3.  Format of URIs and IRIs in Documents and Protocols

1513	   Document formats that transport URIs may have to be upgraded to allow
1514	   the transport of IRIs.  In cases where the document as a whole has a
1515	   native character encoding, IRIs MUST also be encoded in this
1516	   character encoding and converted accordingly by a parser or
1517	   interpreter.  IRI characters not expressible in the native character
1518	   encoding SHOULD be escaped by using the escaping conventions of the
1519	   document format if such conventions are available.  Alternatively,
1520	   they MAY be percent-encoded according to Section 3.6.  For example,
1521	   in HTML or XML, numeric character references SHOULD be used.  If a
1522	   document as a whole has a native character encoding and that
1523	   character encoding is not UTF-8, then IRIs MUST NOT be placed into
1524	   the document in the UTF-8 character encoding.

1526	   ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
1527	   although they use different terminology.  HTML 4.0 [HTML4] defines
1528	   the conversion from IRIs to URIs as error-avoiding behavior.  XML 1.0
1529	   [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
1530	   based upon them allow IRIs.  Also, it is expected that all relevant
1531	   new W3C formats and protocols will be required to handle IRIs
1532	   [CharMod].

1534	6.4.  Use of UTF-8 for Encoding Original Characters

1536	   This section discusses details and gives examples for point c) in
1537	   Section 1.2.  To be able to use IRIs, the URI corresponding to the
1538	   IRI in question has to encode original characters into octets by
1539	   using UTF-8.  This can be specified for all URIs of a URI scheme or
1540	   can apply to individual URIs for schemes that do not specify how to
1541	   encode original characters.  It can apply to the whole URI, or only
1542	   to some part.  For background information on encoding characters into
1543	   URIs, see also Section 2.5 of [RFC3986].

1545	   For new URI schemes, using UTF-8 is recommended in [RFC4395].
1546	   Examples where UTF-8 is already used are the URN syntax [RFC2141],
1547	   IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand,
1548	   because the HTTP URI scheme does not specify how to encode original
1549	   characters, only some HTTP URLs can have corresponding but different
1550	   IRIs.

1552	   For example, for a document with a URI of
1553	   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
1554	   construct a corresponding IRI (in XML notation, see Section 1.4):
1555	   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9;" stands for
1556	   the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-
1557	   encoded representation of that character).  On the other hand, for a
1558	   document with a URI of "http://www.example.org/r%E9sum%E9.html", the
1559	   percent-encoding octets cannot be converted to actual characters in
1560	   an IRI, as the percent-encoding is not based on UTF-8.

1562	   For most URI schemes, there is no need to upgrade their scheme
1563	   definition in order for them to work with IRIs.  The main case where
1564	   upgrading makes sense is when a scheme definition, or a particular
1565	   component of a scheme, is strictly limited to the use of US-ASCII
1566	   characters with no provision to include non-ASCII characters/octets
1567	   via percent-encoding, or if a scheme definition currently uses highly
1568	   scheme-specific provisions for the encoding of non-ASCII characters.
1569	   An example of this is the mailto: scheme [RFC2368].

1571	   This specification updates the IANA registry of URI schemes to note
1572	   their applicability to IRIs, see Section 9.  All IRIs use URI
1573	   schemes, and all URIs with URI schemes can be used as IRIs, even
1574	   though in some cases only by using URIs directly as IRIs, without any
1575	   conversion.

1577	   Scheme definitions can impose restrictions on the syntax of scheme-
1578	   specific URIs; i.e., URIs that are admissible under the generic URI
1579	   syntax [RFC3986] may not be admissible due to narrower syntactic
1580	   constraints imposed by a URI scheme specification.  URI scheme
1581	   definitions cannot broaden the syntactic restrictions of the generic
1582	   URI syntax; otherwise, it would be possible to generate URIs that
1583	   satisfied the scheme-specific syntactic constraints without
1584	   satisfying the syntactic constraints of the generic URI syntax.
1585	   However, additional syntactic constraints imposed by URI scheme
1586	   specifications are applicable to IRI, as the corresponding URI
1587	   resulting from the mapping defined in Section 3.6 MUST be a valid URI
1588	   under the syntactic restrictions of generic URI syntax and any
1589	   narrower restrictions imposed by the corresponding URI scheme
1590	   specification.

1592	   The requirement for the use of UTF-8 generally applies to all parts
1593	   of a URI.  However, it is possible that the capability of IRIs to
1594	   represent a wide range of characters directly is used just in some
1595	   parts of the IRI (or IRI reference).  The other parts of the IRI may
1596	   only contain US-ASCII characters, or they may not be based on UTF-8.
1597	   They may be based on another character encoding, or they may directly
1598	   encode raw binary data (see also [RFC2397]).

1600	   For example, it is possible to have a URI reference of
1601	   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
1602	   document name is encoded in iso-8859-1 based on server settings, but
1603	   where the fragment identifier is encoded in UTF-8 according to
1604	   [XPointer].  The IRI corresponding to the above URI would be (in XML
1605	   notation)
1606	   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".

1608	   Similar considerations apply to query parts.  The functionality of
1609	   IRIs (namely, to be able to include non-ASCII characters) can only be
1610	   used if the query part is encoded in UTF-8.

1612	6.5.  Relative IRI References

1614	   Processing of relative IRI references against a base is handled
1615	   straightforwardly; the algorithms of [RFC3986] can be applied
1616	   directly, treating the characters additionally allowed in IRI
1617	   references in the same way that unreserved characters are in URI
1618	   references.

1620	7.  Liberal handling of otherwise invalid IRIs

1622	   (EDITOR NOTE: This Section may move to an appendix.)  Some technical
1623	   specifications and widely-deployed software have allowed additional
1624	   variations and extensions of IRIs to be used in syntactic components.
1625	   This section describes two widely-used preprocessing agreements.
1626	   Other technical specifications may wish to reference a syntactic
1627	   component which is "a valid IRI or a string that will map to a valid
1628	   IRI after this preprocessing algorithm".  These two variants are
1629	   known as Legacy Extended IRI or LEIRI [LEIRI], and Web Address
1630	   [HTML5]).

1632	   Future technical specifications SHOULD NOT allow conforming producers
1633	   to produce, or conforming content to contain, such forms, as they are
1634	   not interoperable with other IRI consuming software.

1636	7.1.  LEIRI processing

1638	   This section defines Legacy Extended IRIs (LEIRIs).  The syntax of
1639	   Legacy Extended IRIs is the same as that for IRIs, except that the
1640	   ucschar production is replaced by the leiri-ucschar production:

1642	     leiri-ucschar  = " " / "<" / ">" / '"' / "{" / "}" / "|"
1643	                      / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1644	                      / %xE000-FFFD / %x10000-10FFFF

1646	   Among other extensions, processors based on this specification also
1647	   did not enforce the restriction on bidirectional formatting
1648	   characters in Section 4.1, and the iprivate production becomes
1649	   redundant.

1651	   To convert a string allowed as a LEIRI to an IRI, each character
1652	   allowed in leiri-ucschar but not in ucschar must be percent-encoded
1653	   using Section 3.3.

1655	7.2.  Web Address processing

1657	   Many popular web browsers have taken the approach of being quite
1658	   liberal in what is accepted as a "URL" or its relative forms.  This
1659	   section describes their behavior in terms of a preprocessor which
1660	   maps strings into the IRI space for subsequent parsing and
1661	   interpretation as an IRI.

1663	   In some situations, it might be appropriate to describe the syntax
1664	   that a liberal consumer implementation might accept as a "Web
1665	   Address" or "Hypertext Reference" or "HREF".  However, technical
1666	   specifications SHOULD restrict the syntactic form allowed by
1667	   compliant producers to the IRI or IRI reference syntax defined in
1668	   this document even if they want to mandate this processing.

1670	   Summary:

1672	   o  Leading and trailing whitespace is removed.

1674	   o  Some additional characters are removed.

1676	   o  Some additional characters are allowed and escaped (as with
1677	      LEIRI).

1679	   o  If interpreting an IRI as a URI, the pct-encoding of the query
1680	      component of the parsed URI component depends on operational
1681	      context.

1683	   Each string provided may have an associated charset (called the HREF-
1684	   charset here); this defaults to UTF-8.  For web browsers interpreting
1685	   HTML, the document charset of a string is determined:

1687	   If the string came from a script (e.g. as an argument to a method)
1688	      The HRef-charset is the script's charset.

1690	   If the string came from a DOM node (e.g. from an element)  The node
1691	      has a Document, and the HRef-charset is the Document's character
1692	      encoding.

1694	   If the string had a HRef-charset defined when the string was created
1695	   or defined  The HRef-charset is as defined.

1697	   If the resulting HRef-charset is a unicode based character encoding
1698	   (e.g., UTF-16), then use UTF-8 instead.

1700	   The syntax for Web Addresses is obtained by replacing the 'ucschar',
1701	   pct-form, and path-sep rules with the href-ucschar, href-pct-form,
1702	   and href-path-sep rules below.  In addition, some characters are
1703	   stripped.

1705	     href-ucschar  = " " / "<" / ">" / DQUOTE / "{" / "}" / "|"
1706	                      / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1707	                      / %xE000-FFFD / %x10000-10FFFF
1708	     href-pct-form = pct-encoded / "%"
1709	     href-path-sep = "/" / "\"
1710	     href-strip    = <to be done>

1712	   (NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT
1713	   SENTENCE) browsers did not enforce the restriction on bidirectional
1714	   formatting characters in Section 4.1, and the iprivate production
1715	   becomes redundant.

1717	   'Web Address processing' requires the following additional
1718	   preprocessing steps:

1720	   1.  Leading and trailing instances of space (U+0020), CR (U+000A), LF
1721	       (U+000D), and TAB (U+0009) characters are removed.

1723	   2.  strip all characters in href-strip.

1725	   3.  Percent-encode all characters in href-ucschar not in ucschar.

1727	   4.  Replace occurrences of "%" not followed by two hexadecimal digits
1728	       by "%25".

1730	   5.  Convert backslashes ('\') matching href-path-sep to forward
1731	       slashes ('/').

1733	7.3.  Characters not allowed in IRIs

1735	   This section provides a list of the groups of characters and code
1736	   points that are allowed by LEIRI or HREF but are not allowed in IRIs
1737	   or are allowed in IRIs only in the query part.  For each group of
1738	   characters, advice on the usage of these characters is also given,
1739	   concentrating on the reasons for why they are excluded from IRI use.

1741	      Space (U+0020): Some formats and applications use space as a
1742	      delimiter, e.g. for items in a list.  Appendix C of [RFC3986] also
1743	      mentions that white space may have to be added when displaying or
1744	      printing long URIs; the same applies to long IRIs.  This means
1745	      that spaces can disappear, or can make the what is intended as a
1746	      single IRI or IRI reference to be treated as two or more separate
1747	      IRIs.

1749	      Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
1750	      C of [RFC3986] suggests the use of double-quotes
1751	      ("http://example.com/") and angle brackets (<http://example.com/>)
1752	      as delimiters for URIs in plain text.  These conventions are often
1753	      used, and also apply to IRIs.  Using these characters in strings
1754	      intended to be IRIs would result in the IRIs being cut off at the
1755	      wrong place.

1757	      Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
1758	      (U+007B), "|" (U+007C), and "}" (U+007D): These characters
1759	      originally have been excluded from URIs because the respective
1760	      codepoints are assigned to different graphic characters in some
1761	      7-bit or 8-bit encoding.  Despite the move to Unicode, some of
1762	      these characters are still occasionally displayed differently on
1763	      some systems, e.g.  U+005C may appear as a Japanese Yen symbol on
1764	      some systems.  Also, the fact that these characters are not used
1765	      in URIs or IRIs has encouraged their use outside URIs or IRIs in
1766	      contexts that may include URIs or IRIs.  If a string with such a
1767	      character were used as an IRI in such a context, it would likely
1768	      be interpreted piecemeal.

1770	      The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1771	      #x9F): There is generally no way to transmit these characters
1772	      reliably as text outside of a charset encoding.  Even when in
1773	      encoded form, many software components silently filter out some of
1774	      these characters, or may stop processing alltogether when
1775	      encountering some of them.  These characters may affect text
1776	      display in subtle, unnoticable ways or in drastic, global, and
1777	      irreversible ways depending on the hardware and software involved.
1778	      The use of some of these characters would allow malicious users to
1779	      manipulate the display of an IRI and its context in many
1780	      situations.

1782	      Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1783	      characters affect the display ordering of characters.  If IRIs
1784	      were allowed to contain these characters and the resulting visual
1785	      display transcribed. they could not be converted back to
1786	      electronic form (logical order) unambiguously.  These characters,
1787	      if allowed in IRIs, might allow malicious users to manipulate the
1788	      display of IRI and its context.

1790	      Specials (U+FFF0-FFFD): These code points provide functionality
1791	      beyond that useful in an IRI, for example byte order
1792	      identification, annotation, and replacements for unknown
1793	      characters and objects.  Their use and interpretation in an IRI
1794	      would serve no purpose and might lead to confusing display
1795	      variations.

1797	      Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
1798	      10FFFD): Display and interpretation of these code points is by
1799	      definition undefined without private agreement.  Therefore, these
1800	      code points are not suited for use on the Internet.  They are not
1801	      interoperable and may have unpredictable effects.

1803	      Tags (U+E0000-E0FFF): These characters provide a way to language
1804	      tag in Unicode plain text.  They are not appropriate for IRIs
1805	      because language information in identifiers cannot reliably be
1806	      input, transmitted (e.g. on a visual medium such as paper), or
1807	      recognized.

1809	      Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1810	      U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1811	      U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1812	      U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1813	      U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1814	      non-characters.  Applications may use some of them internally, but
1815	      are not prepared to interchange them.

1817	   LEIRI preprocessing disallowed some code points and code units:

1819	      Surrogate code units (D800-DFFF): These do not represent Unicode
1820	      codepoints.

1822	8.  URI/IRI Processing Guidelines (Informative)

1824	   This informative section provides guidelines for supporting IRIs in
1825	   the same software components and operations that currently process
1826	   URIs: Software interfaces that handle URIs, software that allows
1827	   users to enter URIs, software that creates or generates URIs,
1828	   software that displays URIs, formats and protocols that transport
1829	   URIs, and software that interprets URIs.  These may all require
1830	   modification before functioning properly with IRIs.  The
1831	   considerations in this section also apply to URI references and IRI
1832	   references.

1834	8.1.  URI/IRI Software Interfaces

1836	   Software interfaces that handle URIs, such as URI-handling APIs and
1837	   protocols transferring URIs, need interfaces and protocol elements
1838	   that are designed to carry IRIs.

1840	   In case the current handling in an API or protocol is based on US-
1841	   ASCII, UTF-8 is recommended as the character encoding for IRIs, as it
1842	   is compatible with US-ASCII, is in accordance with the
1843	   recommendations of [RFC2277], and makes converting to URIs easy.  In
1844	   any case, the API or protocol definition must clearly define the
1845	   character encoding to be used.

1847	   The transfer from URI-only to IRI-capable components requires no
1848	   mapping, although the conversion described in Section 3.7 above may
1849	   be performed.  It is preferable not to perform this inverse
1850	   conversion unless it is certain this can be done correctly.

1852	8.2.  URI/IRI Entry

1854	   Some components allow users to enter URIs into the system by typing
1855	   or dictation, for example.  This software must be updated to allow
1856	   for IRI entry.

1858	   A person viewing a visual representation of an IRI (as a sequence of
1859	   glyphs, in some order, in some visual display) or hearing an IRI will
1860	   use an entry method for characters in the user's language to input
1861	   the IRI.  Depending on the script and the input method used, this may
1862	   be a more or less complicated process.

1864	   The process of IRI entry must ensure, as much as possible, that the
1865	   restrictions defined in Section 2.2 are met.  This may be done by
1866	   choosing appropriate input methods or variants/settings thereof, by
1867	   appropriately converting the characters being input, by eliminating
1868	   characters that cannot be converted, and/or by issuing a warning or
1869	   error message to the user.

1871	   As an example of variant settings, input method editors for East
1872	   Asian Languages usually allow the input of Latin letters and related
1873	   characters in full-width or half-width versions.  For IRI input, the
1874	   input method editor should be set so that it produces half-width
1875	   Latin letters and punctuation and full-width Katakana.

1877	   An input field primarily or solely used for the input of URIs/IRIs
1878	   might allow the user to view an IRI as it is mapped to a URI.  Places
1879	   where the input of IRIs is frequent may provide the possibility for
1880	   viewing an IRI as mapped to a URI.  This will help users when some of
1881	   the software they use does not yet accept IRIs.

1883	   An IRI input component interfacing to components that handle URIs,
1884	   but not IRIs, must map the IRI to a URI before passing it to these
1885	   components.

1887	   For the input of IRIs with right-to-left characters, please see
1888	   Section 4.3.

1890	8.3.  URI/IRI Transfer between Applications

1892	   Many applications (for example, mail user agents) try to detect URIs
1893	   appearing in plain text.  For this, they use some heuristics based on
1894	   URI syntax.  They then allow the user to click on such URIs and
1895	   retrieve the corresponding resource in an appropriate (usually
1896	   scheme-dependent) application.

1898	   Such applications would need to be upgraded, in order to use the IRI
1899	   syntax as a base for heuristics.  In particular, a non-ASCII
1900	   character should not be taken as the indication of the end of an IRI.
1901	   Such applications also would need to make sure that they correctly
1902	   convert the detected IRI from the character encoding of the document
1903	   or application where the IRI appears, to the character encoding used
1904	   by the system-wide IRI invocation mechanism, or to a URI (according
1905	   to Section 3.6) if the system-wide invocation mechanism only accepts
1906	   URIs.

1908	   The clipboard is another frequently used way to transfer URIs and
1909	   IRIs from one application to another.  On most platforms, the
1910	   clipboard is able to store and transfer text in many languages and
1911	   scripts.  Correctly used, the clipboard transfers characters, not
1912	   octets, which will do the right thing with IRIs.

1914	8.4.  URI/IRI Generation

1916	   Systems that offer resources through the Internet, where those
1917	   resources have logical names, sometimes automatically generate URIs
1918	   for the resources they offer.  For example, some HTTP servers can
1919	   generate a directory listing for a file directory and then respond to
1920	   the generated URIs with the files.

1922	   Many legacy character encodings are in use in various file systems.
1923	   Many currently deployed systems do not transform the local character
1924	   representation of the underlying system before generating URIs.

1926	   For maximum interoperability, systems that generate resource
1927	   identifiers should make the appropriate transformations.  For
1928	   example, if a file system contains a file named "r&#xE9;sum&#
1929	   xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in
1930	   a URI, which allows use of "r&#xE9;sum&#xE9;.html" in an IRI, even if
1931	   locally the file name is kept in a character encoding other than
1932	   UTF-8.

1934	   This recommendation particularly applies to HTTP servers.  For FTP
1935	   servers, similar considerations apply; see [RFC2640].

1937	8.5.  URI/IRI Selection

1939	   In some cases, resource owners and publishers have control over the
1940	   IRIs used to identify their resources.  This control is mostly
1941	   executed by controlling the resource names, such as file names,
1942	   directly.

1944	   In these cases, it is recommended to avoid choosing IRIs that are
1945	   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
1946	   is easily confused with the digit one ("1"), and the upper-case oh
1947	   ("O") is easily confused with the digit zero ("0").  Publishers
1948	   should avoid confusing users with "br0ken" or "1ame" identifiers.

1950	   Outside the US-ASCII repertoire, there are many more opportunities
1951	   for confusion; a complete set of guidelines is too lengthy to include
1952	   here.  As long as names are limited to characters from a single
1953	   script, native writers of a given script or language will know best
1954	   when ambiguities can appear, and how they can be avoided.  What may
1955	   look ambiguous to a stranger may be completely obvious to the average
1956	   native user.  On the other hand, in some cases, the UCS contains
1957	   variants for compatibility reasons; for example, for typographic
1958	   purposes.  These should be avoided wherever possible.  Although there
1959	   may be exceptions, newly created resource names should generally be
1960	   in NFKC [UTR15] (which means that they are also in NFC).

1962	   As an example, the UCS contains the "fi" ligature at U+FB01 for
1963	   compatibility reasons.  Wherever possible, IRIs should use the two
1964	   letters "f" and "i" rather than the "fi" ligature.  An example where
1965	   the latter may be used is in the query part of an IRI for an explicit
1966	   search for a word written containing the "fi" ligature.

1968	   In certain cases, there is a chance that characters from different
1969	   scripts look the same.  The best known example is the similarity of
1970	   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
1971	   such cases, IRIs should only be created where all the characters in a
1972	   single component are used together in a given language.  This usually
1973	   means that all of these characters will be from the same script, but
1974	   there are languages that mix characters from different scripts (such
1975	   as Japanese).  This is similar to the heuristics used to distinguish
1976	   between letters and numbers in the examples above.  Also, for Latin,
1977	   Greek, and Cyrillic, using lowercase letters results in fewer
1978	   ambiguities than using uppercase letters would.

1980	8.6.  Display of URIs/IRIs

1982	   In situations where the rendering software is not expected to display
1983	   non-ASCII parts of the IRI correctly using the available layout and
1984	   font resources, these parts should be percent-encoded before being
1985	   displayed.

1987	   For display of Bidi IRIs, please see Section 4.1.

1989	8.7.  Interpretation of URIs and IRIs

1991	   Software that interprets IRIs as the names of local resources should
1992	   accept IRIs in multiple forms and convert and match them with the
1993	   appropriate local resource names.

1995	   First, multiple representations include both IRIs in the native
1996	   character encoding of the protocol and also their URI counterparts.

1998	   Second, it may include URIs constructed based on character encodings
1999	   other than UTF-8.  These URIs may be produced by user agents that do
2000	   not conform to this specification and that use legacy character
2001	   encodings to convert non-ASCII characters to URIs.  Whether this is
2002	   necessary, and what character encodings to cover, depends on a number
2003	   of factors, such as the legacy character encodings used locally and
2004	   the distribution of various versions of user agents.  For example,
2005	   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
2006	   addition to UTF-8.

2008	   Third, it may include additional mappings to be more user-friendly
2009	   and robust against transmission errors.  These would be similar to
2010	   how some servers currently treat URIs as case insensitive or perform
2011	   additional matching to account for spelling errors.  For characters
2012	   beyond the US-ASCII repertoire, this may, for example, include
2013	   ignoring the accents on received IRIs or resource names.  Please note
2014	   that such mappings, including case mappings, are language dependent.

2016	   It can be difficult to identify a resource unambiguously if too many
2017	   mappings are taken into consideration.  However, percent-encoded and
2018	   not percent-encoded parts of IRIs can always be clearly
2019	   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
2020	   the potential for collisions lower than it may seem at first.

2022	8.8.  Upgrading Strategy

2024	   Where this recommendation places further constraints on software for
2025	   which many instances are already deployed, it is important to
2026	   introduce upgrades carefully and to be aware of the various
2027	   interdependencies.

2029	   If IRIs cannot be interpreted correctly, they should not be created,
2030	   generated, or transported.  This suggests that upgrading URI
2031	   interpreting software to accept IRIs should have highest priority.

2033	   On the other hand, a single IRI is interpreted only by a single or
2034	   very few interpreters that are known in advance, although it may be
2035	   entered and transported very widely.

2037	   Therefore, IRIs benefit most from a broad upgrade of software to be
2038	   able to enter and transport IRIs.  However, before an individual IRI
2039	   is published, care should be taken to upgrade the corresponding
2040	   interpreting software in order to cover the forms expected to be
2041	   received by various versions of entry and transport software.

2043	   The upgrade of generating software to generate IRIs instead of using
2044	   a local character encoding should happen only after the service is
2045	   upgraded to accept IRIs.  Similarly, IRIs should only be generated
2046	   when the service accepts IRIs and the intervening infrastructure and
2047	   protocol is known to transport them safely.

2049	   Software converting from URIs to IRIs for display should be upgraded
2050	   only after upgraded entry software has been widely deployed to the
2051	   population that will see the displayed result.

2053	   Where there is a free choice of character encodings, it is often
2054	   possible to reduce the effort and dependencies for upgrading to IRIs
2055	   by using UTF-8 rather than another encoding.  For example, when a new
2056	   file-based Web server is set up, using UTF-8 as the character
2057	   encoding for file names will make the transition to IRIs easier.
2058	   Likewise, when a new Web form is set up using UTF-8 as the character
2059	   encoding of the form page, the returned query URIs will use UTF-8 as
2060	   the character encoding (unless the user, for whatever reason, changes
2061	   the character encoding) and will therefore be compatible with IRIs.

2063	   These recommendations, when taken together, will allow for the
2064	   extension from URIs to IRIs in order to handle characters other than
2065	   US-ASCII while minimizing interoperability problems.  For
2066	   considerations regarding the upgrade of URI scheme definitions, see
2067	   Section 6.4.

2069	9.  IANA Considerations

2071	   RFC Editor and IANA note: Please Replace RFC XXXX with the number of
2072	   this document when it issues as an RFC.

2074	   IANA maintains a registry of "URI schemes".  A "URI scheme" also
2075	   serves an "IRI scheme".

2077	   To clarify that the URI scheme registration process also applies to
2078	   IRIs, change the description of the "URI schemes" registry header to
2079	   say "[RFC4395] defines an IANA-maintained registry of URI Schemes.

2081	   These registries include the Permanent and Provisional URI Schemes.
2082	   RFC XXXX updates this registry to designate that schemes may also
2083	   indicate their usability as IRI schemes.

2085	   Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".

2087	10.  Security Considerations

2089	   The security considerations discussed in [RFC3986] also apply to
2090	   IRIs.  In addition, the following issues require particular care for
2091	   IRIs.

2093	   Incorrect encoding or decoding can lead to security problems.  In
2094	   particular, some UTF-8 decoders do not check against overlong byte
2095	   sequences.  As an example, a "/" is encoded with the byte 0x2F both
2096	   in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
2097	   interpret the sequence 0xC0 0xAF as a "/".  A sequence such as
2098	   "%C0%AF.." may pass some security tests and then be interpreted as
2099	   "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
2100	   and checking are not done in the right order, and/or if reserved
2101	   characters and unreserved characters are not clearly distinguished.

2103	   There are various ways in which "spoofing" can occur with IRIs.
2104	   "Spoofing" means that somebody may add a resource name that looks the
2105	   same or similar to the user, but that points to a different resource.
2106	   The added resource may pretend to be the real resource by looking
2107	   very similar but may contain all kinds of changes that may be
2108	   difficult to spot and that can cause all kinds of problems.  Most
2109	   spoofing possibilities for IRIs are extensions of those for URIs.

2111	   Spoofing can occur for various reasons.  First, a user's
2112	   normalization expectations or actual normalization when entering an
2113	   IRI or transcoding an IRI from a legacy character encoding do not
2114	   match the normalization used on the server side.  Conceptually, this
2115	   is no different from the problems surrounding the use of case-
2116	   insensitive web servers.  For example, a popular web page with a
2117	   mixed-case name ("http://big.example.com/PopularPage.html") might be
2118	   "spoofed" by someone who is able to create
2119	   "http://big.example.com/popularpage.html".  However, the use of
2120	   unnormalized character sequences, and of additional mappings for user
2121	   convenience, may increase the chance for spoofing.  Protocols and
2122	   servers that allow the creation of resources with names that are not
2123	   normalized are particularly vulnerable to such attacks.  This is an
2124	   inherent security problem of the relevant protocol, server, or
2125	   resource and is not specific to IRIs, but it is mentioned here for
2126	   completeness.

2128	   Spoofing can occur in various IRI components, such as the domain name
2129	   part or a path part.  For considerations specific to the domain name
2130	   part, see [RFC3491].  For the path part, administrators of sites that
2131	   allow independent users to create resources in the same sub area may
2132	   have to be careful to check for spoofing.

2134	   Spoofing can occur because in the UCS many characters look very
2135	   similar.  Details are discussed in Section 8.5.  Again, this is very
2136	   similar to spoofing possibilities on US-ASCII, e.g., using "br0ken"
2137	   or "1ame" URIs.

2139	   Spoofing can occur when URIs with percent-encodings based on various
2140	   character encodings are accepted to deal with older user agents.  In
2141	   some cases, particularly for Latin-based resource names, this is
2142	   usually easy to detect because UTF-8-encoded names, when interpreted
2143	   and viewed as legacy character encodings, produce mostly garbage.

2145	   When concurrently used character encodings have a similar structure
2146	   but there are no characters that have exactly the same encoding,
2147	   detection is more difficult.

2149	   Spoofing can occur with bidirectional IRIs, if the restrictions in
2150	   Section 4.2 are not followed.  The same visual representation may be
2151	   interpreted as different logical representations, and vice versa.  It
2152	   is also very important that a correct Unicode bidirectional
2153	   implementation be used.

2155	   The use of Legacy Extended IRIs introduces additional security
2156	   issues.

2158	11.  Acknowledgements

2160	   For contributions to this update, we would like to thank Ian Hickson,
2161	   Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin,
2162	   Henry S. Thomson, and the XML Core Working Group of the W3C.

2164	   The discussion on the issue addressed here started a long time ago.
2165	   There was a thread in the HTML working group in August 1995 (under
2166	   the topic of "Globalizing URIs") and in the www-international mailing
2167	   list in July 1996 (under the topic of "Internationalization and
2168	   URLs"), and there were ad-hoc meetings at the Unicode conferences in
2169	   September 1995 and September 1997.

2171	   For contributions to the previous version of this document, RFC 3987,
2172	   many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
2173	   Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
2174	   Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
2175	   Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley,
2176	   Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne,
2177	   Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan
2178	   Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan
2179	   Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris
2180	   Haynes, Walter Underwood, and many others.

2182	   A definition of HyperText Reference was initially produced by Ian
2183	   Hixson, and further edited by Dan Connolly and C. M. Spergerg-
2184	   McQueen.

2186	   Thanks to the Internationalization Working Group (I18N WG) of the
2187	   World Wide Web Consortium (W3C), and the members of the W3C I18N
2188	   Working Group and Interest Group for their contributions and their
2189	   work on [CharMod].  Thanks also go to the members of many other W3C
2190	   Working Groups for adopting IRIs, and to the members of the Montreal
2191	   IAB Workshop on Internationalization and Localization for their
2192	   review.

2194	12.  Change Log

2196	   Note to RFC Editor: Please completely remove this section before
2197	   publication.

2199	12.1.  Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00

2201	   Changed draft name, date, last paragraph of abstract, and titles in
2202	   change log, and added this section in moving from
2203	   draft-duerst-iri-bis-07 (personal submission) to
2204	   draft-ietf-iri-3987bis-00 (WG document).

2206	12.2.  Changes from -06 to -07 of draft-duerst-iri-bis

2208	   Major restructuring of IRI processing model to make scheme-specific
2209	   translation necessary to handle IDNA requirements and for consistency
2210	   with web implementations.

2212	   Starting with IRI, you want one of:

2214	   a  IRI components (IRI parsed into UTF8 pieces)

2216	   b  URI components (URI parsed into ASCII pieces, encoded correctly)

2218	   c  whole URI (for passing on to some other system that wants whole
2219	      URIs)

2221	12.2.1.  OLD WAY

2223	   1.  Pct-encoding on the whole thing to a URI. (c1) If you want a
2224	       (maybe broken) whole URI, you might stop here.

2226	   2.  Parsing the URI into URI components. (b1) If you want (maybe
2227	       broken) URI components, stop here.

2229	   3.  Decode the components (undoing the pct-encoding). (a) if you want
2230	       IRI components, stop here.

2232	   4.  reencode: Either using a different encoding some components (for
2233	       domain names, and query components in web pages, which depends on
2234	       the component, scheme and context), and otherwise using pct-
2235	       encoding. (b2) if you want (good) URI components, stop here.

2237	   5.  reassemble the reencoded components. (c2) if you want a (*good*)
2238	       whole URI stop here.

2240	12.2.2.  NEW WAY

2242	   1.  Parse the IRI into IRI components using the generic syntax. (a)
2243	       if you want IRI components, stop here.

2245	   2.  Encode each components, using pct-encoding, IDN encoding, or
2246	       special query part encoding depending on the component scheme or
2247	       context. (b) If you want URI components, stop here.

2249	   3.  reassemble the a whole URI from URI components. (c) if you want a
2250	       whole URI stop here.

2252	12.3.  Changes from -00 to -01

2254	   o  Removed 'mailto:' before mail addresses of authors.

2256	   o  Added "<to be done>" as right side of 'href-strip' rule.  Fixed
2257	      '|' to '/' for alternatives.

2259	12.4.  Changes from -05 to -06 of draft-duerst-iri-bis-00

2261	   o  Add HyperText Reference, change abstract, acks and references for
2262	      it

2264	   o  Add Masinter back as another editor.

2266	   o  Masinter integrates HRef material from HTML5 spec.

2268	   o  Rewrite introduction sections to modernize.

2270	12.5.  Changes from -04 to -05 of draft-duerst-iri-bis

2272	   o  Updated references.

2274	   o  Changed IPR text to pre5378Trust200902.

2276	12.6.  Changes from -03 to -04 of draft-duerst-iri-bis

2278	   o  Added explicit abbreviation for LEIRIs.

2280	   o  Mentioned LEIRI references.

2282	   o  Completed text in LEIRI section about tag characters and about
2283	      specials.

2285	12.7.  Changes from -02 to -03 of draft-duerst-iri-bis

2287	   o  Updated some references.

2289	   o  Updated Michel Suginard's coordinates.

2291	12.8.  Changes from -01 to -02 of draft-duerst-iri-bis

2293	   o  Added tag range to iprivate (issue private-include-tags-115).

2295	   o  Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.

2297	12.9.  Changes from -00 to -01 of draft-duerst-iri-bis

2299	   o  Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI"
2300	      based on input from the W3C XML Core WG.  Moved the relevant
2301	      subsections to the back and promoted them to a section.

2303	   o  Added some text re.  Legacy Extended IRIs to the security section.

2305	   o  Added a IANA Consideration Section.

2307	   o  Added this Change Log Section.

2309	   o  Added a section about "IRIs with Spaces/Controls" (converting from
2310	      a Note in RFC 3987).

2312	12.10.  Changes from RFC 3987 to -00 of draft-duerst-iri-bis

2314	      Fixed errata (see
2315	      http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).

2317	13.  References

2319	13.1.  Normative References

2321	   [ASCII]    American National Standards Institute, "Coded Character
2322	              Set -- 7-bit American Standard Code for Information
2323	              Interchange", ANSI X3.4, 1986.

2325	   [ISO10646]
2326	              International Organization for Standardization, "ISO/IEC
2327	              10646:2003: Information Technology - Universal Multiple-
2328	              Octet Coded Character Set (UCS)", ISO Standard 10646,
2329	              December 2003.

2331	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2332	              Requirement Levels", BCP 14, RFC 2119, March 1997.

2334	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
2335	              "Internationalizing Domain Names in Applications (IDNA)",
2336	              RFC 3490, March 2003.

2338	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
2339	              Profile for Internationalized Domain Names (IDN)",
2340	              RFC 3491, March 2003.

2342	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
2343	              10646", STD 63, RFC 3629, November 2003.

2345	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
2346	              Resource Identifier (URI): Generic Syntax", STD 66,
2347	              RFC 3986, January 2005.

2349	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
2350	              Applications (IDNA): Definitions and Document Framework",
2351	              RFC 5890, August 2010.

2353	   [RFC5891]  Klensin, J., "Internationalized Domain Names in
2354	              Applications (IDNA): Protocol", RFC 5891, August 2010.

2356	   [STD68]    Crocker, D. and P. Overell, "Augmented BNF for Syntax
2357	              Specifications: ABNF", STD 68, RFC 5234, January 2008.

2359	   [UNI9]     Davis, M., "The Bidirectional Algorithm", Unicode Standard
2360	              Annex #9, March 2004,
2361	              <http://www.unicode.org/reports/tr9/tr9-13.html>.

2363	   [UNIV4]    The Unicode Consortium, "The Unicode Standard, Version
2364	              5.1.0, defined by: The Unicode Standard, Version 5.0
2365	              (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0), as
2366	              amended by Unicode 4.1.0
2367	              (http://www.unicode.org/versions/Unicode5.1.0/)",
2368	              April 2008.

2370	   [UTR15]    Davis, M. and M. Duerst, "Unicode Normalization Forms",
2371	              Unicode Standard Annex #15, March 2008,
2372	              <http://www.unicode.org/unicode/reports/tr15/
2373	              tr15-23.html>.

2375	13.2.  Informative References

2377	   [BidiEx]   "Examples of bidirectional IRIs",
2378	              <http://www.w3.org/International/iri-edit/BidiExamples>.

2380	   [CharMod]  Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
2381	              Texin, "Character Model for the World Wide Web: Resource
2382	              Identifiers", World Wide Web Consortium Candidate
2383	              Recommendation, November 2004,
2384	              <http://www.w3.org/TR/charmod-resid>.

2386	   [Duerst97]
2387	              Duerst, M., "The Properties and Promises of UTF-8", Proc.
2388	              11th International Unicode Conference, San Jose ,
2389	              September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
2390	              papers/PDF/IUC11-UTF-8.pdf>.

2392	   [Gettys]   Gettys, J., "URI Model Consequences",
2393	              <http://www.w3.org/DesignIssues/ModelConsequences>.

2395	   [HTML4]    Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
2396	              Specification", World Wide Web Consortium Recommendation,
2397	              December 1999,
2398	              <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.

2400	   [HTML5]    Hickson, I. and D. Hyatt, "A vocabulary and associated
2401	              APIs for HTML and XHTML", World Wide Web
2402	              Consortium Working Draft, April 2009,
2403	              <http://www.w3.org/TR/2009/WD-html5-20090423/>.

2405	   [LEIRI]    Thompson, H., Tobin, R., and N. Walsh, "Legacy extended
2406	              IRIs for XML resource identification", World Wide Web
2407	              Consortium Note, November 2008,
2408	              <http://www.w3.org/TR/leiri/>.

2410	   [RFC1738]  Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
2411	              Resource Locators (URL)", RFC 1738, December 1994.

2413	   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
2414	              Extensions (MIME) Part One: Format of Internet Message
2415	              Bodies", RFC 2045, November 1996.

2417	   [RFC2130]  Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
2418	              Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
2419	              the IAB Character Set Workshop held 29 February - 1 March,
2420	              1996", RFC 2130, April 1997.

2422	   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

2424	   [RFC2192]  Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

2426	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
2427	              Languages", BCP 18, RFC 2277, January 1998.

2429	   [RFC2368]  Hoffman, P., Masinter, L., and J. Zawinski, "The mailto
2430	              URL scheme", RFC 2368, July 1998.

2432	   [RFC2384]  Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

2434	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
2435	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
2436	              August 1998.

2438	   [RFC2397]  Masinter, L., "The "data" URL scheme", RFC 2397,
2439	              August 1998.

2441	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
2442	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
2443	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

2445	   [RFC2640]  Curtin, B., "Internationalization of the File Transfer
2446	              Protocol", RFC 2640, July 1999.

2448	   [RFC4395]  Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
2449	              Registration Procedures for New URI Schemes", BCP 35,
2450	              RFC 4395, February 2006.

2452	   [UNIXML]   Duerst, M. and A. Freytag, "Unicode in XML and other
2453	              Markup Languages", Unicode Technical Report #20, World
2454	              Wide Web Consortium Note, June 2003,
2455	              <http://www.w3.org/TR/unicode-xml/>.

2457	   [UTR36]    Davis, M. and M. Suignard, "Unicode Security
2458	              Considerations", Unicode Technical Report #36,
2459	              August 2010, <http://unicode.org/reports/tr36/>.

2461	   [XLink]    DeRose, S., Maler, E., and D. Orchard, "XML Linking
2462	              Language (XLink) Version 1.0", World Wide Web
2463	              Consortium Recommendation, June 2001,
2464	              <http://www.w3.org/TR/xlink/#link-locators>.

2466	   [XML1]     Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
2467	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth
2468	              Edition)", World Wide Web Consortium Recommendation,
2469	              August 2006, <http://www.w3.org/TR/REC-xml>.

2471	   [XMLNamespace]
2472	              Bray, T., Hollander, D., Layman, A., and R. Tobin,
2473	              "Namespaces in XML (Second Edition)", World Wide Web
2474	              Consortium Recommendation, August 2006,
2475	              <http://www.w3.org/TR/REC-xml-names>.

2477	   [XMLSchema]
2478	              Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
2479	              World Wide Web Consortium Recommendation, May 2001,
2480	              <http://www.w3.org/TR/xmlschema-2/#anyURI>.

2482	   [XPointer]
2483	              Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
2484	              Framework", World Wide Web Consortium Recommendation,
2485	              March 2003,
2486	              <http://www.w3.org/TR/xptr-framework/#escaping>.

2488	Appendix A.  Design Alternatives

2490	   This section briefly summarizes some design alternatives considered
2491	   earlier and the reasons why they were not chosen.

2493	A.1.  New Scheme(s)

2495	   Introducing new schemes (for example, httpi:, ftpi:,...) or a new
2496	   metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:,
2497	   i:ftp:,...) was proposed to make IRI-to-URI conversion scheme
2498	   dependent or to distinguish between percent-encodings resulting from
2499	   IRI-to-URI conversion and percent-encodings from legacy character
2500	   encodings.

2502	   New schemes are not needed to distinguish URIs from true IRIs (i.e.,
2503	   IRIs that contain non-ASCII characters).  The benefit of being able
2504	   to detect the origin of percent-encodings is marginal, as UTF-8 can
2505	   be detected with very high reliability.  Deploying new schemes is
2506	   extremely hard, so not requiring new schemes for IRIs makes
2507	   deployment of IRIs vastly easier.  Making conversion scheme dependent
2508	   is highly inadvisable and would be encouraged by separate schemes for
2509	   IRIs.  Using a uniform convention for conversion from IRIs to URIs
2510	   makes IRI implementation orthogonal to the introduction of actual new
2511	   schemes.

2513	A.2.  Character Encodings Other Than UTF-8

2515	   At an early stage, UTF-7 was considered as an alternative to UTF-8
2516	   when IRIs are converted to URIs.  UTF-7 would not have needed
2517	   percent-encoding and in most cases would have been shorter than
2518	   percent-encoded UTF-8.

2520	   Using UTF-8 avoids a double layering and overloading of the use of
2521	   the "+" character.  UTF-8 is fully compatible with US-ASCII and has
2522	   therefore been recommended by the IETF, and is being used widely.

2524	   UTF-7 has never been used much and is now clearly being discouraged.
2525	   Requiring implementations to convert from UTF-8 to UTF-7 and back
2526	   would be an additional implementation burden.

2528	A.3.  New Encoding Convention

2530	   Instead of using the existing percent-encoding convention of URIs,
2531	   which is based on octets, the idea was to create a new encoding
2532	   convention; for example, to use "%u" to introduce UCS code points.

2534	   Using the existing octet-based percent-encoding mechanism does not
2535	   need an upgrade of the URI syntax and does not need corresponding
2536	   server upgrades.

2538	A.4.  Indicating Character Encodings in the URI/IRI

2540	   Some proposals suggested indicating the character encodings used in
2541	   an URI or IRI with some new syntactic convention in the URI itself,
2542	   similar to the "charset" parameter for e-mails and Web pages.  As an
2543	   example, the label in square brackets in
2544	   "http://www.example.org/ros[iso-8859-1]&#xE9;" indicated that the
2545	   following "&#xE9;" had to be interpreted as iso-8859-1.

2547	   If UTF-8 is used exclusively, an upgrade to the URI syntax is not
2548	   needed.  It avoids potentially multiple labels that have to be copied
2549	   correctly in all cases, even on the side of a bus or on a napkin,
2550	   leading to usability problems (and being prohibitively annoying).
2551	   Exclusively using UTF-8 also reduces transcoding errors and
2552	   confusion.

2554	Authors' Addresses

2556	   Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
2557	      possible, for example as "D&#252;rst" in XML and HTML)
2558	   Aoyama Gakuin University
2559	   5-10-1 Fuchinobe
2560	   Sagamihara, Kanagawa  229-8558
2561	   Japan

2563	   Phone: +81 42 759 6329
2564	   Fax:   +81 42 759 6495
2565	   Email: duerst@it.aoyama.ac.jp
2566	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
2567	          (Note: This is the percent-encoded form of an IRI)

2569	   Michel Suignard
2570	   Unicode Consortium
2571	   P.O. Box 391476
2572	   Mountain View, CA  94039-1476
2573	   U.S.A.

2575	   Phone: +1-650-693-3921
2576	   Email: michel@unicode.org
2577	   URI:   http://www.suignard.com

2579	   Larry Masinter
2580	   Adobe
2581	   345 Park Ave
2582	   San Jose, CA  95110
2583	   U.S.A.

2585	   Phone: +1-408-536-3024
2586	   Email: masinter@adobe.com
2587	   URI:   http://larry.masinter.net