idnits 2.17.1 

draft-duerst-iri-bis-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == The 'Obsoletes: ' line in the draft header should list only the
     _numbers_ of the RFCs which will be obsoleted by this document (if
     approved); it should not include the word 'RFC' in the list.

  -- The draft header indicates that this document obsoletes RFC3987, but the
     abstract doesn't seem to directly say this.  It does mention RFC3987
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to contain a disclaimer for pre-RFC5378 work, and may
     have content which was first submitted before 10 November 2008.  The
     disclaimer is necessary when there are original authors that you have
     been unable to contact, or if some do not wish to grant the BCP78 rights
     to the IETF Trust.  If you are able to get all authors (current and
     original) to grant those rights, you can and should remove the
     disclaimer; otherwise, the disclaimer is needed and you can ignore this
     comment. (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 26, 2009) is 5294 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV4'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'

  -- Obsolete informational reference (is this intentional?): RFC 1738
     (Obsoleted by RFC 4248, RFC 4266)

  -- Obsolete informational reference (is this intentional?): RFC 2141
     (Obsoleted by RFC 8141)

  -- Obsolete informational reference (is this intentional?): RFC 2192
     (Obsoleted by RFC 5092)

  -- Obsolete informational reference (is this intentional?): RFC 2368
     (Obsoleted by RFC 6068)

  -- Obsolete informational reference (is this intentional?): RFC 2396
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Obsolete informational reference (is this intentional?): RFC 4395
     (Obsoleted by RFC 7595)


     Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 16 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                          M. Duerst
3	Internet-Draft                                  Aoyama Gakuin University
4	Obsoletes: RFC 3987                                          M. Suignard
5	(if approved)                                         Unicode Consortium
6	Intended status: Standards Track                             L. Masinter
7	Expires: April 29, 2010                                            Adobe
8	                                                        October 26, 2009

10	             Internationalized Resource Identifiers (IRIs)
11	                        draft-duerst-iri-bis-07

13	Status of this Memo

15	   This Internet-Draft is submitted to IETF in full conformance with the
16	   provisions of BCP 78 and BCP 79.  This document may contain material
17	   from IETF Documents or IETF Contributions published or made publicly
18	   available before November 10, 2008.  The person(s) controlling the
19	   copyright in some of this material may not have granted the IETF
20	   Trust the right to allow modifications of such material outside the
21	   IETF Standards Process.  Without obtaining an adequate license from
22	   the person(s) controlling the copyright in such materials, this
23	   document may not be modified outside the IETF Standards Process, and
24	   derivative works of it may not be created outside the IETF Standards
25	   Process, except to format it for publication as an RFC or to
26	   translate it into languages other than English.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF), its areas, and its working groups.  Note that
30	   other groups may also distribute working documents as Internet-
31	   Drafts.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   The list of current Internet-Drafts can be accessed at
39	   http://www.ietf.org/ietf/1id-abstracts.txt.

41	   The list of Internet-Draft Shadow Directories can be accessed at
42	   http://www.ietf.org/shadow.html.

44	   This Internet-Draft will expire on April 29, 2010.

46	Copyright Notice

48	   Copyright (c) 2009 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents in effect on the date of
53	   publication of this document (http://trustee.ietf.org/license-info).
54	   Please review these documents carefully, as they describe your rights
55	   and restrictions with respect to this document.

57	Abstract

59	   This document defines the Internationalized Resource Identifier (IRI)
60	   protocol element, as an extension of the Uniform Resource Identifier
61	   (URI).  An IRI is a sequence of characters from the Universal
62	   Character Set (Unicode/ISO 10646).  Grammar and processing rules are
63	   given for IRIs and related syntactic forms.

65	   In addition, this document provides named additional rule sets for
66	   processing otherwise invalid IRIs, in a way that supports other
67	   specifications that wish to mandate common behavior for 'error'
68	   handling.  In particular, rules used in some XML languages (LEIRI)
69	   and web applications are given.

71	   Defining IRI as new protocol element (rather than updating or
72	   extending the definition of URI) allows independent orderly
73	   transitions: other protocols and languages that use URIs must
74	   explicitly choose to allow IRIs.

76	   Guidelines are provided for the use and deployment of IRIs and
77	   related protocol elements when revising protocols, formats, and
78	   software components that currently deal only with URIs.

80	   [RFC Editor: Please remove this paragraph before publication.]  This
81	   document is intended to update RFC 3987 and move towards IETF Draft
82	   Standard.  This is an interim version in preparation for the IRI BOF
83	   at IETF 76 in Hiroshima.  For discussion and comments on this draft,
84	   please use the public-iri@w3.org mailing list.

86	Table of Contents

88	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
89	     1.1.  Overview and Motivation  . . . . . . . . . . . . . . . . .  5
90	     1.2.  Applicability  . . . . . . . . . . . . . . . . . . . . . .  6
91	     1.3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . .  6
92	     1.4.  Notation . . . . . . . . . . . . . . . . . . . . . . . . .  9
93	   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  9
94	     2.1.  Summary of IRI Syntax  . . . . . . . . . . . . . . . . . . 10
95	     2.2.  ABNF for IRI References and IRIs . . . . . . . . . . . . . 10
96	   3.  Processing IRIs and related protocol elements  . . . . . . . . 13
97	     3.1.  Converting to UCS  . . . . . . . . . . . . . . . . . . . . 14
98	     3.2.  Parse the IRI into IRI components  . . . . . . . . . . . . 14
99	     3.3.  General percent-encoding of IRI components . . . . . . . . 15
100	     3.4.  Mapping ireg-name  . . . . . . . . . . . . . . . . . . . . 15
101	     3.5.  Mapping query components . . . . . . . . . . . . . . . . . 17
102	     3.6.  Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . . 17
103	     3.7.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . . 17
104	       3.7.1.  Examples . . . . . . . . . . . . . . . . . . . . . . . 19
105	   4.  Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20
106	     4.1.  Logical Storage and Visual Presentation  . . . . . . . . . 21
107	     4.2.  Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 22
108	     4.3.  Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 23
109	     4.4.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . 23
110	   5.  Normalization and Comparison . . . . . . . . . . . . . . . . . 25
111	     5.1.  Equivalence  . . . . . . . . . . . . . . . . . . . . . . . 25
112	     5.2.  Preparation for Comparison . . . . . . . . . . . . . . . . 26
113	     5.3.  Comparison Ladder  . . . . . . . . . . . . . . . . . . . . 27
114	       5.3.1.  Simple String Comparison . . . . . . . . . . . . . . . 27
115	       5.3.2.  Syntax-Based Normalization . . . . . . . . . . . . . . 28
116	       5.3.3.  Scheme-Based Normalization . . . . . . . . . . . . . . 31
117	       5.3.4.  Protocol-Based Normalization . . . . . . . . . . . . . 32
118	   6.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 33
119	     6.1.  Limitations on UCS Characters Allowed in IRIs  . . . . . . 33
120	     6.2.  Software Interfaces and Protocols  . . . . . . . . . . . . 33
121	     6.3.  Format of URIs and IRIs in Documents and Protocols . . . . 33
122	     6.4.  Use of UTF-8 for Encoding Original Characters  . . . . . . 34
123	     6.5.  Relative IRI References  . . . . . . . . . . . . . . . . . 36
124	   7.  Liberal handling of otherwise invalid IRIs . . . . . . . . . . 36
125	     7.1.  LEIRI processing . . . . . . . . . . . . . . . . . . . . . 36
126	     7.2.  Web Address processing . . . . . . . . . . . . . . . . . . 36
127	     7.3.  Characters not allowed in IRIs . . . . . . . . . . . . . . 38
128	   8.  URI/IRI Processing Guidelines (Informative)  . . . . . . . . . 40
129	     8.1.  URI/IRI Software Interfaces  . . . . . . . . . . . . . . . 40
130	     8.2.  URI/IRI Entry  . . . . . . . . . . . . . . . . . . . . . . 41
131	     8.3.  URI/IRI Transfer between Applications  . . . . . . . . . . 42
132	     8.4.  URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 42
133	     8.5.  URI/IRI Selection  . . . . . . . . . . . . . . . . . . . . 43
134	     8.6.  Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 43
135	     8.7.  Interpretation of URIs and IRIs  . . . . . . . . . . . . . 44
136	     8.8.  Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 44
137	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 45
138	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 46
139	   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 47
140	   12. Open Issues  . . . . . . . . . . . . . . . . . . . . . . . . . 48
141	   13. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 50
142	     13.1. Changes from -06 to this document  . . . . . . . . . . . . 50
143	       13.1.1. OLD WAY  . . . . . . . . . . . . . . . . . . . . . . . 50
144	       13.1.2. NEW WAY  . . . . . . . . . . . . . . . . . . . . . . . 51
145	     13.2. Changes from -05 to -06  . . . . . . . . . . . . . . . . . 51
146	     13.3. Changes from -04 to -05  . . . . . . . . . . . . . . . . . 51
147	     13.4. Changes from -03 to -04  . . . . . . . . . . . . . . . . . 51
148	     13.5. Changes from -02 to -03  . . . . . . . . . . . . . . . . . 51
149	     13.6. Changes from -01 to -02  . . . . . . . . . . . . . . . . . 52
150	     13.7. Changes from -00 to -01  . . . . . . . . . . . . . . . . . 52
151	     13.8. Changes from RFC 3987 to -00 . . . . . . . . . . . . . . . 52
152	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 52
153	     14.1. Normative References . . . . . . . . . . . . . . . . . . . 52
154	     14.2. Informative References . . . . . . . . . . . . . . . . . . 53
155	   Appendix A.  Design Alternatives . . . . . . . . . . . . . . . . . 55
156	     A.1.  New Scheme(s)  . . . . . . . . . . . . . . . . . . . . . . 56
157	     A.2.  Character Encodings Other Than UTF-8 . . . . . . . . . . . 56
158	     A.3.  New Encoding Convention  . . . . . . . . . . . . . . . . . 56
159	     A.4.  Indicating Character Encodings in the URI/IRI  . . . . . . 57
160	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 57

162	1.  Introduction

164	1.1.  Overview and Motivation

166	   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
167	   sequence of characters chosen from a limited subset of the repertoire
168	   of US-ASCII [ASCII] characters.

170	   The characters in URIs are frequently used for representing words of
171	   natural languages.  This usage has many advantages: Such URIs are
172	   easier to memorize, easier to interpret, easier to transcribe, easier
173	   to create, and easier to guess.  For most languages other than
174	   English, however, the natural script uses characters other than A -
175	   Z. For many people, handling Latin characters is as difficult as
176	   handling the characters of other scripts is for those who use only
177	   the Latin alphabet.  Many languages with non-Latin scripts are
178	   transcribed with Latin letters.  These transcriptions are now often
179	   used in URIs, but they introduce additional difficulties.

181	   The infrastructure for the appropriate handling of characters from
182	   additional scripts is now widely deployed in operating system and
183	   application software.  Software that can handle a wide variety of
184	   scripts and languages at the same time is increasingly common.  Also,
185	   an increasing number of protocols and formats can carry a wide range
186	   of characters.

188	   URIs are used both as a protocol element (for transmission and
189	   processing by software) and also a presentation element (for display
190	   and handling by people who read, interpret, coin, or guess them).
191	   The transition between these roles is more difficult and complex when
192	   dealing with the larger set of characters than allowed for URIs in
193	   [RFC3986].

195	   This document defines the protocol element called Internationalized
196	   Resource Identifier (IRI), which allow applications of URIs to be
197	   extended to use resource identifiers that have a much wider
198	   repertoire of characters.  It also provides corresponding
199	   "internationalized" versions of other constructs from [RFC3986], such
200	   as URI references.  The syntax of IRIs is defined in Section 2.

202	   Using characters outside of A - Z in IRIs adds a number of
203	   difficulties.  Section 4 discusses the special case of bidirectional
204	   IRIs using characters from scripts written right-to-left.  Section 5
205	   discusses various forms of equivalence between IRIs.  Section 6
206	   discusses the use of IRIs in different situations.  Section 8 gives
207	   additional informative guidelines.  Section 10 discusses IRI-specific
208	   security considerations.

210	1.2.  Applicability

212	   IRIs are designed to allow protocols and software that deal with URIs
213	   to be updated to handle IRIs.  A "URI scheme" (as defined by
214	   [RFC3986] and registered through the IANA process defined in
215	   [RFC4395] also serves as an "IRI scheme".  Processing of IRIs is
216	   accomplished by extending the URI syntax while retaining (and not
217	   expanding) the set of "reserved" characters, such that the syntax for
218	   any URI scheme may be uniformly extended to allow non-ASCII
219	   characters.  In addition, following parsing of an IRI, it is possible
220	   to construct a corresponding URI by first encoding characters outside
221	   of the allowed URI range and then reassembling the components.

223	   Practical use of IRIs forms in place of URIs forms depends on the
224	   following conditions being met:

226	   a. A protocol or format element MUST be explicitly designated to be
227	      able to carry IRIs.  The intent is to avoid introducing IRIs into
228	      contexts that are not defined to accept them.  For example, XML
229	      schema [XMLSchema] has an explicit type "anyURI" that includes
230	      IRIs and IRI references.  Therefore, IRIs and IRI references can
231	      be in attributes and elements of type "anyURI".  On the other
232	      hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is
233	      defined as a URI, which means that direct use of IRIs is not
234	      allowed in HTTP requests.

236	   b. The protocol or format carrying the IRIs MUST have a mechanism to
237	      represent the wide range of characters used in IRIs, either
238	      natively or by some protocol- or format-specific escaping
239	      mechanism (for example, numeric character references in [XML1]).

241	   c. The URI scheme definition, if it explicitly allows a percent sign
242	      ("%") in any syntactic component, SHOULD define the interpretation
243	      of sequences of percent-encoded octets (using "%XX" hex octets) as
244	      octet from sequences of UTF-8 encoded strings; this is recommended
245	      in the guidelines for registering new schemes, [RFC4395].  For
246	      example, this is the practice for IMAP URLs [RFC2192], POP URLs
247	      [RFC2384] and the URN syntax [RFC2141]).  Note that use of
248	      percent-encoding may also be restricted in some situations, for
249	      example, URI schemes that disallow percent-encoding might still be
250	      used with a fragment identifier which is percent-encoded (e.g.,
251	      [XPointer]).  See Section 6.4 for further discussion.

253	1.3.  Definitions

255	   The following definitions are used in this document; they follow the
256	   terms in [RFC2130], [RFC2277], and [ISO10646].

258	   character:  A member of a set of elements used for the organization,
259	      control, or representation of data.  For example, "LATIN CAPITAL
260	      LETTER A" names a character.

262	   octet:  An ordered sequence of eight bits considered as a unit.

264	   character repertoire:  A set of characters (set in the mathematical
265	      sense).

267	   sequence of characters:  A sequence of characters (one after
268	      another).

270	   sequence of octets:  A sequence of octets (one after another).

272	   character encoding:  A method of representing a sequence of
273	      characters as a sequence of octets (maybe with variants).  Also, a
274	      method of (unambiguously) converting a sequence of octets into a
275	      sequence of characters.

277	   charset:  The name of a parameter or attribute used to identify a
278	      character encoding.

280	   UCS:  Universal Character Set. The coded character set defined by
281	      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].

283	   IRI reference:  Denotes the common usage of an Internationalized
284	      Resource Identifier.  An IRI reference may be absolute or
285	      relative.  However, the "IRI" that results from such a reference
286	      only includes absolute IRIs; any relative IRI references are
287	      resolved to their absolute form.  Note that in [RFC2396] URIs did
288	      not include fragment identifiers, but in [RFC3986] fragment
289	      identifiers are part of URIs.

291	   URL:  The term "URL" was originally used [RFC1738] for roughly what
292	      is now called a "URI".  Books, software and documentation often
293	      refers to URIs and IRIs using the "URL" term.  Some usages
294	      restrict "URL" to those URIs which are not URNs.  Because of the
295	      ambiguity of the term using the term "URL" is NOT RECOMMENDED in
296	      formal documents.

298	   LEIRI (Legacy Extended IRI) processing:  This term was used in
299	      various XML specifications to refer to strings that, although not
300	      valid IRIs, were acceptable input to the processing rules in
301	      Section 7.1.

303	   (Web Address, Hypertext Reference, HREF):  These terms have been
304	      added in this document for convenience, to allow other
305	      specifications to refer to those strings that, although not valid
306	      IRIs, are acceptable input to the processing rules in Section 7.2.
307	      This usage corresponds to the parsing rules of some popular web
308	      browsing applications.  ISSUE: Need to find a good name/
309	      abbreviation for these.

311	   running text:  Human text (paragraphs, sentences, phrases) with
312	      syntax according to orthographic conventions of a natural
313	      language, as opposed to syntax defined for ease of processing by
314	      machines (e.g., markup, programming languages).

316	   protocol element:  Any portion of a message that affects processing
317	      of that message by the protocol in question.

319	   presentation element:  A presentation form corresponding to a
320	      protocol element; for example, using a wider range of characters.

322	   create (a URI or IRI):  With respect to URIs and IRIs, the term is
323	      used for the initial creation.  This may be the initial creation
324	      of a resource with a certain identifier, or the initial exposition
325	      of a resource under a particular identifier.

327	   generate (a URI or IRI):  With respect to URIs and IRIs, the term is
328	      used when the identifier is generated by derivation from other
329	      information.

331	   parsed URI component:  When a URI processor parses a URI (following
332	      the generic syntax or a scheme-specific syntax, the result is a
333	      set of parsed URI components, each of which has a type
334	      (corresponding to the syntactic definition) and a sequence of URI
335	      characters.

337	   parsed IRI component:  When an IRI processor parses an IRI directly,
338	      following the general syntax or a scheme-specific syntax, the
339	      result is a set of parsed IRI components, each of which has a type
340	      (corresponding to the syntactice definition) and a sequence of IRI
341	      characters.  (This definition is analogous to "parsed URI
342	      component".)

344	   IRI scheme:  A URI scheme may also be known as an "IRI scheme" if the
345	      scheme's syntax has been extended to allow non-US-ASCII characters
346	      according to the rules in this document.

348	1.4.  Notation

350	   RFCs and Internet Drafts currently do not allow any characters
351	   outside the US-ASCII repertoire.  Therefore, this document uses
352	   various special notations to denote such characters in examples.

354	   In text, characters outside US-ASCII are sometimes referenced by
355	   using a prefix of 'U+', followed by four to six hexadecimal digits.

357	   To represent characters outside US-ASCII in examples, this document
358	   uses two notations: 'XML Notation' and 'Bidi Notation'.

360	   XML Notation uses a leading '&#x', a trailing ';', and the
361	   hexadecimal number of the character in the UCS in between.  For
362	   example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA.  In this
363	   notation, an actual '&' is denoted by '&amp;'.

365	   Bidi Notation is used for bidirectional examples: Lower case letters
366	   stand for Latin letters or other letters that are written left to
367	   right, whereas upper case letters represent Arabic or Hebrew letters
368	   that are written right to left.

370	   To denote actual octets in examples (as opposed to percent-encoded
371	   octets), the two hex digits denoting the octet are enclosed in "<"
372	   and ">".  For example, the octet often denoted as 0xc9 is denoted
373	   here as <c9>.

375	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
376	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
377	   and "OPTIONAL" are to be interpreted as described in [RFC2119].

379	2.  IRI Syntax

381	   This section defines the syntax of Internationalized Resource
382	   Identifiers (IRIs).

384	   As with URIs, an IRI is defined as a sequence of characters, not as a
385	   sequence of octets.  This definition accommodates the fact that IRIs
386	   may be written on paper or read over the radio as well as stored or
387	   transmitted digitally.  The same IRI might be represented as
388	   different sequences of octets in different protocols or documents if
389	   these protocols or documents use different character encodings
390	   (and/or transfer encodings).  Using the same character encoding as
391	   the containing protocol or document ensures that the characters in
392	   the IRI can be handled (e.g., searched, converted, displayed) in the
393	   same way as the rest of the protocol or document.

395	2.1.  Summary of IRI Syntax

397	   IRIs are defined by extending the URI syntax in [RFC3986], but
398	   extending the class of unreserved characters by adding the characters
399	   of the UCS (Universal Character Set, [ISO10646]) beyond U+007F,
400	   subject to the limitations given in the syntax rules below and in
401	   Section 6.1.

403	   The syntax and use of components and reserved characters is the same
404	   as that in [RFC3986].  Each "URI scheme" thus also functions as an
405	   "IRI scheme", in that scheme-specific parsing rules for URIs of a
406	   scheme are be extended to allow parsing of IRIs using the same
407	   parsing rules.

409	   All the operations defined in [RFC3986], such as the resolution of
410	   relative references, can be applied to IRIs by IRI-processing
411	   software in exactly the same way as they are for URIs by URI-
412	   processing software.

414	   Characters outside the US-ASCII repertoire MUST NOT be reserved and
415	   therefore MUST NOT be used for syntactical purposes, such as to
416	   delimit components in newly defined schemes.  For example, U+00A2,
417	   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
418	   the 'iunreserved' category.  This is similar to the fact that it is
419	   not possible to use '-' as a delimiter in URIs, because it is in the
420	   'unreserved' category.

422	2.2.  ABNF for IRI References and IRIs

424	   An ABNF definition for IRI references (which are the most general
425	   concept and the start of the grammar) and IRIs is given here.  The
426	   syntax of this ABNF is described in [STD68].  Character numbers are
427	   taken from the UCS, without implying any actual binary encoding.
428	   Terminals in the ABNF are characters, not octets.

430	   The following grammar closely follows the URI grammar in [RFC3986],
431	   except that the range of unreserved characters is expanded to include
432	   UCS characters, with the restriction that private UCS characters can
433	   occur only in query parts.  The grammar is split into two parts:
434	   Rules that differ from [RFC3986] because of the above-mentioned
435	   expansion, and rules that are the same as those in [RFC3986].  For
436	   rules that are different than those in [RFC3986], the names of the
437	   non-terminals have been changed as follows.  If the non-terminal
438	   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
439	   has been prefixed.

441	   The following rules are different from those in [RFC3986]:

443	   IRI            = scheme ":" ihier-part [ "?" iquery ]
444	                    [ "#" ifragment ]

446	   ihier-part     = "//" iauthority ipath-abempty
447	                  / ipath-absolute
448	                  / ipath-rootless
449	                  / ipath-empty

451	   IRI-reference  = IRI / irelative-ref

453	   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

455	   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]

457	   irelative-part = "//" iauthority ipath-abempty
458	                  / ipath-absolute
459	                  / ipath-noscheme
460	                  / ipath-empty

462	   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
463	   iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
464	   ihost          = IP-literal / IPv4address / ireg-name

466	   pct-form       = pct-encoded

468	   ireg-name      = *( iunreserved / sub-delims )

470	   ipath          = ipath-abempty   ; begins with "/" or is empty
471	                  / ipath-absolute  ; begins with "/" but not "//"
472	                  / ipath-noscheme  ; begins with a non-colon segment
473	                  / ipath-rootless  ; begins with a segment
474	                  / ipath-empty     ; zero characters

476	   ipath-abempty  = *( path-sep isegment )
477	   ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
478	   ipath-noscheme = isegment-nz-nc *( path-sep isegment )
479	   ipath-rootless = isegment-nz *( path-sep isegment )
480	   ipath-empty    = 0<ipchar>
481	   path-sep       = "/"

483	   isegment       = *ipchar
484	   isegment-nz    = 1*ipchar
485	   isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
486	                        / "@" )
487	                  ; non-zero-length segment without any colon ":"

489	   ipchar         = iunreserved / pct-form / sub-delims / ":"
490	                  / "@"

492	   iquery         = *( ipchar / iprivate / "/" / "?" )

494	   ifragment      = *( ipchar / "/" / "?" / "#" )

496	   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

498	   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
499	                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
500	                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
501	                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
502	                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
503	                  / %xD0000-DFFFD / %xE1000-EFFFD

505	   iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
506	                  / %x100000-10FFFD

508	   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
509	   "greedy") algorithm applies.  For details, see [RFC3986].

511	   The following rules are the same as those in [RFC3986]:

513	   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

515	   port           = *DIGIT

517	   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

519	   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

521	   IPv6address    =                            6( h16 ":" ) ls32
522	                  /                       "::" 5( h16 ":" ) ls32
523	                  / [               h16 ] "::" 4( h16 ":" ) ls32
524	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
525	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
526	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
527	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
528	                  / [ *5( h16 ":" ) h16 ] "::"              h16
529	                  / [ *6( h16 ":" ) h16 ] "::"

531	   h16            = 1*4HEXDIG
532	   ls32           = ( h16 ":" h16 ) / IPv4address

534	   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

536	   dec-octet      = DIGIT                 ; 0-9
537	                  / %x31-39 DIGIT         ; 10-99
538	                  / "1" 2DIGIT            ; 100-199
539	                  / "2" %x30-34 DIGIT     ; 200-249
540	                  / "25" %x30-35          ; 250-255

542	   pct-encoded    = "%" HEXDIG HEXDIG

544	   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
545	   reserved       = gen-delims / sub-delims
546	   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
547	   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
548	                  / "*" / "+" / "," / ";" / "="

550	   This syntax does not support IPv6 scoped addressing zone identifiers.

552	3.  Processing IRIs and related protocol elements

554	   IRIs are meant to replace URIs in identifying resources within new
555	   versions of protocols, formats, and software components that use a
556	   UCS-based character repertoire.  Protocols and components may use and
557	   process IRIs directly.  However, there are still numerous systems and
558	   protocols which only accept URIs or components of parsed URIs; that
559	   is, they only accept sequences of characters within the subset of US-
560	   ASCII characters allowed in URIs.

562	   This section defines specific processing steps for IRI consumers
563	   which establish the relationship between the string given and the
564	   interpreted derivatives.  These processing steps apply to both IRIs
565	   and IRI references (i.e., absolute or relative forms); for IRIs, some
566	   steps are scheme specific.

568	3.1.  Converting to UCS

570	   Input that is already in a Unicode form (i.e., a sequence of Unicode
571	   characters or an octet-stream representing a Unicode-based character
572	   encoding such as UTF-8 or UTF-16) should be left as is and not
573	   normalized (see (see Section 5.3.2.2).

575	   If the IRI or IRI reference is an octet stream in some known non-
576	   Unicode character encoding, convert the IRI to a sequence of
577	   characters from the UCS; this sequence SHOULD also be normalized
578	   according to Unicode Normalization Form C (NFC, [UTR15]).  In this
579	   case, retain the original character encoding as the "document
580	   character encoding".  (DESIGN QUESTION: NOT WHAT MOST IMPLEMENTATIONS
581	   DO, CHANGE? )

583	   In other cases (written on paper, read aloud, or otherwise
584	   represented independent of any character encoding) represent the IRI
585	   as a sequence of characters from the UCS normalized according to
586	   Unicode Normalization Form C (NFC, [UTR15]).

588	3.2.  Parse the IRI into IRI components

590	   Parse the IRI, either as a relative reference (no scheme) or using
591	   scheme specific processing (according to the scheme given); the
592	   result resulting in a set of parsed IRI components.  (NOTE: FIX
593	   BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES THAT USE GENERIC
594	   SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN ONLY USE AUTHORITY FOR NAMES
595	   THAT FOLLOW PUNICODE.)

597	   NOTE: The result of parsing into components will correspond result in
598	   a correspondence of subtrings of the IRI according to the part
599	   matched.  For example, in [HTML5], the protocol components of
600	   interest are SCHEME (scheme), HOST (ireg-name), PORT (port), the PATH
601	   (ipath after the initial "/"), QUERY (iquery), FRAGMENT (ifragment),
602	   and AUTHORITY (iauthority).

604	   Subsequent processing rules are sometimes used to define other
605	   syntactic components.  For example, [HTML5] defines APIs for IRI
606	   processing; in these APIs:

608	   HOSTSPECIFIC  the substring that follows the substring matched by the
609	      iauthority production, or the whole string if the iauthority
610	      production wasn't matched.

612	   HOSTPORT  if there is a scheme component and a port component and the
613	      port given by the port component is different than the default
614	      port defined for the protocol given by the scheme component, then
615	      HOSTPORT is the substring that starts with the substring matched
616	      by the host production and ends with the substring matched by the
617	      port production, and includes the colon in between the two.
618	      Otherwise, it is the same as the host component.

620	3.3.  General percent-encoding of IRI components

622	   For most IRI components, it is possible to map the IRI component to
623	   an equivalent URI component by percent-encoding those characters not
624	   allowed in URIs.  Previous processing steps will have removed some
625	   characters, and the interpretation of reserved characters will have
626	   already been done (with the syntactic reserved characters outside of
627	   the IRI component).  This mapping is defined for all sequences of
628	   Unicode characters, whether or not they are valid for the component
629	   in question.

631	   For each character which is not allowed in a valid URI (NOTE: WHAT IS
632	   THE RIGHT REFERENCE HERE), apply the following steps.

634	   Convert to UTF-8  Convert the character to a sequence of one or more
635	      octets using UTF-8 [RFC3629].

637	   Percent encode  Convert each octet of this sequence to %HH, where HH
638	      is the hexadecimal notation of the octet value.  The hexadecimal
639	      notation SHOULD use uppercase letters.  (This is the general URI
640	      percent-encoding mechanism in Section 2.1 of [RFC3986].)

642	   Note that the mapping is an identity transformation for parsed URI
643	   components of valid URIs, and is idempotent: applying the mapping a
644	   second time will not change anything.

646	3.4.  Mapping ireg-name

648	   Schemes that allow non-ASCII based characters in the reg-name (ireg-
649	   name) position MUST convert the ireg-name component of an IRI as
650	   follows:

652	   Replace the ireg-name part of the IRI by the part converted using the
653	   ToASCII operation specified in Section 4.1 of [RFC3490] on each dot-
654	   separated label, and by using U+002E (FULL STOP) as a label
655	   separator, with the flag UseSTD3ASCIIRules set to FALSE, and with the
656	   flag AllowUnassigned set to FALSE.  The ToASCII operation may fail,
657	   but this would mean that the IRI cannot be resolved.  In such cases,
658	   if the domain name conversion fails, then the entire IRI conversion
659	   fails.  Processors that have no mechanism for signalling a failure
660	   MAY instead substitute an otherwise invalid host name, although such
661	   processing SHOULD be avoided.

663	   For example, the IRI
664	   "http://r&#xE9;sum&#xE9;.example.org"
665	   MAY be converted to
666	   "http://xn--rsum-bad.example.org"
667	   ; conversion to percent-encoded form, e.g.,
668	   "http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed.

670	   Note:  Domain Names may appear in parts of an IRI other than the
671	      ireg-name part.  It is the responsibility of scheme-specific
672	      implementations (if the Internationalized Domain Name is part of
673	      the scheme syntax) or of server-side implementations (if the
674	      Internationalized Domain Name is part of 'iquery') to apply the
675	      necessary conversions at the appropriate point.  Example: Trying
676	      to validate the Web page at
677	      http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
678	      http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
679	      example.org, which would convert to a URI of
680	      http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
681	      example.org.  The server-side implementation is responsible for
682	      making the necessary conversions to be able to retrieve the Web
683	      page.

685	   Note:  In this process, characters allowed in URI references and
686	      existing percent-encoded sequences are not encoded further.  (This
687	      mapping is similar to, but different from, the encoding applied
688	      when arbitrary content is included in some part of a URI.)  For
689	      example, an IRI of
690	      "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
691	      converted to
692	      "http://www.example.org/red%09ros%C3%A9#red", not to something
693	      like
694	      "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
695	      ((DESIGN QUESTION: What about e.g.
696	      http://r%C3%A9sum%C3%A9.example.org in an IRI?  Will that get
697	      converted to punycode, or not?))

699	3.5.  Mapping query components

701	   ((NOTE: SEE ISSUES LIST)) For compatibility with existing deployed
702	   HTTP infrastructure, the following special case applies for schemes
703	   "http" and "https" and IRIs whose origin has a document charset other
704	   than one which is UCS-based (e.g., UTF-8 or UTF-16).  In such a case,
705	   the "query" component of an IRI is mapped into a URI by using the
706	   document charset rather than UTF-8 as the binary representation
707	   before pct-encoding.  This mapping is not applied for any other
708	   scheme or component.

710	3.6.  Mapping IRIs to URIs

712	   The canonical mapping from a IRI to URI is defined by applying the
713	   mapping above (from IRI to URI components) and then reassembling a
714	   URI from the parsed URI components using the original punctuation
715	   that delimited the IRI components.

717	3.7.  Converting URIs to IRIs

719	   In some situations, for presentation and further processing, it is
720	   desirable to convert a URI into an equivalent IRI in which natural
721	   characters are represented directly rather than percent encoded.  Of
722	   course, every URI is already an IRI in its own right without any
723	   conversion, and in general there This section gives one such
724	   procedure for this conversion.

726	   The conversion described in this section, if given a valid URI, will
727	   result in an IRI that maps back to the URI used as an input for the
728	   conversion (except for potential case differences in percent-encoding
729	   and for potential percent-encoded unreserved characters).  However,
730	   the IRI resulting from this conversion may differ from the original
731	   IRI (if there ever was one).

733	   URI-to-IRI conversion removes percent-encodings, but not all percent-
734	   encodings can be eliminated.  There are several reasons for this:

736	   1. Some percent-encodings are necessary to distinguish percent-
737	      encoded and unencoded uses of reserved characters.

739	   2. Some percent-encodings cannot be interpreted as sequences of UTF-8
740	      octets.

742	      (Note: The octet patterns of UTF-8 are highly regular.  Therefore,
743	      there is a very high probability, but no guarantee, that percent-
744	      encodings that can be interpreted as sequences of UTF-8 octets
745	      actually originated from UTF-8.  For a detailed discussion, see
746	      [Duerst97].)

748	   3. The conversion may result in a character that is not appropriate
749	      in an IRI.  See Section 2.2, Section 4.1, and Section 6.1 for
750	      further details.

752	   4. IRI to URI conversion has different rules for dealing with domain
753	      names and query parameters.

755	   Conversion from a URI to an IRI MAY be done by using the following
756	   steps:

758	   1. Represent the URI as a sequence of octets in US-ASCII.

760	   2. Convert all percent-encodings ("%" followed by two hexadecimal
761	      digits) to the corresponding octets, except those corresponding to
762	      "%", characters in "reserved", and characters in US-ASCII not
763	      allowed in URIs.

765	   3. Re-percent-encode any octet produced in step 2 that is not part of
766	      a strictly legal UTF-8 octet sequence.

768	   4. Re-percent-encode all octets produced in step 3 that in UTF-8
769	      represent characters that are not appropriate according to
770	      Section 2.2, Section 4.1, and Section 6.1.

772	   5. Interpret the resulting octet sequence as a sequence of characters
773	      encoded in UTF-8.

775	   6. URIs known to contain domain names in the reg-name component
776	      SHOULD convert punycode-encoded domain name labels to the
777	      corresponding characters using the ToUnicode procedure.

779	   This procedure will convert as many percent-encoded characters as
780	   possible to characters in an IRI.  Because there are some choices
781	   when step 4 is applied (see Section 6.1), results may vary.

783	   Conversions from URIs to IRIs MUST NOT use any character encoding
784	   other than UTF-8 in steps 3 and 4, even if it might be possible to
785	   guess from the context that another character encoding than UTF-8 was
786	   used in the URI.  For example, the URI
787	   "http://www.example.org/r%E9sum%E9.html" might with some guessing be
788	   interpreted to contain two e-acute characters encoded as iso-8859-1.
789	   It must not be converted to an IRI containing these e-acute
790	   characters.  Otherwise, in the future the IRI will be mapped to
791	   "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
792	   URI from "http://www.example.org/r%E9sum%E9.html".

794	3.7.1.  Examples

796	   This section shows various examples of converting URIs to IRIs.  Each
797	   example shows the result after each of the steps 1 through 6 is
798	   applied.  XML Notation is used for the final result.  Octets are
799	   denoted by "<" followed by two hexadecimal digits followed by ">".

801	   The following example contains the sequence "%C3%BC", which is a
802	   strictly legal UTF-8 sequence, and which is converted into the actual
803	   character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
804	   u-umlaut).

806	   1. http://www.example.org/D%C3%BCrst

808	   2. http://www.example.org/D<c3><bc>rst

810	   3. http://www.example.org/D<c3><bc>rst

812	   4. http://www.example.org/D<c3><bc>rst

814	   5. http://www.example.org/D&#xFC;rst

816	   6. http://www.example.org/D&#xFC;rst

818	   The following example contains the sequence "%FC", which might
819	   represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
820	   iso-8859-1 character encoding.  (It might represent other characters
821	   in other character encodings.  For example, the octet <fc> in iso-
822	   8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.)  Because <fc>
823	   is not part of a strictly legal UTF-8 sequence, it is re-percent-
824	   encoded in step 3.

826	   1. http://www.example.org/D%FCrst

828	   2. http://www.example.org/D<fc>rst

830	   3. http://www.example.org/D%FCrst

832	   4. http://www.example.org/D%FCrst

834	   5. http://www.example.org/D%FCrst

836	   6. http://www.example.org/D%FCrst

838	   The following example contains "%e2%80%ae", which is the percent-
839	   encoded
840	   UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.
841	   Section 4.1 forbids the direct use of this character in an IRI.

843	   Therefore, the corresponding octets are re-percent-encoded in step 4.
844	   This example shows that the case (upper- or lowercase) of letters
845	   used in percent-encodings may not be preserved.  The example also
846	   contains a punycode-encoded domain name label (xn--99zt52a), which is
847	   not converted.

849	   1. http://xn--99zt52a.example.org/%e2%80%ae

851	   2. http://xn--99zt52a.example.org/<e2><80><ae>

853	   3. http://xn--99zt52a.example.org/<e2><80><ae>

855	   4. http://xn--99zt52a.example.org/%E2%80%AE

857	   5. http://xn--99zt52a.example.org/%E2%80%AE

859	   6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE

861	   Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
862	   (Japanese Natto).  ((EDITOR NOTE: There is some inconsistency in this
863	   note.))

865	4.  Bidirectional IRIs for Right-to-Left Languages

867	   Some UCS characters, such as those used in the Arabic and Hebrew
868	   scripts, have an inherent right-to-left (rtl) writing direction.
869	   IRIs containing these characters (called bidirectional IRIs or Bidi
870	   IRIs) require additional attention because of the non-trivial
871	   relation between logical representation (used for digital
872	   representation and for reading/spelling) and visual representation
873	   (used for display/printing).

875	   Because of the complex interaction between the logical
876	   representation, the visual representation, and the syntax of a Bidi
877	   IRI, a balance is needed between various requirements.  The main
878	   requirements are

880	   1. user-predictable conversion between visual and logical
881	      representation;

883	   2. the ability to include a wide range of characters in various parts
884	      of the IRI; and

886	   3. minor or no changes or restrictions for implementations.

888	4.1.  Logical Storage and Visual Presentation

890	   When stored or transmitted in digital representation, bidirectional
891	   IRIs MUST be in full logical order and MUST conform to the IRI syntax
892	   rules (which includes the rules relevant to their scheme).  This
893	   ensures that bidirectional IRIs can be processed in the same way as
894	   other IRIs.

896	   Bidirectional IRIs MUST be rendered by using the Unicode
897	   Bidirectional Algorithm [UNIV4], [UNI9].  Bidirectional IRIs MUST be
898	   rendered in the same way as they would be if they were in a left-to-
899	   right embedding; i.e., as if they were preceded by U+202A, LEFT-TO-
900	   RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL
901	   FORMATTING (PDF).  Setting the embedding direction can also be done
902	   in a higher-level protocol (e.g., the dir='ltr' attribute in HTML).

904	   There is no requirement to use the above embedding if the display is
905	   still the same without the embedding.  For example, a bidirectional
906	   IRI in a text with left-to-right base directionality (such as used
907	   for English or Cyrillic) that is preceded and followed by whitespace
908	   and strong left-to-right characters does not need an embedding.
909	   Also, a bidirectional relative IRI reference that only contains
910	   strong right-to-left characters and weak characters and that starts
911	   and ends with a strong right-to-left character and appears in a text
912	   with right-to-left base directionality (such as used for Arabic or
913	   Hebrew) and is preceded and followed by whitespace and strong
914	   characters does not need an embedding.

916	   In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
917	   sufficient to force the correct display behavior.  However, the
918	   details of the Unicode Bidirectional algorithm are not always easy to
919	   understand.  Implementers are strongly advised to err on the side of
920	   caution and to use embedding in all cases where they are not
921	   completely sure that the display behavior is unaffected without the
922	   embedding.

924	   The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits
925	   higher-level protocols to influence bidirectional rendering.  Such
926	   changes by higher-level protocols MUST NOT be used if they change the
927	   rendering of IRIs.

929	   The bidirectional formatting characters that may be used before or
930	   after the IRI to ensure correct display are not themselves part of
931	   the IRI.  IRIs MUST NOT contain bidirectional formatting characters
932	   (LRM, RLM, LRE, RLE, LRO, RLO, and PDF).  They affect the visual
933	   rendering of the IRI but do not appear themselves.  It would
934	   therefore not be possible to input an IRI with such characters
935	   correctly.

937	4.2.  Bidi IRI Structure

939	   The Unicode Bidirectional Algorithm is designed mainly for running
940	   text.  To make sure that it does not affect the rendering of
941	   bidirectional IRIs too much, some restrictions on bidirectional IRIs
942	   are necessary.  These restrictions are given in terms of delimiters
943	   (structural characters, mostly punctuation such as "@", ".", ":", and
944	   "/") and components (usually consisting mostly of letters and
945	   digits).

947	   The following syntax rules from Section 2.2 correspond to components
948	   for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
949	   isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.

951	   Specifications that define the syntax of any of the above components
952	   MAY divide them further and define smaller parts to be components
953	   according to this document.  As an example, the restrictions of
954	   [RFC3490] on bidirectional domain names correspond to treating each
955	   label of a domain name as a component for schemes with ireg-name as a
956	   domain name.  Even where the components are not defined formally, it
957	   may be helpful to think about some syntax in terms of components and
958	   to apply the relevant restrictions.  For example, for the usual name/
959	   value syntax in query parts, it is convenient to treat each name and
960	   each value as a component.  As another example, the extensions in a
961	   resource name can be treated as separate components.

963	   For each component, the following restrictions apply:

965	   1. A component SHOULD NOT use both right-to-left and left-to-right
966	      characters.

968	   2. A component using right-to-left characters SHOULD start and end
969	      with right-to-left characters.

971	   The above restrictions are given as "SHOULD"s, rather than as
972	   "MUST"s.  For IRIs that are never presented visually, they are not
973	   relevant.  However, for IRIs in general, they are very important to
974	   ensure consistent conversion between visual presentation and logical
975	   representation, in both directions.

977	   Note:  In some components, the above restrictions may actually be
978	      strictly enforced.  For example, [RFC3490] requires that these
979	      restrictions apply to the labels of a host name for those schemes
980	      where ireg-name is a host name.  In some other components (for
981	      example, path components) following these restrictions may not be
982	      too difficult.  For other components, such as parts of the query
983	      part, it may be very difficult to enforce the restrictions because
984	      the values of query parameters may be arbitrary character
985	      sequences.

987	   If the above restrictions cannot be satisfied otherwise, the affected
988	   component can always be mapped to URI notation as described in
989	   Section 3.3.  Please note that the whole component has to be mapped
990	   (see also Example 9 below).

992	4.3.  Input of Bidi IRIs

994	   Bidi input methods MUST generate Bidi IRIs in logical order while
995	   rendering them according to Section 4.1.  During input, rendering
996	   SHOULD be updated after every new character is input to avoid end-
997	   user confusion.

999	4.4.  Examples

1001	   This section gives examples of bidirectional IRIs, in Bidi Notation.
1002	   It shows legal IRIs with the relationship between logical and visual
1003	   representation and explains how certain phenomena in this
1004	   relationship may look strange to somebody not familiar with
1005	   bidirectional behavior, but familiar to users of Arabic and Hebrew.
1006	   It also shows what happens if the restrictions given in Section 4.2
1007	   are not followed.  The examples below can be seen at [BidiEx], in
1008	   Arabic, Hebrew, and Bidi Notation variants.

1010	   To read the bidi text in the examples, read the visual representation
1011	   from left to right until you encounter a block of rtl text.  Read the
1012	   rtl block (including slashes and other special characters) from right
1013	   to left, then continue at the next unread ltr character.

1015	   Example 1: A single component with rtl characters is inverted:
1016	   Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html"
1017	   Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html"
1018	   Components can be read one by one, and each component can be read in
1019	   its natural direction.

1021	   Example 2: More than one consecutive component with rtl characters is
1022	   inverted as a whole:
1023	   Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html"
1024	   Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html"
1025	   A sequence of rtl components is read rtl, in the same way as a
1026	   sequence of rtl words is read rtl in a bidi text.

1028	   Example 3: All components of an IRI (except for the scheme) are rtl.
1029	   All rtl components are inverted overall:
1030	   Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"
1031	   Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"
1032	   The whole IRI (except the scheme) is read rtl.  Delimiters between
1033	   rtl components stay between the respective components; delimiters
1034	   between ltr and rtl components don't move.

1036	   Example 4: Each of several sequences of rtl components is inverted on
1037	   its own:
1038	   Logical representation: "http://AB.CD.ef/gh/IJ/KL.html"
1039	   Visual representation: "http://DC.BA.ef/gh/LK/JI.html"
1040	   Each sequence of rtl components is read rtl, in the same way as each
1041	   sequence of rtl words in an ltr text is read rtl.

1043	   Example 5: Example 2, applied to components of different kinds:
1044	   Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
1045	   Visual representation: "http://ab.cd.HG/FE/ij/kl.html"
1046	   The inversion of the domain name label and the path component may be
1047	   unexpected, but it is consistent with other bidi behavior.  For
1048	   reassurance that the domain component really is "ab.cd.EF", it may be
1049	   helpful to read aloud the visual representation following the bidi
1050	   algorithm.  After "http://ab.cd." one reads the RTL block
1051	   "E-F-slash-G-H", which corresponds to the logical representation.

1053	   Example 6: Same as Example 5, with more rtl components:
1054	   Logical representation: "http://ab.CD.EF/GH/IJ/kl.html"
1055	   Visual representation: "http://ab.JI/HG/FE.DC/kl.html"
1056	   The inversion of the domain name labels and the path components may
1057	   be easier to identify because the delimiters also move.

1059	   Example 7: A single rtl component includes digits:
1060	   Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"
1061	   Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"
1062	   Numbers are written ltr in all cases but are treated as an additional
1063	   embedding inside a run of rtl characters.  This is completely
1064	   consistent with usual bidirectional text.

1066	   Example 8 (not allowed): Numbers are at the start or end of an rtl
1067	   component:
1068	   Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html"
1069	   Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html"
1070	   The sequence "1/2" is interpreted by the bidi algorithm as a
1071	   fraction, fragmenting the components and leading to confusion.  There
1072	   are other characters that are interpreted in a special way close to
1073	   numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".

1075	   Example 9 (not allowed): The numbers in the previous example are
1076	   percent-encoded:
1077	   Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html",
1078	   Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html"

1080	   Example 10 (allowed but not recommended):

1082	   Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html"
1083	   Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html"
1084	   Components consisting of only numbers are allowed (it would be rather
1085	   difficult to prohibit them), but these may interact with adjacent RTL
1086	   components in ways that are not easy to predict.

1088	   Example 11 (allowed but not recommended):
1089	   Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"
1090	   Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html"
1091	   Components consisting of numbers and left-to-right characters are
1092	   allowed, but these may interact with adjacent RTL components in ways
1093	   that are not easy to predict.

1095	5.  Normalization and Comparison

1097	   Note:  The structure and much of the material for this section is
1098	      taken from section 6 of [RFC3986]; the differences are due to the
1099	      specifics of IRIs.

1101	   One of the most common operations on IRIs is simple comparison:
1102	   Determining whether two IRIs are equivalent, without using the IRIs
1103	   to access their respective resource(s).  A comparison is performed
1104	   whenever a response cache is accessed, a browser checks its history
1105	   to color a link, or an XML parser processes tags within a namespace.
1106	   Extensive normalization prior to comparison of IRIs may be used by
1107	   spiders and indexing engines to prune a search space or reduce
1108	   duplication of request actions and response storage.

1110	   IRI comparison is performed for some particular purpose.  Protocols
1111	   or implementations that compare IRIs for different purposes will
1112	   often be subject to differing design trade-offs in regards to how
1113	   much effort should be spent in reducing aliased identifiers.  This
1114	   section describes various methods that may be used to compare IRIs,
1115	   the trade-offs between them, and the types of applications that might
1116	   use them.

1118	5.1.  Equivalence

1120	   Because IRIs exist to identify resources, presumably they should be
1121	   considered equivalent when they identify the same resource.  However,
1122	   this definition of equivalence is not of much practical use, as there
1123	   is no way for an implementation to compare two resources to determine
1124	   if they are "the same" unless it has full knowledge or control of
1125	   them.  For this reason, determination of equivalence or difference of
1126	   IRIs is based on string comparison, perhaps augmented by reference to
1127	   additional rules provided by URI scheme definitions.  We use the
1128	   terms "different" and "equivalent" to describe the possible outcomes
1129	   of such comparisons, but there are many application-dependent
1130	   versions of equivalence.

1132	   Even when it is possible to determine that two IRIs are equivalent,
1133	   IRI comparison is not sufficient to determine whether two IRIs
1134	   identify different resources.  For example, an owner of two different
1135	   domain names could decide to serve the same resource from both,
1136	   resulting in two different IRIs.  Therefore, comparison methods are
1137	   designed to minimize false negatives while strictly avoiding false
1138	   positives.

1140	   In testing for equivalence, applications should not directly compare
1141	   relative references; the references should be converted to their
1142	   respective target IRIs before comparison.  When IRIs are compared to
1143	   select (or avoid) a network action, such as retrieval of a
1144	   representation, fragment components (if any) should be excluded from
1145	   the comparison.

1147	   Applications using IRIs as identity tokens with no relationship to a
1148	   protocol MUST use the Simple String Comparison (see Section 5.3.1).
1149	   All other applications MUST select one of the comparison practices
1150	   from the Comparison Ladder (see Section 5.3.

1152	5.2.  Preparation for Comparison

1154	   Any kind of IRI comparison REQUIRES that any additional contextual
1155	   processing is first performed, including undoing higher-level
1156	   escapings or encodings in the protocol or format that carries an IRI.
1157	   This preprocessing is usually done when the protocol or format is
1158	   parsed.

1160	   Examples of contextual preprocessing steps are described in
1161	   Section 7.

1163	   Examples of such escapings or encodings are entities and numeric
1164	   character references in [HTML4] and [XML1].  As an example,
1165	   "http://example.org/ros&eacute;" (in HTML),
1166	   "http://example.org/ros&#233;" (in HTML or XML), and
1167	   "http://example.org/ros&#xE9;" (in HTML or XML) are all resolved into
1168	   what is denoted in this document (see Section 1.4) as
1169	   "http://example.org/ros&#xE9;" (the "&#xE9;" here standing for the
1170	   actual e-acute character, to compensate for the fact that this
1171	   document cannot contain non-ASCII characters).

1173	   Similar considerations apply to encodings such as Transfer Codings in
1174	   HTTP (see [RFC2616]) and Content Transfer Encodings in MIME
1175	   ([RFC2045]), although in these cases, the encoding is based not on
1176	   characters but on octets, and additional care is required to make
1177	   sure that characters, and not just arbitrary octets, are compared
1178	   (see Section 5.3.1).

1180	5.3.  Comparison Ladder

1182	   In practice, a variety of methods are used to test IRI equivalence.
1183	   These methods fall into a range distinguished by the amount of
1184	   processing required and the degree to which the probability of false
1185	   negatives is reduced.  As noted above, false negatives cannot be
1186	   eliminated.  In practice, their probability can be reduced, but this
1187	   reduction requires more processing and is not cost-effective for all
1188	   applications.

1190	   If this range of comparison practices is considered as a ladder, the
1191	   following discussion will climb the ladder, starting with practices
1192	   that are cheap but have a relatively higher chance of producing false
1193	   negatives, and proceeding to those that have higher computational
1194	   cost and lower risk of false negatives.

1196	5.3.1.  Simple String Comparison

1198	   If two IRIs, when considered as character strings, are identical,
1199	   then it is safe to conclude that they are equivalent.  This type of
1200	   equivalence test has very low computational cost and is in wide use
1201	   in a variety of applications, particularly in the domain of parsing.
1202	   It is also used when a definitive answer to the question of IRI
1203	   equivalence is needed that is independent of the scheme used and that
1204	   can be calculated quickly and without accessing a network.  An
1205	   example of such a case is XML Namespaces ([XMLNamespace]).

1207	   Testing strings for equivalence requires some basic precautions.
1208	   This procedure is often referred to as "bit-for-bit" or "byte-for-
1209	   byte" comparison, which is potentially misleading.  Testing strings
1210	   for equality is normally based on pair comparison of the characters
1211	   that make up the strings, starting from the first and proceeding
1212	   until both strings are exhausted and all characters are found to be
1213	   equal, until a pair of characters compares unequal, or until one of
1214	   the strings is exhausted before the other.

1216	   This character comparison requires that each pair of characters be
1217	   put in comparable encoding form.  For example, should one IRI be
1218	   stored in a byte array in UTF-8 encoding form and the second in a
1219	   UTF-16 encoding form, bit-for-bit comparisons applied naively will
1220	   produce errors.  It is better to speak of equality on a character-
1221	   for-character rather than on a byte-for-byte or bit-for-bit basis.
1222	   In practical terms, character-by-character comparisons should be done
1223	   codepoint by codepoint after conversion to a common character
1224	   encoding form.  When comparing character by character, the comparison
1225	   function MUST NOT map IRIs to URIs, because such a mapping would
1226	   create additional spurious equivalences.  It follows that an IRI
1227	   SHOULD NOT be modified when being transported if there is any chance
1228	   that this IRI might be used in a context that uses Simple String
1229	   Comparison.

1231	   False negatives are caused by the production and use of IRI aliases.
1232	   Unnecessary aliases can be reduced, regardless of the comparison
1233	   method, by consistently providing IRI references in an already
1234	   normalized form (i.e., a form identical to what would be produced
1235	   after normalization is applied, as described below).  Protocols and
1236	   data formats often limit some IRI comparisons to simple string
1237	   comparison, based on the theory that people and implementations will,
1238	   in their own best interest, be consistent in providing IRI
1239	   references, or at least be consistent enough to negate any efficiency
1240	   that might be obtained from further normalization.

1242	5.3.2.  Syntax-Based Normalization

1244	   Implementations may use logic based on the definitions provided by
1245	   this specification to reduce the probability of false negatives.
1246	   This processing is moderately higher in cost than character-for-
1247	   character string comparison.  For example, an application using this
1248	   approach could reasonably consider the following two IRIs equivalent:

1250	      example://a/b/c/%7Bfoo%7D/ros&#xE9;
1251	      eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9

1253	   Web user agents, such as browsers, typically apply this type of IRI
1254	   normalization when determining whether a cached response is
1255	   available.  Syntax-based normalization includes such techniques as
1256	   case normalization, character normalization, percent-encoding
1257	   normalization, and removal of dot-segments.

1259	5.3.2.1.  Case Normalization

1261	   For all IRIs, the hexadecimal digits within a percent-encoding
1262	   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1263	   should be normalized to use uppercase letters for the digits A-F.

1265	   When an IRI uses components of the generic syntax, the component
1266	   syntax equivalence rules always apply; namely, that the scheme and
1267	   US-ASCII only host are case insensitive and therefore should be
1268	   normalized to lowercase.  For example, the URI
1269	   "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/".
1270	   Case equivalence for non-ASCII characters in IRI components that are
1271	   IDNs are discussed in Section 5.3.3.  The other generic syntax
1272	   components are assumed to be case sensitive unless specifically
1273	   defined otherwise by the scheme.

1275	   Creating schemes that allow case-insensitive syntax components
1276	   containing non-ASCII characters should be avoided.  Case
1277	   normalization of non-ASCII characters can be culturally dependent and
1278	   is always a complex operation.  The only exception concerns non-ASCII
1279	   host names for which the character normalization includes a mapping
1280	   step derived from case folding.

1282	5.3.2.2.  Character Normalization

1284	   The Unicode Standard [UNIV4] defines various equivalences between
1285	   sequences of characters for various purposes.  Unicode Standard Annex
1286	   #15 [UTR15] defines various Normalization Forms for these
1287	   equivalences, in particular Normalization Form C (NFC, Canonical
1288	   Decomposition, followed by Canonical Composition) and Normalization
1289	   Form KC (NFKC, Compatibility Decomposition, followed by Canonical
1290	   Composition).

1292	   IRIs already in Unicode MUST NOT be normalized before parsing or
1293	   interpreting.  In many non-Unicode character encodings, some text
1294	   cannot be represented directly.  For example, the word "Vietnam" is
1295	   natively written "Vi&#x1EC7;t Nam" (containing a LATIN SMALL LETTER E
1296	   WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct transcoding from
1297	   the windows-1258 character encoding leads to "Vi&#xEA;&#x323;t Nam"
1298	   (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX followed by a
1299	   COMBINING DOT BELOW).  Direct transcoding of other 8-bit encodings of
1300	   Vietnamese may lead to other representations.

1302	   Equivalence of IRIs MUST rely on the assumption that IRIs are
1303	   appropriately pre-character-normalized rather than apply character
1304	   normalization when comparing two IRIs.  The exceptions are conversion
1305	   from a non-digital form, and conversion from a non-UCS-based
1306	   character encoding to a UCS-based character encoding.  In these
1307	   cases, NFC or a normalizing transcoder using NFC MUST be used for
1308	   interoperability.  To avoid false negatives and problems with
1309	   transcoding, IRIs SHOULD be created by using NFC.  Using NFKC may
1310	   avoid even more problems; for example, by choosing half-width Latin
1311	   letters instead of full-width ones, and full-width instead of half-
1312	   width Katakana.

1314	   As an example, "http://www.example.org/r&#xE9;sum&#xE9;.html" (in XML
1315	   Notation) is in NFC.  On the other hand,
1316	   "http://www.example.org/re&#x301;sume&#x301;.html" is not in NFC.

1318	   The former uses precombined e-acute characters, and the latter uses
1319	   "e" characters followed by combining acute accents.  Both usages are
1320	   defined as canonically equivalent in [UNIV4].

1322	   Note:  Because it is unknown how a particular sequence of characters
1323	      is being treated with respect to character normalization, it would
1324	      be inappropriate to allow third parties to normalize an IRI
1325	      arbitrarily.  This does not contradict the recommendation that
1326	      when a resource is created, its IRI should be as character
1327	      normalized as possible (i.e., NFC or even NFKC).  This is similar
1328	      to the uppercase/lowercase problems.  Some parts of a URI are case
1329	      insensitive (for example, the domain name).  For others, it is
1330	      unclear whether they are case sensitive, case insensitive, or
1331	      something in between (e.g., case sensitive, but with a multiple
1332	      choice selection if the wrong case is used, instead of a direct
1333	      negative result).  The best recipe is that the creator use a
1334	      reasonable capitalization and, when transferring the URI,
1335	      capitalization never be changed.

1337	   Various IRI schemes may allow the usage of Internationalized Domain
1338	   Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
1339	   Character Normalization also applies to IDNs, as discussed in
1340	   Section 5.3.3.

1342	5.3.2.3.  Percent-Encoding Normalization

1344	   The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a
1345	   frequent source of variance among otherwise identical IRIs.  In
1346	   addition to the case normalization issue noted above, some IRI
1347	   producers percent-encode octets that do not require percent-encoding,
1348	   resulting in IRIs that are equivalent to their nonencoded
1349	   counterparts.  These IRIs should be normalized by decoding any
1350	   percent-encoded octet sequence that corresponds to an unreserved
1351	   character, as described in section 2.3 of [RFC3986].

1353	   For actual resolution, differences in percent-encoding (except for
1354	   the percent-encoding of reserved characters) MUST always result in
1355	   the same resource.  For example, "http://example.org/~user",
1356	   "http://example.org/%7euser", and "http://example.org/%7Euser", must
1357	   resolve to the same resource.

1359	   If this kind of equivalence is to be tested, the percent-encoding of
1360	   both IRIs to be compared has to be aligned; for example, by
1361	   converting both IRIs to URIs (see Section 3.1), eliminating escape
1362	   differences in the resulting URIs, and making sure that the case of
1363	   the hexadecimal characters in the percent-encoding is always the same
1364	   (preferably upper case).  If the IRI is to be passed to another
1365	   application or used further in some other way, its original form MUST
1366	   be preserved.  The conversion described here should be performed only
1367	   for local comparison.

1369	5.3.2.4.  Path Segment Normalization

1371	   The complete path segments "." and ".." are intended only for use
1372	   within relative references (Section 4.1 of [RFC3986]) and are removed
1373	   as part of the reference resolution process (Section 5.2 of
1374	   [RFC3986]).  However, some implementations may incorrectly assume
1375	   that reference resolution is not necessary when the reference is
1376	   already an IRI, and thus fail to remove dot-segments when they occur
1377	   in non-relative paths.  IRI normalizers should remove dot-segments by
1378	   applying the remove_dot_segments algorithm to the path, as described
1379	   in Section 5.2.4 of [RFC3986].

1381	5.3.3.  Scheme-Based Normalization

1383	   The syntax and semantics of IRIs vary from scheme to scheme, as
1384	   described by the defining specification for each scheme.
1385	   Implementations may use scheme-specific rules, at further processing
1386	   cost, to reduce the probability of false negatives.  For example,
1387	   because the "http" scheme makes use of an authority component, has a
1388	   default port of "80", and defines an empty path to be equivalent to
1389	   "/", the following four IRIs are equivalent:

1391	      http://example.com
1392	      http://example.com/
1393	      http://example.com:/
1394	      http://example.com:80/

1396	   In general, an IRI that uses the generic syntax for authority with an
1397	   empty path should be normalized to a path of "/".  Likewise, an
1398	   explicit ":port", for which the port is empty or the default for the
1399	   scheme, is equivalent to one where the port and its ":" delimiter are
1400	   elided and thus should be removed by scheme-based normalization.  For
1401	   example, the second IRI above is the normal form for the "http"
1402	   scheme.

1404	   Another case where normalization varies by scheme is in the handling
1405	   of an empty authority component or empty host subcomponent.  For many
1406	   scheme specifications, an empty authority or host is considered an
1407	   error; for others, it is considered equivalent to "localhost" or the
1408	   end-user's host.  When a scheme defines a default for authority and
1409	   an IRI reference to that default is desired, the reference should be
1410	   normalized to an empty authority for the sake of uniformity, brevity,
1411	   and internationalization.  If, however, either the userinfo or port
1412	   subcomponents are non-empty, then the host should be given explicitly
1413	   even if it matches the default.

1415	   Normalization should not remove delimiters when their associated
1416	   component is empty unless it is licensed to do so by the scheme
1417	   specification.  For example, the IRI "http://example.com/?" cannot be
1418	   assumed to be equivalent to any of the examples above.  Likewise, the
1419	   presence or absence of delimiters within a userinfo subcomponent is
1420	   usually significant to its interpretation.  The fragment component is
1421	   not subject to any scheme-based normalization; thus, two IRIs that
1422	   differ only by the suffix "#" are considered different regardless of
1423	   the scheme.

1425	   ((NOTE: THIS NEEDS TO BE UPDATED TO DEAL WITH IDNA8)) Some IRI
1426	   schemes may allow the usage of Internationalized Domain Names (IDN)
1427	   [RFC3490] either in their ireg-name part or elsewhere.  When in use
1428	   in IRIs, those names SHOULD be validated by using the ToASCII
1429	   operation defined in [RFC3490], with the flags "UseSTD3ASCIIRules"
1430	   and "AllowUnassigned".  An IRI containing an invalid IDN cannot
1431	   successfully be resolved.  Validated IDN components of IRIs SHOULD be
1432	   character normalized by using the Nameprep process [RFC3491];
1433	   however, for legibility purposes, they SHOULD NOT be converted into
1434	   ASCII Compatible Encoding (ACE).

1436	   Scheme-based normalization may also consider IDN components and their
1437	   conversions to punycode as equivalent.  As an example,
1438	   "http://r&#xE9;sum&#xE9;.example.org" may be considered equivalent to
1439	   "http://xn--rsum-bpad.example.org".

1441	   Other scheme-specific normalizations are possible.

1443	5.3.4.  Protocol-Based Normalization

1445	   Substantial effort to reduce the incidence of false negatives is
1446	   often cost-effective for web spiders.  Consequently, they implement
1447	   even more aggressive techniques in IRI comparison.  For example, if
1448	   they observe that an IRI such as

1450	      http://example.com/data

1452	   redirects to an IRI differing only in the trailing slash

1454	      http://example.com/data/

1456	   they will likely regard the two as equivalent in the future.  This
1457	   kind of technique is only appropriate when equivalence is clearly
1458	   indicated by both the result of accessing the resources and the
1459	   common conventions of their scheme's dereference algorithm (in this
1460	   case, use of redirection by HTTP origin servers to avoid problems
1461	   with relative references).

1463	6.  Use of IRIs

1465	6.1.  Limitations on UCS Characters Allowed in IRIs

1467	   This section discusses limitations on characters and character
1468	   sequences usable for IRIs beyond those given in Section 2.2 and
1469	   Section 4.1.  The considerations in this section are relevant when
1470	   IRIs are created and when URIs are converted to IRIs.

1472	   a. The repertoire of characters allowed in each IRI component is
1473	      limited by the definition of that component.  For example, the
1474	      definition of the scheme component does not allow characters
1475	      beyond US-ASCII.

1477	      (Note: In accordance with URI practice, generic IRI software
1478	      cannot and should not check for such limitations.)

1480	   b. The UCS contains many areas of characters for which there are
1481	      strong visual look-alikes.  Because of the likelihood of
1482	      transcription errors, these also should be avoided.  This includes
1483	      the full-width equivalents of Latin characters, half-width
1484	      Katakana characters for Japanese, and many others.  It also
1485	      includes many look-alikes of "space", "delims", and "unwise",
1486	      characters excluded in [RFC3491].

1488	   Additional information is available from [UNIXML].  [UNIXML] is
1489	   written in the context of running text rather than in that of
1490	   identifiers.  Nevertheless, it discusses many of the categories of
1491	   characters not appropriate for IRIs.

1493	6.2.  Software Interfaces and Protocols

1495	   Although an IRI is defined as a sequence of characters, software
1496	   interfaces for URIs typically function on sequences of octets or
1497	   other kinds of code units.  Thus, software interfaces and protocols
1498	   MUST define which character encoding is used.

1500	   Intermediate software interfaces between IRI-capable components and
1501	   URI-only components MUST map the IRIs per Section 3.6, when
1502	   transferring from IRI-capable to URI-only components.  This mapping
1503	   SHOULD be applied as late as possible.  It SHOULD NOT be applied
1504	   between components that are known to be able to handle IRIs.

1506	6.3.  Format of URIs and IRIs in Documents and Protocols

1508	   Document formats that transport URIs may have to be upgraded to allow
1509	   the transport of IRIs.  In cases where the document as a whole has a
1510	   native character encoding, IRIs MUST also be encoded in this
1511	   character encoding and converted accordingly by a parser or
1512	   interpreter.  IRI characters not expressible in the native character
1513	   encoding SHOULD be escaped by using the escaping conventions of the
1514	   document format if such conventions are available.  Alternatively,
1515	   they MAY be percent-encoded according to Section 3.6.  For example,
1516	   in HTML or XML, numeric character references SHOULD be used.  If a
1517	   document as a whole has a native character encoding and that
1518	   character encoding is not UTF-8, then IRIs MUST NOT be placed into
1519	   the document in the UTF-8 character encoding.

1521	   ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
1522	   although they use different terminology.  HTML 4.0 [HTML4] defines
1523	   the conversion from IRIs to URIs as error-avoiding behavior.  XML 1.0
1524	   [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
1525	   based upon them allow IRIs.  Also, it is expected that all relevant
1526	   new W3C formats and protocols will be required to handle IRIs
1527	   [CharMod].

1529	6.4.  Use of UTF-8 for Encoding Original Characters

1531	   This section discusses details and gives examples for point c) in
1532	   Section 1.2.  To be able to use IRIs, the URI corresponding to the
1533	   IRI in question has to encode original characters into octets by
1534	   using UTF-8.  This can be specified for all URIs of a URI scheme or
1535	   can apply to individual URIs for schemes that do not specify how to
1536	   encode original characters.  It can apply to the whole URI, or only
1537	   to some part.  For background information on encoding characters into
1538	   URIs, see also Section 2.5 of [RFC3986].

1540	   For new URI schemes, using UTF-8 is recommended in [RFC4395].
1541	   Examples where UTF-8 is already used are the URN syntax [RFC2141],
1542	   IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand,
1543	   because the HTTP URI scheme does not specify how to encode original
1544	   characters, only some HTTP URLs can have corresponding but different
1545	   IRIs.

1547	   For example, for a document with a URI of
1548	   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
1549	   construct a corresponding IRI (in XML notation, see Section 1.4):
1550	   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9;" stands for
1551	   the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-
1552	   encoded representation of that character).  On the other hand, for a
1553	   document with a URI of "http://www.example.org/r%E9sum%E9.html", the
1554	   percent-encoding octets cannot be converted to actual characters in
1555	   an IRI, as the percent-encoding is not based on UTF-8.

1557	   For most URI schemes, there is no need to upgrade their scheme
1558	   definition in order for them to work with IRIs.  The main case where
1559	   upgrading makes sense is when a scheme definition, or a particular
1560	   component of a scheme, is strictly limited to the use of US-ASCII
1561	   characters with no provision to include non-ASCII characters/octets
1562	   via percent-encoding, or if a scheme definition currently uses highly
1563	   scheme-specific provisions for the encoding of non-ASCII characters.
1564	   An example of this is the mailto: scheme [RFC2368].

1566	   This specification updates the IANA registry of URI schemes to note
1567	   their applicability to IRIs, see Section 9.  All IRIs use URI
1568	   schemes, and all URIs with URI schemes can be used as IRIs, even
1569	   though in some cases only by using URIs directly as IRIs, without any
1570	   conversion.

1572	   Scheme definitions can impose restrictions on the syntax of scheme-
1573	   specific URIs; i.e., URIs that are admissible under the generic URI
1574	   syntax [RFC3986] may not be admissible due to narrower syntactic
1575	   constraints imposed by a URI scheme specification.  URI scheme
1576	   definitions cannot broaden the syntactic restrictions of the generic
1577	   URI syntax; otherwise, it would be possible to generate URIs that
1578	   satisfied the scheme-specific syntactic constraints without
1579	   satisfying the syntactic constraints of the generic URI syntax.
1580	   However, additional syntactic constraints imposed by URI scheme
1581	   specifications are applicable to IRI, as the corresponding URI
1582	   resulting from the mapping defined in Section 3.6 MUST be a valid URI
1583	   under the syntactic restrictions of generic URI syntax and any
1584	   narrower restrictions imposed by the corresponding URI scheme
1585	   specification.

1587	   The requirement for the use of UTF-8 generally applies to all parts
1588	   of a URI.  However, it is possible that the capability of IRIs to
1589	   represent a wide range of characters directly is used just in some
1590	   parts of the IRI (or IRI reference).  The other parts of the IRI may
1591	   only contain US-ASCII characters, or they may not be based on UTF-8.
1592	   They may be based on another character encoding, or they may directly
1593	   encode raw binary data (see also [RFC2397]).

1595	   For example, it is possible to have a URI reference of
1596	   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
1597	   document name is encoded in iso-8859-1 based on server settings, but
1598	   where the fragment identifier is encoded in UTF-8 according to
1599	   [XPointer].  The IRI corresponding to the above URI would be (in XML
1600	   notation)
1601	   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".

1603	   Similar considerations apply to query parts.  The functionality of
1604	   IRIs (namely, to be able to include non-ASCII characters) can only be
1605	   used if the query part is encoded in UTF-8.

1607	6.5.  Relative IRI References

1609	   Processing of relative IRI references against a base is handled
1610	   straightforwardly; the algorithms of [RFC3986] can be applied
1611	   directly, treating the characters additionally allowed in IRI
1612	   references in the same way that unreserved characters are in URI
1613	   references.

1615	7.  Liberal handling of otherwise invalid IRIs

1617	   (EDITOR NOTE: This Section may move to an appendix.)  Some technical
1618	   specifications and widely-deployed software have allowed additional
1619	   variations and extensions of IRIs to be used in syntactic components.
1620	   This section describes two widely-used preprocessing agreements.
1621	   Other technical specifications may wish to reference a syntactic
1622	   component which is "a valid IRI or a string that will map to a valid
1623	   IRI after this preprocessing algorithm".  These two variants are
1624	   known as Legacy Extended IRI or LEIRI [LEIRI], and Web Address
1625	   [HTML5]).

1627	   Future technical specifications SHOULD NOT allow conforming producers
1628	   to produce, or conforming content to contain, such forms, as they are
1629	   not interoperable with other IRI consuming software.

1631	7.1.  LEIRI processing

1633	   This section defines Legacy Extended IRIs (LEIRIs).  The syntax of
1634	   Legacy Extended IRIs is the same as that for IRIs, except that the
1635	   ucschar production is replaced by the leiri-ucschar production:

1637	     leiri-ucschar  = " " / "<" / ">" / '"' / "{" / "}" / "|"
1638	                      / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1639	                      / %xE000-FFFD / %x10000-10FFFF

1641	   Among other extensions, processors based on this specification also
1642	   did not enforce the restriction on bidirectional formatting
1643	   characters in Section 4.1, and the iprivate production becomes
1644	   redundant.

1646	   To convert a string allowed as a LEIRI to an IRI, each character
1647	   allowed in leiri-ucschar but not in ucschar must be percent-encoded
1648	   using Section 3.3.

1650	7.2.  Web Address processing

1652	   Many popular web browsers have taken the approach of being quite
1653	   liberal in what is accepted as a "URL" or its relative forms.  This
1654	   section describes their behavior in terms of a preprocessor which
1655	   maps strings into the IRI space for subsequent parsing and
1656	   interpretation as an IRI.

1658	   In some situations, it might be appropriate to describe the syntax
1659	   that a liberal consumer implementation might accept as a "Web
1660	   Address" or "Hypertext Reference" or "HREF".  However, technical
1661	   specifications SHOULD restrict the syntactic form allowed by
1662	   compliant producers to the IRI or IRI reference syntax defined in
1663	   this document even if they want to mandate this processing.

1665	   Summary:

1667	   o  Leading and trailing whitespace is removed.

1669	   o  Some additional characters are removed.

1671	   o  Some additional characters are allowed and escaped (as with
1672	      LEIRI).

1674	   o  If interpreting an IRI as a URI, the pct-encoding of the query
1675	      component of the parsed URI component depends on operational
1676	      context.

1678	   Each string provided may have an associated charset (called the HREF-
1679	   charset here); this defaults to UTF-8.  For web browsers interpreting
1680	   HTML, the document charset of a string is determined:

1682	   If the string came from a script (e.g. as an argument to a method)
1683	      The HRef-charset is the script's charset.

1685	   If the string came from a DOM node (e.g. from an element)  The node
1686	      has a Document, and the HRef-charset is the Document's character
1687	      encoding.

1689	   If the string had a HRef-charset defined when the string was created
1690	   or defined  The HRef-charset is as defined.

1692	   If the resulting HRef-charset is a unicode based character encoding
1693	   (e.g., UTF-16), then use UTF-8 instead.

1695	   The syntax for Web Addresses is obtained by replacing the 'ucschar',
1696	   pct-form, and path-sep rules with the href-ucschar, href-pct-form,
1697	   and href-path-sep rules below.  In addition, some characters are
1698	   stripped.

1700	     href-ucschar  = " " / "<" / ">" / '"' / "{" / "}" / "|"
1701	                      / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1702	                      / %xE000-FFFD / %x10000-10FFFF
1703	     href-pct-form = pct-encoded | "%"
1704	     href-path-sep = "/" | "\"
1705	     href-strip    =

1707	   (NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT
1708	   SENTENCE) browsers did not enforce the restriction on bidirectional
1709	   formatting characters in Section 4.1, and the iprivate production
1710	   becomes redundant.

1712	   'Web Address processing' requires the following additional
1713	   preprocessing steps:

1715	   1.  Leading and trailing instances of space (U+0020), CR (U+000A), LF
1716	       (U+000D), and TAB (U+0009) characters are removed.

1718	   2.  strip all characters in href-strip.

1720	   3.  Percent-encode all characters in href-ucschar not in ucschar.

1722	   4.  Replace occurrences of "%" not followed by two hexadecimal digits
1723	       by "%25".

1725	   5.  Convert backslashes ('\') matching href-path-sep to forward
1726	       slashes ('/').

1728	7.3.  Characters not allowed in IRIs

1730	   This section provides a list of the groups of characters and code
1731	   points that are allowed by LEIRI or HREF but are not allowed in IRIs
1732	   or are allowed in IRIs only in the query part.  For each group of
1733	   characters, advice on the usage of these characters is also given,
1734	   concentrating on the reasons for why they are excluded from IRI use.

1736	      Space (U+0020): Some formats and applications use space as a
1737	      delimiter, e.g. for items in a list.  Appendix C of [RFC3986] also
1738	      mentions that white space may have to be added when displaying or
1739	      printing long URIs; the same applies to long IRIs.  This means
1740	      that spaces can disappear, or can make the what is intended as a
1741	      single IRI or IRI reference to be treated as two or more separate
1742	      IRIs.

1744	      Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
1745	      C of [RFC3986] suggests the use of double-quotes
1746	      ("http://example.com/") and angle brackets (<http://example.com/>)
1747	      as delimiters for URIs in plain text.  These conventions are often
1748	      used, and also apply to IRIs.  Using these characters in strings
1749	      intended to be IRIs would result in the IRIs being cut off at the
1750	      wrong place.

1752	      Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
1753	      (U+007B), "|" (U+007C), and "}" (U+007D): These characters
1754	      originally have been excluded from URIs because the respective
1755	      codepoints are assigned to different graphic characters in some
1756	      7-bit or 8-bit encoding.  Despite the move to Unicode, some of
1757	      these characters are still occasionally displayed differently on
1758	      some systems, e.g.  U+005C may appear as a Japanese Yen symbol on
1759	      some systems.  Also, the fact that these characters are not used
1760	      in URIs or IRIs has encouraged their use outside URIs or IRIs in
1761	      contexts that may include URIs or IRIs.  If a string with such a
1762	      character were used as an IRI in such a context, it would likely
1763	      be interpreted piecemeal.

1765	      The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1766	      #x9F): There is generally no way to transmit these characters
1767	      reliably as text outside of a charset encoding.  Even when in
1768	      encoded form, many software components silently filter out some of
1769	      these characters, or may stop processing alltogether when
1770	      encountering some of them.  These characters may affect text
1771	      display in subtle, unnoticable ways or in drastic, global, and
1772	      irreversible ways depending on the hardware and software involved.
1773	      The use of some of these characters would allow malicious users to
1774	      manipulate the display of an IRI and its context in many
1775	      situations.

1777	      Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1778	      characters affect the display ordering of characters.  If IRIs
1779	      were allowed to contain these characters and the resulting visual
1780	      display transcribed. they could not be converted back to
1781	      electronic form (logical order) unambiguously.  These characters,
1782	      if allowed in IRIs, might allow malicious users to manipulate the
1783	      display of IRI and its context.

1785	      Specials (U+FFF0-FFFD): These code points provide functionality
1786	      beyond that useful in an IRI, for example byte order
1787	      identification, annotation, and replacements for unknown
1788	      characters and objects.  Their use and interpretation in an IRI
1789	      would serve no purpose and might lead to confusing display
1790	      variations.

1792	      Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
1793	      10FFFD): Display and interpretation of these code points is by
1794	      definition undefined without private agreement.  Therefore, these
1795	      code points are not suited for use on the Internet.  They are not
1796	      interoperable and may have unpredictable effects.

1798	      Tags (U+E0000-E0FFF): These characters provide a way to language
1799	      tag in Unicode plain text.  They are not appropriate for IRIs
1800	      because language information in identifiers cannot reliably be
1801	      input, transmitted (e.g. on a visual medium such as paper), or
1802	      recognized.

1804	      Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1805	      U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1806	      U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1807	      U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1808	      U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1809	      non-characters.  Applications may use some of them internally, but
1810	      are not prepared to interchange them.

1812	   LEIRI preprocessing disallowed some code points and code units:

1814	      Surrogate code units (D800-DFFF): These do not represent Unicode
1815	      codepoints.

1817	8.  URI/IRI Processing Guidelines (Informative)

1819	   This informative section provides guidelines for supporting IRIs in
1820	   the same software components and operations that currently process
1821	   URIs: Software interfaces that handle URIs, software that allows
1822	   users to enter URIs, software that creates or generates URIs,
1823	   software that displays URIs, formats and protocols that transport
1824	   URIs, and software that interprets URIs.  These may all require
1825	   modification before functioning properly with IRIs.  The
1826	   considerations in this section also apply to URI references and IRI
1827	   references.

1829	8.1.  URI/IRI Software Interfaces

1831	   Software interfaces that handle URIs, such as URI-handling APIs and
1832	   protocols transferring URIs, need interfaces and protocol elements
1833	   that are designed to carry IRIs.

1835	   In case the current handling in an API or protocol is based on US-
1836	   ASCII, UTF-8 is recommended as the character encoding for IRIs, as it
1837	   is compatible with US-ASCII, is in accordance with the
1838	   recommendations of [RFC2277], and makes converting to URIs easy.  In
1839	   any case, the API or protocol definition must clearly define the
1840	   character encoding to be used.

1842	   The transfer from URI-only to IRI-capable components requires no
1843	   mapping, although the conversion described in Section 3.7 above may
1844	   be performed.  It is preferable not to perform this inverse
1845	   conversion unless it is certain this can be done correctly.

1847	8.2.  URI/IRI Entry

1849	   Some components allow users to enter URIs into the system by typing
1850	   or dictation, for example.  This software must be updated to allow
1851	   for IRI entry.

1853	   A person viewing a visual representation of an IRI (as a sequence of
1854	   glyphs, in some order, in some visual display) or hearing an IRI will
1855	   use an entry method for characters in the user's language to input
1856	   the IRI.  Depending on the script and the input method used, this may
1857	   be a more or less complicated process.

1859	   The process of IRI entry must ensure, as much as possible, that the
1860	   restrictions defined in Section 2.2 are met.  This may be done by
1861	   choosing appropriate input methods or variants/settings thereof, by
1862	   appropriately converting the characters being input, by eliminating
1863	   characters that cannot be converted, and/or by issuing a warning or
1864	   error message to the user.

1866	   As an example of variant settings, input method editors for East
1867	   Asian Languages usually allow the input of Latin letters and related
1868	   characters in full-width or half-width versions.  For IRI input, the
1869	   input method editor should be set so that it produces half-width
1870	   Latin letters and punctuation and full-width Katakana.

1872	   An input field primarily or solely used for the input of URIs/IRIs
1873	   might allow the user to view an IRI as it is mapped to a URI.  Places
1874	   where the input of IRIs is frequent may provide the possibility for
1875	   viewing an IRI as mapped to a URI.  This will help users when some of
1876	   the software they use does not yet accept IRIs.

1878	   An IRI input component interfacing to components that handle URIs,
1879	   but not IRIs, must map the IRI to a URI before passing it to these
1880	   components.

1882	   For the input of IRIs with right-to-left characters, please see
1883	   Section 4.3.

1885	8.3.  URI/IRI Transfer between Applications

1887	   Many applications (for example, mail user agents) try to detect URIs
1888	   appearing in plain text.  For this, they use some heuristics based on
1889	   URI syntax.  They then allow the user to click on such URIs and
1890	   retrieve the corresponding resource in an appropriate (usually
1891	   scheme-dependent) application.

1893	   Such applications would need to be upgraded, in order to use the IRI
1894	   syntax as a base for heuristics.  In particular, a non-ASCII
1895	   character should not be taken as the indication of the end of an IRI.
1896	   Such applications also would need to make sure that they correctly
1897	   convert the detected IRI from the character encoding of the document
1898	   or application where the IRI appears, to the character encoding used
1899	   by the system-wide IRI invocation mechanism, or to a URI (according
1900	   to Section 3.6) if the system-wide invocation mechanism only accepts
1901	   URIs.

1903	   The clipboard is another frequently used way to transfer URIs and
1904	   IRIs from one application to another.  On most platforms, the
1905	   clipboard is able to store and transfer text in many languages and
1906	   scripts.  Correctly used, the clipboard transfers characters, not
1907	   octets, which will do the right thing with IRIs.

1909	8.4.  URI/IRI Generation

1911	   Systems that offer resources through the Internet, where those
1912	   resources have logical names, sometimes automatically generate URIs
1913	   for the resources they offer.  For example, some HTTP servers can
1914	   generate a directory listing for a file directory and then respond to
1915	   the generated URIs with the files.

1917	   Many legacy character encodings are in use in various file systems.
1918	   Many currently deployed systems do not transform the local character
1919	   representation of the underlying system before generating URIs.

1921	   For maximum interoperability, systems that generate resource
1922	   identifiers should make the appropriate transformations.  For
1923	   example, if a file system contains a file named "r&#xE9;sum&#
1924	   xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in
1925	   a URI, which allows use of "r&#xE9;sum&#xE9;.html" in an IRI, even if
1926	   locally the file name is kept in a character encoding other than
1927	   UTF-8.

1929	   This recommendation particularly applies to HTTP servers.  For FTP
1930	   servers, similar considerations apply; see [RFC2640].

1932	8.5.  URI/IRI Selection

1934	   In some cases, resource owners and publishers have control over the
1935	   IRIs used to identify their resources.  This control is mostly
1936	   executed by controlling the resource names, such as file names,
1937	   directly.

1939	   In these cases, it is recommended to avoid choosing IRIs that are
1940	   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
1941	   is easily confused with the digit one ("1"), and the upper-case oh
1942	   ("O") is easily confused with the digit zero ("0").  Publishers
1943	   should avoid confusing users with "br0ken" or "1ame" identifiers.

1945	   Outside the US-ASCII repertoire, there are many more opportunities
1946	   for confusion; a complete set of guidelines is too lengthy to include
1947	   here.  As long as names are limited to characters from a single
1948	   script, native writers of a given script or language will know best
1949	   when ambiguities can appear, and how they can be avoided.  What may
1950	   look ambiguous to a stranger may be completely obvious to the average
1951	   native user.  On the other hand, in some cases, the UCS contains
1952	   variants for compatibility reasons; for example, for typographic
1953	   purposes.  These should be avoided wherever possible.  Although there
1954	   may be exceptions, newly created resource names should generally be
1955	   in NFKC [UTR15] (which means that they are also in NFC).

1957	   As an example, the UCS contains the "fi" ligature at U+FB01 for
1958	   compatibility reasons.  Wherever possible, IRIs should use the two
1959	   letters "f" and "i" rather than the "fi" ligature.  An example where
1960	   the latter may be used is in the query part of an IRI for an explicit
1961	   search for a word written containing the "fi" ligature.

1963	   In certain cases, there is a chance that characters from different
1964	   scripts look the same.  The best known example is the similarity of
1965	   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
1966	   such cases, IRIs should only be created where all the characters in a
1967	   single component are used together in a given language.  This usually
1968	   means that all of these characters will be from the same script, but
1969	   there are languages that mix characters from different scripts (such
1970	   as Japanese).  This is similar to the heuristics used to distinguish
1971	   between letters and numbers in the examples above.  Also, for Latin,
1972	   Greek, and Cyrillic, using lowercase letters results in fewer
1973	   ambiguities than using uppercase letters would.

1975	8.6.  Display of URIs/IRIs

1977	   In situations where the rendering software is not expected to display
1978	   non-ASCII parts of the IRI correctly using the available layout and
1979	   font resources, these parts should be percent-encoded before being
1980	   displayed.

1982	   For display of Bidi IRIs, please see Section 4.1.

1984	8.7.  Interpretation of URIs and IRIs

1986	   Software that interprets IRIs as the names of local resources should
1987	   accept IRIs in multiple forms and convert and match them with the
1988	   appropriate local resource names.

1990	   First, multiple representations include both IRIs in the native
1991	   character encoding of the protocol and also their URI counterparts.

1993	   Second, it may include URIs constructed based on character encodings
1994	   other than UTF-8.  These URIs may be produced by user agents that do
1995	   not conform to this specification and that use legacy character
1996	   encodings to convert non-ASCII characters to URIs.  Whether this is
1997	   necessary, and what character encodings to cover, depends on a number
1998	   of factors, such as the legacy character encodings used locally and
1999	   the distribution of various versions of user agents.  For example,
2000	   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
2001	   addition to UTF-8.

2003	   Third, it may include additional mappings to be more user-friendly
2004	   and robust against transmission errors.  These would be similar to
2005	   how some servers currently treat URIs as case insensitive or perform
2006	   additional matching to account for spelling errors.  For characters
2007	   beyond the US-ASCII repertoire, this may, for example, include
2008	   ignoring the accents on received IRIs or resource names.  Please note
2009	   that such mappings, including case mappings, are language dependent.

2011	   It can be difficult to identify a resource unambiguously if too many
2012	   mappings are taken into consideration.  However, percent-encoded and
2013	   not percent-encoded parts of IRIs can always be clearly
2014	   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
2015	   the potential for collisions lower than it may seem at first.

2017	8.8.  Upgrading Strategy

2019	   Where this recommendation places further constraints on software for
2020	   which many instances are already deployed, it is important to
2021	   introduce upgrades carefully and to be aware of the various
2022	   interdependencies.

2024	   If IRIs cannot be interpreted correctly, they should not be created,
2025	   generated, or transported.  This suggests that upgrading URI
2026	   interpreting software to accept IRIs should have highest priority.

2028	   On the other hand, a single IRI is interpreted only by a single or
2029	   very few interpreters that are known in advance, although it may be
2030	   entered and transported very widely.

2032	   Therefore, IRIs benefit most from a broad upgrade of software to be
2033	   able to enter and transport IRIs.  However, before an individual IRI
2034	   is published, care should be taken to upgrade the corresponding
2035	   interpreting software in order to cover the forms expected to be
2036	   received by various versions of entry and transport software.

2038	   The upgrade of generating software to generate IRIs instead of using
2039	   a local character encoding should happen only after the service is
2040	   upgraded to accept IRIs.  Similarly, IRIs should only be generated
2041	   when the service accepts IRIs and the intervening infrastructure and
2042	   protocol is known to transport them safely.

2044	   Software converting from URIs to IRIs for display should be upgraded
2045	   only after upgraded entry software has been widely deployed to the
2046	   population that will see the displayed result.

2048	   Where there is a free choice of character encodings, it is often
2049	   possible to reduce the effort and dependencies for upgrading to IRIs
2050	   by using UTF-8 rather than another encoding.  For example, when a new
2051	   file-based Web server is set up, using UTF-8 as the character
2052	   encoding for file names will make the transition to IRIs easier.
2053	   Likewise, when a new Web form is set up using UTF-8 as the character
2054	   encoding of the form page, the returned query URIs will use UTF-8 as
2055	   the character encoding (unless the user, for whatever reason, changes
2056	   the character encoding) and will therefore be compatible with IRIs.

2058	   These recommendations, when taken together, will allow for the
2059	   extension from URIs to IRIs in order to handle characters other than
2060	   US-ASCII while minimizing interoperability problems.  For
2061	   considerations regarding the upgrade of URI scheme definitions, see
2062	   Section 6.4.

2064	9.  IANA Considerations

2066	   RFC Editor and IANA note: Please Replace RFC XXXX with the number of
2067	   this document when it issues as an RFC.

2069	   IANA maintains a registry of "URI schemes".  A "URI scheme" also
2070	   serves an "IRI scheme".

2072	   To clarify that the URI scheme registration process also applies to
2073	   IRIs, change the description of the "URI schemes" registry header to
2074	   say "[RFC4395] defines an IANA-maintained registry of URI Schemes.

2076	   These registries include the Permanent and Provisional URI Schemes.
2077	   RFC XXXX updates this registry to designate that schemes may also
2078	   indicate their usability as IRI schemes.

2080	   Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".

2082	10.  Security Considerations

2084	   The security considerations discussed in [RFC3986] also apply to
2085	   IRIs.  In addition, the following issues require particular care for
2086	   IRIs.

2088	   Incorrect encoding or decoding can lead to security problems.  In
2089	   particular, some UTF-8 decoders do not check against overlong byte
2090	   sequences.  As an example, a "/" is encoded with the byte 0x2F both
2091	   in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
2092	   interpret the sequence 0xC0 0xAF as a "/".  A sequence such as
2093	   "%C0%AF.." may pass some security tests and then be interpreted as
2094	   "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
2095	   and checking are not done in the right order, and/or if reserved
2096	   characters and unreserved characters are not clearly distinguished.

2098	   There are various ways in which "spoofing" can occur with IRIs.
2099	   "Spoofing" means that somebody may add a resource name that looks the
2100	   same or similar to the user, but that points to a different resource.
2101	   The added resource may pretend to be the real resource by looking
2102	   very similar but may contain all kinds of changes that may be
2103	   difficult to spot and that can cause all kinds of problems.  Most
2104	   spoofing possibilities for IRIs are extensions of those for URIs.

2106	   Spoofing can occur for various reasons.  First, a user's
2107	   normalization expectations or actual normalization when entering an
2108	   IRI or transcoding an IRI from a legacy character encoding do not
2109	   match the normalization used on the server side.  Conceptually, this
2110	   is no different from the problems surrounding the use of case-
2111	   insensitive web servers.  For example, a popular web page with a
2112	   mixed-case name ("http://big.example.com/PopularPage.html") might be
2113	   "spoofed" by someone who is able to create
2114	   "http://big.example.com/popularpage.html".  However, the use of
2115	   unnormalized character sequences, and of additional mappings for user
2116	   convenience, may increase the chance for spoofing.  Protocols and
2117	   servers that allow the creation of resources with names that are not
2118	   normalized are particularly vulnerable to such attacks.  This is an
2119	   inherent security problem of the relevant protocol, server, or
2120	   resource and is not specific to IRIs, but it is mentioned here for
2121	   completeness.

2123	   Spoofing can occur in various IRI components, such as the domain name
2124	   part or a path part.  For considerations specific to the domain name
2125	   part, see [RFC3491].  For the path part, administrators of sites that
2126	   allow independent users to create resources in the same sub area may
2127	   have to be careful to check for spoofing.

2129	   Spoofing can occur because in the UCS many characters look very
2130	   similar.  Details are discussed in Section 8.5.  Again, this is very
2131	   similar to spoofing possibilities on US-ASCII, e.g., using "br0ken"
2132	   or "1ame" URIs.

2134	   Spoofing can occur when URIs with percent-encodings based on various
2135	   character encodings are accepted to deal with older user agents.  In
2136	   some cases, particularly for Latin-based resource names, this is
2137	   usually easy to detect because UTF-8-encoded names, when interpreted
2138	   and viewed as legacy character encodings, produce mostly garbage.

2140	   When concurrently used character encodings have a similar structure
2141	   but there are no characters that have exactly the same encoding,
2142	   detection is more difficult.

2144	   Spoofing can occur with bidirectional IRIs, if the restrictions in
2145	   Section 4.2 are not followed.  The same visual representation may be
2146	   interpreted as different logical representations, and vice versa.  It
2147	   is also very important that a correct Unicode bidirectional
2148	   implementation be used.

2150	   The use of Legacy Extended IRIs introduces additional security
2151	   issues.

2153	11.  Acknowledgements

2155	   For contributions to this update, we would like to thank Ian Hickson,
2156	   Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin,
2157	   Henry S. Thomson, and the XML Core Working Group of the W3C.

2159	   The discussion on the issue addressed here started a long time ago.
2160	   There was a thread in the HTML working group in August 1995 (under
2161	   the topic of "Globalizing URIs") and in the www-international mailing
2162	   list in July 1996 (under the topic of "Internationalization and
2163	   URLs"), and there were ad-hoc meetings at the Unicode conferences in
2164	   September 1995 and September 1997.

2166	   For contributions to the previous version of this document, RFC 3987,
2167	   many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
2168	   Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
2169	   Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
2170	   Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley,
2171	   Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne,
2172	   Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan
2173	   Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan
2174	   Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris
2175	   Haynes, Walter Underwood, and many others.

2177	   A definition of HyperText Reference was initially produced by Ian
2178	   Hixson, and further edited by Dan Connolly and C. M. Spergerg-
2179	   McQueen.

2181	   Thanks to the Internationalization Working Group (I18N WG) of the
2182	   World Wide Web Consortium (W3C), and the members of the W3C I18N
2183	   Working Group and Interest Group for their contributions and their
2184	   work on [CharMod].  Thanks also go to the members of many other W3C
2185	   Working Groups for adopting IRIs, and to the members of the Montreal
2186	   IAB Workshop on Internationalization and Localization for their
2187	   review.

2189	12.  Open Issues

2191	   NOTE: The issues noted in this section should be addressed before the
2192	   document is submitted as an RFC.  These issues are not in any
2193	   particular order.

2195	   length limits on domain name  See, for example,
2196	      http://lists.w3.org/Archives/Public/public-iri/2009Sep/0064.html
2197	      discussion on public-iri@w3.org (that discussion is mostly
2198	      irrelevant now as the "63 octets in UTF-8 per label" restriction
2199	      was dropped)

2201	   Allow generic scheme-independent IRI to URI translation  Previous
2202	      drafts of this specification proposed a generic IRI to URI
2203	      transformation using pct-encoding, and allowed domain name
2204	      translation to be optionally handled by retranslating host names
2205	      from pct-encoding back into Unicode and then into punycode.  This
2206	      draft does not allow that behavior, but this should be fixed to be
2207	      in line with RFC 3986 syntax and to lead implementations towards
2208	      an uniform an long-term URI<->IRI correspondence.  See also
2209	      [Gettys]

2211	   update URI scheme registry?  This document starts the process of
2212	      making minor changes to the URI scheme registry.  This should be
2213	      handled as an update to RFC 4395.

2215	   utf8 in HTTP  Not really IRI issue, but some HTTP implementations
2216	      send UTF8 path directly, review.

2218	   handling of \\  Some web applications convert \ to / and others
2219	      don't.  Make this mandatory or disallowed (but not optional), for
2220	      Web Addresses.

2222	   dealing with disallowed IRI characters

2224	   misplaced text  Find a place to note that some older software
2225	      transcoding to UTF-8 may produce illegal output for some input, in
2226	      particular for characters outside the BMP (Basic Multilingual
2227	      Plane).  As an example, for the IRI with non-BMP characters (in
2228	      XML Notation):
2229	      "http://example.com/&#x10300;&#x10301;&#x10302";
2230	      which contains the first three letters of the Old Italic alphabet,
2231	      the correct conversion to a URI is
2232	      "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"

2234	   Special Query Handling needed?  The percent-encoding handling of
2235	      query components in the HTTP scheme is really unfortunate.  There
2236	      is no good normative advice to give if the percent-encoding is
2237	      delayed until the query-IRI is interpreted.  Could HTML ask
2238	      browsers to percent-encode the form data using the document
2239	      character set BEFORE the query IRI is constructed, and only in the
2240	      case where the document character set isn't Unicode-based and the
2241	      query is being added to http: or https: URIs?  This would give
2242	      more consistent results.  Browsers might have to change their
2243	      behavior in constructing the IRI-with-query-added, but the results
2244	      would be more consistent and fewer bugs, and it wouldn't affect
2245	      interpretation of any existing web pages.  It would remove the
2246	      need to have a normative special case for queries in HTML
2247	      documents, just for http, in a way in which things like
2248	      transcoding etc. wouldn't work well.  You could tell the
2249	      difference between a query URI in the address bar and one created
2250	      via a form because the address bar would always be UTF-8.  The
2251	      browsers might have to change the algorithm for showing the
2252	      address in the adress bar to know how to undo the encoding.

2254	   handling illegal characters  Section 3.3 used to apply only to
2255	      characters in either 'ucschar' or 'iprivate', but then later said
2256	      that systems accepting IRIs MAY also deal with the printable
2257	      characters in US-ASCII that are not allowed in URIs, namely "<",
2258	      ">", '"', space, "{", "}", "|", "\", "^", and "`".  Larry felt
2259	      that this a MAY would result in non-uniform behavior, because some
2260	      systems would produce valid URI components and others wouldn't.
2261	      Non-printable US-ASCII characters should be stripped by most
2262	      software, so if they get to if they're passed on somewhere as IRI
2263	      characters, encoding them makes sense.  The section also used to
2264	      say "If these characters are found but are not converted, then the
2265	      conversion SHOULD fail." but there is no notion of conversion
2266	      failing -- every string is converted.  Please note that the number
2267	      sign ("#"), the percent sign ("%"), and the square bracket
2268	      characters ("[", "]") are not part of the above list and MUST NOT
2269	      be converted.

2271	   adding single % and hash  Changed the BNF to not match the URI
2272	      document in allowing single % in path but not everywhere, and
2273	      allowing a # in the fragment part.

2275	13.  Change Log

2277	   Note to RFC Editor: Please completely remove this section before
2278	   publication.

2280	13.1.  Changes from -06 to this document

2282	   Major restructuring of IRI processing model to make scheme-specific
2283	   translation necessary to handle IDNA requirements and for consistency
2284	   with web implementations.

2286	   Starting with IRI, you want one of:

2288	   a  IRI components (IRI parsed into UTF8 pieces)

2290	   b  URI components (URI parsed into ASCII pieces, encoded correctly)

2292	   c  whole URI (for passing on to some other system that wants whole
2293	      URIs)

2295	13.1.1.  OLD WAY

2297	   1.  Pct-encoding on the whole thing to a URI. (c1) If you want a
2298	       (maybe broken) whole URI, you might stop here.

2300	   2.  Parsing the URI into URI components. (b1) If you want (maybe
2301	       broken) URI components, stop here.

2303	   3.  Decode the components (undoing the pct-encoding). (a) if you want
2304	       IRI components, stop here.

2306	   4.  reencode: Either using a different encoding some components (for
2307	       domain names, and query components in web pages, which depends on
2308	       the component, scheme and context), and otherwise using pct-
2309	       encoding. (b2) if you want (good) URI components, stop here.

2311	   5.  reassemble the reencoded components. (c2) if you want a (*good*)
2312	       whole URI stop here.

2314	13.1.2.  NEW WAY

2316	   1.  Parse the IRI into IRI components using the generic syntax. (a)
2317	       if you want IRI components, stop here.

2319	   2.  Encode each components, using pct-encoding, IDN encoding, or
2320	       special query part encoding depending on the component scheme or
2321	       context. (b) If you want URI components, stop here.

2323	   3.  reassemble the a whole URI from URI components. (c) if you want a
2324	       whole URI stop here.

2326	13.2.  Changes from -05 to -06

2328	   o  Add HyperText Reference, change abstract, acks and references for
2329	      it

2331	   o  Add Masinter back as another editor.

2333	   o  Masinter integrates HRef material from HTML5 spec.

2335	   o  Rewrite introduction sections to modernize.

2337	13.3.  Changes from -04 to -05

2339	   o  Updated references.

2341	   o  Changed IPR text to pre5378Trust200902.

2343	13.4.  Changes from -03 to -04

2345	   o  Added explicit abbreviation for LEIRIs.

2347	   o  Mentioned LEIRI references.

2349	   o  Completed text in LEIRI section about tag characters and about
2350	      specials.

2352	13.5.  Changes from -02 to -03

2354	   o  Updated some references.

2356	   o  Updated Michel Suginard's coordinates.

2358	13.6.  Changes from -01 to -02

2360	   o  Added tag range to iprivate (issue private-include-tags-115).

2362	   o  Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.

2364	13.7.  Changes from -00 to -01

2366	   o  Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI"
2367	      based on input from the W3C XML Core WG.  Moved the relevant
2368	      subsections to the back and promoted them to a section.

2370	   o  Added some text re.  Legacy Extended IRIs to the security section.

2372	   o  Added a IANA Consideration Section.

2374	   o  Added this Change Log Section.

2376	   o  Added a section about "IRIs with Spaces/Controls" (converting from
2377	      a Note in RFC 3987).

2379	13.8.  Changes from RFC 3987 to -00

2381	      Fixed errata (see
2382	      http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).

2384	14.  References

2386	14.1.  Normative References

2388	   [ASCII]    American National Standards Institute, "Coded Character
2389	              Set -- 7-bit American Standard Code for Information
2390	              Interchange", ANSI X3.4, 1986.

2392	   [ISO10646]
2393	              International Organization for Standardization, "ISO/IEC
2394	              10646:2003: Information Technology - Universal Multiple-
2395	              Octet Coded Character Set (UCS)", ISO Standard 10646,
2396	              December 2003.

2398	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2399	              Requirement Levels", BCP 14, RFC 2119, March 1997.

2401	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
2402	              "Internationalizing Domain Names in Applications (IDNA)",
2403	              RFC 3490, March 2003.

2405	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
2406	              Profile for Internationalized Domain Names (IDN)",
2407	              RFC 3491, March 2003.

2409	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
2410	              10646", STD 63, RFC 3629, November 2003.

2412	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
2413	              Resource Identifier (URI): Generic Syntax", STD 66,
2414	              RFC 3986, January 2005.

2416	   [STD68]    Crocker, D. and P. Overell, "Augmented BNF for Syntax
2417	              Specifications: ABNF", STD 68, RFC 5234, January 2008.

2419	   [UNI9]     Davis, M., "The Bidirectional Algorithm", Unicode Standard
2420	              Annex #9, March 2004,
2421	              <http://www.unicode.org/reports/tr9/tr9-13.html>.

2423	   [UNIV4]    The Unicode Consortium, "The Unicode Standard, Version
2424	              5.1.0, defined by: The Unicode Standard, Version 5.0
2425	              (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0), as
2426	              amended by Unicode 4.1.0
2427	              (http://www.unicode.org/versions/Unicode5.1.0/)",
2428	              April 2008.

2430	   [UTR15]    Davis, M. and M. Duerst, "Unicode Normalization Forms",
2431	              Unicode Standard Annex #15, March 2008,
2432	              <http://www.unicode.org/unicode/reports/tr15/
2433	              tr15-23.html>.

2435	14.2.  Informative References

2437	   [BidiEx]   "Examples of bidirectional IRIs",
2438	              <http://www.w3.org/International/iri-edit/BidiExamples>.

2440	   [CharMod]  Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
2441	              Texin, "Character Model for the World Wide Web: Resource
2442	              Identifiers", World Wide Web Consortium Candidate
2443	              Recommendation, November 2004,
2444	              <http://www.w3.org/TR/charmod-resid>.

2446	   [Duerst97]
2447	              Duerst, M., "The Properties and Promises of UTF-8", Proc.
2448	              11th International Unicode Conference, San Jose ,
2449	              September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
2450	              papers/PDF/IUC11-UTF-8.pdf>.

2452	   [Gettys]   Gettys, J., "URI Model Consequences",
2453	              <http://www.w3.org/DesignIssues/ModelConsequences>.

2455	   [HTML4]    Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
2456	              Specification", World Wide Web Consortium Recommendation,
2457	              December 1999,
2458	              <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.

2460	   [HTML5]    Hickson, I. and D. Hyatt, "A vocabulary and associated
2461	              APIs for HTML and XHTML", World Wide Web
2462	              Consortium Working Draft, April 2009,
2463	              <http://www.w3.org/TR/2009/WD-html5-20090423/>.

2465	   [LEIRI]    Thompson, H., Tobin, R., and N. Walsh, "Legacy extended
2466	              IRIs for XML resource identification", World Wide Web
2467	              Consortium Note, November 2008,
2468	              <http://www.w3.org/TR/leiri/>.

2470	   [RFC1738]  Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
2471	              Resource Locators (URL)", RFC 1738, December 1994.

2473	   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
2474	              Extensions (MIME) Part One: Format of Internet Message
2475	              Bodies", RFC 2045, November 1996.

2477	   [RFC2130]  Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
2478	              Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
2479	              the IAB Character Set Workshop held 29 February - 1 March,
2480	              1996", RFC 2130, April 1997.

2482	   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

2484	   [RFC2192]  Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

2486	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
2487	              Languages", BCP 18, RFC 2277, January 1998.

2489	   [RFC2368]  Hoffman, P., Masinter, L., and J. Zawinski, "The mailto
2490	              URL scheme", RFC 2368, July 1998.

2492	   [RFC2384]  Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

2494	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
2495	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
2496	              August 1998.

2498	   [RFC2397]  Masinter, L., "The "data" URL scheme", RFC 2397,
2499	              August 1998.

2501	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
2502	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
2503	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

2505	   [RFC2640]  Curtin, B., "Internationalization of the File Transfer
2506	              Protocol", RFC 2640, July 1999.

2508	   [RFC4395]  Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
2509	              Registration Procedures for New URI Schemes", BCP 35,
2510	              RFC 4395, February 2006.

2512	   [UNIXML]   Duerst, M. and A. Freytag, "Unicode in XML and other
2513	              Markup Languages", Unicode Technical Report #20, World
2514	              Wide Web Consortium Note, June 2003,
2515	              <http://www.w3.org/TR/unicode-xml/>.

2517	   [XLink]    DeRose, S., Maler, E., and D. Orchard, "XML Linking
2518	              Language (XLink) Version 1.0", World Wide Web
2519	              Consortium Recommendation, June 2001,
2520	              <http://www.w3.org/TR/xlink/#link-locators>.

2522	   [XML1]     Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
2523	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth
2524	              Edition)", World Wide Web Consortium Recommendation,
2525	              August 2006, <http://www.w3.org/TR/REC-xml>.

2527	   [XMLNamespace]
2528	              Bray, T., Hollander, D., Layman, A., and R. Tobin,
2529	              "Namespaces in XML (Second Edition)", World Wide Web
2530	              Consortium Recommendation, August 2006,
2531	              <http://www.w3.org/TR/REC-xml-names>.

2533	   [XMLSchema]
2534	              Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
2535	              World Wide Web Consortium Recommendation, May 2001,
2536	              <http://www.w3.org/TR/xmlschema-2/#anyURI>.

2538	   [XPointer]
2539	              Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
2540	              Framework", World Wide Web Consortium Recommendation,
2541	              March 2003,
2542	              <http://www.w3.org/TR/xptr-framework/#escaping>.

2544	Appendix A.  Design Alternatives

2546	   This section briefly summarizes some design alternatives considered
2547	   earlier and the reasons why they were not chosen.

2549	A.1.  New Scheme(s)

2551	   Introducing new schemes (for example, httpi:, ftpi:,...) or a new
2552	   metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:,
2553	   i:ftp:,...) was proposed to make IRI-to-URI conversion scheme
2554	   dependent or to distinguish between percent-encodings resulting from
2555	   IRI-to-URI conversion and percent-encodings from legacy character
2556	   encodings.

2558	   New schemes are not needed to distinguish URIs from true IRIs (i.e.,
2559	   IRIs that contain non-ASCII characters).  The benefit of being able
2560	   to detect the origin of percent-encodings is marginal, as UTF-8 can
2561	   be detected with very high reliability.  Deploying new schemes is
2562	   extremely hard, so not requiring new schemes for IRIs makes
2563	   deployment of IRIs vastly easier.  Making conversion scheme dependent
2564	   is highly inadvisable and would be encouraged by separate schemes for
2565	   IRIs.  Using a uniform convention for conversion from IRIs to URIs
2566	   makes IRI implementation orthogonal to the introduction of actual new
2567	   schemes.

2569	A.2.  Character Encodings Other Than UTF-8

2571	   At an early stage, UTF-7 was considered as an alternative to UTF-8
2572	   when IRIs are converted to URIs.  UTF-7 would not have needed
2573	   percent-encoding and in most cases would have been shorter than
2574	   percent-encoded UTF-8.

2576	   Using UTF-8 avoids a double layering and overloading of the use of
2577	   the "+" character.  UTF-8 is fully compatible with US-ASCII and has
2578	   therefore been recommended by the IETF, and is being used widely.

2580	   UTF-7 has never been used much and is now clearly being discouraged.
2581	   Requiring implementations to convert from UTF-8 to UTF-7 and back
2582	   would be an additional implementation burden.

2584	A.3.  New Encoding Convention

2586	   Instead of using the existing percent-encoding convention of URIs,
2587	   which is based on octets, the idea was to create a new encoding
2588	   convention; for example, to use "%u" to introduce UCS code points.

2590	   Using the existing octet-based percent-encoding mechanism does not
2591	   need an upgrade of the URI syntax and does not need corresponding
2592	   server upgrades.

2594	A.4.  Indicating Character Encodings in the URI/IRI

2596	   Some proposals suggested indicating the character encodings used in
2597	   an URI or IRI with some new syntactic convention in the URI itself,
2598	   similar to the "charset" parameter for e-mails and Web pages.  As an
2599	   example, the label in square brackets in
2600	   "http://www.example.org/ros[iso-8859-1]&#xE9;" indicated that the
2601	   following "&#xE9;" had to be interpreted as iso-8859-1.

2603	   If UTF-8 is used exclusively, an upgrade to the URI syntax is not
2604	   needed.  It avoids potentially multiple labels that have to be copied
2605	   correctly in all cases, even on the side of a bus or on a napkin,
2606	   leading to usability problems (and being prohibitively annoying).
2607	   Exclusively using UTF-8 also reduces transcoding errors and
2608	   confusion.

2610	Authors' Addresses

2612	   Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
2613	             possible, for example as "D&amp;#252;rst" in XML and HTML.)
2614	   Aoyama Gakuin University
2615	   5-10-1 Fuchinobe
2616	   Sagamihara, Kanagawa  229-8558
2617	   Japan

2619	   Phone: +81 42 759 6329
2620	   Fax:   +81 42 759 6495
2621	   Email: mailto:duerst@it.aoyama.ac.jp
2622	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
2623	          (Note: This is the percent-encoded form of an IRI.)

2625	   Michel Suignard
2626	   Unicode Consortium
2627	   P.O. Box 391476
2628	   Mountain View, CA  94039-1476
2629	   U.S.A.

2631	   Phone: +1-650-693-3921
2632	   Email: mailto:michel@unicode.org
2633	   URI:   http://www.suignard.com
2634	   Larry Masinter
2635	   Adobe
2636	   345 Park Ave
2637	   San Jose, CA  95110
2638	   U.S.A.

2640	   Phone: +1-408-536-3024
2641	   Email: mailto:masinter@adobe.com
2642	   URI:   http://larry.masinter.net