idnits 2.17.1 

draft-ietf-iri-3987bis-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Sep 2009 rather than the newer Notice from 28 Dec 2009.  (See
     https://trustee.ietf.org/license-info/)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == The 'Obsoletes: ' line in the draft header should list only the
     _numbers_ of the RFCs which will be obsoleted by this document (if
     approved); it should not include the word 'RFC' in the list.

  -- The draft header indicates that this document obsoletes RFC3987, but the
     abstract doesn't seem to directly say this.  It does mention RFC3987
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (January 29, 2010) is 5198 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV4'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'

  -- Obsolete informational reference (is this intentional?): RFC 1738
     (Obsoleted by RFC 4248, RFC 4266)

  -- Obsolete informational reference (is this intentional?): RFC 2141
     (Obsoleted by RFC 8141)

  -- Obsolete informational reference (is this intentional?): RFC 2192
     (Obsoleted by RFC 5092)

  -- Obsolete informational reference (is this intentional?): RFC 2368
     (Obsoleted by RFC 6068)

  -- Obsolete informational reference (is this intentional?): RFC 2396
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Obsolete informational reference (is this intentional?): RFC 4395
     (Obsoleted by RFC 7595)


     Summary: 3 errors (**), 0 flaws (~~), 4 warnings (==), 15 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internationalized Resource                                     M. Duerst
3	Identifiers (iri)                               Aoyama Gakuin University
4	Internet-Draft                                               M. Suignard
5	Obsoletes: RFC 3987                                   Unicode Consortium
6	(if approved)                                                L. Masinter
7	Intended status: Standards Track                                   Adobe
8	Expires: August 2, 2010                                 January 29, 2010

10	             Internationalized Resource Identifiers (IRIs)
11	                       draft-ietf-iri-3987bis-00

13	Abstract

15	   This document defines the Internationalized Resource Identifier (IRI)
16	   protocol element, as an extension of the Uniform Resource Identifier
17	   (URI).  An IRI is a sequence of characters from the Universal
18	   Character Set (Unicode/ISO 10646).  Grammar and processing rules are
19	   given for IRIs and related syntactic forms.

21	   In addition, this document provides named additional rule sets for
22	   processing otherwise invalid IRIs, in a way that supports other
23	   specifications that wish to mandate common behavior for 'error'
24	   handling.  In particular, rules used in some XML languages (LEIRI)
25	   and web applications are given.

27	   Defining IRI as new protocol element (rather than updating or
28	   extending the definition of URI) allows independent orderly
29	   transitions: other protocols and languages that use URIs must
30	   explicitly choose to allow IRIs.

32	   Guidelines are provided for the use and deployment of IRIs and
33	   related protocol elements when revising protocols, formats, and
34	   software components that currently deal only with URIs.

36	   [RFC Editor: Please remove this paragraph before publication.]  This
37	   document is intended to update RFC 3987 and move towards IETF Draft
38	   Standard.  This version is essentially identical to
39	   draft-duerst-iri-bis-07.txt, and is submitted as an initial draft to
40	   start WG discussions.  For discussion and comments on this draft,
41	   please join the IETF IRI WG by subscribing to the mailing list
42	   public-iri@w3.org.

44	Status of this Memo

46	   This Internet-Draft is submitted to IETF in full conformance with the
47	   provisions of BCP 78 and BCP 79.

49	   Internet-Drafts are working documents of the Internet Engineering
50	   Task Force (IETF), its areas, and its working groups.  Note that
51	   other groups may also distribute working documents as Internet-
52	   Drafts.

54	   Internet-Drafts are draft documents valid for a maximum of six months
55	   and may be updated, replaced, or obsoleted by other documents at any
56	   time.  It is inappropriate to use Internet-Drafts as reference
57	   material or to cite them other than as "work in progress."

59	   The list of current Internet-Drafts can be accessed at
60	   http://www.ietf.org/ietf/1id-abstracts.txt.

62	   The list of Internet-Draft Shadow Directories can be accessed at
63	   http://www.ietf.org/shadow.html.

65	   This Internet-Draft will expire on August 2, 2010.

67	Copyright Notice

69	   Copyright (c) 2010 IETF Trust and the persons identified as the
70	   document authors.  All rights reserved.

72	   This document is subject to BCP 78 and the IETF Trust's Legal
73	   Provisions Relating to IETF Documents
74	   (http://trustee.ietf.org/license-info) in effect on the date of
75	   publication of this document.  Please review these documents
76	   carefully, as they describe your rights and restrictions with respect
77	   to this document.  Code Components extracted from this document must
78	   include Simplified BSD License text as described in Section 4.e of
79	   the Trust Legal Provisions and are provided without warranty as
80	   described in the BSD License.

82	   This document may contain material from IETF Documents or IETF
83	   Contributions published or made publicly available before November
84	   10, 2008.  The person(s) controlling the copyright in some of this
85	   material may not have granted the IETF Trust the right to allow
86	   modifications of such material outside the IETF Standards Process.
87	   Without obtaining an adequate license from the person(s) controlling
88	   the copyright in such materials, this document may not be modified
89	   outside the IETF Standards Process, and derivative works of it may
90	   not be created outside the IETF Standards Process, except to format
91	   it for publication as an RFC or to translate it into languages other
92	   than English.

94	Table of Contents

96	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
97	     1.1.  Overview and Motivation  . . . . . . . . . . . . . . . . .  5
98	     1.2.  Applicability  . . . . . . . . . . . . . . . . . . . . . .  6
99	     1.3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . .  6
100	     1.4.  Notation . . . . . . . . . . . . . . . . . . . . . . . . .  9
101	   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  9
102	     2.1.  Summary of IRI Syntax  . . . . . . . . . . . . . . . . . . 10
103	     2.2.  ABNF for IRI References and IRIs . . . . . . . . . . . . . 10
104	   3.  Processing IRIs and related protocol elements  . . . . . . . . 13
105	     3.1.  Converting to UCS  . . . . . . . . . . . . . . . . . . . . 14
106	     3.2.  Parse the IRI into IRI components  . . . . . . . . . . . . 14
107	     3.3.  General percent-encoding of IRI components . . . . . . . . 15
108	     3.4.  Mapping ireg-name  . . . . . . . . . . . . . . . . . . . . 15
109	     3.5.  Mapping query components . . . . . . . . . . . . . . . . . 17
110	     3.6.  Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . . 17
111	     3.7.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . . 17
112	       3.7.1.  Examples . . . . . . . . . . . . . . . . . . . . . . . 19
113	   4.  Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20
114	     4.1.  Logical Storage and Visual Presentation  . . . . . . . . . 21
115	     4.2.  Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 22
116	     4.3.  Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 23
117	     4.4.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . 23
118	   5.  Normalization and Comparison . . . . . . . . . . . . . . . . . 25
119	     5.1.  Equivalence  . . . . . . . . . . . . . . . . . . . . . . . 25
120	     5.2.  Preparation for Comparison . . . . . . . . . . . . . . . . 26
121	     5.3.  Comparison Ladder  . . . . . . . . . . . . . . . . . . . . 27
122	       5.3.1.  Simple String Comparison . . . . . . . . . . . . . . . 27
123	       5.3.2.  Syntax-Based Normalization . . . . . . . . . . . . . . 28
124	       5.3.3.  Scheme-Based Normalization . . . . . . . . . . . . . . 31
125	       5.3.4.  Protocol-Based Normalization . . . . . . . . . . . . . 32
126	   6.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 33
127	     6.1.  Limitations on UCS Characters Allowed in IRIs  . . . . . . 33
128	     6.2.  Software Interfaces and Protocols  . . . . . . . . . . . . 33
129	     6.3.  Format of URIs and IRIs in Documents and Protocols . . . . 33
130	     6.4.  Use of UTF-8 for Encoding Original Characters  . . . . . . 34
131	     6.5.  Relative IRI References  . . . . . . . . . . . . . . . . . 36
132	   7.  Liberal handling of otherwise invalid IRIs . . . . . . . . . . 36
133	     7.1.  LEIRI processing . . . . . . . . . . . . . . . . . . . . . 36
134	     7.2.  Web Address processing . . . . . . . . . . . . . . . . . . 36
135	     7.3.  Characters not allowed in IRIs . . . . . . . . . . . . . . 38
136	   8.  URI/IRI Processing Guidelines (Informative)  . . . . . . . . . 40
137	     8.1.  URI/IRI Software Interfaces  . . . . . . . . . . . . . . . 40
138	     8.2.  URI/IRI Entry  . . . . . . . . . . . . . . . . . . . . . . 41
139	     8.3.  URI/IRI Transfer between Applications  . . . . . . . . . . 42
140	     8.4.  URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 42
141	     8.5.  URI/IRI Selection  . . . . . . . . . . . . . . . . . . . . 43
142	     8.6.  Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 43
143	     8.7.  Interpretation of URIs and IRIs  . . . . . . . . . . . . . 44
144	     8.8.  Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 44
145	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 45
146	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 46
147	   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 47
148	   12. Open Issues  . . . . . . . . . . . . . . . . . . . . . . . . . 48
149	   13. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 50
150	     13.1. Changes from draft-duerst-iri-bis-07 to
151	           draft-ietf-iri-3987bis-00  . . . . . . . . . . . . . . . . 50
152	     13.2. Changes from -06 to -07 of draft-duerst-iri-bis  . . . . . 50
153	       13.2.1. OLD WAY  . . . . . . . . . . . . . . . . . . . . . . . 50
154	       13.2.2. NEW WAY  . . . . . . . . . . . . . . . . . . . . . . . 51
155	     13.3. Changes from -05 to -06 of draft-duerst-iri-bis  . . . . . 51
156	     13.4. Changes from -04 to -05 of draft-duerst-iri-bis  . . . . . 51
157	     13.5. Changes from -03 to -04 of draft-duerst-iri-bis  . . . . . 51
158	     13.6. Changes from -02 to -03 of draft-duerst-iri-bis  . . . . . 52
159	     13.7. Changes from -01 to -02 of draft-duerst-iri-bis  . . . . . 52
160	     13.8. Changes from -00 to -01 of draft-duerst-iri-bis  . . . . . 52
161	     13.9. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . . 52
162	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 52
163	     14.1. Normative References . . . . . . . . . . . . . . . . . . . 52
164	     14.2. Informative References . . . . . . . . . . . . . . . . . . 53
165	   Appendix A.  Design Alternatives . . . . . . . . . . . . . . . . . 56
166	     A.1.  New Scheme(s)  . . . . . . . . . . . . . . . . . . . . . . 56
167	     A.2.  Character Encodings Other Than UTF-8 . . . . . . . . . . . 56
168	     A.3.  New Encoding Convention  . . . . . . . . . . . . . . . . . 56
169	     A.4.  Indicating Character Encodings in the URI/IRI  . . . . . . 57
170	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 57

172	1.  Introduction

174	1.1.  Overview and Motivation

176	   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
177	   sequence of characters chosen from a limited subset of the repertoire
178	   of US-ASCII [ASCII] characters.

180	   The characters in URIs are frequently used for representing words of
181	   natural languages.  This usage has many advantages: Such URIs are
182	   easier to memorize, easier to interpret, easier to transcribe, easier
183	   to create, and easier to guess.  For most languages other than
184	   English, however, the natural script uses characters other than A -
185	   Z. For many people, handling Latin characters is as difficult as
186	   handling the characters of other scripts is for those who use only
187	   the Latin alphabet.  Many languages with non-Latin scripts are
188	   transcribed with Latin letters.  These transcriptions are now often
189	   used in URIs, but they introduce additional difficulties.

191	   The infrastructure for the appropriate handling of characters from
192	   additional scripts is now widely deployed in operating system and
193	   application software.  Software that can handle a wide variety of
194	   scripts and languages at the same time is increasingly common.  Also,
195	   an increasing number of protocols and formats can carry a wide range
196	   of characters.

198	   URIs are used both as a protocol element (for transmission and
199	   processing by software) and also a presentation element (for display
200	   and handling by people who read, interpret, coin, or guess them).
201	   The transition between these roles is more difficult and complex when
202	   dealing with the larger set of characters than allowed for URIs in
203	   [RFC3986].

205	   This document defines the protocol element called Internationalized
206	   Resource Identifier (IRI), which allow applications of URIs to be
207	   extended to use resource identifiers that have a much wider
208	   repertoire of characters.  It also provides corresponding
209	   "internationalized" versions of other constructs from [RFC3986], such
210	   as URI references.  The syntax of IRIs is defined in Section 2.

212	   Using characters outside of A - Z in IRIs adds a number of
213	   difficulties.  Section 4 discusses the special case of bidirectional
214	   IRIs using characters from scripts written right-to-left.  Section 5
215	   discusses various forms of equivalence between IRIs.  Section 6
216	   discusses the use of IRIs in different situations.  Section 8 gives
217	   additional informative guidelines.  Section 10 discusses IRI-specific
218	   security considerations.

220	1.2.  Applicability

222	   IRIs are designed to allow protocols and software that deal with URIs
223	   to be updated to handle IRIs.  A "URI scheme" (as defined by
224	   [RFC3986] and registered through the IANA process defined in
225	   [RFC4395] also serves as an "IRI scheme".  Processing of IRIs is
226	   accomplished by extending the URI syntax while retaining (and not
227	   expanding) the set of "reserved" characters, such that the syntax for
228	   any URI scheme may be uniformly extended to allow non-ASCII
229	   characters.  In addition, following parsing of an IRI, it is possible
230	   to construct a corresponding URI by first encoding characters outside
231	   of the allowed URI range and then reassembling the components.

233	   Practical use of IRIs forms in place of URIs forms depends on the
234	   following conditions being met:

236	   a. A protocol or format element MUST be explicitly designated to be
237	      able to carry IRIs.  The intent is to avoid introducing IRIs into
238	      contexts that are not defined to accept them.  For example, XML
239	      schema [XMLSchema] has an explicit type "anyURI" that includes
240	      IRIs and IRI references.  Therefore, IRIs and IRI references can
241	      be in attributes and elements of type "anyURI".  On the other
242	      hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is
243	      defined as a URI, which means that direct use of IRIs is not
244	      allowed in HTTP requests.

246	   b. The protocol or format carrying the IRIs MUST have a mechanism to
247	      represent the wide range of characters used in IRIs, either
248	      natively or by some protocol- or format-specific escaping
249	      mechanism (for example, numeric character references in [XML1]).

251	   c. The URI scheme definition, if it explicitly allows a percent sign
252	      ("%") in any syntactic component, SHOULD define the interpretation
253	      of sequences of percent-encoded octets (using "%XX" hex octets) as
254	      octet from sequences of UTF-8 encoded strings; this is recommended
255	      in the guidelines for registering new schemes, [RFC4395].  For
256	      example, this is the practice for IMAP URLs [RFC2192], POP URLs
257	      [RFC2384] and the URN syntax [RFC2141]).  Note that use of
258	      percent-encoding may also be restricted in some situations, for
259	      example, URI schemes that disallow percent-encoding might still be
260	      used with a fragment identifier which is percent-encoded (e.g.,
261	      [XPointer]).  See Section 6.4 for further discussion.

263	1.3.  Definitions

265	   The following definitions are used in this document; they follow the
266	   terms in [RFC2130], [RFC2277], and [ISO10646].

268	   character:  A member of a set of elements used for the organization,
269	      control, or representation of data.  For example, "LATIN CAPITAL
270	      LETTER A" names a character.

272	   octet:  An ordered sequence of eight bits considered as a unit.

274	   character repertoire:  A set of characters (set in the mathematical
275	      sense).

277	   sequence of characters:  A sequence of characters (one after
278	      another).

280	   sequence of octets:  A sequence of octets (one after another).

282	   character encoding:  A method of representing a sequence of
283	      characters as a sequence of octets (maybe with variants).  Also, a
284	      method of (unambiguously) converting a sequence of octets into a
285	      sequence of characters.

287	   charset:  The name of a parameter or attribute used to identify a
288	      character encoding.

290	   UCS:  Universal Character Set. The coded character set defined by
291	      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].

293	   IRI reference:  Denotes the common usage of an Internationalized
294	      Resource Identifier.  An IRI reference may be absolute or
295	      relative.  However, the "IRI" that results from such a reference
296	      only includes absolute IRIs; any relative IRI references are
297	      resolved to their absolute form.  Note that in [RFC2396] URIs did
298	      not include fragment identifiers, but in [RFC3986] fragment
299	      identifiers are part of URIs.

301	   URL:  The term "URL" was originally used [RFC1738] for roughly what
302	      is now called a "URI".  Books, software and documentation often
303	      refers to URIs and IRIs using the "URL" term.  Some usages
304	      restrict "URL" to those URIs which are not URNs.  Because of the
305	      ambiguity of the term using the term "URL" is NOT RECOMMENDED in
306	      formal documents.

308	   LEIRI (Legacy Extended IRI) processing:  This term was used in
309	      various XML specifications to refer to strings that, although not
310	      valid IRIs, were acceptable input to the processing rules in
311	      Section 7.1.

313	   (Web Address, Hypertext Reference, HREF):  These terms have been
314	      added in this document for convenience, to allow other
315	      specifications to refer to those strings that, although not valid
316	      IRIs, are acceptable input to the processing rules in Section 7.2.
317	      This usage corresponds to the parsing rules of some popular web
318	      browsing applications.  ISSUE: Need to find a good name/
319	      abbreviation for these.

321	   running text:  Human text (paragraphs, sentences, phrases) with
322	      syntax according to orthographic conventions of a natural
323	      language, as opposed to syntax defined for ease of processing by
324	      machines (e.g., markup, programming languages).

326	   protocol element:  Any portion of a message that affects processing
327	      of that message by the protocol in question.

329	   presentation element:  A presentation form corresponding to a
330	      protocol element; for example, using a wider range of characters.

332	   create (a URI or IRI):  With respect to URIs and IRIs, the term is
333	      used for the initial creation.  This may be the initial creation
334	      of a resource with a certain identifier, or the initial exposition
335	      of a resource under a particular identifier.

337	   generate (a URI or IRI):  With respect to URIs and IRIs, the term is
338	      used when the identifier is generated by derivation from other
339	      information.

341	   parsed URI component:  When a URI processor parses a URI (following
342	      the generic syntax or a scheme-specific syntax, the result is a
343	      set of parsed URI components, each of which has a type
344	      (corresponding to the syntactic definition) and a sequence of URI
345	      characters.

347	   parsed IRI component:  When an IRI processor parses an IRI directly,
348	      following the general syntax or a scheme-specific syntax, the
349	      result is a set of parsed IRI components, each of which has a type
350	      (corresponding to the syntactice definition) and a sequence of IRI
351	      characters.  (This definition is analogous to "parsed URI
352	      component".)

354	   IRI scheme:  A URI scheme may also be known as an "IRI scheme" if the
355	      scheme's syntax has been extended to allow non-US-ASCII characters
356	      according to the rules in this document.

358	1.4.  Notation

360	   RFCs and Internet Drafts currently do not allow any characters
361	   outside the US-ASCII repertoire.  Therefore, this document uses
362	   various special notations to denote such characters in examples.

364	   In text, characters outside US-ASCII are sometimes referenced by
365	   using a prefix of 'U+', followed by four to six hexadecimal digits.

367	   To represent characters outside US-ASCII in examples, this document
368	   uses two notations: 'XML Notation' and 'Bidi Notation'.

370	   XML Notation uses a leading '&#x', a trailing ';', and the
371	   hexadecimal number of the character in the UCS in between.  For
372	   example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA.  In this
373	   notation, an actual '&' is denoted by '&amp;'.

375	   Bidi Notation is used for bidirectional examples: Lower case letters
376	   stand for Latin letters or other letters that are written left to
377	   right, whereas upper case letters represent Arabic or Hebrew letters
378	   that are written right to left.

380	   To denote actual octets in examples (as opposed to percent-encoded
381	   octets), the two hex digits denoting the octet are enclosed in "<"
382	   and ">".  For example, the octet often denoted as 0xc9 is denoted
383	   here as <c9>.

385	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
386	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
387	   and "OPTIONAL" are to be interpreted as described in [RFC2119].

389	2.  IRI Syntax

391	   This section defines the syntax of Internationalized Resource
392	   Identifiers (IRIs).

394	   As with URIs, an IRI is defined as a sequence of characters, not as a
395	   sequence of octets.  This definition accommodates the fact that IRIs
396	   may be written on paper or read over the radio as well as stored or
397	   transmitted digitally.  The same IRI might be represented as
398	   different sequences of octets in different protocols or documents if
399	   these protocols or documents use different character encodings
400	   (and/or transfer encodings).  Using the same character encoding as
401	   the containing protocol or document ensures that the characters in
402	   the IRI can be handled (e.g., searched, converted, displayed) in the
403	   same way as the rest of the protocol or document.

405	2.1.  Summary of IRI Syntax

407	   IRIs are defined by extending the URI syntax in [RFC3986], but
408	   extending the class of unreserved characters by adding the characters
409	   of the UCS (Universal Character Set, [ISO10646]) beyond U+007F,
410	   subject to the limitations given in the syntax rules below and in
411	   Section 6.1.

413	   The syntax and use of components and reserved characters is the same
414	   as that in [RFC3986].  Each "URI scheme" thus also functions as an
415	   "IRI scheme", in that scheme-specific parsing rules for URIs of a
416	   scheme are be extended to allow parsing of IRIs using the same
417	   parsing rules.

419	   All the operations defined in [RFC3986], such as the resolution of
420	   relative references, can be applied to IRIs by IRI-processing
421	   software in exactly the same way as they are for URIs by URI-
422	   processing software.

424	   Characters outside the US-ASCII repertoire MUST NOT be reserved and
425	   therefore MUST NOT be used for syntactical purposes, such as to
426	   delimit components in newly defined schemes.  For example, U+00A2,
427	   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
428	   the 'iunreserved' category.  This is similar to the fact that it is
429	   not possible to use '-' as a delimiter in URIs, because it is in the
430	   'unreserved' category.

432	2.2.  ABNF for IRI References and IRIs

434	   An ABNF definition for IRI references (which are the most general
435	   concept and the start of the grammar) and IRIs is given here.  The
436	   syntax of this ABNF is described in [STD68].  Character numbers are
437	   taken from the UCS, without implying any actual binary encoding.
438	   Terminals in the ABNF are characters, not octets.

440	   The following grammar closely follows the URI grammar in [RFC3986],
441	   except that the range of unreserved characters is expanded to include
442	   UCS characters, with the restriction that private UCS characters can
443	   occur only in query parts.  The grammar is split into two parts:
444	   Rules that differ from [RFC3986] because of the above-mentioned
445	   expansion, and rules that are the same as those in [RFC3986].  For
446	   rules that are different than those in [RFC3986], the names of the
447	   non-terminals have been changed as follows.  If the non-terminal
448	   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
449	   has been prefixed.

451	   The following rules are different from those in [RFC3986]:

453	   IRI            = scheme ":" ihier-part [ "?" iquery ]
454	                    [ "#" ifragment ]

456	   ihier-part     = "//" iauthority ipath-abempty
457	                  / ipath-absolute
458	                  / ipath-rootless
459	                  / ipath-empty

461	   IRI-reference  = IRI / irelative-ref

463	   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

465	   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]

467	   irelative-part = "//" iauthority ipath-abempty
468	                  / ipath-absolute
469	                  / ipath-noscheme
470	                  / ipath-empty

472	   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
473	   iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
474	   ihost          = IP-literal / IPv4address / ireg-name

476	   pct-form       = pct-encoded

478	   ireg-name      = *( iunreserved / sub-delims )

480	   ipath          = ipath-abempty   ; begins with "/" or is empty
481	                  / ipath-absolute  ; begins with "/" but not "//"
482	                  / ipath-noscheme  ; begins with a non-colon segment
483	                  / ipath-rootless  ; begins with a segment
484	                  / ipath-empty     ; zero characters

486	   ipath-abempty  = *( path-sep isegment )
487	   ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
488	   ipath-noscheme = isegment-nz-nc *( path-sep isegment )
489	   ipath-rootless = isegment-nz *( path-sep isegment )
490	   ipath-empty    = 0<ipchar>
491	   path-sep       = "/"

493	   isegment       = *ipchar
494	   isegment-nz    = 1*ipchar
495	   isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
496	                        / "@" )
497	                  ; non-zero-length segment without any colon ":"

499	   ipchar         = iunreserved / pct-form / sub-delims / ":"
500	                  / "@"

502	   iquery         = *( ipchar / iprivate / "/" / "?" )

504	   ifragment      = *( ipchar / "/" / "?" / "#" )

506	   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

508	   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
509	                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
510	                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
511	                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
512	                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
513	                  / %xD0000-DFFFD / %xE1000-EFFFD

515	   iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
516	                  / %x100000-10FFFD

518	   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
519	   "greedy") algorithm applies.  For details, see [RFC3986].

521	   The following rules are the same as those in [RFC3986]:

523	   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

525	   port           = *DIGIT

527	   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

529	   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

531	   IPv6address    =                            6( h16 ":" ) ls32
532	                  /                       "::" 5( h16 ":" ) ls32
533	                  / [               h16 ] "::" 4( h16 ":" ) ls32
534	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
535	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
536	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
537	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
538	                  / [ *5( h16 ":" ) h16 ] "::"              h16
539	                  / [ *6( h16 ":" ) h16 ] "::"

541	   h16            = 1*4HEXDIG
542	   ls32           = ( h16 ":" h16 ) / IPv4address

544	   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

546	   dec-octet      = DIGIT                 ; 0-9
547	                  / %x31-39 DIGIT         ; 10-99
548	                  / "1" 2DIGIT            ; 100-199
549	                  / "2" %x30-34 DIGIT     ; 200-249
550	                  / "25" %x30-35          ; 250-255

552	   pct-encoded    = "%" HEXDIG HEXDIG

554	   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
555	   reserved       = gen-delims / sub-delims
556	   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
557	   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
558	                  / "*" / "+" / "," / ";" / "="

560	   This syntax does not support IPv6 scoped addressing zone identifiers.

562	3.  Processing IRIs and related protocol elements

564	   IRIs are meant to replace URIs in identifying resources within new
565	   versions of protocols, formats, and software components that use a
566	   UCS-based character repertoire.  Protocols and components may use and
567	   process IRIs directly.  However, there are still numerous systems and
568	   protocols which only accept URIs or components of parsed URIs; that
569	   is, they only accept sequences of characters within the subset of US-
570	   ASCII characters allowed in URIs.

572	   This section defines specific processing steps for IRI consumers
573	   which establish the relationship between the string given and the
574	   interpreted derivatives.  These processing steps apply to both IRIs
575	   and IRI references (i.e., absolute or relative forms); for IRIs, some
576	   steps are scheme specific.

578	3.1.  Converting to UCS

580	   Input that is already in a Unicode form (i.e., a sequence of Unicode
581	   characters or an octet-stream representing a Unicode-based character
582	   encoding such as UTF-8 or UTF-16) should be left as is and not
583	   normalized (see (see Section 5.3.2.2).

585	   If the IRI or IRI reference is an octet stream in some known non-
586	   Unicode character encoding, convert the IRI to a sequence of
587	   characters from the UCS; this sequence SHOULD also be normalized
588	   according to Unicode Normalization Form C (NFC, [UTR15]).  In this
589	   case, retain the original character encoding as the "document
590	   character encoding".  (DESIGN QUESTION: NOT WHAT MOST IMPLEMENTATIONS
591	   DO, CHANGE? )

593	   In other cases (written on paper, read aloud, or otherwise
594	   represented independent of any character encoding) represent the IRI
595	   as a sequence of characters from the UCS normalized according to
596	   Unicode Normalization Form C (NFC, [UTR15]).

598	3.2.  Parse the IRI into IRI components

600	   Parse the IRI, either as a relative reference (no scheme) or using
601	   scheme specific processing (according to the scheme given); the
602	   result resulting in a set of parsed IRI components.  (NOTE: FIX
603	   BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES THAT USE GENERIC
604	   SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN ONLY USE AUTHORITY FOR NAMES
605	   THAT FOLLOW PUNICODE.)

607	   NOTE: The result of parsing into components will correspond result in
608	   a correspondence of subtrings of the IRI according to the part
609	   matched.  For example, in [HTML5], the protocol components of
610	   interest are SCHEME (scheme), HOST (ireg-name), PORT (port), the PATH
611	   (ipath after the initial "/"), QUERY (iquery), FRAGMENT (ifragment),
612	   and AUTHORITY (iauthority).

614	   Subsequent processing rules are sometimes used to define other
615	   syntactic components.  For example, [HTML5] defines APIs for IRI
616	   processing; in these APIs:

618	   HOSTSPECIFIC  the substring that follows the substring matched by the
619	      iauthority production, or the whole string if the iauthority
620	      production wasn't matched.

622	   HOSTPORT  if there is a scheme component and a port component and the
623	      port given by the port component is different than the default
624	      port defined for the protocol given by the scheme component, then
625	      HOSTPORT is the substring that starts with the substring matched
626	      by the host production and ends with the substring matched by the
627	      port production, and includes the colon in between the two.
628	      Otherwise, it is the same as the host component.

630	3.3.  General percent-encoding of IRI components

632	   For most IRI components, it is possible to map the IRI component to
633	   an equivalent URI component by percent-encoding those characters not
634	   allowed in URIs.  Previous processing steps will have removed some
635	   characters, and the interpretation of reserved characters will have
636	   already been done (with the syntactic reserved characters outside of
637	   the IRI component).  This mapping is defined for all sequences of
638	   Unicode characters, whether or not they are valid for the component
639	   in question.

641	   For each character which is not allowed in a valid URI (NOTE: WHAT IS
642	   THE RIGHT REFERENCE HERE), apply the following steps.

644	   Convert to UTF-8  Convert the character to a sequence of one or more
645	      octets using UTF-8 [RFC3629].

647	   Percent encode  Convert each octet of this sequence to %HH, where HH
648	      is the hexadecimal notation of the octet value.  The hexadecimal
649	      notation SHOULD use uppercase letters.  (This is the general URI
650	      percent-encoding mechanism in Section 2.1 of [RFC3986].)

652	   Note that the mapping is an identity transformation for parsed URI
653	   components of valid URIs, and is idempotent: applying the mapping a
654	   second time will not change anything.

656	3.4.  Mapping ireg-name

658	   Schemes that allow non-ASCII based characters in the reg-name (ireg-
659	   name) position MUST convert the ireg-name component of an IRI as
660	   follows:

662	   Replace the ireg-name part of the IRI by the part converted using the
663	   ToASCII operation specified in Section 4.1 of [RFC3490] on each dot-
664	   separated label, and by using U+002E (FULL STOP) as a label
665	   separator, with the flag UseSTD3ASCIIRules set to FALSE, and with the
666	   flag AllowUnassigned set to FALSE.  The ToASCII operation may fail,
667	   but this would mean that the IRI cannot be resolved.  In such cases,
668	   if the domain name conversion fails, then the entire IRI conversion
669	   fails.  Processors that have no mechanism for signalling a failure
670	   MAY instead substitute an otherwise invalid host name, although such
671	   processing SHOULD be avoided.

673	   For example, the IRI
674	   "http://r&#xE9;sum&#xE9;.example.org"
675	   MAY be converted to
676	   "http://xn--rsum-bad.example.org"
677	   ; conversion to percent-encoded form, e.g.,
678	   "http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed.

680	   Note:  Domain Names may appear in parts of an IRI other than the
681	      ireg-name part.  It is the responsibility of scheme-specific
682	      implementations (if the Internationalized Domain Name is part of
683	      the scheme syntax) or of server-side implementations (if the
684	      Internationalized Domain Name is part of 'iquery') to apply the
685	      necessary conversions at the appropriate point.  Example: Trying
686	      to validate the Web page at
687	      http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
688	      http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
689	      example.org, which would convert to a URI of
690	      http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
691	      example.org.  The server-side implementation is responsible for
692	      making the necessary conversions to be able to retrieve the Web
693	      page.

695	   Note:  In this process, characters allowed in URI references and
696	      existing percent-encoded sequences are not encoded further.  (This
697	      mapping is similar to, but different from, the encoding applied
698	      when arbitrary content is included in some part of a URI.)  For
699	      example, an IRI of
700	      "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
701	      converted to
702	      "http://www.example.org/red%09ros%C3%A9#red", not to something
703	      like
704	      "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
705	      ((DESIGN QUESTION: What about e.g.
706	      http://r%C3%A9sum%C3%A9.example.org in an IRI?  Will that get
707	      converted to punycode, or not?))

709	3.5.  Mapping query components

711	   ((NOTE: SEE ISSUES LIST)) For compatibility with existing deployed
712	   HTTP infrastructure, the following special case applies for schemes
713	   "http" and "https" and IRIs whose origin has a document charset other
714	   than one which is UCS-based (e.g., UTF-8 or UTF-16).  In such a case,
715	   the "query" component of an IRI is mapped into a URI by using the
716	   document charset rather than UTF-8 as the binary representation
717	   before pct-encoding.  This mapping is not applied for any other
718	   scheme or component.

720	3.6.  Mapping IRIs to URIs

722	   The canonical mapping from a IRI to URI is defined by applying the
723	   mapping above (from IRI to URI components) and then reassembling a
724	   URI from the parsed URI components using the original punctuation
725	   that delimited the IRI components.

727	3.7.  Converting URIs to IRIs

729	   In some situations, for presentation and further processing, it is
730	   desirable to convert a URI into an equivalent IRI in which natural
731	   characters are represented directly rather than percent encoded.  Of
732	   course, every URI is already an IRI in its own right without any
733	   conversion, and in general there This section gives one such
734	   procedure for this conversion.

736	   The conversion described in this section, if given a valid URI, will
737	   result in an IRI that maps back to the URI used as an input for the
738	   conversion (except for potential case differences in percent-encoding
739	   and for potential percent-encoded unreserved characters).  However,
740	   the IRI resulting from this conversion may differ from the original
741	   IRI (if there ever was one).

743	   URI-to-IRI conversion removes percent-encodings, but not all percent-
744	   encodings can be eliminated.  There are several reasons for this:

746	   1. Some percent-encodings are necessary to distinguish percent-
747	      encoded and unencoded uses of reserved characters.

749	   2. Some percent-encodings cannot be interpreted as sequences of UTF-8
750	      octets.

752	      (Note: The octet patterns of UTF-8 are highly regular.  Therefore,
753	      there is a very high probability, but no guarantee, that percent-
754	      encodings that can be interpreted as sequences of UTF-8 octets
755	      actually originated from UTF-8.  For a detailed discussion, see
756	      [Duerst97].)

758	   3. The conversion may result in a character that is not appropriate
759	      in an IRI.  See Section 2.2, Section 4.1, and Section 6.1 for
760	      further details.

762	   4. IRI to URI conversion has different rules for dealing with domain
763	      names and query parameters.

765	   Conversion from a URI to an IRI MAY be done by using the following
766	   steps:

768	   1. Represent the URI as a sequence of octets in US-ASCII.

770	   2. Convert all percent-encodings ("%" followed by two hexadecimal
771	      digits) to the corresponding octets, except those corresponding to
772	      "%", characters in "reserved", and characters in US-ASCII not
773	      allowed in URIs.

775	   3. Re-percent-encode any octet produced in step 2 that is not part of
776	      a strictly legal UTF-8 octet sequence.

778	   4. Re-percent-encode all octets produced in step 3 that in UTF-8
779	      represent characters that are not appropriate according to
780	      Section 2.2, Section 4.1, and Section 6.1.

782	   5. Interpret the resulting octet sequence as a sequence of characters
783	      encoded in UTF-8.

785	   6. URIs known to contain domain names in the reg-name component
786	      SHOULD convert punycode-encoded domain name labels to the
787	      corresponding characters using the ToUnicode procedure.

789	   This procedure will convert as many percent-encoded characters as
790	   possible to characters in an IRI.  Because there are some choices
791	   when step 4 is applied (see Section 6.1), results may vary.

793	   Conversions from URIs to IRIs MUST NOT use any character encoding
794	   other than UTF-8 in steps 3 and 4, even if it might be possible to
795	   guess from the context that another character encoding than UTF-8 was
796	   used in the URI.  For example, the URI
797	   "http://www.example.org/r%E9sum%E9.html" might with some guessing be
798	   interpreted to contain two e-acute characters encoded as iso-8859-1.
799	   It must not be converted to an IRI containing these e-acute
800	   characters.  Otherwise, in the future the IRI will be mapped to
801	   "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
802	   URI from "http://www.example.org/r%E9sum%E9.html".

804	3.7.1.  Examples

806	   This section shows various examples of converting URIs to IRIs.  Each
807	   example shows the result after each of the steps 1 through 6 is
808	   applied.  XML Notation is used for the final result.  Octets are
809	   denoted by "<" followed by two hexadecimal digits followed by ">".

811	   The following example contains the sequence "%C3%BC", which is a
812	   strictly legal UTF-8 sequence, and which is converted into the actual
813	   character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
814	   u-umlaut).

816	   1. http://www.example.org/D%C3%BCrst

818	   2. http://www.example.org/D<c3><bc>rst

820	   3. http://www.example.org/D<c3><bc>rst

822	   4. http://www.example.org/D<c3><bc>rst

824	   5. http://www.example.org/D&#xFC;rst

826	   6. http://www.example.org/D&#xFC;rst

828	   The following example contains the sequence "%FC", which might
829	   represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
830	   iso-8859-1 character encoding.  (It might represent other characters
831	   in other character encodings.  For example, the octet <fc> in iso-
832	   8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.)  Because <fc>
833	   is not part of a strictly legal UTF-8 sequence, it is re-percent-
834	   encoded in step 3.

836	   1. http://www.example.org/D%FCrst

838	   2. http://www.example.org/D<fc>rst

840	   3. http://www.example.org/D%FCrst

842	   4. http://www.example.org/D%FCrst

844	   5. http://www.example.org/D%FCrst

846	   6. http://www.example.org/D%FCrst

848	   The following example contains "%e2%80%ae", which is the percent-
849	   encoded
850	   UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.
851	   Section 4.1 forbids the direct use of this character in an IRI.

853	   Therefore, the corresponding octets are re-percent-encoded in step 4.
854	   This example shows that the case (upper- or lowercase) of letters
855	   used in percent-encodings may not be preserved.  The example also
856	   contains a punycode-encoded domain name label (xn--99zt52a), which is
857	   not converted.

859	   1. http://xn--99zt52a.example.org/%e2%80%ae

861	   2. http://xn--99zt52a.example.org/<e2><80><ae>

863	   3. http://xn--99zt52a.example.org/<e2><80><ae>

865	   4. http://xn--99zt52a.example.org/%E2%80%AE

867	   5. http://xn--99zt52a.example.org/%E2%80%AE

869	   6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE

871	   Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
872	   (Japanese Natto).  ((EDITOR NOTE: There is some inconsistency in this
873	   note.))

875	4.  Bidirectional IRIs for Right-to-Left Languages

877	   Some UCS characters, such as those used in the Arabic and Hebrew
878	   scripts, have an inherent right-to-left (rtl) writing direction.
879	   IRIs containing these characters (called bidirectional IRIs or Bidi
880	   IRIs) require additional attention because of the non-trivial
881	   relation between logical representation (used for digital
882	   representation and for reading/spelling) and visual representation
883	   (used for display/printing).

885	   Because of the complex interaction between the logical
886	   representation, the visual representation, and the syntax of a Bidi
887	   IRI, a balance is needed between various requirements.  The main
888	   requirements are

890	   1. user-predictable conversion between visual and logical
891	      representation;

893	   2. the ability to include a wide range of characters in various parts
894	      of the IRI; and

896	   3. minor or no changes or restrictions for implementations.

898	4.1.  Logical Storage and Visual Presentation

900	   When stored or transmitted in digital representation, bidirectional
901	   IRIs MUST be in full logical order and MUST conform to the IRI syntax
902	   rules (which includes the rules relevant to their scheme).  This
903	   ensures that bidirectional IRIs can be processed in the same way as
904	   other IRIs.

906	   Bidirectional IRIs MUST be rendered by using the Unicode
907	   Bidirectional Algorithm [UNIV4], [UNI9].  Bidirectional IRIs MUST be
908	   rendered in the same way as they would be if they were in a left-to-
909	   right embedding; i.e., as if they were preceded by U+202A, LEFT-TO-
910	   RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL
911	   FORMATTING (PDF).  Setting the embedding direction can also be done
912	   in a higher-level protocol (e.g., the dir='ltr' attribute in HTML).

914	   There is no requirement to use the above embedding if the display is
915	   still the same without the embedding.  For example, a bidirectional
916	   IRI in a text with left-to-right base directionality (such as used
917	   for English or Cyrillic) that is preceded and followed by whitespace
918	   and strong left-to-right characters does not need an embedding.
919	   Also, a bidirectional relative IRI reference that only contains
920	   strong right-to-left characters and weak characters and that starts
921	   and ends with a strong right-to-left character and appears in a text
922	   with right-to-left base directionality (such as used for Arabic or
923	   Hebrew) and is preceded and followed by whitespace and strong
924	   characters does not need an embedding.

926	   In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
927	   sufficient to force the correct display behavior.  However, the
928	   details of the Unicode Bidirectional algorithm are not always easy to
929	   understand.  Implementers are strongly advised to err on the side of
930	   caution and to use embedding in all cases where they are not
931	   completely sure that the display behavior is unaffected without the
932	   embedding.

934	   The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits
935	   higher-level protocols to influence bidirectional rendering.  Such
936	   changes by higher-level protocols MUST NOT be used if they change the
937	   rendering of IRIs.

939	   The bidirectional formatting characters that may be used before or
940	   after the IRI to ensure correct display are not themselves part of
941	   the IRI.  IRIs MUST NOT contain bidirectional formatting characters
942	   (LRM, RLM, LRE, RLE, LRO, RLO, and PDF).  They affect the visual
943	   rendering of the IRI but do not appear themselves.  It would
944	   therefore not be possible to input an IRI with such characters
945	   correctly.

947	4.2.  Bidi IRI Structure

949	   The Unicode Bidirectional Algorithm is designed mainly for running
950	   text.  To make sure that it does not affect the rendering of
951	   bidirectional IRIs too much, some restrictions on bidirectional IRIs
952	   are necessary.  These restrictions are given in terms of delimiters
953	   (structural characters, mostly punctuation such as "@", ".", ":", and
954	   "/") and components (usually consisting mostly of letters and
955	   digits).

957	   The following syntax rules from Section 2.2 correspond to components
958	   for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
959	   isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.

961	   Specifications that define the syntax of any of the above components
962	   MAY divide them further and define smaller parts to be components
963	   according to this document.  As an example, the restrictions of
964	   [RFC3490] on bidirectional domain names correspond to treating each
965	   label of a domain name as a component for schemes with ireg-name as a
966	   domain name.  Even where the components are not defined formally, it
967	   may be helpful to think about some syntax in terms of components and
968	   to apply the relevant restrictions.  For example, for the usual name/
969	   value syntax in query parts, it is convenient to treat each name and
970	   each value as a component.  As another example, the extensions in a
971	   resource name can be treated as separate components.

973	   For each component, the following restrictions apply:

975	   1. A component SHOULD NOT use both right-to-left and left-to-right
976	      characters.

978	   2. A component using right-to-left characters SHOULD start and end
979	      with right-to-left characters.

981	   The above restrictions are given as "SHOULD"s, rather than as
982	   "MUST"s.  For IRIs that are never presented visually, they are not
983	   relevant.  However, for IRIs in general, they are very important to
984	   ensure consistent conversion between visual presentation and logical
985	   representation, in both directions.

987	   Note:  In some components, the above restrictions may actually be
988	      strictly enforced.  For example, [RFC3490] requires that these
989	      restrictions apply to the labels of a host name for those schemes
990	      where ireg-name is a host name.  In some other components (for
991	      example, path components) following these restrictions may not be
992	      too difficult.  For other components, such as parts of the query
993	      part, it may be very difficult to enforce the restrictions because
994	      the values of query parameters may be arbitrary character
995	      sequences.

997	   If the above restrictions cannot be satisfied otherwise, the affected
998	   component can always be mapped to URI notation as described in
999	   Section 3.3.  Please note that the whole component has to be mapped
1000	   (see also Example 9 below).

1002	4.3.  Input of Bidi IRIs

1004	   Bidi input methods MUST generate Bidi IRIs in logical order while
1005	   rendering them according to Section 4.1.  During input, rendering
1006	   SHOULD be updated after every new character is input to avoid end-
1007	   user confusion.

1009	4.4.  Examples

1011	   This section gives examples of bidirectional IRIs, in Bidi Notation.
1012	   It shows legal IRIs with the relationship between logical and visual
1013	   representation and explains how certain phenomena in this
1014	   relationship may look strange to somebody not familiar with
1015	   bidirectional behavior, but familiar to users of Arabic and Hebrew.
1016	   It also shows what happens if the restrictions given in Section 4.2
1017	   are not followed.  The examples below can be seen at [BidiEx], in
1018	   Arabic, Hebrew, and Bidi Notation variants.

1020	   To read the bidi text in the examples, read the visual representation
1021	   from left to right until you encounter a block of rtl text.  Read the
1022	   rtl block (including slashes and other special characters) from right
1023	   to left, then continue at the next unread ltr character.

1025	   Example 1: A single component with rtl characters is inverted:
1026	   Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html"
1027	   Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html"
1028	   Components can be read one by one, and each component can be read in
1029	   its natural direction.

1031	   Example 2: More than one consecutive component with rtl characters is
1032	   inverted as a whole:
1033	   Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html"
1034	   Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html"
1035	   A sequence of rtl components is read rtl, in the same way as a
1036	   sequence of rtl words is read rtl in a bidi text.

1038	   Example 3: All components of an IRI (except for the scheme) are rtl.
1039	   All rtl components are inverted overall:
1040	   Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"
1041	   Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"
1042	   The whole IRI (except the scheme) is read rtl.  Delimiters between
1043	   rtl components stay between the respective components; delimiters
1044	   between ltr and rtl components don't move.

1046	   Example 4: Each of several sequences of rtl components is inverted on
1047	   its own:
1048	   Logical representation: "http://AB.CD.ef/gh/IJ/KL.html"
1049	   Visual representation: "http://DC.BA.ef/gh/LK/JI.html"
1050	   Each sequence of rtl components is read rtl, in the same way as each
1051	   sequence of rtl words in an ltr text is read rtl.

1053	   Example 5: Example 2, applied to components of different kinds:
1054	   Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
1055	   Visual representation: "http://ab.cd.HG/FE/ij/kl.html"
1056	   The inversion of the domain name label and the path component may be
1057	   unexpected, but it is consistent with other bidi behavior.  For
1058	   reassurance that the domain component really is "ab.cd.EF", it may be
1059	   helpful to read aloud the visual representation following the bidi
1060	   algorithm.  After "http://ab.cd." one reads the RTL block
1061	   "E-F-slash-G-H", which corresponds to the logical representation.

1063	   Example 6: Same as Example 5, with more rtl components:
1064	   Logical representation: "http://ab.CD.EF/GH/IJ/kl.html"
1065	   Visual representation: "http://ab.JI/HG/FE.DC/kl.html"
1066	   The inversion of the domain name labels and the path components may
1067	   be easier to identify because the delimiters also move.

1069	   Example 7: A single rtl component includes digits:
1070	   Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"
1071	   Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"
1072	   Numbers are written ltr in all cases but are treated as an additional
1073	   embedding inside a run of rtl characters.  This is completely
1074	   consistent with usual bidirectional text.

1076	   Example 8 (not allowed): Numbers are at the start or end of an rtl
1077	   component:
1078	   Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html"
1079	   Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html"
1080	   The sequence "1/2" is interpreted by the bidi algorithm as a
1081	   fraction, fragmenting the components and leading to confusion.  There
1082	   are other characters that are interpreted in a special way close to
1083	   numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".

1085	   Example 9 (not allowed): The numbers in the previous example are
1086	   percent-encoded:
1087	   Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html",
1088	   Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html"

1090	   Example 10 (allowed but not recommended):

1092	   Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html"
1093	   Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html"
1094	   Components consisting of only numbers are allowed (it would be rather
1095	   difficult to prohibit them), but these may interact with adjacent RTL
1096	   components in ways that are not easy to predict.

1098	   Example 11 (allowed but not recommended):
1099	   Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"
1100	   Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html"
1101	   Components consisting of numbers and left-to-right characters are
1102	   allowed, but these may interact with adjacent RTL components in ways
1103	   that are not easy to predict.

1105	5.  Normalization and Comparison

1107	   Note:  The structure and much of the material for this section is
1108	      taken from section 6 of [RFC3986]; the differences are due to the
1109	      specifics of IRIs.

1111	   One of the most common operations on IRIs is simple comparison:
1112	   Determining whether two IRIs are equivalent, without using the IRIs
1113	   to access their respective resource(s).  A comparison is performed
1114	   whenever a response cache is accessed, a browser checks its history
1115	   to color a link, or an XML parser processes tags within a namespace.
1116	   Extensive normalization prior to comparison of IRIs may be used by
1117	   spiders and indexing engines to prune a search space or reduce
1118	   duplication of request actions and response storage.

1120	   IRI comparison is performed for some particular purpose.  Protocols
1121	   or implementations that compare IRIs for different purposes will
1122	   often be subject to differing design trade-offs in regards to how
1123	   much effort should be spent in reducing aliased identifiers.  This
1124	   section describes various methods that may be used to compare IRIs,
1125	   the trade-offs between them, and the types of applications that might
1126	   use them.

1128	5.1.  Equivalence

1130	   Because IRIs exist to identify resources, presumably they should be
1131	   considered equivalent when they identify the same resource.  However,
1132	   this definition of equivalence is not of much practical use, as there
1133	   is no way for an implementation to compare two resources to determine
1134	   if they are "the same" unless it has full knowledge or control of
1135	   them.  For this reason, determination of equivalence or difference of
1136	   IRIs is based on string comparison, perhaps augmented by reference to
1137	   additional rules provided by URI scheme definitions.  We use the
1138	   terms "different" and "equivalent" to describe the possible outcomes
1139	   of such comparisons, but there are many application-dependent
1140	   versions of equivalence.

1142	   Even when it is possible to determine that two IRIs are equivalent,
1143	   IRI comparison is not sufficient to determine whether two IRIs
1144	   identify different resources.  For example, an owner of two different
1145	   domain names could decide to serve the same resource from both,
1146	   resulting in two different IRIs.  Therefore, comparison methods are
1147	   designed to minimize false negatives while strictly avoiding false
1148	   positives.

1150	   In testing for equivalence, applications should not directly compare
1151	   relative references; the references should be converted to their
1152	   respective target IRIs before comparison.  When IRIs are compared to
1153	   select (or avoid) a network action, such as retrieval of a
1154	   representation, fragment components (if any) should be excluded from
1155	   the comparison.

1157	   Applications using IRIs as identity tokens with no relationship to a
1158	   protocol MUST use the Simple String Comparison (see Section 5.3.1).
1159	   All other applications MUST select one of the comparison practices
1160	   from the Comparison Ladder (see Section 5.3.

1162	5.2.  Preparation for Comparison

1164	   Any kind of IRI comparison REQUIRES that any additional contextual
1165	   processing is first performed, including undoing higher-level
1166	   escapings or encodings in the protocol or format that carries an IRI.
1167	   This preprocessing is usually done when the protocol or format is
1168	   parsed.

1170	   Examples of contextual preprocessing steps are described in
1171	   Section 7.

1173	   Examples of such escapings or encodings are entities and numeric
1174	   character references in [HTML4] and [XML1].  As an example,
1175	   "http://example.org/ros&eacute;" (in HTML),
1176	   "http://example.org/ros&#233;" (in HTML or XML), and
1177	   "http://example.org/ros&#xE9;" (in HTML or XML) are all resolved into
1178	   what is denoted in this document (see Section 1.4) as
1179	   "http://example.org/ros&#xE9;" (the "&#xE9;" here standing for the
1180	   actual e-acute character, to compensate for the fact that this
1181	   document cannot contain non-ASCII characters).

1183	   Similar considerations apply to encodings such as Transfer Codings in
1184	   HTTP (see [RFC2616]) and Content Transfer Encodings in MIME
1185	   ([RFC2045]), although in these cases, the encoding is based not on
1186	   characters but on octets, and additional care is required to make
1187	   sure that characters, and not just arbitrary octets, are compared
1188	   (see Section 5.3.1).

1190	5.3.  Comparison Ladder

1192	   In practice, a variety of methods are used to test IRI equivalence.
1193	   These methods fall into a range distinguished by the amount of
1194	   processing required and the degree to which the probability of false
1195	   negatives is reduced.  As noted above, false negatives cannot be
1196	   eliminated.  In practice, their probability can be reduced, but this
1197	   reduction requires more processing and is not cost-effective for all
1198	   applications.

1200	   If this range of comparison practices is considered as a ladder, the
1201	   following discussion will climb the ladder, starting with practices
1202	   that are cheap but have a relatively higher chance of producing false
1203	   negatives, and proceeding to those that have higher computational
1204	   cost and lower risk of false negatives.

1206	5.3.1.  Simple String Comparison

1208	   If two IRIs, when considered as character strings, are identical,
1209	   then it is safe to conclude that they are equivalent.  This type of
1210	   equivalence test has very low computational cost and is in wide use
1211	   in a variety of applications, particularly in the domain of parsing.
1212	   It is also used when a definitive answer to the question of IRI
1213	   equivalence is needed that is independent of the scheme used and that
1214	   can be calculated quickly and without accessing a network.  An
1215	   example of such a case is XML Namespaces ([XMLNamespace]).

1217	   Testing strings for equivalence requires some basic precautions.
1218	   This procedure is often referred to as "bit-for-bit" or "byte-for-
1219	   byte" comparison, which is potentially misleading.  Testing strings
1220	   for equality is normally based on pair comparison of the characters
1221	   that make up the strings, starting from the first and proceeding
1222	   until both strings are exhausted and all characters are found to be
1223	   equal, until a pair of characters compares unequal, or until one of
1224	   the strings is exhausted before the other.

1226	   This character comparison requires that each pair of characters be
1227	   put in comparable encoding form.  For example, should one IRI be
1228	   stored in a byte array in UTF-8 encoding form and the second in a
1229	   UTF-16 encoding form, bit-for-bit comparisons applied naively will
1230	   produce errors.  It is better to speak of equality on a character-
1231	   for-character rather than on a byte-for-byte or bit-for-bit basis.
1232	   In practical terms, character-by-character comparisons should be done
1233	   codepoint by codepoint after conversion to a common character
1234	   encoding form.  When comparing character by character, the comparison
1235	   function MUST NOT map IRIs to URIs, because such a mapping would
1236	   create additional spurious equivalences.  It follows that an IRI
1237	   SHOULD NOT be modified when being transported if there is any chance
1238	   that this IRI might be used in a context that uses Simple String
1239	   Comparison.

1241	   False negatives are caused by the production and use of IRI aliases.
1242	   Unnecessary aliases can be reduced, regardless of the comparison
1243	   method, by consistently providing IRI references in an already
1244	   normalized form (i.e., a form identical to what would be produced
1245	   after normalization is applied, as described below).  Protocols and
1246	   data formats often limit some IRI comparisons to simple string
1247	   comparison, based on the theory that people and implementations will,
1248	   in their own best interest, be consistent in providing IRI
1249	   references, or at least be consistent enough to negate any efficiency
1250	   that might be obtained from further normalization.

1252	5.3.2.  Syntax-Based Normalization

1254	   Implementations may use logic based on the definitions provided by
1255	   this specification to reduce the probability of false negatives.
1256	   This processing is moderately higher in cost than character-for-
1257	   character string comparison.  For example, an application using this
1258	   approach could reasonably consider the following two IRIs equivalent:

1260	      example://a/b/c/%7Bfoo%7D/ros&#xE9;
1261	      eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9

1263	   Web user agents, such as browsers, typically apply this type of IRI
1264	   normalization when determining whether a cached response is
1265	   available.  Syntax-based normalization includes such techniques as
1266	   case normalization, character normalization, percent-encoding
1267	   normalization, and removal of dot-segments.

1269	5.3.2.1.  Case Normalization

1271	   For all IRIs, the hexadecimal digits within a percent-encoding
1272	   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1273	   should be normalized to use uppercase letters for the digits A-F.

1275	   When an IRI uses components of the generic syntax, the component
1276	   syntax equivalence rules always apply; namely, that the scheme and
1277	   US-ASCII only host are case insensitive and therefore should be
1278	   normalized to lowercase.  For example, the URI
1279	   "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/".
1280	   Case equivalence for non-ASCII characters in IRI components that are
1281	   IDNs are discussed in Section 5.3.3.  The other generic syntax
1282	   components are assumed to be case sensitive unless specifically
1283	   defined otherwise by the scheme.

1285	   Creating schemes that allow case-insensitive syntax components
1286	   containing non-ASCII characters should be avoided.  Case
1287	   normalization of non-ASCII characters can be culturally dependent and
1288	   is always a complex operation.  The only exception concerns non-ASCII
1289	   host names for which the character normalization includes a mapping
1290	   step derived from case folding.

1292	5.3.2.2.  Character Normalization

1294	   The Unicode Standard [UNIV4] defines various equivalences between
1295	   sequences of characters for various purposes.  Unicode Standard Annex
1296	   #15 [UTR15] defines various Normalization Forms for these
1297	   equivalences, in particular Normalization Form C (NFC, Canonical
1298	   Decomposition, followed by Canonical Composition) and Normalization
1299	   Form KC (NFKC, Compatibility Decomposition, followed by Canonical
1300	   Composition).

1302	   IRIs already in Unicode MUST NOT be normalized before parsing or
1303	   interpreting.  In many non-Unicode character encodings, some text
1304	   cannot be represented directly.  For example, the word "Vietnam" is
1305	   natively written "Vi&#x1EC7;t Nam" (containing a LATIN SMALL LETTER E
1306	   WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct transcoding from
1307	   the windows-1258 character encoding leads to "Vi&#xEA;&#x323;t Nam"
1308	   (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX followed by a
1309	   COMBINING DOT BELOW).  Direct transcoding of other 8-bit encodings of
1310	   Vietnamese may lead to other representations.

1312	   Equivalence of IRIs MUST rely on the assumption that IRIs are
1313	   appropriately pre-character-normalized rather than apply character
1314	   normalization when comparing two IRIs.  The exceptions are conversion
1315	   from a non-digital form, and conversion from a non-UCS-based
1316	   character encoding to a UCS-based character encoding.  In these
1317	   cases, NFC or a normalizing transcoder using NFC MUST be used for
1318	   interoperability.  To avoid false negatives and problems with
1319	   transcoding, IRIs SHOULD be created by using NFC.  Using NFKC may
1320	   avoid even more problems; for example, by choosing half-width Latin
1321	   letters instead of full-width ones, and full-width instead of half-
1322	   width Katakana.

1324	   As an example, "http://www.example.org/r&#xE9;sum&#xE9;.html" (in XML
1325	   Notation) is in NFC.  On the other hand,
1326	   "http://www.example.org/re&#x301;sume&#x301;.html" is not in NFC.

1328	   The former uses precombined e-acute characters, and the latter uses
1329	   "e" characters followed by combining acute accents.  Both usages are
1330	   defined as canonically equivalent in [UNIV4].

1332	   Note:  Because it is unknown how a particular sequence of characters
1333	      is being treated with respect to character normalization, it would
1334	      be inappropriate to allow third parties to normalize an IRI
1335	      arbitrarily.  This does not contradict the recommendation that
1336	      when a resource is created, its IRI should be as character
1337	      normalized as possible (i.e., NFC or even NFKC).  This is similar
1338	      to the uppercase/lowercase problems.  Some parts of a URI are case
1339	      insensitive (for example, the domain name).  For others, it is
1340	      unclear whether they are case sensitive, case insensitive, or
1341	      something in between (e.g., case sensitive, but with a multiple
1342	      choice selection if the wrong case is used, instead of a direct
1343	      negative result).  The best recipe is that the creator use a
1344	      reasonable capitalization and, when transferring the URI,
1345	      capitalization never be changed.

1347	   Various IRI schemes may allow the usage of Internationalized Domain
1348	   Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
1349	   Character Normalization also applies to IDNs, as discussed in
1350	   Section 5.3.3.

1352	5.3.2.3.  Percent-Encoding Normalization

1354	   The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a
1355	   frequent source of variance among otherwise identical IRIs.  In
1356	   addition to the case normalization issue noted above, some IRI
1357	   producers percent-encode octets that do not require percent-encoding,
1358	   resulting in IRIs that are equivalent to their nonencoded
1359	   counterparts.  These IRIs should be normalized by decoding any
1360	   percent-encoded octet sequence that corresponds to an unreserved
1361	   character, as described in section 2.3 of [RFC3986].

1363	   For actual resolution, differences in percent-encoding (except for
1364	   the percent-encoding of reserved characters) MUST always result in
1365	   the same resource.  For example, "http://example.org/~user",
1366	   "http://example.org/%7euser", and "http://example.org/%7Euser", must
1367	   resolve to the same resource.

1369	   If this kind of equivalence is to be tested, the percent-encoding of
1370	   both IRIs to be compared has to be aligned; for example, by
1371	   converting both IRIs to URIs (see Section 3.1), eliminating escape
1372	   differences in the resulting URIs, and making sure that the case of
1373	   the hexadecimal characters in the percent-encoding is always the same
1374	   (preferably upper case).  If the IRI is to be passed to another
1375	   application or used further in some other way, its original form MUST
1376	   be preserved.  The conversion described here should be performed only
1377	   for local comparison.

1379	5.3.2.4.  Path Segment Normalization

1381	   The complete path segments "." and ".." are intended only for use
1382	   within relative references (Section 4.1 of [RFC3986]) and are removed
1383	   as part of the reference resolution process (Section 5.2 of
1384	   [RFC3986]).  However, some implementations may incorrectly assume
1385	   that reference resolution is not necessary when the reference is
1386	   already an IRI, and thus fail to remove dot-segments when they occur
1387	   in non-relative paths.  IRI normalizers should remove dot-segments by
1388	   applying the remove_dot_segments algorithm to the path, as described
1389	   in Section 5.2.4 of [RFC3986].

1391	5.3.3.  Scheme-Based Normalization

1393	   The syntax and semantics of IRIs vary from scheme to scheme, as
1394	   described by the defining specification for each scheme.
1395	   Implementations may use scheme-specific rules, at further processing
1396	   cost, to reduce the probability of false negatives.  For example,
1397	   because the "http" scheme makes use of an authority component, has a
1398	   default port of "80", and defines an empty path to be equivalent to
1399	   "/", the following four IRIs are equivalent:

1401	      http://example.com
1402	      http://example.com/
1403	      http://example.com:/
1404	      http://example.com:80/

1406	   In general, an IRI that uses the generic syntax for authority with an
1407	   empty path should be normalized to a path of "/".  Likewise, an
1408	   explicit ":port", for which the port is empty or the default for the
1409	   scheme, is equivalent to one where the port and its ":" delimiter are
1410	   elided and thus should be removed by scheme-based normalization.  For
1411	   example, the second IRI above is the normal form for the "http"
1412	   scheme.

1414	   Another case where normalization varies by scheme is in the handling
1415	   of an empty authority component or empty host subcomponent.  For many
1416	   scheme specifications, an empty authority or host is considered an
1417	   error; for others, it is considered equivalent to "localhost" or the
1418	   end-user's host.  When a scheme defines a default for authority and
1419	   an IRI reference to that default is desired, the reference should be
1420	   normalized to an empty authority for the sake of uniformity, brevity,
1421	   and internationalization.  If, however, either the userinfo or port
1422	   subcomponents are non-empty, then the host should be given explicitly
1423	   even if it matches the default.

1425	   Normalization should not remove delimiters when their associated
1426	   component is empty unless it is licensed to do so by the scheme
1427	   specification.  For example, the IRI "http://example.com/?" cannot be
1428	   assumed to be equivalent to any of the examples above.  Likewise, the
1429	   presence or absence of delimiters within a userinfo subcomponent is
1430	   usually significant to its interpretation.  The fragment component is
1431	   not subject to any scheme-based normalization; thus, two IRIs that
1432	   differ only by the suffix "#" are considered different regardless of
1433	   the scheme.

1435	   ((NOTE: THIS NEEDS TO BE UPDATED TO DEAL WITH IDNA8)) Some IRI
1436	   schemes may allow the usage of Internationalized Domain Names (IDN)
1437	   [RFC3490] either in their ireg-name part or elsewhere.  When in use
1438	   in IRIs, those names SHOULD be validated by using the ToASCII
1439	   operation defined in [RFC3490], with the flags "UseSTD3ASCIIRules"
1440	   and "AllowUnassigned".  An IRI containing an invalid IDN cannot
1441	   successfully be resolved.  Validated IDN components of IRIs SHOULD be
1442	   character normalized by using the Nameprep process [RFC3491];
1443	   however, for legibility purposes, they SHOULD NOT be converted into
1444	   ASCII Compatible Encoding (ACE).

1446	   Scheme-based normalization may also consider IDN components and their
1447	   conversions to punycode as equivalent.  As an example,
1448	   "http://r&#xE9;sum&#xE9;.example.org" may be considered equivalent to
1449	   "http://xn--rsum-bpad.example.org".

1451	   Other scheme-specific normalizations are possible.

1453	5.3.4.  Protocol-Based Normalization

1455	   Substantial effort to reduce the incidence of false negatives is
1456	   often cost-effective for web spiders.  Consequently, they implement
1457	   even more aggressive techniques in IRI comparison.  For example, if
1458	   they observe that an IRI such as

1460	      http://example.com/data

1462	   redirects to an IRI differing only in the trailing slash

1464	      http://example.com/data/

1466	   they will likely regard the two as equivalent in the future.  This
1467	   kind of technique is only appropriate when equivalence is clearly
1468	   indicated by both the result of accessing the resources and the
1469	   common conventions of their scheme's dereference algorithm (in this
1470	   case, use of redirection by HTTP origin servers to avoid problems
1471	   with relative references).

1473	6.  Use of IRIs

1475	6.1.  Limitations on UCS Characters Allowed in IRIs

1477	   This section discusses limitations on characters and character
1478	   sequences usable for IRIs beyond those given in Section 2.2 and
1479	   Section 4.1.  The considerations in this section are relevant when
1480	   IRIs are created and when URIs are converted to IRIs.

1482	   a. The repertoire of characters allowed in each IRI component is
1483	      limited by the definition of that component.  For example, the
1484	      definition of the scheme component does not allow characters
1485	      beyond US-ASCII.

1487	      (Note: In accordance with URI practice, generic IRI software
1488	      cannot and should not check for such limitations.)

1490	   b. The UCS contains many areas of characters for which there are
1491	      strong visual look-alikes.  Because of the likelihood of
1492	      transcription errors, these also should be avoided.  This includes
1493	      the full-width equivalents of Latin characters, half-width
1494	      Katakana characters for Japanese, and many others.  It also
1495	      includes many look-alikes of "space", "delims", and "unwise",
1496	      characters excluded in [RFC3491].

1498	   Additional information is available from [UNIXML].  [UNIXML] is
1499	   written in the context of running text rather than in that of
1500	   identifiers.  Nevertheless, it discusses many of the categories of
1501	   characters not appropriate for IRIs.

1503	6.2.  Software Interfaces and Protocols

1505	   Although an IRI is defined as a sequence of characters, software
1506	   interfaces for URIs typically function on sequences of octets or
1507	   other kinds of code units.  Thus, software interfaces and protocols
1508	   MUST define which character encoding is used.

1510	   Intermediate software interfaces between IRI-capable components and
1511	   URI-only components MUST map the IRIs per Section 3.6, when
1512	   transferring from IRI-capable to URI-only components.  This mapping
1513	   SHOULD be applied as late as possible.  It SHOULD NOT be applied
1514	   between components that are known to be able to handle IRIs.

1516	6.3.  Format of URIs and IRIs in Documents and Protocols

1518	   Document formats that transport URIs may have to be upgraded to allow
1519	   the transport of IRIs.  In cases where the document as a whole has a
1520	   native character encoding, IRIs MUST also be encoded in this
1521	   character encoding and converted accordingly by a parser or
1522	   interpreter.  IRI characters not expressible in the native character
1523	   encoding SHOULD be escaped by using the escaping conventions of the
1524	   document format if such conventions are available.  Alternatively,
1525	   they MAY be percent-encoded according to Section 3.6.  For example,
1526	   in HTML or XML, numeric character references SHOULD be used.  If a
1527	   document as a whole has a native character encoding and that
1528	   character encoding is not UTF-8, then IRIs MUST NOT be placed into
1529	   the document in the UTF-8 character encoding.

1531	   ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
1532	   although they use different terminology.  HTML 4.0 [HTML4] defines
1533	   the conversion from IRIs to URIs as error-avoiding behavior.  XML 1.0
1534	   [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
1535	   based upon them allow IRIs.  Also, it is expected that all relevant
1536	   new W3C formats and protocols will be required to handle IRIs
1537	   [CharMod].

1539	6.4.  Use of UTF-8 for Encoding Original Characters

1541	   This section discusses details and gives examples for point c) in
1542	   Section 1.2.  To be able to use IRIs, the URI corresponding to the
1543	   IRI in question has to encode original characters into octets by
1544	   using UTF-8.  This can be specified for all URIs of a URI scheme or
1545	   can apply to individual URIs for schemes that do not specify how to
1546	   encode original characters.  It can apply to the whole URI, or only
1547	   to some part.  For background information on encoding characters into
1548	   URIs, see also Section 2.5 of [RFC3986].

1550	   For new URI schemes, using UTF-8 is recommended in [RFC4395].
1551	   Examples where UTF-8 is already used are the URN syntax [RFC2141],
1552	   IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand,
1553	   because the HTTP URI scheme does not specify how to encode original
1554	   characters, only some HTTP URLs can have corresponding but different
1555	   IRIs.

1557	   For example, for a document with a URI of
1558	   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
1559	   construct a corresponding IRI (in XML notation, see Section 1.4):
1560	   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9;" stands for
1561	   the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-
1562	   encoded representation of that character).  On the other hand, for a
1563	   document with a URI of "http://www.example.org/r%E9sum%E9.html", the
1564	   percent-encoding octets cannot be converted to actual characters in
1565	   an IRI, as the percent-encoding is not based on UTF-8.

1567	   For most URI schemes, there is no need to upgrade their scheme
1568	   definition in order for them to work with IRIs.  The main case where
1569	   upgrading makes sense is when a scheme definition, or a particular
1570	   component of a scheme, is strictly limited to the use of US-ASCII
1571	   characters with no provision to include non-ASCII characters/octets
1572	   via percent-encoding, or if a scheme definition currently uses highly
1573	   scheme-specific provisions for the encoding of non-ASCII characters.
1574	   An example of this is the mailto: scheme [RFC2368].

1576	   This specification updates the IANA registry of URI schemes to note
1577	   their applicability to IRIs, see Section 9.  All IRIs use URI
1578	   schemes, and all URIs with URI schemes can be used as IRIs, even
1579	   though in some cases only by using URIs directly as IRIs, without any
1580	   conversion.

1582	   Scheme definitions can impose restrictions on the syntax of scheme-
1583	   specific URIs; i.e., URIs that are admissible under the generic URI
1584	   syntax [RFC3986] may not be admissible due to narrower syntactic
1585	   constraints imposed by a URI scheme specification.  URI scheme
1586	   definitions cannot broaden the syntactic restrictions of the generic
1587	   URI syntax; otherwise, it would be possible to generate URIs that
1588	   satisfied the scheme-specific syntactic constraints without
1589	   satisfying the syntactic constraints of the generic URI syntax.
1590	   However, additional syntactic constraints imposed by URI scheme
1591	   specifications are applicable to IRI, as the corresponding URI
1592	   resulting from the mapping defined in Section 3.6 MUST be a valid URI
1593	   under the syntactic restrictions of generic URI syntax and any
1594	   narrower restrictions imposed by the corresponding URI scheme
1595	   specification.

1597	   The requirement for the use of UTF-8 generally applies to all parts
1598	   of a URI.  However, it is possible that the capability of IRIs to
1599	   represent a wide range of characters directly is used just in some
1600	   parts of the IRI (or IRI reference).  The other parts of the IRI may
1601	   only contain US-ASCII characters, or they may not be based on UTF-8.
1602	   They may be based on another character encoding, or they may directly
1603	   encode raw binary data (see also [RFC2397]).

1605	   For example, it is possible to have a URI reference of
1606	   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
1607	   document name is encoded in iso-8859-1 based on server settings, but
1608	   where the fragment identifier is encoded in UTF-8 according to
1609	   [XPointer].  The IRI corresponding to the above URI would be (in XML
1610	   notation)
1611	   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".

1613	   Similar considerations apply to query parts.  The functionality of
1614	   IRIs (namely, to be able to include non-ASCII characters) can only be
1615	   used if the query part is encoded in UTF-8.

1617	6.5.  Relative IRI References

1619	   Processing of relative IRI references against a base is handled
1620	   straightforwardly; the algorithms of [RFC3986] can be applied
1621	   directly, treating the characters additionally allowed in IRI
1622	   references in the same way that unreserved characters are in URI
1623	   references.

1625	7.  Liberal handling of otherwise invalid IRIs

1627	   (EDITOR NOTE: This Section may move to an appendix.)  Some technical
1628	   specifications and widely-deployed software have allowed additional
1629	   variations and extensions of IRIs to be used in syntactic components.
1630	   This section describes two widely-used preprocessing agreements.
1631	   Other technical specifications may wish to reference a syntactic
1632	   component which is "a valid IRI or a string that will map to a valid
1633	   IRI after this preprocessing algorithm".  These two variants are
1634	   known as Legacy Extended IRI or LEIRI [LEIRI], and Web Address
1635	   [HTML5]).

1637	   Future technical specifications SHOULD NOT allow conforming producers
1638	   to produce, or conforming content to contain, such forms, as they are
1639	   not interoperable with other IRI consuming software.

1641	7.1.  LEIRI processing

1643	   This section defines Legacy Extended IRIs (LEIRIs).  The syntax of
1644	   Legacy Extended IRIs is the same as that for IRIs, except that the
1645	   ucschar production is replaced by the leiri-ucschar production:

1647	     leiri-ucschar  = " " / "<" / ">" / '"' / "{" / "}" / "|"
1648	                      / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1649	                      / %xE000-FFFD / %x10000-10FFFF

1651	   Among other extensions, processors based on this specification also
1652	   did not enforce the restriction on bidirectional formatting
1653	   characters in Section 4.1, and the iprivate production becomes
1654	   redundant.

1656	   To convert a string allowed as a LEIRI to an IRI, each character
1657	   allowed in leiri-ucschar but not in ucschar must be percent-encoded
1658	   using Section 3.3.

1660	7.2.  Web Address processing

1662	   Many popular web browsers have taken the approach of being quite
1663	   liberal in what is accepted as a "URL" or its relative forms.  This
1664	   section describes their behavior in terms of a preprocessor which
1665	   maps strings into the IRI space for subsequent parsing and
1666	   interpretation as an IRI.

1668	   In some situations, it might be appropriate to describe the syntax
1669	   that a liberal consumer implementation might accept as a "Web
1670	   Address" or "Hypertext Reference" or "HREF".  However, technical
1671	   specifications SHOULD restrict the syntactic form allowed by
1672	   compliant producers to the IRI or IRI reference syntax defined in
1673	   this document even if they want to mandate this processing.

1675	   Summary:

1677	   o  Leading and trailing whitespace is removed.

1679	   o  Some additional characters are removed.

1681	   o  Some additional characters are allowed and escaped (as with
1682	      LEIRI).

1684	   o  If interpreting an IRI as a URI, the pct-encoding of the query
1685	      component of the parsed URI component depends on operational
1686	      context.

1688	   Each string provided may have an associated charset (called the HREF-
1689	   charset here); this defaults to UTF-8.  For web browsers interpreting
1690	   HTML, the document charset of a string is determined:

1692	   If the string came from a script (e.g. as an argument to a method)
1693	      The HRef-charset is the script's charset.

1695	   If the string came from a DOM node (e.g. from an element)  The node
1696	      has a Document, and the HRef-charset is the Document's character
1697	      encoding.

1699	   If the string had a HRef-charset defined when the string was created
1700	   or defined  The HRef-charset is as defined.

1702	   If the resulting HRef-charset is a unicode based character encoding
1703	   (e.g., UTF-16), then use UTF-8 instead.

1705	   The syntax for Web Addresses is obtained by replacing the 'ucschar',
1706	   pct-form, and path-sep rules with the href-ucschar, href-pct-form,
1707	   and href-path-sep rules below.  In addition, some characters are
1708	   stripped.

1710	     href-ucschar  = " " / "<" / ">" / '"' / "{" / "}" / "|"
1711	                      / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1712	                      / %xE000-FFFD / %x10000-10FFFF
1713	     href-pct-form = pct-encoded | "%"
1714	     href-path-sep = "/" | "\"
1715	     href-strip    =

1717	   (NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT
1718	   SENTENCE) browsers did not enforce the restriction on bidirectional
1719	   formatting characters in Section 4.1, and the iprivate production
1720	   becomes redundant.

1722	   'Web Address processing' requires the following additional
1723	   preprocessing steps:

1725	   1.  Leading and trailing instances of space (U+0020), CR (U+000A), LF
1726	       (U+000D), and TAB (U+0009) characters are removed.

1728	   2.  strip all characters in href-strip.

1730	   3.  Percent-encode all characters in href-ucschar not in ucschar.

1732	   4.  Replace occurrences of "%" not followed by two hexadecimal digits
1733	       by "%25".

1735	   5.  Convert backslashes ('\') matching href-path-sep to forward
1736	       slashes ('/').

1738	7.3.  Characters not allowed in IRIs

1740	   This section provides a list of the groups of characters and code
1741	   points that are allowed by LEIRI or HREF but are not allowed in IRIs
1742	   or are allowed in IRIs only in the query part.  For each group of
1743	   characters, advice on the usage of these characters is also given,
1744	   concentrating on the reasons for why they are excluded from IRI use.

1746	      Space (U+0020): Some formats and applications use space as a
1747	      delimiter, e.g. for items in a list.  Appendix C of [RFC3986] also
1748	      mentions that white space may have to be added when displaying or
1749	      printing long URIs; the same applies to long IRIs.  This means
1750	      that spaces can disappear, or can make the what is intended as a
1751	      single IRI or IRI reference to be treated as two or more separate
1752	      IRIs.

1754	      Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
1755	      C of [RFC3986] suggests the use of double-quotes
1756	      ("http://example.com/") and angle brackets (<http://example.com/>)
1757	      as delimiters for URIs in plain text.  These conventions are often
1758	      used, and also apply to IRIs.  Using these characters in strings
1759	      intended to be IRIs would result in the IRIs being cut off at the
1760	      wrong place.

1762	      Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
1763	      (U+007B), "|" (U+007C), and "}" (U+007D): These characters
1764	      originally have been excluded from URIs because the respective
1765	      codepoints are assigned to different graphic characters in some
1766	      7-bit or 8-bit encoding.  Despite the move to Unicode, some of
1767	      these characters are still occasionally displayed differently on
1768	      some systems, e.g.  U+005C may appear as a Japanese Yen symbol on
1769	      some systems.  Also, the fact that these characters are not used
1770	      in URIs or IRIs has encouraged their use outside URIs or IRIs in
1771	      contexts that may include URIs or IRIs.  If a string with such a
1772	      character were used as an IRI in such a context, it would likely
1773	      be interpreted piecemeal.

1775	      The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1776	      #x9F): There is generally no way to transmit these characters
1777	      reliably as text outside of a charset encoding.  Even when in
1778	      encoded form, many software components silently filter out some of
1779	      these characters, or may stop processing alltogether when
1780	      encountering some of them.  These characters may affect text
1781	      display in subtle, unnoticable ways or in drastic, global, and
1782	      irreversible ways depending on the hardware and software involved.
1783	      The use of some of these characters would allow malicious users to
1784	      manipulate the display of an IRI and its context in many
1785	      situations.

1787	      Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1788	      characters affect the display ordering of characters.  If IRIs
1789	      were allowed to contain these characters and the resulting visual
1790	      display transcribed. they could not be converted back to
1791	      electronic form (logical order) unambiguously.  These characters,
1792	      if allowed in IRIs, might allow malicious users to manipulate the
1793	      display of IRI and its context.

1795	      Specials (U+FFF0-FFFD): These code points provide functionality
1796	      beyond that useful in an IRI, for example byte order
1797	      identification, annotation, and replacements for unknown
1798	      characters and objects.  Their use and interpretation in an IRI
1799	      would serve no purpose and might lead to confusing display
1800	      variations.

1802	      Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
1803	      10FFFD): Display and interpretation of these code points is by
1804	      definition undefined without private agreement.  Therefore, these
1805	      code points are not suited for use on the Internet.  They are not
1806	      interoperable and may have unpredictable effects.

1808	      Tags (U+E0000-E0FFF): These characters provide a way to language
1809	      tag in Unicode plain text.  They are not appropriate for IRIs
1810	      because language information in identifiers cannot reliably be
1811	      input, transmitted (e.g. on a visual medium such as paper), or
1812	      recognized.

1814	      Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1815	      U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1816	      U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1817	      U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1818	      U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1819	      non-characters.  Applications may use some of them internally, but
1820	      are not prepared to interchange them.

1822	   LEIRI preprocessing disallowed some code points and code units:

1824	      Surrogate code units (D800-DFFF): These do not represent Unicode
1825	      codepoints.

1827	8.  URI/IRI Processing Guidelines (Informative)

1829	   This informative section provides guidelines for supporting IRIs in
1830	   the same software components and operations that currently process
1831	   URIs: Software interfaces that handle URIs, software that allows
1832	   users to enter URIs, software that creates or generates URIs,
1833	   software that displays URIs, formats and protocols that transport
1834	   URIs, and software that interprets URIs.  These may all require
1835	   modification before functioning properly with IRIs.  The
1836	   considerations in this section also apply to URI references and IRI
1837	   references.

1839	8.1.  URI/IRI Software Interfaces

1841	   Software interfaces that handle URIs, such as URI-handling APIs and
1842	   protocols transferring URIs, need interfaces and protocol elements
1843	   that are designed to carry IRIs.

1845	   In case the current handling in an API or protocol is based on US-
1846	   ASCII, UTF-8 is recommended as the character encoding for IRIs, as it
1847	   is compatible with US-ASCII, is in accordance with the
1848	   recommendations of [RFC2277], and makes converting to URIs easy.  In
1849	   any case, the API or protocol definition must clearly define the
1850	   character encoding to be used.

1852	   The transfer from URI-only to IRI-capable components requires no
1853	   mapping, although the conversion described in Section 3.7 above may
1854	   be performed.  It is preferable not to perform this inverse
1855	   conversion unless it is certain this can be done correctly.

1857	8.2.  URI/IRI Entry

1859	   Some components allow users to enter URIs into the system by typing
1860	   or dictation, for example.  This software must be updated to allow
1861	   for IRI entry.

1863	   A person viewing a visual representation of an IRI (as a sequence of
1864	   glyphs, in some order, in some visual display) or hearing an IRI will
1865	   use an entry method for characters in the user's language to input
1866	   the IRI.  Depending on the script and the input method used, this may
1867	   be a more or less complicated process.

1869	   The process of IRI entry must ensure, as much as possible, that the
1870	   restrictions defined in Section 2.2 are met.  This may be done by
1871	   choosing appropriate input methods or variants/settings thereof, by
1872	   appropriately converting the characters being input, by eliminating
1873	   characters that cannot be converted, and/or by issuing a warning or
1874	   error message to the user.

1876	   As an example of variant settings, input method editors for East
1877	   Asian Languages usually allow the input of Latin letters and related
1878	   characters in full-width or half-width versions.  For IRI input, the
1879	   input method editor should be set so that it produces half-width
1880	   Latin letters and punctuation and full-width Katakana.

1882	   An input field primarily or solely used for the input of URIs/IRIs
1883	   might allow the user to view an IRI as it is mapped to a URI.  Places
1884	   where the input of IRIs is frequent may provide the possibility for
1885	   viewing an IRI as mapped to a URI.  This will help users when some of
1886	   the software they use does not yet accept IRIs.

1888	   An IRI input component interfacing to components that handle URIs,
1889	   but not IRIs, must map the IRI to a URI before passing it to these
1890	   components.

1892	   For the input of IRIs with right-to-left characters, please see
1893	   Section 4.3.

1895	8.3.  URI/IRI Transfer between Applications

1897	   Many applications (for example, mail user agents) try to detect URIs
1898	   appearing in plain text.  For this, they use some heuristics based on
1899	   URI syntax.  They then allow the user to click on such URIs and
1900	   retrieve the corresponding resource in an appropriate (usually
1901	   scheme-dependent) application.

1903	   Such applications would need to be upgraded, in order to use the IRI
1904	   syntax as a base for heuristics.  In particular, a non-ASCII
1905	   character should not be taken as the indication of the end of an IRI.
1906	   Such applications also would need to make sure that they correctly
1907	   convert the detected IRI from the character encoding of the document
1908	   or application where the IRI appears, to the character encoding used
1909	   by the system-wide IRI invocation mechanism, or to a URI (according
1910	   to Section 3.6) if the system-wide invocation mechanism only accepts
1911	   URIs.

1913	   The clipboard is another frequently used way to transfer URIs and
1914	   IRIs from one application to another.  On most platforms, the
1915	   clipboard is able to store and transfer text in many languages and
1916	   scripts.  Correctly used, the clipboard transfers characters, not
1917	   octets, which will do the right thing with IRIs.

1919	8.4.  URI/IRI Generation

1921	   Systems that offer resources through the Internet, where those
1922	   resources have logical names, sometimes automatically generate URIs
1923	   for the resources they offer.  For example, some HTTP servers can
1924	   generate a directory listing for a file directory and then respond to
1925	   the generated URIs with the files.

1927	   Many legacy character encodings are in use in various file systems.
1928	   Many currently deployed systems do not transform the local character
1929	   representation of the underlying system before generating URIs.

1931	   For maximum interoperability, systems that generate resource
1932	   identifiers should make the appropriate transformations.  For
1933	   example, if a file system contains a file named "r&#xE9;sum&#
1934	   xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in
1935	   a URI, which allows use of "r&#xE9;sum&#xE9;.html" in an IRI, even if
1936	   locally the file name is kept in a character encoding other than
1937	   UTF-8.

1939	   This recommendation particularly applies to HTTP servers.  For FTP
1940	   servers, similar considerations apply; see [RFC2640].

1942	8.5.  URI/IRI Selection

1944	   In some cases, resource owners and publishers have control over the
1945	   IRIs used to identify their resources.  This control is mostly
1946	   executed by controlling the resource names, such as file names,
1947	   directly.

1949	   In these cases, it is recommended to avoid choosing IRIs that are
1950	   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
1951	   is easily confused with the digit one ("1"), and the upper-case oh
1952	   ("O") is easily confused with the digit zero ("0").  Publishers
1953	   should avoid confusing users with "br0ken" or "1ame" identifiers.

1955	   Outside the US-ASCII repertoire, there are many more opportunities
1956	   for confusion; a complete set of guidelines is too lengthy to include
1957	   here.  As long as names are limited to characters from a single
1958	   script, native writers of a given script or language will know best
1959	   when ambiguities can appear, and how they can be avoided.  What may
1960	   look ambiguous to a stranger may be completely obvious to the average
1961	   native user.  On the other hand, in some cases, the UCS contains
1962	   variants for compatibility reasons; for example, for typographic
1963	   purposes.  These should be avoided wherever possible.  Although there
1964	   may be exceptions, newly created resource names should generally be
1965	   in NFKC [UTR15] (which means that they are also in NFC).

1967	   As an example, the UCS contains the "fi" ligature at U+FB01 for
1968	   compatibility reasons.  Wherever possible, IRIs should use the two
1969	   letters "f" and "i" rather than the "fi" ligature.  An example where
1970	   the latter may be used is in the query part of an IRI for an explicit
1971	   search for a word written containing the "fi" ligature.

1973	   In certain cases, there is a chance that characters from different
1974	   scripts look the same.  The best known example is the similarity of
1975	   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
1976	   such cases, IRIs should only be created where all the characters in a
1977	   single component are used together in a given language.  This usually
1978	   means that all of these characters will be from the same script, but
1979	   there are languages that mix characters from different scripts (such
1980	   as Japanese).  This is similar to the heuristics used to distinguish
1981	   between letters and numbers in the examples above.  Also, for Latin,
1982	   Greek, and Cyrillic, using lowercase letters results in fewer
1983	   ambiguities than using uppercase letters would.

1985	8.6.  Display of URIs/IRIs

1987	   In situations where the rendering software is not expected to display
1988	   non-ASCII parts of the IRI correctly using the available layout and
1989	   font resources, these parts should be percent-encoded before being
1990	   displayed.

1992	   For display of Bidi IRIs, please see Section 4.1.

1994	8.7.  Interpretation of URIs and IRIs

1996	   Software that interprets IRIs as the names of local resources should
1997	   accept IRIs in multiple forms and convert and match them with the
1998	   appropriate local resource names.

2000	   First, multiple representations include both IRIs in the native
2001	   character encoding of the protocol and also their URI counterparts.

2003	   Second, it may include URIs constructed based on character encodings
2004	   other than UTF-8.  These URIs may be produced by user agents that do
2005	   not conform to this specification and that use legacy character
2006	   encodings to convert non-ASCII characters to URIs.  Whether this is
2007	   necessary, and what character encodings to cover, depends on a number
2008	   of factors, such as the legacy character encodings used locally and
2009	   the distribution of various versions of user agents.  For example,
2010	   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
2011	   addition to UTF-8.

2013	   Third, it may include additional mappings to be more user-friendly
2014	   and robust against transmission errors.  These would be similar to
2015	   how some servers currently treat URIs as case insensitive or perform
2016	   additional matching to account for spelling errors.  For characters
2017	   beyond the US-ASCII repertoire, this may, for example, include
2018	   ignoring the accents on received IRIs or resource names.  Please note
2019	   that such mappings, including case mappings, are language dependent.

2021	   It can be difficult to identify a resource unambiguously if too many
2022	   mappings are taken into consideration.  However, percent-encoded and
2023	   not percent-encoded parts of IRIs can always be clearly
2024	   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
2025	   the potential for collisions lower than it may seem at first.

2027	8.8.  Upgrading Strategy

2029	   Where this recommendation places further constraints on software for
2030	   which many instances are already deployed, it is important to
2031	   introduce upgrades carefully and to be aware of the various
2032	   interdependencies.

2034	   If IRIs cannot be interpreted correctly, they should not be created,
2035	   generated, or transported.  This suggests that upgrading URI
2036	   interpreting software to accept IRIs should have highest priority.

2038	   On the other hand, a single IRI is interpreted only by a single or
2039	   very few interpreters that are known in advance, although it may be
2040	   entered and transported very widely.

2042	   Therefore, IRIs benefit most from a broad upgrade of software to be
2043	   able to enter and transport IRIs.  However, before an individual IRI
2044	   is published, care should be taken to upgrade the corresponding
2045	   interpreting software in order to cover the forms expected to be
2046	   received by various versions of entry and transport software.

2048	   The upgrade of generating software to generate IRIs instead of using
2049	   a local character encoding should happen only after the service is
2050	   upgraded to accept IRIs.  Similarly, IRIs should only be generated
2051	   when the service accepts IRIs and the intervening infrastructure and
2052	   protocol is known to transport them safely.

2054	   Software converting from URIs to IRIs for display should be upgraded
2055	   only after upgraded entry software has been widely deployed to the
2056	   population that will see the displayed result.

2058	   Where there is a free choice of character encodings, it is often
2059	   possible to reduce the effort and dependencies for upgrading to IRIs
2060	   by using UTF-8 rather than another encoding.  For example, when a new
2061	   file-based Web server is set up, using UTF-8 as the character
2062	   encoding for file names will make the transition to IRIs easier.
2063	   Likewise, when a new Web form is set up using UTF-8 as the character
2064	   encoding of the form page, the returned query URIs will use UTF-8 as
2065	   the character encoding (unless the user, for whatever reason, changes
2066	   the character encoding) and will therefore be compatible with IRIs.

2068	   These recommendations, when taken together, will allow for the
2069	   extension from URIs to IRIs in order to handle characters other than
2070	   US-ASCII while minimizing interoperability problems.  For
2071	   considerations regarding the upgrade of URI scheme definitions, see
2072	   Section 6.4.

2074	9.  IANA Considerations

2076	   RFC Editor and IANA note: Please Replace RFC XXXX with the number of
2077	   this document when it issues as an RFC.

2079	   IANA maintains a registry of "URI schemes".  A "URI scheme" also
2080	   serves an "IRI scheme".

2082	   To clarify that the URI scheme registration process also applies to
2083	   IRIs, change the description of the "URI schemes" registry header to
2084	   say "[RFC4395] defines an IANA-maintained registry of URI Schemes.

2086	   These registries include the Permanent and Provisional URI Schemes.
2087	   RFC XXXX updates this registry to designate that schemes may also
2088	   indicate their usability as IRI schemes.

2090	   Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".

2092	10.  Security Considerations

2094	   The security considerations discussed in [RFC3986] also apply to
2095	   IRIs.  In addition, the following issues require particular care for
2096	   IRIs.

2098	   Incorrect encoding or decoding can lead to security problems.  In
2099	   particular, some UTF-8 decoders do not check against overlong byte
2100	   sequences.  As an example, a "/" is encoded with the byte 0x2F both
2101	   in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
2102	   interpret the sequence 0xC0 0xAF as a "/".  A sequence such as
2103	   "%C0%AF.." may pass some security tests and then be interpreted as
2104	   "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
2105	   and checking are not done in the right order, and/or if reserved
2106	   characters and unreserved characters are not clearly distinguished.

2108	   There are various ways in which "spoofing" can occur with IRIs.
2109	   "Spoofing" means that somebody may add a resource name that looks the
2110	   same or similar to the user, but that points to a different resource.
2111	   The added resource may pretend to be the real resource by looking
2112	   very similar but may contain all kinds of changes that may be
2113	   difficult to spot and that can cause all kinds of problems.  Most
2114	   spoofing possibilities for IRIs are extensions of those for URIs.

2116	   Spoofing can occur for various reasons.  First, a user's
2117	   normalization expectations or actual normalization when entering an
2118	   IRI or transcoding an IRI from a legacy character encoding do not
2119	   match the normalization used on the server side.  Conceptually, this
2120	   is no different from the problems surrounding the use of case-
2121	   insensitive web servers.  For example, a popular web page with a
2122	   mixed-case name ("http://big.example.com/PopularPage.html") might be
2123	   "spoofed" by someone who is able to create
2124	   "http://big.example.com/popularpage.html".  However, the use of
2125	   unnormalized character sequences, and of additional mappings for user
2126	   convenience, may increase the chance for spoofing.  Protocols and
2127	   servers that allow the creation of resources with names that are not
2128	   normalized are particularly vulnerable to such attacks.  This is an
2129	   inherent security problem of the relevant protocol, server, or
2130	   resource and is not specific to IRIs, but it is mentioned here for
2131	   completeness.

2133	   Spoofing can occur in various IRI components, such as the domain name
2134	   part or a path part.  For considerations specific to the domain name
2135	   part, see [RFC3491].  For the path part, administrators of sites that
2136	   allow independent users to create resources in the same sub area may
2137	   have to be careful to check for spoofing.

2139	   Spoofing can occur because in the UCS many characters look very
2140	   similar.  Details are discussed in Section 8.5.  Again, this is very
2141	   similar to spoofing possibilities on US-ASCII, e.g., using "br0ken"
2142	   or "1ame" URIs.

2144	   Spoofing can occur when URIs with percent-encodings based on various
2145	   character encodings are accepted to deal with older user agents.  In
2146	   some cases, particularly for Latin-based resource names, this is
2147	   usually easy to detect because UTF-8-encoded names, when interpreted
2148	   and viewed as legacy character encodings, produce mostly garbage.

2150	   When concurrently used character encodings have a similar structure
2151	   but there are no characters that have exactly the same encoding,
2152	   detection is more difficult.

2154	   Spoofing can occur with bidirectional IRIs, if the restrictions in
2155	   Section 4.2 are not followed.  The same visual representation may be
2156	   interpreted as different logical representations, and vice versa.  It
2157	   is also very important that a correct Unicode bidirectional
2158	   implementation be used.

2160	   The use of Legacy Extended IRIs introduces additional security
2161	   issues.

2163	11.  Acknowledgements

2165	   For contributions to this update, we would like to thank Ian Hickson,
2166	   Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin,
2167	   Henry S. Thomson, and the XML Core Working Group of the W3C.

2169	   The discussion on the issue addressed here started a long time ago.
2170	   There was a thread in the HTML working group in August 1995 (under
2171	   the topic of "Globalizing URIs") and in the www-international mailing
2172	   list in July 1996 (under the topic of "Internationalization and
2173	   URLs"), and there were ad-hoc meetings at the Unicode conferences in
2174	   September 1995 and September 1997.

2176	   For contributions to the previous version of this document, RFC 3987,
2177	   many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
2178	   Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
2179	   Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
2180	   Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley,
2181	   Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne,
2182	   Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan
2183	   Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan
2184	   Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris
2185	   Haynes, Walter Underwood, and many others.

2187	   A definition of HyperText Reference was initially produced by Ian
2188	   Hixson, and further edited by Dan Connolly and C. M. Spergerg-
2189	   McQueen.

2191	   Thanks to the Internationalization Working Group (I18N WG) of the
2192	   World Wide Web Consortium (W3C), and the members of the W3C I18N
2193	   Working Group and Interest Group for their contributions and their
2194	   work on [CharMod].  Thanks also go to the members of many other W3C
2195	   Working Groups for adopting IRIs, and to the members of the Montreal
2196	   IAB Workshop on Internationalization and Localization for their
2197	   review.

2199	12.  Open Issues

2201	   NOTE: The issues noted in this section should be addressed before the
2202	   document is submitted as an RFC.  These issues are not in any
2203	   particular order, and do not necessarily form a complete list of all
2204	   known issues.

2206	   length limits on domain name  See, for example,
2207	      http://lists.w3.org/Archives/Public/public-iri/2009Sep/0064.html
2208	      discussion on public-iri@w3.org (that discussion is mostly
2209	      irrelevant now as the "63 octets in UTF-8 per label" restriction
2210	      was dropped)

2212	   Allow generic scheme-independent IRI to URI translation  Previous
2213	      drafts of this specification proposed a generic IRI to URI
2214	      transformation using pct-encoding, and allowed domain name
2215	      translation to be optionally handled by retranslating host names
2216	      from pct-encoding back into Unicode and then into punycode.  This
2217	      draft does not allow that behavior, but this should be fixed to be
2218	      in line with RFC 3986 syntax and to lead implementations towards
2219	      an uniform an long-term URI<->IRI correspondence.  See also
2220	      [Gettys]

2222	   update URI scheme registry?  This document starts the process of
2223	      making minor changes to the URI scheme registry.  This should be
2224	      handled as an update to RFC 4395.

2226	   utf8 in HTTP  Not really IRI issue, but some HTTP implementations
2227	      send UTF8 path directly, review.

2229	   handling of \\  Some web applications convert \ to / and others
2230	      don't.  Make this mandatory or disallowed (but not optional), for
2231	      Web Addresses.

2233	   dealing with disallowed IRI characters

2235	   misplaced text  Find a place to note that some older software
2236	      transcoding to UTF-8 may produce illegal output for some input, in
2237	      particular for characters outside the BMP (Basic Multilingual
2238	      Plane).  As an example, for the IRI with non-BMP characters (in
2239	      XML Notation):
2240	      "http://example.com/&#x10300;&#x10301;&#x10302";
2241	      which contains the first three letters of the Old Italic alphabet,
2242	      the correct conversion to a URI is
2243	      "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"

2245	   Special Query Handling needed?  The percent-encoding handling of
2246	      query components in the HTTP scheme is really unfortunate.  There
2247	      is no good normative advice to give if the percent-encoding is
2248	      delayed until the query-IRI is interpreted.  Could HTML ask
2249	      browsers to percent-encode the form data using the document
2250	      character set BEFORE the query IRI is constructed, and only in the
2251	      case where the document character set isn't Unicode-based and the
2252	      query is being added to http: or https: URIs?  This would give
2253	      more consistent results.  Browsers might have to change their
2254	      behavior in constructing the IRI-with-query-added, but the results
2255	      would be more consistent and fewer bugs, and it wouldn't affect
2256	      interpretation of any existing web pages.  It would remove the
2257	      need to have a normative special case for queries in HTML
2258	      documents, just for http, in a way in which things like
2259	      transcoding etc. wouldn't work well.  You could tell the
2260	      difference between a query URI in the address bar and one created
2261	      via a form because the address bar would always be UTF-8.  The
2262	      browsers might have to change the algorithm for showing the
2263	      address in the adress bar to know how to undo the encoding.

2265	   handling illegal characters  Section 3.3 used to apply only to
2266	      characters in either 'ucschar' or 'iprivate', but then later said
2267	      that systems accepting IRIs MAY also deal with the printable
2268	      characters in US-ASCII that are not allowed in URIs, namely "<",
2269	      ">", '"', space, "{", "}", "|", "\", "^", and "`".  Larry felt
2270	      that this a MAY would result in non-uniform behavior, because some
2271	      systems would produce valid URI components and others wouldn't.
2272	      Non-printable US-ASCII characters should be stripped by most
2273	      software, so if they get to if they're passed on somewhere as IRI
2274	      characters, encoding them makes sense.  The section also used to
2275	      say "If these characters are found but are not converted, then the
2276	      conversion SHOULD fail." but there is no notion of conversion
2277	      failing -- every string is converted.  Please note that the number
2278	      sign ("#"), the percent sign ("%"), and the square bracket
2279	      characters ("[", "]") are not part of the above list and MUST NOT
2280	      be converted.

2282	   adding single % and hash  Changed the BNF to not match the URI
2283	      document in allowing single % in path but not everywhere, and
2284	      allowing a # in the fragment part.

2286	13.  Change Log

2288	   Note to RFC Editor: Please completely remove this section before
2289	   publication.

2291	13.1.  Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00

2293	   Changed draft name, date, last paragraph of abstract, and titles in
2294	   change log, and added this section in moving from
2295	   draft-duerst-iri-bis-07 (personal submission) to
2296	   draft-ietf-iri-3987bis-00 (WG document).

2298	13.2.  Changes from -06 to -07 of draft-duerst-iri-bis

2300	   Major restructuring of IRI processing model to make scheme-specific
2301	   translation necessary to handle IDNA requirements and for consistency
2302	   with web implementations.

2304	   Starting with IRI, you want one of:

2306	   a  IRI components (IRI parsed into UTF8 pieces)

2308	   b  URI components (URI parsed into ASCII pieces, encoded correctly)

2310	   c  whole URI (for passing on to some other system that wants whole
2311	      URIs)

2313	13.2.1.  OLD WAY

2315	   1.  Pct-encoding on the whole thing to a URI. (c1) If you want a
2316	       (maybe broken) whole URI, you might stop here.

2318	   2.  Parsing the URI into URI components. (b1) If you want (maybe
2319	       broken) URI components, stop here.

2321	   3.  Decode the components (undoing the pct-encoding). (a) if you want
2322	       IRI components, stop here.

2324	   4.  reencode: Either using a different encoding some components (for
2325	       domain names, and query components in web pages, which depends on
2326	       the component, scheme and context), and otherwise using pct-
2327	       encoding. (b2) if you want (good) URI components, stop here.

2329	   5.  reassemble the reencoded components. (c2) if you want a (*good*)
2330	       whole URI stop here.

2332	13.2.2.  NEW WAY

2334	   1.  Parse the IRI into IRI components using the generic syntax. (a)
2335	       if you want IRI components, stop here.

2337	   2.  Encode each components, using pct-encoding, IDN encoding, or
2338	       special query part encoding depending on the component scheme or
2339	       context. (b) If you want URI components, stop here.

2341	   3.  reassemble the a whole URI from URI components. (c) if you want a
2342	       whole URI stop here.

2344	13.3.  Changes from -05 to -06 of draft-duerst-iri-bis

2346	   o  Add HyperText Reference, change abstract, acks and references for
2347	      it

2349	   o  Add Masinter back as another editor.

2351	   o  Masinter integrates HRef material from HTML5 spec.

2353	   o  Rewrite introduction sections to modernize.

2355	13.4.  Changes from -04 to -05 of draft-duerst-iri-bis

2357	   o  Updated references.

2359	   o  Changed IPR text to pre5378Trust200902.

2361	13.5.  Changes from -03 to -04 of draft-duerst-iri-bis

2363	   o  Added explicit abbreviation for LEIRIs.

2365	   o  Mentioned LEIRI references.

2367	   o  Completed text in LEIRI section about tag characters and about
2368	      specials.

2370	13.6.  Changes from -02 to -03 of draft-duerst-iri-bis

2372	   o  Updated some references.

2374	   o  Updated Michel Suginard's coordinates.

2376	13.7.  Changes from -01 to -02 of draft-duerst-iri-bis

2378	   o  Added tag range to iprivate (issue private-include-tags-115).

2380	   o  Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.

2382	13.8.  Changes from -00 to -01 of draft-duerst-iri-bis

2384	   o  Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI"
2385	      based on input from the W3C XML Core WG.  Moved the relevant
2386	      subsections to the back and promoted them to a section.

2388	   o  Added some text re.  Legacy Extended IRIs to the security section.

2390	   o  Added a IANA Consideration Section.

2392	   o  Added this Change Log Section.

2394	   o  Added a section about "IRIs with Spaces/Controls" (converting from
2395	      a Note in RFC 3987).

2397	13.9.  Changes from RFC 3987 to -00 of draft-duerst-iri-bis

2399	      Fixed errata (see
2400	      http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).

2402	14.  References

2404	14.1.  Normative References

2406	   [ASCII]    American National Standards Institute, "Coded Character
2407	              Set -- 7-bit American Standard Code for Information
2408	              Interchange", ANSI X3.4, 1986.

2410	   [ISO10646]
2411	              International Organization for Standardization, "ISO/IEC
2412	              10646:2003: Information Technology - Universal Multiple-
2413	              Octet Coded Character Set (UCS)", ISO Standard 10646,
2414	              December 2003.

2416	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2417	              Requirement Levels", BCP 14, RFC 2119, March 1997.

2419	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
2420	              "Internationalizing Domain Names in Applications (IDNA)",
2421	              RFC 3490, March 2003.

2423	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
2424	              Profile for Internationalized Domain Names (IDN)",
2425	              RFC 3491, March 2003.

2427	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
2428	              10646", STD 63, RFC 3629, November 2003.

2430	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
2431	              Resource Identifier (URI): Generic Syntax", STD 66,
2432	              RFC 3986, January 2005.

2434	   [STD68]    Crocker, D. and P. Overell, "Augmented BNF for Syntax
2435	              Specifications: ABNF", STD 68, RFC 5234, January 2008.

2437	   [UNI9]     Davis, M., "The Bidirectional Algorithm", Unicode Standard
2438	              Annex #9, March 2004,
2439	              <http://www.unicode.org/reports/tr9/tr9-13.html>.

2441	   [UNIV4]    The Unicode Consortium, "The Unicode Standard, Version
2442	              5.1.0, defined by: The Unicode Standard, Version 5.0
2443	              (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0), as
2444	              amended by Unicode 4.1.0
2445	              (http://www.unicode.org/versions/Unicode5.1.0/)",
2446	              April 2008.

2448	   [UTR15]    Davis, M. and M. Duerst, "Unicode Normalization Forms",
2449	              Unicode Standard Annex #15, March 2008,
2450	              <http://www.unicode.org/unicode/reports/tr15/
2451	              tr15-23.html>.

2453	14.2.  Informative References

2455	   [BidiEx]   "Examples of bidirectional IRIs",
2456	              <http://www.w3.org/International/iri-edit/BidiExamples>.

2458	   [CharMod]  Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
2459	              Texin, "Character Model for the World Wide Web: Resource
2460	              Identifiers", World Wide Web Consortium Candidate
2461	              Recommendation, November 2004,
2462	              <http://www.w3.org/TR/charmod-resid>.

2464	   [Duerst97]
2465	              Duerst, M., "The Properties and Promises of UTF-8", Proc.
2466	              11th International Unicode Conference, San Jose ,
2467	              September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
2468	              papers/PDF/IUC11-UTF-8.pdf>.

2470	   [Gettys]   Gettys, J., "URI Model Consequences",
2471	              <http://www.w3.org/DesignIssues/ModelConsequences>.

2473	   [HTML4]    Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
2474	              Specification", World Wide Web Consortium Recommendation,
2475	              December 1999,
2476	              <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.

2478	   [HTML5]    Hickson, I. and D. Hyatt, "A vocabulary and associated
2479	              APIs for HTML and XHTML", World Wide Web
2480	              Consortium Working Draft, April 2009,
2481	              <http://www.w3.org/TR/2009/WD-html5-20090423/>.

2483	   [LEIRI]    Thompson, H., Tobin, R., and N. Walsh, "Legacy extended
2484	              IRIs for XML resource identification", World Wide Web
2485	              Consortium Note, November 2008,
2486	              <http://www.w3.org/TR/leiri/>.

2488	   [RFC1738]  Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
2489	              Resource Locators (URL)", RFC 1738, December 1994.

2491	   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
2492	              Extensions (MIME) Part One: Format of Internet Message
2493	              Bodies", RFC 2045, November 1996.

2495	   [RFC2130]  Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
2496	              Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
2497	              the IAB Character Set Workshop held 29 February - 1 March,
2498	              1996", RFC 2130, April 1997.

2500	   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

2502	   [RFC2192]  Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

2504	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
2505	              Languages", BCP 18, RFC 2277, January 1998.

2507	   [RFC2368]  Hoffman, P., Masinter, L., and J. Zawinski, "The mailto
2508	              URL scheme", RFC 2368, July 1998.

2510	   [RFC2384]  Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

2512	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
2513	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
2514	              August 1998.

2516	   [RFC2397]  Masinter, L., "The "data" URL scheme", RFC 2397,
2517	              August 1998.

2519	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
2520	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
2521	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

2523	   [RFC2640]  Curtin, B., "Internationalization of the File Transfer
2524	              Protocol", RFC 2640, July 1999.

2526	   [RFC4395]  Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
2527	              Registration Procedures for New URI Schemes", BCP 35,
2528	              RFC 4395, February 2006.

2530	   [UNIXML]   Duerst, M. and A. Freytag, "Unicode in XML and other
2531	              Markup Languages", Unicode Technical Report #20, World
2532	              Wide Web Consortium Note, June 2003,
2533	              <http://www.w3.org/TR/unicode-xml/>.

2535	   [XLink]    DeRose, S., Maler, E., and D. Orchard, "XML Linking
2536	              Language (XLink) Version 1.0", World Wide Web
2537	              Consortium Recommendation, June 2001,
2538	              <http://www.w3.org/TR/xlink/#link-locators>.

2540	   [XML1]     Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
2541	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth
2542	              Edition)", World Wide Web Consortium Recommendation,
2543	              August 2006, <http://www.w3.org/TR/REC-xml>.

2545	   [XMLNamespace]
2546	              Bray, T., Hollander, D., Layman, A., and R. Tobin,
2547	              "Namespaces in XML (Second Edition)", World Wide Web
2548	              Consortium Recommendation, August 2006,
2549	              <http://www.w3.org/TR/REC-xml-names>.

2551	   [XMLSchema]
2552	              Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
2553	              World Wide Web Consortium Recommendation, May 2001,
2554	              <http://www.w3.org/TR/xmlschema-2/#anyURI>.

2556	   [XPointer]
2557	              Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
2558	              Framework", World Wide Web Consortium Recommendation,
2559	              March 2003,
2560	              <http://www.w3.org/TR/xptr-framework/#escaping>.

2562	Appendix A.  Design Alternatives

2564	   This section briefly summarizes some design alternatives considered
2565	   earlier and the reasons why they were not chosen.

2567	A.1.  New Scheme(s)

2569	   Introducing new schemes (for example, httpi:, ftpi:,...) or a new
2570	   metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:,
2571	   i:ftp:,...) was proposed to make IRI-to-URI conversion scheme
2572	   dependent or to distinguish between percent-encodings resulting from
2573	   IRI-to-URI conversion and percent-encodings from legacy character
2574	   encodings.

2576	   New schemes are not needed to distinguish URIs from true IRIs (i.e.,
2577	   IRIs that contain non-ASCII characters).  The benefit of being able
2578	   to detect the origin of percent-encodings is marginal, as UTF-8 can
2579	   be detected with very high reliability.  Deploying new schemes is
2580	   extremely hard, so not requiring new schemes for IRIs makes
2581	   deployment of IRIs vastly easier.  Making conversion scheme dependent
2582	   is highly inadvisable and would be encouraged by separate schemes for
2583	   IRIs.  Using a uniform convention for conversion from IRIs to URIs
2584	   makes IRI implementation orthogonal to the introduction of actual new
2585	   schemes.

2587	A.2.  Character Encodings Other Than UTF-8

2589	   At an early stage, UTF-7 was considered as an alternative to UTF-8
2590	   when IRIs are converted to URIs.  UTF-7 would not have needed
2591	   percent-encoding and in most cases would have been shorter than
2592	   percent-encoded UTF-8.

2594	   Using UTF-8 avoids a double layering and overloading of the use of
2595	   the "+" character.  UTF-8 is fully compatible with US-ASCII and has
2596	   therefore been recommended by the IETF, and is being used widely.

2598	   UTF-7 has never been used much and is now clearly being discouraged.
2599	   Requiring implementations to convert from UTF-8 to UTF-7 and back
2600	   would be an additional implementation burden.

2602	A.3.  New Encoding Convention

2604	   Instead of using the existing percent-encoding convention of URIs,
2605	   which is based on octets, the idea was to create a new encoding
2606	   convention; for example, to use "%u" to introduce UCS code points.

2608	   Using the existing octet-based percent-encoding mechanism does not
2609	   need an upgrade of the URI syntax and does not need corresponding
2610	   server upgrades.

2612	A.4.  Indicating Character Encodings in the URI/IRI

2614	   Some proposals suggested indicating the character encodings used in
2615	   an URI or IRI with some new syntactic convention in the URI itself,
2616	   similar to the "charset" parameter for e-mails and Web pages.  As an
2617	   example, the label in square brackets in
2618	   "http://www.example.org/ros[iso-8859-1]&#xE9;" indicated that the
2619	   following "&#xE9;" had to be interpreted as iso-8859-1.

2621	   If UTF-8 is used exclusively, an upgrade to the URI syntax is not
2622	   needed.  It avoids potentially multiple labels that have to be copied
2623	   correctly in all cases, even on the side of a bus or on a napkin,
2624	   leading to usability problems (and being prohibitively annoying).
2625	   Exclusively using UTF-8 also reduces transcoding errors and
2626	   confusion.

2628	Authors' Addresses

2630	   Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
2631	             possible, for example as "D&amp;#252;rst" in XML and HTML.)
2632	   Aoyama Gakuin University
2633	   5-10-1 Fuchinobe
2634	   Sagamihara, Kanagawa  229-8558
2635	   Japan

2637	   Phone: +81 42 759 6329
2638	   Fax:   +81 42 759 6495
2639	   Email: mailto:duerst@it.aoyama.ac.jp
2640	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
2641	          (Note: This is the percent-encoded form of an IRI.)

2643	   Michel Suignard
2644	   Unicode Consortium
2645	   P.O. Box 391476
2646	   Mountain View, CA  94039-1476
2647	   U.S.A.

2649	   Phone: +1-650-693-3921
2650	   Email: mailto:michel@unicode.org
2651	   URI:   http://www.suignard.com
2652	   Larry Masinter
2653	   Adobe
2654	   345 Park Ave
2655	   San Jose, CA  95110
2656	   U.S.A.

2658	   Phone: +1-408-536-3024
2659	   Email: mailto:masinter@adobe.com
2660	   URI:   http://larry.masinter.net