idnits 2.17.1 

draft-ietf-iri-3987bis-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  -- The draft header indicates that this document obsoletes RFC3987, but the
     abstract doesn't seem to directly say this.  It does mention RFC3987
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 14, 2011) is 4631 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC3490' is defined on line 1616, but no explicit
     reference was found in the text

  == Unused Reference: 'LEIRI' is defined on line 1683, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2045' is defined on line 1688, but no explicit
     reference was found in the text

  == Unused Reference: 'XMLNamespace' is defined on line 1755, but no
     explicit reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'

  == Outdated reference: A later version (-03) exists of
     draft-ietf-iri-bidi-guidelines-00

  == Outdated reference: A later version (-02) exists of
     draft-ietf-iri-comparison-00

  -- Obsolete informational reference (is this intentional?): RFC 2141
     (Obsoleted by RFC 8141)

  -- Obsolete informational reference (is this intentional?): RFC 2192
     (Obsoleted by RFC 5092)

  -- Obsolete informational reference (is this intentional?): RFC 2368
     (Obsoleted by RFC 6068)

  -- Obsolete informational reference (is this intentional?): RFC 2396
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  == Outdated reference: A later version (-04) exists of
     draft-ietf-iri-4395bis-irireg-03


     Summary: 2 errors (**), 0 flaws (~~), 11 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internationalized Resource Identifiers                         M. Duerst
3	(iri)                                           Aoyama Gakuin University
4	Internet-Draft                                               M. Suignard
5	Obsoletes: 3987 (if approved)                         Unicode Consortium
6	Intended status: Standards Track                             L. Masinter
7	Expires: February 15, 2012                                         Adobe
8	                                                         August 14, 2011

10	             Internationalized Resource Identifiers (IRIs)
11	                       draft-ietf-iri-3987bis-07

13	Abstract

15	   This document defines the Internationalized Resource Identifier (IRI)
16	   protocol element, as an extension of the Uniform Resource Identifier
17	   (URI).  An IRI is a sequence of characters from the Universal
18	   Character Set (Unicode/ISO 10646).  Grammar and processing rules are
19	   given for IRIs and related syntactic forms.

21	   In addition, this document provides named additional rule sets for
22	   processing otherwise invalid IRIs, in a way that supports other
23	   specifications that wish to mandate common behavior for 'error'
24	   handling.  In particular, rules used in some XML languages (LEIRI)
25	   and web applications are given.

27	   Defining IRI as new protocol element (rather than updating or
28	   extending the definition of URI) allows independent orderly
29	   transitions: other protocols and languages that use URIs must
30	   explicitly choose to allow IRIs.

32	   Guidelines are provided for the use and deployment of IRIs and
33	   related protocol elements when revising protocols, formats, and
34	   software components that currently deal only with URIs.

36	RFC Editor: Please remove the next paragraph before publication.

38	   This (and several companion documents) are intended to obsolete RFC
39	   3987, and also move towards IETF Draft Standard.  For discussion and
40	   comments on these drafts, please join the IETF IRI WG by subscribing
41	   to the mailing list public-iri@w3.org, archives at
42	   http://lists.w3.org/archives/public/public-iri/.  For a list of open
43	   issues, please see the issue tracker of the WG at
44	   http://trac.tools.ietf.org/wg/iri/trac/report/1.  For a list of
45	   individual edits, please see the change history at
46	   http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis.

48	Status of this Memo
49	   This Internet-Draft is submitted in full conformance with the
50	   provisions of BCP 78 and BCP 79.

52	   Internet-Drafts are working documents of the Internet Engineering
53	   Task Force (IETF).  Note that other groups may also distribute
54	   working documents as Internet-Drafts.  The list of current Internet-
55	   Drafts is at http://datatracker.ietf.org/drafts/current/.

57	   Internet-Drafts are draft documents valid for a maximum of six months
58	   and may be updated, replaced, or obsoleted by other documents at any
59	   time.  It is inappropriate to use Internet-Drafts as reference
60	   material or to cite them other than as "work in progress."

62	   This Internet-Draft will expire on February 15, 2012.

64	Copyright Notice

66	   Copyright (c) 2011 IETF Trust and the persons identified as the
67	   document authors.  All rights reserved.

69	   This document is subject to BCP 78 and the IETF Trust's Legal
70	   Provisions Relating to IETF Documents
71	   (http://trustee.ietf.org/license-info) in effect on the date of
72	   publication of this document.  Please review these documents
73	   carefully, as they describe your rights and restrictions with respect
74	   to this document.  Code Components extracted from this document must
75	   include Simplified BSD License text as described in Section 4.e of
76	   the Trust Legal Provisions and are provided without warranty as
77	   described in the Simplified BSD License.

79	   This document may contain material from IETF Documents or IETF
80	   Contributions published or made publicly available before November
81	   10, 2008.  The person(s) controlling the copyright in some of this
82	   material may not have granted the IETF Trust the right to allow
83	   modifications of such material outside the IETF Standards Process.
84	   Without obtaining an adequate license from the person(s) controlling
85	   the copyright in such materials, this document may not be modified
86	   outside the IETF Standards Process, and derivative works of it may
87	   not be created outside the IETF Standards Process, except to format
88	   it for publication as an RFC or to translate it into languages other
89	   than English.

91	Table of Contents

93	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
94	     1.1.   Overview and Motivation . . . . . . . . . . . . . . . . .  5
95	     1.2.   Applicability . . . . . . . . . . . . . . . . . . . . . .  6
96	     1.3.   Definitions . . . . . . . . . . . . . . . . . . . . . . .  7
97	     1.4.   Notation  . . . . . . . . . . . . . . . . . . . . . . . .  8
98	   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  9
99	     2.1.   Summary of IRI Syntax . . . . . . . . . . . . . . . . . .  9
100	     2.2.   ABNF for IRI References and IRIs  . . . . . . . . . . . . 10
101	   3.  Processing IRIs and related protocol elements  . . . . . . . . 12
102	     3.1.   Converting to UCS . . . . . . . . . . . . . . . . . . . . 13
103	     3.2.   Parse the IRI into IRI components . . . . . . . . . . . . 13
104	     3.3.   General percent-encoding of IRI components  . . . . . . . 14
105	     3.4.   Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14
106	       3.4.1.  Mapping using Percent-Encoding . . . . . . . . . . . . 14
107	       3.4.2.  Mapping using Punycode . . . . . . . . . . . . . . . . 14
108	       3.4.3.  Additional Considerations  . . . . . . . . . . . . . . 15
109	     3.5.   Mapping query components  . . . . . . . . . . . . . . . . 16
110	     3.6.   Mapping IRIs to URIs  . . . . . . . . . . . . . . . . . . 16
111	     3.7.   Converting URIs to IRIs . . . . . . . . . . . . . . . . . 16
112	       3.7.1.  Examples . . . . . . . . . . . . . . . . . . . . . . . 18
113	   4.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 19
114	     4.1.   Limitations on UCS Characters Allowed in IRIs . . . . . . 19
115	     4.2.   Software Interfaces and Protocols . . . . . . . . . . . . 20
116	     4.3.   Format of URIs and IRIs in Documents and Protocols  . . . 20
117	     4.4.   Use of UTF-8 for Encoding Original Characters . . . . . . 20
118	     4.5.   Relative IRI References . . . . . . . . . . . . . . . . . 22
119	   5.  Liberal Handling of Otherwise Invalid IRIs . . . . . . . . . . 22
120	     5.1.   LEIRI Processing  . . . . . . . . . . . . . . . . . . . . 22
121	   6.  Characters Not Allowed in IRIs . . . . . . . . . . . . . . . . 23
122	   7.  URI/IRI Processing Guidelines (Informative)  . . . . . . . . . 25
123	     7.1.   URI/IRI Software Interfaces . . . . . . . . . . . . . . . 25
124	     7.2.   URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 25
125	     7.3.   URI/IRI Transfer between Applications . . . . . . . . . . 26
126	     7.4.   URI/IRI Generation  . . . . . . . . . . . . . . . . . . . 27
127	     7.5.   URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27
128	     7.6.   Display of URIs/IRIs  . . . . . . . . . . . . . . . . . . 28
129	     7.7.   Interpretation of URIs and IRIs . . . . . . . . . . . . . 28
130	     7.8.   Upgrading Strategy  . . . . . . . . . . . . . . . . . . . 29
131	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 30
132	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 30
133	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31
134	   11. Main Changes Since RFC 3987  . . . . . . . . . . . . . . . . . 32
135	     11.1.  Split out Bidi, processing guidelines, comparison
136	            sections  . . . . . . . . . . . . . . . . . . . . . . . . 32
137	     11.2.  Major restructuring of IRI processing model . . . . . . . 32
138	       11.2.1. OLD WAY  . . . . . . . . . . . . . . . . . . . . . . . 32
139	       11.2.2. NEW WAY  . . . . . . . . . . . . . . . . . . . . . . . 32
140	       11.2.3. Extension of Syntax  . . . . . . . . . . . . . . . . . 33
141	       11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 33
142	     11.3.  Change Log  . . . . . . . . . . . . . . . . . . . . . . . 33
143	       11.3.1. Changes after draft-ietf-iri-3987bis-01  . . . . . . . 33
144	       11.3.2. Changes from draft-duerst-iri-bis-07 to
145	               draft-ietf-iri-3987bis-00  . . . . . . . . . . . . . . 33
146	       11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis  . . . 33
147	     11.4.  Changes from -00 to -01 . . . . . . . . . . . . . . . . . 33
148	     11.5.  Changes from -05 to -06 of draft-duerst-iri-bis-00  . . . 34
149	     11.6.  Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 34
150	     11.7.  Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 34
151	     11.8.  Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 34
152	     11.9.  Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 34
153	     11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 34
154	     11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis  . . 35
155	   12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35
156	     12.1.  Normative References  . . . . . . . . . . . . . . . . . . 35
157	     12.2.  Informative References  . . . . . . . . . . . . . . . . . 36
158	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 38

160	1.  Introduction

162	1.1.  Overview and Motivation

164	   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
165	   sequence of characters chosen from a limited subset of the repertoire
166	   of US-ASCII [ASCII] characters.

168	   The characters in URIs are frequently used for representing words of
169	   natural languages.  This usage has many advantages: Such URIs are
170	   easier to memorize, easier to interpret, easier to transcribe, easier
171	   to create, and easier to guess.  For most languages other than
172	   English, however, the natural script uses characters other than A -
173	   Z. For many people, handling Latin characters is as difficult as
174	   handling the characters of other scripts is for those who use only
175	   the Latin alphabet.  Many languages with non-Latin scripts are
176	   transcribed with Latin letters.  These transcriptions are now often
177	   used in URIs, but they introduce additional difficulties.

179	   The infrastructure for the appropriate handling of characters from
180	   additional scripts is now widely deployed in operating system and
181	   application software.  Software that can handle a wide variety of
182	   scripts and languages at the same time is increasingly common.  Also,
183	   an increasing number of protocols and formats can carry a wide range
184	   of characters.

186	   URIs are used both as a protocol element (for transmission and
187	   processing by software) and also a presentation element (for display
188	   and handling by people who read, interpret, coin, or guess them).
189	   The transition between these roles is more difficult and complex when
190	   dealing with the larger set of characters than allowed for URIs in
191	   [RFC3986].

193	   This document defines the protocol element called Internationalized
194	   Resource Identifier (IRI), which allow applications of URIs to be
195	   extended to use resource identifiers that have a much wider
196	   repertoire of characters.  It also provides corresponding
197	   "internationalized" versions of other constructs from [RFC3986], such
198	   as URI references.  The syntax of IRIs is defined in Section 2.

200	   Using characters outside of A - Z in IRIs adds a number of
201	   difficulties.  Section 4 discusses the use of IRIs in different
202	   situations.  Section 7 gives additional informative guidelines.
203	   Section 9 discusses IRI-specific security considerations.

205	   [Bidi] discusses the special case of bidirectional IRIs using
206	   characters from scripts written right-to-left.  [Equivalence] gives
207	   guidelines for applications wishing to determine if two IRIs are
208	   equivalent, as well as defining some equivalence methods.
209	   [RFC4395bis] updates the URI scheme registration guidelines and
210	   proceedures to note that every URI scheme is also automatically an
211	   IRI scheme and to allow scheme definitions to be directly described
212	   in terms of Unicode characters.

214	   When originally defining IRIs, several design alternatives were
215	   considered.  Historically interested readers can find an overview in
216	   Appendix A of [RFC3987].  For some additional background on the
217	   design of URIs and IRIs, please also see [Gettys].

219	1.2.  Applicability

221	   IRIs are designed to allow protocols and software that deal with URIs
222	   to be updated to handle IRIs.  Processing of IRIs is accomplished by
223	   extending the URI syntax while retaining (and not expanding) the set
224	   of "reserved" characters, such that the syntax for any URI scheme may
225	   be extended to allow non-ASCII characters.  In addition, following
226	   parsing of an IRI, it is possible to construct a corresponding URI by
227	   first encoding characters outside of the allowed URI range and then
228	   reassembling the components.

230	   Practical use of IRIs forms in place of URIs forms depends on the
231	   following conditions being met:

233	   a. A protocol or format element MUST be explicitly designated to be
234	      able to carry IRIs.  The intent is to avoid introducing IRIs into
235	      contexts that are not defined to accept them.  For example, XML
236	      schema [XMLSchema] has an explicit type "anyURI" that includes
237	      IRIs and IRI references.  Therefore, IRIs and IRI references can
238	      be in attributes and elements of type "anyURI".  On the other
239	      hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is
240	      defined as a URI, which means that direct use of IRIs is not
241	      allowed in HTTP requests.

243	   b. The protocol or format carrying the IRIs MUST have a mechanism to
244	      represent the wide range of characters used in IRIs, either
245	      natively or by some protocol- or format-specific escaping
246	      mechanism (for example, numeric character references in [XML1]).

248	   c. The URI scheme definition, if it explicitly allows a percent sign
249	      ("%") in any syntactic component, SHOULD define the interpretation
250	      of sequences of percent-encoded octets (using "%XX" hex octets) as
251	      octet from sequences of UTF-8 encoded strings; this is recommended
252	      in the guidelines for registering new schemes, [RFC4395bis].  For
253	      example, this is the practice for IMAP URLs [RFC2192], POP URLs
254	      [RFC2384] and the URN syntax [RFC2141]).  Note that use of
255	      percent-encoding may also be restricted in some situations, for
256	      example, URI schemes that disallow percent-encoding might still be
257	      used with a fragment identifier which is percent-encoded (e.g.,
258	      [XPointer]).  See Section 4.4 for further discussion.

260	1.3.  Definitions

262	   The following definitions are used in this document; they follow the
263	   terms in [RFC2130], [RFC2277], and [ISO10646].

265	   character:  A member of a set of elements used for the organization,
266	      control, or representation of data.  For example, "LATIN CAPITAL
267	      LETTER A" names a character.

269	   octet:  An ordered sequence of eight bits considered as a unit.

271	   character repertoire:  A set of characters (set in the mathematical
272	      sense).

274	   sequence of characters:  A sequence of characters (one after
275	      another).

277	   sequence of octets:  A sequence of octets (one after another).

279	   character encoding:  A method of representing a sequence of
280	      characters as a sequence of octets (maybe with variants).  Also, a
281	      method of (unambiguously) converting a sequence of octets into a
282	      sequence of characters.

284	   charset:  The name of a parameter or attribute used to identify a
285	      character encoding.

287	   UCS:  Universal Character Set. The coded character set defined by
288	      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6].

290	   IRI reference:  Denotes the common usage of an Internationalized
291	      Resource Identifier.  An IRI reference may be absolute or
292	      relative.  However, the "IRI" that results from such a reference
293	      only includes absolute IRIs; any relative IRI references are
294	      resolved to their absolute form.  Note that in [RFC2396] URIs did
295	      not include fragment identifiers, but in [RFC3986] fragment
296	      identifiers are part of URIs.

298	   LEIRI (Legacy Extended IRI) processing:  This term was used in
299	      various XML specifications to refer to strings that, although not
300	      valid IRIs, were acceptable input to the processing rules in
301	      Section 5.1.

303	   running text:  Human text (paragraphs, sentences, phrases) with
304	      syntax according to orthographic conventions of a natural
305	      language, as opposed to syntax defined for ease of processing by
306	      machines (e.g., markup, programming languages).

308	   protocol element:  Any portion of a message that affects processing
309	      of that message by the protocol in question.

311	   presentation element:  A presentation form corresponding to a
312	      protocol element; for example, using a wider range of characters.

314	   create (a URI or IRI):  With respect to URIs and IRIs, the term is
315	      used for the initial creation.  This may be the initial creation
316	      of a resource with a certain identifier, or the initial exposition
317	      of a resource under a particular identifier.

319	   generate (a URI or IRI):  With respect to URIs and IRIs, the term is
320	      used when the identifier is generated by derivation from other
321	      information.

323	   parsed URI component:  When a URI processor parses a URI (following
324	      the generic syntax or a scheme-specific syntax, the result is a
325	      set of parsed URI components, each of which has a type
326	      (corresponding to the syntactic definition) and a sequence of URI
327	      characters.

329	   parsed IRI component:  When an IRI processor parses an IRI directly,
330	      following the general syntax or a scheme-specific syntax, the
331	      result is a set of parsed IRI components, each of which has a type
332	      (corresponding to the syntactice definition) and a sequence of IRI
333	      characters.  (This definition is analogous to "parsed URI
334	      component".)

336	   IRI scheme:  A URI scheme may also be known as an "IRI scheme" if the
337	      scheme's syntax has been extended to allow non-US-ASCII characters
338	      according to the rules in this document.

340	1.4.  Notation

342	   RFCs and Internet Drafts currently do not allow any characters
343	   outside the US-ASCII repertoire.  Therefore, this document uses
344	   various special notations to denote such characters in examples.

346	   In text, characters outside US-ASCII are sometimes referenced by
347	   using a prefix of 'U+', followed by four to six hexadecimal digits.

349	   To represent characters outside US-ASCII in examples, this document
350	   uses 'XML Notation'.

352	   XML Notation uses a leading '&#x', a trailing ';', and the
353	   hexadecimal number of the character in the UCS in between.  For
354	   example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA.  In this
355	   notation, an actual '&' is denoted by '&amp;'.

357	   To denote actual octets in examples (as opposed to percent-encoded
358	   octets), the two hex digits denoting the octet are enclosed in "<"
359	   and ">".  For example, the octet often denoted as 0xc9 is denoted
360	   here as <c9>.

362	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
363	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
364	   and "OPTIONAL" are to be interpreted as described in [RFC2119].

366	2.  IRI Syntax

368	   This section defines the syntax of Internationalized Resource
369	   Identifiers (IRIs).

371	   As with URIs, an IRI is defined as a sequence of characters, not as a
372	   sequence of octets.  This definition accommodates the fact that IRIs
373	   may be written on paper or read over the radio as well as stored or
374	   transmitted digitally.  The same IRI might be represented as
375	   different sequences of octets in different protocols or documents if
376	   these protocols or documents use different character encodings
377	   (and/or transfer encodings).  Using the same character encoding as
378	   the containing protocol or document ensures that the characters in
379	   the IRI can be handled (e.g., searched, converted, displayed) in the
380	   same way as the rest of the protocol or document.

382	2.1.  Summary of IRI Syntax

384	   The IRI syntax extends the URI syntax in [RFC3986] by extending the
385	   class of unreserved characters, primarily by adding the characters of
386	   the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject
387	   to the limitations given in the syntax rules below and in
388	   Section 4.1.

390	   The syntax and use of components and reserved characters is the same
391	   as that in [RFC3986].  Each "URI scheme" thus also functions as an
392	   "IRI scheme", in that scheme-specific parsing rules for URIs of a
393	   scheme are be extended to allow parsing of IRIs using the same
394	   parsing rules.

396	   All the operations defined in [RFC3986], such as the resolution of
397	   relative references, can be applied to IRIs by IRI-processing
398	   software in exactly the same way as they are for URIs by URI-
399	   processing software.

401	   Characters outside the US-ASCII repertoire MUST NOT be reserved and
402	   therefore MUST NOT be used for syntactical purposes, such as to
403	   delimit components in newly defined schemes.  For example, U+00A2,
404	   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
405	   the 'iunreserved' category.  This is similar to the fact that it is
406	   not possible to use '-' as a delimiter in URIs, because it is in the
407	   'unreserved' category.

409	2.2.  ABNF for IRI References and IRIs

411	   An ABNF definition for IRI references (which are the most general
412	   concept and the start of the grammar) and IRIs is given here.  The
413	   syntax of this ABNF is described in [STD68].  Character numbers are
414	   taken from the UCS, without implying any actual binary encoding.
415	   Terminals in the ABNF are characters, not octets.

417	   The following grammar closely follows the URI grammar in [RFC3986],
418	   except that the range of unreserved characters is expanded to include
419	   UCS characters, with the restriction that private UCS characters can
420	   occur only in query parts.  The grammar is split into two parts:
421	   Rules that differ from [RFC3986] because of the above-mentioned
422	   expansion, and rules that are the same as those in [RFC3986].  For
423	   rules that are different than those in [RFC3986], the names of the
424	   non-terminals have been changed as follows.  If the non-terminal
425	   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
426	   has been prefixed.  The rule <pct-form> has been introduced in order
427	   to be able to reference it from other parts of the document.

429	   The following rules are different from those in [RFC3986]:

431	   IRI            = scheme ":" ihier-part [ "?" iquery ]
432	                    [ "#" ifragment ]

434	   ihier-part     = "//" iauthority ipath-abempty
435	                  / ipath-absolute
436	                  / ipath-rootless
437	                  / ipath-empty

439	   IRI-reference  = IRI / irelative-ref

441	   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

443	   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]

445	   irelative-part = "//" iauthority ipath-abempty
446	                  / ipath-absolute
447	                  / ipath-noscheme
448	                  / ipath-empty

450	   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
451	   iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
452	   ihost          = IP-literal / IPv4address / ireg-name

454	   pct-form       = pct-encoded

456	   ireg-name      = *( iunreserved / sub-delims )

458	   ipath          = ipath-abempty   ; begins with "/" or is empty
459	                  / ipath-absolute  ; begins with "/" but not "//"
460	                  / ipath-noscheme  ; begins with a non-colon segment
461	                  / ipath-rootless  ; begins with a segment
462	                  / ipath-empty     ; zero characters

464	   ipath-abempty  = *( path-sep isegment )
465	   ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
466	   ipath-noscheme = isegment-nz-nc *( path-sep isegment )
467	   ipath-rootless = isegment-nz *( path-sep isegment )
468	   ipath-empty    = 0<ipchar>
469	   path-sep       = "/"

471	   isegment       = *ipchar
472	   isegment-nz    = 1*ipchar
473	   isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
474	                        / "@" )
475	                  ; non-zero-length segment without any colon ":"

477	   ipchar         = iunreserved / pct-form / sub-delims / ":"
478	                  / "@"

480	   iquery         = *( ipchar / iprivate / "/" / "?" )

482	   ifragment      = *( ipchar / "/" / "?" )

484	   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

486	   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
487	                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
488	                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
489	                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
490	                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
491	                  / %xD0000-DFFFD / %xE1000-EFFFD

493	   iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
494	                  / %x100000-10FFFD

496	   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
497	   "greedy") algorithm applies.  For details, see [RFC3986].

499	   The following rules are the same as those in [RFC3986]:

501	   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

503	   port           = *DIGIT

505	   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

507	   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

509	   IPv6address    =                            6( h16 ":" ) ls32
510	                  /                       "::" 5( h16 ":" ) ls32
511	                  / [               h16 ] "::" 4( h16 ":" ) ls32
512	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
513	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
514	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
515	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
516	                  / [ *5( h16 ":" ) h16 ] "::"              h16
517	                  / [ *6( h16 ":" ) h16 ] "::"

519	   h16            = 1*4HEXDIG
520	   ls32           = ( h16 ":" h16 ) / IPv4address

522	   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

524	   dec-octet      = DIGIT                 ; 0-9
525	                  / %x31-39 DIGIT         ; 10-99
526	                  / "1" 2DIGIT            ; 100-199
527	                  / "2" %x30-34 DIGIT     ; 200-249
528	                  / "25" %x30-35          ; 250-255

530	   pct-encoded    = "%" HEXDIG HEXDIG

532	   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
533	   reserved       = gen-delims / sub-delims
534	   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
535	   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
536	                  / "*" / "+" / "," / ";" / "="

538	   This syntax does not support IPv6 scoped addressing zone identifiers.

540	3.  Processing IRIs and related protocol elements

542	   IRIs are meant to replace URIs in identifying resources within new
543	   versions of protocols, formats, and software components that use a
544	   UCS-based character repertoire.  Protocols and components may use and
545	   process IRIs directly.  However, there are still numerous systems and
546	   protocols which only accept URIs or components of parsed URIs; that
547	   is, they only accept sequences of characters within the subset of US-
548	   ASCII characters allowed in URIs.

550	   This section defines specific processing steps for IRI consumers
551	   which establish the relationship between the string given and the
552	   interpreted derivatives.  These processing steps apply to both IRIs
553	   and IRI references (i.e., absolute or relative forms); for IRIs, some
554	   steps are scheme specific.

556	3.1.  Converting to UCS

558	   Input that is already in a Unicode form (i.e., a sequence of Unicode
559	   characters or an octet-stream representing a Unicode-based character
560	   encoding such as UTF-8 or UTF-16) should be left as is and not
561	   normalized or changed.

563	   An IRI or IRI reference is a sequence of characters from the UCS.
564	   For resource identifiers that are not already in a Unicode form (as
565	   when written on paper, read aloud, or represented in a text stream
566	   using a legacy character encoding), convert the IRI to Unicode.  Note
567	   that some character encodings or transcriptions can be converted to
568	   or represented by more than one sequence of Unicode characters.
569	   Ideally the resulting IRI would use a normalized form, such as
570	   Unicode Normalization Form C [UTR15], since that ensures a stable,
571	   consistent representation that is most likely to produce the intended
572	   results.  Implementers and users are cautioned that, while
573	   denormalized character sequences are valid, they might be difficult
574	   for other users or processes to reproduce and might lead to
575	   unexpected results.

577	   In other cases (written on paper, read aloud, or otherwise
578	   represented independent of any character encoding) represent the IRI
579	   as a sequence of characters from the UCS normalized according to
580	   Unicode Normalization Form C (NFC, [UTR15]).

582	3.2.  Parse the IRI into IRI components

584	   Parse the IRI, either as a relative reference (no scheme) or using
585	   scheme specific processing (according to the scheme given); the
586	   result is a set of parsed IRI components.

588	3.3.  General percent-encoding of IRI components

590	   Except as noted in the following subsections, IRI components are
591	   mapped to the equivalent URI components by percent-encoding those
592	   characters not allowed in URIs.  Previous processing steps will have
593	   removed some characters, and the interpretation of reserved
594	   characters will have already been done (with the syntactic reserved
595	   characters outside of the IRI component).  This mapping is defined
596	   for all sequences of Unicode characters, whether or not they are
597	   valid for the component in question.

599	   For each character which is not allowed anywhere in a valid URI apply
600	   the following steps.

602	   Convert to UTF-8  Convert the character to a sequence of one or more
603	      octets using UTF-8 [RFC3629].

605	   Percent encode  Convert each octet of this sequence to %HH, where HH
606	      is the hexadecimal notation of the octet value.  The hexadecimal
607	      notation SHOULD use uppercase letters.  (This is the general URI
608	      percent-encoding mechanism in Section 2.1 of [RFC3986].)

610	   Note that the mapping is an identity transformation for parsed URI
611	   components of valid URIs, and is idempotent: applying the mapping a
612	   second time will not change anything.

614	3.4.  Mapping ireg-name

616	3.4.1.  Mapping using Percent-Encoding

618	   The ireg-name component SHOULD be converted according to the general
619	   procedure for percent-encoding of IRI components described in
620	   Section 3.3.

622	   For example, the IRI
623	   "http://r&#xE9;sum&#xE9;.example.org"
624	   will be converted to
625	   "http://r%C3%A9sum%C3%A9.example.org".

627	   This conversion for ireg-name is in line with Section 3.2.2 of
628	   [RFC3986], which does not mandate a particular registered name lookup
629	   technology.  For further background, see [RFC6055] and [Gettys].

631	3.4.2.  Mapping using Punycode

633	   The ireg-name component MAY also be converted as follows:

635	   Replace the ireg-name part of the IRI by the part converted using the
636	   Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891].
637	   on each dot-separated label, and by using U+002E (FULL STOP) as a
638	   label separator.  This procedure may fail, but this would mean that
639	   the IRI cannot be resolved.  In such cases, if the domain name
640	   conversion fails, then the entire IRI conversion fails.  Processors
641	   that have no mechanism for signalling a failure MAY instead
642	   substitute an otherwise invalid host name, although such processing
643	   SHOULD be avoided.

645	   For example, the IRI
646	   "http://r&#xE9;sum&#xE9;.example.org"
647	   MAY be converted to
648	   "http://xn--rsum-bad.example.org"
649	   .

651	   This conversion for ireg-name will be better able to deal with legacy
652	   infrastructure that cannot handle percent-encoding in domain names.

654	3.4.3.  Additional Considerations

656	   Note:  Domain Names may appear in parts of an IRI other than the
657	      ireg-name part.  It is the responsibility of scheme-specific
658	      implementations (if the Internationalized Domain Name is part of
659	      the scheme syntax) or of server-side implementations (if the
660	      Internationalized Domain Name is part of 'iquery') to apply the
661	      necessary conversions at the appropriate point.  Example: Trying
662	      to validate the Web page at
663	      http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
664	      http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
665	      example.org, which would convert to a URI of
666	      http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
667	      example.org.  The server-side implementation is responsible for
668	      making the necessary conversions to be able to retrieve the Web
669	      page.

671	   Note:  In this process, characters allowed in URI references and
672	      existing percent-encoded sequences are not encoded further.  (This
673	      mapping is similar to, but different from, the encoding applied
674	      when arbitrary content is included in some part of a URI.)  For
675	      example, an IRI of
676	      "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
677	      converted to
678	      "http://www.example.org/red%09ros%C3%A9#red", not to something
679	      like
680	      "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".

682	3.5.  Mapping query components

684	   For compatibility with existing deployed HTTP infrastructure, the
685	   following special case applies for schemes "http" and "https" and
686	   IRIs whose origin has a document charset other than one which is UCS-
687	   based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
688	   of an IRI is mapped into a URI by using the document charset rather
689	   than UTF-8 as the binary representation before pct-encoding.  This
690	   mapping is not applied for any other scheme or component.

692	3.6.  Mapping IRIs to URIs

694	   The mapping from an IRI to URI is accomplished by applying the
695	   mapping above (from IRI to URI components) and then reassembling a
696	   URI from the parsed URI components using the original punctuation
697	   that delimited the IRI components.

699	3.7.  Converting URIs to IRIs

701	   In some situations, for presentation and further processing, it is
702	   desirable to convert a URI into an equivalent IRI in which natural
703	   characters are represented directly rather than percent encoded.  Of
704	   course, every URI is already an IRI in its own right without any
705	   conversion, and in general there This section gives one such
706	   procedure for this conversion.

708	   The conversion described in this section, if given a valid URI, will
709	   result in an IRI that maps back to the URI used as an input for the
710	   conversion (except for potential case differences in percent-encoding
711	   and for potential percent-encoded unreserved characters).  However,
712	   the IRI resulting from this conversion may differ from the original
713	   IRI (if there ever was one).

715	   URI-to-IRI conversion removes percent-encodings, but not all percent-
716	   encodings can be eliminated.  There are several reasons for this:

718	   1. Some percent-encodings are necessary to distinguish percent-
719	      encoded and unencoded uses of reserved characters.

721	   2. Some percent-encodings cannot be interpreted as sequences of UTF-8
722	      octets.

724	      (Note: The octet patterns of UTF-8 are highly regular.  Therefore,
725	      there is a very high probability, but no guarantee, that percent-
726	      encodings that can be interpreted as sequences of UTF-8 octets
727	      actually originated from UTF-8.  For a detailed discussion, see
728	      [Duerst97].)

730	   3. The conversion may result in a character that is not appropriate
731	      in an IRI.  See Section 2.2, and Section 4.1 for further details.

733	   4. IRI to URI conversion has different rules for dealing with domain
734	      names and query parameters.

736	   Conversion from a URI to an IRI MAY be done by using the following
737	   steps:

739	   1. Represent the URI as a sequence of octets in US-ASCII.

741	   2. Convert all percent-encodings ("%" followed by two hexadecimal
742	      digits) to the corresponding octets, except those corresponding to
743	      "%", characters in "reserved", and characters in US-ASCII not
744	      allowed in URIs.

746	   3. Re-percent-encode any octet produced in step 2 that is not part of
747	      a strictly legal UTF-8 octet sequence.

749	   4. Re-percent-encode all octets produced in step 3 that in UTF-8
750	      represent characters that are not appropriate according to
751	      Section 2.2 and Section 4.1.

753	   5. Interpret the resulting octet sequence as a sequence of characters
754	      encoded in UTF-8.

756	   6. URIs known to contain domain names in the reg-name component
757	      SHOULD convert punycode-encoded domain name labels to the
758	      corresponding characters using the ToUnicode procedure.

760	   This procedure will convert as many percent-encoded characters as
761	   possible to characters in an IRI.  Because there are some choices
762	   when step 4 is applied (see Section 4.1), results may vary.

764	   Conversions from URIs to IRIs MUST NOT use any character encoding
765	   other than UTF-8 in steps 3 and 4, even if it might be possible to
766	   guess from the context that another character encoding than UTF-8 was
767	   used in the URI.  For example, the URI
768	   "http://www.example.org/r%E9sum%E9.html" might with some guessing be
769	   interpreted to contain two e-acute characters encoded as iso-8859-1.
770	   It must not be converted to an IRI containing these e-acute
771	   characters.  Otherwise, in the future the IRI will be mapped to
772	   "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
773	   URI from "http://www.example.org/r%E9sum%E9.html".

775	3.7.1.  Examples

777	   This section shows various examples of converting URIs to IRIs.  Each
778	   example shows the result after each of the steps 1 through 6 is
779	   applied.  XML Notation is used for the final result.  Octets are
780	   denoted by "<" followed by two hexadecimal digits followed by ">".

782	   The following example contains the sequence "%C3%BC", which is a
783	   strictly legal UTF-8 sequence, and which is converted into the actual
784	   character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
785	   u-umlaut).

787	   1. http://www.example.org/D%C3%BCrst

789	   2. http://www.example.org/D<c3><bc>rst

791	   3. http://www.example.org/D<c3><bc>rst

793	   4. http://www.example.org/D<c3><bc>rst

795	   5. http://www.example.org/D&#xFC;rst

797	   6. http://www.example.org/D&#xFC;rst

799	   The following example contains the sequence "%FC", which might
800	   represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
801	   iso-8859-1 character encoding.  (It might represent other characters
802	   in other character encodings.  For example, the octet <fc> in iso-
803	   8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.)  Because <fc>
804	   is not part of a strictly legal UTF-8 sequence, it is re-percent-
805	   encoded in step 3.

807	   1. http://www.example.org/D%FCrst

809	   2. http://www.example.org/D<fc>rst

811	   3. http://www.example.org/D%FCrst

813	   4. http://www.example.org/D%FCrst

815	   5. http://www.example.org/D%FCrst

817	   6. http://www.example.org/D%FCrst

819	   The following example contains "%e2%80%ae", which is the percent-
820	   encoded
821	   UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.  The
822	   direct use of this character is forbiddin in an IRI.  Therefore, the
823	   corresponding octets are re-percent-encoded in step 4.  This example
824	   shows that the case (upper- or lowercase) of letters used in percent-
825	   encodings may not be preserved.  The example also contains a
826	   punycode-encoded domain name label (xn--99zt52a), which is not
827	   converted.

829	   1. http://xn--99zt52a.example.org/%e2%80%ae

831	   2. http://xn--99zt52a.example.org/<e2><80><ae>

833	   3. http://xn--99zt52a.example.org/<e2><80><ae>

835	   4. http://xn--99zt52a.example.org/%E2%80%AE

837	   5. http://xn--99zt52a.example.org/%E2%80%AE

839	   6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE

841	   Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
842	   (Japanese Natto).  ((EDITOR NOTE: There is some inconsistency in this
843	   note.))

845	4.  Use of IRIs

847	4.1.  Limitations on UCS Characters Allowed in IRIs

849	   This section discusses limitations on characters and character
850	   sequences usable for IRIs beyond those given in Section 2.2.  The
851	   considerations in this section are relevant when IRIs are created and
852	   when URIs are converted to IRIs.

854	   a. The repertoire of characters allowed in each IRI component is
855	      limited by the definition of that component.  For example, the
856	      definition of the scheme component does not allow characters
857	      beyond US-ASCII.

859	      (Note: In accordance with URI practice, generic IRI software
860	      cannot and should not check for such limitations.)

862	   b. The UCS contains many areas of characters for which there are
863	      strong visual look-alikes.  Because of the likelihood of
864	      transcription errors, these also should be avoided.  This includes
865	      the full-width equivalents of Latin characters, half-width
866	      Katakana characters for Japanese, and many others.  It also
867	      includes many look-alikes of "space", "delims", and "unwise",
868	      characters excluded in [RFC3491].

870	   Additional information is available from [UNIXML].  [UNIXML] is
871	   written in the context of running text rather than in that of
872	   identifiers.  Nevertheless, it discusses many of the categories of
873	   characters not appropriate for IRIs.

875	4.2.  Software Interfaces and Protocols

877	   Although an IRI is defined as a sequence of characters, software
878	   interfaces for URIs typically function on sequences of octets or
879	   other kinds of code units.  Thus, software interfaces and protocols
880	   MUST define which character encoding is used.

882	   Intermediate software interfaces between IRI-capable components and
883	   URI-only components MUST map the IRIs per Section 3.6, when
884	   transferring from IRI-capable to URI-only components.  This mapping
885	   SHOULD be applied as late as possible.  It SHOULD NOT be applied
886	   between components that are known to be able to handle IRIs.

888	4.3.  Format of URIs and IRIs in Documents and Protocols

890	   Document formats that transport URIs may have to be upgraded to allow
891	   the transport of IRIs.  In cases where the document as a whole has a
892	   native character encoding, IRIs MUST also be encoded in this
893	   character encoding and converted accordingly by a parser or
894	   interpreter.  IRI characters not expressible in the native character
895	   encoding SHOULD be escaped by using the escaping conventions of the
896	   document format if such conventions are available.  Alternatively,
897	   they MAY be percent-encoded according to Section 3.6.  For example,
898	   in HTML or XML, numeric character references SHOULD be used.  If a
899	   document as a whole has a native character encoding and that
900	   character encoding is not UTF-8, then IRIs MUST NOT be placed into
901	   the document in the UTF-8 character encoding.

903	   ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
904	   although they use different terminology.  HTML 4.0 [HTML4] defines
905	   the conversion from IRIs to URIs as error-avoiding behavior.  XML 1.0
906	   [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
907	   based upon them allow IRIs.  Also, it is expected that all relevant
908	   new W3C formats and protocols will be required to handle IRIs
909	   [CharMod].

911	4.4.  Use of UTF-8 for Encoding Original Characters

913	   This section discusses details and gives examples for point c) in
914	   Section 1.2.  To be able to use IRIs, the URI corresponding to the
915	   IRI in question has to encode original characters into octets by
916	   using UTF-8.  This can be specified for all URIs of a URI scheme or
917	   can apply to individual URIs for schemes that do not specify how to
918	   encode original characters.  It can apply to the whole URI, or only
919	   to some part.  For background information on encoding characters into
920	   URIs, see also Section 2.5 of [RFC3986].

922	   For new URI schemes, using UTF-8 is recommended in [RFC4395bis].
923	   Examples where UTF-8 is already used are the URN syntax [RFC2141],
924	   IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand,
925	   because the HTTP URI scheme does not specify how to encode original
926	   characters, only some HTTP URLs can have corresponding but different
927	   IRIs.

929	   For example, for a document with a URI of
930	   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
931	   construct a corresponding IRI (in XML notation, see Section 1.4):
932	   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9;" stands for
933	   the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-
934	   encoded representation of that character).  On the other hand, for a
935	   document with a URI of "http://www.example.org/r%E9sum%E9.html", the
936	   percent-encoding octets cannot be converted to actual characters in
937	   an IRI, as the percent-encoding is not based on UTF-8.

939	   For most URI schemes, there is no need to upgrade their scheme
940	   definition in order for them to work with IRIs.  The main case where
941	   upgrading makes sense is when a scheme definition, or a particular
942	   component of a scheme, is strictly limited to the use of US-ASCII
943	   characters with no provision to include non-ASCII characters/octets
944	   via percent-encoding, or if a scheme definition currently uses highly
945	   scheme-specific provisions for the encoding of non-ASCII characters.
946	   An example of this is the mailto: scheme [RFC2368].

948	   This specification updates the IANA registry of URI schemes to note
949	   their applicability to IRIs, see Section 8.  All IRIs use URI
950	   schemes, and all URIs with URI schemes can be used as IRIs, even
951	   though in some cases only by using URIs directly as IRIs, without any
952	   conversion.

954	   Scheme definitions can impose restrictions on the syntax of scheme-
955	   specific URIs; i.e., URIs that are admissible under the generic URI
956	   syntax [RFC3986] may not be admissible due to narrower syntactic
957	   constraints imposed by a URI scheme specification.  URI scheme
958	   definitions cannot broaden the syntactic restrictions of the generic
959	   URI syntax; otherwise, it would be possible to generate URIs that
960	   satisfied the scheme-specific syntactic constraints without
961	   satisfying the syntactic constraints of the generic URI syntax.
962	   However, additional syntactic constraints imposed by URI scheme
963	   specifications are applicable to IRI, as the corresponding URI
964	   resulting from the mapping defined in Section 3.6 MUST be a valid URI
965	   under the syntactic restrictions of generic URI syntax and any
966	   narrower restrictions imposed by the corresponding URI scheme
967	   specification.

969	   The requirement for the use of UTF-8 generally applies to all parts
970	   of a URI.  However, it is possible that the capability of IRIs to
971	   represent a wide range of characters directly is used just in some
972	   parts of the IRI (or IRI reference).  The other parts of the IRI may
973	   only contain US-ASCII characters, or they may not be based on UTF-8.
974	   They may be based on another character encoding, or they may directly
975	   encode raw binary data (see also [RFC2397]).

977	   For example, it is possible to have a URI reference of
978	   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
979	   document name is encoded in iso-8859-1 based on server settings, but
980	   where the fragment identifier is encoded in UTF-8 according to
981	   [XPointer].  The IRI corresponding to the above URI would be (in XML
982	   notation)
983	   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".

985	   Similar considerations apply to query parts.  The functionality of
986	   IRIs (namely, to be able to include non-ASCII characters) can only be
987	   used if the query part is encoded in UTF-8.

989	4.5.  Relative IRI References

991	   Processing of relative IRI references against a base is handled
992	   straightforwardly; the algorithms of [RFC3986] can be applied
993	   directly, treating the characters additionally allowed in IRI
994	   references in the same way that unreserved characters are in URI
995	   references.

997	5.  Liberal Handling of Otherwise Invalid IRIs

999	   Some technical specifications and widely-deployed software have
1000	   allowed additional variations and extensions of IRIs to be used in
1001	   syntactic components.

1003	   Future technical specifications SHOULD NOT allow conforming producers
1004	   to produce, or conforming content to contain, such forms, as they are
1005	   not interoperable with other IRI consuming software.

1007	5.1.  LEIRI Processing

1009	   This section defines Legacy Extended IRIs (LEIRIs).  The syntax of
1010	   Legacy Extended IRIs is the same as that for <IRI-reference>, except
1011	   that the ucschar production is replaced by the leiri-ucschar
1012	   production:

1014	     leiri-ucschar  = " " / "<" / ">" / '"' / "{" / "}" / "|"
1015	                      / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1016	                      / %xE000-FFFD / %x10000-10FFFF

1018	   Among other extensions, processors based on this specification also
1019	   did not enforce the restriction on bidirectional formatting
1020	   characters in [Bidi], and the iprivate production becomes redundant.

1022	   To convert a string allowed as a LEIRI to an IRI, each character
1023	   allowed in leiri-ucschar but not in ucschar must be percent-encoded
1024	   using Section 3.3.

1026	6.  Characters Not Allowed in IRIs

1028	   This section provides a list of the groups of characters and code
1029	   points that are allowed in some contexts but are not allowed in IRIs
1030	   or are allowed in IRIs only in the query part.  For each group of
1031	   characters, advice on the usage of these characters is also given,
1032	   concentrating on the reasons for why they are excluded from IRI use.

1034	      Space (U+0020): Some formats and applications use space as a
1035	      delimiter, e.g. for items in a list.  Appendix C of [RFC3986] also
1036	      mentions that white space may have to be added when displaying or
1037	      printing long URIs; the same applies to long IRIs.  This means
1038	      that spaces can disappear, or can make the what is intended as a
1039	      single IRI or IRI reference to be treated as two or more separate
1040	      IRIs.

1042	      Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
1043	      C of [RFC3986] suggests the use of double-quotes
1044	      ("http://example.com/") and angle brackets (<http://example.com/>)
1045	      as delimiters for URIs in plain text.  These conventions are often
1046	      used, and also apply to IRIs.  Using these characters in strings
1047	      intended to be IRIs would result in the IRIs being cut off at the
1048	      wrong place.

1050	      Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
1051	      (U+007B), "|" (U+007C), and "}" (U+007D): These characters
1052	      originally have been excluded from URIs because the respective
1053	      codepoints are assigned to different graphic characters in some
1054	      7-bit or 8-bit encoding.  Despite the move to Unicode, some of
1055	      these characters are still occasionally displayed differently on
1056	      some systems, e.g.  U+005C may appear as a Japanese Yen symbol on
1057	      some systems.  Also, the fact that these characters are not used
1058	      in URIs or IRIs has encouraged their use outside URIs or IRIs in
1059	      contexts that may include URIs or IRIs.  If a string with such a
1060	      character were used as an IRI in such a context, it would likely
1061	      be interpreted piecemeal.

1063	      The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1064	      #x9F): There is generally no way to transmit these characters
1065	      reliably as text outside of a charset encoding.  Even when in
1066	      encoded form, many software components silently filter out some of
1067	      these characters, or may stop processing alltogether when
1068	      encountering some of them.  These characters may affect text
1069	      display in subtle, unnoticable ways or in drastic, global, and
1070	      irreversible ways depending on the hardware and software involved.
1071	      The use of some of these characters would allow malicious users to
1072	      manipulate the display of an IRI and its context in many
1073	      situations.

1075	      Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1076	      characters affect the display ordering of characters.  If IRIs
1077	      were allowed to contain these characters and the resulting visual
1078	      display transcribed. they could not be converted back to
1079	      electronic form (logical order) unambiguously.  These characters,
1080	      if allowed in IRIs, might allow malicious users to manipulate the
1081	      display of IRI and its context.

1083	      Specials (U+FFF0-FFFD): These code points provide functionality
1084	      beyond that useful in an IRI, for example byte order
1085	      identification, annotation, and replacements for unknown
1086	      characters and objects.  Their use and interpretation in an IRI
1087	      would serve no purpose and might lead to confusing display
1088	      variations.

1090	      Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
1091	      10FFFD): Display and interpretation of these code points is by
1092	      definition undefined without private agreement.  Therefore, these
1093	      code points are not suited for use on the Internet.  They are not
1094	      interoperable and may have unpredictable effects.

1096	      Tags (U+E0000-E0FFF): These characters provide a way to language
1097	      tag in Unicode plain text.  They are not appropriate for IRIs
1098	      because language information in identifiers cannot reliably be
1099	      input, transmitted (e.g. on a visual medium such as paper), or
1100	      recognized.

1102	      Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1103	      U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1104	      U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1105	      U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1106	      U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1107	      non-characters.  Applications may use some of them internally, but
1108	      are not prepared to interchange them.

1110	   LEIRI preprocessing disallowed some code points and code units:

1112	      Surrogate code units (D800-DFFF): These do not represent Unicode
1113	      codepoints.

1115	7.  URI/IRI Processing Guidelines (Informative)

1117	   This informative section provides guidelines for supporting IRIs in
1118	   the same software components and operations that currently process
1119	   URIs: Software interfaces that handle URIs, software that allows
1120	   users to enter URIs, software that creates or generates URIs,
1121	   software that displays URIs, formats and protocols that transport
1122	   URIs, and software that interprets URIs.  These may all require
1123	   modification before functioning properly with IRIs.  The
1124	   considerations in this section also apply to URI references and IRI
1125	   references.

1127	7.1.  URI/IRI Software Interfaces

1129	   Software interfaces that handle URIs, such as URI-handling APIs and
1130	   protocols transferring URIs, need interfaces and protocol elements
1131	   that are designed to carry IRIs.

1133	   In case the current handling in an API or protocol is based on US-
1134	   ASCII, UTF-8 is recommended as the character encoding for IRIs, as it
1135	   is compatible with US-ASCII, is in accordance with the
1136	   recommendations of [RFC2277], and makes converting to URIs easy.  In
1137	   any case, the API or protocol definition must clearly define the
1138	   character encoding to be used.

1140	   The transfer from URI-only to IRI-capable components requires no
1141	   mapping, although the conversion described in Section 3.7 above may
1142	   be performed.  It is preferable not to perform this inverse
1143	   conversion unless it is certain this can be done correctly.

1145	7.2.  URI/IRI Entry

1147	   Some components allow users to enter URIs into the system by typing
1148	   or dictation, for example.  This software must be updated to allow
1149	   for IRI entry.

1151	   A person viewing a visual representation of an IRI (as a sequence of
1152	   glyphs, in some order, in some visual display) or hearing an IRI will
1153	   use an entry method for characters in the user's language to input
1154	   the IRI.  Depending on the script and the input method used, this may
1155	   be a more or less complicated process.

1157	   The process of IRI entry must ensure, as much as possible, that the
1158	   restrictions defined in Section 2.2 are met.  This may be done by
1159	   choosing appropriate input methods or variants/settings thereof, by
1160	   appropriately converting the characters being input, by eliminating
1161	   characters that cannot be converted, and/or by issuing a warning or
1162	   error message to the user.

1164	   As an example of variant settings, input method editors for East
1165	   Asian Languages usually allow the input of Latin letters and related
1166	   characters in full-width or half-width versions.  For IRI input, the
1167	   input method editor should be set so that it produces half-width
1168	   Latin letters and punctuation and full-width Katakana.

1170	   An input field primarily or solely used for the input of URIs/IRIs
1171	   might allow the user to view an IRI as it is mapped to a URI.  Places
1172	   where the input of IRIs is frequent may provide the possibility for
1173	   viewing an IRI as mapped to a URI.  This will help users when some of
1174	   the software they use does not yet accept IRIs.

1176	   An IRI input component interfacing to components that handle URIs,
1177	   but not IRIs, must map the IRI to a URI before passing it to these
1178	   components.

1180	   For the input of IRIs with right-to-left characters, please see
1181	   [Bidi].

1183	7.3.  URI/IRI Transfer between Applications

1185	   Many applications (for example, mail user agents) try to detect URIs
1186	   appearing in plain text.  For this, they use some heuristics based on
1187	   URI syntax.  They then allow the user to click on such URIs and
1188	   retrieve the corresponding resource in an appropriate (usually
1189	   scheme-dependent) application.

1191	   Such applications would need to be upgraded, in order to use the IRI
1192	   syntax as a base for heuristics.  In particular, a non-ASCII
1193	   character should not be taken as the indication of the end of an IRI.
1194	   Such applications also would need to make sure that they correctly
1195	   convert the detected IRI from the character encoding of the document
1196	   or application where the IRI appears, to the character encoding used
1197	   by the system-wide IRI invocation mechanism, or to a URI (according
1198	   to Section 3.6) if the system-wide invocation mechanism only accepts
1199	   URIs.

1201	   The clipboard is another frequently used way to transfer URIs and
1202	   IRIs from one application to another.  On most platforms, the
1203	   clipboard is able to store and transfer text in many languages and
1204	   scripts.  Correctly used, the clipboard transfers characters, not
1205	   octets, which will do the right thing with IRIs.

1207	7.4.  URI/IRI Generation

1209	   Systems that offer resources through the Internet, where those
1210	   resources have logical names, sometimes automatically generate URIs
1211	   for the resources they offer.  For example, some HTTP servers can
1212	   generate a directory listing for a file directory and then respond to
1213	   the generated URIs with the files.

1215	   Many legacy character encodings are in use in various file systems.
1216	   Many currently deployed systems do not transform the local character
1217	   representation of the underlying system before generating URIs.

1219	   For maximum interoperability, systems that generate resource
1220	   identifiers should make the appropriate transformations.  For
1221	   example, if a file system contains a file named "r&#xE9;sum&#
1222	   xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in
1223	   a URI, which allows use of "r&#xE9;sum&#xE9;.html" in an IRI, even if
1224	   locally the file name is kept in a character encoding other than
1225	   UTF-8.

1227	   This recommendation particularly applies to HTTP servers.  For FTP
1228	   servers, similar considerations apply; see [RFC2640].

1230	7.5.  URI/IRI Selection

1232	   In some cases, resource owners and publishers have control over the
1233	   IRIs used to identify their resources.  This control is mostly
1234	   executed by controlling the resource names, such as file names,
1235	   directly.

1237	   In these cases, it is recommended to avoid choosing IRIs that are
1238	   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
1239	   is easily confused with the digit one ("1"), and the upper-case oh
1240	   ("O") is easily confused with the digit zero ("0").  Publishers
1241	   should avoid confusing users with "br0ken" or "1ame" identifiers.

1243	   Outside the US-ASCII repertoire, there are many more opportunities
1244	   for confusion; a complete set of guidelines is too lengthy to include
1245	   here.  As long as names are limited to characters from a single
1246	   script, native writers of a given script or language will know best
1247	   when ambiguities can appear, and how they can be avoided.  What may
1248	   look ambiguous to a stranger may be completely obvious to the average
1249	   native user.  On the other hand, in some cases, the UCS contains
1250	   variants for compatibility reasons; for example, for typographic
1251	   purposes.  These should be avoided wherever possible.  Although there
1252	   may be exceptions, newly created resource names should generally be
1253	   in NFKC [UTR15] (which means that they are also in NFC).

1255	   As an example, the UCS contains the "fi" ligature at U+FB01 for
1256	   compatibility reasons.  Wherever possible, IRIs should use the two
1257	   letters "f" and "i" rather than the "fi" ligature.  An example where
1258	   the latter may be used is in the query part of an IRI for an explicit
1259	   search for a word written containing the "fi" ligature.

1261	   In certain cases, there is a chance that characters from different
1262	   scripts look the same.  The best known example is the similarity of
1263	   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
1264	   such cases, IRIs should only be created where all the characters in a
1265	   single component are used together in a given language.  This usually
1266	   means that all of these characters will be from the same script, but
1267	   there are languages that mix characters from different scripts (such
1268	   as Japanese).  This is similar to the heuristics used to distinguish
1269	   between letters and numbers in the examples above.  Also, for Latin,
1270	   Greek, and Cyrillic, using lowercase letters results in fewer
1271	   ambiguities than using uppercase letters would.

1273	7.6.  Display of URIs/IRIs

1275	   In situations where the rendering software is not expected to display
1276	   non-ASCII parts of the IRI correctly using the available layout and
1277	   font resources, these parts should be percent-encoded before being
1278	   displayed.

1280	   For display of Bidi IRIs, please see [Bidi].

1282	7.7.  Interpretation of URIs and IRIs

1284	   Software that interprets IRIs as the names of local resources should
1285	   accept IRIs in multiple forms and convert and match them with the
1286	   appropriate local resource names.

1288	   First, multiple representations include both IRIs in the native
1289	   character encoding of the protocol and also their URI counterparts.

1291	   Second, it may include URIs constructed based on character encodings
1292	   other than UTF-8.  These URIs may be produced by user agents that do
1293	   not conform to this specification and that use legacy character
1294	   encodings to convert non-ASCII characters to URIs.  Whether this is
1295	   necessary, and what character encodings to cover, depends on a number
1296	   of factors, such as the legacy character encodings used locally and
1297	   the distribution of various versions of user agents.  For example,
1298	   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
1299	   addition to UTF-8.

1301	   Third, it may include additional mappings to be more user-friendly
1302	   and robust against transmission errors.  These would be similar to
1303	   how some servers currently treat URIs as case insensitive or perform
1304	   additional matching to account for spelling errors.  For characters
1305	   beyond the US-ASCII repertoire, this may, for example, include
1306	   ignoring the accents on received IRIs or resource names.  Please note
1307	   that such mappings, including case mappings, are language dependent.

1309	   It can be difficult to identify a resource unambiguously if too many
1310	   mappings are taken into consideration.  However, percent-encoded and
1311	   not percent-encoded parts of IRIs can always be clearly
1312	   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
1313	   the potential for collisions lower than it may seem at first.

1315	7.8.  Upgrading Strategy

1317	   Where this recommendation places further constraints on software for
1318	   which many instances are already deployed, it is important to
1319	   introduce upgrades carefully and to be aware of the various
1320	   interdependencies.

1322	   If IRIs cannot be interpreted correctly, they should not be created,
1323	   generated, or transported.  This suggests that upgrading URI
1324	   interpreting software to accept IRIs should have highest priority.

1326	   On the other hand, a single IRI is interpreted only by a single or
1327	   very few interpreters that are known in advance, although it may be
1328	   entered and transported very widely.

1330	   Therefore, IRIs benefit most from a broad upgrade of software to be
1331	   able to enter and transport IRIs.  However, before an individual IRI
1332	   is published, care should be taken to upgrade the corresponding
1333	   interpreting software in order to cover the forms expected to be
1334	   received by various versions of entry and transport software.

1336	   The upgrade of generating software to generate IRIs instead of using
1337	   a local character encoding should happen only after the service is
1338	   upgraded to accept IRIs.  Similarly, IRIs should only be generated
1339	   when the service accepts IRIs and the intervening infrastructure and
1340	   protocol is known to transport them safely.

1342	   Software converting from URIs to IRIs for display should be upgraded
1343	   only after upgraded entry software has been widely deployed to the
1344	   population that will see the displayed result.

1346	   Where there is a free choice of character encodings, it is often
1347	   possible to reduce the effort and dependencies for upgrading to IRIs
1348	   by using UTF-8 rather than another encoding.  For example, when a new
1349	   file-based Web server is set up, using UTF-8 as the character
1350	   encoding for file names will make the transition to IRIs easier.
1351	   Likewise, when a new Web form is set up using UTF-8 as the character
1352	   encoding of the form page, the returned query URIs will use UTF-8 as
1353	   the character encoding (unless the user, for whatever reason, changes
1354	   the character encoding) and will therefore be compatible with IRIs.

1356	   These recommendations, when taken together, will allow for the
1357	   extension from URIs to IRIs in order to handle characters other than
1358	   US-ASCII while minimizing interoperability problems.  For
1359	   considerations regarding the upgrade of URI scheme definitions, see
1360	   Section 4.4.

1362	8.  IANA Considerations

1364	   RFC Editor and IANA note: Please Replace RFC XXXX with the number of
1365	   this document when it issues as an RFC.

1367	   IANA maintains a registry of "URI schemes".  A "URI scheme" also
1368	   serves an "IRI scheme".

1370	   To clarify that the URI scheme registration process also applies to
1371	   IRIs, change the description of the "URI schemes" registry header to
1372	   say "[RFC4395] defines an IANA-maintained registry of URI Schemes.
1373	   These registries include the Permanent and Provisional URI Schemes.
1374	   RFC XXXX updates this registry to designate that schemes may also
1375	   indicate their usability as IRI schemes.

1377	   Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".

1379	9.  Security Considerations

1381	   The security considerations discussed in [RFC3986] also apply to
1382	   IRIs.  In addition, the following issues require particular care for
1383	   IRIs.

1385	   Incorrect encoding or decoding can lead to security problems.  For
1386	   example, some UTF-8 decoders do not check against overlong byte
1387	   sequences.  See [UTR36] Section 3 for details.

1389	   There are serious difficulties with relying on a human to verify that
1390	   a an IRI (whether presented visually or aurally) is the same as
1391	   another IRI or is the one intended.  These problems exist with ASCII-
1392	   only URIs (bl00mberg.com vs. bloomberg.com) but are strongly
1393	   exacerbated when using the much larger character repertoire of
1394	   Unicode.  For details, see Section 2 of [UTR36].  Using
1395	   administrative and technical means to reduce the availability of such
1396	   exploits is possible, but they are difficult to eliminate altogether.
1397	   User agents SHOULD NOT rely on visual or perceptual comparison or
1398	   verification of IRIs as a means of validating or assuring safety,
1399	   correctness or appropriateness of an IRI.  Other means of presenting
1400	   users with the validity, safety, or appropriateness of visited sites
1401	   are being developed in the browser community as an alternative means
1402	   of avoiding these difficulties.

1404	   Besides the large character repertoire of Unicode, reasons for
1405	   confusion include different forms of normalization and different
1406	   normalization expectations, use of percent-encoding with various
1407	   legacy encodings, and bidirectionality issues.  See also [Bidi].

1409	   Confusion can occur in various IRI components, such as the domain
1410	   name part or the path part, or between IRI components.  For
1411	   considerations specific to the domain name part, see [RFC5890].  For
1412	   considerations specific to particular protocols or schemes, see the
1413	   security sections of the relevant specifications and registration
1414	   templates.  Administrators of sites that allow independent users to
1415	   create resources in the same sub area have to be careful.  Details
1416	   are discussed in Section 7.5.

1418	   The characters additionally allowed in Legacy Extended IRIs introduce
1419	   additional security issues.  For details, see Section 6.

1421	10.  Acknowledgements

1423	   This document was derived from [RFC3987]; the acknowledgments from
1424	   that specification still apply.

1426	   In addition, this document was influenced by contributions from (in
1427	   no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson,
1428	   John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris
1429	   Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank
1430	   Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard
1431	   Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie
1432	   Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas
1433	   Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van
1434	   Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos
1435	   Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R.
1436	   Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse,
1437	   Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn
1438	   Steele, Peter Saint-Andre, Geoffrey Sneddon, Chris Weber, Alex
1439	   Melnikov, Slim Amamou, S. Moonesamy, Tim Berners-Lee, Yaron Goland,
1440	   Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas
1441	   Milo, Murray Sargent, Marc Blanchet, and Mykyta Yevstifeyev.

1443	11.  Main Changes Since RFC 3987

1445	   This section describes the main changes since [RFC3987].

1447	11.1.  Split out Bidi, processing guidelines, comparison sections

1449	   Move some components (comparison, bidi, processing) into separate
1450	   documents.

1452	11.2.  Major restructuring of IRI processing model

1454	   Major restructuring of IRI processing model to make scheme-specific
1455	   translation necessary to handle IDNA requirements and for consistency
1456	   with web implementations.

1458	   Starting with IRI, you want one of:

1460	   a  IRI components (IRI parsed into UTF8 pieces)

1462	   b  URI components (URI parsed into ASCII pieces, encoded correctly)

1464	   c  whole URI (for passing on to some other system that wants whole
1465	      URIs)

1467	11.2.1.  OLD WAY

1469	   1.  Pct-encoding on the whole thing to a URI. (c1) If you want a
1470	       (maybe broken) whole URI, you might stop here.

1472	   2.  Parsing the URI into URI components. (b1) If you want (maybe
1473	       broken) URI components, stop here.

1475	   3.  Decode the components (undoing the pct-encoding). (a) if you want
1476	       IRI components, stop here.

1478	   4.  reencode: Either using a different encoding some components (for
1479	       domain names, and query components in web pages, which depends on
1480	       the component, scheme and context), and otherwise using pct-
1481	       encoding. (b2) if you want (good) URI components, stop here.

1483	   5.  reassemble the reencoded components. (c2) if you want a (*good*)
1484	       whole URI stop here.

1486	11.2.2.  NEW WAY

1488	   1.  Parse the IRI into IRI components using the generic syntax. (a)
1489	       if you want IRI components, stop here.

1491	   2.  Encode each components, using pct-encoding, IDN encoding, or
1492	       special query part encoding depending on the component scheme or
1493	       context. (b) If you want URI components, stop here.

1495	   3.  reassemble the a whole URI from URI components. (c) if you want a
1496	       whole URI stop here.

1498	11.2.3.  Extension of Syntax

1500	   Added the tag range (U+E0000-E0FFF) to the iprivate production.  Some
1501	   IRIs generated with the new syntax may fail to pass very strict
1502	   checks relying on the old syntax.  But characters in this range
1503	   should be extremely infrequent anyway.

1505	11.2.4.  More to be added

1507	   TODO: There are more main changes that need to be documented in this
1508	   section.

1510	11.3.  Change Log

1512	   Note to RFC Editor: Please completely remove this section before
1513	   publication.

1515	11.3.1.  Changes after draft-ietf-iri-3987bis-01

1517	   Changes from draft-ietf-iri-3987bis-01 onwards are available as
1518	   changesets in the IETF tools subversion repository at http://
1519	   trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/
1520	   draft-ietf-iri-3987bis.xml.

1522	11.3.2.  Changes from draft-duerst-iri-bis-07 to
1523	         draft-ietf-iri-3987bis-00

1525	   Changed draft name, date, last paragraph of abstract, and titles in
1526	   change log, and added this section in moving from
1527	   draft-duerst-iri-bis-07 (personal submission) to
1528	   draft-ietf-iri-3987bis-00 (WG document).

1530	11.3.3.  Changes from -06 to -07 of draft-duerst-iri-bis

1532	   Major restructuring of the processing model, see Section 11.2.

1534	11.4.  Changes from -00 to -01

1536	   o  Removed 'mailto:' before mail addresses of authors.

1538	   o  Added "<to be done>" as right side of 'href-strip' rule.  Fixed
1539	      '|' to '/' for alternatives.

1541	11.5.  Changes from -05 to -06 of draft-duerst-iri-bis-00

1543	   o  Add HyperText Reference, change abstract, acks and references for
1544	      it

1546	   o  Add Masinter back as another editor.

1548	   o  Masinter integrates HRef material from HTML5 spec.

1550	   o  Rewrite introduction sections to modernize.

1552	11.6.  Changes from -04 to -05 of draft-duerst-iri-bis

1554	   o  Updated references.

1556	   o  Changed IPR text to pre5378Trust200902.

1558	11.7.  Changes from -03 to -04 of draft-duerst-iri-bis

1560	   o  Added explicit abbreviation for LEIRIs.

1562	   o  Mentioned LEIRI references.

1564	   o  Completed text in LEIRI section about tag characters and about
1565	      specials.

1567	11.8.  Changes from -02 to -03 of draft-duerst-iri-bis

1569	   o  Updated some references.

1571	   o  Updated Michel Suginard's coordinates.

1573	11.9.  Changes from -01 to -02 of draft-duerst-iri-bis

1575	   o  Added tag range to iprivate (issue private-include-tags-115).

1577	   o  Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.

1579	11.10.  Changes from -00 to -01 of draft-duerst-iri-bis

1581	   o  Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI"
1582	      based on input from the W3C XML Core WG.  Moved the relevant
1583	      subsections to the back and promoted them to a section.

1585	   o  Added some text re.  Legacy Extended IRIs to the security section.

1587	   o  Added a IANA Consideration Section.

1589	   o  Added this Change Log Section.

1591	   o  Added a section about "IRIs with Spaces/Controls" (converting from
1592	      a Note in RFC 3987).

1594	11.11.  Changes from RFC 3987 to -00 of draft-duerst-iri-bis

1596	      Fixed errata (see
1597	      http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).

1599	12.  References

1601	12.1.  Normative References

1603	   [ASCII]    American National Standards Institute, "Coded Character
1604	              Set -- 7-bit American Standard Code for Information
1605	              Interchange", ANSI X3.4, 1986.

1607	   [ISO10646]
1608	              International Organization for Standardization, "ISO/IEC
1609	              10646:2003: Information Technology - Universal Multiple-
1610	              Octet Coded Character Set (UCS)", ISO Standard 10646,
1611	              December 2003.

1613	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1614	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1616	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
1617	              "Internationalizing Domain Names in Applications (IDNA)",
1618	              RFC 3490, March 2003.

1620	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1621	              Profile for Internationalized Domain Names (IDN)",
1622	              RFC 3491, March 2003.

1624	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
1625	              10646", STD 63, RFC 3629, November 2003.

1627	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1628	              Resource Identifier (URI): Generic Syntax", STD 66,
1629	              RFC 3986, January 2005.

1631	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
1632	              Applications (IDNA): Definitions and Document Framework",
1633	              RFC 5890, August 2010.

1635	   [RFC5891]  Klensin, J., "Internationalized Domain Names in
1636	              Applications (IDNA): Protocol", RFC 5891, August 2010.

1638	   [STD68]    Crocker, D. and P. Overell, "Augmented BNF for Syntax
1639	              Specifications: ABNF", STD 68, RFC 5234, January 2008.

1641	   [UNIV6]    The Unicode Consortium, "The Unicode Standard, Version
1642	              6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
1643	              ISBN 978-1-936213-01-6)", October 2010.

1645	   [UTR15]    Davis, M. and M. Duerst, "Unicode Normalization Forms",
1646	              Unicode Standard Annex #15, March 2008,
1647	              <http://www.unicode.org/unicode/reports/tr15/
1648	              tr15-23.html>.

1650	12.2.  Informative References

1652	   [Bidi]     Duerst, M. and L. Masinter, "Guidelines for
1653	              Internationalized Resource Identifiers with Bi-directional
1654	              Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-00
1655	              (work in progress), August 2011.

1657	   [CharMod]  Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
1658	              Texin, "Character Model for the World Wide Web: Resource
1659	              Identifiers", World Wide Web Consortium Candidate
1660	              Recommendation, November 2004,
1661	              <http://www.w3.org/TR/charmod-resid>.

1663	   [Duerst97]
1664	              Duerst, M., "The Properties and Promises of UTF-8", Proc.
1665	              11th International Unicode Conference, San Jose ,
1666	              September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
1667	              papers/PDF/IUC11-UTF-8.pdf>.

1669	   [Equivalence]
1670	              Masinter, L. and M. Duerst, "Equivalence and
1671	              Canonicalization of Internationalized Resource Identifiers
1672	              (IRIs)", draft-ietf-iri-comparison-00 (work in progress),
1673	              August 2011.

1675	   [Gettys]   Gettys, J., "URI Model Consequences",
1676	              <http://www.w3.org/DesignIssues/ModelConsequences>.

1678	   [HTML4]    Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
1679	              Specification", World Wide Web Consortium Recommendation,
1680	              December 1999,
1681	              <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.

1683	   [LEIRI]    Thompson, H., Tobin, R., and N. Walsh, "Legacy extended
1684	              IRIs for XML resource identification", World Wide Web
1685	              Consortium Note, November 2008,
1686	              <http://www.w3.org/TR/leiri/>.

1688	   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
1689	              Extensions (MIME) Part One: Format of Internet Message
1690	              Bodies", RFC 2045, November 1996.

1692	   [RFC2130]  Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
1693	              Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
1694	              the IAB Character Set Workshop held 29 February - 1 March,
1695	              1996", RFC 2130, April 1997.

1697	   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

1699	   [RFC2192]  Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

1701	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
1702	              Languages", BCP 18, RFC 2277, January 1998.

1704	   [RFC2368]  Hoffman, P., Masinter, L., and J. Zawinski, "The mailto
1705	              URL scheme", RFC 2368, July 1998.

1707	   [RFC2384]  Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

1709	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1710	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
1711	              August 1998.

1713	   [RFC2397]  Masinter, L., "The "data" URL scheme", RFC 2397,
1714	              August 1998.

1716	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
1717	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
1718	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

1720	   [RFC2640]  Curtin, B., "Internationalization of the File Transfer
1721	              Protocol", RFC 2640, July 1999.

1723	   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
1724	              Identifiers (IRIs)", RFC 3987, January 2005.

1726	   [RFC4395bis]
1727	              Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
1728	              Registration Procedures for New URI/IRI Schemes",
1729	              draft-ietf-iri-4395bis-irireg-03 (work in progress),
1730	              July 2011.

1732	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
1733	              Encodings for Internationalized Domain Names", RFC 6055,
1734	              February 2011.

1736	   [UNIXML]   Duerst, M. and A. Freytag, "Unicode in XML and other
1737	              Markup Languages", Unicode Technical Report #20, World
1738	              Wide Web Consortium Note, June 2003,
1739	              <http://www.w3.org/TR/unicode-xml/>.

1741	   [UTR36]    Davis, M. and M. Suignard, "Unicode Security
1742	              Considerations", Unicode Technical Report #36,
1743	              August 2010, <http://unicode.org/reports/tr36/>.

1745	   [XLink]    DeRose, S., Maler, E., and D. Orchard, "XML Linking
1746	              Language (XLink) Version 1.0", World Wide Web
1747	              Consortium Recommendation, June 2001,
1748	              <http://www.w3.org/TR/xlink/#link-locators>.

1750	   [XML1]     Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
1751	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth
1752	              Edition)", World Wide Web Consortium Recommendation,
1753	              August 2006, <http://www.w3.org/TR/REC-xml>.

1755	   [XMLNamespace]
1756	              Bray, T., Hollander, D., Layman, A., and R. Tobin,
1757	              "Namespaces in XML (Second Edition)", World Wide Web
1758	              Consortium Recommendation, August 2006,
1759	              <http://www.w3.org/TR/REC-xml-names>.

1761	   [XMLSchema]
1762	              Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
1763	              World Wide Web Consortium Recommendation, May 2001,
1764	              <http://www.w3.org/TR/xmlschema-2/#anyURI>.

1766	   [XPointer]
1767	              Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
1768	              Framework", World Wide Web Consortium Recommendation,
1769	              March 2003,
1770	              <http://www.w3.org/TR/xptr-framework/#escaping>.

1772	Authors' Addresses

1774	   Martin Duerst
1775	   Aoyama Gakuin University
1776	   5-10-1 Fuchinobe
1777	   Sagamihara, Kanagawa  229-8558
1778	   Japan

1780	   Phone: +81 42 759 6329
1781	   Fax:   +81 42 759 6495
1782	   Email: duerst@it.aoyama.ac.jp
1783	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/

1785	   Michel Suignard
1786	   Unicode Consortium
1787	   P.O. Box 391476
1788	   Mountain View, CA  94039-1476
1789	   U.S.A.

1791	   Phone: +1-650-693-3921
1792	   Email: michel@unicode.org
1793	   URI:   http://www.suignard.com

1795	   Larry Masinter
1796	   Adobe
1797	   345 Park Ave
1798	   San Jose, CA  95110
1799	   U.S.A.

1801	   Phone: +1-408-536-3024
1802	   Email: masinter@adobe.com
1803	   URI:   http://larry.masinter.net