idnits 2.17.1 

draft-ietf-iri-3987bis-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There is 1 instance of lines with control characters in the document.

  == There is 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  -- The draft header indicates that this document obsoletes RFC3987, but the
     abstract doesn't seem to directly say this.  It does mention RFC3987
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (January 9, 2012) is 4489 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'LEIRI' is defined on line 1698, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2045' is defined on line 1703, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC6082' is defined on line 1751, but no explicit
     reference was found in the text

  == Unused Reference: 'XMLNamespace' is defined on line 1774, but no
     explicit reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'

  == Outdated reference: A later version (-03) exists of
     draft-ietf-iri-bidi-guidelines-00

  == Outdated reference: A later version (-02) exists of
     draft-ietf-iri-comparison-00

  -- Obsolete informational reference (is this intentional?): RFC 2141
     (Obsoleted by RFC 8141)

  -- Obsolete informational reference (is this intentional?): RFC 2192
     (Obsoleted by RFC 5092)

  -- Obsolete informational reference (is this intentional?): RFC 2368
     (Obsoleted by RFC 6068)

  -- Obsolete informational reference (is this intentional?): RFC 2396
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  == Outdated reference: A later version (-04) exists of
     draft-ietf-iri-4395bis-irireg-03


     Summary: 2 errors (**), 0 flaws (~~), 11 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internationalized Resource Identifiers                         M. Duerst
3	(iri)                                           Aoyama Gakuin University
4	Internet-Draft                                               M. Suignard
5	Obsoletes: 3987 (if approved)                         Unicode Consortium
6	Intended status: Standards Track                             L. Masinter
7	Expires: July 12, 2012                                             Adobe
8	                                                         January 9, 2012

10	             Internationalized Resource Identifiers (IRIs)
11	                       draft-ietf-iri-3987bis-09

13	Abstract

15	   This document defines the Internationalized Resource Identifier (IRI)
16	   protocol element, as an extension of the Uniform Resource Identifier
17	   (URI).  An IRI is a sequence of characters from the Universal
18	   Character Set (Unicode/ISO 10646).  Grammar and processing rules are
19	   given for IRIs and related syntactic forms.

21	   Defining IRI as new protocol element (rather than updating or
22	   extending the definition of URI) allows independent orderly
23	   transitions: other protocols and languages that use URIs must
24	   explicitly choose to allow IRIs.

26	   Guidelines are provided for the use and deployment of IRIs and
27	   related protocol elements when revising protocols, formats, and
28	   software components that currently deal only with URIs.

30	   This document is part of a set of documents intended to replace RFC
31	   3987.

33	RFC Editor: Please remove the next paragraph before publication.

35	   This (and several companion documents) are intended to obsolete RFC
36	   3987, and also move towards IETF Draft Standard.  For discussion and
37	   comments on these drafts, please join the IETF IRI WG by subscribing
38	   to the mailing list public-iri@w3.org, archives at
39	   http://lists.w3.org/archives/public/public-iri/.  For a list of open
40	   issues, please see the issue tracker of the WG at
41	   http://trac.tools.ietf.org/wg/iri/trac/report/1.  For a list of
42	   individual edits, please see the change history at
43	   http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis.

45	Status of this Memo

47	   This Internet-Draft is submitted in full conformance with the
48	   provisions of BCP 78 and BCP 79.

50	   Internet-Drafts are working documents of the Internet Engineering
51	   Task Force (IETF).  Note that other groups may also distribute
52	   working documents as Internet-Drafts.  The list of current Internet-
53	   Drafts is at http://datatracker.ietf.org/drafts/current/.

55	   Internet-Drafts are draft documents valid for a maximum of six months
56	   and may be updated, replaced, or obsoleted by other documents at any
57	   time.  It is inappropriate to use Internet-Drafts as reference
58	   material or to cite them other than as "work in progress."

60	   This Internet-Draft will expire on July 12, 2012.

62	Copyright Notice

64	   Copyright (c) 2012 IETF Trust and the persons identified as the
65	   document authors.  All rights reserved.

67	   This document is subject to BCP 78 and the IETF Trust's Legal
68	   Provisions Relating to IETF Documents
69	   (http://trustee.ietf.org/license-info) in effect on the date of
70	   publication of this document.  Please review these documents
71	   carefully, as they describe your rights and restrictions with respect
72	   to this document.  Code Components extracted from this document must
73	   include Simplified BSD License text as described in Section 4.e of
74	   the Trust Legal Provisions and are provided without warranty as
75	   described in the Simplified BSD License.

77	   This document may contain material from IETF Documents or IETF
78	   Contributions published or made publicly available before November
79	   10, 2008.  The person(s) controlling the copyright in some of this
80	   material may not have granted the IETF Trust the right to allow
81	   modifications of such material outside the IETF Standards Process.
82	   Without obtaining an adequate license from the person(s) controlling
83	   the copyright in such materials, this document may not be modified
84	   outside the IETF Standards Process, and derivative works of it may
85	   not be created outside the IETF Standards Process, except to format
86	   it for publication as an RFC or to translate it into languages other
87	   than English.

89	Table of Contents

91	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
92	     1.1.   Overview and Motivation . . . . . . . . . . . . . . . . .  5
93	     1.2.   Applicability . . . . . . . . . . . . . . . . . . . . . .  6
94	     1.3.   Definitions . . . . . . . . . . . . . . . . . . . . . . .  7
95	     1.4.   Notation  . . . . . . . . . . . . . . . . . . . . . . . .  8
96	   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  9
97	     2.1.   Summary of IRI Syntax . . . . . . . . . . . . . . . . . .  9
98	     2.2.   ABNF for IRI References and IRIs  . . . . . . . . . . . . 10
99	   3.  Processing IRIs and related protocol elements  . . . . . . . . 13
100	     3.1.   Converting to UCS . . . . . . . . . . . . . . . . . . . . 13
101	     3.2.   Parse the IRI into IRI components . . . . . . . . . . . . 13
102	     3.3.   General percent-encoding of IRI components  . . . . . . . 14
103	     3.4.   Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14
104	       3.4.1.  Mapping using Percent-Encoding . . . . . . . . . . . . 14
105	       3.4.2.  Mapping using Punycode . . . . . . . . . . . . . . . . 14
106	       3.4.3.  Additional Considerations  . . . . . . . . . . . . . . 15
107	     3.5.   Mapping query components  . . . . . . . . . . . . . . . . 16
108	     3.6.   Mapping IRIs to URIs  . . . . . . . . . . . . . . . . . . 16
109	   4.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . . . . 16
110	     4.1.   Examples  . . . . . . . . . . . . . . . . . . . . . . . . 18
111	   5.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 19
112	     5.1.   Limitations on UCS Characters Allowed in IRIs . . . . . . 19
113	     5.2.   Software Interfaces and Protocols . . . . . . . . . . . . 20
114	     5.3.   Format of URIs and IRIs in Documents and Protocols  . . . 20
115	     5.4.   Use of UTF-8 for Encoding Original Characters . . . . . . 20
116	     5.5.   Relative IRI References . . . . . . . . . . . . . . . . . 22
117	   6.  Legacy Extended IRIs (LEIRIs)  . . . . . . . . . . . . . . . . 22
118	     6.1.   Legacy Extended IRI Syntax  . . . . . . . . . . . . . . . 23
119	     6.2.   Conversion of Legacy Extended IRIs to IRIs  . . . . . . . 23
120	     6.3.   Characters Allowed in Legacy Extended IRIs but not in
121	            IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . . 23
122	   7.  URI/IRI Processing Guidelines (Informative)  . . . . . . . . . 25
123	     7.1.   URI/IRI Software Interfaces . . . . . . . . . . . . . . . 25
124	     7.2.   URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26
125	     7.3.   URI/IRI Transfer between Applications . . . . . . . . . . 26
126	     7.4.   URI/IRI Generation  . . . . . . . . . . . . . . . . . . . 27
127	     7.5.   URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27
128	     7.6.   Display of URIs/IRIs  . . . . . . . . . . . . . . . . . . 28
129	     7.7.   Interpretation of URIs and IRIs . . . . . . . . . . . . . 28
130	     7.8.   Upgrading Strategy  . . . . . . . . . . . . . . . . . . . 29
131	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 30
132	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 30
133	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31
134	   11. Main Changes Since RFC 3987  . . . . . . . . . . . . . . . . . 32
135	     11.1.  Split out Bidi, processing guidelines, comparison
136	            sections  . . . . . . . . . . . . . . . . . . . . . . . . 32

138	     11.2.  Major restructuring of IRI processing model . . . . . . . 32
139	       11.2.1. OLD WAY  . . . . . . . . . . . . . . . . . . . . . . . 32
140	       11.2.2. NEW WAY  . . . . . . . . . . . . . . . . . . . . . . . 33
141	       11.2.3. Extension of Syntax  . . . . . . . . . . . . . . . . . 33
142	       11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 33
143	     11.3.  Change Log  . . . . . . . . . . . . . . . . . . . . . . . 33
144	       11.3.1. Changes after draft-ietf-iri-3987bis-01  . . . . . . . 33
145	       11.3.2. Changes from draft-duerst-iri-bis-07 to
146	               draft-ietf-iri-3987bis-00  . . . . . . . . . . . . . . 34
147	       11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis  . . . 34
148	     11.4.  Changes from -00 to -01 . . . . . . . . . . . . . . . . . 34
149	     11.5.  Changes from -05 to -06 of draft-duerst-iri-bis-00  . . . 34
150	     11.6.  Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 34
151	     11.7.  Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 34
152	     11.8.  Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35
153	     11.9.  Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35
154	     11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 35
155	     11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis  . . 35
156	   12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35
157	     12.1.  Normative References  . . . . . . . . . . . . . . . . . . 35
158	     12.2.  Informative References  . . . . . . . . . . . . . . . . . 36
159	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39

161	1.  Introduction

163	1.1.  Overview and Motivation

165	   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
166	   sequence of characters chosen from a limited subset of the repertoire
167	   of US-ASCII [ASCII] characters.

169	   The characters in URIs are frequently used for representing words of
170	   natural languages.  This usage has many advantages: Such URIs are
171	   easier to memorize, easier to interpret, easier to transcribe, easier
172	   to create, and easier to guess.  For most languages other than
173	   English, however, the natural script uses characters other than A -
174	   Z. For many people, handling Latin characters is as difficult as
175	   handling the characters of other scripts is for those who use only
176	   the Latin alphabet.  Many languages with non-Latin scripts are
177	   transcribed with Latin letters.  These transcriptions are now often
178	   used in URIs, but they introduce additional difficulties.

180	   The infrastructure for the appropriate handling of characters from
181	   additional scripts is now widely deployed in operating system and
182	   application software.  Software that can handle a wide variety of
183	   scripts and languages at the same time is increasingly common.  Also,
184	   an increasing number of protocols and formats can carry a wide range
185	   of characters.

187	   URIs are composed out of a very limited repertoire of characters;
188	   this design choice was made to support global transcription([RFC3986]
189	   section 1.2.1.).  Reliable transition between a URI (as an abstract
190	   protocol element composed of a sequence of characters) and a
191	   presentation of that URI (written on a napkin, read out loud) and
192	   back is relatively straightforward, because of the limited repertoire
193	   of characters used.  IRIs are designed to satisfy a different set of
194	   use requirements; in particular, to allow IRIs to be written in ways
195	   that are more meaningful to their users, even at the expense of
196	   global transcribability.  However, ensuring reliability of the
197	   transition between an IRI and its presentation and back is more
198	   difficult and complex when dealing with the larger set of Unicode
199	   characters.  For example, Unicode supports multiple ways of encoding
200	   complex combinations of characters and accents, with multiple
201	   character sequences that can result in the same presentation.

203	   This document defines the protocol element called Internationalized
204	   Resource Identifier (IRI), which allow applications of URIs to be
205	   extended to use resource identifiers that have a much wider
206	   repertoire of characters.  It also provides corresponding
207	   "internationalized" versions of other constructs from [RFC3986], such
208	   as URI references.  The syntax of IRIs is defined in Section 2.

210	   Within this document, Section 5 discusses the use of IRIs in
211	   different situations.  Section 7 gives additional informative
212	   guidelines.  Section 9 discusses IRI-specific security
213	   considerations.

215	   This specification is part of a collection of specifications intended
216	   to replace [RFC3987].  [Bidi] discusses the special case of
217	   bidirectional IRIs using characters from scripts written right-to-
218	   left.  [Equivalence] gives guidelines for applications wishing to
219	   determine if two IRIs are equivalent, as well as defining some
220	   equivalence methods.  [RFC4395bis] updates the URI scheme
221	   registration guidelines and procedures to note that every URI scheme
222	   is also automatically an IRI scheme and to allow scheme definitions
223	   to be directly described in terms of Unicode characters.

225	1.2.  Applicability

227	   IRIs are designed to allow protocols and software that deal with URIs
228	   to be updated to handle IRIs.  Processing of IRIs is accomplished by
229	   extending the URI syntax while retaining (and not expanding) the set
230	   of "reserved" characters, such that the syntax for any URI scheme may
231	   be extended to allow non-ASCII characters.  In addition, following
232	   parsing of an IRI, it is possible to construct a corresponding URI by
233	   first encoding characters outside of the allowed URI range and then
234	   reassembling the components.

236	   Practical use of IRIs forms in place of URIs forms depends on the
237	   following conditions being met:

239	   a. A protocol or format element MUST be explicitly designated to be
240	      able to carry IRIs.  The intent is to avoid introducing IRIs into
241	      contexts that are not defined to accept them.  For example, XML
242	      schema [XMLSchema] has an explicit type "anyURI" that includes
243	      IRIs and IRI references.  Therefore, IRIs and IRI references can
244	      be in attributes and elements of type "anyURI".  On the other
245	      hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is
246	      defined as a URI, which means that direct use of IRIs is not
247	      allowed in HTTP requests.

249	   b. The protocol or format carrying the IRIs MUST have a mechanism to
250	      represent the wide range of characters used in IRIs, either
251	      natively or by some protocol- or format-specific escaping
252	      mechanism (for example, numeric character references in [XML1]).

254	   c. The URI scheme definition, if it explicitly allows a percent sign
255	      ("%") in any syntactic component, SHOULD define the interpretation
256	      of sequences of percent-encoded octets (using "%XX" hex octets) as
257	      octet from sequences of UTF-8 encoded strings; this is recommended
258	      in the guidelines for registering new schemes, [RFC4395bis].  For
259	      example, this is the practice for IMAP URLs [RFC2192], POP URLs
260	      [RFC2384] and the URN syntax [RFC2141]).  Note that use of
261	      percent-encoding may also be restricted in some situations, for
262	      example, URI schemes that disallow percent-encoding might still be
263	      used with a fragment identifier which is percent-encoded (e.g.,
264	      [XPointer]).  See Section 5.4 for further discussion.

266	1.3.  Definitions

268	   The following definitions are used in this document; they follow the
269	   terms in [RFC2130], [RFC2277], and [ISO10646].

271	   character:  A member of a set of elements used for the organization,
272	      control, or representation of data.  For example, "LATIN CAPITAL
273	      LETTER A" names a character.

275	   octet:  An ordered sequence of eight bits considered as a unit.

277	   character repertoire:  A set of characters (set in the mathematical
278	      sense).

280	   sequence of characters:  A sequence of characters (one after
281	      another).

283	   sequence of octets:  A sequence of octets (one after another).

285	   character encoding:  A method of representing a sequence of
286	      characters as a sequence of octets (maybe with variants).  Also, a
287	      method of (unambiguously) converting a sequence of octets into a
288	      sequence of characters.

290	   charset:  The name of a parameter or attribute used to identify a
291	      character encoding.

293	   UCS:  Universal Character Set. The coded character set defined by
294	      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6].

296	   IRI reference:  Denotes the common usage of an Internationalized
297	      Resource Identifier.  An IRI reference may be absolute or
298	      relative.  However, the "IRI" that results from such a reference
299	      only includes absolute IRIs; any relative IRI references are
300	      resolved to their absolute form.  Note that in [RFC2396] URIs did
301	      not include fragment identifiers, but in [RFC3986] fragment
302	      identifiers are part of URIs.

304	   LEIRI (Legacy Extended IRI) processing:  This term was used in
305	      various XML specifications to refer to strings that, although not
306	      valid IRIs, were acceptable input to the processing rules in
307	      Section 6.2.

309	   running text:  Human text (paragraphs, sentences, phrases) with
310	      syntax according to orthographic conventions of a natural
311	      language, as opposed to syntax defined for ease of processing by
312	      machines (e.g., markup, programming languages).

314	   protocol element:  Any portion of a message that affects processing
315	      of that message by the protocol in question.

317	   create (a URI or IRI):  With respect to URIs and IRIs, the term is
318	      used for the initial creation.  This may be the initial creation
319	      of a resource with a certain identifier, or the initial exposition
320	      of a resource under a particular identifier.

322	   generate (a URI or IRI):  With respect to URIs and IRIs, the term is
323	      used when the identifier is generated by derivation from other
324	      information.

326	   parsed URI component:  When a URI processor parses a URI (following
327	      the generic syntax or a scheme-specific syntax, the result is a
328	      set of parsed URI components, each of which has a type
329	      (corresponding to the syntactic definition) and a sequence of URI
330	      characters.

332	   parsed IRI component:  When an IRI processor parses an IRI directly,
333	      following the general syntax or a scheme-specific syntax, the
334	      result is a set of parsed IRI components, each of which has a type
335	      (corresponding to the syntactice definition) and a sequence of IRI
336	      characters.  (This definition is analogous to "parsed URI
337	      component".)

339	   IRI scheme:  A URI scheme may also be known as an "IRI scheme" if the
340	      scheme's syntax has been extended to allow non-US-ASCII characters
341	      according to the rules in this document.

343	1.4.  Notation

345	   RFCs and Internet Drafts currently do not allow any characters
346	   outside the US-ASCII repertoire.  Therefore, this document uses
347	   various special notations to denote such characters in examples.

349	   In text, characters outside US-ASCII are sometimes referenced by
350	   using a prefix of 'U+', followed by four to six hexadecimal digits.

352	   To represent characters outside US-ASCII in examples, this document
353	   uses 'XML Notation'.

355	   XML Notation uses a leading '&#x', a trailing ';', and the
356	   hexadecimal number of the character in the UCS in between.  For
357	   example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA.  In this
358	   notation, an actual '&' is denoted by '&amp;'.

360	   To denote actual octets in examples (as opposed to percent-encoded
361	   octets), the two hex digits denoting the octet are enclosed in "<"
362	   and ">".  For example, the octet often denoted as 0xc9 is denoted
363	   here as <c9>.

365	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
366	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
367	   and "OPTIONAL" are to be interpreted as described in [RFC2119].

369	2.  IRI Syntax

371	   This section defines the syntax of Internationalized Resource
372	   Identifiers (IRIs).

374	   As with URIs, an IRI is defined as a sequence of characters, not as a
375	   sequence of octets.  This definition accommodates the fact that IRIs
376	   may be written on paper or read over the radio as well as stored or
377	   transmitted digitally.  The same IRI might be represented as
378	   different sequences of octets in different protocols or documents if
379	   these protocols or documents use different character encodings
380	   (and/or transfer encodings).  Using the same character encoding as
381	   the containing protocol or document ensures that the characters in
382	   the IRI can be handled (e.g., searched, converted, displayed) in the
383	   same way as the rest of the protocol or document.

385	2.1.  Summary of IRI Syntax

387	   The IRI syntax extends the URI syntax in [RFC3986] by extending the
388	   class of unreserved characters, primarily by adding the characters of
389	   the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject
390	   to the limitations given in the syntax rules below and in
391	   Section 5.1.

393	   The syntax and use of components and reserved characters is the same
394	   as that in [RFC3986].  Each "URI scheme" thus also functions as an
395	   "IRI scheme", in that scheme-specific parsing rules for URIs of a
396	   scheme are be extended to allow parsing of IRIs using the same
397	   parsing rules.

399	   All the operations defined in [RFC3986], such as the resolution of
400	   relative references, can be applied to IRIs by IRI-processing
401	   software in exactly the same way as they are for URIs by URI-
402	   processing software.

404	   Characters outside the US-ASCII repertoire MUST NOT be reserved and
405	   therefore MUST NOT be used for syntactical purposes, such as to
406	   delimit components in newly defined schemes.  For example, U+00A2,
407	   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
408	   the 'iunreserved' category.  This is similar to the fact that it is
409	   not possible to use '-' as a delimiter in URIs, because it is in the
410	   'unreserved' category.

412	2.2.  ABNF for IRI References and IRIs

414	   An ABNF definition for IRI references (which are the most general
415	   concept and the start of the grammar) and IRIs is given here.  The
416	   syntax of this ABNF is described in [STD68].  Character numbers are
417	   taken from the UCS, without implying any actual binary encoding.
418	   Terminals in the ABNF are characters, not octets.

420	   The following grammar closely follows the URI grammar in [RFC3986],
421	   except that the range of unreserved characters is expanded to include
422	   UCS characters, with the restriction that private UCS characters can
423	   occur only in query parts.  The grammar is split into two parts:
424	   Rules that differ from [RFC3986] because of the above-mentioned
425	   expansion, and rules that are the same as those in [RFC3986].  For
426	   rules that are different than those in [RFC3986], the names of the
427	   non-terminals have been changed as follows.  If the non-terminal
428	   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
429	   has been prefixed.  The rule <pct-form> has been introduced in order
430	   to be able to reference it from other parts of the document.

432	   The following rules are different from those in [RFC3986]:

434	   IRI            = scheme ":" ihier-part [ "?" iquery ]
435	                    [ "#" ifragment ]

437	   ihier-part     = "//" iauthority ipath-abempty
438	                  / ipath-absolute
439	                  / ipath-rootless
440	                  / ipath-empty

442	   IRI-reference  = IRI / irelative-ref

444	   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

446	   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
447	   irelative-part = "//" iauthority ipath-abempty
448	                  / ipath-absolute
449	                  / ipath-noscheme
450	                  / ipath-empty

452	   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
453	   iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
454	   ihost          = IP-literal / IPv4address / ireg-name

456	   pct-form       = pct-encoded

458	   ireg-name      = *( iunreserved / sub-delims )

460	   ipath          = ipath-abempty   ; begins with "/" or is empty
461	                  / ipath-absolute  ; begins with "/" but not "//"
462	                  / ipath-noscheme  ; begins with a non-colon segment
463	                  / ipath-rootless  ; begins with a segment
464	                  / ipath-empty     ; zero characters

466	   ipath-abempty  = *( path-sep isegment )
467	   ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
468	   ipath-noscheme = isegment-nz-nc *( path-sep isegment )
469	   ipath-rootless = isegment-nz *( path-sep isegment )
470	   ipath-empty    = 0<ipchar>
471	   path-sep       = "/"

473	   isegment       = *ipchar
474	   isegment-nz    = 1*ipchar
475	   isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
476	                        / "@" )
477	                  ; non-zero-length segment without any colon ":"

479	   ipchar         = iunreserved / pct-form / sub-delims / ":"
480	                  / "@"

482	   iquery         = *( ipchar / iprivate / "/" / "?" )

484	   ifragment      = *( ipchar / "/" / "?" )

486	   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

488	   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
489	                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
490	                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
491	                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
492	                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
493	                  / %xD0000-DFFFD / %xE1000-EFFFD

495	   iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
496	                  / %x100000-10FFFD

498	   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
499	   "greedy") algorithm applies.  For details, see [RFC3986].

501	   The following rules are the same as those in [RFC3986]:

503	   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

505	   port           = *DIGIT

507	   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

509	   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

511	   IPv6address    =                            6( h16 ":" ) ls32
512	                  /                       "::" 5( h16 ":" ) ls32
513	                  / [               h16 ] "::" 4( h16 ":" ) ls32
514	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
515	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
516	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
517	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
518	                  / [ *5( h16 ":" ) h16 ] "::"              h16
519	                  / [ *6( h16 ":" ) h16 ] "::"

521	   h16            = 1*4HEXDIG
522	   ls32           = ( h16 ":" h16 ) / IPv4address

524	   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

526	   dec-octet      = DIGIT                 ; 0-9
527	                  / %x31-39 DIGIT         ; 10-99
528	                  / "1" 2DIGIT            ; 100-199
529	                  / "2" %x30-34 DIGIT     ; 200-249
530	                  / "25" %x30-35          ; 250-255

532	   pct-encoded    = "%" HEXDIG HEXDIG

534	   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
535	   reserved       = gen-delims / sub-delims
536	   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
537	   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
538	                  / "*" / "+" / "," / ";" / "="

540	   This syntax does not support IPv6 scoped addressing zone identifiers.

542	3.  Processing IRIs and related protocol elements

544	   IRIs are meant to replace URIs in identifying resources within new
545	   versions of protocols, formats, and software components that use a
546	   UCS-based character repertoire.  Protocols and components may use and
547	   process IRIs directly.  However, there are still numerous systems and
548	   protocols which only accept URIs or components of parsed URIs; that
549	   is, they only accept sequences of characters within the subset of US-
550	   ASCII characters allowed in URIs.

552	   This section defines specific processing steps for IRI consumers
553	   which establish the relationship between the string given and the
554	   interpreted derivatives.  These processing steps apply to both IRIs
555	   and IRI references (i.e., absolute or relative forms); for IRIs, some
556	   steps are scheme specific.

558	3.1.  Converting to UCS

560	   Input that is already in a Unicode form (i.e., a sequence of Unicode
561	   characters or an octet-stream representing a Unicode-based character
562	   encoding such as UTF-8 or UTF-16) should be left as is and not
563	   normalized or changed.

565	   An IRI or IRI reference is a sequence of characters from the UCS.
566	   For input from presentations (written on paper, read aloud) or
567	   translation from other representations (a text stream using a legacy
568	   character encoding), convert the input to Unicode.  Note that some
569	   character encodings or transcriptions can be converted to or
570	   represented by more than one sequence of Unicode characters.  Ideally
571	   the resulting IRI would use a normalized form, such as Unicode
572	   Normalization Form C [UTR15], since that ensures a stable, consistent
573	   representation that is most likely to produce the intended results.
574	   Previous versions of this specification required normalization at
575	   this step.  However, attempts to require normalization in other
576	   protocols have met with strong enough resistance that requiring
577	   normalization here was considered impractical.  Implementers and
578	   users are cautioned that, while denormalized character sequences are
579	   valid, they might be difficult for other users or processes to
580	   reproduce and might lead to unexpected results.

582	3.2.  Parse the IRI into IRI components

584	   Parse the IRI, either as a relative reference (no scheme) or using
585	   scheme specific processing (according to the scheme given); the
586	   result is a set of parsed IRI components.

588	3.3.  General percent-encoding of IRI components

590	   Except as noted in the following subsections, IRI components are
591	   mapped to the equivalent URI components by percent-encoding those
592	   characters not allowed in URIs.  Previous processing steps will have
593	   removed some characters, and the interpretation of reserved
594	   characters will have already been done (with the syntactic reserved
595	   characters outside of the IRI component).  This mapping is defined
596	   for all sequences of Unicode characters, whether or not they are
597	   valid for the component in question.

599	   For each character which is not allowed anywhere in a valid URI apply
600	   the following steps.

602	   Convert to UTF-8  Convert the character to a sequence of one or more
603	      octets using UTF-8 [RFC3629].

605	   Percent encode  Convert each octet of this sequence to %HH, where HH
606	      is the hexadecimal notation of the octet value.  The hexadecimal
607	      notation SHOULD use uppercase letters.  (This is the general URI
608	      percent-encoding mechanism in Section 2.1 of [RFC3986].)

610	   Note that the mapping is an identity transformation for parsed URI
611	   components of valid URIs, and is idempotent: applying the mapping a
612	   second time will not change anything.

614	3.4.  Mapping ireg-name

616	3.4.1.  Mapping using Percent-Encoding

618	   The ireg-name component SHOULD be converted according to the general
619	   procedure for percent-encoding of IRI components described in
620	   Section 3.3.

622	   For example, the IRI
623	   "http://r&#xE9;sum&#xE9;.example.org"
624	   will be converted to
625	   "http://r%C3%A9sum%C3%A9.example.org".

627	   This conversion for ireg-name is in line with Section 3.2.2 of
628	   [RFC3986], which does not mandate a particular registered name lookup
629	   technology.  For further background, see [RFC6055] and [Gettys].

631	3.4.2.  Mapping using Punycode

633	   The ireg-name component MAY also be converted as follows:

635	   If there are any sequences of <pct-encoded>, and their corresponding
636	   octets all represent valid UTF-8 octet sequences, then convert these
637	   back to Unicode character sequences.  (If any <pct-encoded> sequences
638	   are not valid UTF-8 octet sequences, then leave the entire field as
639	   is without any change, since punycode encoding would not succeed.)

641	   Replace the ireg-name part of the IRI by the part converted using the
642	   Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891].
643	   on each dot-separated label, and by using U+002E (FULL STOP) as a
644	   label separator.  This procedure may fail, but this would mean that
645	   the IRI cannot be resolved.  In such cases, if the domain name
646	   conversion fails, then the entire IRI conversion fails.  Processors
647	   that have no mechanism for signalling a failure MAY instead
648	   substitute an otherwise invalid host name, although such processing
649	   SHOULD be avoided.

651	   For example, the IRI
652	   "http://r&#xE9;sum&#xE9;.example.org"
653	   MAY be converted to
654	   "http://xn--rsum-bad.example.org"
655	   .

657	   This conversion for ireg-name will be better able to deal with legacy
658	   infrastructure that cannot handle percent-encoding in domain names.

660	3.4.3.  Additional Considerations

662	   Note:  Domain Names may appear in parts of an IRI other than the
663	      ireg-name part.  It is the responsibility of scheme-specific
664	      implementations (if the Internationalized Domain Name is part of
665	      the scheme syntax) or of server-side implementations (if the
666	      Internationalized Domain Name is part of 'iquery') to apply the
667	      necessary conversions at the appropriate point.  Example: Trying
668	      to validate the Web page at
669	      http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
670	      http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
671	      example.org, which would convert to a URI of
672	      http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
673	      example.org.  The server-side implementation is responsible for
674	      making the necessary conversions to be able to retrieve the Web
675	      page.

677	   Note:  In this process, characters allowed in URI references and
678	      existing percent-encoded sequences are not encoded further.  (This
679	      mapping is similar to, but different from, the encoding applied
680	      when arbitrary content is included in some part of a URI.)  For
681	      example, an IRI of
682	      "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
683	      converted to
684	      "http://www.example.org/red%09ros%C3%A9#red", not to something
685	      like
686	      "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".

688	3.5.  Mapping query components

690	   For compatibility with existing deployed HTTP infrastructure, the
691	   following special case applies for schemes "http" and "https" and
692	   IRIs whose origin has a document charset other than one which is UCS-
693	   based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
694	   of an IRI is mapped into a URI by using the document charset rather
695	   than UTF-8 as the binary representation before pct-encoding.  This
696	   mapping is not applied for any other scheme or component.

698	3.6.  Mapping IRIs to URIs

700	   The mapping from an IRI to URI is accomplished by applying the
701	   mapping above (from IRI to URI components) and then reassembling a
702	   URI from the parsed URI components using the original punctuation
703	   that delimited the IRI components.

705	4.  Converting URIs to IRIs

707	   In some situations, for presentation and further processing, it is
708	   desirable to convert a URI into an equivalent IRI without unnecessary
709	   percent encoding.  Of course, every URI is already an IRI in its own
710	   right without any conversion.  This section gives one possible
711	   procedure for URI to IRI mapping.

713	   The conversion described in this section, if given a valid URI, will
714	   result in an IRI that maps back to the URI used as an input for the
715	   conversion (except for potential case differences in percent-encoding
716	   and for potential percent-encoded unreserved characters).  However,
717	   the IRI resulting from this conversion may differ from the original
718	   IRI (if there ever was one).

720	   URI-to-IRI conversion removes percent-encodings, but not all percent-
721	   encodings can be eliminated.  There are several reasons for this:

723	   1. Some percent-encodings are necessary to distinguish percent-
724	      encoded and unencoded uses of reserved characters.

726	   2. Some percent-encodings cannot be interpreted as sequences of UTF-8
727	      octets.

729	      (Note: The octet patterns of UTF-8 are highly regular.  Therefore,
730	      there is a very high probability, but no guarantee, that percent-
731	      encodings that can be interpreted as sequences of UTF-8 octets
732	      actually originated from UTF-8.  For a detailed discussion, see
733	      [Duerst97].)

735	   3. The conversion may result in a character that is not appropriate
736	      in an IRI.  See Section 2.2, and Section 5.1 for further details.

738	   4. IRI to URI conversion has different rules for dealing with domain
739	      names and query parameters.

741	   Conversion from a URI to an IRI MAY be done by using the following
742	   steps:

744	   1. Represent the URI as a sequence of octets in US-ASCII.

746	   2. Convert all percent-encodings ("%" followed by two hexadecimal
747	      digits) to the corresponding octets, except those corresponding to
748	      "%", characters in "reserved", and characters in US-ASCII not
749	      allowed in URIs.

751	   3. Re-percent-encode any octet produced in step 2 that is not part of
752	      a strictly legal UTF-8 octet sequence.

754	   4. Re-percent-encode all octets produced in step 3 that in UTF-8
755	      represent characters that are not appropriate according to
756	      Section 2.2 and Section 5.1.

758	   5. Interpret the resulting octet sequence as a sequence of characters
759	      encoded in UTF-8.

761	   6. URIs known to contain domain names in the reg-name component
762	      SHOULD convert punycode-encoded domain name labels to the
763	      corresponding characters using the ToUnicode procedure.

765	   This procedure will convert as many percent-encoded characters as
766	   possible to characters in an IRI.  Because there are some choices
767	   when step 4 is applied (see Section 5.1), results may vary.

769	   Conversions from URIs to IRIs MUST NOT use any character encoding
770	   other than UTF-8 in steps 3 and 4, even if it might be possible to
771	   guess from the context that another character encoding than UTF-8 was
772	   used in the URI.  For example, the URI
773	   "http://www.example.org/r%E9sum%E9.html" might with some guessing be
774	   interpreted to contain two e-acute characters encoded as iso-8859-1.
775	   It must not be converted to an IRI containing these e-acute
776	   characters.  Otherwise, in the future the IRI will be mapped to
777	   "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
778	   URI from "http://www.example.org/r%E9sum%E9.html".

780	4.1.  Examples

782	   This section shows various examples of converting URIs to IRIs.  Each
783	   example shows the result after each of the steps 1 through 6 is
784	   applied.  XML Notation is used for the final result.  Octets are
785	   denoted by "<" followed by two hexadecimal digits followed by ">".

787	   The following example contains the sequence "%C3%BC", which is a
788	   strictly legal UTF-8 sequence, and which is converted into the actual
789	   character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
790	   u-umlaut).

792	   1. http://www.example.org/D%C3%BCrst

794	   2. http://www.example.org/D<c3><bc>rst

796	   3. http://www.example.org/D<c3><bc>rst

798	   4. http://www.example.org/D<c3><bc>rst

800	   5. http://www.example.org/D&#xFC;rst

802	   6. http://www.example.org/D&#xFC;rst

804	   The following example contains the sequence "%FC", which might
805	   represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
806	   iso-8859-1 character encoding.  (It might represent other characters
807	   in other character encodings.  For example, the octet <fc> in iso-
808	   8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.)  Because <fc>
809	   is not part of a strictly legal UTF-8 sequence, it is re-percent-
810	   encoded in step 3.

812	   1. http://www.example.org/D%FCrst

814	   2. http://www.example.org/D<fc>rst

816	   3. http://www.example.org/D%FCrst

818	   4. http://www.example.org/D%FCrst

820	   5. http://www.example.org/D%FCrst

822	   6. http://www.example.org/D%FCrst

824	   The following example contains "%e2%80%ae", which is the percent-
825	   encoded
826	   UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.  The
827	   direct use of this character is forbiddin in an IRI.  Therefore, the
828	   corresponding octets are re-percent-encoded in step 4.  This example
829	   shows that the case (upper- or lowercase) of letters used in percent-
830	   encodings may not be preserved.  The example also contains a
831	   punycode-encoded domain name label (xn--99zt52a), which is not
832	   converted.

834	   1. http://xn--99zt52a.example.org/%e2%80%ae

836	   2. http://xn--99zt52a.example.org/<e2><80><ae>

838	   3. http://xn--99zt52a.example.org/<e2><80><ae>

840	   4. http://xn--99zt52a.example.org/%E2%80%AE

842	   5. http://xn--99zt52a.example.org/%E2%80%AE

844	   6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE

846	   Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
847	   (Japanese Natto).  ((EDITOR NOTE: There is some inconsistency in this
848	   note.))

850	5.  Use of IRIs

852	5.1.  Limitations on UCS Characters Allowed in IRIs

854	   This section discusses limitations on characters and character
855	   sequences usable for IRIs beyond those given in Section 2.2.  The
856	   considerations in this section are relevant when IRIs are created and
857	   when URIs are converted to IRIs.

859	   a. The repertoire of characters allowed in each IRI component is
860	      limited by the definition of that component.  For example, the
861	      definition of the scheme component does not allow characters
862	      beyond US-ASCII.

864	      (Note: In accordance with URI practice, generic IRI software
865	      cannot and should not check for such limitations.)

867	   b. The UCS contains many areas of characters for which there are
868	      strong visual look-alikes.  Because of the likelihood of
869	      transcription errors, these also should be avoided.  This includes
870	      the full-width equivalents of Latin characters, half-width
871	      Katakana characters for Japanese, and many others.  It also
872	      includes many look-alikes of "space", "delims", and "unwise",
873	      characters excluded in [RFC3491].

875	   Additional information is available from [UNIXML].  [UNIXML] is
876	   written in the context of running text rather than in that of
877	   identifiers.  Nevertheless, it discusses many of the categories of
878	   characters not appropriate for IRIs.

880	5.2.  Software Interfaces and Protocols

882	   Although an IRI is defined as a sequence of characters, software
883	   interfaces for URIs typically function on sequences of octets or
884	   other kinds of code units.  Thus, software interfaces and protocols
885	   MUST define which character encoding is used.

887	   Intermediate software interfaces between IRI-capable components and
888	   URI-only components MUST map the IRIs per Section 3.6, when
889	   transferring from IRI-capable to URI-only components.  This mapping
890	   SHOULD be applied as late as possible.  It SHOULD NOT be applied
891	   between components that are known to be able to handle IRIs.

893	5.3.  Format of URIs and IRIs in Documents and Protocols

895	   Document formats that transport URIs may have to be upgraded to allow
896	   the transport of IRIs.  In cases where the document as a whole has a
897	   native character encoding, IRIs MUST also be encoded in this
898	   character encoding and converted accordingly by a parser or
899	   interpreter.  IRI characters not expressible in the native character
900	   encoding SHOULD be escaped by using the escaping conventions of the
901	   document format if such conventions are available.  Alternatively,
902	   they MAY be percent-encoded according to Section 3.6.  For example,
903	   in HTML or XML, numeric character references SHOULD be used.  If a
904	   document as a whole has a native character encoding and that
905	   character encoding is not UTF-8, then IRIs MUST NOT be placed into
906	   the document in the UTF-8 character encoding.

908	   ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
909	   although they use different terminology.  HTML 4.0 [HTML4] defines
910	   the conversion from IRIs to URIs as error-avoiding behavior.  XML 1.0
911	   [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
912	   based upon them allow IRIs.  Also, it is expected that all relevant
913	   new W3C formats and protocols will be required to handle IRIs
914	   [CharMod].

916	5.4.  Use of UTF-8 for Encoding Original Characters

918	   This section discusses details and gives examples for point c) in
919	   Section 1.2.  To be able to use IRIs, the URI corresponding to the
920	   IRI in question has to encode original characters into octets by
921	   using UTF-8.  This can be specified for all URIs of a URI scheme or
922	   can apply to individual URIs for schemes that do not specify how to
923	   encode original characters.  It can apply to the whole URI, or only
924	   to some part.  For background information on encoding characters into
925	   URIs, see also Section 2.5 of [RFC3986].

927	   For new URI schemes, using UTF-8 is recommended in [RFC4395bis].
928	   Examples where UTF-8 is already used are the URN syntax [RFC2141],
929	   IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand,
930	   because the HTTP URI scheme does not specify how to encode original
931	   characters, only some HTTP URLs can have corresponding but different
932	   IRIs.

934	   For example, for a document with a URI of
935	   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
936	   construct a corresponding IRI (in XML notation, see Section 1.4):
937	   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9;" stands for
938	   the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-
939	   encoded representation of that character).  On the other hand, for a
940	   document with a URI of "http://www.example.org/r%E9sum%E9.html", the
941	   percent-encoding octets cannot be converted to actual characters in
942	   an IRI, as the percent-encoding is not based on UTF-8.

944	   For most URI schemes, there is no need to upgrade their scheme
945	   definition in order for them to work with IRIs.  The main case where
946	   upgrading makes sense is when a scheme definition, or a particular
947	   component of a scheme, is strictly limited to the use of US-ASCII
948	   characters with no provision to include non-ASCII characters/octets
949	   via percent-encoding, or if a scheme definition currently uses highly
950	   scheme-specific provisions for the encoding of non-ASCII characters.
951	   An example of this is the mailto: scheme [RFC2368].

953	   This specification updates the IANA registry of URI schemes to note
954	   their applicability to IRIs, see Section 8.  All IRIs use URI
955	   schemes, and all URIs with URI schemes can be used as IRIs, even
956	   though in some cases only by using URIs directly as IRIs, without any
957	   conversion.

959	   Scheme definitions can impose restrictions on the syntax of scheme-
960	   specific URIs; i.e., URIs that are admissible under the generic URI
961	   syntax [RFC3986] may not be admissible due to narrower syntactic
962	   constraints imposed by a URI scheme specification.  URI scheme
963	   definitions cannot broaden the syntactic restrictions of the generic
964	   URI syntax; otherwise, it would be possible to generate URIs that
965	   satisfied the scheme-specific syntactic constraints without
966	   satisfying the syntactic constraints of the generic URI syntax.
967	   However, additional syntactic constraints imposed by URI scheme
968	   specifications are applicable to IRI, as the corresponding URI
969	   resulting from the mapping defined in Section 3.6 MUST be a valid URI
970	   under the syntactic restrictions of generic URI syntax and any
971	   narrower restrictions imposed by the corresponding URI scheme
972	   specification.

974	   The requirement for the use of UTF-8 generally applies to all parts
975	   of a URI.  However, it is possible that the capability of IRIs to
976	   represent a wide range of characters directly is used just in some
977	   parts of the IRI (or IRI reference).  The other parts of the IRI may
978	   only contain US-ASCII characters, or they may not be based on UTF-8.
979	   They may be based on another character encoding, or they may directly
980	   encode raw binary data (see also [RFC2397]).

982	   For example, it is possible to have a URI reference of
983	   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
984	   document name is encoded in iso-8859-1 based on server settings, but
985	   where the fragment identifier is encoded in UTF-8 according to
986	   [XPointer].  The IRI corresponding to the above URI would be (in XML
987	   notation)
988	   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".

990	   Similar considerations apply to query parts.  The functionality of
991	   IRIs (namely, to be able to include non-ASCII characters) can only be
992	   used if the query part is encoded in UTF-8.

994	5.5.  Relative IRI References

996	   Processing of relative IRI references against a base is handled
997	   straightforwardly; the algorithms of [RFC3986] can be applied
998	   directly, treating the characters additionally allowed in IRI
999	   references in the same way that unreserved characters are in URI
1000	   references.

1002	6.  Legacy Extended IRIs (LEIRIs)

1004	   For historic reasons, some formats have allowed variants of IRIs that
1005	   are somewhat less restricted in syntax.  This section provides a
1006	   definition and a name (Legacy Extended IRI or LEIRI) for these
1007	   variants for easier reference.  These variants have to be used with
1008	   care; they require further processing before being fully
1009	   interchangeable as IRIs.  New protocols and formats SHOULD NOT use
1010	   Legacy Extended IRIs.  Even where Legacy Extended IRIs are allowed,
1011	   only IRIs fully conforming to the syntax definition in Section 2.2
1012	   SHOULD be created, generated, and used.  The provisions in this
1013	   section also apply to Legacy Extended IRI references.

1015	6.1.  Legacy Extended IRI Syntax

1017	   The syntax of Legacy Extended IRIs is the same as that for IRIs,
1018	   except that ucschar is redefined as follows:

1020	         ucschar        = " " / "<" / ">" / '"' / "{" / "}" / "|"
1021	         / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1022	         / %xE000-FFFD / %x10000-10FFFF

1024	   The restriction on bidirectional formatting characters in [Bidi] is
1025	   lifted.  The iprivate production becomes redundant.

1027	   Likewise, the syntax for Legacy Extended IRI references (LEIRI
1028	   references) is the same as that for IRI references with the above
1029	   redefinition of ucschar applied.

1031	   Formats that use Legacy Extended IRIs or Legacy Extended IRI
1032	   references MAY further restrict the characters allowed therein,
1033	   either implicitly by the fact that the format as such does not allow
1034	   some characters, or explicitly.  An example of a character not
1035	   allowed implicitly may be the NUL character (U+0000).  However, all
1036	   the characters allowed in IRIs MUST still be allowed.

1038	6.2.  Conversion of Legacy Extended IRIs to IRIs

1040	   To convert a Legacy Extended IRI (reference) to an IRI (reference),
1041	   each character allowed in a Legacy Extended IRI (reference) but not
1042	   allowed in an IRI (reference) (see Section 6.3) MUST be percent-
1043	   encoded by applying steps 2.1 to 2.3 of Section 3.6.

1045	6.3.  Characters Allowed in Legacy Extended IRIs but not in IRIs

1047	   This section provides a list of the groups of characters and code
1048	   points that are allowed in Legacy Extedend IRIs, but are not allowed
1049	   in IRIs or are allowed in IRIs only in the query part.  For each
1050	   group of characters, advice on the usage of these characters is also
1051	   given, concentrating on the reasons for why not to use them.

1053	      Space (U+0020): Some formats and applications use space as a
1054	      delimiter, e.g. for items in a list.  Appendix C of [RFC3986] also
1055	      mentions that white space may have to be added when displaying or
1056	      printing long URIs; the same applies to long IRIs.  This means
1057	      that spaces can disappear, or can make the Legacy Extended IRI to
1058	      be interpreted as two or more separate IRIs.

1060	      Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
1061	      C of [RFC3986] suggests the use of double-quotes
1062	      ("http://example.com/") and angle brackets (<http://example.com/>)
1063	      as delimiters for URIs in plain text.  These conventions are often
1064	      used, and also apply to IRIs.  Legacy Extended IRIs using these
1065	      characters will be cut off at the wrong place.

1067	      Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
1068	      (U+007B), "|" (U+007C), and "}" (U+007D): These characters
1069	      originally have been excluded from URIs because the respective
1070	      codepoints are assigned to different graphic characters in some
1071	      7-bit or 8-bit encoding.  Despite the move to Unicode, some of
1072	      these characters are still occasionally displayed differently on
1073	      some systems, e.g.  U+005C as a Japanese Yen symbol.  Also, the
1074	      fact that these characters are not used in URIs or IRIs has
1075	      encouraged their use outside URIs or IRIs in contexts that may
1076	      include URIs or IRIs.  In case a Legacy Extended IRI with such a
1077	      character is used in such a context, the Legacy Extended IRI will
1078	      be interpreted piecemeal.

1080	      The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1081	      #x9F): There is no way to transmit these characters reliably
1082	      except potentially in electronic form.  Even when in electronic
1083	      form, some software components might silently filter out some of
1084	      these characters, or may stop processing alltogether when
1085	      encountering some of them.  These characters may affect text
1086	      display in subtle, unnoticable ways or in drastic, global, and
1087	      irreversible ways depending on the hardware and software involved.
1088	      The use of some of these characters may allow malicious users to
1089	      manipulate the display of a Legacy Extended IRI and its context.

1091	      Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1092	      characters affect the display ordering of characters.  Displayed
1093	      Legacy Extended IRIs containing these characters cannot be
1094	      converted back to electronic form (logical order) unambiguously.
1095	      These characters may allow malicious users to manipulate the
1096	      display of a Legacy Extended IRI and its context.

1098	      Specials (U+FFF0-FFFD): These code points provide functionality
1099	      beyond that useful in a Legacy Extended IRI, for example byte
1100	      order identification, annotation, and replacements for unknown
1101	      characters and objects.  Their use and interpretation in a Legacy
1102	      Extended IRI serves no purpose and may lead to confusing display
1103	      variations.

1105	      Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
1106	      10FFFD): Display and interpretation of these code points is by
1107	      definition undefined without private agreement.  Therefore, these
1108	      code points are not suited for use on the Internet.  They are not
1109	      interoperable and may have unpredictable effects.

1111	      Tags (U+E0000-E0FFF): These characters provide a way to language
1112	      tag in Unicode plain text.  They are not appropriate for Legacy
1113	      Extended IRIs because language information in identifiers cannot
1114	      reliably be input, transmitted (e.g. on a visual medium such as
1115	      paper), or recognized.

1117	      Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1118	      U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1119	      U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1120	      U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1121	      U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1122	      non-characters.  Applications may use some of them internally, but
1123	      are not prepared to interchange them.

1125	   For reference, we here also list the code points and code units not
1126	   even allowed in Legacy Extended IRIs:

1128	      Surrogate code units (D800-DFFF): These do not represent Unicode
1129	      codepoints.

1131	7.  URI/IRI Processing Guidelines (Informative)

1133	   This informative section provides guidelines for supporting IRIs in
1134	   the same software components and operations that currently process
1135	   URIs: Software interfaces that handle URIs, software that allows
1136	   users to enter URIs, software that creates or generates URIs,
1137	   software that displays URIs, formats and protocols that transport
1138	   URIs, and software that interprets URIs.  These may all require
1139	   modification before functioning properly with IRIs.  The
1140	   considerations in this section also apply to URI references and IRI
1141	   references.

1143	7.1.  URI/IRI Software Interfaces

1145	   Software interfaces that handle URIs, such as URI-handling APIs and
1146	   protocols transferring URIs, need interfaces and protocol elements
1147	   that are designed to carry IRIs.

1149	   In case the current handling in an API or protocol is based on US-
1150	   ASCII, UTF-8 is recommended as the character encoding for IRIs, as it
1151	   is compatible with US-ASCII, is in accordance with the
1152	   recommendations of [RFC2277], and makes converting to URIs easy.  In
1153	   any case, the API or protocol definition must clearly define the
1154	   character encoding to be used.

1156	   The transfer from URI-only to IRI-capable components requires no
1157	   mapping, although the conversion described in Section 4 above may be
1158	   performed.  It is preferable not to perform this inverse conversion
1159	   unless it is certain this can be done correctly.

1161	7.2.  URI/IRI Entry

1163	   Some components allow users to enter URIs into the system by typing
1164	   or dictation, for example.  This software must be updated to allow
1165	   for IRI entry.

1167	   A person viewing a visual presentation of an IRI (as a sequence of
1168	   glyphs, in some order, in some visual display) will use an entry
1169	   method for characters in the user's language to input the IRI.
1170	   Depending on the script and the input method used, this may be a more
1171	   or less complicated process.

1173	   The process of IRI entry must ensure, as much as possible, that the
1174	   restrictions defined in Section 2.2 are met.  This may be done by
1175	   choosing appropriate input methods or variants/settings thereof, by
1176	   appropriately converting the characters being input, by eliminating
1177	   characters that cannot be converted, and/or by issuing a warning or
1178	   error message to the user.

1180	   As an example of variant settings, input method editors for East
1181	   Asian Languages usually allow the input of Latin letters and related
1182	   characters in full-width or half-width versions.  For IRI input, the
1183	   input method editor should be set so that it produces half-width
1184	   Latin letters and punctuation and full-width Katakana.

1186	   An input field primarily or solely used for the input of URIs/IRIs
1187	   might allow the user to view an IRI as it is mapped to a URI.  Places
1188	   where the input of IRIs is frequent may provide the possibility for
1189	   viewing an IRI as mapped to a URI.  This will help users when some of
1190	   the software they use does not yet accept IRIs.

1192	   An IRI input component interfacing to components that handle URIs,
1193	   but not IRIs, must map the IRI to a URI before passing it to these
1194	   components.

1196	   For the input of IRIs with right-to-left characters, please see
1197	   [Bidi].

1199	7.3.  URI/IRI Transfer between Applications

1201	   Many applications (for example, mail user agents) try to detect URIs
1202	   appearing in plain text.  For this, they use some heuristics based on
1203	   URI syntax.  They then allow the user to click on such URIs and
1204	   retrieve the corresponding resource in an appropriate (usually
1205	   scheme-dependent) application.

1207	   Such applications would need to be upgraded, in order to use the IRI
1208	   syntax as a base for heuristics.  In particular, a non-ASCII
1209	   character should not be taken as the indication of the end of an IRI.
1210	   Such applications also would need to make sure that they correctly
1211	   convert the detected IRI from the character encoding of the document
1212	   or application where the IRI appears, to the character encoding used
1213	   by the system-wide IRI invocation mechanism, or to a URI (according
1214	   to Section 3.6) if the system-wide invocation mechanism only accepts
1215	   URIs.

1217	   The clipboard is another frequently used way to transfer URIs and
1218	   IRIs from one application to another.  On most platforms, the
1219	   clipboard is able to store and transfer text in many languages and
1220	   scripts.  Correctly used, the clipboard transfers characters, not
1221	   octets, which will do the right thing with IRIs.

1223	7.4.  URI/IRI Generation

1225	   Systems that offer resources through the Internet, where those
1226	   resources have logical names, sometimes automatically generate URIs
1227	   for the resources they offer.  For example, some HTTP servers can
1228	   generate a directory listing for a file directory and then respond to
1229	   the generated URIs with the files.

1231	   Many legacy character encodings are in use in various file systems.
1232	   Many currently deployed systems do not transform the local character
1233	   representation of the underlying system before generating URIs.

1235	   For maximum interoperability, systems that generate resource
1236	   identifiers should make the appropriate transformations.  For
1237	   example, if a file system contains a file named "r&#xE9;sum&#
1238	   xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in
1239	   a URI, which allows use of "r&#xE9;sum&#xE9;.html" in an IRI, even if
1240	   locally the file name is kept in a character encoding other than
1241	   UTF-8.

1243	   This recommendation particularly applies to HTTP servers.  For FTP
1244	   servers, similar considerations apply; see [RFC2640].

1246	7.5.  URI/IRI Selection

1248	   In some cases, resource owners and publishers have control over the
1249	   IRIs used to identify their resources.  This control is mostly
1250	   executed by controlling the resource names, such as file names,
1251	   directly.

1253	   In these cases, it is recommended to avoid choosing IRIs that are
1254	   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
1255	   is easily confused with the digit one ("1"), and the upper-case oh
1256	   ("O") is easily confused with the digit zero ("0").  Publishers
1257	   should avoid confusing users with "br0ken" or "1ame" identifiers.

1259	   Outside the US-ASCII repertoire, there are many more opportunities
1260	   for confusion; a complete set of guidelines is too lengthy to include
1261	   here.  As long as names are limited to characters from a single
1262	   script, native writers of a given script or language will know best
1263	   when ambiguities can appear, and how they can be avoided.  What may
1264	   look ambiguous to a stranger may be completely obvious to the average
1265	   native user.  On the other hand, in some cases, the UCS contains
1266	   variants for compatibility reasons; for example, for typographic
1267	   purposes.  These should be avoided wherever possible.  Although there
1268	   may be exceptions, newly created resource names should generally be
1269	   in NFKC [UTR15] (which means that they are also in NFC).

1271	   As an example, the UCS contains the "fi" ligature at U+FB01 for
1272	   compatibility reasons.  Wherever possible, IRIs should use the two
1273	   letters "f" and "i" rather than the "fi" ligature.  An example where
1274	   the latter may be used is in the query part of an IRI for an explicit
1275	   search for a word written containing the "fi" ligature.

1277	   In certain cases, there is a chance that characters from different
1278	   scripts look the same.  The best known example is the similarity of
1279	   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
1280	   such cases, IRIs should only be created where all the characters in a
1281	   single component are used together in a given language.  This usually
1282	   means that all of these characters will be from the same script, but
1283	   there are languages that mix characters from different scripts (such
1284	   as Japanese).  This is similar to the heuristics used to distinguish
1285	   between letters and numbers in the examples above.  Also, for Latin,
1286	   Greek, and Cyrillic, using lowercase letters results in fewer
1287	   ambiguities than using uppercase letters would.

1289	7.6.  Display of URIs/IRIs

1291	   In situations where the rendering software is not expected to display
1292	   non-ASCII parts of the IRI correctly using the available layout and
1293	   font resources, these parts should be percent-encoded before being
1294	   displayed.

1296	   For display of Bidi IRIs, please see [Bidi].

1298	7.7.  Interpretation of URIs and IRIs

1300	   Software that interprets IRIs as the names of local resources should
1301	   accept IRIs in multiple forms and convert and match them with the
1302	   appropriate local resource names.

1304	   First, multiple representations include both IRIs in the native
1305	   character encoding of the protocol and also their URI counterparts.

1307	   Second, it may include URIs constructed based on character encodings
1308	   other than UTF-8.  These URIs may be produced by user agents that do
1309	   not conform to this specification and that use legacy character
1310	   encodings to convert non-ASCII characters to URIs.  Whether this is
1311	   necessary, and what character encodings to cover, depends on a number
1312	   of factors, such as the legacy character encodings used locally and
1313	   the distribution of various versions of user agents.  For example,
1314	   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
1315	   addition to UTF-8.

1317	   Third, it may include additional mappings to be more user-friendly
1318	   and robust against transmission errors.  These would be similar to
1319	   how some servers currently treat URIs as case insensitive or perform
1320	   additional matching to account for spelling errors.  For characters
1321	   beyond the US-ASCII repertoire, this may, for example, include
1322	   ignoring the accents on received IRIs or resource names.  Please note
1323	   that such mappings, including case mappings, are language dependent.

1325	   It can be difficult to identify a resource unambiguously if too many
1326	   mappings are taken into consideration.  However, percent-encoded and
1327	   not percent-encoded parts of IRIs can always be clearly
1328	   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
1329	   the potential for collisions lower than it may seem at first.

1331	7.8.  Upgrading Strategy

1333	   Where this recommendation places further constraints on software for
1334	   which many instances are already deployed, it is important to
1335	   introduce upgrades carefully and to be aware of the various
1336	   interdependencies.

1338	   If IRIs cannot be interpreted correctly, they should not be created,
1339	   generated, or transported.  This suggests that upgrading URI
1340	   interpreting software to accept IRIs should have highest priority.

1342	   On the other hand, a single IRI is interpreted only by a single or
1343	   very few interpreters that are known in advance, although it may be
1344	   entered and transported very widely.

1346	   Therefore, IRIs benefit most from a broad upgrade of software to be
1347	   able to enter and transport IRIs.  However, before an individual IRI
1348	   is published, care should be taken to upgrade the corresponding
1349	   interpreting software in order to cover the forms expected to be
1350	   received by various versions of entry and transport software.

1352	   The upgrade of generating software to generate IRIs instead of using
1353	   a local character encoding should happen only after the service is
1354	   upgraded to accept IRIs.  Similarly, IRIs should only be generated
1355	   when the service accepts IRIs and the intervening infrastructure and
1356	   protocol is known to transport them safely.

1358	   Software converting from URIs to IRIs for display should be upgraded
1359	   only after upgraded entry software has been widely deployed to the
1360	   population that will see the displayed result.

1362	   Where there is a free choice of character encodings, it is often
1363	   possible to reduce the effort and dependencies for upgrading to IRIs
1364	   by using UTF-8 rather than another encoding.  For example, when a new
1365	   file-based Web server is set up, using UTF-8 as the character
1366	   encoding for file names will make the transition to IRIs easier.
1367	   Likewise, when a new Web form is set up using UTF-8 as the character
1368	   encoding of the form page, the returned query URIs will use UTF-8 as
1369	   the character encoding (unless the user, for whatever reason, changes
1370	   the character encoding) and will therefore be compatible with IRIs.

1372	   These recommendations, when taken together, will allow for the
1373	   extension from URIs to IRIs in order to handle characters other than
1374	   US-ASCII while minimizing interoperability problems.  For
1375	   considerations regarding the upgrade of URI scheme definitions, see
1376	   Section 5.4.

1378	8.  IANA Considerations

1380	   RFC Editor and IANA note: Please Replace RFC XXXX with the number of
1381	   this document when it issues as an RFC.

1383	   IANA maintains a registry of "URI schemes".  A "URI scheme" also
1384	   serves an "IRI scheme".

1386	   To clarify that the URI scheme registration process also applies to
1387	   IRIs, change the description of the "URI schemes" registry header to
1388	   say "[RFC4395] defines an IANA-maintained registry of URI Schemes.
1389	   These registries include the Permanent and Provisional URI Schemes.
1390	   RFC XXXX updates this registry to designate that schemes may also
1391	   indicate their usability as IRI schemes.

1393	   Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".

1395	9.  Security Considerations

1397	   The security considerations discussed in [RFC3986] also apply to
1398	   IRIs.  In addition, the following issues require particular care for
1399	   IRIs.

1401	   Incorrect encoding or decoding can lead to security problems.  For
1402	   example, some UTF-8 decoders do not check against overlong byte
1403	   sequences.  See [UTR36] Section 3 for details.

1405	   There are serious difficulties with relying on a human to verify that
1406	   a an IRI (whether presented visually or aurally) is the same as
1407	   another IRI or is the one intended.  These problems exist with ASCII-
1408	   only URIs (bl00mberg.com vs. bloomberg.com) but are strongly
1409	   exacerbated when using the much larger character repertoire of
1410	   Unicode.  For details, see Section 2 of [UTR36].  Using
1411	   administrative and technical means to reduce the availability of such
1412	   exploits is possible, but they are difficult to eliminate altogether.
1413	   User agents SHOULD NOT rely on visual or perceptual comparison or
1414	   verification of IRIs as a means of validating or assuring safety,
1415	   correctness or appropriateness of an IRI.  Other means of presenting
1416	   users with the validity, safety, or appropriateness of visited sites
1417	   are being developed in the browser community as an alternative means
1418	   of avoiding these difficulties.

1420	   Besides the large character repertoire of Unicode, reasons for
1421	   confusion include different forms of normalization and different
1422	   normalization expectations, use of percent-encoding with various
1423	   legacy encodings, and bidirectionality issues.  See also [Bidi].

1425	   Confusion can occur in various IRI components, such as the domain
1426	   name part or the path part, or between IRI components.  For
1427	   considerations specific to the domain name part, see [RFC5890].  For
1428	   considerations specific to particular protocols or schemes, see the
1429	   security sections of the relevant specifications and registration
1430	   templates.  Administrators of sites that allow independent users to
1431	   create resources in the same sub area have to be careful.  Details
1432	   are discussed in Section 7.5.

1434	   The characters additionally allowed in Legacy Extended IRIs introduce
1435	   additional security issues.  For details, see Section 6.3.

1437	10.  Acknowledgements

1439	   This document was derived from [RFC3987]; the acknowledgments from
1440	   that specification still apply.

1442	   In addition, this document was influenced by contributions from (in
1443	   no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson,
1444	   John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris
1445	   Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank
1446	   Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard
1447	   Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie
1448	   Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas
1449	   Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van
1450	   Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos
1451	   Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R.
1452	   Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse,
1453	   Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn
1454	   Steele, Peter Saint-Andre, Geoffrey Sneddon, Chris Weber, Alex
1455	   Melnikov, Slim Amamou, S. Moonesamy, Tim Berners-Lee, Yaron Goland,
1456	   Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas
1457	   Milo, Murray Sargent, Marc Blanchet, and Mykyta Yevstifeyev.

1459	11.  Main Changes Since RFC 3987

1461	   This section describes the main changes since [RFC3987].

1463	11.1.  Split out Bidi, processing guidelines, comparison sections

1465	   Move some components (comparison, bidi, processing) into separate
1466	   documents.

1468	11.2.  Major restructuring of IRI processing model

1470	   Major restructuring of IRI processing model to make scheme-specific
1471	   translation necessary to handle IDNA requirements and for consistency
1472	   with web implementations.

1474	   Starting with IRI, you want one of:

1476	   a  IRI components (IRI parsed into UTF8 pieces)

1478	   b  URI components (URI parsed into ASCII pieces, encoded correctly)

1480	   c  whole URI (for passing on to some other system that wants whole
1481	      URIs)

1483	11.2.1.  OLD WAY

1485	   1.  Pct-encoding on the whole thing to a URI. (c1) If you want a
1486	       (maybe broken) whole URI, you might stop here.

1488	   2.  Parsing the URI into URI components. (b1) If you want (maybe
1489	       broken) URI components, stop here.

1491	   3.  Decode the components (undoing the pct-encoding). (a) if you want
1492	       IRI components, stop here.

1494	   4.  reencode: Either using a different encoding some components (for
1495	       domain names, and query components in web pages, which depends on
1496	       the component, scheme and context), and otherwise using pct-
1497	       encoding. (b2) if you want (good) URI components, stop here.

1499	   5.  reassemble the reencoded components. (c2) if you want a (*good*)
1500	       whole URI stop here.

1502	11.2.2.  NEW WAY

1504	   1.  Parse the IRI into IRI components using the generic syntax. (a)
1505	       if you want IRI components, stop here.

1507	   2.  Encode each components, using pct-encoding, IDN encoding, or
1508	       special query part encoding depending on the component scheme or
1509	       context. (b) If you want URI components, stop here.

1511	   3.  reassemble the a whole URI from URI components. (c) if you want a
1512	       whole URI stop here.

1514	11.2.3.  Extension of Syntax

1516	   Added the tag range (U+E0000-E0FFF) to the iprivate production.  Some
1517	   IRIs generated with the new syntax may fail to pass very strict
1518	   checks relying on the old syntax.  But characters in this range
1519	   should be extremely infrequent anyway.

1521	11.2.4.  More to be added

1523	   TODO: There are more main changes that need to be documented in this
1524	   section.

1526	11.3.  Change Log

1528	   Note to RFC Editor: Please completely remove this section before
1529	   publication.

1531	11.3.1.  Changes after draft-ietf-iri-3987bis-01

1533	   Changes from draft-ietf-iri-3987bis-01 onwards are available as
1534	   changesets in the IETF tools subversion repository at http://
1535	   trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/
1536	   draft-ietf-iri-3987bis.xml.

1538	11.3.2.  Changes from draft-duerst-iri-bis-07 to
1539	         draft-ietf-iri-3987bis-00

1541	   Changed draft name, date, last paragraph of abstract, and titles in
1542	   change log, and added this section in moving from
1543	   draft-duerst-iri-bis-07 (personal submission) to
1544	   draft-ietf-iri-3987bis-00 (WG document).

1546	11.3.3.  Changes from -06 to -07 of draft-duerst-iri-bis

1548	   Major restructuring of the processing model, see Section 11.2.

1550	11.4.  Changes from -00 to -01

1552	   o  Removed 'mailto:' before mail addresses of authors.

1554	   o  Added "<to be done>" as right side of 'href-strip' rule.  Fixed
1555	      '|' to '/' for alternatives.

1557	11.5.  Changes from -05 to -06 of draft-duerst-iri-bis-00

1559	   o  Add HyperText Reference, change abstract, acks and references for
1560	      it

1562	   o  Add Masinter back as another editor.

1564	   o  Masinter integrates HRef material from HTML5 spec.

1566	   o  Rewrite introduction sections to modernize.

1568	11.6.  Changes from -04 to -05 of draft-duerst-iri-bis

1570	   o  Updated references.

1572	   o  Changed IPR text to pre5378Trust200902.

1574	11.7.  Changes from -03 to -04 of draft-duerst-iri-bis

1576	   o  Added explicit abbreviation for LEIRIs.

1578	   o  Mentioned LEIRI references.

1580	   o  Completed text in LEIRI section about tag characters and about
1581	      specials.

1583	11.8.  Changes from -02 to -03 of draft-duerst-iri-bis

1585	   o  Updated some references.

1587	   o  Updated Michel Suginard's coordinates.

1589	11.9.  Changes from -01 to -02 of draft-duerst-iri-bis

1591	   o  Added tag range to iprivate (issue private-include-tags-115).

1593	   o  Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.

1595	11.10.  Changes from -00 to -01 of draft-duerst-iri-bis

1597	   o  Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI"
1598	      based on input from the W3C XML Core WG.  Moved the relevant
1599	      subsections to the back and promoted them to a section.

1601	   o  Added some text re.  Legacy Extended IRIs to the security section.

1603	   o  Added a IANA Consideration Section.

1605	   o  Added this Change Log Section.

1607	   o  Added a section about "IRIs with Spaces/Controls" (converting from
1608	      a Note in RFC 3987).

1610	11.11.  Changes from RFC 3987 to -00 of draft-duerst-iri-bis

1612	      Fixed errata (see
1613	      http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).

1615	12.  References

1617	12.1.  Normative References

1619	   [ASCII]    American National Standards Institute, "Coded Character
1620	              Set -- 7-bit American Standard Code for Information
1621	              Interchange", ANSI X3.4, 1986.

1623	   [ISO10646]
1624	              International Organization for Standardization, "ISO/IEC
1625	              10646:2011: Information Technology - Universal Multiple-
1626	              Octet Coded Character Set (UCS)", ISO Standard 10646,
1627	              March 20011, <http://standards.iso.org/ittf/
1628	              PubliclyAvailableStandards/
1629	              c051273_ISO_IEC_10646_2011(E).zip>.

1631	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1632	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1634	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1635	              Profile for Internationalized Domain Names (IDN)",
1636	              RFC 3491, March 2003.

1638	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
1639	              10646", STD 63, RFC 3629, November 2003.

1641	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1642	              Resource Identifier (URI): Generic Syntax", STD 66,
1643	              RFC 3986, January 2005.

1645	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
1646	              Applications (IDNA): Definitions and Document Framework",
1647	              RFC 5890, August 2010.

1649	   [RFC5891]  Klensin, J., "Internationalized Domain Names in
1650	              Applications (IDNA): Protocol", RFC 5891, August 2010.

1652	   [STD68]    Crocker, D. and P. Overell, "Augmented BNF for Syntax
1653	              Specifications: ABNF", STD 68, RFC 5234, January 2008.

1655	   [UNIV6]    The Unicode Consortium, "The Unicode Standard, Version
1656	              6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
1657	              ISBN 978-1-936213-01-6)", October 2010.

1659	   [UTR15]    Davis, M. and M. Duerst, "Unicode Normalization Forms",
1660	              Unicode Standard Annex #15, March 2008,
1661	              <http://www.unicode.org/unicode/reports/tr15/
1662	              tr15-23.html>.

1664	12.2.  Informative References

1666	   [Bidi]     Duerst, M. and L. Masinter, "Guidelines for
1667	              Internationalized Resource Identifiers with Bi-directional
1668	              Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-00
1669	              (work in progress), August 2011.

1671	   [CharMod]  Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
1672	              Texin, "Character Model for the World Wide Web: Resource
1673	              Identifiers", World Wide Web Consortium Candidate
1674	              Recommendation, November 2004,
1675	              <http://www.w3.org/TR/charmod-resid>.

1677	   [Duerst97]
1678	              Duerst, M., "The Properties and Promises of UTF-8", Proc.

1680	              11th International Unicode Conference, San Jose ,
1681	              September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
1682	              papers/PDF/IUC11-UTF-8.pdf>.

1684	   [Equivalence]
1685	              Masinter, L. and M. Duerst, "Equivalence and
1686	              Canonicalization of Internationalized Resource Identifiers
1687	              (IRIs)", draft-ietf-iri-comparison-00 (work in progress),
1688	              August 2011.

1690	   [Gettys]   Gettys, J., "URI Model Consequences",
1691	              <http://www.w3.org/DesignIssues/ModelConsequences>.

1693	   [HTML4]    Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
1694	              Specification", World Wide Web Consortium Recommendation,
1695	              December 1999,
1696	              <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.

1698	   [LEIRI]    Thompson, H., Tobin, R., and N. Walsh, "Legacy extended
1699	              IRIs for XML resource identification", World Wide Web
1700	              Consortium Note, November 2008,
1701	              <http://www.w3.org/TR/leiri/>.

1703	   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
1704	              Extensions (MIME) Part One: Format of Internet Message
1705	              Bodies", RFC 2045, November 1996.

1707	   [RFC2130]  Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
1708	              Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
1709	              the IAB Character Set Workshop held 29 February - 1 March,
1710	              1996", RFC 2130, April 1997.

1712	   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

1714	   [RFC2192]  Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

1716	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
1717	              Languages", BCP 18, RFC 2277, January 1998.

1719	   [RFC2368]  Hoffman, P., Masinter, L., and J. Zawinski, "The mailto
1720	              URL scheme", RFC 2368, July 1998.

1722	   [RFC2384]  Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

1724	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1725	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
1726	              August 1998.

1728	   [RFC2397]  Masinter, L., "The "data" URL scheme", RFC 2397,
1729	              August 1998.

1731	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
1732	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
1733	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

1735	   [RFC2640]  Curtin, B., "Internationalization of the File Transfer
1736	              Protocol", RFC 2640, July 1999.

1738	   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
1739	              Identifiers (IRIs)", RFC 3987, January 2005.

1741	   [RFC4395bis]
1742	              Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
1743	              Registration Procedures for New URI/IRI Schemes",
1744	              draft-ietf-iri-4395bis-irireg-03 (work in progress),
1745	              July 2011.

1747	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
1748	              Encodings for Internationalized Domain Names", RFC 6055,
1749	              February 2011.

1751	   [RFC6082]  Whistler, K., Adams, G., Duerst, M., Presuhn, R., and J.
1752	              Klensin, "Deprecating Unicode Language Tag Characters: RFC
1753	              2482 is Historic", RFC 6082, November 2010.

1755	   [UNIXML]   Duerst, M. and A. Freytag, "Unicode in XML and other
1756	              Markup Languages", Unicode Technical Report #20, World
1757	              Wide Web Consortium Note, June 2003,
1758	              <http://www.w3.org/TR/unicode-xml/>.

1760	   [UTR36]    Davis, M. and M. Suignard, "Unicode Security
1761	              Considerations", Unicode Technical Report #36,
1762	              August 2010, <http://unicode.org/reports/tr36/>.

1764	   [XLink]    DeRose, S., Maler, E., and D. Orchard, "XML Linking
1765	              Language (XLink) Version 1.0", World Wide Web
1766	              Consortium REC-xlink-20010627, June 2001,
1767	              <http://www.w3.org/TR/xlink/#link-locators>.

1769	   [XML1]     Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
1770	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth
1771	              Edition)", World Wide Web Consortium REC-xml-20081126,
1772	              August 2006, <http://www.w3.org/TR/REC-xml>.

1774	   [XMLNamespace]
1775	              Bray, T., Hollander, D., Layman, A., and R. Tobin,
1776	              "Namespaces in XML (Second Edition)", World Wide Web
1777	              Consortium REC-xml-names-20091208, August 2006,
1778	              <http://www.w3.org/TR/REC-xml-names>.

1780	   [XMLSchema]
1781	              Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
1782	              World Wide Web Consortium REC-xmlschema-2-20041028,
1783	              May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.

1785	   [XPointer]
1786	              Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
1787	              Framework", World Wide Web Consortium REC-xptr-framework-
1788	              20030325, March 2003,
1789	              <http://www.w3.org/TR/xptr-framework/#escaping>.

1791	Authors' Addresses

1793	   Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
1794		               possible, for example as "D&#252;rst" in XML and HTML.)
1795	   Aoyama Gakuin University
1796	   5-10-1 Fuchinobe
1797	   Sagamihara, Kanagawa  229-8558
1798	   Japan

1800	   Phone: +81 42 759 6329
1801	   Fax:   +81 42 759 6495
1802	   Email: duerst@it.aoyama.ac.jp
1803	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
1804	          (Note: This is the percent-encoded form of an IRI.)

1806	   Michel Suignard
1807	   Unicode Consortium
1808	   P.O. Box 391476
1809	   Mountain View, CA  94039-1476
1810	   U.S.A.

1812	   Phone: +1-650-693-3921
1813	   Email: michel@unicode.org
1814	   URI:   http://www.suignard.com
1815	   Larry Masinter
1816	   Adobe
1817	   345 Park Ave
1818	   San Jose, CA  95110
1819	   U.S.A.

1821	   Phone: +1-408-536-3024
1822	   Email: masinter@adobe.com
1823	   URI:   http://larry.masinter.net