idnits 2.17.1 

draft-ietf-iri-3987bis-13.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  -- The draft header indicates that this document obsoletes RFC3987, but the
     abstract doesn't seem to directly say this.  It does mention RFC3987
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 20, 2012) is 4206 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'

  == Outdated reference: A later version (-03) exists of
     draft-ietf-iri-bidi-guidelines-02

  == Outdated reference: A later version (-02) exists of
     draft-ietf-iri-comparison-01

  -- Obsolete informational reference (is this intentional?): RFC 2141
     (Obsoleted by RFC 8141)

  -- Obsolete informational reference (is this intentional?): RFC 2192
     (Obsoleted by RFC 5092)

  -- Obsolete informational reference (is this intentional?): RFC 2396
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)


     Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internationalized Resource Identifiers                         M. Duerst
3	(iri)                                           Aoyama Gakuin University
4	Internet-Draft                                               M. Suignard
5	Obsoletes: 3987 (if approved)                         Unicode Consortium
6	Intended status: Standards Track                             L. Masinter
7	Expires: April 23, 2013                                            Adobe
8	                                                        October 20, 2012

10	             Internationalized Resource Identifiers (IRIs)
11	                       draft-ietf-iri-3987bis-13

13	Abstract

15	   This document defines the Internationalized Resource Identifier (IRI)
16	   protocol element, as an extension of the Uniform Resource Identifier
17	   (URI).  An IRI is a sequence of characters from the Universal
18	   Character Set (Unicode/ISO 10646).  Grammar and processing rules are
19	   given for IRIs and related syntactic forms.

21	   Defining IRI as a new protocol element (rather than updating or
22	   extending the definition of URI) allows independent orderly
23	   transitions: protocols and languages that use URIs must explicitly
24	   choose to allow IRIs.

26	   Guidelines are provided for the use and deployment of IRIs and
27	   related protocol elements when revising protocols, formats, and
28	   software components that currently deal only with URIs.

30	   This document is part of a set of documents intended to replace RFC
31	   3987.

33	RFC Editor: Please remove the next paragraph before publication.

35	   This document, and several companion documents, are intended to
36	   obsolete RFC 3987.  For discussion and comments on these drafts,
37	   please join the IETF IRI WG by subscribing to the mailing list
38	   public-iri@w3.org, archives at
39	   http://lists.w3.org/archives/public/public-iri/.  For a list of open
40	   issues, please see the issue tracker of the WG at
41	   http://trac.tools.ietf.org/wg/iri/trac/report/1.  For a list of
42	   individual edits, please see the change history at
43	   http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis.

45	   This document is available in (line-printer ready) plaintext ASCII
46	   and PDF.  It is also available in HTML from
47	   http://www.sw.it.aoyama.ac.jp/2012/pub/
48	   draft-ietf-iri-3987bis-13.html, and in UTF-8 plaintext from http://
49	   www.sw.it.aoyama.ac.jp/2012/pub/draft-ietf-iri-3987bis-13.utf8.txt.
50	   While all these versions are identical in their technical content,
51	   the HTML, PDF, and UTF-8 plaintext versions show non-Unicode
52	   characters directly.  This often makes it easier to understand
53	   examples, and readers are therefore advised to consult these versions
54	   in preference or as a supplement to the ASCII version.

56	Status of this Memo

58	   This Internet-Draft is submitted in full conformance with the
59	   provisions of BCP 78 and BCP 79.

61	   Internet-Drafts are working documents of the Internet Engineering
62	   Task Force (IETF).  Note that other groups may also distribute
63	   working documents as Internet-Drafts.  The list of current Internet-
64	   Drafts is at http://datatracker.ietf.org/drafts/current/.

66	   Internet-Drafts are draft documents valid for a maximum of six months
67	   and may be updated, replaced, or obsoleted by other documents at any
68	   time.  It is inappropriate to use Internet-Drafts as reference
69	   material or to cite them other than as "work in progress."

71	   This Internet-Draft will expire on April 23, 2013.

73	Copyright Notice

75	   Copyright (c) 2012 IETF Trust and the persons identified as the
76	   document authors.  All rights reserved.

78	   This document is subject to BCP 78 and the IETF Trust's Legal
79	   Provisions Relating to IETF Documents
80	   (http://trustee.ietf.org/license-info) in effect on the date of
81	   publication of this document.  Please review these documents
82	   carefully, as they describe your rights and restrictions with respect
83	   to this document.  Code Components extracted from this document must
84	   include Simplified BSD License text as described in Section 4.e of
85	   the Trust Legal Provisions and are provided without warranty as
86	   described in the Simplified BSD License.

88	   This document may contain material from IETF Documents or IETF
89	   Contributions published or made publicly available before November
90	   10, 2008.  The person(s) controlling the copyright in some of this
91	   material may not have granted the IETF Trust the right to allow
92	   modifications of such material outside the IETF Standards Process.
93	   Without obtaining an adequate license from the person(s) controlling
94	   the copyright in such materials, this document may not be modified
95	   outside the IETF Standards Process, and derivative works of it may
96	   not be created outside the IETF Standards Process, except to format
97	   it for publication as an RFC or to translate it into languages other
98	   than English.

100	Table of Contents

102	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
103	     1.1.   Overview and Motivation . . . . . . . . . . . . . . . . .  5
104	     1.2.   Applicability . . . . . . . . . . . . . . . . . . . . . .  6
105	     1.3.   Definitions . . . . . . . . . . . . . . . . . . . . . . .  7
106	     1.4.   Notation  . . . . . . . . . . . . . . . . . . . . . . . .  8
107	   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  9
108	     2.1.   Summary of IRI Syntax . . . . . . . . . . . . . . . . . .  9
109	     2.2.   ABNF for IRI References and IRIs  . . . . . . . . . . . . 10
110	   3.  Processing IRIs and related protocol elements  . . . . . . . . 12
111	     3.1.   Converting to UCS . . . . . . . . . . . . . . . . . . . . 13
112	     3.2.   Parse the IRI into IRI components . . . . . . . . . . . . 13
113	     3.3.   General percent-encoding of IRI components  . . . . . . . 13
114	     3.4.   Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14
115	       3.4.1.  Mapping using Percent-Encoding . . . . . . . . . . . . 14
116	       3.4.2.  Mapping using Punycode . . . . . . . . . . . . . . . . 14
117	       3.4.3.  Additional Considerations  . . . . . . . . . . . . . . 15
118	     3.5.   Mapping query components  . . . . . . . . . . . . . . . . 16
119	     3.6.   Mapping IRIs to URIs  . . . . . . . . . . . . . . . . . . 16
120	   4.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . . . . 16
121	     4.1.   Limitations . . . . . . . . . . . . . . . . . . . . . . . 16
122	     4.2.   Conversion  . . . . . . . . . . . . . . . . . . . . . . . 17
123	     4.3.   Examples  . . . . . . . . . . . . . . . . . . . . . . . . 18
124	   5.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 19
125	     5.1.   Limitations on UCS Characters Allowed in IRIs . . . . . . 19
126	     5.2.   Software Interfaces and Protocols . . . . . . . . . . . . 20
127	     5.3.   Format of URIs and IRIs in Documents and Protocols  . . . 20
128	     5.4.   Use of UTF-8 for Encoding Original Characters . . . . . . 21
129	     5.5.   Relative IRI References . . . . . . . . . . . . . . . . . 22
130	   6.  Legacy Extended IRIs (LEIRIs)  . . . . . . . . . . . . . . . . 23
131	     6.1.   Legacy Extended IRI Syntax  . . . . . . . . . . . . . . . 23
132	     6.2.   Conversion of Legacy Extended IRIs to IRIs  . . . . . . . 23
133	     6.3.   Characters Allowed in Legacy Extended IRIs but not in
134	            IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . . 23
135	   7.  Processing of URIs/IRIs/URLs by Web Browsers . . . . . . . . . 25
136	   8.  URI/IRI Processing Guidelines (Informative)  . . . . . . . . . 26
137	     8.1.   URI/IRI Software Interfaces . . . . . . . . . . . . . . . 26
138	     8.2.   URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26
139	     8.3.   URI/IRI Transfer between Applications . . . . . . . . . . 27
140	     8.4.   URI/IRI Generation  . . . . . . . . . . . . . . . . . . . 27
141	     8.5.   URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 28
142	     8.6.   Display of URIs/IRIs  . . . . . . . . . . . . . . . . . . 29
143	     8.7.   Interpretation of URIs and IRIs . . . . . . . . . . . . . 29
144	     8.8.   Upgrading Strategy  . . . . . . . . . . . . . . . . . . . 30
145	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 31
146	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 31
147	   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32
148	   12. Main Changes Since RFC 3987  . . . . . . . . . . . . . . . . . 32
149	     12.1.  Split out Bidi, processing guidelines, comparison
150	            sections  . . . . . . . . . . . . . . . . . . . . . . . . 32
151	     12.2.  Major restructuring of IRI processing model . . . . . . . 32
152	       12.2.1. OLD WAY  . . . . . . . . . . . . . . . . . . . . . . . 33
153	       12.2.2. NEW WAY  . . . . . . . . . . . . . . . . . . . . . . . 33
154	       12.2.3. Extension of Syntax  . . . . . . . . . . . . . . . . . 33
155	       12.2.4. More to be added . . . . . . . . . . . . . . . . . . . 34
156	     12.3.  Change Log  . . . . . . . . . . . . . . . . . . . . . . . 34
157	       12.3.1. Changes after draft-ietf-iri-3987bis-01  . . . . . . . 34
158	       12.3.2. Changes from draft-duerst-iri-bis-07 to
159	               draft-ietf-iri-3987bis-00  . . . . . . . . . . . . . . 34
160	       12.3.3. Changes from -06 to -07 of draft-duerst-iri-bis  . . . 34
161	     12.4.  Changes from -00 to -01 . . . . . . . . . . . . . . . . . 34
162	     12.5.  Changes from -05 to -06 of draft-duerst-iri-bis-00  . . . 34
163	     12.6.  Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 35
164	     12.7.  Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 35
165	     12.8.  Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35
166	     12.9.  Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35
167	     12.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 35
168	     12.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis  . . 35
169	   13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36
170	     13.1.  Normative References  . . . . . . . . . . . . . . . . . . 36
171	     13.2.  Informative References  . . . . . . . . . . . . . . . . . 37
172	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39

174	1.  Introduction

176	1.1.  Overview and Motivation

178	   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
179	   sequence of characters chosen from a limited subset of the repertoire
180	   of US-ASCII [ASCII] characters.

182	   The characters in URIs are frequently used for representing words of
183	   natural languages.  This usage has many advantages: Such URIs are
184	   easier to memorize, easier to interpret, easier to transcribe, easier
185	   to create, and easier to guess.  For most languages other than
186	   English, however, the natural script uses characters other than A -
187	   Z. For many people, handling Latin characters is as difficult as
188	   handling the characters of other scripts is for those who use only
189	   the Latin script.  Many languages with non-Latin scripts are
190	   transcribed with Latin letters.  These transcriptions are now often
191	   used in URIs, but they introduce additional difficulties.

193	   The infrastructure for the appropriate handling of characters from
194	   additional scripts is now widely deployed in operating system and
195	   application software.  Software that can handle a wide variety of
196	   scripts and languages at the same time is increasingly common.  Also,
197	   an increasing number of protocols and formats can carry a wide range
198	   of characters.

200	   URIs are composed out of a very limited repertoire of characters;
201	   this design choice was made to support global transcription (see
202	   [RFC3986] section 1.2.1.).  Reliable transition between a URI (as an
203	   abstract protocol element composed of a sequence of characters) and a
204	   presentation of that URI (written on a napkin, read out loud) and
205	   back is relatively straightforward, because of the limited repertoire
206	   of characters used.  IRIs are designed to satisfy a different set of
207	   use requirements; in particular, to allow IRIs to be written in ways
208	   that are more meaningful to their users, even at the expense of
209	   global transcribability.  However, ensuring reliability of the
210	   transition between an IRI and its presentation and back is more
211	   difficult and complex when dealing with the larger set of Unicode
212	   characters.  For example, Unicode supports multiple ways of encoding
213	   complex combinations of characters and accents, with multiple
214	   character sequences that can result in the same presentation.

216	   This document defines the protocol element called Internationalized
217	   Resource Identifier (IRI), which allows applications of URIs to be
218	   extended to use resource identifiers that have a much wider
219	   repertoire of characters.  It also provides corresponding
220	   "internationalized" versions of other constructs from [RFC3986], such
221	   as URI references.  The syntax of IRIs is defined in Section 2.

223	   Within this document, Section 5 discusses the use of IRIs in
224	   different situations.  Section 8 gives additional informative
225	   guidelines.  Section 10 discusses IRI-specific security
226	   considerations.

228	   This specification is part of a collection of specifications intended
229	   to replace [RFC3987].  [Bidi] discusses the special case of
230	   bidirectional IRIs, IRIs using characters from scripts written right-
231	   to-left.  [Equivalence] gives guidelines for applications wishing to
232	   determine if two IRIs are equivalent, as well as defining some
233	   equivalence methods.  [RFC4395bis] updates the URI scheme
234	   registration guidelines and procedures to note that every URI scheme
235	   is also automatically an IRI scheme and to allow scheme definitions
236	   to be directly described in terms of Unicode characters.

238	1.2.  Applicability

240	   IRIs are designed to allow protocols and software that deal with URIs
241	   to be updated to handle IRIs.  Processing of IRIs is accomplished by
242	   extending the URI syntax while retaining (and not expanding) the set
243	   of "reserved" characters, such that the syntax for any URI scheme may
244	   be extended to allow non-ASCII characters.  In addition, following
245	   parsing of an IRI, it is possible to construct a corresponding URI by
246	   first encoding characters outside of the allowed URI range and then
247	   reassembling the components.

249	   Practical use of IRIs forms in place of URIs forms depends on the
250	   following conditions being met:

252	   a. A protocol or format element MUST be explicitly designated to be
253	      able to carry IRIs.  The intent is to avoid introducing IRIs into
254	      contexts that are not defined to accept them.  For example, XML
255	      schema [XMLSchema] has an explicit type "anyURI" that includes
256	      IRIs and IRI references.  Therefore, IRIs and IRI references can
257	      be used in attributes and elements of type "anyURI".  On the other
258	      hand, in HTTP/1.1 ([RFC2616]) , the Request URI is defined as a
259	      URI, which means that direct use of IRIs is not allowed in HTTP
260	      requests.

262	   b. The protocol or format carrying the IRIs MUST have a mechanism to
263	      represent the wide range of characters used in IRIs, either
264	      natively or by some protocol- or format-specific escaping
265	      mechanism (for example, numeric character references in [XML1]).

267	   c. The URI scheme definition, if it explicitly allows a percent sign
268	      ("%") in any syntactic component, SHOULD define the interpretation
269	      of sequences of percent-encoded octets (using "%XX" hex octets) as
270	      octets from sequences of UTF-8 encoded characters; this is
271	      recommended in the guidelines for registering new schemes,
272	      [RFC4395bis].  For example, this is the practice for IMAP URLs
273	      [RFC2192], POP URLs [RFC2384] and the URN syntax [RFC2141]).  Note
274	      that use of percent-encoding may also be restricted in some
275	      situations, for example, URI schemes that disallow percent-
276	      encoding might still be used with a fragment identifier which is
277	      percent-encoded (e.g., [XPointer]).  See Section 5.4 for further
278	      discussion.

280	1.3.  Definitions

282	   Various terms used in this document are defined in [RFC6365] and
283	   [RFC3986].  In addition, we define the following terms for use in
284	   this document.

286	   octet:  An ordered sequence of eight bits considered as a unit.

288	   sequence of characters:  A sequence of characters (one after
289	      another).

291	   sequence of octets:  A sequence of octets (one after another).

293	   character encoding:  A method of representing a sequence of
294	      characters as a sequence of octets (maybe with variants).  Also, a
295	      method of (unambiguously) converting a sequence of octets into a
296	      sequence of characters.

298	   charset:  The name of a parameter or attribute used to identify a
299	      character encoding.

301	   UCS:  Universal Character Set. The coded character set defined by
302	      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6].

304	   IRI reference:  Denotes the common usage of an Internationalized
305	      Resource Identifier.  An IRI reference may be absolute or
306	      relative.  However, the "IRI" that results from such a reference
307	      only includes absolute IRIs; any relative IRI references are
308	      resolved to their absolute form.  Note that in [RFC2396] URIs did
309	      not include fragment identifiers, but in [RFC3986] fragment
310	      identifiers are part of URIs.

312	   LEIRI (Legacy Extended IRI):  This term is used in various XML
313	      specifications to refer to strings that, although not valid IRIs,
314	      are acceptable input to the processing rules in Section 6.2.

316	   protocol element:  Any portion of a message that affects processing
317	      of that message by the protocol in question.

319	   create (a URI or IRI):  With respect to URIs and IRIs, the term is
320	      used for the initial creation.  This may be the initial creation
321	      of a resource with a certain identifier, or the initial exposition
322	      of a resource under a particular identifier.

324	   generate (a URI or IRI):  With respect to URIs and IRIs, the term is
325	      used when the identifier is generated by derivation from other
326	      information.

328	   parsed URI component:  When a URI processor parses a URI (following
329	      the generic syntax or a scheme-specific syntax, the result is a
330	      set of parsed URI components, each of which has a type
331	      (corresponding to the syntactic definition) and a sequence of URI
332	      characters.

334	   parsed IRI component:  When an IRI processor parses an IRI directly,
335	      following the general syntax or a scheme-specific syntax, the
336	      result is a set of parsed IRI components, each of which has a type
337	      (corresponding to the syntactic definition) and a sequence of IRI
338	      characters.  (This definition is analogous to "parsed URI
339	      component".)

341	   IRI scheme:  A URI scheme may also be known as an "IRI scheme" if the
342	      scheme's syntax has been extended to allow non-US-ASCII characters
343	      according to the rules in this document.

345	1.4.  Notation

347	   RFCs and Internet Drafts currently do not allow any characters
348	   outside the US-ASCII repertoire.  Therefore, this document uses
349	   various special notations for such characters in examples.

351	   In text, characters outside US-ASCII are sometimes referenced by
352	   using a prefix of 'U+', followed by four to six hexadecimal digits.

354	   To represent characters outside US-ASCII in a document format that is
355	   limited to US-ASCII, this document uses 'XML Notation'.  XML Notation
356	   uses a leading '&#x', a trailing ';', and the hexadecimal number of
357	   the character in the UCS in between.  For example, &#x42F; stands for
358	   CYRILLIC CAPITAL LETTER YA.  In this notation, an actual '&' is
359	   denoted by '&amp;'.  This notation is only used in the ASCII
360	   version(s) of this document, because in the other versions, non-ASCII
361	   characters are used directly.

363	   To denote actual octets in examples (as opposed to percent-encoded
364	   octets), the two hex digits denoting the octet are enclosed in "<"
365	   and ">".  For example, the octet often denoted as 0xc9 is denoted
366	   here as <c9>.

368	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
369	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
370	   and "OPTIONAL" are to be interpreted as described in [RFC2119].

372	2.  IRI Syntax

374	   This section defines the syntax of Internationalized Resource
375	   Identifiers (IRIs).

377	   As with URIs, an IRI is defined as a sequence of characters, not as a
378	   sequence of octets.  This definition accommodates the fact that IRIs
379	   may be written on paper or read over the radio as well as stored or
380	   transmitted digitally.  The same IRI might be represented as
381	   different sequences of octets in different protocols or documents if
382	   these protocols or documents use different character encodings
383	   (and/or transfer encodings).  Using the same character encoding as
384	   the containing protocol or document ensures that the characters in
385	   the IRI can be handled (e.g., searched, converted, displayed) in the
386	   same way as the rest of the protocol or document.

388	2.1.  Summary of IRI Syntax

390	   The IRI syntax extends the URI syntax in [RFC3986] by extending the
391	   class of unreserved characters, primarily by adding the characters of
392	   the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject
393	   to the limitations given in the syntax rules below and in
394	   Section 5.1.

396	   The syntax and use of components and reserved characters is the same
397	   as that in [RFC3986].  Each URI scheme thus also functions as an IRI
398	   scheme, in that scheme-specific parsing rules for URIs of a scheme
399	   are extended to allow parsing of IRIs using the same parsing rules.

401	   All the operations defined in [RFC3986], such as the resolution of
402	   relative references, can be applied to IRIs by IRI-processing
403	   software in exactly the same way as they are for URIs by URI-
404	   processing software.

406	   Characters outside the US-ASCII repertoire MUST NOT be reserved and
407	   therefore MUST NOT be used for syntactical purposes, such as to
408	   delimit components in newly defined schemes.  For example, U+00A2,
409	   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
410	   the 'iunreserved' category.  This is similar to the fact that it is
411	   not possible to use '-' as a delimiter in URIs, because it is in the
412	   'unreserved' category.

414	2.2.  ABNF for IRI References and IRIs

416	   An ABNF definition for IRI references (which are the most general
417	   concept and the start of the grammar) and IRIs is given here.  The
418	   syntax of this ABNF is described in [STD68].  Character numbers are
419	   taken from the UCS, without implying any actual binary encoding.
420	   Terminals in the ABNF are characters, not octets.

422	   The following grammar closely follows the URI grammar in [RFC3986],
423	   except that the range of unreserved characters is expanded to include
424	   UCS characters, with the restriction that private UCS characters can
425	   occur only in query parts.  The grammar is split into two parts:
426	   Rules that differ from [RFC3986] because of the above-mentioned
427	   expansion, and rules that are the same as those in [RFC3986].  For
428	   rules that are different than those in [RFC3986], the names of the
429	   non-terminals have been changed as follows.  If the non-terminal
430	   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
431	   has been prefixed.  The rule <pct-form> has been introduced in order
432	   to be able to reference it from other parts of the document.

434	   The following rules are different from those in [RFC3986]:

436	   IRI            = scheme ":" ihier-part [ "?" iquery ]
437	                    [ "#" ifragment ]

439	   ihier-part     = "//" iauthority ipath-abempty
440	                  / ipath-absolute
441	                  / ipath-rootless
442	                  / ipath-empty

444	   IRI-reference  = IRI / irelative-ref

446	   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

448	   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]

450	   irelative-part = "//" iauthority ipath-abempty
451	                  / ipath-absolute
452	                  / ipath-noscheme
453	                  / ipath-empty

455	   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
456	   iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
457	   ihost          = IP-literal / IPv4address / ireg-name
458	   pct-form       = pct-encoded

460	   ireg-name      = *( iunreserved / sub-delims )

462	   ipath          = ipath-abempty   ; begins with "/" or is empty
463	                  / ipath-absolute  ; begins with "/" but not "//"
464	                  / ipath-noscheme  ; begins with a non-colon segment
465	                  / ipath-rootless  ; begins with a segment
466	                  / ipath-empty     ; zero characters

468	   ipath-abempty  = *( path-sep isegment )
469	   ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
470	   ipath-noscheme = isegment-nz-nc *( path-sep isegment )
471	   ipath-rootless = isegment-nz *( path-sep isegment )
472	   ipath-empty    = ""
473	   path-sep       = "/"

475	   isegment       = *ipchar
476	   isegment-nz    = 1*ipchar
477	   isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
478	                        / "@" )
479	                  ; non-zero-length segment without any colon ":"

481	   ipchar         = iunreserved / pct-form / sub-delims / ":"
482	                  / "@"

484	   iquery         = *( ipchar / iprivate / "/" / "?" )

486	   ifragment      = *( ipchar / "/" / "?" )

488	   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

490	   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
491	                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
492	                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
493	                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
494	                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
495	                  / %xD0000-DFFFD / %xE1000-EFFFD

497	   iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
498	                  / %x100000-10FFFD

500	   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
501	   "greedy") algorithm applies.  For details, see [RFC3986].

503	   The following rules are the same as those in [RFC3986]:

505	   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

507	   port           = *DIGIT

509	   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

511	   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

513	   IPv6address    =                            6( h16 ":" ) ls32
514	                  /                       "::" 5( h16 ":" ) ls32
515	                  / [               h16 ] "::" 4( h16 ":" ) ls32
516	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
517	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
518	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
519	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
520	                  / [ *5( h16 ":" ) h16 ] "::"              h16
521	                  / [ *6( h16 ":" ) h16 ] "::"

523	   h16            = 1*4HEXDIG
524	   ls32           = ( h16 ":" h16 ) / IPv4address

526	   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

528	   dec-octet      = DIGIT                 ; 0-9
529	                  / %x31-39 DIGIT         ; 10-99
530	                  / "1" 2DIGIT            ; 100-199
531	                  / "2" %x30-34 DIGIT     ; 200-249
532	                  / "25" %x30-35          ; 250-255

534	   pct-encoded    = "%" HEXDIG HEXDIG

536	   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
537	   reserved       = gen-delims / sub-delims
538	   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
539	   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
540	                  / "*" / "+" / "," / ";" / "="

542	   This syntax does not support IPv6 scoped addressing zone identifiers.

544	3.  Processing IRIs and related protocol elements

546	   IRIs are meant to replace URIs in identifying resources within new
547	   versions of protocols, formats, and software components that use a
548	   UCS-based character repertoire.  Protocols and components may use and
549	   process IRIs directly.  However, there are still numerous systems and
550	   protocols which only accept URIs or components of parsed URIs; that
551	   is, they only accept sequences of characters within the subset of US-
552	   ASCII characters allowed in URIs.

554	   This section defines specific processing steps for IRI consumers
555	   which establish the relationship between the string given and the
556	   interpreted derivatives.  These processing steps apply to both IRIs
557	   and IRI references (i.e., absolute or relative forms); for IRIs, some
558	   steps are scheme specific.

560	3.1.  Converting to UCS

562	   Input that is already in a Unicode form (i.e., a sequence of Unicode
563	   characters or an octet-stream representing a Unicode-based character
564	   encoding such as UTF-8 or UTF-16) should be left as is and not
565	   normalized or changed.

567	   An IRI or IRI reference is a sequence of characters from the UCS.
568	   For input from presentations (written on paper, read aloud) or
569	   translation from other representations (a text stream using a legacy
570	   character encoding), convert the input to Unicode.  Note that some
571	   character encodings or transcriptions can be converted to or
572	   represented by more than one sequence of Unicode characters.  Ideally
573	   the resulting IRI would use a normalized form, such as Unicode
574	   Normalization Form C [UTR15], since that ensures a stable, consistent
575	   representation that is most likely to produce the intended results.
576	   Previous versions of this specification required normalization at
577	   this step.  However, attempts to require normalization in other
578	   protocols have met with strong enough resistance that requiring
579	   normalization here was considered impractical.  Implementers and
580	   users are cautioned that, while denormalized character sequences are
581	   valid, they might be difficult for other users or processes to
582	   reproduce and might lead to unexpected results.

584	3.2.  Parse the IRI into IRI components

586	   Parse the IRI, either as a relative reference (no scheme) or using
587	   scheme specific processing (according to the scheme given); the
588	   result is a set of parsed IRI components.

590	3.3.  General percent-encoding of IRI components

592	   Except as noted in the following subsections, IRI components are
593	   mapped to the equivalent URI components by percent-encoding those
594	   characters not allowed in URIs.  Previous processing steps will have
595	   removed some characters, and the interpretation of reserved
596	   characters will have already been done (with the syntactic reserved
597	   characters outside of the IRI component).  This mapping is defined
598	   for all sequences of Unicode characters, whether or not they are
599	   valid for the component in question.

601	   For each character which is not allowed anywhere in a valid URI apply
602	   the following steps.

604	   Convert to UTF-8:  Convert the character to a sequence of one or more
605	      octets using UTF-8 [STD63].

607	   Percent encode:  Convert each octet of this sequence to %HH, where HH
608	      is the hexadecimal notation of the octet value.  The hexadecimal
609	      notation SHOULD use uppercase letters.  (This is the general URI
610	      percent-encoding mechanism in Section 2.1 of [RFC3986].)

612	   Note that the mapping is an identity transformation for parsed URI
613	   components of valid URIs, and is idempotent: applying the mapping a
614	   second time will not change anything.

616	3.4.  Mapping ireg-name

618	   The mapping from <ireg-name> to a <reg-name> requires a choice
619	   between one of the two methods described below.

621	3.4.1.  Mapping using Percent-Encoding

623	   The ireg-name component SHOULD be converted according to the general
624	   procedure for percent-encoding of IRI components described in
625	   Section 3.3.

627	   For example, the IRI
628	   "http://r&#xE9;sum&#xE9;.example.org"
629	   will be converted to
630	   "http://r%C3%A9sum%C3%A9.example.org".

632	   This conversion for ireg-name is in line with Section 3.2.2 of
633	   [RFC3986], which does not mandate a particular registered name lookup
634	   technology.  For further background, see [RFC6055] and [Gettys].

636	3.4.2.  Mapping using Punycode

638	   In situations where it is certain that <ireg-name> is intended to be
639	   used as a domain name to be processed by Domain Name Lookup (as per
640	   [RFC5891]), an alternative method MAY be used, converting <ireg-name>
641	   as follows:

643	   If there is any percent-encoding, and the corresponding octets all
644	   represent valid UTF-8 octet sequences, then convert these back to
645	   Unicode character sequences.  (If any percent-encodings are not valid
646	   UTF-8 octet sequences, then leave the entire field as is without any
647	   change, since punycode encoding would not succeed.)

649	   Replace the ireg-name part of the IRI by the part converted using the
650	   Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891].
651	   on each dot-separated label, and by using U+002E (FULL STOP) as a
652	   label separator.  This procedure may fail, but this would mean that
653	   the IRI cannot be resolved.  In such cases, if the domain name
654	   conversion fails, then the entire IRI conversion fails.  Processors
655	   that have no mechanism for signalling a failure MAY instead
656	   substitute an otherwise invalid host name, although such processing
657	   SHOULD be avoided.

659	   For example, the IRI
660	   "http://r&#xE9;sum&#xE9;.example.org"
661	   is converted to
662	   "http://xn--rsum-bad.example.org".

664	   This conversion for ireg-name will be better able to deal with legacy
665	   infrastructure that cannot handle percent-encoding in domain names.

667	3.4.3.  Additional Considerations

669	   Domain Names can appear in parts of an IRI other than the ireg-name
670	   part.  It is the responsibility of scheme-specific implementations
671	   (if the Internationalized Domain Name is part of the scheme syntax)
672	   or of server-side implementations (if the Internationalized Domain
673	   Name is part of 'iquery') to apply the necessary conversions at the
674	   appropriate point.  Example: Trying to validate the Web page at
675	   http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
676	   http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;
677	   .example.org, which would convert to a URI of
678	   http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
679	   example.org.  The server-side implementation is responsible for
680	   making the necessary conversions to be able to retrieve the Web page.

682	   In this process, characters allowed in URI references and existing
683	   percent-encoded sequences are not encoded further.  (This mapping is
684	   similar to, but different from, the encoding applied when arbitrary
685	   content is included in some part of a URI.)  For example, an IRI of
686	   "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
687	   converted to
688	   "http://www.example.org/red%09ros%C3%A9#red", not to something like
689	   "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".

691	3.5.  Mapping query components

693	   For compatibility with existing deployed HTTP infrastructure, the
694	   following special case applies for the schemes "http" and "https"
695	   when an IRI is found in a document whose charset is not based on UCS
696	   (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
697	   of an IRI is mapped into a URI by using the document charset rather
698	   than UTF-8 as the binary representation before percent-encoding.
699	   This mapping is not applied for any other schemes or components.

701	3.6.  Mapping IRIs to URIs

703	   The mapping from an IRI to URI is accomplished by applying the
704	   mapping above (from IRI to URI components) and then reassembling a
705	   URI from the parsed URI components using the original punctuation
706	   that delimited the IRI components.

708	4.  Converting URIs to IRIs

710	   In some situations, for presentation and further processing, it is
711	   desirable to convert a URI into an equivalent IRI without unnecessary
712	   percent encoding.  Of course, every URI is already an IRI in its own
713	   right without any conversion.  This section gives one possible
714	   procedure for converting a URI to an IRI.

716	4.1.  Limitations

718	   The conversion described in this section, if given a valid URI, will
719	   result in an IRI that maps back to the URI used as an input for the
720	   conversion (except for potential case differences in percent-encoding
721	   and for potential percent-encoded unreserved characters).  However,
722	   the IRI resulting from this conversion may differ from the original
723	   IRI (if there ever was one).

725	   URI-to-IRI conversion removes percent-encodings, but not all percent-
726	   encodings can be eliminated.  There are several reasons for this:

728	   1. Some percent-encodings are necessary to distinguish percent-
729	      encoded and unencoded uses of reserved characters.

731	   2. Some percent-encodings cannot be interpreted as sequences of UTF-8
732	      octets.

734	      (Note: The octet patterns of UTF-8 are highly regular.  Therefore,
735	      there is a very high probability, but no guarantee, that percent-
736	      encodings that can be interpreted as sequences of UTF-8 octets
737	      actually originated from UTF-8.  For a detailed discussion, see

739	      [Duerst97].)

741	   3. The conversion may result in a character that is not appropriate
742	      in an IRI.  See Section 2.2, and Section 5.1 for further details.

744	   4. As described in Section 3.5, IRI to URI conversion may work
745	      somewhat differently for query components.

747	4.2.  Conversion

749	   Conversion from a URI to an IRI MAY be done by using the following
750	   steps:

752	   1. Represent the URI as a sequence of octets in US-ASCII.

754	   2. Convert all percent-encodings ("%" followed by two hexadecimal
755	      digits) to the corresponding octets, except those corresponding to
756	      "%", characters in "reserved", and characters in US-ASCII not
757	      allowed in URIs.

759	   3. Re-percent-encode any octet produced in step 2 that is not part of
760	      a strictly legal UTF-8 octet sequence.

762	   4. Re-percent-encode all octets produced in step 3 that in UTF-8
763	      represent characters that are not appropriate according to
764	      Section 2.2 and Section 5.1.

766	   5. Optionally, re-percent-encode octets in the query component if the
767	      scheme is one of those mentioned in Section 3.5.

769	   6. Interpret the resulting octet sequence as a sequence of characters
770	      encoded in UTF-8.

772	   7. URIs known to contain domain names in the reg-name component
773	      SHOULD convert punycode-encoded domain name labels to the
774	      corresponding characters using the ToUnicode procedure.

776	   This procedure will convert as many percent-encoded characters as
777	   possible to characters in an IRI.  Because there are some choices in
778	   steps 4 (see also Section 5.1) and 5, results may vary.

780	   Conversions from URIs to IRIs MUST NOT use any character encoding
781	   other than UTF-8 in steps 3 and 4, even if it might be possible to
782	   guess from the context that another character encoding than UTF-8 was
783	   used in the URI.  For example, the URI
784	   "http://www.example.org/r%E9sum%E9.html" might with some guessing be
785	   interpreted to contain two e-acute characters encoded as iso-8859-1.
786	   It must not be converted to an IRI containing these e-acute
787	   characters.  Otherwise, in the future the IRI will be mapped to
788	   "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
789	   URI from "http://www.example.org/r%E9sum%E9.html".

791	4.3.  Examples

793	   This section shows various examples of converting URIs to IRIs.  Each
794	   example shows the result after each of the steps 1 through 6 is
795	   applied.  XML Notation is used for the final result.  Octets are
796	   denoted by "<" followed by two hexadecimal digits followed by ">".

798	   The following example contains the sequence "%C3%BC", which is a
799	   strictly legal UTF-8 sequence, and which is converted into the actual
800	   character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
801	   u-umlaut).

803	   1. http://www.example.org/D%C3%BCrst

805	   2. http://www.example.org/D<c3><bc>rst

807	   3. http://www.example.org/D<c3><bc>rst

809	   4. http://www.example.org/D<c3><bc>rst

811	   5. http://www.example.org/D&#xFC;rst

813	   6. http://www.example.org/D&#xFC;rst

815	   The following example contains the sequence "%FC", which might
816	   represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
817	   iso-8859-1 character encoding.  (It might represent other characters
818	   in other character encodings.  For example, the octet <fc> in iso-
819	   8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.)  Because <fc>
820	   is not part of a strictly legal UTF-8 sequence, it is re-percent-
821	   encoded in step 3.

823	   1. http://www.example.org/D%FCrst

825	   2. http://www.example.org/D<fc>rst

827	   3. http://www.example.org/D%FCrst

829	   4. http://www.example.org/D%FCrst

831	   5. http://www.example.org/D%FCrst
832	   6. http://www.example.org/D%FCrst

834	   The following example contains "%e2%80%ae", which is the percent-
835	   encoded
836	   UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.  The
837	   direct use of this character is forbidden in an IRI.  Therefore, the
838	   corresponding octets are re-percent-encoded in step 4.  This example
839	   shows that the case (upper- or lowercase) of letters used in percent-
840	   encodings may not be preserved.  The example also contains a
841	   punycode-encoded domain name label (xn--99zt52a), which is not
842	   converted.

844	   1. http://xn--99zt52a.example.org/%e2%80%ae

846	   2. http://xn--99zt52a.example.org/<e2><80><ae>

848	   3. http://xn--99zt52a.example.org/<e2><80><ae>

850	   4. http://xn--99zt52a.example.org/%E2%80%AE

852	   5. http://xn--99zt52a.example.org/%E2%80%AE

854	   6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE

856	   Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
857	   (Japanese Natto).  ((EDITOR NOTE: There is some inconsistency in this
858	   note.))

860	5.  Use of IRIs

862	5.1.  Limitations on UCS Characters Allowed in IRIs

864	   This section discusses limitations on characters and character
865	   sequences usable for IRIs beyond those given in Section 2.2.  The
866	   considerations in this section are relevant when IRIs are created and
867	   when URIs are converted to IRIs.

869	   a. The repertoire of characters allowed in each IRI component is
870	      limited by the definition of that component.  For example, the
871	      definition of the scheme component does not allow characters
872	      beyond US-ASCII.

874	      (Note: In accordance with URI practice, generic IRI software
875	      cannot and should not check for such limitations.)

877	   b. The UCS contains many areas of characters for which there are
878	      strong visual look-alikes.  Because of the likelihood of
879	      transcription errors, these also should be avoided.  This includes
880	      the full-width equivalents of Latin characters, half-width
881	      Katakana characters for Japanese, and many others.  It also
882	      includes many look-alikes of "space", "delims", and "unwise",
883	      characters excluded in [RFC3491].

885	   c. At the start of a component, the use of combining marks is
886	      strongly discouraged.  As an example, a COMBINING TILDE OVERLAY
887	      (U+0334) would be very confusing at the start of a <isegment>.
888	      Combined with the preceeding '/', it might look like a solidus
889	      with combining tilde overlay, but IRI processing software will
890	      parse and process the '/' separately.

892	   d. The ZERO WIDTH NON-JOINER (U+200C) and ZERO WIDTH JOINER (U+200D)
893	      are invisible in most contexts, but are crucial in some very
894	      limited contexts.  Appendix A of [RFC5892] contains contextual
895	      restrictions for these and some other characters.  The use of
896	      these characters are strongly discouraged except in the relevant
897	      contexts.

899	   Additional information is available from [UNIXML].  [UNIXML] is
900	   written in the context of general purpose text rather than in that of
901	   identifiers.  Nevertheless, it discusses many of the categories of
902	   characters not appropriate for IRIs.

904	5.2.  Software Interfaces and Protocols

906	   Although an IRI is defined as a sequence of characters, software
907	   interfaces for URIs typically function on sequences of octets or
908	   other kinds of code units.  Thus, software interfaces and protocols
909	   MUST define which character encoding is used.

911	   Intermediate software interfaces between IRI-capable components and
912	   URI-only components MUST map the IRIs per Section 3.6, when
913	   transferring from IRI-capable to URI-only components.  This mapping
914	   SHOULD be applied as late as possible.  It SHOULD NOT be applied
915	   between components that are known to be able to handle IRIs.

917	5.3.  Format of URIs and IRIs in Documents and Protocols

919	   Document formats that transport URIs may have to be upgraded to allow
920	   the transport of IRIs.  In cases where the document as a whole has a
921	   native character encoding, IRIs MUST also be encoded in this
922	   character encoding and converted accordingly by a parser or
923	   interpreter.  IRI characters not expressible in the native character
924	   encoding SHOULD be escaped by using the escaping conventions of the
925	   document format if such conventions are available.  Alternatively,
926	   they MAY be percent-encoded according to Section 3.6.  For example,
927	   in HTML or XML, numeric character references SHOULD be used.  If a
928	   document as a whole has a native character encoding and that
929	   character encoding is not UTF-8, then IRIs MUST NOT be placed into
930	   the document in the UTF-8 character encoding.

932	   ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
933	   although they use different terminology.  HTML 4.0 [HTML4] defines
934	   the conversion from IRIs to URIs as error-avoiding behavior.  XML 1.0
935	   [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
936	   based upon them allow IRIs.  Also, it is expected that all relevant
937	   new W3C formats and protocols will be required to handle IRIs
938	   [CharMod].

940	5.4.  Use of UTF-8 for Encoding Original Characters

942	   This section discusses details and gives examples for point c) in
943	   Section 1.2.  To be able to use IRIs, the URI corresponding to the
944	   IRI in question has to encode original characters into octets by
945	   using UTF-8.  This can be specified for all URIs of a URI scheme or
946	   can apply to individual URIs for schemes that do not specify how to
947	   encode original characters.  It can apply to the whole URI, or only
948	   to some part.  For background information on encoding characters into
949	   URIs, see also Section 2.5 of [RFC3986].

951	   For new URI/IRI schemes, using UTF-8 is recommended in [RFC4395bis].
952	   Examples where UTF-8 is already used are the URN syntax [RFC2141],
953	   IMAP URLs [RFC2192], POP URLs [RFC2384], XMPP URLs [RFC5122], and the
954	   'mailto:' scheme [RFC6068].  On the other hand, because the HTTP URI
955	   scheme does not specify how to encode original characters, only some
956	   HTTP URLs can have corresponding but different IRIs.

958	   For example, for a document with a URI of
959	   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
960	   construct a corresponding IRI (in XML notation, see Section 1.4):
961	   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9;" stands for
962	   the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-
963	   encoded representation of that character).  On the other hand, for a
964	   document with a URI of "http://www.example.org/r%E9sum%E9.html", the
965	   percent-encoded octets cannot be converted to actual characters in an
966	   IRI, as the percent-encoding is not based on UTF-8.

968	   For most URI schemes, there is no need to upgrade their scheme
969	   definition in order for them to work with IRIs.  The main case where
970	   upgrading makes sense is when a scheme definition, or a particular
971	   component of a scheme, is strictly limited to the use of US-ASCII
972	   characters with no provision to include non-ASCII characters/octets
973	   via percent-encoding, or if a scheme definition currently uses highly
974	   scheme-specific provisions for the encoding of non-ASCII characters.

976	   Scheme definitions can impose restrictions on the syntax of scheme-
977	   specific URIs; i.e., URIs that are admissible under the generic URI
978	   syntax [RFC3986] may not be admissible due to narrower syntactic
979	   constraints imposed by a URI scheme specification.  URI scheme
980	   definitions cannot broaden the syntactic restrictions of the generic
981	   URI syntax; otherwise, it would be possible to generate URIs that
982	   satisfied the scheme-specific syntactic constraints without
983	   satisfying the syntactic constraints of the generic URI syntax.
984	   However, additional syntactic constraints imposed by URI scheme
985	   specifications are applicable to IRI, as the corresponding URI
986	   resulting from the mapping defined in Section 3.6 MUST be a valid URI
987	   under the syntactic restrictions of generic URI syntax and any
988	   narrower restrictions imposed by the corresponding URI scheme
989	   specification.

991	   The requirement for the use of UTF-8 generally applies to all parts
992	   of a URI.  However, it is possible that the capability of IRIs to
993	   represent a wide range of characters directly is used just in some
994	   parts of the IRI (or IRI reference).  The other parts of the IRI may
995	   only contain US-ASCII characters, or they may not be based on UTF-8.
996	   They may be based on another character encoding, or they may directly
997	   encode raw binary data (see also [RFC2397]).

999	   For example, it is possible to have a URI reference of
1000	   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
1001	   document name is encoded in iso-8859-1 based on server settings, but
1002	   where the fragment identifier is encoded in UTF-8 according to
1003	   [XPointer].  The IRI corresponding to the above URI would be (in XML
1004	   notation)
1005	   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".

1007	   Similar considerations apply to query parts.  The functionality of
1008	   IRIs (namely, to be able to include non-ASCII characters) can only be
1009	   used if the query part is encoded in UTF-8.

1011	5.5.  Relative IRI References

1013	   Processing of relative IRI references against a base is handled
1014	   straightforwardly; the algorithms of [RFC3986] can be applied
1015	   directly, treating the characters additionally allowed in IRI
1016	   references in the same way that unreserved characters are treated in
1017	   URI references.

1019	6.  Legacy Extended IRIs (LEIRIs)

1021	   In some cases, there have been formats which have used a protocol
1022	   element which is a variant of the IRI definition; these variants have
1023	   usually been somewhat less restricted in syntax.  This section
1024	   provides a definition and a name (Legacy Extended IRI or LEIRI) for
1025	   one of these variants used widely in XML-based protocols.  This
1026	   variant has to be used with care; it requires further processing
1027	   before being fully interchangeable as IRIs.  New protocols and
1028	   formats SHOULD NOT use Legacy Extended IRIs.  Even where Legacy
1029	   Extended IRIs are allowed, only IRIs fully conforming to the syntax
1030	   definition in Section 2.2 SHOULD be created, generated, and used.
1031	   The provisions in this section also apply to Legacy Extended IRI
1032	   references.

1034	6.1.  Legacy Extended IRI Syntax

1036	   This section defines Legacy Extended IRIs (LEIRIs).  The syntax of
1037	   Legacy Extended IRIs is the same as that for <IRI-reference>, except
1038	   that the ucschar production is replaced by the leiri-ucschar
1039	   production:

1041	   leiri-ucschar  = " " / "<" / ">" / '"' / "{" / "}" / "|"
1042	                  / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1043	                  / %xE000-FFFD / %x10000-10FFFF

1045	   The restriction on bidirectional formatting characters in [Bidi] is
1046	   lifted.  The iprivate production becomes redundant.

1048	   Likewise, the syntax for Legacy Extended IRI references (LEIRI
1049	   references) is the same as that for IRI references with the above
1050	   replacement of ucschar with leiri-ucschar.

1052	6.2.  Conversion of Legacy Extended IRIs to IRIs

1054	   To convert a Legacy Extended IRI (reference) to an IRI (reference),
1055	   each character allowed in a Legacy Extended IRI (reference) but not
1056	   allowed in an IRI (reference) (see Section 6.3) MUST be percent-
1057	   encoded by applying the steps in Section 3.3.

1059	6.3.  Characters Allowed in Legacy Extended IRIs but not in IRIs

1061	   This section provides a list of the groups of characters and code
1062	   points that are allowed in Legacy Extedend IRIs, but are not allowed
1063	   in IRIs or are allowed in IRIs only in the query part.  For each
1064	   group of characters, advice on the usage of these characters is also
1065	   given, concentrating on the reasons for why not to use them.

1067	      Space (U+0020): Some formats and applications use space as a
1068	      delimiter, e.g., for items in a list.  Appendix C of [RFC3986]
1069	      also mentions that white space may have to be added when
1070	      displaying or printing long URIs; the same applies to long IRIs.
1071	      Spaces might disappear, or a single Legacy Extended IRI might
1072	      incorrectly be interpreted as two or more separate ones.

1074	      Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
1075	      C of [RFC3986] suggests the use of double-quotes
1076	      ("http://example.com/") and angle brackets (<http://example.com/>)
1077	      as delimiters for URIs in plain text.  These conventions are often
1078	      used, and also apply to IRIs.  Legacy Extended IRIs using these
1079	      characters might be cut off at the wrong place.

1081	      Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
1082	      (U+007B), "|" (U+007C), and "}" (U+007D): These characters
1083	      originally were excluded from URIs because the respective
1084	      codepoints are assigned to different graphic characters in some
1085	      7-bit or 8-bit encoding.  Despite the move to Unicode, some of
1086	      these characters are still occasionally displayed differently on
1087	      some systems, e.g., U+005C as a Japanese Yen symbol.  Also, the
1088	      fact that these characters are not used in URIs or IRIs has
1089	      encouraged their use outside URIs or IRIs in contexts that may
1090	      include URIs or IRIs.  In case a Legacy Extended IRI with such a
1091	      character is used in such a context, the Legacy Extended IRI will
1092	      be interpreted piecemeal.

1094	      The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1095	      #x9F): There is no way to transmit these characters reliably
1096	      except potentially in electronic form.  Even when in electronic
1097	      form, some software components might silently filter out some of
1098	      these characters, or may stop processing alltogether when
1099	      encountering some of them.  These characters may affect text
1100	      display in subtle, unnoticable ways or in drastic, global, and
1101	      irreversible ways depending on the hardware and software involved.
1102	      The use of some of these characters may allow malicious users to
1103	      manipulate the display of a Legacy Extended IRI and its context.

1105	      Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1106	      characters affect the display ordering of characters.  Displayed
1107	      Legacy Extended IRIs containing these characters cannot be
1108	      converted back to electronic form (logical order) unambiguously.
1109	      These characters may allow malicious users to manipulate the
1110	      display of a Legacy Extended IRI and its context.

1112	      Specials (U+FFF0-FFFD): These code points provide functionality
1113	      beyond that useful in a Legacy Extended IRI, for example byte
1114	      order identification, annotation, and replacements for unknown
1115	      characters and objects.  Their use and interpretation in a Legacy
1116	      Extended IRI serves no purpose and may lead to confusing display
1117	      variations.

1119	      Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
1120	      10FFFD): Display and interpretation of these code points is by
1121	      definition undefined without private agreement.  Therefore, these
1122	      code points are not suited for use on the Internet.  They are not
1123	      interoperable and may have unpredictable effects.

1125	      Tags (U+E0000-E0FFF): These characters provide a way to language
1126	      tag in Unicode plain text.  They are not appropriate for Legacy
1127	      Extended IRIs because language information in identifiers cannot
1128	      reliably be input, transmitted (e.g., on a visual medium such as
1129	      paper), or recognized.

1131	      Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1132	      U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1133	      U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1134	      U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1135	      U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1136	      non-characters.  Applications may use some of them internally, but
1137	      are not prepared to interchange them.

1139	   For reference, we here also list the code points and code units not
1140	   even allowed in Legacy Extended IRIs:

1142	      Surrogate code units (D800-DFFF): These do not represent Unicode
1143	      codepoints.

1145	      Non-characters (U+FFFE-FFFF): These are not allowed in XML nor
1146	      LEIRIs.

1148	7.  Processing of URIs/IRIs/URLs by Web Browsers

1150	   For legacy reasons, many web browsers exhibit some irregularities
1151	   when processing URIs, IRIs, and URLs.  This is being documented in
1152	   [HTMLURL], in the hope that it will lead to more uniform
1153	   implementations of these irregularities across web browsers.

1155	   As far as currently known, creators of content for web browsers (such
1156	   as HTML) can use all URIs without problems.  They can also use all
1157	   IRIs without problems except that they should be aware of the fact
1158	   that query parts for HTTP/HTTPS IRIs should be percent-escaped.

1160	8.  URI/IRI Processing Guidelines (Informative)

1162	   This informative section provides guidelines for supporting IRIs in
1163	   the same software components and operations that currently process
1164	   URIs: Software interfaces that handle URIs, software that allows
1165	   users to enter URIs, software that creates or generates URIs,
1166	   software that displays URIs, formats and protocols that transport
1167	   URIs, and software that interprets URIs.  These may all require
1168	   modification before functioning properly with IRIs.  The
1169	   considerations in this section also apply to URI references and IRI
1170	   references.

1172	8.1.  URI/IRI Software Interfaces

1174	   Software interfaces that handle URIs, such as URI-handling APIs and
1175	   protocols transferring URIs, need interfaces and protocol elements
1176	   that are designed to carry IRIs.

1178	   In case the current handling in an API or protocol is based on US-
1179	   ASCII, UTF-8 is recommended as the character encoding for IRIs, as it
1180	   is compatible with US-ASCII, is in accordance with the
1181	   recommendations of [RFC2277], and makes converting to URIs easy.  In
1182	   any case, the API or protocol definition must clearly define the
1183	   character encoding to be used.

1185	   The transfer from URI-only to IRI-capable components requires no
1186	   mapping, although the conversion described in Section 4 above may be
1187	   performed.  It is preferable not to perform this inverse conversion
1188	   unless it is certain this can be done correctly.

1190	8.2.  URI/IRI Entry

1192	   Some components allow users to enter URIs into the system by typing
1193	   or dictation, for example.  This software must be updated to allow
1194	   for IRI entry.

1196	   A person viewing a visual presentation of an IRI (as a sequence of
1197	   glyphs, in some order, in some visual display) will use an entry
1198	   method for characters in the user's language to input the IRI.
1199	   Depending on the script and the input method used, this may be a more
1200	   or less complicated process.

1202	   The process of IRI entry must ensure, as much as possible, that the
1203	   restrictions defined in Section 2.2 are met.  This may be done by
1204	   choosing appropriate input methods or variants/settings thereof, by
1205	   appropriately converting the characters being input, by eliminating
1206	   characters that cannot be converted, and/or by issuing a warning or
1207	   error message to the user.

1209	   As an example of variant settings, input method editors for East
1210	   Asian Languages usually allow the input of Latin letters and related
1211	   characters in full-width or half-width versions.  For IRI input, the
1212	   input method editor should be set so that it produces half-width
1213	   Latin letters and punctuation and full-width Katakana.

1215	   An input field primarily or solely used for the input of URIs/IRIs
1216	   might allow the user to view an IRI as it is mapped to a URI.  Places
1217	   where the input of IRIs is frequent may provide the possibility for
1218	   viewing an IRI as mapped to a URI.  This will help users when some of
1219	   the software they use does not yet accept IRIs.

1221	   An IRI input component interfacing to components that handle URIs,
1222	   but not IRIs, must map the IRI to a URI before passing it to these
1223	   components.

1225	   For the input of IRIs with right-to-left characters, please see
1226	   [Bidi].

1228	8.3.  URI/IRI Transfer between Applications

1230	   Many applications (for example, mail user agents) try to detect URIs
1231	   appearing in plain text.  For this, they use some heuristics based on
1232	   URI syntax.  They then allow the user to click on such URIs and
1233	   retrieve the corresponding resource in an appropriate (usually
1234	   scheme-dependent) application.

1236	   Such applications would need to be upgraded, in order to use the IRI
1237	   syntax as a base for heuristics.  In particular, a non-ASCII
1238	   character should not be taken as the indication of the end of an IRI.
1239	   Such applications also would need to make sure that they correctly
1240	   convert the detected IRI from the character encoding of the document
1241	   or application where the IRI appears, to the character encoding used
1242	   by the system-wide IRI invocation mechanism, or to a URI (according
1243	   to Section 3.6) if the system-wide invocation mechanism only accepts
1244	   URIs.

1246	   The clipboard is another frequently used way to transfer URIs and
1247	   IRIs from one application to another.  On most platforms, the
1248	   clipboard is able to store and transfer text in many languages and
1249	   scripts.  Correctly used, the clipboard transfers characters, not
1250	   octets, which will do the right thing with IRIs.

1252	8.4.  URI/IRI Generation

1254	   Systems that offer resources through the Internet, where those
1255	   resources have logical names, sometimes automatically generate URIs
1256	   for the resources they offer.  For example, some HTTP servers can
1257	   generate a directory listing for a file directory and then respond to
1258	   the generated URIs with the files.

1260	   Many legacy character encodings are in use in various file systems.
1261	   Many currently deployed systems do not transform the local character
1262	   representation of the underlying system before generating URIs.

1264	   For maximum interoperability, systems that generate resource
1265	   identifiers should make the appropriate transformations.  For
1266	   example, if a file system contains a file named "r&#xE9;sum&#
1267	   xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in
1268	   a URI, which allows use of "r&#xE9;sum&#xE9;.html" in an IRI, even if
1269	   locally the file name is kept in a character encoding other than
1270	   UTF-8.

1272	   This recommendation particularly applies to HTTP servers.  For FTP
1273	   servers, similar considerations apply; see [RFC2640].

1275	8.5.  URI/IRI Selection

1277	   In some cases, resource owners and publishers have control over the
1278	   IRIs used to identify their resources.  This control is mostly
1279	   executed by controlling the resource names, such as file names,
1280	   directly.

1282	   In these cases, it is recommended to avoid choosing IRIs that are
1283	   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
1284	   is easily confused with the digit one ("1"), and the upper-case oh
1285	   ("O") is easily confused with the digit zero ("0").  Publishers
1286	   should avoid confusing users with "br0ken" or "1ame" identifiers.

1288	   Outside the US-ASCII repertoire, there are many more opportunities
1289	   for confusion; a complete set of guidelines is too lengthy to include
1290	   here.  As long as names are limited to characters from a single
1291	   script, native writers of a given script or language will know best
1292	   when ambiguities can appear, and how they can be avoided.  What may
1293	   look ambiguous to a stranger may be completely obvious to the average
1294	   native user.  On the other hand, in some cases, the UCS contains
1295	   variants for compatibility reasons; for example, for typographic
1296	   purposes.  These should be avoided wherever possible.  Although there
1297	   may be exceptions, newly created resource names should generally be
1298	   in NFKC [UTR15] (which means that they are also in NFC).

1300	   As an example, the UCS contains the "fi" ligature at U+FB01 for
1301	   compatibility reasons.  Wherever possible, IRIs should use the two
1302	   letters "f" and "i" rather than the "fi" ligature.  An example where
1303	   the latter may be used is in the query part of an IRI for an explicit
1304	   search for a word written containing the "fi" ligature.

1306	   In certain cases, there is a chance that characters from different
1307	   scripts look the same.  The best known example is the similarity of
1308	   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
1309	   such cases, IRIs should only be created where all the characters in a
1310	   single component are used together in a given language.  This usually
1311	   means that all of these characters will be from the same script, but
1312	   there are languages that mix characters from different scripts (such
1313	   as Japanese).  This is similar to the heuristics used to distinguish
1314	   between letters and numbers in the examples above.  Also, for Latin,
1315	   Greek, and Cyrillic, using lowercase letters results in fewer
1316	   ambiguities than using uppercase letters would.

1318	8.6.  Display of URIs/IRIs

1320	   In situations where the rendering software is not expected to display
1321	   non-ASCII parts of the IRI correctly using the available layout and
1322	   font resources, these parts should be percent-encoded before being
1323	   displayed.

1325	   For display of Bidi IRIs, please see [Bidi].

1327	8.7.  Interpretation of URIs and IRIs

1329	   Software that interprets IRIs as the names of local resources should
1330	   accept IRIs in multiple forms and convert and match them with the
1331	   appropriate local resource names.

1333	   First, multiple representations include both IRIs in the native
1334	   character encoding of the protocol and also their URI counterparts.

1336	   Second, it may include URIs constructed based on character encodings
1337	   other than UTF-8.  These URIs may be produced by user agents that do
1338	   not conform to this specification and that use legacy character
1339	   encodings to convert non-ASCII characters to URIs.  Whether this is
1340	   necessary, and what character encodings to cover, depends on a number
1341	   of factors, such as the legacy character encodings used locally and
1342	   the distribution of various versions of user agents.  For example,
1343	   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
1344	   addition to UTF-8.

1346	   Third, it may include additional mappings to be more user-friendly
1347	   and robust against transmission errors.  These would be similar to
1348	   how some servers currently treat URIs as case insensitive or perform
1349	   additional matching to account for spelling errors.  For characters
1350	   beyond the US-ASCII repertoire, this may, for example, include
1351	   ignoring the accents on received IRIs or resource names.  Please note
1352	   that such mappings, including case mappings, are language dependent.

1354	   It can be difficult to identify a resource unambiguously if too many
1355	   mappings are taken into consideration.  However, percent-encoded and
1356	   not percent-encoded parts of IRIs can always be clearly
1357	   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
1358	   the potential for collisions lower than it may seem at first.

1360	8.8.  Upgrading Strategy

1362	   Where this recommendation places further constraints on software for
1363	   which many instances are already deployed, it is important to
1364	   introduce upgrades carefully and to be aware of the various
1365	   interdependencies.

1367	   If IRIs cannot be interpreted correctly, they should not be created,
1368	   generated, or transported.  This suggests that upgrading URI
1369	   interpreting software to accept IRIs should have highest priority.

1371	   On the other hand, a single IRI is interpreted only by a single or
1372	   very few interpreters that are known in advance, although it may be
1373	   entered and transported very widely.

1375	   Therefore, IRIs benefit most from a broad upgrade of software to be
1376	   able to enter and transport IRIs.  However, before an individual IRI
1377	   is published, care should be taken to upgrade the corresponding
1378	   interpreting software in order to cover the forms expected to be
1379	   received by various versions of entry and transport software.

1381	   The upgrade of generating software to generate IRIs instead of using
1382	   a local character encoding should happen only after the service is
1383	   upgraded to accept IRIs.  Similarly, IRIs should only be generated
1384	   when the service accepts IRIs and the intervening infrastructure and
1385	   protocol is known to transport them safely.

1387	   Software converting from URIs to IRIs for display should be upgraded
1388	   only after upgraded entry software has been widely deployed to the
1389	   population that will see the displayed result.

1391	   Where there is a free choice of character encodings, it is often
1392	   possible to reduce the effort and dependencies for upgrading to IRIs
1393	   by using UTF-8 rather than another encoding.  For example, when a new
1394	   file-based Web server is set up, using UTF-8 as the character
1395	   encoding for file names will make the transition to IRIs easier.
1396	   Likewise, when a new Web form is set up using UTF-8 as the character
1397	   encoding of the form page, the returned query URIs will use UTF-8 as
1398	   the character encoding (unless the user, for whatever reason, changes
1399	   the character encoding) and will therefore be compatible with IRIs.

1401	   These recommendations, when taken together, will allow for the
1402	   extension from URIs to IRIs in order to handle characters other than
1403	   US-ASCII while minimizing interoperability problems.  For
1404	   considerations regarding the upgrade of URI scheme definitions, see
1405	   Section 5.4.

1407	9.  IANA Considerations

1409	   This specification does not affect IANA.  For details on how to
1410	   define a URI/IRI scheme and register it with IANA, see [RFC4395bis].

1412	10.  Security Considerations

1414	   The security considerations discussed in [RFC3986] also apply to
1415	   IRIs.  In addition, the following issues require particular care for
1416	   IRIs.

1418	   Incorrect encoding or decoding can lead to security problems.  For
1419	   example, some UTF-8 decoders do not check against overlong byte
1420	   sequences.  See [UTR36] Section 3 for details.

1422	   There are serious difficulties with relying on a human to verify that
1423	   a an IRI (whether presented visually or aurally) is the same as
1424	   another IRI or is the one intended.  These problems exist with ASCII-
1425	   only URIs (bl00mberg.com vs. bloomberg.com) but are strongly
1426	   exacerbated when using the much larger character repertoire of
1427	   Unicode.  For details, see Section 2 of [UTR36].  Using
1428	   administrative and technical means to reduce the availability of such
1429	   exploits is possible, but they are difficult to eliminate altogether.
1430	   User agents SHOULD NOT rely on visual or perceptual comparison or
1431	   verification of IRIs as a means of validating or assuring safety,
1432	   correctness or appropriateness of an IRI.  Other means of presenting
1433	   users with the validity, safety, or appropriateness of visited sites
1434	   are being developed in the browser community as an alternative means
1435	   of avoiding these difficulties.

1437	   Besides the large character repertoire of Unicode, reasons for
1438	   confusion include different forms of normalization and different
1439	   normalization expectations, use of percent-encoding with various
1440	   legacy encodings, and bidirectionality issues.  See also [Bidi].

1442	   Confusion can occur in various IRI components, such as the domain
1443	   name part or the path part, or between IRI components.  For
1444	   considerations specific to the domain name part, see [RFC5890].  For
1445	   considerations specific to particular protocols or schemes, see the
1446	   security sections of the relevant specifications and registration
1447	   templates.  Administrators of sites that allow independent users to
1448	   create resources in the same sub area have to be careful.  Details
1449	   are discussed in Section 8.5.

1451	   The characters additionally allowed in Legacy Extended IRIs introduce
1452	   additional security issues.  For details, see Section 6.3.

1454	11.  Acknowledgements

1456	   This document was derived from [RFC3987]; the acknowledgments from
1457	   that specification still apply.

1459	   In addition, this document was influenced by contributions from (in
1460	   no particular order) Norman Walsh, Richard Tobin, Henry S. Thomson,
1461	   John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris
1462	   Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank
1463	   Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard
1464	   Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie
1465	   Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas
1466	   Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van
1467	   Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos
1468	   Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R.
1469	   Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse,
1470	   Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn
1471	   Steele, Peter Saint-Andre, Geoffrey Sneddon, Chris Weber, Alex
1472	   Melnikov, Slim Amamou, S. Moonesamy, Tim Berners-Lee, Yaron Goland,
1473	   Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas
1474	   Milo, Murray Sargent, Marc Blanchet, and Mykyta Yevstifeyev.

1476	   Anne van Kesteren is also gratefully acknowledged for his ongoing
1477	   work documenting browser behavior with respect to URIs/URIs/URLs (see
1478	   [HTMLURL]).

1480	12.  Main Changes Since RFC 3987

1482	   This section describes the main changes since [RFC3987].

1484	12.1.  Split out Bidi, processing guidelines, comparison sections

1486	   Move some components (comparison, bidi, processing) into separate
1487	   documents.

1489	12.2.  Major restructuring of IRI processing model

1491	   Major restructuring of IRI processing model to make scheme-specific
1492	   translation necessary to handle IDNA requirements and for consistency
1493	   with web implementations.

1495	   Starting with IRI, you want one of:

1497	   a  IRI components (IRI parsed into UTF8 pieces)

1499	   b  URI components (URI parsed into ASCII pieces, encoded correctly)

1501	   c  whole URI (for passing on to some other system that wants whole
1502	      URIs)

1504	12.2.1.  OLD WAY

1506	   1.  Percent-encoding on the whole thing to a URI. (c1) If you want a
1507	       (maybe broken) whole URI, you might stop here.

1509	   2.  Parsing the URI into URI components. (b1) If you want (maybe
1510	       broken) URI components, stop here.

1512	   3.  Decode the components (undoing the percent-encoding). (a) if you
1513	       want IRI components, stop here.

1515	   4.  reencode: Either using a different encoding some components (for
1516	       domain names, and query components in web pages, which depends on
1517	       the component, scheme and context), and otherwise using percent-
1518	       encoding. (b2) if you want (good) URI components, stop here.

1520	   5.  reassemble the reencoded components. (c2) if you want a (*good*)
1521	       whole URI stop here.

1523	12.2.2.  NEW WAY

1525	   1.  Parse the IRI into IRI components using the generic syntax. (a)
1526	       if you want IRI components, stop here.

1528	   2.  Encode each components, using percent-encoding, IDN encoding, or
1529	       special query part encoding depending on the component scheme or
1530	       context. (b) If you want URI components, stop here.

1532	   3.  reassemble the a whole URI from URI components. (c) if you want a
1533	       whole URI stop here.

1535	12.2.3.  Extension of Syntax

1537	   Added the tag range (U+E0000-E0FFF) to the iprivate production.  Some
1538	   IRIs generated with the new syntax may fail to pass very strict
1539	   checks relying on the old syntax.  But characters in this range
1540	   should be extremely infrequent anyway.

1542	12.2.4.  More to be added

1544	   TODO: There are more main changes that need to be documented in this
1545	   section.

1547	12.3.  Change Log

1549	   Note to RFC Editor: Please completely remove this section before
1550	   publication.

1552	12.3.1.  Changes after draft-ietf-iri-3987bis-01

1554	   Changes from draft-ietf-iri-3987bis-01 onwards are available as
1555	   changesets in the IETF tools subversion repository at http://
1556	   trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/
1557	   draft-ietf-iri-3987bis.xml.

1559	12.3.2.  Changes from draft-duerst-iri-bis-07 to
1560	         draft-ietf-iri-3987bis-00

1562	   Changed draft name, date, last paragraph of abstract, and titles in
1563	   change log, and added this section in moving from
1564	   draft-duerst-iri-bis-07 (personal submission) to
1565	   draft-ietf-iri-3987bis-00 (WG document).

1567	12.3.3.  Changes from -06 to -07 of draft-duerst-iri-bis

1569	   Major restructuring of the processing model, see Section 12.2.

1571	12.4.  Changes from -00 to -01

1573	   o  Removed 'mailto:' before mail addresses of authors.

1575	   o  Added "<to be done>" as right side of 'href-strip' rule.  Fixed
1576	      '|' to '/' for alternatives.

1578	12.5.  Changes from -05 to -06 of draft-duerst-iri-bis-00

1580	   o  Add HyperText Reference, change abstract, acks and references for
1581	      it

1583	   o  Add Masinter back as another editor.

1585	   o  Masinter integrates HRef material from HTML5 spec.

1587	   o  Rewrite introduction sections to modernize.

1589	12.6.  Changes from -04 to -05 of draft-duerst-iri-bis

1591	   o  Updated references.

1593	   o  Changed IPR text to pre5378Trust200902.

1595	12.7.  Changes from -03 to -04 of draft-duerst-iri-bis

1597	   o  Added explicit abbreviation for LEIRIs.

1599	   o  Mentioned LEIRI references.

1601	   o  Completed text in LEIRI section about tag characters and about
1602	      specials.

1604	12.8.  Changes from -02 to -03 of draft-duerst-iri-bis

1606	   o  Updated some references.

1608	   o  Updated Michel Suginard's coordinates.

1610	12.9.  Changes from -01 to -02 of draft-duerst-iri-bis

1612	   o  Added tag range to iprivate (issue private-include-tags-115).

1614	   o  Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.

1616	12.10.  Changes from -00 to -01 of draft-duerst-iri-bis

1618	   o  Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI"
1619	      based on input from the W3C XML Core WG.  Moved the relevant
1620	      subsections to the back and promoted them to a section.

1622	   o  Added some text re.  Legacy Extended IRIs to the security section.

1624	   o  Added a IANA Consideration Section.

1626	   o  Added this Change Log Section.

1628	   o  Added a section about "IRIs with Spaces/Controls" (converting from
1629	      a Note in RFC 3987).

1631	12.11.  Changes from RFC 3987 to -00 of draft-duerst-iri-bis

1633	      Fixed errata (see
1634	      http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).

1636	13.  References

1638	13.1.  Normative References

1640	   [ASCII]    American National Standards Institute, "Coded Character
1641	              Set -- 7-bit American Standard Code for Information
1642	              Interchange", ANSI X3.4, 1986.

1644	   [ISO10646]
1645	              International Organization for Standardization, "ISO/IEC
1646	              10646:2011: Information Technology - Universal Multiple-
1647	              Octet Coded Character Set (UCS)", ISO Standard 10646,
1648	              March 20011, <http://standards.iso.org/ittf/
1649	              PubliclyAvailableStandards/
1650	              c051273_ISO_IEC_10646_2011(E).zip>.

1652	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1653	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1655	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1656	              Profile for Internationalized Domain Names (IDN)",
1657	              RFC 3491, March 2003.

1659	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1660	              Resource Identifier (URI): Generic Syntax", STD 66,
1661	              RFC 3986, January 2005.

1663	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
1664	              Applications (IDNA): Definitions and Document Framework",
1665	              RFC 5890, August 2010.

1667	   [RFC5891]  Klensin, J., "Internationalized Domain Names in
1668	              Applications (IDNA): Protocol", RFC 5891, August 2010.

1670	   [RFC5892]  Faltstrom, P., "The Unicode Code Points and
1671	              Internationalized Domain Names for Applications (IDNA)",
1672	              RFC 5892, August 2010.

1674	   [STD63]    Yergeau, F., "UTF-8, a transformation format of ISO
1675	              10646", STD 63, RFC 3629, November 2003.

1677	   [STD68]    Crocker, D. and P. Overell, "Augmented BNF for Syntax
1678	              Specifications: ABNF", STD 68, RFC 5234, January 2008.

1680	   [UNIV6]    The Unicode Consortium, "The Unicode Standard, Version
1681	              6.2.0 (Mountain View, CA, The Unicode Consortium, 2012,
1682	              ISBN 978-1-936213-07-8)", October 2012.

1684	   [UTR15]    Davis, M. and M. Duerst, "Unicode Normalization Forms",
1685	              Unicode Standard Annex #15, March 2008,
1686	              <http://www.unicode.org/unicode/reports/tr15/
1687	              tr15-23.html>.

1689	13.2.  Informative References

1691	   [Bidi]     Duerst, M., Masinter, L., and A. Allawi, "Guidelines for
1692	              Internationalized Resource Identifiers with Bi-directional
1693	              Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-02
1694	              (work in progress), March 2012.

1696	   [CharMod]  Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
1697	              Texin, "Character Model for the World Wide Web 1.0:
1698	              Resource Identifiers", W3C Candidate Recommendation CR-
1699	              charmod-resid-20041122, November 2004,
1700	              <http://www.w3.org/TR/2004/CR-charmod-resid/>.

1702	   [Duerst97]
1703	              Duerst, M., "The Properties and Promises of UTF-8", Proc.
1704	              11th International Unicode Conference, San Jose ,
1705	              September 1997,
1706	              <http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf>.

1708	   [Equivalence]
1709	              Masinter, L. and M. Duerst, "Equivalence and
1710	              Canonicalization of Internationalized Resource Identifiers
1711	              (IRIs)", draft-ietf-iri-comparison-01 (work in progress),
1712	              March 2012.

1714	   [Gettys]   Gettys, J., "URI Model Consequences",
1715	              <http://www.w3.org/DesignIssues/ModelConsequences>.

1717	   [HTML4]    Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
1718	              Specification", W3C Recommendation REC-html401-19991224,
1719	              December 1999, <http://www.w3.org/TR/1999/REC-html401>.

1721	   [HTMLURL]  van Kesteren, A., "URL", October 2012,
1722	              <http://url.spec.whatwg.org/>.

1724	   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

1726	   [RFC2192]  Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

1728	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
1729	              Languages", BCP 18, RFC 2277, January 1998.

1731	   [RFC2384]  Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

1733	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1734	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
1735	              August 1998.

1737	   [RFC2397]  Masinter, L., "The "data" URL scheme", RFC 2397,
1738	              August 1998.

1740	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
1741	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
1742	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

1744	   [RFC2640]  Curtin, B., "Internationalization of the File Transfer
1745	              Protocol", RFC 2640, July 1999.

1747	   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
1748	              Identifiers (IRIs)", RFC 3987, January 2005.

1750	   [RFC4395bis]
1751	              Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
1752	              Registration Procedures for New URI/IRI Schemes",
1753	              draft-ietf-iri-4395bis-irireg-04 (work in progress),
1754	              December 2011.

1756	   [RFC5122]  Saint-Andre, P., "Internationalized Resource Identifiers
1757	              (IRIs) and Uniform Resource Identifiers (URIs) for the
1758	              Extensible Messaging and Presence Protocol (XMPP)",
1759	              RFC 5122, February 2008.

1761	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
1762	              Encodings for Internationalized Domain Names", RFC 6055,
1763	              February 2011.

1765	   [RFC6068]  Duerst, M., Masinter, L., and J. Zawinski, "The 'mailto'
1766	              URI Scheme", RFC 6068, October 2010.

1768	   [RFC6365]  Hoffman, P. and J. Klensin, "Terminology Used in
1769	              Internationalization in the IETF", BCP 166, RFC 6365,
1770	              September 2011.

1772	   [UNIXML]   Duerst, M. and A. Freytag, "Unicode in XML and other
1773	              Markup Languages", Unicode Technical Report #20, World
1774	              Wide Web Consortium Note, June 2003,
1775	              <http://www.w3.org/TR/unicode-xml/>.

1777	   [UTR36]    Davis, M. and M. Suignard, "Unicode Security
1778	              Considerations", Unicode Technical Report #36,
1779	              August 2010, <http://unicode.org/reports/tr36/>.

1781	   [XLink]    DeRose, S., Maler, E., Orchard, D., and N. Walsh, "XML
1782	              Linking Language (XLink) Version 1.1", W3C
1783	              Recommendation REC-xlink11-20100506, May 2010,
1784	              <http://www.w3.org/TR/xlink11/#link-locators>.

1786	   [XML1]     Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
1787	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth
1788	              Edition)", W3C Recommendation REC-xml-20081126,
1789	              November 2008, <http://www.w3.org/TR/2008/REC-xml/>.

1791	   [XMLSchema]
1792	              Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes
1793	              Second Edition", W3C Recommendation REC-xmlschema-2-
1794	              20041028, October 2004,
1795	              <http://www.w3.org/TR/xmlschema-2/#anyURI>.

1797	   [XPointer]
1798	              Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
1799	              Framework", W3C Recommendation REC-xptr-framework-
1800	              20030325, March 2003,
1801	              <http://www.w3.org/TR/xptr-framework/#escaping>.

1803	Authors' Addresses

1805	   Martin J. Duerst (Note: Please write "Duerst" with u-umlaut wherever
1806	                 possible, for example as "D&#252;rst" in XML and HTML.)
1807	   Aoyama Gakuin University
1808	   5-10-1 Fuchinobe
1809	   Chuo-ku
1810	   Sagamihara, Kanagawa  252-5258
1811	   Japan

1813	   Phone: +81 42 759 6329
1814	   Fax:   +81 42 759 6495
1815	   Email: duerst@it.aoyama.ac.jp
1816	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
1817	          (Note: This is the percent-encoded form of an IRI)

1819	   Michel Suignard
1820	   Unicode Consortium
1821	   P.O. Box 391476
1822	   Mountain View, CA  94039-1476
1823	   U.S.A.

1825	   Phone: +1-650-693-3921
1826	   Email: michel@unicode.org
1827	   URI:   http://www.suignard.com
1828	   Larry Masinter
1829	   Adobe
1830	   345 Park Ave
1831	   San Jose, CA  95110
1832	   U.S.A.

1834	   Phone: +1-408-536-3024
1835	   Email: masinter@adobe.com
1836	   URI:   http://larry.masinter.net