idnits 2.17.1 

draft-ietf-iri-3987bis-10.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  -- The draft header indicates that this document obsoletes RFC3987, but the
     abstract doesn't seem to directly say this.  It does mention RFC3987
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  == The document seems to contain a disclaimer for pre-RFC5378 work, but was
     first submitted on or after 10 November 2008.  The disclaimer is usually
     necessary only for documents that revise or obsolete older RFCs, and that
     take significant amounts of text from those RFCs.  If you can contact all
     authors of the source material and they are willing to grant the BCP78
     rights to the IETF Trust, you can and should remove the disclaimer. 
     Otherwise, the disclaimer is needed and you can ignore this comment. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (March 2, 2012) is 4438 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'LEIRI' is defined on line 1730, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2045' is defined on line 1735, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC6082' is defined on line 1783, but no explicit
     reference was found in the text

  == Unused Reference: 'XMLNamespace' is defined on line 1806, but no
     explicit reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'

  == Outdated reference: A later version (-03) exists of
     draft-ietf-iri-bidi-guidelines-00

  == Outdated reference: A later version (-02) exists of
     draft-ietf-iri-comparison-00

  -- Obsolete informational reference (is this intentional?): RFC 2141
     (Obsoleted by RFC 8141)

  -- Obsolete informational reference (is this intentional?): RFC 2192
     (Obsoleted by RFC 5092)

  -- Obsolete informational reference (is this intentional?): RFC 2368
     (Obsoleted by RFC 6068)

  -- Obsolete informational reference (is this intentional?): RFC 2396
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  == Outdated reference: A later version (-04) exists of
     draft-ietf-iri-4395bis-irireg-03


     Summary: 1 error (**), 0 flaws (~~), 11 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internationalized Resource Identifiers                         M. Duerst
3	(iri)                                           Aoyama Gakuin University
4	Internet-Draft                                               M. Suignard
5	Obsoletes: 3987 (if approved)                         Unicode Consortium
6	Intended status: Standards Track                             L. Masinter
7	Expires: September 3, 2012                                         Adobe
8	                                                           March 2, 2012

10	             Internationalized Resource Identifiers (IRIs)
11	                       draft-ietf-iri-3987bis-10

13	Abstract

15	   This document defines the Internationalized Resource Identifier (IRI)
16	   protocol element, as an extension of the Uniform Resource Identifier
17	   (URI).  An IRI is a sequence of characters from the Universal
18	   Character Set (Unicode/ISO 10646).  Grammar and processing rules are
19	   given for IRIs and related syntactic forms.

21	   Defining IRI as new protocol element (rather than updating or
22	   extending the definition of URI) allows independent orderly
23	   transitions: other protocols and languages that use URIs must
24	   explicitly choose to allow IRIs.

26	   Guidelines are provided for the use and deployment of IRIs and
27	   related protocol elements when revising protocols, formats, and
28	   software components that currently deal only with URIs.

30	   This document is part of a set of documents intended to replace RFC
31	   3987.

33	RFC Editor: Please remove the next paragraph before publication.

35	   This (and several companion documents) are intended to obsolete RFC
36	   3987, and also move towards IETF Draft Standard.  For discussion and
37	   comments on these drafts, please join the IETF IRI WG by subscribing
38	   to the mailing list public-iri@w3.org, archives at
39	   http://lists.w3.org/archives/public/public-iri/.  For a list of open
40	   issues, please see the issue tracker of the WG at
41	   http://trac.tools.ietf.org/wg/iri/trac/report/1.  For a list of
42	   individual edits, please see the change history at
43	   http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis.

45	Status of this Memo

47	   This Internet-Draft is submitted in full conformance with the
48	   provisions of BCP 78 and BCP 79.

50	   Internet-Drafts are working documents of the Internet Engineering
51	   Task Force (IETF).  Note that other groups may also distribute
52	   working documents as Internet-Drafts.  The list of current Internet-
53	   Drafts is at http://datatracker.ietf.org/drafts/current/.

55	   Internet-Drafts are draft documents valid for a maximum of six months
56	   and may be updated, replaced, or obsoleted by other documents at any
57	   time.  It is inappropriate to use Internet-Drafts as reference
58	   material or to cite them other than as "work in progress."

60	   This Internet-Draft will expire on September 3, 2012.

62	Copyright Notice

64	   Copyright (c) 2012 IETF Trust and the persons identified as the
65	   document authors.  All rights reserved.

67	   This document is subject to BCP 78 and the IETF Trust's Legal
68	   Provisions Relating to IETF Documents
69	   (http://trustee.ietf.org/license-info) in effect on the date of
70	   publication of this document.  Please review these documents
71	   carefully, as they describe your rights and restrictions with respect
72	   to this document.  Code Components extracted from this document must
73	   include Simplified BSD License text as described in Section 4.e of
74	   the Trust Legal Provisions and are provided without warranty as
75	   described in the Simplified BSD License.

77	   This document may contain material from IETF Documents or IETF
78	   Contributions published or made publicly available before November
79	   10, 2008.  The person(s) controlling the copyright in some of this
80	   material may not have granted the IETF Trust the right to allow
81	   modifications of such material outside the IETF Standards Process.
82	   Without obtaining an adequate license from the person(s) controlling
83	   the copyright in such materials, this document may not be modified
84	   outside the IETF Standards Process, and derivative works of it may
85	   not be created outside the IETF Standards Process, except to format
86	   it for publication as an RFC or to translate it into languages other
87	   than English.

89	Table of Contents

91	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
92	     1.1.   Overview and Motivation . . . . . . . . . . . . . . . . .  5
93	     1.2.   Applicability . . . . . . . . . . . . . . . . . . . . . .  6
94	     1.3.   Definitions . . . . . . . . . . . . . . . . . . . . . . .  7
95	     1.4.   Notation  . . . . . . . . . . . . . . . . . . . . . . . .  8
96	   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  9
97	     2.1.   Summary of IRI Syntax . . . . . . . . . . . . . . . . . .  9
98	     2.2.   ABNF for IRI References and IRIs  . . . . . . . . . . . . 10
99	   3.  Processing IRIs and related protocol elements  . . . . . . . . 13
100	     3.1.   Converting to UCS . . . . . . . . . . . . . . . . . . . . 13
101	     3.2.   Parse the IRI into IRI components . . . . . . . . . . . . 13
102	     3.3.   General percent-encoding of IRI components  . . . . . . . 14
103	     3.4.   Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14
104	       3.4.1.  Mapping using Percent-Encoding . . . . . . . . . . . . 14
105	       3.4.2.  Mapping using Punycode . . . . . . . . . . . . . . . . 15
106	       3.4.3.  Additional Considerations  . . . . . . . . . . . . . . 15
107	     3.5.   Mapping query components  . . . . . . . . . . . . . . . . 16
108	     3.6.   Mapping IRIs to URIs  . . . . . . . . . . . . . . . . . . 16
109	   4.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . . . . 16
110	     4.1.   Examples  . . . . . . . . . . . . . . . . . . . . . . . . 18
111	   5.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 19
112	     5.1.   Limitations on UCS Characters Allowed in IRIs . . . . . . 19
113	     5.2.   Software Interfaces and Protocols . . . . . . . . . . . . 20
114	     5.3.   Format of URIs and IRIs in Documents and Protocols  . . . 21
115	     5.4.   Use of UTF-8 for Encoding Original Characters . . . . . . 21
116	     5.5.   Relative IRI References . . . . . . . . . . . . . . . . . 23
117	   6.  Legacy Extended IRIs (LEIRIs)  . . . . . . . . . . . . . . . . 23
118	     6.1.   Legacy Extended IRI Syntax  . . . . . . . . . . . . . . . 23
119	     6.2.   Conversion of Legacy Extended IRIs to IRIs  . . . . . . . 24
120	     6.3.   Characters Allowed in Legacy Extended IRIs but not in
121	            IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . . 24
122	   7.  URI/IRI Processing Guidelines (Informative)  . . . . . . . . . 26
123	     7.1.   URI/IRI Software Interfaces . . . . . . . . . . . . . . . 26
124	     7.2.   URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26
125	     7.3.   URI/IRI Transfer between Applications . . . . . . . . . . 27
126	     7.4.   URI/IRI Generation  . . . . . . . . . . . . . . . . . . . 27
127	     7.5.   URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 28
128	     7.6.   Display of URIs/IRIs  . . . . . . . . . . . . . . . . . . 29
129	     7.7.   Interpretation of URIs and IRIs . . . . . . . . . . . . . 29
130	     7.8.   Upgrading Strategy  . . . . . . . . . . . . . . . . . . . 30
131	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 31
132	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 31
133	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32
134	   11. Main Changes Since RFC 3987  . . . . . . . . . . . . . . . . . 33
135	     11.1.  Split out Bidi, processing guidelines, comparison
136	            sections  . . . . . . . . . . . . . . . . . . . . . . . . 33

138	     11.2.  Major restructuring of IRI processing model . . . . . . . 33
139	       11.2.1. OLD WAY  . . . . . . . . . . . . . . . . . . . . . . . 33
140	       11.2.2. NEW WAY  . . . . . . . . . . . . . . . . . . . . . . . 34
141	       11.2.3. Extension of Syntax  . . . . . . . . . . . . . . . . . 34
142	       11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 34
143	     11.3.  Change Log  . . . . . . . . . . . . . . . . . . . . . . . 34
144	       11.3.1. Changes after draft-ietf-iri-3987bis-01  . . . . . . . 34
145	       11.3.2. Changes from draft-duerst-iri-bis-07 to
146	               draft-ietf-iri-3987bis-00  . . . . . . . . . . . . . . 34
147	       11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis  . . . 34
148	     11.4.  Changes from -00 to -01 . . . . . . . . . . . . . . . . . 35
149	     11.5.  Changes from -05 to -06 of draft-duerst-iri-bis-00  . . . 35
150	     11.6.  Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 35
151	     11.7.  Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 35
152	     11.8.  Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35
153	     11.9.  Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35
154	     11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 36
155	     11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis  . . 36
156	   12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36
157	     12.1.  Normative References  . . . . . . . . . . . . . . . . . . 36
158	     12.2.  Informative References  . . . . . . . . . . . . . . . . . 37
159	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 40

161	1.  Introduction

163	1.1.  Overview and Motivation

165	   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
166	   sequence of characters chosen from a limited subset of the repertoire
167	   of US-ASCII [ASCII] characters.

169	   The characters in URIs are frequently used for representing words of
170	   natural languages.  This usage has many advantages: Such URIs are
171	   easier to memorize, easier to interpret, easier to transcribe, easier
172	   to create, and easier to guess.  For most languages other than
173	   English, however, the natural script uses characters other than A -
174	   Z. For many people, handling Latin characters is as difficult as
175	   handling the characters of other scripts is for those who use only
176	   the Latin alphabet.  Many languages with non-Latin scripts are
177	   transcribed with Latin letters.  These transcriptions are now often
178	   used in URIs, but they introduce additional difficulties.

180	   The infrastructure for the appropriate handling of characters from
181	   additional scripts is now widely deployed in operating system and
182	   application software.  Software that can handle a wide variety of
183	   scripts and languages at the same time is increasingly common.  Also,
184	   an increasing number of protocols and formats can carry a wide range
185	   of characters.

187	   URIs are composed out of a very limited repertoire of characters;
188	   this design choice was made to support global transcription([RFC3986]
189	   section 1.2.1.).  Reliable transition between a URI (as an abstract
190	   protocol element composed of a sequence of characters) and a
191	   presentation of that URI (written on a napkin, read out loud) and
192	   back is relatively straightforward, because of the limited repertoire
193	   of characters used.  IRIs are designed to satisfy a different set of
194	   use requirements; in particular, to allow IRIs to be written in ways
195	   that are more meaningful to their users, even at the expense of
196	   global transcribability.  However, ensuring reliability of the
197	   transition between an IRI and its presentation and back is more
198	   difficult and complex when dealing with the larger set of Unicode
199	   characters.  For example, Unicode supports multiple ways of encoding
200	   complex combinations of characters and accents, with multiple
201	   character sequences that can result in the same presentation.

203	   This document defines the protocol element called Internationalized
204	   Resource Identifier (IRI), which allow applications of URIs to be
205	   extended to use resource identifiers that have a much wider
206	   repertoire of characters.  It also provides corresponding
207	   "internationalized" versions of other constructs from [RFC3986], such
208	   as URI references.  The syntax of IRIs is defined in Section 2.

210	   Within this document, Section 5 discusses the use of IRIs in
211	   different situations.  Section 7 gives additional informative
212	   guidelines.  Section 9 discusses IRI-specific security
213	   considerations.

215	   This specification is part of a collection of specifications intended
216	   to replace [RFC3987].  [Bidi] discusses the special case of
217	   bidirectional IRIs using characters from scripts written right-to-
218	   left.  [Equivalence] gives guidelines for applications wishing to
219	   determine if two IRIs are equivalent, as well as defining some
220	   equivalence methods.  [RFC4395bis] updates the URI scheme
221	   registration guidelines and procedures to note that every URI scheme
222	   is also automatically an IRI scheme and to allow scheme definitions
223	   to be directly described in terms of Unicode characters.

225	1.2.  Applicability

227	   IRIs are designed to allow protocols and software that deal with URIs
228	   to be updated to handle IRIs.  Processing of IRIs is accomplished by
229	   extending the URI syntax while retaining (and not expanding) the set
230	   of "reserved" characters, such that the syntax for any URI scheme may
231	   be extended to allow non-ASCII characters.  In addition, following
232	   parsing of an IRI, it is possible to construct a corresponding URI by
233	   first encoding characters outside of the allowed URI range and then
234	   reassembling the components.

236	   Practical use of IRIs forms in place of URIs forms depends on the
237	   following conditions being met:

239	   a. A protocol or format element MUST be explicitly designated to be
240	      able to carry IRIs.  The intent is to avoid introducing IRIs into
241	      contexts that are not defined to accept them.  For example, XML
242	      schema [XMLSchema] has an explicit type "anyURI" that includes
243	      IRIs and IRI references.  Therefore, IRIs and IRI references can
244	      be in attributes and elements of type "anyURI".  On the other
245	      hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is
246	      defined as a URI, which means that direct use of IRIs is not
247	      allowed in HTTP requests.

249	   b. The protocol or format carrying the IRIs MUST have a mechanism to
250	      represent the wide range of characters used in IRIs, either
251	      natively or by some protocol- or format-specific escaping
252	      mechanism (for example, numeric character references in [XML1]).

254	   c. The URI scheme definition, if it explicitly allows a percent sign
255	      ("%") in any syntactic component, SHOULD define the interpretation
256	      of sequences of percent-encoded octets (using "%XX" hex octets) as
257	      octet from sequences of UTF-8 encoded strings; this is recommended
258	      in the guidelines for registering new schemes, [RFC4395bis].  For
259	      example, this is the practice for IMAP URLs [RFC2192], POP URLs
260	      [RFC2384] and the URN syntax [RFC2141]).  Note that use of
261	      percent-encoding may also be restricted in some situations, for
262	      example, URI schemes that disallow percent-encoding might still be
263	      used with a fragment identifier which is percent-encoded (e.g.,
264	      [XPointer]).  See Section 5.4 for further discussion.

266	1.3.  Definitions

268	   The following definitions are used in this document; they follow the
269	   terms in [RFC2130], [RFC2277], and [ISO10646].

271	   character:  A member of a set of elements used for the organization,
272	      control, or representation of data.  For example, "LATIN CAPITAL
273	      LETTER A" names a character.

275	   octet:  An ordered sequence of eight bits considered as a unit.

277	   character repertoire:  A set of characters (set in the mathematical
278	      sense).

280	   sequence of characters:  A sequence of characters (one after
281	      another).

283	   sequence of octets:  A sequence of octets (one after another).

285	   character encoding:  A method of representing a sequence of
286	      characters as a sequence of octets (maybe with variants).  Also, a
287	      method of (unambiguously) converting a sequence of octets into a
288	      sequence of characters.

290	   charset:  The name of a parameter or attribute used to identify a
291	      character encoding.

293	   UCS:  Universal Character Set. The coded character set defined by
294	      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6].

296	   IRI reference:  Denotes the common usage of an Internationalized
297	      Resource Identifier.  An IRI reference may be absolute or
298	      relative.  However, the "IRI" that results from such a reference
299	      only includes absolute IRIs; any relative IRI references are
300	      resolved to their absolute form.  Note that in [RFC2396] URIs did
301	      not include fragment identifiers, but in [RFC3986] fragment
302	      identifiers are part of URIs.

304	   LEIRI (Legacy Extended IRI) processing:  This term was used in
305	      various XML specifications to refer to strings that, although not
306	      valid IRIs, were acceptable input to the processing rules in
307	      Section 6.2.

309	   running text:  Human text (paragraphs, sentences, phrases) with
310	      syntax according to orthographic conventions of a natural
311	      language, as opposed to syntax defined for ease of processing by
312	      machines (e.g., markup, programming languages).

314	   protocol element:  Any portion of a message that affects processing
315	      of that message by the protocol in question.

317	   create (a URI or IRI):  With respect to URIs and IRIs, the term is
318	      used for the initial creation.  This may be the initial creation
319	      of a resource with a certain identifier, or the initial exposition
320	      of a resource under a particular identifier.

322	   generate (a URI or IRI):  With respect to URIs and IRIs, the term is
323	      used when the identifier is generated by derivation from other
324	      information.

326	   parsed URI component:  When a URI processor parses a URI (following
327	      the generic syntax or a scheme-specific syntax, the result is a
328	      set of parsed URI components, each of which has a type
329	      (corresponding to the syntactic definition) and a sequence of URI
330	      characters.

332	   parsed IRI component:  When an IRI processor parses an IRI directly,
333	      following the general syntax or a scheme-specific syntax, the
334	      result is a set of parsed IRI components, each of which has a type
335	      (corresponding to the syntactice definition) and a sequence of IRI
336	      characters.  (This definition is analogous to "parsed URI
337	      component".)

339	   IRI scheme:  A URI scheme may also be known as an "IRI scheme" if the
340	      scheme's syntax has been extended to allow non-US-ASCII characters
341	      according to the rules in this document.

343	1.4.  Notation

345	   RFCs and Internet Drafts currently do not allow any characters
346	   outside the US-ASCII repertoire.  Therefore, this document uses
347	   various special notations to denote such characters in examples.

349	   In text, characters outside US-ASCII are sometimes referenced by
350	   using a prefix of 'U+', followed by four to six hexadecimal digits.

352	   To represent characters outside US-ASCII in examples, this document
353	   uses 'XML Notation'.

355	   XML Notation uses a leading '&#x', a trailing ';', and the
356	   hexadecimal number of the character in the UCS in between.  For
357	   example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA.  In this
358	   notation, an actual '&' is denoted by '&amp;'.

360	   To denote actual octets in examples (as opposed to percent-encoded
361	   octets), the two hex digits denoting the octet are enclosed in "<"
362	   and ">".  For example, the octet often denoted as 0xc9 is denoted
363	   here as <c9>.

365	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
366	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
367	   and "OPTIONAL" are to be interpreted as described in [RFC2119].

369	2.  IRI Syntax

371	   This section defines the syntax of Internationalized Resource
372	   Identifiers (IRIs).

374	   As with URIs, an IRI is defined as a sequence of characters, not as a
375	   sequence of octets.  This definition accommodates the fact that IRIs
376	   may be written on paper or read over the radio as well as stored or
377	   transmitted digitally.  The same IRI might be represented as
378	   different sequences of octets in different protocols or documents if
379	   these protocols or documents use different character encodings
380	   (and/or transfer encodings).  Using the same character encoding as
381	   the containing protocol or document ensures that the characters in
382	   the IRI can be handled (e.g., searched, converted, displayed) in the
383	   same way as the rest of the protocol or document.

385	2.1.  Summary of IRI Syntax

387	   The IRI syntax extends the URI syntax in [RFC3986] by extending the
388	   class of unreserved characters, primarily by adding the characters of
389	   the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject
390	   to the limitations given in the syntax rules below and in
391	   Section 5.1.

393	   The syntax and use of components and reserved characters is the same
394	   as that in [RFC3986].  Each "URI scheme" thus also functions as an
395	   "IRI scheme", in that scheme-specific parsing rules for URIs of a
396	   scheme are be extended to allow parsing of IRIs using the same
397	   parsing rules.

399	   All the operations defined in [RFC3986], such as the resolution of
400	   relative references, can be applied to IRIs by IRI-processing
401	   software in exactly the same way as they are for URIs by URI-
402	   processing software.

404	   Characters outside the US-ASCII repertoire MUST NOT be reserved and
405	   therefore MUST NOT be used for syntactical purposes, such as to
406	   delimit components in newly defined schemes.  For example, U+00A2,
407	   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
408	   the 'iunreserved' category.  This is similar to the fact that it is
409	   not possible to use '-' as a delimiter in URIs, because it is in the
410	   'unreserved' category.

412	2.2.  ABNF for IRI References and IRIs

414	   An ABNF definition for IRI references (which are the most general
415	   concept and the start of the grammar) and IRIs is given here.  The
416	   syntax of this ABNF is described in [STD68].  Character numbers are
417	   taken from the UCS, without implying any actual binary encoding.
418	   Terminals in the ABNF are characters, not octets.

420	   The following grammar closely follows the URI grammar in [RFC3986],
421	   except that the range of unreserved characters is expanded to include
422	   UCS characters, with the restriction that private UCS characters can
423	   occur only in query parts.  The grammar is split into two parts:
424	   Rules that differ from [RFC3986] because of the above-mentioned
425	   expansion, and rules that are the same as those in [RFC3986].  For
426	   rules that are different than those in [RFC3986], the names of the
427	   non-terminals have been changed as follows.  If the non-terminal
428	   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
429	   has been prefixed.  The rule <pct-form> has been introduced in order
430	   to be able to reference it from other parts of the document.

432	   The following rules are different from those in [RFC3986]:

434	   IRI            = scheme ":" ihier-part [ "?" iquery ]
435	                    [ "#" ifragment ]

437	   ihier-part     = "//" iauthority ipath-abempty
438	                  / ipath-absolute
439	                  / ipath-rootless
440	                  / ipath-empty

442	   IRI-reference  = IRI / irelative-ref

444	   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

446	   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
447	   irelative-part = "//" iauthority ipath-abempty
448	                  / ipath-absolute
449	                  / ipath-noscheme
450	                  / ipath-empty

452	   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
453	   iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
454	   ihost          = IP-literal / IPv4address / ireg-name

456	   pct-form       = pct-encoded

458	   ireg-name      = *( iunreserved / sub-delims )

460	   ipath          = ipath-abempty   ; begins with "/" or is empty
461	                  / ipath-absolute  ; begins with "/" but not "//"
462	                  / ipath-noscheme  ; begins with a non-colon segment
463	                  / ipath-rootless  ; begins with a segment
464	                  / ipath-empty     ; zero characters

466	   ipath-abempty  = *( path-sep isegment )
467	   ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
468	   ipath-noscheme = isegment-nz-nc *( path-sep isegment )
469	   ipath-rootless = isegment-nz *( path-sep isegment )
470	   ipath-empty    = ""
471	   path-sep       = "/"

473	   isegment       = *ipchar
474	   isegment-nz    = 1*ipchar
475	   isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
476	                        / "@" )
477	                  ; non-zero-length segment without any colon ":"

479	   ipchar         = iunreserved / pct-form / sub-delims / ":"
480	                  / "@"

482	   iquery         = *( ipchar / iprivate / "/" / "?" )

484	   ifragment      = *( ipchar / "/" / "?" )

486	   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

488	   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
489	                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
490	                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
491	                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
492	                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
493	                  / %xD0000-DFFFD / %xE1000-EFFFD

495	   iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
496	                  / %x100000-10FFFD

498	   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
499	   "greedy") algorithm applies.  For details, see [RFC3986].

501	   The following rules are the same as those in [RFC3986]:

503	   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

505	   port           = *DIGIT

507	   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

509	   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

511	   IPv6address    =                            6( h16 ":" ) ls32
512	                  /                       "::" 5( h16 ":" ) ls32
513	                  / [               h16 ] "::" 4( h16 ":" ) ls32
514	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
515	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
516	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
517	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
518	                  / [ *5( h16 ":" ) h16 ] "::"              h16
519	                  / [ *6( h16 ":" ) h16 ] "::"

521	   h16            = 1*4HEXDIG
522	   ls32           = ( h16 ":" h16 ) / IPv4address

524	   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

526	   dec-octet      = DIGIT                 ; 0-9
527	                  / %x31-39 DIGIT         ; 10-99
528	                  / "1" 2DIGIT            ; 100-199
529	                  / "2" %x30-34 DIGIT     ; 200-249
530	                  / "25" %x30-35          ; 250-255

532	   pct-encoded    = "%" HEXDIG HEXDIG

534	   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
535	   reserved       = gen-delims / sub-delims
536	   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
537	   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
538	                  / "*" / "+" / "," / ";" / "="

540	   This syntax does not support IPv6 scoped addressing zone identifiers.

542	3.  Processing IRIs and related protocol elements

544	   IRIs are meant to replace URIs in identifying resources within new
545	   versions of protocols, formats, and software components that use a
546	   UCS-based character repertoire.  Protocols and components may use and
547	   process IRIs directly.  However, there are still numerous systems and
548	   protocols which only accept URIs or components of parsed URIs; that
549	   is, they only accept sequences of characters within the subset of US-
550	   ASCII characters allowed in URIs.

552	   This section defines specific processing steps for IRI consumers
553	   which establish the relationship between the string given and the
554	   interpreted derivatives.  These processing steps apply to both IRIs
555	   and IRI references (i.e., absolute or relative forms); for IRIs, some
556	   steps are scheme specific.

558	3.1.  Converting to UCS

560	   Input that is already in a Unicode form (i.e., a sequence of Unicode
561	   characters or an octet-stream representing a Unicode-based character
562	   encoding such as UTF-8 or UTF-16) should be left as is and not
563	   normalized or changed.

565	   An IRI or IRI reference is a sequence of characters from the UCS.
566	   For input from presentations (written on paper, read aloud) or
567	   translation from other representations (a text stream using a legacy
568	   character encoding), convert the input to Unicode.  Note that some
569	   character encodings or transcriptions can be converted to or
570	   represented by more than one sequence of Unicode characters.  Ideally
571	   the resulting IRI would use a normalized form, such as Unicode
572	   Normalization Form C [UTR15], since that ensures a stable, consistent
573	   representation that is most likely to produce the intended results.
574	   Previous versions of this specification required normalization at
575	   this step.  However, attempts to require normalization in other
576	   protocols have met with strong enough resistance that requiring
577	   normalization here was considered impractical.  Implementers and
578	   users are cautioned that, while denormalized character sequences are
579	   valid, they might be difficult for other users or processes to
580	   reproduce and might lead to unexpected results.

582	3.2.  Parse the IRI into IRI components

584	   Parse the IRI, either as a relative reference (no scheme) or using
585	   scheme specific processing (according to the scheme given); the
586	   result is a set of parsed IRI components.

588	3.3.  General percent-encoding of IRI components

590	   Except as noted in the following subsections, IRI components are
591	   mapped to the equivalent URI components by percent-encoding those
592	   characters not allowed in URIs.  Previous processing steps will have
593	   removed some characters, and the interpretation of reserved
594	   characters will have already been done (with the syntactic reserved
595	   characters outside of the IRI component).  This mapping is defined
596	   for all sequences of Unicode characters, whether or not they are
597	   valid for the component in question.

599	   For each character which is not allowed anywhere in a valid URI apply
600	   the following steps.

602	   Convert to UTF-8  Convert the character to a sequence of one or more
603	      octets using UTF-8 [RFC3629].

605	   Percent encode  Convert each octet of this sequence to %HH, where HH
606	      is the hexadecimal notation of the octet value.  The hexadecimal
607	      notation SHOULD use uppercase letters.  (This is the general URI
608	      percent-encoding mechanism in Section 2.1 of [RFC3986].)

610	   Note that the mapping is an identity transformation for parsed URI
611	   components of valid URIs, and is idempotent: applying the mapping a
612	   second time will not change anything.

614	3.4.  Mapping ireg-name

616	   The mapping from <ireg-name> to a <reg-name> requires a choice
617	   between one of the two methods described below.

619	3.4.1.  Mapping using Percent-Encoding

621	   The ireg-name component SHOULD be converted according to the general
622	   procedure for percent-encoding of IRI components described in
623	   Section 3.3.

625	   For example, the IRI
626	   "http://r&#xE9;sum&#xE9;.example.org"
627	   will be converted to
628	   "http://r%C3%A9sum%C3%A9.example.org".

630	   This conversion for ireg-name is in line with Section 3.2.2 of
631	   [RFC3986], which does not mandate a particular registered name lookup
632	   technology.  For further background, see [RFC6055] and [Gettys].

634	3.4.2.  Mapping using Punycode

636	   In situations where it is certain that <ireg-name> is intended to be
637	   used as a domain name to be processed by Domain Name Lookup (as per
638	   [RFC5891]), an alternative method MAY be used, converting <ireg-name>
639	   as follows:

641	   If there are any sequences of <pct-encoded>, and their corresponding
642	   octets all represent valid UTF-8 octet sequences, then convert these
643	   back to Unicode character sequences.  (If any <pct-encoded> sequences
644	   are not valid UTF-8 octet sequences, then leave the entire field as
645	   is without any change, since punycode encoding would not succeed.)

647	   Replace the ireg-name part of the IRI by the part converted using the
648	   Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891].
649	   on each dot-separated label, and by using U+002E (FULL STOP) as a
650	   label separator.  This procedure may fail, but this would mean that
651	   the IRI cannot be resolved.  In such cases, if the domain name
652	   conversion fails, then the entire IRI conversion fails.  Processors
653	   that have no mechanism for signalling a failure MAY instead
654	   substitute an otherwise invalid host name, although such processing
655	   SHOULD be avoided.

657	   For example, the IRI
658	   "http://r&#xE9;sum&#xE9;.example.org"
659	   MAY be converted to
660	   "http://xn--rsum-bad.example.org"
661	   .

663	   This conversion for ireg-name will be better able to deal with legacy
664	   infrastructure that cannot handle percent-encoding in domain names.

666	3.4.3.  Additional Considerations

668	   Note:  Domain Names may appear in parts of an IRI other than the
669	      ireg-name part.  It is the responsibility of scheme-specific
670	      implementations (if the Internationalized Domain Name is part of
671	      the scheme syntax) or of server-side implementations (if the
672	      Internationalized Domain Name is part of 'iquery') to apply the
673	      necessary conversions at the appropriate point.  Example: Trying
674	      to validate the Web page at
675	      http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
676	      http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.
677	      example.org, which would convert to a URI of
678	      http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
679	      example.org.  The server-side implementation is responsible for
680	      making the necessary conversions to be able to retrieve the Web
681	      page.

683	   Note:  In this process, characters allowed in URI references and
684	      existing percent-encoded sequences are not encoded further.  (This
685	      mapping is similar to, but different from, the encoding applied
686	      when arbitrary content is included in some part of a URI.)  For
687	      example, an IRI of
688	      "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
689	      converted to
690	      "http://www.example.org/red%09ros%C3%A9#red", not to something
691	      like
692	      "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".

694	3.5.  Mapping query components

696	   For compatibility with existing deployed HTTP infrastructure, the
697	   following special case applies for schemes "http" and "https" and
698	   IRIs whose origin has a document charset other than one which is UCS-
699	   based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
700	   of an IRI is mapped into a URI by using the document charset rather
701	   than UTF-8 as the binary representation before pct-encoding.  This
702	   mapping is not applied for any other scheme or component.

704	3.6.  Mapping IRIs to URIs

706	   The mapping from an IRI to URI is accomplished by applying the
707	   mapping above (from IRI to URI components) and then reassembling a
708	   URI from the parsed URI components using the original punctuation
709	   that delimited the IRI components.

711	4.  Converting URIs to IRIs

713	   In some situations, for presentation and further processing, it is
714	   desirable to convert a URI into an equivalent IRI without unnecessary
715	   percent encoding.  Of course, every URI is already an IRI in its own
716	   right without any conversion.  This section gives one possible
717	   procedure for URI to IRI mapping.

719	   The conversion described in this section, if given a valid URI, will
720	   result in an IRI that maps back to the URI used as an input for the
721	   conversion (except for potential case differences in percent-encoding
722	   and for potential percent-encoded unreserved characters).  However,
723	   the IRI resulting from this conversion may differ from the original
724	   IRI (if there ever was one).

726	   URI-to-IRI conversion removes percent-encodings, but not all percent-
727	   encodings can be eliminated.  There are several reasons for this:

729	   1. Some percent-encodings are necessary to distinguish percent-
730	      encoded and unencoded uses of reserved characters.

732	   2. Some percent-encodings cannot be interpreted as sequences of UTF-8
733	      octets.

735	      (Note: The octet patterns of UTF-8 are highly regular.  Therefore,
736	      there is a very high probability, but no guarantee, that percent-
737	      encodings that can be interpreted as sequences of UTF-8 octets
738	      actually originated from UTF-8.  For a detailed discussion, see
739	      [Duerst97].)

741	   3. The conversion may result in a character that is not appropriate
742	      in an IRI.  See Section 2.2, and Section 5.1 for further details.

744	   4. IRI to URI conversion has different rules for dealing with domain
745	      names and query parameters.

747	   Conversion from a URI to an IRI MAY be done by using the following
748	   steps:

750	   1. Represent the URI as a sequence of octets in US-ASCII.

752	   2. Convert all percent-encodings ("%" followed by two hexadecimal
753	      digits) to the corresponding octets, except those corresponding to
754	      "%", characters in "reserved", and characters in US-ASCII not
755	      allowed in URIs.

757	   3. Re-percent-encode any octet produced in step 2 that is not part of
758	      a strictly legal UTF-8 octet sequence.

760	   4. Re-percent-encode all octets produced in step 3 that in UTF-8
761	      represent characters that are not appropriate according to
762	      Section 2.2 and Section 5.1.

764	   5. Interpret the resulting octet sequence as a sequence of characters
765	      encoded in UTF-8.

767	   6. URIs known to contain domain names in the reg-name component
768	      SHOULD convert punycode-encoded domain name labels to the
769	      corresponding characters using the ToUnicode procedure.

771	   This procedure will convert as many percent-encoded characters as
772	   possible to characters in an IRI.  Because there are some choices
773	   when step 4 is applied (see Section 5.1), results may vary.

775	   Conversions from URIs to IRIs MUST NOT use any character encoding
776	   other than UTF-8 in steps 3 and 4, even if it might be possible to
777	   guess from the context that another character encoding than UTF-8 was
778	   used in the URI.  For example, the URI
779	   "http://www.example.org/r%E9sum%E9.html" might with some guessing be
780	   interpreted to contain two e-acute characters encoded as iso-8859-1.
781	   It must not be converted to an IRI containing these e-acute
782	   characters.  Otherwise, in the future the IRI will be mapped to
783	   "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
784	   URI from "http://www.example.org/r%E9sum%E9.html".

786	4.1.  Examples

788	   This section shows various examples of converting URIs to IRIs.  Each
789	   example shows the result after each of the steps 1 through 6 is
790	   applied.  XML Notation is used for the final result.  Octets are
791	   denoted by "<" followed by two hexadecimal digits followed by ">".

793	   The following example contains the sequence "%C3%BC", which is a
794	   strictly legal UTF-8 sequence, and which is converted into the actual
795	   character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
796	   u-umlaut).

798	   1. http://www.example.org/D%C3%BCrst

800	   2. http://www.example.org/D<c3><bc>rst

802	   3. http://www.example.org/D<c3><bc>rst

804	   4. http://www.example.org/D<c3><bc>rst

806	   5. http://www.example.org/D&#xFC;rst

808	   6. http://www.example.org/D&#xFC;rst

810	   The following example contains the sequence "%FC", which might
811	   represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the
812	   iso-8859-1 character encoding.  (It might represent other characters
813	   in other character encodings.  For example, the octet <fc> in iso-
814	   8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.)  Because <fc>
815	   is not part of a strictly legal UTF-8 sequence, it is re-percent-
816	   encoded in step 3.

818	   1. http://www.example.org/D%FCrst

820	   2. http://www.example.org/D<fc>rst
821	   3. http://www.example.org/D%FCrst

823	   4. http://www.example.org/D%FCrst

825	   5. http://www.example.org/D%FCrst

827	   6. http://www.example.org/D%FCrst

829	   The following example contains "%e2%80%ae", which is the percent-
830	   encoded
831	   UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.  The
832	   direct use of this character is forbiddin in an IRI.  Therefore, the
833	   corresponding octets are re-percent-encoded in step 4.  This example
834	   shows that the case (upper- or lowercase) of letters used in percent-
835	   encodings may not be preserved.  The example also contains a
836	   punycode-encoded domain name label (xn--99zt52a), which is not
837	   converted.

839	   1. http://xn--99zt52a.example.org/%e2%80%ae

841	   2. http://xn--99zt52a.example.org/<e2><80><ae>

843	   3. http://xn--99zt52a.example.org/<e2><80><ae>

845	   4. http://xn--99zt52a.example.org/%E2%80%AE

847	   5. http://xn--99zt52a.example.org/%E2%80%AE

849	   6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE

851	   Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
852	   (Japanese Natto).  ((EDITOR NOTE: There is some inconsistency in this
853	   note.))

855	5.  Use of IRIs

857	5.1.  Limitations on UCS Characters Allowed in IRIs

859	   This section discusses limitations on characters and character
860	   sequences usable for IRIs beyond those given in Section 2.2.  The
861	   considerations in this section are relevant when IRIs are created and
862	   when URIs are converted to IRIs.

864	   a. The repertoire of characters allowed in each IRI component is
865	      limited by the definition of that component.  For example, the
866	      definition of the scheme component does not allow characters
867	      beyond US-ASCII.

869	      (Note: In accordance with URI practice, generic IRI software
870	      cannot and should not check for such limitations.)

872	   b. The UCS contains many areas of characters for which there are
873	      strong visual look-alikes.  Because of the likelihood of
874	      transcription errors, these also should be avoided.  This includes
875	      the full-width equivalents of Latin characters, half-width
876	      Katakana characters for Japanese, and many others.  It also
877	      includes many look-alikes of "space", "delims", and "unwise",
878	      characters excluded in [RFC3491].

880	   c. At the start of a component, the use of combining marks is
881	      strongly discouraged.  As an example, a COMBINING TILDE OVERLAY
882	      (U+0334) would be very confusing at the start of a <isegment>.
883	      Combined with the preceeding '/', it might look like a solidus
884	      with combining tilde overlay, but IRI processing software will
885	      parse and process the '/' separately.

887	   d. The ZERO WIDTH NON-JOINER (U+200C) and ZERO WIDTH JOINER (U+200D)
888	      are invisible in most contexts, but are crucial in some very
889	      limited contexts.  Appendix A of [RFC5892] contains contextual
890	      restrictions for these and some other characters.  The use of
891	      these characters are strongly discouraged except in the relevant
892	      contexts.

894	   Additional information is available from [UNIXML].  [UNIXML] is
895	   written in the context of running text rather than in that of
896	   identifiers.  Nevertheless, it discusses many of the categories of
897	   characters not appropriate for IRIs.

899	5.2.  Software Interfaces and Protocols

901	   Although an IRI is defined as a sequence of characters, software
902	   interfaces for URIs typically function on sequences of octets or
903	   other kinds of code units.  Thus, software interfaces and protocols
904	   MUST define which character encoding is used.

906	   Intermediate software interfaces between IRI-capable components and
907	   URI-only components MUST map the IRIs per Section 3.6, when
908	   transferring from IRI-capable to URI-only components.  This mapping
909	   SHOULD be applied as late as possible.  It SHOULD NOT be applied
910	   between components that are known to be able to handle IRIs.

912	5.3.  Format of URIs and IRIs in Documents and Protocols

914	   Document formats that transport URIs may have to be upgraded to allow
915	   the transport of IRIs.  In cases where the document as a whole has a
916	   native character encoding, IRIs MUST also be encoded in this
917	   character encoding and converted accordingly by a parser or
918	   interpreter.  IRI characters not expressible in the native character
919	   encoding SHOULD be escaped by using the escaping conventions of the
920	   document format if such conventions are available.  Alternatively,
921	   they MAY be percent-encoded according to Section 3.6.  For example,
922	   in HTML or XML, numeric character references SHOULD be used.  If a
923	   document as a whole has a native character encoding and that
924	   character encoding is not UTF-8, then IRIs MUST NOT be placed into
925	   the document in the UTF-8 character encoding.

927	   ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
928	   although they use different terminology.  HTML 4.0 [HTML4] defines
929	   the conversion from IRIs to URIs as error-avoiding behavior.  XML 1.0
930	   [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
931	   based upon them allow IRIs.  Also, it is expected that all relevant
932	   new W3C formats and protocols will be required to handle IRIs
933	   [CharMod].

935	5.4.  Use of UTF-8 for Encoding Original Characters

937	   This section discusses details and gives examples for point c) in
938	   Section 1.2.  To be able to use IRIs, the URI corresponding to the
939	   IRI in question has to encode original characters into octets by
940	   using UTF-8.  This can be specified for all URIs of a URI scheme or
941	   can apply to individual URIs for schemes that do not specify how to
942	   encode original characters.  It can apply to the whole URI, or only
943	   to some part.  For background information on encoding characters into
944	   URIs, see also Section 2.5 of [RFC3986].

946	   For new URI schemes, using UTF-8 is recommended in [RFC4395bis].
947	   Examples where UTF-8 is already used are the URN syntax [RFC2141],
948	   IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand,
949	   because the HTTP URI scheme does not specify how to encode original
950	   characters, only some HTTP URLs can have corresponding but different
951	   IRIs.

953	   For example, for a document with a URI of
954	   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
955	   construct a corresponding IRI (in XML notation, see Section 1.4):
956	   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9;" stands for
957	   the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-
958	   encoded representation of that character).  On the other hand, for a
959	   document with a URI of "http://www.example.org/r%E9sum%E9.html", the
960	   percent-encoding octets cannot be converted to actual characters in
961	   an IRI, as the percent-encoding is not based on UTF-8.

963	   For most URI schemes, there is no need to upgrade their scheme
964	   definition in order for them to work with IRIs.  The main case where
965	   upgrading makes sense is when a scheme definition, or a particular
966	   component of a scheme, is strictly limited to the use of US-ASCII
967	   characters with no provision to include non-ASCII characters/octets
968	   via percent-encoding, or if a scheme definition currently uses highly
969	   scheme-specific provisions for the encoding of non-ASCII characters.
970	   An example of this is the mailto: scheme [RFC2368].

972	   This specification updates the IANA registry of URI schemes to note
973	   their applicability to IRIs, see Section 8.  All IRIs use URI
974	   schemes, and all URIs with URI schemes can be used as IRIs, even
975	   though in some cases only by using URIs directly as IRIs, without any
976	   conversion.

978	   Scheme definitions can impose restrictions on the syntax of scheme-
979	   specific URIs; i.e., URIs that are admissible under the generic URI
980	   syntax [RFC3986] may not be admissible due to narrower syntactic
981	   constraints imposed by a URI scheme specification.  URI scheme
982	   definitions cannot broaden the syntactic restrictions of the generic
983	   URI syntax; otherwise, it would be possible to generate URIs that
984	   satisfied the scheme-specific syntactic constraints without
985	   satisfying the syntactic constraints of the generic URI syntax.
986	   However, additional syntactic constraints imposed by URI scheme
987	   specifications are applicable to IRI, as the corresponding URI
988	   resulting from the mapping defined in Section 3.6 MUST be a valid URI
989	   under the syntactic restrictions of generic URI syntax and any
990	   narrower restrictions imposed by the corresponding URI scheme
991	   specification.

993	   The requirement for the use of UTF-8 generally applies to all parts
994	   of a URI.  However, it is possible that the capability of IRIs to
995	   represent a wide range of characters directly is used just in some
996	   parts of the IRI (or IRI reference).  The other parts of the IRI may
997	   only contain US-ASCII characters, or they may not be based on UTF-8.
998	   They may be based on another character encoding, or they may directly
999	   encode raw binary data (see also [RFC2397]).

1001	   For example, it is possible to have a URI reference of
1002	   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
1003	   document name is encoded in iso-8859-1 based on server settings, but
1004	   where the fragment identifier is encoded in UTF-8 according to
1005	   [XPointer].  The IRI corresponding to the above URI would be (in XML
1006	   notation)
1007	   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".

1009	   Similar considerations apply to query parts.  The functionality of
1010	   IRIs (namely, to be able to include non-ASCII characters) can only be
1011	   used if the query part is encoded in UTF-8.

1013	5.5.  Relative IRI References

1015	   Processing of relative IRI references against a base is handled
1016	   straightforwardly; the algorithms of [RFC3986] can be applied
1017	   directly, treating the characters additionally allowed in IRI
1018	   references in the same way that unreserved characters are in URI
1019	   references.

1021	6.  Legacy Extended IRIs (LEIRIs)

1023	   In some cases, there have been formats which have used a protocol
1024	   element which is a variant of the IRI definition; these variants have
1025	   usually been somewhat less restricted in syntax.  This section
1026	   provides a definition and a name (Legacy Extended IRI or LEIRI) for
1027	   one of these variants used widely in XML-based protocols.  This
1028	   variant has to be used with care; it requires further processing
1029	   before being fully interchangeable as IRIs.  New protocols and
1030	   formats SHOULD NOT use Legacy Extended IRIs.  Even where Legacy
1031	   Extended IRIs are allowed, only IRIs fully conforming to the syntax
1032	   definition in Section 2.2 SHOULD be created, generated, and used.
1033	   The provisions in this section also apply to Legacy Extended IRI
1034	   references.

1036	6.1.  Legacy Extended IRI Syntax

1038	   This section defines Legacy Extended IRIs (LEIRIs).  The syntax of
1039	   Legacy Extended IRIs is the same as that for <IRI-reference>, except
1040	   that the ucschar production is replaced by the leiri-ucschar
1041	   production:

1043	              leiri-ucschar  = " " / "<" / ">" / '"' / "{" / "}" / "|"
1044	         / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1045	         / %xE000-FFFD / %x10000-10FFFF

1047	   The restriction on bidirectional formatting characters in [Bidi] is
1048	   lifted.  The iprivate production becomes redundant.

1050	   Likewise, the syntax for Legacy Extended IRI references (LEIRI
1051	   references) is the same as that for IRI references with the above
1052	   replacement of ucschar with leiri-ucschar.

1054	6.2.  Conversion of Legacy Extended IRIs to IRIs

1056	   To convert a Legacy Extended IRI (reference) to an IRI (reference),
1057	   each character allowed in a Legacy Extended IRI (reference) but not
1058	   allowed in an IRI (reference) (see Section 6.3) MUST be percent-
1059	   encoded by applying the steps in Section 3.3.

1061	6.3.  Characters Allowed in Legacy Extended IRIs but not in IRIs

1063	   This section provides a list of the groups of characters and code
1064	   points that are allowed in Legacy Extedend IRIs, but are not allowed
1065	   in IRIs or are allowed in IRIs only in the query part.  For each
1066	   group of characters, advice on the usage of these characters is also
1067	   given, concentrating on the reasons for why not to use them.

1069	      Space (U+0020): Some formats and applications use space as a
1070	      delimiter, e.g., for items in a list.  Appendix C of [RFC3986]
1071	      also mentions that white space may have to be added when
1072	      displaying or printing long URIs; the same applies to long IRIs.
1073	      Spaces might disappear, or a single Legacy Extended IRI might
1074	      incorrectly be interpreted as two or more separate ones.

1076	      Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
1077	      C of [RFC3986] suggests the use of double-quotes
1078	      ("http://example.com/") and angle brackets (<http://example.com/>)
1079	      as delimiters for URIs in plain text.  These conventions are often
1080	      used, and also apply to IRIs.  Legacy Extended IRIs using these
1081	      characters might be cut off at the wrong place.

1083	      Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
1084	      (U+007B), "|" (U+007C), and "}" (U+007D): These characters
1085	      originally were excluded from URIs because the respective
1086	      codepoints are assigned to different graphic characters in some
1087	      7-bit or 8-bit encoding.  Despite the move to Unicode, some of
1088	      these characters are still occasionally displayed differently on
1089	      some systems, e.g., U+005C as a Japanese Yen symbol.  Also, the
1090	      fact that these characters are not used in URIs or IRIs has
1091	      encouraged their use outside URIs or IRIs in contexts that may
1092	      include URIs or IRIs.  In case a Legacy Extended IRI with such a
1093	      character is used in such a context, the Legacy Extended IRI will
1094	      be interpreted piecemeal.

1096	      The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1097	      #x9F): There is no way to transmit these characters reliably
1098	      except potentially in electronic form.  Even when in electronic
1099	      form, some software components might silently filter out some of
1100	      these characters, or may stop processing alltogether when
1101	      encountering some of them.  These characters may affect text
1102	      display in subtle, unnoticable ways or in drastic, global, and
1103	      irreversible ways depending on the hardware and software involved.
1104	      The use of some of these characters may allow malicious users to
1105	      manipulate the display of a Legacy Extended IRI and its context.

1107	      Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1108	      characters affect the display ordering of characters.  Displayed
1109	      Legacy Extended IRIs containing these characters cannot be
1110	      converted back to electronic form (logical order) unambiguously.
1111	      These characters may allow malicious users to manipulate the
1112	      display of a Legacy Extended IRI and its context.

1114	      Specials (U+FFF0-FFFD): These code points provide functionality
1115	      beyond that useful in a Legacy Extended IRI, for example byte
1116	      order identification, annotation, and replacements for unknown
1117	      characters and objects.  Their use and interpretation in a Legacy
1118	      Extended IRI serves no purpose and may lead to confusing display
1119	      variations.

1121	      Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
1122	      10FFFD): Display and interpretation of these code points is by
1123	      definition undefined without private agreement.  Therefore, these
1124	      code points are not suited for use on the Internet.  They are not
1125	      interoperable and may have unpredictable effects.

1127	      Tags (U+E0000-E0FFF): These characters provide a way to language
1128	      tag in Unicode plain text.  They are not appropriate for Legacy
1129	      Extended IRIs because language information in identifiers cannot
1130	      reliably be input, transmitted (e.g., on a visual medium such as
1131	      paper), or recognized.

1133	      Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1134	      U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1135	      U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1136	      U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1137	      U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1138	      non-characters.  Applications may use some of them internally, but
1139	      are not prepared to interchange them.

1141	   For reference, we here also list the code points and code units not
1142	   even allowed in Legacy Extended IRIs:

1144	      Surrogate code units (D800-DFFF): These do not represent Unicode
1145	      codepoints.

1147	      Non-characters (U+FFFE-FFFF): These are not allowed in XML nor
1148	      LEIRIs.

1150	7.  URI/IRI Processing Guidelines (Informative)

1152	   This informative section provides guidelines for supporting IRIs in
1153	   the same software components and operations that currently process
1154	   URIs: Software interfaces that handle URIs, software that allows
1155	   users to enter URIs, software that creates or generates URIs,
1156	   software that displays URIs, formats and protocols that transport
1157	   URIs, and software that interprets URIs.  These may all require
1158	   modification before functioning properly with IRIs.  The
1159	   considerations in this section also apply to URI references and IRI
1160	   references.

1162	7.1.  URI/IRI Software Interfaces

1164	   Software interfaces that handle URIs, such as URI-handling APIs and
1165	   protocols transferring URIs, need interfaces and protocol elements
1166	   that are designed to carry IRIs.

1168	   In case the current handling in an API or protocol is based on US-
1169	   ASCII, UTF-8 is recommended as the character encoding for IRIs, as it
1170	   is compatible with US-ASCII, is in accordance with the
1171	   recommendations of [RFC2277], and makes converting to URIs easy.  In
1172	   any case, the API or protocol definition must clearly define the
1173	   character encoding to be used.

1175	   The transfer from URI-only to IRI-capable components requires no
1176	   mapping, although the conversion described in Section 4 above may be
1177	   performed.  It is preferable not to perform this inverse conversion
1178	   unless it is certain this can be done correctly.

1180	7.2.  URI/IRI Entry

1182	   Some components allow users to enter URIs into the system by typing
1183	   or dictation, for example.  This software must be updated to allow
1184	   for IRI entry.

1186	   A person viewing a visual presentation of an IRI (as a sequence of
1187	   glyphs, in some order, in some visual display) will use an entry
1188	   method for characters in the user's language to input the IRI.
1189	   Depending on the script and the input method used, this may be a more
1190	   or less complicated process.

1192	   The process of IRI entry must ensure, as much as possible, that the
1193	   restrictions defined in Section 2.2 are met.  This may be done by
1194	   choosing appropriate input methods or variants/settings thereof, by
1195	   appropriately converting the characters being input, by eliminating
1196	   characters that cannot be converted, and/or by issuing a warning or
1197	   error message to the user.

1199	   As an example of variant settings, input method editors for East
1200	   Asian Languages usually allow the input of Latin letters and related
1201	   characters in full-width or half-width versions.  For IRI input, the
1202	   input method editor should be set so that it produces half-width
1203	   Latin letters and punctuation and full-width Katakana.

1205	   An input field primarily or solely used for the input of URIs/IRIs
1206	   might allow the user to view an IRI as it is mapped to a URI.  Places
1207	   where the input of IRIs is frequent may provide the possibility for
1208	   viewing an IRI as mapped to a URI.  This will help users when some of
1209	   the software they use does not yet accept IRIs.

1211	   An IRI input component interfacing to components that handle URIs,
1212	   but not IRIs, must map the IRI to a URI before passing it to these
1213	   components.

1215	   For the input of IRIs with right-to-left characters, please see
1216	   [Bidi].

1218	7.3.  URI/IRI Transfer between Applications

1220	   Many applications (for example, mail user agents) try to detect URIs
1221	   appearing in plain text.  For this, they use some heuristics based on
1222	   URI syntax.  They then allow the user to click on such URIs and
1223	   retrieve the corresponding resource in an appropriate (usually
1224	   scheme-dependent) application.

1226	   Such applications would need to be upgraded, in order to use the IRI
1227	   syntax as a base for heuristics.  In particular, a non-ASCII
1228	   character should not be taken as the indication of the end of an IRI.
1229	   Such applications also would need to make sure that they correctly
1230	   convert the detected IRI from the character encoding of the document
1231	   or application where the IRI appears, to the character encoding used
1232	   by the system-wide IRI invocation mechanism, or to a URI (according
1233	   to Section 3.6) if the system-wide invocation mechanism only accepts
1234	   URIs.

1236	   The clipboard is another frequently used way to transfer URIs and
1237	   IRIs from one application to another.  On most platforms, the
1238	   clipboard is able to store and transfer text in many languages and
1239	   scripts.  Correctly used, the clipboard transfers characters, not
1240	   octets, which will do the right thing with IRIs.

1242	7.4.  URI/IRI Generation

1244	   Systems that offer resources through the Internet, where those
1245	   resources have logical names, sometimes automatically generate URIs
1246	   for the resources they offer.  For example, some HTTP servers can
1247	   generate a directory listing for a file directory and then respond to
1248	   the generated URIs with the files.

1250	   Many legacy character encodings are in use in various file systems.
1251	   Many currently deployed systems do not transform the local character
1252	   representation of the underlying system before generating URIs.

1254	   For maximum interoperability, systems that generate resource
1255	   identifiers should make the appropriate transformations.  For
1256	   example, if a file system contains a file named "r&#xE9;sum&#
1257	   xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in
1258	   a URI, which allows use of "r&#xE9;sum&#xE9;.html" in an IRI, even if
1259	   locally the file name is kept in a character encoding other than
1260	   UTF-8.

1262	   This recommendation particularly applies to HTTP servers.  For FTP
1263	   servers, similar considerations apply; see [RFC2640].

1265	7.5.  URI/IRI Selection

1267	   In some cases, resource owners and publishers have control over the
1268	   IRIs used to identify their resources.  This control is mostly
1269	   executed by controlling the resource names, such as file names,
1270	   directly.

1272	   In these cases, it is recommended to avoid choosing IRIs that are
1273	   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
1274	   is easily confused with the digit one ("1"), and the upper-case oh
1275	   ("O") is easily confused with the digit zero ("0").  Publishers
1276	   should avoid confusing users with "br0ken" or "1ame" identifiers.

1278	   Outside the US-ASCII repertoire, there are many more opportunities
1279	   for confusion; a complete set of guidelines is too lengthy to include
1280	   here.  As long as names are limited to characters from a single
1281	   script, native writers of a given script or language will know best
1282	   when ambiguities can appear, and how they can be avoided.  What may
1283	   look ambiguous to a stranger may be completely obvious to the average
1284	   native user.  On the other hand, in some cases, the UCS contains
1285	   variants for compatibility reasons; for example, for typographic
1286	   purposes.  These should be avoided wherever possible.  Although there
1287	   may be exceptions, newly created resource names should generally be
1288	   in NFKC [UTR15] (which means that they are also in NFC).

1290	   As an example, the UCS contains the "fi" ligature at U+FB01 for
1291	   compatibility reasons.  Wherever possible, IRIs should use the two
1292	   letters "f" and "i" rather than the "fi" ligature.  An example where
1293	   the latter may be used is in the query part of an IRI for an explicit
1294	   search for a word written containing the "fi" ligature.

1296	   In certain cases, there is a chance that characters from different
1297	   scripts look the same.  The best known example is the similarity of
1298	   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
1299	   such cases, IRIs should only be created where all the characters in a
1300	   single component are used together in a given language.  This usually
1301	   means that all of these characters will be from the same script, but
1302	   there are languages that mix characters from different scripts (such
1303	   as Japanese).  This is similar to the heuristics used to distinguish
1304	   between letters and numbers in the examples above.  Also, for Latin,
1305	   Greek, and Cyrillic, using lowercase letters results in fewer
1306	   ambiguities than using uppercase letters would.

1308	7.6.  Display of URIs/IRIs

1310	   In situations where the rendering software is not expected to display
1311	   non-ASCII parts of the IRI correctly using the available layout and
1312	   font resources, these parts should be percent-encoded before being
1313	   displayed.

1315	   For display of Bidi IRIs, please see [Bidi].

1317	7.7.  Interpretation of URIs and IRIs

1319	   Software that interprets IRIs as the names of local resources should
1320	   accept IRIs in multiple forms and convert and match them with the
1321	   appropriate local resource names.

1323	   First, multiple representations include both IRIs in the native
1324	   character encoding of the protocol and also their URI counterparts.

1326	   Second, it may include URIs constructed based on character encodings
1327	   other than UTF-8.  These URIs may be produced by user agents that do
1328	   not conform to this specification and that use legacy character
1329	   encodings to convert non-ASCII characters to URIs.  Whether this is
1330	   necessary, and what character encodings to cover, depends on a number
1331	   of factors, such as the legacy character encodings used locally and
1332	   the distribution of various versions of user agents.  For example,
1333	   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
1334	   addition to UTF-8.

1336	   Third, it may include additional mappings to be more user-friendly
1337	   and robust against transmission errors.  These would be similar to
1338	   how some servers currently treat URIs as case insensitive or perform
1339	   additional matching to account for spelling errors.  For characters
1340	   beyond the US-ASCII repertoire, this may, for example, include
1341	   ignoring the accents on received IRIs or resource names.  Please note
1342	   that such mappings, including case mappings, are language dependent.

1344	   It can be difficult to identify a resource unambiguously if too many
1345	   mappings are taken into consideration.  However, percent-encoded and
1346	   not percent-encoded parts of IRIs can always be clearly
1347	   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
1348	   the potential for collisions lower than it may seem at first.

1350	7.8.  Upgrading Strategy

1352	   Where this recommendation places further constraints on software for
1353	   which many instances are already deployed, it is important to
1354	   introduce upgrades carefully and to be aware of the various
1355	   interdependencies.

1357	   If IRIs cannot be interpreted correctly, they should not be created,
1358	   generated, or transported.  This suggests that upgrading URI
1359	   interpreting software to accept IRIs should have highest priority.

1361	   On the other hand, a single IRI is interpreted only by a single or
1362	   very few interpreters that are known in advance, although it may be
1363	   entered and transported very widely.

1365	   Therefore, IRIs benefit most from a broad upgrade of software to be
1366	   able to enter and transport IRIs.  However, before an individual IRI
1367	   is published, care should be taken to upgrade the corresponding
1368	   interpreting software in order to cover the forms expected to be
1369	   received by various versions of entry and transport software.

1371	   The upgrade of generating software to generate IRIs instead of using
1372	   a local character encoding should happen only after the service is
1373	   upgraded to accept IRIs.  Similarly, IRIs should only be generated
1374	   when the service accepts IRIs and the intervening infrastructure and
1375	   protocol is known to transport them safely.

1377	   Software converting from URIs to IRIs for display should be upgraded
1378	   only after upgraded entry software has been widely deployed to the
1379	   population that will see the displayed result.

1381	   Where there is a free choice of character encodings, it is often
1382	   possible to reduce the effort and dependencies for upgrading to IRIs
1383	   by using UTF-8 rather than another encoding.  For example, when a new
1384	   file-based Web server is set up, using UTF-8 as the character
1385	   encoding for file names will make the transition to IRIs easier.
1386	   Likewise, when a new Web form is set up using UTF-8 as the character
1387	   encoding of the form page, the returned query URIs will use UTF-8 as
1388	   the character encoding (unless the user, for whatever reason, changes
1389	   the character encoding) and will therefore be compatible with IRIs.

1391	   These recommendations, when taken together, will allow for the
1392	   extension from URIs to IRIs in order to handle characters other than
1393	   US-ASCII while minimizing interoperability problems.  For
1394	   considerations regarding the upgrade of URI scheme definitions, see
1395	   Section 5.4.

1397	8.  IANA Considerations

1399	   NOTE: THIS SECTION NEEDS REVIEW AGAINST HAPPIANA WORK.

1401	   RFC Editor and IANA note: Please Replace RFC XXXX with the number of
1402	   this document when it issues as an RFC, and RFC YYYY with the number
1403	   of the RFC issued for draft-ietf-iri-rfc3987bis.

1405	   IANA maintains a registry of "URI schemes".  This document attempts
1406	   to make it clear from the registry that a "URI scheme" also serves an
1407	   "IRI scheme", and makes several changes to the registry.

1409	   The description of the registry should be changed: "RFC 4395 defined
1410	   an IANA-maintained registry of URI Schemes.  RFC XXXX updates this
1411	   registry to make it clear that the registered values also serve as
1412	   IRI schemes, as defined in RFC YYYY."

1414	   The registry includes schemes marked as Permanent or Provisional.
1415	   Previously, this was accomplished by having two sections, "Permanent"
1416	   and "Provisional".  However, in order to allow other status
1417	   ("Historical", and possibly a Proposed status for proposals which
1418	   have been received but not accepted), the registry should be changed
1419	   so that the status is indicated in a separate "Status" column, whose
1420	   values may be "Permanent", "Provisional" or "Historical".  Changes in
1421	   status as well as updates to the entire registration may be
1422	   accomplished by requests and expert review.

1424	9.  Security Considerations

1426	   The security considerations discussed in [RFC3986] also apply to
1427	   IRIs.  In addition, the following issues require particular care for
1428	   IRIs.

1430	   Incorrect encoding or decoding can lead to security problems.  For
1431	   example, some UTF-8 decoders do not check against overlong byte
1432	   sequences.  See [UTR36] Section 3 for details.

1434	   There are serious difficulties with relying on a human to verify that
1435	   a an IRI (whether presented visually or aurally) is the same as
1436	   another IRI or is the one intended.  These problems exist with ASCII-
1437	   only URIs (bl00mberg.com vs. bloomberg.com) but are strongly
1438	   exacerbated when using the much larger character repertoire of
1439	   Unicode.  For details, see Section 2 of [UTR36].  Using
1440	   administrative and technical means to reduce the availability of such
1441	   exploits is possible, but they are difficult to eliminate altogether.
1442	   User agents SHOULD NOT rely on visual or perceptual comparison or
1443	   verification of IRIs as a means of validating or assuring safety,
1444	   correctness or appropriateness of an IRI.  Other means of presenting
1445	   users with the validity, safety, or appropriateness of visited sites
1446	   are being developed in the browser community as an alternative means
1447	   of avoiding these difficulties.

1449	   Besides the large character repertoire of Unicode, reasons for
1450	   confusion include different forms of normalization and different
1451	   normalization expectations, use of percent-encoding with various
1452	   legacy encodings, and bidirectionality issues.  See also [Bidi].

1454	   Confusion can occur in various IRI components, such as the domain
1455	   name part or the path part, or between IRI components.  For
1456	   considerations specific to the domain name part, see [RFC5890].  For
1457	   considerations specific to particular protocols or schemes, see the
1458	   security sections of the relevant specifications and registration
1459	   templates.  Administrators of sites that allow independent users to
1460	   create resources in the same sub area have to be careful.  Details
1461	   are discussed in Section 7.5.

1463	   The characters additionally allowed in Legacy Extended IRIs introduce
1464	   additional security issues.  For details, see Section 6.3.

1466	10.  Acknowledgements

1468	   This document was derived from [RFC3987]; the acknowledgments from
1469	   that specification still apply.

1471	   In addition, this document was influenced by contributions from (in
1472	   no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson,
1473	   John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris
1474	   Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank
1475	   Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard
1476	   Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie
1477	   Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas
1478	   Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van
1479	   Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos
1480	   Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R.
1481	   Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse,
1482	   Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn
1483	   Steele, Peter Saint-Andre, Geoffrey Sneddon, Chris Weber, Alex
1484	   Melnikov, Slim Amamou, S. Moonesamy, Tim Berners-Lee, Yaron Goland,
1485	   Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas
1486	   Milo, Murray Sargent, Marc Blanchet, and Mykyta Yevstifeyev.

1488	11.  Main Changes Since RFC 3987

1490	   This section describes the main changes since [RFC3987].

1492	11.1.  Split out Bidi, processing guidelines, comparison sections

1494	   Move some components (comparison, bidi, processing) into separate
1495	   documents.

1497	11.2.  Major restructuring of IRI processing model

1499	   Major restructuring of IRI processing model to make scheme-specific
1500	   translation necessary to handle IDNA requirements and for consistency
1501	   with web implementations.

1503	   Starting with IRI, you want one of:

1505	   a  IRI components (IRI parsed into UTF8 pieces)

1507	   b  URI components (URI parsed into ASCII pieces, encoded correctly)

1509	   c  whole URI (for passing on to some other system that wants whole
1510	      URIs)

1512	11.2.1.  OLD WAY

1514	   1.  Pct-encoding on the whole thing to a URI. (c1) If you want a
1515	       (maybe broken) whole URI, you might stop here.

1517	   2.  Parsing the URI into URI components. (b1) If you want (maybe
1518	       broken) URI components, stop here.

1520	   3.  Decode the components (undoing the pct-encoding). (a) if you want
1521	       IRI components, stop here.

1523	   4.  reencode: Either using a different encoding some components (for
1524	       domain names, and query components in web pages, which depends on
1525	       the component, scheme and context), and otherwise using pct-
1526	       encoding. (b2) if you want (good) URI components, stop here.

1528	   5.  reassemble the reencoded components. (c2) if you want a (*good*)
1529	       whole URI stop here.

1531	11.2.2.  NEW WAY

1533	   1.  Parse the IRI into IRI components using the generic syntax. (a)
1534	       if you want IRI components, stop here.

1536	   2.  Encode each components, using pct-encoding, IDN encoding, or
1537	       special query part encoding depending on the component scheme or
1538	       context. (b) If you want URI components, stop here.

1540	   3.  reassemble the a whole URI from URI components. (c) if you want a
1541	       whole URI stop here.

1543	11.2.3.  Extension of Syntax

1545	   Added the tag range (U+E0000-E0FFF) to the iprivate production.  Some
1546	   IRIs generated with the new syntax may fail to pass very strict
1547	   checks relying on the old syntax.  But characters in this range
1548	   should be extremely infrequent anyway.

1550	11.2.4.  More to be added

1552	   TODO: There are more main changes that need to be documented in this
1553	   section.

1555	11.3.  Change Log

1557	   Note to RFC Editor: Please completely remove this section before
1558	   publication.

1560	11.3.1.  Changes after draft-ietf-iri-3987bis-01

1562	   Changes from draft-ietf-iri-3987bis-01 onwards are available as
1563	   changesets in the IETF tools subversion repository at http://
1564	   trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/
1565	   draft-ietf-iri-3987bis.xml.

1567	11.3.2.  Changes from draft-duerst-iri-bis-07 to
1568	         draft-ietf-iri-3987bis-00

1570	   Changed draft name, date, last paragraph of abstract, and titles in
1571	   change log, and added this section in moving from
1572	   draft-duerst-iri-bis-07 (personal submission) to
1573	   draft-ietf-iri-3987bis-00 (WG document).

1575	11.3.3.  Changes from -06 to -07 of draft-duerst-iri-bis

1577	   Major restructuring of the processing model, see Section 11.2.

1579	11.4.  Changes from -00 to -01

1581	   o  Removed 'mailto:' before mail addresses of authors.

1583	   o  Added "<to be done>" as right side of 'href-strip' rule.  Fixed
1584	      '|' to '/' for alternatives.

1586	11.5.  Changes from -05 to -06 of draft-duerst-iri-bis-00

1588	   o  Add HyperText Reference, change abstract, acks and references for
1589	      it

1591	   o  Add Masinter back as another editor.

1593	   o  Masinter integrates HRef material from HTML5 spec.

1595	   o  Rewrite introduction sections to modernize.

1597	11.6.  Changes from -04 to -05 of draft-duerst-iri-bis

1599	   o  Updated references.

1601	   o  Changed IPR text to pre5378Trust200902.

1603	11.7.  Changes from -03 to -04 of draft-duerst-iri-bis

1605	   o  Added explicit abbreviation for LEIRIs.

1607	   o  Mentioned LEIRI references.

1609	   o  Completed text in LEIRI section about tag characters and about
1610	      specials.

1612	11.8.  Changes from -02 to -03 of draft-duerst-iri-bis

1614	   o  Updated some references.

1616	   o  Updated Michel Suginard's coordinates.

1618	11.9.  Changes from -01 to -02 of draft-duerst-iri-bis

1620	   o  Added tag range to iprivate (issue private-include-tags-115).

1622	   o  Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.

1624	11.10.  Changes from -00 to -01 of draft-duerst-iri-bis

1626	   o  Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI"
1627	      based on input from the W3C XML Core WG.  Moved the relevant
1628	      subsections to the back and promoted them to a section.

1630	   o  Added some text re.  Legacy Extended IRIs to the security section.

1632	   o  Added a IANA Consideration Section.

1634	   o  Added this Change Log Section.

1636	   o  Added a section about "IRIs with Spaces/Controls" (converting from
1637	      a Note in RFC 3987).

1639	11.11.  Changes from RFC 3987 to -00 of draft-duerst-iri-bis

1641	      Fixed errata (see
1642	      http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).

1644	12.  References

1646	12.1.  Normative References

1648	   [ASCII]    American National Standards Institute, "Coded Character
1649	              Set -- 7-bit American Standard Code for Information
1650	              Interchange", ANSI X3.4, 1986.

1652	   [ISO10646]
1653	              International Organization for Standardization, "ISO/IEC
1654	              10646:2011: Information Technology - Universal Multiple-
1655	              Octet Coded Character Set (UCS)", ISO Standard 10646,
1656	              March 20011, <http://standards.iso.org/ittf/
1657	              PubliclyAvailableStandards/
1658	              c051273_ISO_IEC_10646_2011(E).zip>.

1660	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1661	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1663	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1664	              Profile for Internationalized Domain Names (IDN)",
1665	              RFC 3491, March 2003.

1667	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
1668	              10646", STD 63, RFC 3629, November 2003.

1670	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1671	              Resource Identifier (URI): Generic Syntax", STD 66,
1672	              RFC 3986, January 2005.

1674	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
1675	              Applications (IDNA): Definitions and Document Framework",
1676	              RFC 5890, August 2010.

1678	   [RFC5891]  Klensin, J., "Internationalized Domain Names in
1679	              Applications (IDNA): Protocol", RFC 5891, August 2010.

1681	   [RFC5892]  Faltstrom, P., "The Unicode Code Points and
1682	              Internationalized Domain Names for Applications (IDNA)",
1683	              RFC 5892, August 2010.

1685	   [STD68]    Crocker, D. and P. Overell, "Augmented BNF for Syntax
1686	              Specifications: ABNF", STD 68, RFC 5234, January 2008.

1688	   [UNIV6]    The Unicode Consortium, "The Unicode Standard, Version
1689	              6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
1690	              ISBN 978-1-936213-01-6)", October 2010.

1692	   [UTR15]    Davis, M. and M. Duerst, "Unicode Normalization Forms",
1693	              Unicode Standard Annex #15, March 2008,
1694	              <http://www.unicode.org/unicode/reports/tr15/
1695	              tr15-23.html>.

1697	12.2.  Informative References

1699	   [Bidi]     Duerst, M. and L. Masinter, "Guidelines for
1700	              Internationalized Resource Identifiers with Bi-directional
1701	              Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-00
1702	              (work in progress), August 2011.

1704	   [CharMod]  Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
1705	              Texin, "Character Model for the World Wide Web: Resource
1706	              Identifiers", World Wide Web Consortium Candidate
1707	              Recommendation, November 2004,
1708	              <http://www.w3.org/TR/charmod-resid>.

1710	   [Duerst97]
1711	              Duerst, M., "The Properties and Promises of UTF-8", Proc.
1712	              11th International Unicode Conference, San Jose ,
1713	              September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
1714	              papers/PDF/IUC11-UTF-8.pdf>.

1716	   [Equivalence]
1717	              Masinter, L. and M. Duerst, "Equivalence and
1718	              Canonicalization of Internationalized Resource Identifiers
1719	              (IRIs)", draft-ietf-iri-comparison-00 (work in progress),
1720	              August 2011.

1722	   [Gettys]   Gettys, J., "URI Model Consequences",
1723	              <http://www.w3.org/DesignIssues/ModelConsequences>.

1725	   [HTML4]    Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
1726	              Specification", World Wide Web Consortium Recommendation,
1727	              December 1999,
1728	              <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.

1730	   [LEIRI]    Thompson, H., Tobin, R., and N. Walsh, "Legacy extended
1731	              IRIs for XML resource identification", World Wide Web
1732	              Consortium Note, November 2008,
1733	              <http://www.w3.org/TR/leiri/>.

1735	   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
1736	              Extensions (MIME) Part One: Format of Internet Message
1737	              Bodies", RFC 2045, November 1996.

1739	   [RFC2130]  Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
1740	              Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
1741	              the IAB Character Set Workshop held 29 February - 1 March,
1742	              1996", RFC 2130, April 1997.

1744	   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

1746	   [RFC2192]  Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

1748	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
1749	              Languages", BCP 18, RFC 2277, January 1998.

1751	   [RFC2368]  Hoffman, P., Masinter, L., and J. Zawinski, "The mailto
1752	              URL scheme", RFC 2368, July 1998.

1754	   [RFC2384]  Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

1756	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1757	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
1758	              August 1998.

1760	   [RFC2397]  Masinter, L., "The "data" URL scheme", RFC 2397,
1761	              August 1998.

1763	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
1764	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
1765	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

1767	   [RFC2640]  Curtin, B., "Internationalization of the File Transfer
1768	              Protocol", RFC 2640, July 1999.

1770	   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
1771	              Identifiers (IRIs)", RFC 3987, January 2005.

1773	   [RFC4395bis]
1774	              Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
1775	              Registration Procedures for New URI/IRI Schemes",
1776	              draft-ietf-iri-4395bis-irireg-03 (work in progress),
1777	              July 2011.

1779	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
1780	              Encodings for Internationalized Domain Names", RFC 6055,
1781	              February 2011.

1783	   [RFC6082]  Whistler, K., Adams, G., Duerst, M., Presuhn, R., and J.
1784	              Klensin, "Deprecating Unicode Language Tag Characters: RFC
1785	              2482 is Historic", RFC 6082, November 2010.

1787	   [UNIXML]   Duerst, M. and A. Freytag, "Unicode in XML and other
1788	              Markup Languages", Unicode Technical Report #20, World
1789	              Wide Web Consortium Note, June 2003,
1790	              <http://www.w3.org/TR/unicode-xml/>.

1792	   [UTR36]    Davis, M. and M. Suignard, "Unicode Security
1793	              Considerations", Unicode Technical Report #36,
1794	              August 2010, <http://unicode.org/reports/tr36/>.

1796	   [XLink]    DeRose, S., Maler, E., and D. Orchard, "XML Linking
1797	              Language (XLink) Version 1.0", World Wide Web
1798	              Consortium REC-xlink-20010627, June 2001,
1799	              <http://www.w3.org/TR/xlink/#link-locators>.

1801	   [XML1]     Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
1802	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth
1803	              Edition)", World Wide Web Consortium REC-xml-20081126,
1804	              August 2006, <http://www.w3.org/TR/REC-xml>.

1806	   [XMLNamespace]
1807	              Bray, T., Hollander, D., Layman, A., and R. Tobin,
1808	              "Namespaces in XML (Second Edition)", World Wide Web
1809	              Consortium REC-xml-names-20091208, August 2006,
1810	              <http://www.w3.org/TR/REC-xml-names>.

1812	   [XMLSchema]
1813	              Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
1814	              World Wide Web Consortium REC-xmlschema-2-20041028,
1815	              May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.

1817	   [XPointer]
1818	              Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
1819	              Framework", World Wide Web Consortium REC-xptr-framework-
1820	              20030325, March 2003,
1821	              <http://www.w3.org/TR/xptr-framework/#escaping>.

1823	Authors' Addresses

1825	   Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
1826	                 possible, for example as "D&#252;rst" in XML and HTML.)
1827	   Aoyama Gakuin University
1828	   5-10-1 Fuchinobe
1829	   Sagamihara, Kanagawa  229-8558
1830	   Japan

1832	   Phone: +81 42 759 6329
1833	   Fax:   +81 42 759 6495
1834	   Email: duerst@it.aoyama.ac.jp
1835	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
1836	          (Note: This is the percent-encoded form of an IRI.)

1838	   Michel Suignard
1839	   Unicode Consortium
1840	   P.O. Box 391476
1841	   Mountain View, CA  94039-1476
1842	   U.S.A.

1844	   Phone: +1-650-693-3921
1845	   Email: michel@unicode.org
1846	   URI:   http://www.suignard.com

1848	   Larry Masinter
1849	   Adobe
1850	   345 Park Ave
1851	   San Jose, CA  95110
1852	   U.S.A.

1854	   Phone: +1-408-536-3024
1855	   Email: masinter@adobe.com
1856	   URI:   http://larry.masinter.net