idnits 2.17.1 draft-ietf-iri-3987bis-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of lines with control characters in the document. == There is 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC3987, but the abstract doesn't seem to directly say this. It does mention RFC3987 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 9, 2012) is 4489 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'LEIRI' is defined on line 1698, but no explicit reference was found in the text == Unused Reference: 'RFC2045' is defined on line 1703, but no explicit reference was found in the text == Unused Reference: 'RFC6082' is defined on line 1751, but no explicit reference was found in the text == Unused Reference: 'XMLNamespace' is defined on line 1774, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' == Outdated reference: A later version (-03) exists of draft-ietf-iri-bidi-guidelines-00 == Outdated reference: A later version (-02) exists of draft-ietf-iri-comparison-00 -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2192 (Obsoleted by RFC 5092) -- Obsolete informational reference (is this intentional?): RFC 2368 (Obsoleted by RFC 6068) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) == Outdated reference: A later version (-04) exists of draft-ietf-iri-4395bis-irireg-03 Summary: 2 errors (**), 0 flaws (~~), 11 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internationalized Resource Identifiers M. Duerst 3 (iri) Aoyama Gakuin University 4 Internet-Draft M. Suignard 5 Obsoletes: 3987 (if approved) Unicode Consortium 6 Intended status: Standards Track L. Masinter 7 Expires: July 12, 2012 Adobe 8 January 9, 2012 10 Internationalized Resource Identifiers (IRIs) 11 draft-ietf-iri-3987bis-09 13 Abstract 15 This document defines the Internationalized Resource Identifier (IRI) 16 protocol element, as an extension of the Uniform Resource Identifier 17 (URI). An IRI is a sequence of characters from the Universal 18 Character Set (Unicode/ISO 10646). Grammar and processing rules are 19 given for IRIs and related syntactic forms. 21 Defining IRI as new protocol element (rather than updating or 22 extending the definition of URI) allows independent orderly 23 transitions: other protocols and languages that use URIs must 24 explicitly choose to allow IRIs. 26 Guidelines are provided for the use and deployment of IRIs and 27 related protocol elements when revising protocols, formats, and 28 software components that currently deal only with URIs. 30 This document is part of a set of documents intended to replace RFC 31 3987. 33 RFC Editor: Please remove the next paragraph before publication. 35 This (and several companion documents) are intended to obsolete RFC 36 3987, and also move towards IETF Draft Standard. For discussion and 37 comments on these drafts, please join the IETF IRI WG by subscribing 38 to the mailing list public-iri@w3.org, archives at 39 http://lists.w3.org/archives/public/public-iri/. For a list of open 40 issues, please see the issue tracker of the WG at 41 http://trac.tools.ietf.org/wg/iri/trac/report/1. For a list of 42 individual edits, please see the change history at 43 http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis. 45 Status of this Memo 47 This Internet-Draft is submitted in full conformance with the 48 provisions of BCP 78 and BCP 79. 50 Internet-Drafts are working documents of the Internet Engineering 51 Task Force (IETF). Note that other groups may also distribute 52 working documents as Internet-Drafts. The list of current Internet- 53 Drafts is at http://datatracker.ietf.org/drafts/current/. 55 Internet-Drafts are draft documents valid for a maximum of six months 56 and may be updated, replaced, or obsoleted by other documents at any 57 time. It is inappropriate to use Internet-Drafts as reference 58 material or to cite them other than as "work in progress." 60 This Internet-Draft will expire on July 12, 2012. 62 Copyright Notice 64 Copyright (c) 2012 IETF Trust and the persons identified as the 65 document authors. All rights reserved. 67 This document is subject to BCP 78 and the IETF Trust's Legal 68 Provisions Relating to IETF Documents 69 (http://trustee.ietf.org/license-info) in effect on the date of 70 publication of this document. Please review these documents 71 carefully, as they describe your rights and restrictions with respect 72 to this document. Code Components extracted from this document must 73 include Simplified BSD License text as described in Section 4.e of 74 the Trust Legal Provisions and are provided without warranty as 75 described in the Simplified BSD License. 77 This document may contain material from IETF Documents or IETF 78 Contributions published or made publicly available before November 79 10, 2008. The person(s) controlling the copyright in some of this 80 material may not have granted the IETF Trust the right to allow 81 modifications of such material outside the IETF Standards Process. 82 Without obtaining an adequate license from the person(s) controlling 83 the copyright in such materials, this document may not be modified 84 outside the IETF Standards Process, and derivative works of it may 85 not be created outside the IETF Standards Process, except to format 86 it for publication as an RFC or to translate it into languages other 87 than English. 89 Table of Contents 91 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 92 1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5 93 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 94 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7 95 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 8 96 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 97 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9 98 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 99 3. Processing IRIs and related protocol elements . . . . . . . . 13 100 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 13 101 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 13 102 3.3. General percent-encoding of IRI components . . . . . . . 14 103 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14 104 3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 14 105 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 14 106 3.4.3. Additional Considerations . . . . . . . . . . . . . . 15 107 3.5. Mapping query components . . . . . . . . . . . . . . . . 16 108 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 16 109 4. Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 16 110 4.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . 18 111 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19 112 5.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 19 113 5.2. Software Interfaces and Protocols . . . . . . . . . . . . 20 114 5.3. Format of URIs and IRIs in Documents and Protocols . . . 20 115 5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 20 116 5.5. Relative IRI References . . . . . . . . . . . . . . . . . 22 117 6. Legacy Extended IRIs (LEIRIs) . . . . . . . . . . . . . . . . 22 118 6.1. Legacy Extended IRI Syntax . . . . . . . . . . . . . . . 23 119 6.2. Conversion of Legacy Extended IRIs to IRIs . . . . . . . 23 120 6.3. Characters Allowed in Legacy Extended IRIs but not in 121 IRIs . . . . . . . . . . . . . . . . . . . . . . . . . . 23 122 7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 25 123 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 25 124 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26 125 7.3. URI/IRI Transfer between Applications . . . . . . . . . . 26 126 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 27 127 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27 128 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 28 129 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 28 130 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 29 131 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 132 9. Security Considerations . . . . . . . . . . . . . . . . . . . 30 133 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31 134 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 32 135 11.1. Split out Bidi, processing guidelines, comparison 136 sections . . . . . . . . . . . . . . . . . . . . . . . . 32 138 11.2. Major restructuring of IRI processing model . . . . . . . 32 139 11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 32 140 11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 33 141 11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 33 142 11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 33 143 11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 33 144 11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 33 145 11.3.2. Changes from draft-duerst-iri-bis-07 to 146 draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 34 147 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 34 148 11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 34 149 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 34 150 11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 34 151 11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 34 152 11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35 153 11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35 154 11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 35 155 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 35 156 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35 157 12.1. Normative References . . . . . . . . . . . . . . . . . . 35 158 12.2. Informative References . . . . . . . . . . . . . . . . . 36 159 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39 161 1. Introduction 163 1.1. Overview and Motivation 165 A Uniform Resource Identifier (URI) is defined in [RFC3986] as a 166 sequence of characters chosen from a limited subset of the repertoire 167 of US-ASCII [ASCII] characters. 169 The characters in URIs are frequently used for representing words of 170 natural languages. This usage has many advantages: Such URIs are 171 easier to memorize, easier to interpret, easier to transcribe, easier 172 to create, and easier to guess. For most languages other than 173 English, however, the natural script uses characters other than A - 174 Z. For many people, handling Latin characters is as difficult as 175 handling the characters of other scripts is for those who use only 176 the Latin alphabet. Many languages with non-Latin scripts are 177 transcribed with Latin letters. These transcriptions are now often 178 used in URIs, but they introduce additional difficulties. 180 The infrastructure for the appropriate handling of characters from 181 additional scripts is now widely deployed in operating system and 182 application software. Software that can handle a wide variety of 183 scripts and languages at the same time is increasingly common. Also, 184 an increasing number of protocols and formats can carry a wide range 185 of characters. 187 URIs are composed out of a very limited repertoire of characters; 188 this design choice was made to support global transcription([RFC3986] 189 section 1.2.1.). Reliable transition between a URI (as an abstract 190 protocol element composed of a sequence of characters) and a 191 presentation of that URI (written on a napkin, read out loud) and 192 back is relatively straightforward, because of the limited repertoire 193 of characters used. IRIs are designed to satisfy a different set of 194 use requirements; in particular, to allow IRIs to be written in ways 195 that are more meaningful to their users, even at the expense of 196 global transcribability. However, ensuring reliability of the 197 transition between an IRI and its presentation and back is more 198 difficult and complex when dealing with the larger set of Unicode 199 characters. For example, Unicode supports multiple ways of encoding 200 complex combinations of characters and accents, with multiple 201 character sequences that can result in the same presentation. 203 This document defines the protocol element called Internationalized 204 Resource Identifier (IRI), which allow applications of URIs to be 205 extended to use resource identifiers that have a much wider 206 repertoire of characters. It also provides corresponding 207 "internationalized" versions of other constructs from [RFC3986], such 208 as URI references. The syntax of IRIs is defined in Section 2. 210 Within this document, Section 5 discusses the use of IRIs in 211 different situations. Section 7 gives additional informative 212 guidelines. Section 9 discusses IRI-specific security 213 considerations. 215 This specification is part of a collection of specifications intended 216 to replace [RFC3987]. [Bidi] discusses the special case of 217 bidirectional IRIs using characters from scripts written right-to- 218 left. [Equivalence] gives guidelines for applications wishing to 219 determine if two IRIs are equivalent, as well as defining some 220 equivalence methods. [RFC4395bis] updates the URI scheme 221 registration guidelines and procedures to note that every URI scheme 222 is also automatically an IRI scheme and to allow scheme definitions 223 to be directly described in terms of Unicode characters. 225 1.2. Applicability 227 IRIs are designed to allow protocols and software that deal with URIs 228 to be updated to handle IRIs. Processing of IRIs is accomplished by 229 extending the URI syntax while retaining (and not expanding) the set 230 of "reserved" characters, such that the syntax for any URI scheme may 231 be extended to allow non-ASCII characters. In addition, following 232 parsing of an IRI, it is possible to construct a corresponding URI by 233 first encoding characters outside of the allowed URI range and then 234 reassembling the components. 236 Practical use of IRIs forms in place of URIs forms depends on the 237 following conditions being met: 239 a. A protocol or format element MUST be explicitly designated to be 240 able to carry IRIs. The intent is to avoid introducing IRIs into 241 contexts that are not defined to accept them. For example, XML 242 schema [XMLSchema] has an explicit type "anyURI" that includes 243 IRIs and IRI references. Therefore, IRIs and IRI references can 244 be in attributes and elements of type "anyURI". On the other 245 hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is 246 defined as a URI, which means that direct use of IRIs is not 247 allowed in HTTP requests. 249 b. The protocol or format carrying the IRIs MUST have a mechanism to 250 represent the wide range of characters used in IRIs, either 251 natively or by some protocol- or format-specific escaping 252 mechanism (for example, numeric character references in [XML1]). 254 c. The URI scheme definition, if it explicitly allows a percent sign 255 ("%") in any syntactic component, SHOULD define the interpretation 256 of sequences of percent-encoded octets (using "%XX" hex octets) as 257 octet from sequences of UTF-8 encoded strings; this is recommended 258 in the guidelines for registering new schemes, [RFC4395bis]. For 259 example, this is the practice for IMAP URLs [RFC2192], POP URLs 260 [RFC2384] and the URN syntax [RFC2141]). Note that use of 261 percent-encoding may also be restricted in some situations, for 262 example, URI schemes that disallow percent-encoding might still be 263 used with a fragment identifier which is percent-encoded (e.g., 264 [XPointer]). See Section 5.4 for further discussion. 266 1.3. Definitions 268 The following definitions are used in this document; they follow the 269 terms in [RFC2130], [RFC2277], and [ISO10646]. 271 character: A member of a set of elements used for the organization, 272 control, or representation of data. For example, "LATIN CAPITAL 273 LETTER A" names a character. 275 octet: An ordered sequence of eight bits considered as a unit. 277 character repertoire: A set of characters (set in the mathematical 278 sense). 280 sequence of characters: A sequence of characters (one after 281 another). 283 sequence of octets: A sequence of octets (one after another). 285 character encoding: A method of representing a sequence of 286 characters as a sequence of octets (maybe with variants). Also, a 287 method of (unambiguously) converting a sequence of octets into a 288 sequence of characters. 290 charset: The name of a parameter or attribute used to identify a 291 character encoding. 293 UCS: Universal Character Set. The coded character set defined by 294 ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6]. 296 IRI reference: Denotes the common usage of an Internationalized 297 Resource Identifier. An IRI reference may be absolute or 298 relative. However, the "IRI" that results from such a reference 299 only includes absolute IRIs; any relative IRI references are 300 resolved to their absolute form. Note that in [RFC2396] URIs did 301 not include fragment identifiers, but in [RFC3986] fragment 302 identifiers are part of URIs. 304 LEIRI (Legacy Extended IRI) processing: This term was used in 305 various XML specifications to refer to strings that, although not 306 valid IRIs, were acceptable input to the processing rules in 307 Section 6.2. 309 running text: Human text (paragraphs, sentences, phrases) with 310 syntax according to orthographic conventions of a natural 311 language, as opposed to syntax defined for ease of processing by 312 machines (e.g., markup, programming languages). 314 protocol element: Any portion of a message that affects processing 315 of that message by the protocol in question. 317 create (a URI or IRI): With respect to URIs and IRIs, the term is 318 used for the initial creation. This may be the initial creation 319 of a resource with a certain identifier, or the initial exposition 320 of a resource under a particular identifier. 322 generate (a URI or IRI): With respect to URIs and IRIs, the term is 323 used when the identifier is generated by derivation from other 324 information. 326 parsed URI component: When a URI processor parses a URI (following 327 the generic syntax or a scheme-specific syntax, the result is a 328 set of parsed URI components, each of which has a type 329 (corresponding to the syntactic definition) and a sequence of URI 330 characters. 332 parsed IRI component: When an IRI processor parses an IRI directly, 333 following the general syntax or a scheme-specific syntax, the 334 result is a set of parsed IRI components, each of which has a type 335 (corresponding to the syntactice definition) and a sequence of IRI 336 characters. (This definition is analogous to "parsed URI 337 component".) 339 IRI scheme: A URI scheme may also be known as an "IRI scheme" if the 340 scheme's syntax has been extended to allow non-US-ASCII characters 341 according to the rules in this document. 343 1.4. Notation 345 RFCs and Internet Drafts currently do not allow any characters 346 outside the US-ASCII repertoire. Therefore, this document uses 347 various special notations to denote such characters in examples. 349 In text, characters outside US-ASCII are sometimes referenced by 350 using a prefix of 'U+', followed by four to six hexadecimal digits. 352 To represent characters outside US-ASCII in examples, this document 353 uses 'XML Notation'. 355 XML Notation uses a leading '&#x', a trailing ';', and the 356 hexadecimal number of the character in the UCS in between. For 357 example, я stands for CYRILLIC CAPITAL LETTER YA. In this 358 notation, an actual '&' is denoted by '&'. 360 To denote actual octets in examples (as opposed to percent-encoded 361 octets), the two hex digits denoting the octet are enclosed in "<" 362 and ">". For example, the octet often denoted as 0xc9 is denoted 363 here as . 365 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 366 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 367 and "OPTIONAL" are to be interpreted as described in [RFC2119]. 369 2. IRI Syntax 371 This section defines the syntax of Internationalized Resource 372 Identifiers (IRIs). 374 As with URIs, an IRI is defined as a sequence of characters, not as a 375 sequence of octets. This definition accommodates the fact that IRIs 376 may be written on paper or read over the radio as well as stored or 377 transmitted digitally. The same IRI might be represented as 378 different sequences of octets in different protocols or documents if 379 these protocols or documents use different character encodings 380 (and/or transfer encodings). Using the same character encoding as 381 the containing protocol or document ensures that the characters in 382 the IRI can be handled (e.g., searched, converted, displayed) in the 383 same way as the rest of the protocol or document. 385 2.1. Summary of IRI Syntax 387 The IRI syntax extends the URI syntax in [RFC3986] by extending the 388 class of unreserved characters, primarily by adding the characters of 389 the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject 390 to the limitations given in the syntax rules below and in 391 Section 5.1. 393 The syntax and use of components and reserved characters is the same 394 as that in [RFC3986]. Each "URI scheme" thus also functions as an 395 "IRI scheme", in that scheme-specific parsing rules for URIs of a 396 scheme are be extended to allow parsing of IRIs using the same 397 parsing rules. 399 All the operations defined in [RFC3986], such as the resolution of 400 relative references, can be applied to IRIs by IRI-processing 401 software in exactly the same way as they are for URIs by URI- 402 processing software. 404 Characters outside the US-ASCII repertoire MUST NOT be reserved and 405 therefore MUST NOT be used for syntactical purposes, such as to 406 delimit components in newly defined schemes. For example, U+00A2, 407 CENT SIGN, is not allowed as a delimiter in IRIs, because it is in 408 the 'iunreserved' category. This is similar to the fact that it is 409 not possible to use '-' as a delimiter in URIs, because it is in the 410 'unreserved' category. 412 2.2. ABNF for IRI References and IRIs 414 An ABNF definition for IRI references (which are the most general 415 concept and the start of the grammar) and IRIs is given here. The 416 syntax of this ABNF is described in [STD68]. Character numbers are 417 taken from the UCS, without implying any actual binary encoding. 418 Terminals in the ABNF are characters, not octets. 420 The following grammar closely follows the URI grammar in [RFC3986], 421 except that the range of unreserved characters is expanded to include 422 UCS characters, with the restriction that private UCS characters can 423 occur only in query parts. The grammar is split into two parts: 424 Rules that differ from [RFC3986] because of the above-mentioned 425 expansion, and rules that are the same as those in [RFC3986]. For 426 rules that are different than those in [RFC3986], the names of the 427 non-terminals have been changed as follows. If the non-terminal 428 contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i' 429 has been prefixed. The rule has been introduced in order 430 to be able to reference it from other parts of the document. 432 The following rules are different from those in [RFC3986]: 434 IRI = scheme ":" ihier-part [ "?" iquery ] 435 [ "#" ifragment ] 437 ihier-part = "//" iauthority ipath-abempty 438 / ipath-absolute 439 / ipath-rootless 440 / ipath-empty 442 IRI-reference = IRI / irelative-ref 444 absolute-IRI = scheme ":" ihier-part [ "?" iquery ] 446 irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ] 447 irelative-part = "//" iauthority ipath-abempty 448 / ipath-absolute 449 / ipath-noscheme 450 / ipath-empty 452 iauthority = [ iuserinfo "@" ] ihost [ ":" port ] 453 iuserinfo = *( iunreserved / pct-form / sub-delims / ":" ) 454 ihost = IP-literal / IPv4address / ireg-name 456 pct-form = pct-encoded 458 ireg-name = *( iunreserved / sub-delims ) 460 ipath = ipath-abempty ; begins with "/" or is empty 461 / ipath-absolute ; begins with "/" but not "//" 462 / ipath-noscheme ; begins with a non-colon segment 463 / ipath-rootless ; begins with a segment 464 / ipath-empty ; zero characters 466 ipath-abempty = *( path-sep isegment ) 467 ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ] 468 ipath-noscheme = isegment-nz-nc *( path-sep isegment ) 469 ipath-rootless = isegment-nz *( path-sep isegment ) 470 ipath-empty = 0 471 path-sep = "/" 473 isegment = *ipchar 474 isegment-nz = 1*ipchar 475 isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims 476 / "@" ) 477 ; non-zero-length segment without any colon ":" 479 ipchar = iunreserved / pct-form / sub-delims / ":" 480 / "@" 482 iquery = *( ipchar / iprivate / "/" / "?" ) 484 ifragment = *( ipchar / "/" / "?" ) 486 iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar 488 ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF 489 / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD 490 / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD 491 / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD 492 / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD 493 / %xD0000-DFFFD / %xE1000-EFFFD 495 iprivate = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD 496 / %x100000-10FFFD 498 Some productions are ambiguous. The "first-match-wins" (a.k.a. 499 "greedy") algorithm applies. For details, see [RFC3986]. 501 The following rules are the same as those in [RFC3986]: 503 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 505 port = *DIGIT 507 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 509 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) 511 IPv6address = 6( h16 ":" ) ls32 512 / "::" 5( h16 ":" ) ls32 513 / [ h16 ] "::" 4( h16 ":" ) ls32 514 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 515 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 516 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 517 / [ *4( h16 ":" ) h16 ] "::" ls32 518 / [ *5( h16 ":" ) h16 ] "::" h16 519 / [ *6( h16 ":" ) h16 ] "::" 521 h16 = 1*4HEXDIG 522 ls32 = ( h16 ":" h16 ) / IPv4address 524 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 526 dec-octet = DIGIT ; 0-9 527 / %x31-39 DIGIT ; 10-99 528 / "1" 2DIGIT ; 100-199 529 / "2" %x30-34 DIGIT ; 200-249 530 / "25" %x30-35 ; 250-255 532 pct-encoded = "%" HEXDIG HEXDIG 534 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 535 reserved = gen-delims / sub-delims 536 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 537 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 538 / "*" / "+" / "," / ";" / "=" 540 This syntax does not support IPv6 scoped addressing zone identifiers. 542 3. Processing IRIs and related protocol elements 544 IRIs are meant to replace URIs in identifying resources within new 545 versions of protocols, formats, and software components that use a 546 UCS-based character repertoire. Protocols and components may use and 547 process IRIs directly. However, there are still numerous systems and 548 protocols which only accept URIs or components of parsed URIs; that 549 is, they only accept sequences of characters within the subset of US- 550 ASCII characters allowed in URIs. 552 This section defines specific processing steps for IRI consumers 553 which establish the relationship between the string given and the 554 interpreted derivatives. These processing steps apply to both IRIs 555 and IRI references (i.e., absolute or relative forms); for IRIs, some 556 steps are scheme specific. 558 3.1. Converting to UCS 560 Input that is already in a Unicode form (i.e., a sequence of Unicode 561 characters or an octet-stream representing a Unicode-based character 562 encoding such as UTF-8 or UTF-16) should be left as is and not 563 normalized or changed. 565 An IRI or IRI reference is a sequence of characters from the UCS. 566 For input from presentations (written on paper, read aloud) or 567 translation from other representations (a text stream using a legacy 568 character encoding), convert the input to Unicode. Note that some 569 character encodings or transcriptions can be converted to or 570 represented by more than one sequence of Unicode characters. Ideally 571 the resulting IRI would use a normalized form, such as Unicode 572 Normalization Form C [UTR15], since that ensures a stable, consistent 573 representation that is most likely to produce the intended results. 574 Previous versions of this specification required normalization at 575 this step. However, attempts to require normalization in other 576 protocols have met with strong enough resistance that requiring 577 normalization here was considered impractical. Implementers and 578 users are cautioned that, while denormalized character sequences are 579 valid, they might be difficult for other users or processes to 580 reproduce and might lead to unexpected results. 582 3.2. Parse the IRI into IRI components 584 Parse the IRI, either as a relative reference (no scheme) or using 585 scheme specific processing (according to the scheme given); the 586 result is a set of parsed IRI components. 588 3.3. General percent-encoding of IRI components 590 Except as noted in the following subsections, IRI components are 591 mapped to the equivalent URI components by percent-encoding those 592 characters not allowed in URIs. Previous processing steps will have 593 removed some characters, and the interpretation of reserved 594 characters will have already been done (with the syntactic reserved 595 characters outside of the IRI component). This mapping is defined 596 for all sequences of Unicode characters, whether or not they are 597 valid for the component in question. 599 For each character which is not allowed anywhere in a valid URI apply 600 the following steps. 602 Convert to UTF-8 Convert the character to a sequence of one or more 603 octets using UTF-8 [RFC3629]. 605 Percent encode Convert each octet of this sequence to %HH, where HH 606 is the hexadecimal notation of the octet value. The hexadecimal 607 notation SHOULD use uppercase letters. (This is the general URI 608 percent-encoding mechanism in Section 2.1 of [RFC3986].) 610 Note that the mapping is an identity transformation for parsed URI 611 components of valid URIs, and is idempotent: applying the mapping a 612 second time will not change anything. 614 3.4. Mapping ireg-name 616 3.4.1. Mapping using Percent-Encoding 618 The ireg-name component SHOULD be converted according to the general 619 procedure for percent-encoding of IRI components described in 620 Section 3.3. 622 For example, the IRI 623 "http://résumé.example.org" 624 will be converted to 625 "http://r%C3%A9sum%C3%A9.example.org". 627 This conversion for ireg-name is in line with Section 3.2.2 of 628 [RFC3986], which does not mandate a particular registered name lookup 629 technology. For further background, see [RFC6055] and [Gettys]. 631 3.4.2. Mapping using Punycode 633 The ireg-name component MAY also be converted as follows: 635 If there are any sequences of , and their corresponding 636 octets all represent valid UTF-8 octet sequences, then convert these 637 back to Unicode character sequences. (If any sequences 638 are not valid UTF-8 octet sequences, then leave the entire field as 639 is without any change, since punycode encoding would not succeed.) 641 Replace the ireg-name part of the IRI by the part converted using the 642 Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891]. 643 on each dot-separated label, and by using U+002E (FULL STOP) as a 644 label separator. This procedure may fail, but this would mean that 645 the IRI cannot be resolved. In such cases, if the domain name 646 conversion fails, then the entire IRI conversion fails. Processors 647 that have no mechanism for signalling a failure MAY instead 648 substitute an otherwise invalid host name, although such processing 649 SHOULD be avoided. 651 For example, the IRI 652 "http://résumé.example.org" 653 MAY be converted to 654 "http://xn--rsum-bad.example.org" 655 . 657 This conversion for ireg-name will be better able to deal with legacy 658 infrastructure that cannot handle percent-encoding in domain names. 660 3.4.3. Additional Considerations 662 Note: Domain Names may appear in parts of an IRI other than the 663 ireg-name part. It is the responsibility of scheme-specific 664 implementations (if the Internationalized Domain Name is part of 665 the scheme syntax) or of server-side implementations (if the 666 Internationalized Domain Name is part of 'iquery') to apply the 667 necessary conversions at the appropriate point. Example: Trying 668 to validate the Web page at 669 http://résumé.example.org would lead to an IRI of 670 http://validator.w3.org/check?uri=http%3A%2F%2Frésumé. 671 example.org, which would convert to a URI of 672 http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. 673 example.org. The server-side implementation is responsible for 674 making the necessary conversions to be able to retrieve the Web 675 page. 677 Note: In this process, characters allowed in URI references and 678 existing percent-encoded sequences are not encoded further. (This 679 mapping is similar to, but different from, the encoding applied 680 when arbitrary content is included in some part of a URI.) For 681 example, an IRI of 682 "http://www.example.org/red%09rosé#red" (in XML notation) is 683 converted to 684 "http://www.example.org/red%09ros%C3%A9#red", not to something 685 like 686 "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red". 688 3.5. Mapping query components 690 For compatibility with existing deployed HTTP infrastructure, the 691 following special case applies for schemes "http" and "https" and 692 IRIs whose origin has a document charset other than one which is UCS- 693 based (e.g., UTF-8 or UTF-16). In such a case, the "query" component 694 of an IRI is mapped into a URI by using the document charset rather 695 than UTF-8 as the binary representation before pct-encoding. This 696 mapping is not applied for any other scheme or component. 698 3.6. Mapping IRIs to URIs 700 The mapping from an IRI to URI is accomplished by applying the 701 mapping above (from IRI to URI components) and then reassembling a 702 URI from the parsed URI components using the original punctuation 703 that delimited the IRI components. 705 4. Converting URIs to IRIs 707 In some situations, for presentation and further processing, it is 708 desirable to convert a URI into an equivalent IRI without unnecessary 709 percent encoding. Of course, every URI is already an IRI in its own 710 right without any conversion. This section gives one possible 711 procedure for URI to IRI mapping. 713 The conversion described in this section, if given a valid URI, will 714 result in an IRI that maps back to the URI used as an input for the 715 conversion (except for potential case differences in percent-encoding 716 and for potential percent-encoded unreserved characters). However, 717 the IRI resulting from this conversion may differ from the original 718 IRI (if there ever was one). 720 URI-to-IRI conversion removes percent-encodings, but not all percent- 721 encodings can be eliminated. There are several reasons for this: 723 1. Some percent-encodings are necessary to distinguish percent- 724 encoded and unencoded uses of reserved characters. 726 2. Some percent-encodings cannot be interpreted as sequences of UTF-8 727 octets. 729 (Note: The octet patterns of UTF-8 are highly regular. Therefore, 730 there is a very high probability, but no guarantee, that percent- 731 encodings that can be interpreted as sequences of UTF-8 octets 732 actually originated from UTF-8. For a detailed discussion, see 733 [Duerst97].) 735 3. The conversion may result in a character that is not appropriate 736 in an IRI. See Section 2.2, and Section 5.1 for further details. 738 4. IRI to URI conversion has different rules for dealing with domain 739 names and query parameters. 741 Conversion from a URI to an IRI MAY be done by using the following 742 steps: 744 1. Represent the URI as a sequence of octets in US-ASCII. 746 2. Convert all percent-encodings ("%" followed by two hexadecimal 747 digits) to the corresponding octets, except those corresponding to 748 "%", characters in "reserved", and characters in US-ASCII not 749 allowed in URIs. 751 3. Re-percent-encode any octet produced in step 2 that is not part of 752 a strictly legal UTF-8 octet sequence. 754 4. Re-percent-encode all octets produced in step 3 that in UTF-8 755 represent characters that are not appropriate according to 756 Section 2.2 and Section 5.1. 758 5. Interpret the resulting octet sequence as a sequence of characters 759 encoded in UTF-8. 761 6. URIs known to contain domain names in the reg-name component 762 SHOULD convert punycode-encoded domain name labels to the 763 corresponding characters using the ToUnicode procedure. 765 This procedure will convert as many percent-encoded characters as 766 possible to characters in an IRI. Because there are some choices 767 when step 4 is applied (see Section 5.1), results may vary. 769 Conversions from URIs to IRIs MUST NOT use any character encoding 770 other than UTF-8 in steps 3 and 4, even if it might be possible to 771 guess from the context that another character encoding than UTF-8 was 772 used in the URI. For example, the URI 773 "http://www.example.org/r%E9sum%E9.html" might with some guessing be 774 interpreted to contain two e-acute characters encoded as iso-8859-1. 775 It must not be converted to an IRI containing these e-acute 776 characters. Otherwise, in the future the IRI will be mapped to 777 "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different 778 URI from "http://www.example.org/r%E9sum%E9.html". 780 4.1. Examples 782 This section shows various examples of converting URIs to IRIs. Each 783 example shows the result after each of the steps 1 through 6 is 784 applied. XML Notation is used for the final result. Octets are 785 denoted by "<" followed by two hexadecimal digits followed by ">". 787 The following example contains the sequence "%C3%BC", which is a 788 strictly legal UTF-8 sequence, and which is converted into the actual 789 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as 790 u-umlaut). 792 1. http://www.example.org/D%C3%BCrst 794 2. http://www.example.org/Drst 796 3. http://www.example.org/Drst 798 4. http://www.example.org/Drst 800 5. http://www.example.org/Dürst 802 6. http://www.example.org/Dürst 804 The following example contains the sequence "%FC", which might 805 represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the 806 iso-8859-1 character encoding. (It might represent other characters 807 in other character encodings. For example, the octet in iso- 808 8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.) Because 809 is not part of a strictly legal UTF-8 sequence, it is re-percent- 810 encoded in step 3. 812 1. http://www.example.org/D%FCrst 814 2. http://www.example.org/Drst 816 3. http://www.example.org/D%FCrst 818 4. http://www.example.org/D%FCrst 820 5. http://www.example.org/D%FCrst 822 6. http://www.example.org/D%FCrst 824 The following example contains "%e2%80%ae", which is the percent- 825 encoded 826 UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. The 827 direct use of this character is forbiddin in an IRI. Therefore, the 828 corresponding octets are re-percent-encoded in step 4. This example 829 shows that the case (upper- or lowercase) of letters used in percent- 830 encodings may not be preserved. The example also contains a 831 punycode-encoded domain name label (xn--99zt52a), which is not 832 converted. 834 1. http://xn--99zt52a.example.org/%e2%80%ae 836 2. http://xn--99zt52a.example.org/<80> 838 3. http://xn--99zt52a.example.org/<80> 840 4. http://xn--99zt52a.example.org/%E2%80%AE 842 5. http://xn--99zt52a.example.org/%E2%80%AE 844 6. http://納豆.example.org/%E2%80%AE 846 Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46 847 (Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this 848 note.)) 850 5. Use of IRIs 852 5.1. Limitations on UCS Characters Allowed in IRIs 854 This section discusses limitations on characters and character 855 sequences usable for IRIs beyond those given in Section 2.2. The 856 considerations in this section are relevant when IRIs are created and 857 when URIs are converted to IRIs. 859 a. The repertoire of characters allowed in each IRI component is 860 limited by the definition of that component. For example, the 861 definition of the scheme component does not allow characters 862 beyond US-ASCII. 864 (Note: In accordance with URI practice, generic IRI software 865 cannot and should not check for such limitations.) 867 b. The UCS contains many areas of characters for which there are 868 strong visual look-alikes. Because of the likelihood of 869 transcription errors, these also should be avoided. This includes 870 the full-width equivalents of Latin characters, half-width 871 Katakana characters for Japanese, and many others. It also 872 includes many look-alikes of "space", "delims", and "unwise", 873 characters excluded in [RFC3491]. 875 Additional information is available from [UNIXML]. [UNIXML] is 876 written in the context of running text rather than in that of 877 identifiers. Nevertheless, it discusses many of the categories of 878 characters not appropriate for IRIs. 880 5.2. Software Interfaces and Protocols 882 Although an IRI is defined as a sequence of characters, software 883 interfaces for URIs typically function on sequences of octets or 884 other kinds of code units. Thus, software interfaces and protocols 885 MUST define which character encoding is used. 887 Intermediate software interfaces between IRI-capable components and 888 URI-only components MUST map the IRIs per Section 3.6, when 889 transferring from IRI-capable to URI-only components. This mapping 890 SHOULD be applied as late as possible. It SHOULD NOT be applied 891 between components that are known to be able to handle IRIs. 893 5.3. Format of URIs and IRIs in Documents and Protocols 895 Document formats that transport URIs may have to be upgraded to allow 896 the transport of IRIs. In cases where the document as a whole has a 897 native character encoding, IRIs MUST also be encoded in this 898 character encoding and converted accordingly by a parser or 899 interpreter. IRI characters not expressible in the native character 900 encoding SHOULD be escaped by using the escaping conventions of the 901 document format if such conventions are available. Alternatively, 902 they MAY be percent-encoded according to Section 3.6. For example, 903 in HTML or XML, numeric character references SHOULD be used. If a 904 document as a whole has a native character encoding and that 905 character encoding is not UTF-8, then IRIs MUST NOT be placed into 906 the document in the UTF-8 character encoding. 908 ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs, 909 although they use different terminology. HTML 4.0 [HTML4] defines 910 the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0 911 [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications 912 based upon them allow IRIs. Also, it is expected that all relevant 913 new W3C formats and protocols will be required to handle IRIs 914 [CharMod]. 916 5.4. Use of UTF-8 for Encoding Original Characters 918 This section discusses details and gives examples for point c) in 919 Section 1.2. To be able to use IRIs, the URI corresponding to the 920 IRI in question has to encode original characters into octets by 921 using UTF-8. This can be specified for all URIs of a URI scheme or 922 can apply to individual URIs for schemes that do not specify how to 923 encode original characters. It can apply to the whole URI, or only 924 to some part. For background information on encoding characters into 925 URIs, see also Section 2.5 of [RFC3986]. 927 For new URI schemes, using UTF-8 is recommended in [RFC4395bis]. 928 Examples where UTF-8 is already used are the URN syntax [RFC2141], 929 IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, 930 because the HTTP URI scheme does not specify how to encode original 931 characters, only some HTTP URLs can have corresponding but different 932 IRIs. 934 For example, for a document with a URI of 935 "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to 936 construct a corresponding IRI (in XML notation, see Section 1.4): 937 "http://www.example.org/résumé.html" ("é" stands for 938 the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent- 939 encoded representation of that character). On the other hand, for a 940 document with a URI of "http://www.example.org/r%E9sum%E9.html", the 941 percent-encoding octets cannot be converted to actual characters in 942 an IRI, as the percent-encoding is not based on UTF-8. 944 For most URI schemes, there is no need to upgrade their scheme 945 definition in order for them to work with IRIs. The main case where 946 upgrading makes sense is when a scheme definition, or a particular 947 component of a scheme, is strictly limited to the use of US-ASCII 948 characters with no provision to include non-ASCII characters/octets 949 via percent-encoding, or if a scheme definition currently uses highly 950 scheme-specific provisions for the encoding of non-ASCII characters. 951 An example of this is the mailto: scheme [RFC2368]. 953 This specification updates the IANA registry of URI schemes to note 954 their applicability to IRIs, see Section 8. All IRIs use URI 955 schemes, and all URIs with URI schemes can be used as IRIs, even 956 though in some cases only by using URIs directly as IRIs, without any 957 conversion. 959 Scheme definitions can impose restrictions on the syntax of scheme- 960 specific URIs; i.e., URIs that are admissible under the generic URI 961 syntax [RFC3986] may not be admissible due to narrower syntactic 962 constraints imposed by a URI scheme specification. URI scheme 963 definitions cannot broaden the syntactic restrictions of the generic 964 URI syntax; otherwise, it would be possible to generate URIs that 965 satisfied the scheme-specific syntactic constraints without 966 satisfying the syntactic constraints of the generic URI syntax. 967 However, additional syntactic constraints imposed by URI scheme 968 specifications are applicable to IRI, as the corresponding URI 969 resulting from the mapping defined in Section 3.6 MUST be a valid URI 970 under the syntactic restrictions of generic URI syntax and any 971 narrower restrictions imposed by the corresponding URI scheme 972 specification. 974 The requirement for the use of UTF-8 generally applies to all parts 975 of a URI. However, it is possible that the capability of IRIs to 976 represent a wide range of characters directly is used just in some 977 parts of the IRI (or IRI reference). The other parts of the IRI may 978 only contain US-ASCII characters, or they may not be based on UTF-8. 979 They may be based on another character encoding, or they may directly 980 encode raw binary data (see also [RFC2397]). 982 For example, it is possible to have a URI reference of 983 "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the 984 document name is encoded in iso-8859-1 based on server settings, but 985 where the fragment identifier is encoded in UTF-8 according to 986 [XPointer]. The IRI corresponding to the above URI would be (in XML 987 notation) 988 "http://www.example.org/r%E9sum%E9.xml#résumé". 990 Similar considerations apply to query parts. The functionality of 991 IRIs (namely, to be able to include non-ASCII characters) can only be 992 used if the query part is encoded in UTF-8. 994 5.5. Relative IRI References 996 Processing of relative IRI references against a base is handled 997 straightforwardly; the algorithms of [RFC3986] can be applied 998 directly, treating the characters additionally allowed in IRI 999 references in the same way that unreserved characters are in URI 1000 references. 1002 6. Legacy Extended IRIs (LEIRIs) 1004 For historic reasons, some formats have allowed variants of IRIs that 1005 are somewhat less restricted in syntax. This section provides a 1006 definition and a name (Legacy Extended IRI or LEIRI) for these 1007 variants for easier reference. These variants have to be used with 1008 care; they require further processing before being fully 1009 interchangeable as IRIs. New protocols and formats SHOULD NOT use 1010 Legacy Extended IRIs. Even where Legacy Extended IRIs are allowed, 1011 only IRIs fully conforming to the syntax definition in Section 2.2 1012 SHOULD be created, generated, and used. The provisions in this 1013 section also apply to Legacy Extended IRI references. 1015 6.1. Legacy Extended IRI Syntax 1017 The syntax of Legacy Extended IRIs is the same as that for IRIs, 1018 except that ucschar is redefined as follows: 1020 ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" 1021 / "\" / "^" / "`" / %x0-1F / %x7F-D7FF 1022 / %xE000-FFFD / %x10000-10FFFF 1024 The restriction on bidirectional formatting characters in [Bidi] is 1025 lifted. The iprivate production becomes redundant. 1027 Likewise, the syntax for Legacy Extended IRI references (LEIRI 1028 references) is the same as that for IRI references with the above 1029 redefinition of ucschar applied. 1031 Formats that use Legacy Extended IRIs or Legacy Extended IRI 1032 references MAY further restrict the characters allowed therein, 1033 either implicitly by the fact that the format as such does not allow 1034 some characters, or explicitly. An example of a character not 1035 allowed implicitly may be the NUL character (U+0000). However, all 1036 the characters allowed in IRIs MUST still be allowed. 1038 6.2. Conversion of Legacy Extended IRIs to IRIs 1040 To convert a Legacy Extended IRI (reference) to an IRI (reference), 1041 each character allowed in a Legacy Extended IRI (reference) but not 1042 allowed in an IRI (reference) (see Section 6.3) MUST be percent- 1043 encoded by applying steps 2.1 to 2.3 of Section 3.6. 1045 6.3. Characters Allowed in Legacy Extended IRIs but not in IRIs 1047 This section provides a list of the groups of characters and code 1048 points that are allowed in Legacy Extedend IRIs, but are not allowed 1049 in IRIs or are allowed in IRIs only in the query part. For each 1050 group of characters, advice on the usage of these characters is also 1051 given, concentrating on the reasons for why not to use them. 1053 Space (U+0020): Some formats and applications use space as a 1054 delimiter, e.g. for items in a list. Appendix C of [RFC3986] also 1055 mentions that white space may have to be added when displaying or 1056 printing long URIs; the same applies to long IRIs. This means 1057 that spaces can disappear, or can make the Legacy Extended IRI to 1058 be interpreted as two or more separate IRIs. 1060 Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix 1061 C of [RFC3986] suggests the use of double-quotes 1062 ("http://example.com/") and angle brackets () 1063 as delimiters for URIs in plain text. These conventions are often 1064 used, and also apply to IRIs. Legacy Extended IRIs using these 1065 characters will be cut off at the wrong place. 1067 Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{" 1068 (U+007B), "|" (U+007C), and "}" (U+007D): These characters 1069 originally have been excluded from URIs because the respective 1070 codepoints are assigned to different graphic characters in some 1071 7-bit or 8-bit encoding. Despite the move to Unicode, some of 1072 these characters are still occasionally displayed differently on 1073 some systems, e.g. U+005C as a Japanese Yen symbol. Also, the 1074 fact that these characters are not used in URIs or IRIs has 1075 encouraged their use outside URIs or IRIs in contexts that may 1076 include URIs or IRIs. In case a Legacy Extended IRI with such a 1077 character is used in such a context, the Legacy Extended IRI will 1078 be interpreted piecemeal. 1080 The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F - 1081 #x9F): There is no way to transmit these characters reliably 1082 except potentially in electronic form. Even when in electronic 1083 form, some software components might silently filter out some of 1084 these characters, or may stop processing alltogether when 1085 encountering some of them. These characters may affect text 1086 display in subtle, unnoticable ways or in drastic, global, and 1087 irreversible ways depending on the hardware and software involved. 1088 The use of some of these characters may allow malicious users to 1089 manipulate the display of a Legacy Extended IRI and its context. 1091 Bidi formatting characters (U+200E, U+200F, U+202A-202E): These 1092 characters affect the display ordering of characters. Displayed 1093 Legacy Extended IRIs containing these characters cannot be 1094 converted back to electronic form (logical order) unambiguously. 1095 These characters may allow malicious users to manipulate the 1096 display of a Legacy Extended IRI and its context. 1098 Specials (U+FFF0-FFFD): These code points provide functionality 1099 beyond that useful in a Legacy Extended IRI, for example byte 1100 order identification, annotation, and replacements for unknown 1101 characters and objects. Their use and interpretation in a Legacy 1102 Extended IRI serves no purpose and may lead to confusing display 1103 variations. 1105 Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000- 1106 10FFFD): Display and interpretation of these code points is by 1107 definition undefined without private agreement. Therefore, these 1108 code points are not suited for use on the Internet. They are not 1109 interoperable and may have unpredictable effects. 1111 Tags (U+E0000-E0FFF): These characters provide a way to language 1112 tag in Unicode plain text. They are not appropriate for Legacy 1113 Extended IRIs because language information in identifiers cannot 1114 reliably be input, transmitted (e.g. on a visual medium such as 1115 paper), or recognized. 1117 Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF, 1118 U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF, 1119 U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF, 1120 U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF, 1121 U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as 1122 non-characters. Applications may use some of them internally, but 1123 are not prepared to interchange them. 1125 For reference, we here also list the code points and code units not 1126 even allowed in Legacy Extended IRIs: 1128 Surrogate code units (D800-DFFF): These do not represent Unicode 1129 codepoints. 1131 7. URI/IRI Processing Guidelines (Informative) 1133 This informative section provides guidelines for supporting IRIs in 1134 the same software components and operations that currently process 1135 URIs: Software interfaces that handle URIs, software that allows 1136 users to enter URIs, software that creates or generates URIs, 1137 software that displays URIs, formats and protocols that transport 1138 URIs, and software that interprets URIs. These may all require 1139 modification before functioning properly with IRIs. The 1140 considerations in this section also apply to URI references and IRI 1141 references. 1143 7.1. URI/IRI Software Interfaces 1145 Software interfaces that handle URIs, such as URI-handling APIs and 1146 protocols transferring URIs, need interfaces and protocol elements 1147 that are designed to carry IRIs. 1149 In case the current handling in an API or protocol is based on US- 1150 ASCII, UTF-8 is recommended as the character encoding for IRIs, as it 1151 is compatible with US-ASCII, is in accordance with the 1152 recommendations of [RFC2277], and makes converting to URIs easy. In 1153 any case, the API or protocol definition must clearly define the 1154 character encoding to be used. 1156 The transfer from URI-only to IRI-capable components requires no 1157 mapping, although the conversion described in Section 4 above may be 1158 performed. It is preferable not to perform this inverse conversion 1159 unless it is certain this can be done correctly. 1161 7.2. URI/IRI Entry 1163 Some components allow users to enter URIs into the system by typing 1164 or dictation, for example. This software must be updated to allow 1165 for IRI entry. 1167 A person viewing a visual presentation of an IRI (as a sequence of 1168 glyphs, in some order, in some visual display) will use an entry 1169 method for characters in the user's language to input the IRI. 1170 Depending on the script and the input method used, this may be a more 1171 or less complicated process. 1173 The process of IRI entry must ensure, as much as possible, that the 1174 restrictions defined in Section 2.2 are met. This may be done by 1175 choosing appropriate input methods or variants/settings thereof, by 1176 appropriately converting the characters being input, by eliminating 1177 characters that cannot be converted, and/or by issuing a warning or 1178 error message to the user. 1180 As an example of variant settings, input method editors for East 1181 Asian Languages usually allow the input of Latin letters and related 1182 characters in full-width or half-width versions. For IRI input, the 1183 input method editor should be set so that it produces half-width 1184 Latin letters and punctuation and full-width Katakana. 1186 An input field primarily or solely used for the input of URIs/IRIs 1187 might allow the user to view an IRI as it is mapped to a URI. Places 1188 where the input of IRIs is frequent may provide the possibility for 1189 viewing an IRI as mapped to a URI. This will help users when some of 1190 the software they use does not yet accept IRIs. 1192 An IRI input component interfacing to components that handle URIs, 1193 but not IRIs, must map the IRI to a URI before passing it to these 1194 components. 1196 For the input of IRIs with right-to-left characters, please see 1197 [Bidi]. 1199 7.3. URI/IRI Transfer between Applications 1201 Many applications (for example, mail user agents) try to detect URIs 1202 appearing in plain text. For this, they use some heuristics based on 1203 URI syntax. They then allow the user to click on such URIs and 1204 retrieve the corresponding resource in an appropriate (usually 1205 scheme-dependent) application. 1207 Such applications would need to be upgraded, in order to use the IRI 1208 syntax as a base for heuristics. In particular, a non-ASCII 1209 character should not be taken as the indication of the end of an IRI. 1210 Such applications also would need to make sure that they correctly 1211 convert the detected IRI from the character encoding of the document 1212 or application where the IRI appears, to the character encoding used 1213 by the system-wide IRI invocation mechanism, or to a URI (according 1214 to Section 3.6) if the system-wide invocation mechanism only accepts 1215 URIs. 1217 The clipboard is another frequently used way to transfer URIs and 1218 IRIs from one application to another. On most platforms, the 1219 clipboard is able to store and transfer text in many languages and 1220 scripts. Correctly used, the clipboard transfers characters, not 1221 octets, which will do the right thing with IRIs. 1223 7.4. URI/IRI Generation 1225 Systems that offer resources through the Internet, where those 1226 resources have logical names, sometimes automatically generate URIs 1227 for the resources they offer. For example, some HTTP servers can 1228 generate a directory listing for a file directory and then respond to 1229 the generated URIs with the files. 1231 Many legacy character encodings are in use in various file systems. 1232 Many currently deployed systems do not transform the local character 1233 representation of the underlying system before generating URIs. 1235 For maximum interoperability, systems that generate resource 1236 identifiers should make the appropriate transformations. For 1237 example, if a file system contains a file named "résum&# 1238 xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in 1239 a URI, which allows use of "résumé.html" in an IRI, even if 1240 locally the file name is kept in a character encoding other than 1241 UTF-8. 1243 This recommendation particularly applies to HTTP servers. For FTP 1244 servers, similar considerations apply; see [RFC2640]. 1246 7.5. URI/IRI Selection 1248 In some cases, resource owners and publishers have control over the 1249 IRIs used to identify their resources. This control is mostly 1250 executed by controlling the resource names, such as file names, 1251 directly. 1253 In these cases, it is recommended to avoid choosing IRIs that are 1254 easily confused. For example, for US-ASCII, the lower-case ell ("l") 1255 is easily confused with the digit one ("1"), and the upper-case oh 1256 ("O") is easily confused with the digit zero ("0"). Publishers 1257 should avoid confusing users with "br0ken" or "1ame" identifiers. 1259 Outside the US-ASCII repertoire, there are many more opportunities 1260 for confusion; a complete set of guidelines is too lengthy to include 1261 here. As long as names are limited to characters from a single 1262 script, native writers of a given script or language will know best 1263 when ambiguities can appear, and how they can be avoided. What may 1264 look ambiguous to a stranger may be completely obvious to the average 1265 native user. On the other hand, in some cases, the UCS contains 1266 variants for compatibility reasons; for example, for typographic 1267 purposes. These should be avoided wherever possible. Although there 1268 may be exceptions, newly created resource names should generally be 1269 in NFKC [UTR15] (which means that they are also in NFC). 1271 As an example, the UCS contains the "fi" ligature at U+FB01 for 1272 compatibility reasons. Wherever possible, IRIs should use the two 1273 letters "f" and "i" rather than the "fi" ligature. An example where 1274 the latter may be used is in the query part of an IRI for an explicit 1275 search for a word written containing the "fi" ligature. 1277 In certain cases, there is a chance that characters from different 1278 scripts look the same. The best known example is the similarity of 1279 the Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid 1280 such cases, IRIs should only be created where all the characters in a 1281 single component are used together in a given language. This usually 1282 means that all of these characters will be from the same script, but 1283 there are languages that mix characters from different scripts (such 1284 as Japanese). This is similar to the heuristics used to distinguish 1285 between letters and numbers in the examples above. Also, for Latin, 1286 Greek, and Cyrillic, using lowercase letters results in fewer 1287 ambiguities than using uppercase letters would. 1289 7.6. Display of URIs/IRIs 1291 In situations where the rendering software is not expected to display 1292 non-ASCII parts of the IRI correctly using the available layout and 1293 font resources, these parts should be percent-encoded before being 1294 displayed. 1296 For display of Bidi IRIs, please see [Bidi]. 1298 7.7. Interpretation of URIs and IRIs 1300 Software that interprets IRIs as the names of local resources should 1301 accept IRIs in multiple forms and convert and match them with the 1302 appropriate local resource names. 1304 First, multiple representations include both IRIs in the native 1305 character encoding of the protocol and also their URI counterparts. 1307 Second, it may include URIs constructed based on character encodings 1308 other than UTF-8. These URIs may be produced by user agents that do 1309 not conform to this specification and that use legacy character 1310 encodings to convert non-ASCII characters to URIs. Whether this is 1311 necessary, and what character encodings to cover, depends on a number 1312 of factors, such as the legacy character encodings used locally and 1313 the distribution of various versions of user agents. For example, 1314 software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in 1315 addition to UTF-8. 1317 Third, it may include additional mappings to be more user-friendly 1318 and robust against transmission errors. These would be similar to 1319 how some servers currently treat URIs as case insensitive or perform 1320 additional matching to account for spelling errors. For characters 1321 beyond the US-ASCII repertoire, this may, for example, include 1322 ignoring the accents on received IRIs or resource names. Please note 1323 that such mappings, including case mappings, are language dependent. 1325 It can be difficult to identify a resource unambiguously if too many 1326 mappings are taken into consideration. However, percent-encoded and 1327 not percent-encoded parts of IRIs can always be clearly 1328 distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes 1329 the potential for collisions lower than it may seem at first. 1331 7.8. Upgrading Strategy 1333 Where this recommendation places further constraints on software for 1334 which many instances are already deployed, it is important to 1335 introduce upgrades carefully and to be aware of the various 1336 interdependencies. 1338 If IRIs cannot be interpreted correctly, they should not be created, 1339 generated, or transported. This suggests that upgrading URI 1340 interpreting software to accept IRIs should have highest priority. 1342 On the other hand, a single IRI is interpreted only by a single or 1343 very few interpreters that are known in advance, although it may be 1344 entered and transported very widely. 1346 Therefore, IRIs benefit most from a broad upgrade of software to be 1347 able to enter and transport IRIs. However, before an individual IRI 1348 is published, care should be taken to upgrade the corresponding 1349 interpreting software in order to cover the forms expected to be 1350 received by various versions of entry and transport software. 1352 The upgrade of generating software to generate IRIs instead of using 1353 a local character encoding should happen only after the service is 1354 upgraded to accept IRIs. Similarly, IRIs should only be generated 1355 when the service accepts IRIs and the intervening infrastructure and 1356 protocol is known to transport them safely. 1358 Software converting from URIs to IRIs for display should be upgraded 1359 only after upgraded entry software has been widely deployed to the 1360 population that will see the displayed result. 1362 Where there is a free choice of character encodings, it is often 1363 possible to reduce the effort and dependencies for upgrading to IRIs 1364 by using UTF-8 rather than another encoding. For example, when a new 1365 file-based Web server is set up, using UTF-8 as the character 1366 encoding for file names will make the transition to IRIs easier. 1367 Likewise, when a new Web form is set up using UTF-8 as the character 1368 encoding of the form page, the returned query URIs will use UTF-8 as 1369 the character encoding (unless the user, for whatever reason, changes 1370 the character encoding) and will therefore be compatible with IRIs. 1372 These recommendations, when taken together, will allow for the 1373 extension from URIs to IRIs in order to handle characters other than 1374 US-ASCII while minimizing interoperability problems. For 1375 considerations regarding the upgrade of URI scheme definitions, see 1376 Section 5.4. 1378 8. IANA Considerations 1380 RFC Editor and IANA note: Please Replace RFC XXXX with the number of 1381 this document when it issues as an RFC. 1383 IANA maintains a registry of "URI schemes". A "URI scheme" also 1384 serves an "IRI scheme". 1386 To clarify that the URI scheme registration process also applies to 1387 IRIs, change the description of the "URI schemes" registry header to 1388 say "[RFC4395] defines an IANA-maintained registry of URI Schemes. 1389 These registries include the Permanent and Provisional URI Schemes. 1390 RFC XXXX updates this registry to designate that schemes may also 1391 indicate their usability as IRI schemes. 1393 Update "per RFC 4395" to "per RFC 4395 and RFC XXXX". 1395 9. Security Considerations 1397 The security considerations discussed in [RFC3986] also apply to 1398 IRIs. In addition, the following issues require particular care for 1399 IRIs. 1401 Incorrect encoding or decoding can lead to security problems. For 1402 example, some UTF-8 decoders do not check against overlong byte 1403 sequences. See [UTR36] Section 3 for details. 1405 There are serious difficulties with relying on a human to verify that 1406 a an IRI (whether presented visually or aurally) is the same as 1407 another IRI or is the one intended. These problems exist with ASCII- 1408 only URIs (bl00mberg.com vs. bloomberg.com) but are strongly 1409 exacerbated when using the much larger character repertoire of 1410 Unicode. For details, see Section 2 of [UTR36]. Using 1411 administrative and technical means to reduce the availability of such 1412 exploits is possible, but they are difficult to eliminate altogether. 1413 User agents SHOULD NOT rely on visual or perceptual comparison or 1414 verification of IRIs as a means of validating or assuring safety, 1415 correctness or appropriateness of an IRI. Other means of presenting 1416 users with the validity, safety, or appropriateness of visited sites 1417 are being developed in the browser community as an alternative means 1418 of avoiding these difficulties. 1420 Besides the large character repertoire of Unicode, reasons for 1421 confusion include different forms of normalization and different 1422 normalization expectations, use of percent-encoding with various 1423 legacy encodings, and bidirectionality issues. See also [Bidi]. 1425 Confusion can occur in various IRI components, such as the domain 1426 name part or the path part, or between IRI components. For 1427 considerations specific to the domain name part, see [RFC5890]. For 1428 considerations specific to particular protocols or schemes, see the 1429 security sections of the relevant specifications and registration 1430 templates. Administrators of sites that allow independent users to 1431 create resources in the same sub area have to be careful. Details 1432 are discussed in Section 7.5. 1434 The characters additionally allowed in Legacy Extended IRIs introduce 1435 additional security issues. For details, see Section 6.3. 1437 10. Acknowledgements 1439 This document was derived from [RFC3987]; the acknowledgments from 1440 that specification still apply. 1442 In addition, this document was influenced by contributions from (in 1443 no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson, 1444 John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris 1445 Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank 1446 Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard 1447 Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie 1448 Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas 1449 Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van 1450 Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos 1451 Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R. 1452 Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse, 1453 Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn 1454 Steele, Peter Saint-Andre, Geoffrey Sneddon, Chris Weber, Alex 1455 Melnikov, Slim Amamou, S. Moonesamy, Tim Berners-Lee, Yaron Goland, 1456 Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas 1457 Milo, Murray Sargent, Marc Blanchet, and Mykyta Yevstifeyev. 1459 11. Main Changes Since RFC 3987 1461 This section describes the main changes since [RFC3987]. 1463 11.1. Split out Bidi, processing guidelines, comparison sections 1465 Move some components (comparison, bidi, processing) into separate 1466 documents. 1468 11.2. Major restructuring of IRI processing model 1470 Major restructuring of IRI processing model to make scheme-specific 1471 translation necessary to handle IDNA requirements and for consistency 1472 with web implementations. 1474 Starting with IRI, you want one of: 1476 a IRI components (IRI parsed into UTF8 pieces) 1478 b URI components (URI parsed into ASCII pieces, encoded correctly) 1480 c whole URI (for passing on to some other system that wants whole 1481 URIs) 1483 11.2.1. OLD WAY 1485 1. Pct-encoding on the whole thing to a URI. (c1) If you want a 1486 (maybe broken) whole URI, you might stop here. 1488 2. Parsing the URI into URI components. (b1) If you want (maybe 1489 broken) URI components, stop here. 1491 3. Decode the components (undoing the pct-encoding). (a) if you want 1492 IRI components, stop here. 1494 4. reencode: Either using a different encoding some components (for 1495 domain names, and query components in web pages, which depends on 1496 the component, scheme and context), and otherwise using pct- 1497 encoding. (b2) if you want (good) URI components, stop here. 1499 5. reassemble the reencoded components. (c2) if you want a (*good*) 1500 whole URI stop here. 1502 11.2.2. NEW WAY 1504 1. Parse the IRI into IRI components using the generic syntax. (a) 1505 if you want IRI components, stop here. 1507 2. Encode each components, using pct-encoding, IDN encoding, or 1508 special query part encoding depending on the component scheme or 1509 context. (b) If you want URI components, stop here. 1511 3. reassemble the a whole URI from URI components. (c) if you want a 1512 whole URI stop here. 1514 11.2.3. Extension of Syntax 1516 Added the tag range (U+E0000-E0FFF) to the iprivate production. Some 1517 IRIs generated with the new syntax may fail to pass very strict 1518 checks relying on the old syntax. But characters in this range 1519 should be extremely infrequent anyway. 1521 11.2.4. More to be added 1523 TODO: There are more main changes that need to be documented in this 1524 section. 1526 11.3. Change Log 1528 Note to RFC Editor: Please completely remove this section before 1529 publication. 1531 11.3.1. Changes after draft-ietf-iri-3987bis-01 1533 Changes from draft-ietf-iri-3987bis-01 onwards are available as 1534 changesets in the IETF tools subversion repository at http:// 1535 trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/ 1536 draft-ietf-iri-3987bis.xml. 1538 11.3.2. Changes from draft-duerst-iri-bis-07 to 1539 draft-ietf-iri-3987bis-00 1541 Changed draft name, date, last paragraph of abstract, and titles in 1542 change log, and added this section in moving from 1543 draft-duerst-iri-bis-07 (personal submission) to 1544 draft-ietf-iri-3987bis-00 (WG document). 1546 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis 1548 Major restructuring of the processing model, see Section 11.2. 1550 11.4. Changes from -00 to -01 1552 o Removed 'mailto:' before mail addresses of authors. 1554 o Added "" as right side of 'href-strip' rule. Fixed 1555 '|' to '/' for alternatives. 1557 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 1559 o Add HyperText Reference, change abstract, acks and references for 1560 it 1562 o Add Masinter back as another editor. 1564 o Masinter integrates HRef material from HTML5 spec. 1566 o Rewrite introduction sections to modernize. 1568 11.6. Changes from -04 to -05 of draft-duerst-iri-bis 1570 o Updated references. 1572 o Changed IPR text to pre5378Trust200902. 1574 11.7. Changes from -03 to -04 of draft-duerst-iri-bis 1576 o Added explicit abbreviation for LEIRIs. 1578 o Mentioned LEIRI references. 1580 o Completed text in LEIRI section about tag characters and about 1581 specials. 1583 11.8. Changes from -02 to -03 of draft-duerst-iri-bis 1585 o Updated some references. 1587 o Updated Michel Suginard's coordinates. 1589 11.9. Changes from -01 to -02 of draft-duerst-iri-bis 1591 o Added tag range to iprivate (issue private-include-tags-115). 1593 o Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs. 1595 11.10. Changes from -00 to -01 of draft-duerst-iri-bis 1597 o Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" 1598 based on input from the W3C XML Core WG. Moved the relevant 1599 subsections to the back and promoted them to a section. 1601 o Added some text re. Legacy Extended IRIs to the security section. 1603 o Added a IANA Consideration Section. 1605 o Added this Change Log Section. 1607 o Added a section about "IRIs with Spaces/Controls" (converting from 1608 a Note in RFC 3987). 1610 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis 1612 Fixed errata (see 1613 http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987). 1615 12. References 1617 12.1. Normative References 1619 [ASCII] American National Standards Institute, "Coded Character 1620 Set -- 7-bit American Standard Code for Information 1621 Interchange", ANSI X3.4, 1986. 1623 [ISO10646] 1624 International Organization for Standardization, "ISO/IEC 1625 10646:2011: Information Technology - Universal Multiple- 1626 Octet Coded Character Set (UCS)", ISO Standard 10646, 1627 March 20011, . 1631 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1632 Requirement Levels", BCP 14, RFC 2119, March 1997. 1634 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1635 Profile for Internationalized Domain Names (IDN)", 1636 RFC 3491, March 2003. 1638 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1639 10646", STD 63, RFC 3629, November 2003. 1641 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1642 Resource Identifier (URI): Generic Syntax", STD 66, 1643 RFC 3986, January 2005. 1645 [RFC5890] Klensin, J., "Internationalized Domain Names for 1646 Applications (IDNA): Definitions and Document Framework", 1647 RFC 5890, August 2010. 1649 [RFC5891] Klensin, J., "Internationalized Domain Names in 1650 Applications (IDNA): Protocol", RFC 5891, August 2010. 1652 [STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1653 Specifications: ABNF", STD 68, RFC 5234, January 2008. 1655 [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 1656 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 1657 ISBN 978-1-936213-01-6)", October 2010. 1659 [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", 1660 Unicode Standard Annex #15, March 2008, 1661 . 1664 12.2. Informative References 1666 [Bidi] Duerst, M. and L. Masinter, "Guidelines for 1667 Internationalized Resource Identifiers with Bi-directional 1668 Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-00 1669 (work in progress), August 2011. 1671 [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. 1672 Texin, "Character Model for the World Wide Web: Resource 1673 Identifiers", World Wide Web Consortium Candidate 1674 Recommendation, November 2004, 1675 . 1677 [Duerst97] 1678 Duerst, M., "The Properties and Promises of UTF-8", Proc. 1680 11th International Unicode Conference, San Jose , 1681 September 1997, . 1684 [Equivalence] 1685 Masinter, L. and M. Duerst, "Equivalence and 1686 Canonicalization of Internationalized Resource Identifiers 1687 (IRIs)", draft-ietf-iri-comparison-00 (work in progress), 1688 August 2011. 1690 [Gettys] Gettys, J., "URI Model Consequences", 1691 . 1693 [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 1694 Specification", World Wide Web Consortium Recommendation, 1695 December 1999, 1696 . 1698 [LEIRI] Thompson, H., Tobin, R., and N. Walsh, "Legacy extended 1699 IRIs for XML resource identification", World Wide Web 1700 Consortium Note, November 2008, 1701 . 1703 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1704 Extensions (MIME) Part One: Format of Internet Message 1705 Bodies", RFC 2045, November 1996. 1707 [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., 1708 Atkinson, R., Crispin, M., and P. Svanberg, "The Report of 1709 the IAB Character Set Workshop held 29 February - 1 March, 1710 1996", RFC 2130, April 1997. 1712 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 1714 [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. 1716 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1717 Languages", BCP 18, RFC 2277, January 1998. 1719 [RFC2368] Hoffman, P., Masinter, L., and J. Zawinski, "The mailto 1720 URL scheme", RFC 2368, July 1998. 1722 [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. 1724 [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1725 Resource Identifiers (URI): Generic Syntax", RFC 2396, 1726 August 1998. 1728 [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, 1729 August 1998. 1731 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 1732 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 1733 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 1735 [RFC2640] Curtin, B., "Internationalization of the File Transfer 1736 Protocol", RFC 2640, July 1999. 1738 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 1739 Identifiers (IRIs)", RFC 3987, January 2005. 1741 [RFC4395bis] 1742 Hansen, T., Hardie, T., and L. Masinter, "Guidelines and 1743 Registration Procedures for New URI/IRI Schemes", 1744 draft-ietf-iri-4395bis-irireg-03 (work in progress), 1745 July 2011. 1747 [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on 1748 Encodings for Internationalized Domain Names", RFC 6055, 1749 February 2011. 1751 [RFC6082] Whistler, K., Adams, G., Duerst, M., Presuhn, R., and J. 1752 Klensin, "Deprecating Unicode Language Tag Characters: RFC 1753 2482 is Historic", RFC 6082, November 2010. 1755 [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other 1756 Markup Languages", Unicode Technical Report #20, World 1757 Wide Web Consortium Note, June 2003, 1758 . 1760 [UTR36] Davis, M. and M. Suignard, "Unicode Security 1761 Considerations", Unicode Technical Report #36, 1762 August 2010, . 1764 [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking 1765 Language (XLink) Version 1.0", World Wide Web 1766 Consortium REC-xlink-20010627, June 2001, 1767 . 1769 [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and 1770 F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth 1771 Edition)", World Wide Web Consortium REC-xml-20081126, 1772 August 2006, . 1774 [XMLNamespace] 1775 Bray, T., Hollander, D., Layman, A., and R. Tobin, 1776 "Namespaces in XML (Second Edition)", World Wide Web 1777 Consortium REC-xml-names-20091208, August 2006, 1778 . 1780 [XMLSchema] 1781 Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", 1782 World Wide Web Consortium REC-xmlschema-2-20041028, 1783 May 2001, . 1785 [XPointer] 1786 Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer 1787 Framework", World Wide Web Consortium REC-xptr-framework- 1788 20030325, March 2003, 1789 . 1791 Authors' Addresses 1793 Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever 1794 possible, for example as "Dürst" in XML and HTML.) 1795 Aoyama Gakuin University 1796 5-10-1 Fuchinobe 1797 Sagamihara, Kanagawa 229-8558 1798 Japan 1800 Phone: +81 42 759 6329 1801 Fax: +81 42 759 6495 1802 Email: duerst@it.aoyama.ac.jp 1803 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 1804 (Note: This is the percent-encoded form of an IRI.) 1806 Michel Suignard 1807 Unicode Consortium 1808 P.O. Box 391476 1809 Mountain View, CA 94039-1476 1810 U.S.A. 1812 Phone: +1-650-693-3921 1813 Email: michel@unicode.org 1814 URI: http://www.suignard.com 1815 Larry Masinter 1816 Adobe 1817 345 Park Ave 1818 San Jose, CA 95110 1819 U.S.A. 1821 Phone: +1-408-536-3024 1822 Email: masinter@adobe.com 1823 URI: http://larry.masinter.net