idnits 2.17.1 draft-ietf-iri-3987bis-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC3987, but the abstract doesn't seem to directly say this. It does mention RFC3987 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 2, 2012) is 4438 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'LEIRI' is defined on line 1730, but no explicit reference was found in the text == Unused Reference: 'RFC2045' is defined on line 1735, but no explicit reference was found in the text == Unused Reference: 'RFC6082' is defined on line 1783, but no explicit reference was found in the text == Unused Reference: 'XMLNamespace' is defined on line 1806, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' == Outdated reference: A later version (-03) exists of draft-ietf-iri-bidi-guidelines-00 == Outdated reference: A later version (-02) exists of draft-ietf-iri-comparison-00 -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2192 (Obsoleted by RFC 5092) -- Obsolete informational reference (is this intentional?): RFC 2368 (Obsoleted by RFC 6068) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) == Outdated reference: A later version (-04) exists of draft-ietf-iri-4395bis-irireg-03 Summary: 1 error (**), 0 flaws (~~), 11 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internationalized Resource Identifiers M. Duerst 3 (iri) Aoyama Gakuin University 4 Internet-Draft M. Suignard 5 Obsoletes: 3987 (if approved) Unicode Consortium 6 Intended status: Standards Track L. Masinter 7 Expires: September 3, 2012 Adobe 8 March 2, 2012 10 Internationalized Resource Identifiers (IRIs) 11 draft-ietf-iri-3987bis-10 13 Abstract 15 This document defines the Internationalized Resource Identifier (IRI) 16 protocol element, as an extension of the Uniform Resource Identifier 17 (URI). An IRI is a sequence of characters from the Universal 18 Character Set (Unicode/ISO 10646). Grammar and processing rules are 19 given for IRIs and related syntactic forms. 21 Defining IRI as new protocol element (rather than updating or 22 extending the definition of URI) allows independent orderly 23 transitions: other protocols and languages that use URIs must 24 explicitly choose to allow IRIs. 26 Guidelines are provided for the use and deployment of IRIs and 27 related protocol elements when revising protocols, formats, and 28 software components that currently deal only with URIs. 30 This document is part of a set of documents intended to replace RFC 31 3987. 33 RFC Editor: Please remove the next paragraph before publication. 35 This (and several companion documents) are intended to obsolete RFC 36 3987, and also move towards IETF Draft Standard. For discussion and 37 comments on these drafts, please join the IETF IRI WG by subscribing 38 to the mailing list public-iri@w3.org, archives at 39 http://lists.w3.org/archives/public/public-iri/. For a list of open 40 issues, please see the issue tracker of the WG at 41 http://trac.tools.ietf.org/wg/iri/trac/report/1. For a list of 42 individual edits, please see the change history at 43 http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis. 45 Status of this Memo 47 This Internet-Draft is submitted in full conformance with the 48 provisions of BCP 78 and BCP 79. 50 Internet-Drafts are working documents of the Internet Engineering 51 Task Force (IETF). Note that other groups may also distribute 52 working documents as Internet-Drafts. The list of current Internet- 53 Drafts is at http://datatracker.ietf.org/drafts/current/. 55 Internet-Drafts are draft documents valid for a maximum of six months 56 and may be updated, replaced, or obsoleted by other documents at any 57 time. It is inappropriate to use Internet-Drafts as reference 58 material or to cite them other than as "work in progress." 60 This Internet-Draft will expire on September 3, 2012. 62 Copyright Notice 64 Copyright (c) 2012 IETF Trust and the persons identified as the 65 document authors. All rights reserved. 67 This document is subject to BCP 78 and the IETF Trust's Legal 68 Provisions Relating to IETF Documents 69 (http://trustee.ietf.org/license-info) in effect on the date of 70 publication of this document. Please review these documents 71 carefully, as they describe your rights and restrictions with respect 72 to this document. Code Components extracted from this document must 73 include Simplified BSD License text as described in Section 4.e of 74 the Trust Legal Provisions and are provided without warranty as 75 described in the Simplified BSD License. 77 This document may contain material from IETF Documents or IETF 78 Contributions published or made publicly available before November 79 10, 2008. The person(s) controlling the copyright in some of this 80 material may not have granted the IETF Trust the right to allow 81 modifications of such material outside the IETF Standards Process. 82 Without obtaining an adequate license from the person(s) controlling 83 the copyright in such materials, this document may not be modified 84 outside the IETF Standards Process, and derivative works of it may 85 not be created outside the IETF Standards Process, except to format 86 it for publication as an RFC or to translate it into languages other 87 than English. 89 Table of Contents 91 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 92 1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5 93 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 94 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7 95 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 8 96 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 97 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9 98 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 99 3. Processing IRIs and related protocol elements . . . . . . . . 13 100 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 13 101 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 13 102 3.3. General percent-encoding of IRI components . . . . . . . 14 103 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14 104 3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 14 105 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 15 106 3.4.3. Additional Considerations . . . . . . . . . . . . . . 15 107 3.5. Mapping query components . . . . . . . . . . . . . . . . 16 108 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 16 109 4. Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 16 110 4.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . 18 111 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19 112 5.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 19 113 5.2. Software Interfaces and Protocols . . . . . . . . . . . . 20 114 5.3. Format of URIs and IRIs in Documents and Protocols . . . 21 115 5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 21 116 5.5. Relative IRI References . . . . . . . . . . . . . . . . . 23 117 6. Legacy Extended IRIs (LEIRIs) . . . . . . . . . . . . . . . . 23 118 6.1. Legacy Extended IRI Syntax . . . . . . . . . . . . . . . 23 119 6.2. Conversion of Legacy Extended IRIs to IRIs . . . . . . . 24 120 6.3. Characters Allowed in Legacy Extended IRIs but not in 121 IRIs . . . . . . . . . . . . . . . . . . . . . . . . . . 24 122 7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 26 123 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 26 124 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26 125 7.3. URI/IRI Transfer between Applications . . . . . . . . . . 27 126 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 27 127 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 28 128 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 29 129 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 29 130 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 30 131 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 132 9. Security Considerations . . . . . . . . . . . . . . . . . . . 31 133 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32 134 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 33 135 11.1. Split out Bidi, processing guidelines, comparison 136 sections . . . . . . . . . . . . . . . . . . . . . . . . 33 138 11.2. Major restructuring of IRI processing model . . . . . . . 33 139 11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 33 140 11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 34 141 11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 34 142 11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 34 143 11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 34 144 11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 34 145 11.3.2. Changes from draft-duerst-iri-bis-07 to 146 draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 34 147 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 34 148 11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 35 149 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 35 150 11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 35 151 11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 35 152 11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35 153 11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35 154 11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 36 155 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 36 156 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36 157 12.1. Normative References . . . . . . . . . . . . . . . . . . 36 158 12.2. Informative References . . . . . . . . . . . . . . . . . 37 159 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 40 161 1. Introduction 163 1.1. Overview and Motivation 165 A Uniform Resource Identifier (URI) is defined in [RFC3986] as a 166 sequence of characters chosen from a limited subset of the repertoire 167 of US-ASCII [ASCII] characters. 169 The characters in URIs are frequently used for representing words of 170 natural languages. This usage has many advantages: Such URIs are 171 easier to memorize, easier to interpret, easier to transcribe, easier 172 to create, and easier to guess. For most languages other than 173 English, however, the natural script uses characters other than A - 174 Z. For many people, handling Latin characters is as difficult as 175 handling the characters of other scripts is for those who use only 176 the Latin alphabet. Many languages with non-Latin scripts are 177 transcribed with Latin letters. These transcriptions are now often 178 used in URIs, but they introduce additional difficulties. 180 The infrastructure for the appropriate handling of characters from 181 additional scripts is now widely deployed in operating system and 182 application software. Software that can handle a wide variety of 183 scripts and languages at the same time is increasingly common. Also, 184 an increasing number of protocols and formats can carry a wide range 185 of characters. 187 URIs are composed out of a very limited repertoire of characters; 188 this design choice was made to support global transcription([RFC3986] 189 section 1.2.1.). Reliable transition between a URI (as an abstract 190 protocol element composed of a sequence of characters) and a 191 presentation of that URI (written on a napkin, read out loud) and 192 back is relatively straightforward, because of the limited repertoire 193 of characters used. IRIs are designed to satisfy a different set of 194 use requirements; in particular, to allow IRIs to be written in ways 195 that are more meaningful to their users, even at the expense of 196 global transcribability. However, ensuring reliability of the 197 transition between an IRI and its presentation and back is more 198 difficult and complex when dealing with the larger set of Unicode 199 characters. For example, Unicode supports multiple ways of encoding 200 complex combinations of characters and accents, with multiple 201 character sequences that can result in the same presentation. 203 This document defines the protocol element called Internationalized 204 Resource Identifier (IRI), which allow applications of URIs to be 205 extended to use resource identifiers that have a much wider 206 repertoire of characters. It also provides corresponding 207 "internationalized" versions of other constructs from [RFC3986], such 208 as URI references. The syntax of IRIs is defined in Section 2. 210 Within this document, Section 5 discusses the use of IRIs in 211 different situations. Section 7 gives additional informative 212 guidelines. Section 9 discusses IRI-specific security 213 considerations. 215 This specification is part of a collection of specifications intended 216 to replace [RFC3987]. [Bidi] discusses the special case of 217 bidirectional IRIs using characters from scripts written right-to- 218 left. [Equivalence] gives guidelines for applications wishing to 219 determine if two IRIs are equivalent, as well as defining some 220 equivalence methods. [RFC4395bis] updates the URI scheme 221 registration guidelines and procedures to note that every URI scheme 222 is also automatically an IRI scheme and to allow scheme definitions 223 to be directly described in terms of Unicode characters. 225 1.2. Applicability 227 IRIs are designed to allow protocols and software that deal with URIs 228 to be updated to handle IRIs. Processing of IRIs is accomplished by 229 extending the URI syntax while retaining (and not expanding) the set 230 of "reserved" characters, such that the syntax for any URI scheme may 231 be extended to allow non-ASCII characters. In addition, following 232 parsing of an IRI, it is possible to construct a corresponding URI by 233 first encoding characters outside of the allowed URI range and then 234 reassembling the components. 236 Practical use of IRIs forms in place of URIs forms depends on the 237 following conditions being met: 239 a. A protocol or format element MUST be explicitly designated to be 240 able to carry IRIs. The intent is to avoid introducing IRIs into 241 contexts that are not defined to accept them. For example, XML 242 schema [XMLSchema] has an explicit type "anyURI" that includes 243 IRIs and IRI references. Therefore, IRIs and IRI references can 244 be in attributes and elements of type "anyURI". On the other 245 hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is 246 defined as a URI, which means that direct use of IRIs is not 247 allowed in HTTP requests. 249 b. The protocol or format carrying the IRIs MUST have a mechanism to 250 represent the wide range of characters used in IRIs, either 251 natively or by some protocol- or format-specific escaping 252 mechanism (for example, numeric character references in [XML1]). 254 c. The URI scheme definition, if it explicitly allows a percent sign 255 ("%") in any syntactic component, SHOULD define the interpretation 256 of sequences of percent-encoded octets (using "%XX" hex octets) as 257 octet from sequences of UTF-8 encoded strings; this is recommended 258 in the guidelines for registering new schemes, [RFC4395bis]. For 259 example, this is the practice for IMAP URLs [RFC2192], POP URLs 260 [RFC2384] and the URN syntax [RFC2141]). Note that use of 261 percent-encoding may also be restricted in some situations, for 262 example, URI schemes that disallow percent-encoding might still be 263 used with a fragment identifier which is percent-encoded (e.g., 264 [XPointer]). See Section 5.4 for further discussion. 266 1.3. Definitions 268 The following definitions are used in this document; they follow the 269 terms in [RFC2130], [RFC2277], and [ISO10646]. 271 character: A member of a set of elements used for the organization, 272 control, or representation of data. For example, "LATIN CAPITAL 273 LETTER A" names a character. 275 octet: An ordered sequence of eight bits considered as a unit. 277 character repertoire: A set of characters (set in the mathematical 278 sense). 280 sequence of characters: A sequence of characters (one after 281 another). 283 sequence of octets: A sequence of octets (one after another). 285 character encoding: A method of representing a sequence of 286 characters as a sequence of octets (maybe with variants). Also, a 287 method of (unambiguously) converting a sequence of octets into a 288 sequence of characters. 290 charset: The name of a parameter or attribute used to identify a 291 character encoding. 293 UCS: Universal Character Set. The coded character set defined by 294 ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6]. 296 IRI reference: Denotes the common usage of an Internationalized 297 Resource Identifier. An IRI reference may be absolute or 298 relative. However, the "IRI" that results from such a reference 299 only includes absolute IRIs; any relative IRI references are 300 resolved to their absolute form. Note that in [RFC2396] URIs did 301 not include fragment identifiers, but in [RFC3986] fragment 302 identifiers are part of URIs. 304 LEIRI (Legacy Extended IRI) processing: This term was used in 305 various XML specifications to refer to strings that, although not 306 valid IRIs, were acceptable input to the processing rules in 307 Section 6.2. 309 running text: Human text (paragraphs, sentences, phrases) with 310 syntax according to orthographic conventions of a natural 311 language, as opposed to syntax defined for ease of processing by 312 machines (e.g., markup, programming languages). 314 protocol element: Any portion of a message that affects processing 315 of that message by the protocol in question. 317 create (a URI or IRI): With respect to URIs and IRIs, the term is 318 used for the initial creation. This may be the initial creation 319 of a resource with a certain identifier, or the initial exposition 320 of a resource under a particular identifier. 322 generate (a URI or IRI): With respect to URIs and IRIs, the term is 323 used when the identifier is generated by derivation from other 324 information. 326 parsed URI component: When a URI processor parses a URI (following 327 the generic syntax or a scheme-specific syntax, the result is a 328 set of parsed URI components, each of which has a type 329 (corresponding to the syntactic definition) and a sequence of URI 330 characters. 332 parsed IRI component: When an IRI processor parses an IRI directly, 333 following the general syntax or a scheme-specific syntax, the 334 result is a set of parsed IRI components, each of which has a type 335 (corresponding to the syntactice definition) and a sequence of IRI 336 characters. (This definition is analogous to "parsed URI 337 component".) 339 IRI scheme: A URI scheme may also be known as an "IRI scheme" if the 340 scheme's syntax has been extended to allow non-US-ASCII characters 341 according to the rules in this document. 343 1.4. Notation 345 RFCs and Internet Drafts currently do not allow any characters 346 outside the US-ASCII repertoire. Therefore, this document uses 347 various special notations to denote such characters in examples. 349 In text, characters outside US-ASCII are sometimes referenced by 350 using a prefix of 'U+', followed by four to six hexadecimal digits. 352 To represent characters outside US-ASCII in examples, this document 353 uses 'XML Notation'. 355 XML Notation uses a leading '&#x', a trailing ';', and the 356 hexadecimal number of the character in the UCS in between. For 357 example, я stands for CYRILLIC CAPITAL LETTER YA. In this 358 notation, an actual '&' is denoted by '&'. 360 To denote actual octets in examples (as opposed to percent-encoded 361 octets), the two hex digits denoting the octet are enclosed in "<" 362 and ">". For example, the octet often denoted as 0xc9 is denoted 363 here as . 365 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 366 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 367 and "OPTIONAL" are to be interpreted as described in [RFC2119]. 369 2. IRI Syntax 371 This section defines the syntax of Internationalized Resource 372 Identifiers (IRIs). 374 As with URIs, an IRI is defined as a sequence of characters, not as a 375 sequence of octets. This definition accommodates the fact that IRIs 376 may be written on paper or read over the radio as well as stored or 377 transmitted digitally. The same IRI might be represented as 378 different sequences of octets in different protocols or documents if 379 these protocols or documents use different character encodings 380 (and/or transfer encodings). Using the same character encoding as 381 the containing protocol or document ensures that the characters in 382 the IRI can be handled (e.g., searched, converted, displayed) in the 383 same way as the rest of the protocol or document. 385 2.1. Summary of IRI Syntax 387 The IRI syntax extends the URI syntax in [RFC3986] by extending the 388 class of unreserved characters, primarily by adding the characters of 389 the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject 390 to the limitations given in the syntax rules below and in 391 Section 5.1. 393 The syntax and use of components and reserved characters is the same 394 as that in [RFC3986]. Each "URI scheme" thus also functions as an 395 "IRI scheme", in that scheme-specific parsing rules for URIs of a 396 scheme are be extended to allow parsing of IRIs using the same 397 parsing rules. 399 All the operations defined in [RFC3986], such as the resolution of 400 relative references, can be applied to IRIs by IRI-processing 401 software in exactly the same way as they are for URIs by URI- 402 processing software. 404 Characters outside the US-ASCII repertoire MUST NOT be reserved and 405 therefore MUST NOT be used for syntactical purposes, such as to 406 delimit components in newly defined schemes. For example, U+00A2, 407 CENT SIGN, is not allowed as a delimiter in IRIs, because it is in 408 the 'iunreserved' category. This is similar to the fact that it is 409 not possible to use '-' as a delimiter in URIs, because it is in the 410 'unreserved' category. 412 2.2. ABNF for IRI References and IRIs 414 An ABNF definition for IRI references (which are the most general 415 concept and the start of the grammar) and IRIs is given here. The 416 syntax of this ABNF is described in [STD68]. Character numbers are 417 taken from the UCS, without implying any actual binary encoding. 418 Terminals in the ABNF are characters, not octets. 420 The following grammar closely follows the URI grammar in [RFC3986], 421 except that the range of unreserved characters is expanded to include 422 UCS characters, with the restriction that private UCS characters can 423 occur only in query parts. The grammar is split into two parts: 424 Rules that differ from [RFC3986] because of the above-mentioned 425 expansion, and rules that are the same as those in [RFC3986]. For 426 rules that are different than those in [RFC3986], the names of the 427 non-terminals have been changed as follows. If the non-terminal 428 contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i' 429 has been prefixed. The rule has been introduced in order 430 to be able to reference it from other parts of the document. 432 The following rules are different from those in [RFC3986]: 434 IRI = scheme ":" ihier-part [ "?" iquery ] 435 [ "#" ifragment ] 437 ihier-part = "//" iauthority ipath-abempty 438 / ipath-absolute 439 / ipath-rootless 440 / ipath-empty 442 IRI-reference = IRI / irelative-ref 444 absolute-IRI = scheme ":" ihier-part [ "?" iquery ] 446 irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ] 447 irelative-part = "//" iauthority ipath-abempty 448 / ipath-absolute 449 / ipath-noscheme 450 / ipath-empty 452 iauthority = [ iuserinfo "@" ] ihost [ ":" port ] 453 iuserinfo = *( iunreserved / pct-form / sub-delims / ":" ) 454 ihost = IP-literal / IPv4address / ireg-name 456 pct-form = pct-encoded 458 ireg-name = *( iunreserved / sub-delims ) 460 ipath = ipath-abempty ; begins with "/" or is empty 461 / ipath-absolute ; begins with "/" but not "//" 462 / ipath-noscheme ; begins with a non-colon segment 463 / ipath-rootless ; begins with a segment 464 / ipath-empty ; zero characters 466 ipath-abempty = *( path-sep isegment ) 467 ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ] 468 ipath-noscheme = isegment-nz-nc *( path-sep isegment ) 469 ipath-rootless = isegment-nz *( path-sep isegment ) 470 ipath-empty = "" 471 path-sep = "/" 473 isegment = *ipchar 474 isegment-nz = 1*ipchar 475 isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims 476 / "@" ) 477 ; non-zero-length segment without any colon ":" 479 ipchar = iunreserved / pct-form / sub-delims / ":" 480 / "@" 482 iquery = *( ipchar / iprivate / "/" / "?" ) 484 ifragment = *( ipchar / "/" / "?" ) 486 iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar 488 ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF 489 / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD 490 / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD 491 / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD 492 / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD 493 / %xD0000-DFFFD / %xE1000-EFFFD 495 iprivate = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD 496 / %x100000-10FFFD 498 Some productions are ambiguous. The "first-match-wins" (a.k.a. 499 "greedy") algorithm applies. For details, see [RFC3986]. 501 The following rules are the same as those in [RFC3986]: 503 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 505 port = *DIGIT 507 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 509 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) 511 IPv6address = 6( h16 ":" ) ls32 512 / "::" 5( h16 ":" ) ls32 513 / [ h16 ] "::" 4( h16 ":" ) ls32 514 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 515 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 516 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 517 / [ *4( h16 ":" ) h16 ] "::" ls32 518 / [ *5( h16 ":" ) h16 ] "::" h16 519 / [ *6( h16 ":" ) h16 ] "::" 521 h16 = 1*4HEXDIG 522 ls32 = ( h16 ":" h16 ) / IPv4address 524 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 526 dec-octet = DIGIT ; 0-9 527 / %x31-39 DIGIT ; 10-99 528 / "1" 2DIGIT ; 100-199 529 / "2" %x30-34 DIGIT ; 200-249 530 / "25" %x30-35 ; 250-255 532 pct-encoded = "%" HEXDIG HEXDIG 534 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 535 reserved = gen-delims / sub-delims 536 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 537 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 538 / "*" / "+" / "," / ";" / "=" 540 This syntax does not support IPv6 scoped addressing zone identifiers. 542 3. Processing IRIs and related protocol elements 544 IRIs are meant to replace URIs in identifying resources within new 545 versions of protocols, formats, and software components that use a 546 UCS-based character repertoire. Protocols and components may use and 547 process IRIs directly. However, there are still numerous systems and 548 protocols which only accept URIs or components of parsed URIs; that 549 is, they only accept sequences of characters within the subset of US- 550 ASCII characters allowed in URIs. 552 This section defines specific processing steps for IRI consumers 553 which establish the relationship between the string given and the 554 interpreted derivatives. These processing steps apply to both IRIs 555 and IRI references (i.e., absolute or relative forms); for IRIs, some 556 steps are scheme specific. 558 3.1. Converting to UCS 560 Input that is already in a Unicode form (i.e., a sequence of Unicode 561 characters or an octet-stream representing a Unicode-based character 562 encoding such as UTF-8 or UTF-16) should be left as is and not 563 normalized or changed. 565 An IRI or IRI reference is a sequence of characters from the UCS. 566 For input from presentations (written on paper, read aloud) or 567 translation from other representations (a text stream using a legacy 568 character encoding), convert the input to Unicode. Note that some 569 character encodings or transcriptions can be converted to or 570 represented by more than one sequence of Unicode characters. Ideally 571 the resulting IRI would use a normalized form, such as Unicode 572 Normalization Form C [UTR15], since that ensures a stable, consistent 573 representation that is most likely to produce the intended results. 574 Previous versions of this specification required normalization at 575 this step. However, attempts to require normalization in other 576 protocols have met with strong enough resistance that requiring 577 normalization here was considered impractical. Implementers and 578 users are cautioned that, while denormalized character sequences are 579 valid, they might be difficult for other users or processes to 580 reproduce and might lead to unexpected results. 582 3.2. Parse the IRI into IRI components 584 Parse the IRI, either as a relative reference (no scheme) or using 585 scheme specific processing (according to the scheme given); the 586 result is a set of parsed IRI components. 588 3.3. General percent-encoding of IRI components 590 Except as noted in the following subsections, IRI components are 591 mapped to the equivalent URI components by percent-encoding those 592 characters not allowed in URIs. Previous processing steps will have 593 removed some characters, and the interpretation of reserved 594 characters will have already been done (with the syntactic reserved 595 characters outside of the IRI component). This mapping is defined 596 for all sequences of Unicode characters, whether or not they are 597 valid for the component in question. 599 For each character which is not allowed anywhere in a valid URI apply 600 the following steps. 602 Convert to UTF-8 Convert the character to a sequence of one or more 603 octets using UTF-8 [RFC3629]. 605 Percent encode Convert each octet of this sequence to %HH, where HH 606 is the hexadecimal notation of the octet value. The hexadecimal 607 notation SHOULD use uppercase letters. (This is the general URI 608 percent-encoding mechanism in Section 2.1 of [RFC3986].) 610 Note that the mapping is an identity transformation for parsed URI 611 components of valid URIs, and is idempotent: applying the mapping a 612 second time will not change anything. 614 3.4. Mapping ireg-name 616 The mapping from to a requires a choice 617 between one of the two methods described below. 619 3.4.1. Mapping using Percent-Encoding 621 The ireg-name component SHOULD be converted according to the general 622 procedure for percent-encoding of IRI components described in 623 Section 3.3. 625 For example, the IRI 626 "http://résumé.example.org" 627 will be converted to 628 "http://r%C3%A9sum%C3%A9.example.org". 630 This conversion for ireg-name is in line with Section 3.2.2 of 631 [RFC3986], which does not mandate a particular registered name lookup 632 technology. For further background, see [RFC6055] and [Gettys]. 634 3.4.2. Mapping using Punycode 636 In situations where it is certain that is intended to be 637 used as a domain name to be processed by Domain Name Lookup (as per 638 [RFC5891]), an alternative method MAY be used, converting 639 as follows: 641 If there are any sequences of , and their corresponding 642 octets all represent valid UTF-8 octet sequences, then convert these 643 back to Unicode character sequences. (If any sequences 644 are not valid UTF-8 octet sequences, then leave the entire field as 645 is without any change, since punycode encoding would not succeed.) 647 Replace the ireg-name part of the IRI by the part converted using the 648 Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891]. 649 on each dot-separated label, and by using U+002E (FULL STOP) as a 650 label separator. This procedure may fail, but this would mean that 651 the IRI cannot be resolved. In such cases, if the domain name 652 conversion fails, then the entire IRI conversion fails. Processors 653 that have no mechanism for signalling a failure MAY instead 654 substitute an otherwise invalid host name, although such processing 655 SHOULD be avoided. 657 For example, the IRI 658 "http://résumé.example.org" 659 MAY be converted to 660 "http://xn--rsum-bad.example.org" 661 . 663 This conversion for ireg-name will be better able to deal with legacy 664 infrastructure that cannot handle percent-encoding in domain names. 666 3.4.3. Additional Considerations 668 Note: Domain Names may appear in parts of an IRI other than the 669 ireg-name part. It is the responsibility of scheme-specific 670 implementations (if the Internationalized Domain Name is part of 671 the scheme syntax) or of server-side implementations (if the 672 Internationalized Domain Name is part of 'iquery') to apply the 673 necessary conversions at the appropriate point. Example: Trying 674 to validate the Web page at 675 http://résumé.example.org would lead to an IRI of 676 http://validator.w3.org/check?uri=http%3A%2F%2Frésumé. 677 example.org, which would convert to a URI of 678 http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. 679 example.org. The server-side implementation is responsible for 680 making the necessary conversions to be able to retrieve the Web 681 page. 683 Note: In this process, characters allowed in URI references and 684 existing percent-encoded sequences are not encoded further. (This 685 mapping is similar to, but different from, the encoding applied 686 when arbitrary content is included in some part of a URI.) For 687 example, an IRI of 688 "http://www.example.org/red%09rosé#red" (in XML notation) is 689 converted to 690 "http://www.example.org/red%09ros%C3%A9#red", not to something 691 like 692 "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red". 694 3.5. Mapping query components 696 For compatibility with existing deployed HTTP infrastructure, the 697 following special case applies for schemes "http" and "https" and 698 IRIs whose origin has a document charset other than one which is UCS- 699 based (e.g., UTF-8 or UTF-16). In such a case, the "query" component 700 of an IRI is mapped into a URI by using the document charset rather 701 than UTF-8 as the binary representation before pct-encoding. This 702 mapping is not applied for any other scheme or component. 704 3.6. Mapping IRIs to URIs 706 The mapping from an IRI to URI is accomplished by applying the 707 mapping above (from IRI to URI components) and then reassembling a 708 URI from the parsed URI components using the original punctuation 709 that delimited the IRI components. 711 4. Converting URIs to IRIs 713 In some situations, for presentation and further processing, it is 714 desirable to convert a URI into an equivalent IRI without unnecessary 715 percent encoding. Of course, every URI is already an IRI in its own 716 right without any conversion. This section gives one possible 717 procedure for URI to IRI mapping. 719 The conversion described in this section, if given a valid URI, will 720 result in an IRI that maps back to the URI used as an input for the 721 conversion (except for potential case differences in percent-encoding 722 and for potential percent-encoded unreserved characters). However, 723 the IRI resulting from this conversion may differ from the original 724 IRI (if there ever was one). 726 URI-to-IRI conversion removes percent-encodings, but not all percent- 727 encodings can be eliminated. There are several reasons for this: 729 1. Some percent-encodings are necessary to distinguish percent- 730 encoded and unencoded uses of reserved characters. 732 2. Some percent-encodings cannot be interpreted as sequences of UTF-8 733 octets. 735 (Note: The octet patterns of UTF-8 are highly regular. Therefore, 736 there is a very high probability, but no guarantee, that percent- 737 encodings that can be interpreted as sequences of UTF-8 octets 738 actually originated from UTF-8. For a detailed discussion, see 739 [Duerst97].) 741 3. The conversion may result in a character that is not appropriate 742 in an IRI. See Section 2.2, and Section 5.1 for further details. 744 4. IRI to URI conversion has different rules for dealing with domain 745 names and query parameters. 747 Conversion from a URI to an IRI MAY be done by using the following 748 steps: 750 1. Represent the URI as a sequence of octets in US-ASCII. 752 2. Convert all percent-encodings ("%" followed by two hexadecimal 753 digits) to the corresponding octets, except those corresponding to 754 "%", characters in "reserved", and characters in US-ASCII not 755 allowed in URIs. 757 3. Re-percent-encode any octet produced in step 2 that is not part of 758 a strictly legal UTF-8 octet sequence. 760 4. Re-percent-encode all octets produced in step 3 that in UTF-8 761 represent characters that are not appropriate according to 762 Section 2.2 and Section 5.1. 764 5. Interpret the resulting octet sequence as a sequence of characters 765 encoded in UTF-8. 767 6. URIs known to contain domain names in the reg-name component 768 SHOULD convert punycode-encoded domain name labels to the 769 corresponding characters using the ToUnicode procedure. 771 This procedure will convert as many percent-encoded characters as 772 possible to characters in an IRI. Because there are some choices 773 when step 4 is applied (see Section 5.1), results may vary. 775 Conversions from URIs to IRIs MUST NOT use any character encoding 776 other than UTF-8 in steps 3 and 4, even if it might be possible to 777 guess from the context that another character encoding than UTF-8 was 778 used in the URI. For example, the URI 779 "http://www.example.org/r%E9sum%E9.html" might with some guessing be 780 interpreted to contain two e-acute characters encoded as iso-8859-1. 781 It must not be converted to an IRI containing these e-acute 782 characters. Otherwise, in the future the IRI will be mapped to 783 "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different 784 URI from "http://www.example.org/r%E9sum%E9.html". 786 4.1. Examples 788 This section shows various examples of converting URIs to IRIs. Each 789 example shows the result after each of the steps 1 through 6 is 790 applied. XML Notation is used for the final result. Octets are 791 denoted by "<" followed by two hexadecimal digits followed by ">". 793 The following example contains the sequence "%C3%BC", which is a 794 strictly legal UTF-8 sequence, and which is converted into the actual 795 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as 796 u-umlaut). 798 1. http://www.example.org/D%C3%BCrst 800 2. http://www.example.org/Drst 802 3. http://www.example.org/Drst 804 4. http://www.example.org/Drst 806 5. http://www.example.org/Dürst 808 6. http://www.example.org/Dürst 810 The following example contains the sequence "%FC", which might 811 represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the 812 iso-8859-1 character encoding. (It might represent other characters 813 in other character encodings. For example, the octet in iso- 814 8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.) Because 815 is not part of a strictly legal UTF-8 sequence, it is re-percent- 816 encoded in step 3. 818 1. http://www.example.org/D%FCrst 820 2. http://www.example.org/Drst 821 3. http://www.example.org/D%FCrst 823 4. http://www.example.org/D%FCrst 825 5. http://www.example.org/D%FCrst 827 6. http://www.example.org/D%FCrst 829 The following example contains "%e2%80%ae", which is the percent- 830 encoded 831 UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. The 832 direct use of this character is forbiddin in an IRI. Therefore, the 833 corresponding octets are re-percent-encoded in step 4. This example 834 shows that the case (upper- or lowercase) of letters used in percent- 835 encodings may not be preserved. The example also contains a 836 punycode-encoded domain name label (xn--99zt52a), which is not 837 converted. 839 1. http://xn--99zt52a.example.org/%e2%80%ae 841 2. http://xn--99zt52a.example.org/<80> 843 3. http://xn--99zt52a.example.org/<80> 845 4. http://xn--99zt52a.example.org/%E2%80%AE 847 5. http://xn--99zt52a.example.org/%E2%80%AE 849 6. http://納豆.example.org/%E2%80%AE 851 Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46 852 (Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this 853 note.)) 855 5. Use of IRIs 857 5.1. Limitations on UCS Characters Allowed in IRIs 859 This section discusses limitations on characters and character 860 sequences usable for IRIs beyond those given in Section 2.2. The 861 considerations in this section are relevant when IRIs are created and 862 when URIs are converted to IRIs. 864 a. The repertoire of characters allowed in each IRI component is 865 limited by the definition of that component. For example, the 866 definition of the scheme component does not allow characters 867 beyond US-ASCII. 869 (Note: In accordance with URI practice, generic IRI software 870 cannot and should not check for such limitations.) 872 b. The UCS contains many areas of characters for which there are 873 strong visual look-alikes. Because of the likelihood of 874 transcription errors, these also should be avoided. This includes 875 the full-width equivalents of Latin characters, half-width 876 Katakana characters for Japanese, and many others. It also 877 includes many look-alikes of "space", "delims", and "unwise", 878 characters excluded in [RFC3491]. 880 c. At the start of a component, the use of combining marks is 881 strongly discouraged. As an example, a COMBINING TILDE OVERLAY 882 (U+0334) would be very confusing at the start of a . 883 Combined with the preceeding '/', it might look like a solidus 884 with combining tilde overlay, but IRI processing software will 885 parse and process the '/' separately. 887 d. The ZERO WIDTH NON-JOINER (U+200C) and ZERO WIDTH JOINER (U+200D) 888 are invisible in most contexts, but are crucial in some very 889 limited contexts. Appendix A of [RFC5892] contains contextual 890 restrictions for these and some other characters. The use of 891 these characters are strongly discouraged except in the relevant 892 contexts. 894 Additional information is available from [UNIXML]. [UNIXML] is 895 written in the context of running text rather than in that of 896 identifiers. Nevertheless, it discusses many of the categories of 897 characters not appropriate for IRIs. 899 5.2. Software Interfaces and Protocols 901 Although an IRI is defined as a sequence of characters, software 902 interfaces for URIs typically function on sequences of octets or 903 other kinds of code units. Thus, software interfaces and protocols 904 MUST define which character encoding is used. 906 Intermediate software interfaces between IRI-capable components and 907 URI-only components MUST map the IRIs per Section 3.6, when 908 transferring from IRI-capable to URI-only components. This mapping 909 SHOULD be applied as late as possible. It SHOULD NOT be applied 910 between components that are known to be able to handle IRIs. 912 5.3. Format of URIs and IRIs in Documents and Protocols 914 Document formats that transport URIs may have to be upgraded to allow 915 the transport of IRIs. In cases where the document as a whole has a 916 native character encoding, IRIs MUST also be encoded in this 917 character encoding and converted accordingly by a parser or 918 interpreter. IRI characters not expressible in the native character 919 encoding SHOULD be escaped by using the escaping conventions of the 920 document format if such conventions are available. Alternatively, 921 they MAY be percent-encoded according to Section 3.6. For example, 922 in HTML or XML, numeric character references SHOULD be used. If a 923 document as a whole has a native character encoding and that 924 character encoding is not UTF-8, then IRIs MUST NOT be placed into 925 the document in the UTF-8 character encoding. 927 ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs, 928 although they use different terminology. HTML 4.0 [HTML4] defines 929 the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0 930 [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications 931 based upon them allow IRIs. Also, it is expected that all relevant 932 new W3C formats and protocols will be required to handle IRIs 933 [CharMod]. 935 5.4. Use of UTF-8 for Encoding Original Characters 937 This section discusses details and gives examples for point c) in 938 Section 1.2. To be able to use IRIs, the URI corresponding to the 939 IRI in question has to encode original characters into octets by 940 using UTF-8. This can be specified for all URIs of a URI scheme or 941 can apply to individual URIs for schemes that do not specify how to 942 encode original characters. It can apply to the whole URI, or only 943 to some part. For background information on encoding characters into 944 URIs, see also Section 2.5 of [RFC3986]. 946 For new URI schemes, using UTF-8 is recommended in [RFC4395bis]. 947 Examples where UTF-8 is already used are the URN syntax [RFC2141], 948 IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, 949 because the HTTP URI scheme does not specify how to encode original 950 characters, only some HTTP URLs can have corresponding but different 951 IRIs. 953 For example, for a document with a URI of 954 "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to 955 construct a corresponding IRI (in XML notation, see Section 1.4): 956 "http://www.example.org/résumé.html" ("é" stands for 957 the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent- 958 encoded representation of that character). On the other hand, for a 959 document with a URI of "http://www.example.org/r%E9sum%E9.html", the 960 percent-encoding octets cannot be converted to actual characters in 961 an IRI, as the percent-encoding is not based on UTF-8. 963 For most URI schemes, there is no need to upgrade their scheme 964 definition in order for them to work with IRIs. The main case where 965 upgrading makes sense is when a scheme definition, or a particular 966 component of a scheme, is strictly limited to the use of US-ASCII 967 characters with no provision to include non-ASCII characters/octets 968 via percent-encoding, or if a scheme definition currently uses highly 969 scheme-specific provisions for the encoding of non-ASCII characters. 970 An example of this is the mailto: scheme [RFC2368]. 972 This specification updates the IANA registry of URI schemes to note 973 their applicability to IRIs, see Section 8. All IRIs use URI 974 schemes, and all URIs with URI schemes can be used as IRIs, even 975 though in some cases only by using URIs directly as IRIs, without any 976 conversion. 978 Scheme definitions can impose restrictions on the syntax of scheme- 979 specific URIs; i.e., URIs that are admissible under the generic URI 980 syntax [RFC3986] may not be admissible due to narrower syntactic 981 constraints imposed by a URI scheme specification. URI scheme 982 definitions cannot broaden the syntactic restrictions of the generic 983 URI syntax; otherwise, it would be possible to generate URIs that 984 satisfied the scheme-specific syntactic constraints without 985 satisfying the syntactic constraints of the generic URI syntax. 986 However, additional syntactic constraints imposed by URI scheme 987 specifications are applicable to IRI, as the corresponding URI 988 resulting from the mapping defined in Section 3.6 MUST be a valid URI 989 under the syntactic restrictions of generic URI syntax and any 990 narrower restrictions imposed by the corresponding URI scheme 991 specification. 993 The requirement for the use of UTF-8 generally applies to all parts 994 of a URI. However, it is possible that the capability of IRIs to 995 represent a wide range of characters directly is used just in some 996 parts of the IRI (or IRI reference). The other parts of the IRI may 997 only contain US-ASCII characters, or they may not be based on UTF-8. 998 They may be based on another character encoding, or they may directly 999 encode raw binary data (see also [RFC2397]). 1001 For example, it is possible to have a URI reference of 1002 "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the 1003 document name is encoded in iso-8859-1 based on server settings, but 1004 where the fragment identifier is encoded in UTF-8 according to 1005 [XPointer]. The IRI corresponding to the above URI would be (in XML 1006 notation) 1007 "http://www.example.org/r%E9sum%E9.xml#résumé". 1009 Similar considerations apply to query parts. The functionality of 1010 IRIs (namely, to be able to include non-ASCII characters) can only be 1011 used if the query part is encoded in UTF-8. 1013 5.5. Relative IRI References 1015 Processing of relative IRI references against a base is handled 1016 straightforwardly; the algorithms of [RFC3986] can be applied 1017 directly, treating the characters additionally allowed in IRI 1018 references in the same way that unreserved characters are in URI 1019 references. 1021 6. Legacy Extended IRIs (LEIRIs) 1023 In some cases, there have been formats which have used a protocol 1024 element which is a variant of the IRI definition; these variants have 1025 usually been somewhat less restricted in syntax. This section 1026 provides a definition and a name (Legacy Extended IRI or LEIRI) for 1027 one of these variants used widely in XML-based protocols. This 1028 variant has to be used with care; it requires further processing 1029 before being fully interchangeable as IRIs. New protocols and 1030 formats SHOULD NOT use Legacy Extended IRIs. Even where Legacy 1031 Extended IRIs are allowed, only IRIs fully conforming to the syntax 1032 definition in Section 2.2 SHOULD be created, generated, and used. 1033 The provisions in this section also apply to Legacy Extended IRI 1034 references. 1036 6.1. Legacy Extended IRI Syntax 1038 This section defines Legacy Extended IRIs (LEIRIs). The syntax of 1039 Legacy Extended IRIs is the same as that for , except 1040 that the ucschar production is replaced by the leiri-ucschar 1041 production: 1043 leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" 1044 / "\" / "^" / "`" / %x0-1F / %x7F-D7FF 1045 / %xE000-FFFD / %x10000-10FFFF 1047 The restriction on bidirectional formatting characters in [Bidi] is 1048 lifted. The iprivate production becomes redundant. 1050 Likewise, the syntax for Legacy Extended IRI references (LEIRI 1051 references) is the same as that for IRI references with the above 1052 replacement of ucschar with leiri-ucschar. 1054 6.2. Conversion of Legacy Extended IRIs to IRIs 1056 To convert a Legacy Extended IRI (reference) to an IRI (reference), 1057 each character allowed in a Legacy Extended IRI (reference) but not 1058 allowed in an IRI (reference) (see Section 6.3) MUST be percent- 1059 encoded by applying the steps in Section 3.3. 1061 6.3. Characters Allowed in Legacy Extended IRIs but not in IRIs 1063 This section provides a list of the groups of characters and code 1064 points that are allowed in Legacy Extedend IRIs, but are not allowed 1065 in IRIs or are allowed in IRIs only in the query part. For each 1066 group of characters, advice on the usage of these characters is also 1067 given, concentrating on the reasons for why not to use them. 1069 Space (U+0020): Some formats and applications use space as a 1070 delimiter, e.g., for items in a list. Appendix C of [RFC3986] 1071 also mentions that white space may have to be added when 1072 displaying or printing long URIs; the same applies to long IRIs. 1073 Spaces might disappear, or a single Legacy Extended IRI might 1074 incorrectly be interpreted as two or more separate ones. 1076 Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix 1077 C of [RFC3986] suggests the use of double-quotes 1078 ("http://example.com/") and angle brackets () 1079 as delimiters for URIs in plain text. These conventions are often 1080 used, and also apply to IRIs. Legacy Extended IRIs using these 1081 characters might be cut off at the wrong place. 1083 Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{" 1084 (U+007B), "|" (U+007C), and "}" (U+007D): These characters 1085 originally were excluded from URIs because the respective 1086 codepoints are assigned to different graphic characters in some 1087 7-bit or 8-bit encoding. Despite the move to Unicode, some of 1088 these characters are still occasionally displayed differently on 1089 some systems, e.g., U+005C as a Japanese Yen symbol. Also, the 1090 fact that these characters are not used in URIs or IRIs has 1091 encouraged their use outside URIs or IRIs in contexts that may 1092 include URIs or IRIs. In case a Legacy Extended IRI with such a 1093 character is used in such a context, the Legacy Extended IRI will 1094 be interpreted piecemeal. 1096 The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F - 1097 #x9F): There is no way to transmit these characters reliably 1098 except potentially in electronic form. Even when in electronic 1099 form, some software components might silently filter out some of 1100 these characters, or may stop processing alltogether when 1101 encountering some of them. These characters may affect text 1102 display in subtle, unnoticable ways or in drastic, global, and 1103 irreversible ways depending on the hardware and software involved. 1104 The use of some of these characters may allow malicious users to 1105 manipulate the display of a Legacy Extended IRI and its context. 1107 Bidi formatting characters (U+200E, U+200F, U+202A-202E): These 1108 characters affect the display ordering of characters. Displayed 1109 Legacy Extended IRIs containing these characters cannot be 1110 converted back to electronic form (logical order) unambiguously. 1111 These characters may allow malicious users to manipulate the 1112 display of a Legacy Extended IRI and its context. 1114 Specials (U+FFF0-FFFD): These code points provide functionality 1115 beyond that useful in a Legacy Extended IRI, for example byte 1116 order identification, annotation, and replacements for unknown 1117 characters and objects. Their use and interpretation in a Legacy 1118 Extended IRI serves no purpose and may lead to confusing display 1119 variations. 1121 Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000- 1122 10FFFD): Display and interpretation of these code points is by 1123 definition undefined without private agreement. Therefore, these 1124 code points are not suited for use on the Internet. They are not 1125 interoperable and may have unpredictable effects. 1127 Tags (U+E0000-E0FFF): These characters provide a way to language 1128 tag in Unicode plain text. They are not appropriate for Legacy 1129 Extended IRIs because language information in identifiers cannot 1130 reliably be input, transmitted (e.g., on a visual medium such as 1131 paper), or recognized. 1133 Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF, 1134 U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF, 1135 U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF, 1136 U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF, 1137 U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as 1138 non-characters. Applications may use some of them internally, but 1139 are not prepared to interchange them. 1141 For reference, we here also list the code points and code units not 1142 even allowed in Legacy Extended IRIs: 1144 Surrogate code units (D800-DFFF): These do not represent Unicode 1145 codepoints. 1147 Non-characters (U+FFFE-FFFF): These are not allowed in XML nor 1148 LEIRIs. 1150 7. URI/IRI Processing Guidelines (Informative) 1152 This informative section provides guidelines for supporting IRIs in 1153 the same software components and operations that currently process 1154 URIs: Software interfaces that handle URIs, software that allows 1155 users to enter URIs, software that creates or generates URIs, 1156 software that displays URIs, formats and protocols that transport 1157 URIs, and software that interprets URIs. These may all require 1158 modification before functioning properly with IRIs. The 1159 considerations in this section also apply to URI references and IRI 1160 references. 1162 7.1. URI/IRI Software Interfaces 1164 Software interfaces that handle URIs, such as URI-handling APIs and 1165 protocols transferring URIs, need interfaces and protocol elements 1166 that are designed to carry IRIs. 1168 In case the current handling in an API or protocol is based on US- 1169 ASCII, UTF-8 is recommended as the character encoding for IRIs, as it 1170 is compatible with US-ASCII, is in accordance with the 1171 recommendations of [RFC2277], and makes converting to URIs easy. In 1172 any case, the API or protocol definition must clearly define the 1173 character encoding to be used. 1175 The transfer from URI-only to IRI-capable components requires no 1176 mapping, although the conversion described in Section 4 above may be 1177 performed. It is preferable not to perform this inverse conversion 1178 unless it is certain this can be done correctly. 1180 7.2. URI/IRI Entry 1182 Some components allow users to enter URIs into the system by typing 1183 or dictation, for example. This software must be updated to allow 1184 for IRI entry. 1186 A person viewing a visual presentation of an IRI (as a sequence of 1187 glyphs, in some order, in some visual display) will use an entry 1188 method for characters in the user's language to input the IRI. 1189 Depending on the script and the input method used, this may be a more 1190 or less complicated process. 1192 The process of IRI entry must ensure, as much as possible, that the 1193 restrictions defined in Section 2.2 are met. This may be done by 1194 choosing appropriate input methods or variants/settings thereof, by 1195 appropriately converting the characters being input, by eliminating 1196 characters that cannot be converted, and/or by issuing a warning or 1197 error message to the user. 1199 As an example of variant settings, input method editors for East 1200 Asian Languages usually allow the input of Latin letters and related 1201 characters in full-width or half-width versions. For IRI input, the 1202 input method editor should be set so that it produces half-width 1203 Latin letters and punctuation and full-width Katakana. 1205 An input field primarily or solely used for the input of URIs/IRIs 1206 might allow the user to view an IRI as it is mapped to a URI. Places 1207 where the input of IRIs is frequent may provide the possibility for 1208 viewing an IRI as mapped to a URI. This will help users when some of 1209 the software they use does not yet accept IRIs. 1211 An IRI input component interfacing to components that handle URIs, 1212 but not IRIs, must map the IRI to a URI before passing it to these 1213 components. 1215 For the input of IRIs with right-to-left characters, please see 1216 [Bidi]. 1218 7.3. URI/IRI Transfer between Applications 1220 Many applications (for example, mail user agents) try to detect URIs 1221 appearing in plain text. For this, they use some heuristics based on 1222 URI syntax. They then allow the user to click on such URIs and 1223 retrieve the corresponding resource in an appropriate (usually 1224 scheme-dependent) application. 1226 Such applications would need to be upgraded, in order to use the IRI 1227 syntax as a base for heuristics. In particular, a non-ASCII 1228 character should not be taken as the indication of the end of an IRI. 1229 Such applications also would need to make sure that they correctly 1230 convert the detected IRI from the character encoding of the document 1231 or application where the IRI appears, to the character encoding used 1232 by the system-wide IRI invocation mechanism, or to a URI (according 1233 to Section 3.6) if the system-wide invocation mechanism only accepts 1234 URIs. 1236 The clipboard is another frequently used way to transfer URIs and 1237 IRIs from one application to another. On most platforms, the 1238 clipboard is able to store and transfer text in many languages and 1239 scripts. Correctly used, the clipboard transfers characters, not 1240 octets, which will do the right thing with IRIs. 1242 7.4. URI/IRI Generation 1244 Systems that offer resources through the Internet, where those 1245 resources have logical names, sometimes automatically generate URIs 1246 for the resources they offer. For example, some HTTP servers can 1247 generate a directory listing for a file directory and then respond to 1248 the generated URIs with the files. 1250 Many legacy character encodings are in use in various file systems. 1251 Many currently deployed systems do not transform the local character 1252 representation of the underlying system before generating URIs. 1254 For maximum interoperability, systems that generate resource 1255 identifiers should make the appropriate transformations. For 1256 example, if a file system contains a file named "résum&# 1257 xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in 1258 a URI, which allows use of "résumé.html" in an IRI, even if 1259 locally the file name is kept in a character encoding other than 1260 UTF-8. 1262 This recommendation particularly applies to HTTP servers. For FTP 1263 servers, similar considerations apply; see [RFC2640]. 1265 7.5. URI/IRI Selection 1267 In some cases, resource owners and publishers have control over the 1268 IRIs used to identify their resources. This control is mostly 1269 executed by controlling the resource names, such as file names, 1270 directly. 1272 In these cases, it is recommended to avoid choosing IRIs that are 1273 easily confused. For example, for US-ASCII, the lower-case ell ("l") 1274 is easily confused with the digit one ("1"), and the upper-case oh 1275 ("O") is easily confused with the digit zero ("0"). Publishers 1276 should avoid confusing users with "br0ken" or "1ame" identifiers. 1278 Outside the US-ASCII repertoire, there are many more opportunities 1279 for confusion; a complete set of guidelines is too lengthy to include 1280 here. As long as names are limited to characters from a single 1281 script, native writers of a given script or language will know best 1282 when ambiguities can appear, and how they can be avoided. What may 1283 look ambiguous to a stranger may be completely obvious to the average 1284 native user. On the other hand, in some cases, the UCS contains 1285 variants for compatibility reasons; for example, for typographic 1286 purposes. These should be avoided wherever possible. Although there 1287 may be exceptions, newly created resource names should generally be 1288 in NFKC [UTR15] (which means that they are also in NFC). 1290 As an example, the UCS contains the "fi" ligature at U+FB01 for 1291 compatibility reasons. Wherever possible, IRIs should use the two 1292 letters "f" and "i" rather than the "fi" ligature. An example where 1293 the latter may be used is in the query part of an IRI for an explicit 1294 search for a word written containing the "fi" ligature. 1296 In certain cases, there is a chance that characters from different 1297 scripts look the same. The best known example is the similarity of 1298 the Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid 1299 such cases, IRIs should only be created where all the characters in a 1300 single component are used together in a given language. This usually 1301 means that all of these characters will be from the same script, but 1302 there are languages that mix characters from different scripts (such 1303 as Japanese). This is similar to the heuristics used to distinguish 1304 between letters and numbers in the examples above. Also, for Latin, 1305 Greek, and Cyrillic, using lowercase letters results in fewer 1306 ambiguities than using uppercase letters would. 1308 7.6. Display of URIs/IRIs 1310 In situations where the rendering software is not expected to display 1311 non-ASCII parts of the IRI correctly using the available layout and 1312 font resources, these parts should be percent-encoded before being 1313 displayed. 1315 For display of Bidi IRIs, please see [Bidi]. 1317 7.7. Interpretation of URIs and IRIs 1319 Software that interprets IRIs as the names of local resources should 1320 accept IRIs in multiple forms and convert and match them with the 1321 appropriate local resource names. 1323 First, multiple representations include both IRIs in the native 1324 character encoding of the protocol and also their URI counterparts. 1326 Second, it may include URIs constructed based on character encodings 1327 other than UTF-8. These URIs may be produced by user agents that do 1328 not conform to this specification and that use legacy character 1329 encodings to convert non-ASCII characters to URIs. Whether this is 1330 necessary, and what character encodings to cover, depends on a number 1331 of factors, such as the legacy character encodings used locally and 1332 the distribution of various versions of user agents. For example, 1333 software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in 1334 addition to UTF-8. 1336 Third, it may include additional mappings to be more user-friendly 1337 and robust against transmission errors. These would be similar to 1338 how some servers currently treat URIs as case insensitive or perform 1339 additional matching to account for spelling errors. For characters 1340 beyond the US-ASCII repertoire, this may, for example, include 1341 ignoring the accents on received IRIs or resource names. Please note 1342 that such mappings, including case mappings, are language dependent. 1344 It can be difficult to identify a resource unambiguously if too many 1345 mappings are taken into consideration. However, percent-encoded and 1346 not percent-encoded parts of IRIs can always be clearly 1347 distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes 1348 the potential for collisions lower than it may seem at first. 1350 7.8. Upgrading Strategy 1352 Where this recommendation places further constraints on software for 1353 which many instances are already deployed, it is important to 1354 introduce upgrades carefully and to be aware of the various 1355 interdependencies. 1357 If IRIs cannot be interpreted correctly, they should not be created, 1358 generated, or transported. This suggests that upgrading URI 1359 interpreting software to accept IRIs should have highest priority. 1361 On the other hand, a single IRI is interpreted only by a single or 1362 very few interpreters that are known in advance, although it may be 1363 entered and transported very widely. 1365 Therefore, IRIs benefit most from a broad upgrade of software to be 1366 able to enter and transport IRIs. However, before an individual IRI 1367 is published, care should be taken to upgrade the corresponding 1368 interpreting software in order to cover the forms expected to be 1369 received by various versions of entry and transport software. 1371 The upgrade of generating software to generate IRIs instead of using 1372 a local character encoding should happen only after the service is 1373 upgraded to accept IRIs. Similarly, IRIs should only be generated 1374 when the service accepts IRIs and the intervening infrastructure and 1375 protocol is known to transport them safely. 1377 Software converting from URIs to IRIs for display should be upgraded 1378 only after upgraded entry software has been widely deployed to the 1379 population that will see the displayed result. 1381 Where there is a free choice of character encodings, it is often 1382 possible to reduce the effort and dependencies for upgrading to IRIs 1383 by using UTF-8 rather than another encoding. For example, when a new 1384 file-based Web server is set up, using UTF-8 as the character 1385 encoding for file names will make the transition to IRIs easier. 1386 Likewise, when a new Web form is set up using UTF-8 as the character 1387 encoding of the form page, the returned query URIs will use UTF-8 as 1388 the character encoding (unless the user, for whatever reason, changes 1389 the character encoding) and will therefore be compatible with IRIs. 1391 These recommendations, when taken together, will allow for the 1392 extension from URIs to IRIs in order to handle characters other than 1393 US-ASCII while minimizing interoperability problems. For 1394 considerations regarding the upgrade of URI scheme definitions, see 1395 Section 5.4. 1397 8. IANA Considerations 1399 NOTE: THIS SECTION NEEDS REVIEW AGAINST HAPPIANA WORK. 1401 RFC Editor and IANA note: Please Replace RFC XXXX with the number of 1402 this document when it issues as an RFC, and RFC YYYY with the number 1403 of the RFC issued for draft-ietf-iri-rfc3987bis. 1405 IANA maintains a registry of "URI schemes". This document attempts 1406 to make it clear from the registry that a "URI scheme" also serves an 1407 "IRI scheme", and makes several changes to the registry. 1409 The description of the registry should be changed: "RFC 4395 defined 1410 an IANA-maintained registry of URI Schemes. RFC XXXX updates this 1411 registry to make it clear that the registered values also serve as 1412 IRI schemes, as defined in RFC YYYY." 1414 The registry includes schemes marked as Permanent or Provisional. 1415 Previously, this was accomplished by having two sections, "Permanent" 1416 and "Provisional". However, in order to allow other status 1417 ("Historical", and possibly a Proposed status for proposals which 1418 have been received but not accepted), the registry should be changed 1419 so that the status is indicated in a separate "Status" column, whose 1420 values may be "Permanent", "Provisional" or "Historical". Changes in 1421 status as well as updates to the entire registration may be 1422 accomplished by requests and expert review. 1424 9. Security Considerations 1426 The security considerations discussed in [RFC3986] also apply to 1427 IRIs. In addition, the following issues require particular care for 1428 IRIs. 1430 Incorrect encoding or decoding can lead to security problems. For 1431 example, some UTF-8 decoders do not check against overlong byte 1432 sequences. See [UTR36] Section 3 for details. 1434 There are serious difficulties with relying on a human to verify that 1435 a an IRI (whether presented visually or aurally) is the same as 1436 another IRI or is the one intended. These problems exist with ASCII- 1437 only URIs (bl00mberg.com vs. bloomberg.com) but are strongly 1438 exacerbated when using the much larger character repertoire of 1439 Unicode. For details, see Section 2 of [UTR36]. Using 1440 administrative and technical means to reduce the availability of such 1441 exploits is possible, but they are difficult to eliminate altogether. 1442 User agents SHOULD NOT rely on visual or perceptual comparison or 1443 verification of IRIs as a means of validating or assuring safety, 1444 correctness or appropriateness of an IRI. Other means of presenting 1445 users with the validity, safety, or appropriateness of visited sites 1446 are being developed in the browser community as an alternative means 1447 of avoiding these difficulties. 1449 Besides the large character repertoire of Unicode, reasons for 1450 confusion include different forms of normalization and different 1451 normalization expectations, use of percent-encoding with various 1452 legacy encodings, and bidirectionality issues. See also [Bidi]. 1454 Confusion can occur in various IRI components, such as the domain 1455 name part or the path part, or between IRI components. For 1456 considerations specific to the domain name part, see [RFC5890]. For 1457 considerations specific to particular protocols or schemes, see the 1458 security sections of the relevant specifications and registration 1459 templates. Administrators of sites that allow independent users to 1460 create resources in the same sub area have to be careful. Details 1461 are discussed in Section 7.5. 1463 The characters additionally allowed in Legacy Extended IRIs introduce 1464 additional security issues. For details, see Section 6.3. 1466 10. Acknowledgements 1468 This document was derived from [RFC3987]; the acknowledgments from 1469 that specification still apply. 1471 In addition, this document was influenced by contributions from (in 1472 no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson, 1473 John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris 1474 Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank 1475 Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard 1476 Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie 1477 Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas 1478 Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van 1479 Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos 1480 Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R. 1481 Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse, 1482 Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn 1483 Steele, Peter Saint-Andre, Geoffrey Sneddon, Chris Weber, Alex 1484 Melnikov, Slim Amamou, S. Moonesamy, Tim Berners-Lee, Yaron Goland, 1485 Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas 1486 Milo, Murray Sargent, Marc Blanchet, and Mykyta Yevstifeyev. 1488 11. Main Changes Since RFC 3987 1490 This section describes the main changes since [RFC3987]. 1492 11.1. Split out Bidi, processing guidelines, comparison sections 1494 Move some components (comparison, bidi, processing) into separate 1495 documents. 1497 11.2. Major restructuring of IRI processing model 1499 Major restructuring of IRI processing model to make scheme-specific 1500 translation necessary to handle IDNA requirements and for consistency 1501 with web implementations. 1503 Starting with IRI, you want one of: 1505 a IRI components (IRI parsed into UTF8 pieces) 1507 b URI components (URI parsed into ASCII pieces, encoded correctly) 1509 c whole URI (for passing on to some other system that wants whole 1510 URIs) 1512 11.2.1. OLD WAY 1514 1. Pct-encoding on the whole thing to a URI. (c1) If you want a 1515 (maybe broken) whole URI, you might stop here. 1517 2. Parsing the URI into URI components. (b1) If you want (maybe 1518 broken) URI components, stop here. 1520 3. Decode the components (undoing the pct-encoding). (a) if you want 1521 IRI components, stop here. 1523 4. reencode: Either using a different encoding some components (for 1524 domain names, and query components in web pages, which depends on 1525 the component, scheme and context), and otherwise using pct- 1526 encoding. (b2) if you want (good) URI components, stop here. 1528 5. reassemble the reencoded components. (c2) if you want a (*good*) 1529 whole URI stop here. 1531 11.2.2. NEW WAY 1533 1. Parse the IRI into IRI components using the generic syntax. (a) 1534 if you want IRI components, stop here. 1536 2. Encode each components, using pct-encoding, IDN encoding, or 1537 special query part encoding depending on the component scheme or 1538 context. (b) If you want URI components, stop here. 1540 3. reassemble the a whole URI from URI components. (c) if you want a 1541 whole URI stop here. 1543 11.2.3. Extension of Syntax 1545 Added the tag range (U+E0000-E0FFF) to the iprivate production. Some 1546 IRIs generated with the new syntax may fail to pass very strict 1547 checks relying on the old syntax. But characters in this range 1548 should be extremely infrequent anyway. 1550 11.2.4. More to be added 1552 TODO: There are more main changes that need to be documented in this 1553 section. 1555 11.3. Change Log 1557 Note to RFC Editor: Please completely remove this section before 1558 publication. 1560 11.3.1. Changes after draft-ietf-iri-3987bis-01 1562 Changes from draft-ietf-iri-3987bis-01 onwards are available as 1563 changesets in the IETF tools subversion repository at http:// 1564 trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/ 1565 draft-ietf-iri-3987bis.xml. 1567 11.3.2. Changes from draft-duerst-iri-bis-07 to 1568 draft-ietf-iri-3987bis-00 1570 Changed draft name, date, last paragraph of abstract, and titles in 1571 change log, and added this section in moving from 1572 draft-duerst-iri-bis-07 (personal submission) to 1573 draft-ietf-iri-3987bis-00 (WG document). 1575 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis 1577 Major restructuring of the processing model, see Section 11.2. 1579 11.4. Changes from -00 to -01 1581 o Removed 'mailto:' before mail addresses of authors. 1583 o Added "" as right side of 'href-strip' rule. Fixed 1584 '|' to '/' for alternatives. 1586 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 1588 o Add HyperText Reference, change abstract, acks and references for 1589 it 1591 o Add Masinter back as another editor. 1593 o Masinter integrates HRef material from HTML5 spec. 1595 o Rewrite introduction sections to modernize. 1597 11.6. Changes from -04 to -05 of draft-duerst-iri-bis 1599 o Updated references. 1601 o Changed IPR text to pre5378Trust200902. 1603 11.7. Changes from -03 to -04 of draft-duerst-iri-bis 1605 o Added explicit abbreviation for LEIRIs. 1607 o Mentioned LEIRI references. 1609 o Completed text in LEIRI section about tag characters and about 1610 specials. 1612 11.8. Changes from -02 to -03 of draft-duerst-iri-bis 1614 o Updated some references. 1616 o Updated Michel Suginard's coordinates. 1618 11.9. Changes from -01 to -02 of draft-duerst-iri-bis 1620 o Added tag range to iprivate (issue private-include-tags-115). 1622 o Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs. 1624 11.10. Changes from -00 to -01 of draft-duerst-iri-bis 1626 o Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" 1627 based on input from the W3C XML Core WG. Moved the relevant 1628 subsections to the back and promoted them to a section. 1630 o Added some text re. Legacy Extended IRIs to the security section. 1632 o Added a IANA Consideration Section. 1634 o Added this Change Log Section. 1636 o Added a section about "IRIs with Spaces/Controls" (converting from 1637 a Note in RFC 3987). 1639 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis 1641 Fixed errata (see 1642 http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987). 1644 12. References 1646 12.1. Normative References 1648 [ASCII] American National Standards Institute, "Coded Character 1649 Set -- 7-bit American Standard Code for Information 1650 Interchange", ANSI X3.4, 1986. 1652 [ISO10646] 1653 International Organization for Standardization, "ISO/IEC 1654 10646:2011: Information Technology - Universal Multiple- 1655 Octet Coded Character Set (UCS)", ISO Standard 10646, 1656 March 20011, . 1660 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1661 Requirement Levels", BCP 14, RFC 2119, March 1997. 1663 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1664 Profile for Internationalized Domain Names (IDN)", 1665 RFC 3491, March 2003. 1667 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1668 10646", STD 63, RFC 3629, November 2003. 1670 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1671 Resource Identifier (URI): Generic Syntax", STD 66, 1672 RFC 3986, January 2005. 1674 [RFC5890] Klensin, J., "Internationalized Domain Names for 1675 Applications (IDNA): Definitions and Document Framework", 1676 RFC 5890, August 2010. 1678 [RFC5891] Klensin, J., "Internationalized Domain Names in 1679 Applications (IDNA): Protocol", RFC 5891, August 2010. 1681 [RFC5892] Faltstrom, P., "The Unicode Code Points and 1682 Internationalized Domain Names for Applications (IDNA)", 1683 RFC 5892, August 2010. 1685 [STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1686 Specifications: ABNF", STD 68, RFC 5234, January 2008. 1688 [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 1689 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 1690 ISBN 978-1-936213-01-6)", October 2010. 1692 [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", 1693 Unicode Standard Annex #15, March 2008, 1694 . 1697 12.2. Informative References 1699 [Bidi] Duerst, M. and L. Masinter, "Guidelines for 1700 Internationalized Resource Identifiers with Bi-directional 1701 Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-00 1702 (work in progress), August 2011. 1704 [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. 1705 Texin, "Character Model for the World Wide Web: Resource 1706 Identifiers", World Wide Web Consortium Candidate 1707 Recommendation, November 2004, 1708 . 1710 [Duerst97] 1711 Duerst, M., "The Properties and Promises of UTF-8", Proc. 1712 11th International Unicode Conference, San Jose , 1713 September 1997, . 1716 [Equivalence] 1717 Masinter, L. and M. Duerst, "Equivalence and 1718 Canonicalization of Internationalized Resource Identifiers 1719 (IRIs)", draft-ietf-iri-comparison-00 (work in progress), 1720 August 2011. 1722 [Gettys] Gettys, J., "URI Model Consequences", 1723 . 1725 [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 1726 Specification", World Wide Web Consortium Recommendation, 1727 December 1999, 1728 . 1730 [LEIRI] Thompson, H., Tobin, R., and N. Walsh, "Legacy extended 1731 IRIs for XML resource identification", World Wide Web 1732 Consortium Note, November 2008, 1733 . 1735 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1736 Extensions (MIME) Part One: Format of Internet Message 1737 Bodies", RFC 2045, November 1996. 1739 [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., 1740 Atkinson, R., Crispin, M., and P. Svanberg, "The Report of 1741 the IAB Character Set Workshop held 29 February - 1 March, 1742 1996", RFC 2130, April 1997. 1744 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 1746 [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. 1748 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1749 Languages", BCP 18, RFC 2277, January 1998. 1751 [RFC2368] Hoffman, P., Masinter, L., and J. Zawinski, "The mailto 1752 URL scheme", RFC 2368, July 1998. 1754 [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. 1756 [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1757 Resource Identifiers (URI): Generic Syntax", RFC 2396, 1758 August 1998. 1760 [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, 1761 August 1998. 1763 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 1764 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 1765 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 1767 [RFC2640] Curtin, B., "Internationalization of the File Transfer 1768 Protocol", RFC 2640, July 1999. 1770 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 1771 Identifiers (IRIs)", RFC 3987, January 2005. 1773 [RFC4395bis] 1774 Hansen, T., Hardie, T., and L. Masinter, "Guidelines and 1775 Registration Procedures for New URI/IRI Schemes", 1776 draft-ietf-iri-4395bis-irireg-03 (work in progress), 1777 July 2011. 1779 [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on 1780 Encodings for Internationalized Domain Names", RFC 6055, 1781 February 2011. 1783 [RFC6082] Whistler, K., Adams, G., Duerst, M., Presuhn, R., and J. 1784 Klensin, "Deprecating Unicode Language Tag Characters: RFC 1785 2482 is Historic", RFC 6082, November 2010. 1787 [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other 1788 Markup Languages", Unicode Technical Report #20, World 1789 Wide Web Consortium Note, June 2003, 1790 . 1792 [UTR36] Davis, M. and M. Suignard, "Unicode Security 1793 Considerations", Unicode Technical Report #36, 1794 August 2010, . 1796 [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking 1797 Language (XLink) Version 1.0", World Wide Web 1798 Consortium REC-xlink-20010627, June 2001, 1799 . 1801 [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and 1802 F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth 1803 Edition)", World Wide Web Consortium REC-xml-20081126, 1804 August 2006, . 1806 [XMLNamespace] 1807 Bray, T., Hollander, D., Layman, A., and R. Tobin, 1808 "Namespaces in XML (Second Edition)", World Wide Web 1809 Consortium REC-xml-names-20091208, August 2006, 1810 . 1812 [XMLSchema] 1813 Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", 1814 World Wide Web Consortium REC-xmlschema-2-20041028, 1815 May 2001, . 1817 [XPointer] 1818 Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer 1819 Framework", World Wide Web Consortium REC-xptr-framework- 1820 20030325, March 2003, 1821 . 1823 Authors' Addresses 1825 Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever 1826 possible, for example as "Dürst" in XML and HTML.) 1827 Aoyama Gakuin University 1828 5-10-1 Fuchinobe 1829 Sagamihara, Kanagawa 229-8558 1830 Japan 1832 Phone: +81 42 759 6329 1833 Fax: +81 42 759 6495 1834 Email: duerst@it.aoyama.ac.jp 1835 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 1836 (Note: This is the percent-encoded form of an IRI.) 1838 Michel Suignard 1839 Unicode Consortium 1840 P.O. Box 391476 1841 Mountain View, CA 94039-1476 1842 U.S.A. 1844 Phone: +1-650-693-3921 1845 Email: michel@unicode.org 1846 URI: http://www.suignard.com 1848 Larry Masinter 1849 Adobe 1850 345 Park Ave 1851 San Jose, CA 95110 1852 U.S.A. 1854 Phone: +1-408-536-3024 1855 Email: masinter@adobe.com 1856 URI: http://larry.masinter.net