idnits 2.17.1 draft-duerst-iri-bis-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. == The 'Obsoletes: ' line in the draft header should list only the _numbers_ of the RFCs which will be obsoleted by this document (if approved); it should not include the word 'RFC' in the list. -- The draft header indicates that this document obsoletes RFC3987, but the abstract doesn't seem to directly say this. It does mention RFC3987 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 12, 2009) is 5401 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV4' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2192 (Obsoleted by RFC 5092) -- Obsolete informational reference (is this intentional?): RFC 2368 (Obsoleted by RFC 6068) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 2718 (Obsoleted by RFC 4395) Summary: 4 errors (**), 0 flaws (~~), 3 warnings (==), 15 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Duerst 3 Internet-Draft Aoyama Gakuin University 4 Obsoletes: RFC 3987 M. Suignard 5 (if approved) Unicode Consortium 6 Intended status: Standards Track L. Masinter 7 Expires: January 13, 2010 Adobe 8 July 12, 2009 10 Internationalized Resource Identifiers (IRIs) 11 draft-duerst-iri-bis-06 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance with the 16 provisions of BCP 78 and BCP 79. This document may contain material 17 from IETF Documents or IETF Contributions published or made publicly 18 available before November 10, 2008. The person(s) controlling the 19 copyright in some of this material may not have granted the IETF 20 Trust the right to allow modifications of such material outside the 21 IETF Standards Process. Without obtaining an adequate license from 22 the person(s) controlling the copyright in such materials, this 23 document may not be modified outside the IETF Standards Process, and 24 derivative works of it may not be created outside the IETF Standards 25 Process, except to format it for publication as an RFC or to 26 translate it into languages other than English. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt. 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html. 44 This Internet-Draft will expire on January 13, 2010. 46 Copyright Notice 48 Copyright (c) 2009 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents in effect on the date of 53 publication of this document (http://trustee.ietf.org/license-info). 54 Please review these documents carefully, as they describe your rights 55 and restrictions with respect to this document. 57 Abstract 59 This document defines a new protocol element, the Internationalized 60 Resource Identifier (IRI), as an extension of the Uniform Resource 61 Identifier (URI). An IRI is a sequence of characters from the 62 Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to 63 URIs is defined, which provides a means for IRIs to be used instead 64 of URIs, where appropriate, to identify resources. 66 To accomodate widespread current practice, additional derivative 67 protocol elements are defined, and current practice for resolving 68 IRI-based hypertext references in HTML are outlined. 70 The approach of defining new protocol elements, rather than updating 71 or extending the definition of URI, was chosen to allow independent 72 orderly transitions as appropriate: other protocols and languages 73 that use URIs and their processing may explicitly choose to allow 74 IRIs or derivative forms. 76 Guidelines are provided for the use and deployment of IRIs and 77 related protocol elements when revising protocols, formats, and 78 software components that currently deal only with URIs. 80 [RFC Editor: Please remove this paragraph before publication.] This 81 is a draft to update RFC 3987 and move towards IETF Draft Standard. 82 For an issues list/change log and additional information (including 83 mailing list information), please see 84 http://www.w3.org/International/iri-edit. For discussion and 85 comments on this draft, please use the public-iri@w3.org mailing 86 list. 88 Table of Contents 90 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 91 1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5 92 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 93 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 6 94 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . . 8 95 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 8 96 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9 97 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . . 9 98 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 12 99 3.1. Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 13 100 3.2. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 16 101 3.2.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 18 102 4. Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 19 103 4.1. Logical Storage and Visual Presentation . . . . . . . . . 20 104 4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 21 105 4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 22 106 4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 22 107 5. Normalization and Comparison . . . . . . . . . . . . . . . . . 24 108 5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . . 24 109 5.2. Preparation for Comparison . . . . . . . . . . . . . . . . 25 110 5.3. Comparison Ladder . . . . . . . . . . . . . . . . . . . . 26 111 5.3.1. Simple String Comparison . . . . . . . . . . . . . . . 26 112 5.3.2. Syntax-Based Normalization . . . . . . . . . . . . . . 27 113 5.3.3. Scheme-Based Normalization . . . . . . . . . . . . . . 29 114 5.3.4. Protocol-Based Normalization . . . . . . . . . . . . . 31 115 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 31 116 6.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 31 117 6.2. Software Interfaces and Protocols . . . . . . . . . . . . 32 118 6.3. Format of URIs and IRIs in Documents and Protocols . . . . 32 119 6.4. Use of UTF-8 for Encoding Original Characters . . . . . . 33 120 6.5. Relative IRI References . . . . . . . . . . . . . . . . . 34 121 7. Legacy Extended IRIs (LEIRIs) and Hypertext References . . . . 34 122 7.1. Legacy Extended IRI Syntax . . . . . . . . . . . . . . . . 35 123 7.2. Conversion of Legacy Extended IRIs to IRIs . . . . . . . . 35 124 7.3. Characters Allowed in Legacy Extended IRIs but not in 125 IRIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 126 7.4. HyperText References . . . . . . . . . . . . . . . . . . . 37 127 8. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 39 128 8.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 39 129 8.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 40 130 8.3. URI/IRI Transfer between Applications . . . . . . . . . . 41 131 8.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 41 132 8.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 42 133 8.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 42 134 8.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 43 135 8.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 43 137 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 44 138 10. Security Considerations . . . . . . . . . . . . . . . . . . . 44 139 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 46 140 12. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 47 141 12.1. Changes from -05 to -06 . . . . . . . . . . . . . . . . . 47 142 12.2. Changes from -04 to -05 . . . . . . . . . . . . . . . . . 47 143 12.3. Changes from -03 to -04 . . . . . . . . . . . . . . . . . 47 144 12.4. Changes from -02 to -03 . . . . . . . . . . . . . . . . . 47 145 12.5. Changes from -01 to -02 . . . . . . . . . . . . . . . . . 48 146 12.6. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 48 147 12.7. Changes from RFC 3987 to -00 . . . . . . . . . . . . . . . 48 148 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 48 149 13.1. Normative References . . . . . . . . . . . . . . . . . . . 48 150 13.2. Informative References . . . . . . . . . . . . . . . . . . 49 151 Appendix A. Design Alternatives . . . . . . . . . . . . . . . . . 51 152 A.1. New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 51 153 A.2. Character Encodings Other Than UTF-8 . . . . . . . . . . . 52 154 A.3. New Encoding Convention . . . . . . . . . . . . . . . . . 52 155 A.4. Indicating Character Encodings in the URI/IRI . . . . . . 52 156 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 53 158 1. Introduction 160 1.1. Overview and Motivation 162 A Uniform Resource Identifier (URI) is defined in [RFC3986] as a 163 sequence of characters chosen from a limited subset of the repertoire 164 of US-ASCII [ASCII] characters. 166 The characters in URIs are frequently used for representing words of 167 natural languages. This usage has many advantages: Such URIs are 168 easier to memorize, easier to interpret, easier to transcribe, easier 169 to create, and easier to guess. For most languages other than 170 English, however, the natural script uses characters other than A - 171 Z. For many people, handling Latin characters is as difficult as 172 handling the characters of other scripts is for those who use only 173 the Latin alphabet. Many languages with non-Latin scripts are 174 transcribed with Latin letters. These transcriptions are now often 175 used in URIs, but they introduce additional difficulties. 177 The infrastructure for the appropriate handling of characters from 178 additional scripts is now widely deployed in operating system and 179 application software. Software that can handle a wide variety of 180 scripts and languages at the same time is increasingly common. Also, 181 increasing numbers of protocols and formats can carry a wide range of 182 characters. 184 URIs are used both as a protocol element (for transmission and 185 processing by software) and also a presentation element (for display 186 and handling by people who read, interpret, coin, or guess them. The 187 transition between these roles is more difficult and complex when 188 dealing with the larger set of characters than allowed in [RFC3986]. 190 This document defines a new protocol element called Internationalized 191 Resource Identifier (IRI), extending the syntax of URIs to a much 192 wider repertoire of characters. It also defines corresponding 193 "internationalized" versions of other constructs from [RFC3986], such 194 as URI references. The syntax of IRIs is defined in Section 2, and 195 the relationship between IRIs and URIs in Section 3. 197 Using characters outside of A - Z in IRIs brings a number of 198 difficulties. Section 4 discusses the special case of bidirectional 199 IRIs using characters from scripts written right-to-left. Section 5 200 discusses various forms of equivalence between IRIs. Section 6 201 discusses the use of IRIs in different situations. Section 7 202 describes extensions to the IRI syntax used in some XML languages 203 [LEIRI] and the handling of IRIs in commonly deployed web browsers 204 [HTML5]. Section 8 gives additional informative guidelines. 205 Section 10 discusses security considerations. 207 1.2. Applicability 209 IRIs are designed to be compatible with recommendations for new URI 210 schemes [RFC2718]. The compatibility is provided by specifying a 211 well-defined and deterministic mapping from the IRI character 212 sequence to the functionally equivalent URI character sequence. 213 Practical use of IRIs (or IRI references) in place of URIs (or URI 214 references) depends on the following conditions being met: 216 a. A protocol or format element should be explicitly designated to be 217 able to carry IRIs. The intent is not to introduce IRIs into 218 contexts that are not defined to accept them. For example, XML 219 schema [XMLSchema] has an explicit type "anyURI" that includes 220 IRIs and IRI references. Therefore, IRIs and IRI references can 221 be in attributes and elements of type "anyURI". On the other 222 hand, in the HTTP protocol [RFC2616], the Request URI is defined 223 as a URI, which means that direct use of IRIs is not allowed in 224 HTTP requests. 226 b. The protocol or format carrying the IRIs should have a mechanism 227 to represent the wide range of characters used in IRIs, either 228 natively or by some protocol- or format-specific escaping 229 mechanism (for example, numeric character references in [XML1]). 231 c. The URI corresponding to the IRI in question has to encode 232 original characters into octets using UTF-8. For new URI schemes, 233 this is recommended in [RFC2718]. It can apply to a whole scheme 234 (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN 235 syntax [RFC2141]). It can apply to a specific part of a URI, such 236 as the fragment identifier (e.g., [XPointer]). It can apply to a 237 specific URI or part(s) thereof. For details, please see 238 Section 6.4. 240 1.3. Definitions 242 The following definitions are used in this document; they follow the 243 terms in [RFC2130], [RFC2277], and [ISO10646]. 245 character: A member of a set of elements used for the organization, 246 control, or representation of data. For example, "LATIN CAPITAL 247 LETTER A" names a character. 249 octet: An ordered sequence of eight bits considered as a unit. 251 character repertoire: A set of characters (in the mathematical 252 sense). 254 sequence of characters: A sequence of characters (one after 255 another). 257 sequence of octets: A sequence of octets (one after another). 259 character encoding: A method of representing a sequence of 260 characters as a sequence of octets (maybe with variants). Also, a 261 method of (unambiguously) converting a sequence of octets into a 262 sequence of characters. 264 charset: The name of a parameter or attribute used to identify a 265 character encoding. 267 UCS: Universal Character Set. The coded character set defined by 268 ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. 270 IRI reference: Denotes the common usage of an Internationalized 271 Resource Identifier. An IRI reference may be absolute or 272 relative. However, the "IRI" that results from such a reference 273 only includes absolute IRIs; any relative IRI references are 274 resolved to their absolute form. Note that in [RFC2396] URIs did 275 not include fragment identifiers, but in [RFC3986] fragment 276 identifiers are part of URIs. 278 running text: Human text (paragraphs, sentences, phrases) with 279 syntax according to orthographic conventions of a natural 280 language, as opposed to syntax defined for ease of processing by 281 machines (e.g., markup, programming languages). 283 protocol element: Any portion of a message that affects processing 284 of that message by the protocol in question. 286 presentation element: A presentation form corresponding to a 287 protocol element; for example, using a wider range of characters. 289 create (a URI or IRI): With respect to URIs and IRIs, the term is 290 used for the initial creation. This may be the initial creation 291 of a resource with a certain identifier, or the initial exposition 292 of a resource under a particular identifier. 294 generate (a URI or IRI): With respect to URIs and IRIs, the term is 295 used when the IRI is generated by derivation from other 296 information. 298 1.4. Notation 300 RFCs and Internet Drafts currently do not allow any characters 301 outside the US-ASCII repertoire. Therefore, this document uses 302 various special notations to denote such characters in examples. 304 In text, characters outside US-ASCII are sometimes referenced by 305 using a prefix of 'U+', followed by four to six hexadecimal digits. 307 To represent characters outside US-ASCII in examples, this document 308 uses two notations: 'XML Notation' and 'Bidi Notation'. 310 XML Notation uses a leading '&#x', a trailing ';', and the 311 hexadecimal number of the character in the UCS in between. For 312 example, я stands for CYRILLIC CAPITAL LETTER YA. In this 313 notation, an actual '&' is denoted by '&'. 315 Bidi Notation is used for bidirectional examples: Lower case letters 316 stand for Latin letters or other letters that are written left to 317 right, whereas upper case letters represent Arabic or Hebrew letters 318 that are written right to left. 320 To denote actual octets in examples (as opposed to percent-encoded 321 octets), the two hex digits denoting the octet are enclosed in "<" 322 and ">". For example, the octet often denoted as 0xc9 is denoted 323 here as . 325 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 326 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 327 and "OPTIONAL" are to be interpreted as described in [RFC2119]. 329 2. IRI Syntax 331 This section defines the syntax of Internationalized Resource 332 Identifiers (IRIs). 334 As with URIs, an IRI is defined as a sequence of characters, not as a 335 sequence of octets. This definition accommodates the fact that IRIs 336 may be written on paper or read over the radio as well as stored or 337 transmitted digitally. The same IRI might be represented as 338 different sequences of octets in different protocols or documents if 339 these protocols or documents use different character encodings 340 (and/or transfer encodings). Using the same character encoding as 341 the containing protocol or document ensures that the characters in 342 the IRI can be handled (e.g., searched, converted, displayed) in the 343 same way as the rest of the protocol or document. 345 2.1. Summary of IRI Syntax 347 IRIs are defined similarly to URIs in [RFC3986], but the class of 348 unreserved characters is extended by adding the characters of the UCS 349 (Universal Character Set, [ISO10646]) beyond U+007F, subject to the 350 limitations given in the syntax rules below and in Section 6.1. 352 Otherwise, the syntax and use of components and reserved characters 353 is the same as that in [RFC3986]. All the operations defined in 354 [RFC3986], such as the resolution of relative references, can be 355 applied to IRIs by IRI-processing software in exactly the same way as 356 they are for URIs by URI-processing software. 358 Characters outside the US-ASCII repertoire are not reserved and 359 therefore MUST NOT be used for syntactical purposes, such as to 360 delimit components in newly defined schemes. For example, U+00A2, 361 CENT SIGN, is not allowed as a delimiter in IRIs, because it is in 362 the 'iunreserved' category. This is similar to the fact that it is 363 not possible to use '-' as a delimiter in URIs, because it is in the 364 'unreserved' category. 366 2.2. ABNF for IRI References and IRIs 368 Although it might be possible to define IRI references and IRIs 369 merely by their transformation to URI references and URIs, they can 370 also be accepted and processed directly. Therefore, an ABNF 371 definition for IRI references (which are the most general concept and 372 the start of the grammar) and IRIs is given here. The syntax of this 373 ABNF is described in [STD68]. Character numbers are taken from the 374 UCS, without implying any actual binary encoding. Terminals in the 375 ABNF are characters, not bytes. 377 The following grammar closely follows the URI grammar in [RFC3986], 378 except that the range of unreserved characters is expanded to include 379 UCS characters, with the restriction that private UCS characters can 380 occur only in query parts. The grammar is split into two parts: 381 Rules that differ from [RFC3986] because of the above-mentioned 382 expansion, and rules that are the same as those in [RFC3986]. For 383 rules that are different than those in [RFC3986], the names of the 384 non-terminals have been changed as follows. If the non-terminal 385 contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i' 386 has been prefixed. 388 The following rules are different from those in [RFC3986]: 390 IRI = scheme ":" ihier-part [ "?" iquery ] 391 [ "#" ifragment ] 393 ihier-part = "//" iauthority ipath-abempty 394 / ipath-absolute 395 / ipath-rootless 396 / ipath-empty 398 IRI-reference = IRI / irelative-ref 400 absolute-IRI = scheme ":" ihier-part [ "?" iquery ] 402 irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ] 404 irelative-part = "//" iauthority ipath-abempty 405 / ipath-absolute 406 / ipath-noscheme 407 / ipath-empty 409 iauthority = [ iuserinfo "@" ] ihost [ ":" port ] 410 iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" ) 411 ihost = IP-literal / IPv4address / ireg-name 413 ireg-name = *( iunreserved / pct-encoded / sub-delims ) 415 ipath = ipath-abempty ; begins with "/" or is empty 416 / ipath-absolute ; begins with "/" but not "//" 417 / ipath-noscheme ; begins with a non-colon segment 418 / ipath-rootless ; begins with a segment 419 / ipath-empty ; zero characters 421 ipath-abempty = *( "/" isegment ) 422 ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ] 423 ipath-noscheme = isegment-nz-nc *( "/" isegment ) 424 ipath-rootless = isegment-nz *( "/" isegment ) 425 ipath-empty = 0 427 isegment = *ipchar 428 isegment-nz = 1*ipchar 429 isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims 430 / "@" ) 431 ; non-zero-length segment without any colon ":" 433 ipchar = iunreserved / pct-encoded / sub-delims / ":" 434 / "@" 436 iquery = *( ipchar / iprivate / "/" / "?" ) 438 ifragment = *( ipchar / "/" / "?" ) 440 iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar 441 ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF 442 / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD 443 / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD 444 / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD 445 / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD 446 / %xD0000-DFFFD / %xE1000-EFFFD 448 iprivate = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD 449 / %x100000-10FFFD 451 Some productions are ambiguous. The "first-match-wins" (a.k.a. 452 "greedy") algorithm applies. For details, see [RFC3986]. 454 The following rules are the same as those in [RFC3986]: 456 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 458 port = *DIGIT 460 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 462 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) 464 IPv6address = 6( h16 ":" ) ls32 465 / "::" 5( h16 ":" ) ls32 466 / [ h16 ] "::" 4( h16 ":" ) ls32 467 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 468 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 469 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 470 / [ *4( h16 ":" ) h16 ] "::" ls32 471 / [ *5( h16 ":" ) h16 ] "::" h16 472 / [ *6( h16 ":" ) h16 ] "::" 474 h16 = 1*4HEXDIG 475 ls32 = ( h16 ":" h16 ) / IPv4address 477 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 479 dec-octet = DIGIT ; 0-9 480 / %x31-39 DIGIT ; 10-99 481 / "1" 2DIGIT ; 100-199 482 / "2" %x30-34 DIGIT ; 200-249 483 / "25" %x30-35 ; 250-255 485 pct-encoded = "%" HEXDIG HEXDIG 487 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 488 reserved = gen-delims / sub-delims 489 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 490 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 491 / "*" / "+" / "," / ";" / "=" 493 This syntax does not support IPv6 scoped addressing zone identifiers. 495 3. Relationship between IRIs and URIs 497 IRIs are meant to replace URIs in identifying resources within new 498 versions of protocols, formats, and software components that use a 499 UCS-based character repertoire. These protocols and components may 500 never need to use URIs directly, especially when the resource 501 identifier is used simply for identification purposes. However, when 502 the resource identifier is used for resource retrieval, it is in many 503 cases necessary to determine the associated URI, because retrieval 504 mechanisms are only defined for URIs. When used to access a 505 resource, the meaning of an IRI SHOULD be the same as the meaning of 506 the equivalent URI. This relationship insures that the resources 507 identified continue to be also available to URI-based software. 509 This mapping has two purposes: 511 Syntactical. Many URI schemes and components define additional 512 syntactical restrictions not captured in Section 2.2. Scheme- 513 specific restrictions are applied to IRIs by converting IRIs to 514 URIs and checking the URIs against the scheme-specific 515 restrictions. 517 Interpretational. URIs are used to identify resources in various 518 ways. IRIs also identify resources; an IRI identifies the same 519 resource as does URI that it maps to. In some contexts, it may 520 actually not be necessary to map the IRI to a URI to determine the 521 resource it identifies (see Section 5). However, when an IRI is 522 used for resource retrieval, the resource that the IRI locates is 523 the same as the one located by the URI obtained after converting 524 the IRI according to the procedure defined below. For this 525 reason, there is no separate definition of resolution for IRIs. 527 3.1. Mapping of IRIs to URIs 529 This section defines how to map IRI-related protocol elements to 530 strings in the URI character set. This mapping is intended for 531 mapping IRIs to URIs, IRI references and URI references, as well as 532 to components thereof (for example, fragment identifiers). 534 Note that Section 7 describes variants of this algorithm used in some 535 applications for mappings related protocol elements. 537 The mapping is defined through an algorithm: 539 Step 1. Generate a UCS character sequence from the original IRI 540 format. This step has the following three variants, depending on 541 the form of the input: 543 a. If the IRI is written on paper, read aloud, or otherwise 544 represented as a sequence of characters independent of any 545 character encoding, represent the IRI as a sequence of 546 characters from the UCS normalized according to Normalization 547 Form C (NFC, [UTR15]). 549 b. If the IRI is in some digital representation (e.g., an octet 550 stream) in some known non-Unicode character encoding, convert 551 the IRI to a sequence of characters from the UCS. The 552 resulting sequence of characters SHOULD be normalized using 553 NFC. 555 c. If the IRI is in a Unicode-based character encoding (for 556 example, UTF-8 or UTF-16), do not normalize (see 557 Section 5.3.2.2 for details). Apply the next steps directly to 558 the encoded Unicode character sequence. 560 Step 2. For each character which is in either 'ucschar' or 561 'iprivate', apply steps 2.1 through 2.3 below. 563 2.1. Convert the character to a sequence of one or more octets 564 using UTF-8 [RFC3629]. 566 2.2. Convert each octet to %HH, where HH is the hexadecimal 567 notation of the octet value. Note that this is identical to 568 the percent-encoding mechanism in Section 2.1 of [RFC3986]. To 569 reduce variability, the hexadecimal notation SHOULD use 570 uppercase letters. 572 2.3. Replace the original character with the resulting character 573 sequence (i.e., a sequence of %HH triplets). 575 The above mapping, when applied a valid IRI, produces a URI fully 576 conforming to [RFC3986]. The mapping is also an identity 577 transformation for URIs and is idempotent; applying the mapping a 578 second time will not change anything. Every URI is by definition an 579 IRI. 581 Systems accepting IRIs MAY convert the ireg-name component of an IRI 582 as follows (before step 2 above) for schemes known to use domain 583 names in ireg-name, if the scheme definition does not allow percent- 584 encoding for ireg-name: Replace the ireg-name part of the IRI by the 585 part converted using the ToASCII operation specified in Section 4.1 586 of [RFC3490] on each dot-separated label, and by using U+002E (FULL 587 STOP) as a label separator, with the flag UseSTD3ASCIIRules set to 588 TRUE, and with the flag AllowUnassigned set to FALSE for creating 589 IRIs and set to TRUE otherwise. The ToASCII operation may fail, but 590 this would mean that the IRI cannot be resolved. This conversion 591 SHOULD be used when the goal is to maximize interoperability with 592 legacy URI resolvers. For example, the IRI 593 "http://résumé.example.org" 594 may be converted to 595 "http://xn--rsum-bpad.example.org" 596 instead of 597 "http://r%C3%A9sum%C3%A9.example.org". 599 An IRI with a scheme that is known to use domain names in ireg-name, 600 but where the scheme definition does not allow percent-encoding for 601 ireg-name, meets scheme-specific restrictions if either the 602 straightforward conversion or the conversion using the ToASCII 603 operation on ireg-name result in an URI that meets the scheme- 604 specific restrictions. 606 Such an IRI resolves to the URI obtained after converting the IRI and 607 uses the ToASCII operation on ireg-name. Implementations do not have 608 to do this conversion as long as they produce the same result. 610 Note: The difference between variants b and c in step 1 (using 611 normalization with NFC, versus not using any normalization) 612 accounts for the fact that in many non-Unicode character 613 encodings, some text cannot be represented directly. For example, 614 the word "Vietnam" is natively written "Việt Nam" 615 (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW) 616 in NFC, but a direct transcoding from the windows-1258 character 617 encoding leads to "Việt Nam" (containing a LATIN SMALL 618 LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW). 619 Direct transcoding of other 8-bit encodings of Vietnamese may lead 620 to other representations. 622 Note: The uniform treatment of the whole IRI in step 2 is important 623 to make processing independent of URI scheme. See [Gettys] for an 624 in-depth discussion. 626 Note: In practice, whether the general mapping (steps 1 and 2) or 627 the ToASCII operation of [RFC3490] is used for ireg-name will not 628 be noticed if mapping from IRI to URI and resolution is tightly 629 integrated (e.g., carried out in the same user agent). But 630 conversion using [RFC3490] may be able to better deal with 631 backwards compatibility issues in case mapping and resolution are 632 separated, as in the case of using an HTTP proxy. 634 Note: Internationalized Domain Names may be contained in parts of an 635 IRI other than the ireg-name part. It is the responsibility of 636 scheme-specific implementations (if the Internationalized Domain 637 Name is part of the scheme syntax) or of server-side 638 implementations (if the Internationalized Domain Name is part of 639 'iquery') to apply the necessary conversions at the appropriate 640 point. Example: Trying to validate the Web page at 641 http://résumé.example.org would lead to an IRI of 642 http://validator.w3.org/check?uri=http%3A%2F%2Frésumé. 643 example.org, which would convert to a URI of 644 http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. 646 example.org. The server side implementation would be responsible 647 for making the necessary conversions to be able to retrieve the 648 Web page. 650 Systems accepting IRIs MAY also deal with the printable characters in 651 US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, 652 "{", "}", "|", "\", "^", and "`", in step 2 above. If these 653 characters are found but are not converted, then the conversion 654 SHOULD fail. Protocols and formats that have used earlier 655 definitions of IRIs including these characters MAY require percent- 656 encoding of these characters as a preprocessing step to extract the 657 actual IRI from a given field. This preprocessing MAY also be used 658 by applications allowing the user to enter an IRI. Please note that 659 the number sign ("#"), the percent sign ("%"), and the square bracket 660 characters ("[", "]") are not part of the above list and MUST NOT be 661 converted. 663 Note: In this process (in step 2.3), characters allowed in URI 664 references and existing percent-encoded sequences are not encoded 665 further. (This mapping is similar to, but different from, the 666 encoding applied when arbitrary content is included in some part 667 of a URI.) For example, an IRI of 668 "http://www.example.org/red%09rosé#red" (in XML notation) is 669 converted to 670 "http://www.example.org/red%09ros%C3%A9#red", not to something 671 like 672 "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red". 674 Note: Some older software transcoding to UTF-8 may produce illegal 675 output for some input, in particular for characters outside the 676 BMP (Basic Multilingual Plane). As an example, for the IRI with 677 non-BMP characters (in XML Notation): 678 "http://example.com/𐌀𐌁𐌂"; 679 which contains the first three letters of the Old Italic alphabet, 680 the correct conversion to a URI is 681 "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82" 683 3.2. Converting URIs to IRIs 685 In some situations, converting a URI into an equivalent IRI may be 686 desirable. This section gives a procedure for this conversion. The 687 conversion described in this section will always result in an IRI 688 that maps back to the URI used as an input for the conversion (except 689 for potential case differences in percent-encoding and for potential 690 percent-encoded unreserved characters). However, the IRI resulting 691 from this conversion may not be exactly the same as the original IRI 692 (if there ever was one). 694 URI-to-IRI conversion removes percent-encodings, but not all percent- 695 encodings can be eliminated. There are several reasons for this: 697 1. Some percent-encodings are necessary to distinguish percent- 698 encoded and unencoded uses of reserved characters. 700 2. Some percent-encodings cannot be interpreted as sequences of UTF-8 701 octets. 703 (Note: The octet patterns of UTF-8 are highly regular. Therefore, 704 there is a very high probability, but no guarantee, that percent- 705 encodings that can be interpreted as sequences of UTF-8 octets 706 actually originated from UTF-8. For a detailed discussion, see 707 [Duerst97].) 709 3. The conversion may result in a character that is not appropriate 710 in an IRI. See Section 2.2, Section 4.1, and Section 6.1 for 711 further details. 713 Conversion from a URI to an IRI is done by using the following steps 714 (or any other algorithm that produces the same result): 716 1. Represent the URI as a sequence of octets in US-ASCII. 718 2. Convert all percent-encodings ("%" followed by two hexadecimal 719 digits) to the corresponding octets, except those corresponding to 720 "%", characters in "reserved", and characters in US-ASCII not 721 allowed in URIs. 723 3. Re-percent-encode any octet produced in step 2 that is not part of 724 a strictly legal UTF-8 octet sequence. 726 4. Re-percent-encode all octets produced in step 3 that in UTF-8 727 represent characters that are not appropriate according to 728 Section 2.2, Section 4.1, and Section 6.1. 730 5. Interpret the resulting octet sequence as a sequence of characters 731 encoded in UTF-8. 733 This procedure will convert as many percent-encoded characters as 734 possible to characters in an IRI. Because there are some choices 735 when step 4 is applied (see Section 6.1), results may vary. 737 Conversions from URIs to IRIs MUST NOT use any character encoding 738 other than UTF-8 in steps 3 and 4, even if it might be possible to 739 guess from the context that another character encoding than UTF-8 was 740 used in the URI. For example, the URI 741 "http://www.example.org/r%E9sum%E9.html" might with some guessing be 742 interpreted to contain two e-acute characters encoded as iso-8859-1. 743 It must not be converted to an IRI containing these e-acute 744 characters. Otherwise, in the future the IRI will be mapped to 745 "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different 746 URI from "http://www.example.org/r%E9sum%E9.html". 748 3.2.1. Examples 750 This section shows various examples of converting URIs to IRIs. Each 751 example shows the result after each of the steps 1 through 5 is 752 applied. XML Notation is used for the final result. Octets are 753 denoted by "<" followed by two hexadecimal digits followed by ">". 755 The following example contains the sequence "%C3%BC", which is a 756 strictly legal UTF-8 sequence, and which is converted into the actual 757 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as 758 u-umlaut). 760 1. http://www.example.org/D%C3%BCrst 762 2. http://www.example.org/Drst 764 3. http://www.example.org/Drst 766 4. http://www.example.org/Drst 768 5. http://www.example.org/Dürst 770 The following example contains the sequence "%FC", which might 771 represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the 772 iso-8859-1 character encoding. (It might represent other characters 773 in other character encodings. For example, the octet in iso- 774 8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.) Because 775 is not part of a strictly legal UTF-8 sequence, it is re-percent- 776 encoded in step 3. 778 1. http://www.example.org/D%FCrst 780 2. http://www.example.org/Drst 782 3. http://www.example.org/D%FCrst 784 4. http://www.example.org/D%FCrst 786 5. http://www.example.org/D%FCrst 788 The following example contains "%e2%80%ae", which is the percent- 789 encoded 790 UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. 791 Section 4.1 forbids the direct use of this character in an IRI. 792 Therefore, the corresponding octets are re-percent-encoded in step 4. 793 This example shows that the case (upper- or lowercase) of letters 794 used in percent-encodings may not be preserved. The example also 795 contains a punycode-encoded domain name label (xn--99zt52a), which is 796 not converted. 798 1. http://xn--99zt52a.example.org/%e2%80%ae 800 2. http://xn--99zt52a.example.org/<80> 802 3. http://xn--99zt52a.example.org/<80> 804 4. http://xn--99zt52a.example.org/%E2%80%AE 806 5. http://xn--99zt52a.example.org/%E2%80%AE 808 Implementations with scheme-specific knowledge MAY convert punycode- 809 encoded domain name labels to the corresponding characters using the 810 ToUnicode procedure. Thus, for the example above, the label "xn-- 811 99zt52a" may be converted to U+7D0D U+8C46 (Japanese Natto), leading 812 to the overall IRI of 813 "http://納豆.example.org/%E2%80%AE". 815 4. Bidirectional IRIs for Right-to-Left Languages 817 Some UCS characters, such as those used in the Arabic and Hebrew 818 scripts, have an inherent right-to-left (rtl) writing direction. 819 IRIs containing these characters (called bidirectional IRIs or Bidi 820 IRIs) require additional attention because of the non-trivial 821 relation between logical representation (used for digital 822 representation and for reading/spelling) and visual representation 823 (used for display/printing). 825 Because of the complex interaction between the logical 826 representation, the visual representation, and the syntax of a Bidi 827 IRI, a balance is needed between various requirements. The main 828 requirements are 830 1. user-predictable conversion between visual and logical 831 representation; 833 2. the ability to include a wide range of characters in various parts 834 of the IRI; and 836 3. minor or no changes or restrictions for implementations. 838 4.1. Logical Storage and Visual Presentation 840 When stored or transmitted in digital representation, bidirectional 841 IRIs MUST be in full logical order and MUST conform to the IRI syntax 842 rules (which includes the rules relevant to their scheme). This 843 ensures that bidirectional IRIs can be processed in the same way as 844 other IRIs. 846 Bidirectional IRIs MUST be rendered by using the Unicode 847 Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be 848 rendered in the same way as they would be if they were in a left-to- 849 right embedding; i.e., as if they were preceded by U+202A, LEFT-TO- 850 RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL 851 FORMATTING (PDF). Setting the embedding direction can also be done 852 in a higher-level protocol (e.g., the dir='ltr' attribute in HTML). 854 There is no requirement to use the above embedding if the display is 855 still the same without the embedding. For example, a bidirectional 856 IRI in a text with left-to-right base directionality (such as used 857 for English or Cyrillic) that is preceded and followed by whitespace 858 and strong left-to-right characters does not need an embedding. 859 Also, a bidirectional relative IRI reference that only contains 860 strong right-to-left characters and weak characters and that starts 861 and ends with a strong right-to-left character and appears in a text 862 with right-to-left base directionality (such as used for Arabic or 863 Hebrew) and is preceded and followed by whitespace and strong 864 characters does not need an embedding. 866 In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be 867 sufficient to force the correct display behavior. However, the 868 details of the Unicode Bidirectional algorithm are not always easy to 869 understand. Implementers are strongly advised to err on the side of 870 caution and to use embedding in all cases where they are not 871 completely sure that the display behavior is unaffected without the 872 embedding. 874 The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits 875 higher-level protocols to influence bidirectional rendering. Such 876 changes by higher-level protocols MUST NOT be used if they change the 877 rendering of IRIs. 879 The bidirectional formatting characters that may be used before or 880 after the IRI to ensure correct display are not themselves part of 881 the IRI. IRIs MUST NOT contain bidirectional formatting characters 882 (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual 883 rendering of the IRI but do not appear themselves. It would 884 therefore not be possible to input an IRI with such characters 885 correctly. 887 4.2. Bidi IRI Structure 889 The Unicode Bidirectional Algorithm is designed mainly for running 890 text. To make sure that it does not affect the rendering of 891 bidirectional IRIs too much, some restrictions on bidirectional IRIs 892 are necessary. These restrictions are given in terms of delimiters 893 (structural characters, mostly punctuation such as "@", ".", ":", and 894 "/") and components (usually consisting mostly of letters and 895 digits). 897 The following syntax rules from Section 2.2 correspond to components 898 for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, 899 isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment. 901 Specifications that define the syntax of any of the above components 902 MAY divide them further and define smaller parts to be components 903 according to this document. As an example, the restrictions of 904 [RFC3490] on bidirectional domain names correspond to treating each 905 label of a domain name as a component for schemes with ireg-name as a 906 domain name. Even where the components are not defined formally, it 907 may be helpful to think about some syntax in terms of components and 908 to apply the relevant restrictions. For example, for the usual name/ 909 value syntax in query parts, it is convenient to treat each name and 910 each value as a component. As another example, the extensions in a 911 resource name can be treated as separate components. 913 For each component, the following restrictions apply: 915 1. A component SHOULD NOT use both right-to-left and left-to-right 916 characters. 918 2. A component using right-to-left characters SHOULD start and end 919 with right-to-left characters. 921 The above restrictions are given as shoulds, rather than as musts. 922 For IRIs that are never presented visually, they are not relevant. 923 However, for IRIs in general, they are very important to ensure 924 consistent conversion between visual presentation and logical 925 representation, in both directions. 927 Note: In some components, the above restrictions may actually be 928 strictly enforced. For example, [RFC3490] requires that these 929 restrictions apply to the labels of a host name for those schemes 930 where ireg-name is a host name. In some other components (for 931 example, path components) following these restrictions may not be 932 too difficult. For other components, such as parts of the query 933 part, it may be very difficult to enforce the restrictions because 934 the values of query parameters may be arbitrary character 935 sequences. 937 If the above restrictions cannot be satisfied otherwise, the affected 938 component can always be mapped to URI notation as described in 939 Section 3.1. Please note that the whole component has to be mapped 940 (see also Example 9 below). 942 4.3. Input of Bidi IRIs 944 Bidi input methods MUST generate Bidi IRIs in logical order while 945 rendering them according to Section 4.1. During input, rendering 946 SHOULD be updated after every new character is input to avoid end- 947 user confusion. 949 4.4. Examples 951 This section gives examples of bidirectional IRIs, in Bidi Notation. 952 It shows legal IRIs with the relationship between logical and visual 953 representation and explains how certain phenomena in this 954 relationship may look strange to somebody not familiar with 955 bidirectional behavior, but familiar to users of Arabic and Hebrew. 956 It also shows what happens if the restrictions given in Section 4.2 957 are not followed. The examples below can be seen at [BidiEx], in 958 Arabic, Hebrew, and Bidi Notation variants. 960 To read the bidi text in the examples, read the visual representation 961 from left to right until you encounter a block of rtl text. Read the 962 rtl block (including slashes and other special characters) from right 963 to left, then continue at the next unread ltr character. 965 Example 1: A single component with rtl characters is inverted: 966 Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html" 967 Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html" 968 Components can be read one by one, and each component can be read in 969 its natural direction. 971 Example 2: More than one consecutive component with rtl characters is 972 inverted as a whole: 973 Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html" 974 Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html" 975 A sequence of rtl components is read rtl, in the same way as a 976 sequence of rtl words is read rtl in a bidi text. 978 Example 3: All components of an IRI (except for the scheme) are rtl. 979 All rtl components are inverted overall: 981 Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV" 982 Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA" 983 The whole IRI (except the scheme) is read rtl. Delimiters between 984 rtl components stay between the respective components; delimiters 985 between ltr and rtl components don't move. 987 Example 4: Each of several sequences of rtl components is inverted on 988 its own: 989 Logical representation: "http://AB.CD.ef/gh/IJ/KL.html" 990 Visual representation: "http://DC.BA.ef/gh/LK/JI.html" 991 Each sequence of rtl components is read rtl, in the same way as each 992 sequence of rtl words in an ltr text is read rtl. 994 Example 5: Example 2, applied to components of different kinds: 995 Logical representation: "http://ab.cd.EF/GH/ij/kl.html" 996 Visual representation: "http://ab.cd.HG/FE/ij/kl.html" 997 The inversion of the domain name label and the path component may be 998 unexpected, but it is consistent with other bidi behavior. For 999 reassurance that the domain component really is "ab.cd.EF", it may be 1000 helpful to read aloud the visual representation following the bidi 1001 algorithm. After "http://ab.cd." one reads the RTL block 1002 "E-F-slash-G-H", which corresponds to the logical representation. 1004 Example 6: Same as Example 5, with more rtl components: 1005 Logical representation: "http://ab.CD.EF/GH/IJ/kl.html" 1006 Visual representation: "http://ab.JI/HG/FE.DC/kl.html" 1007 The inversion of the domain name labels and the path components may 1008 be easier to identify because the delimiters also move. 1010 Example 7: A single rtl component includes digits: 1011 Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html" 1012 Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html" 1013 Numbers are written ltr in all cases but are treated as an additional 1014 embedding inside a run of rtl characters. This is completely 1015 consistent with usual bidirectional text. 1017 Example 8 (not allowed): Numbers are at the start or end of an rtl 1018 component: 1019 Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html" 1020 Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html" 1021 The sequence "1/2" is interpreted by the bidi algorithm as a 1022 fraction, fragmenting the components and leading to confusion. There 1023 are other characters that are interpreted in a special way close to 1024 numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":". 1026 Example 9 (not allowed): The numbers in the previous example are 1027 percent-encoded: 1028 Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html", 1029 Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html" 1031 Example 10 (allowed but not recommended): 1032 Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html" 1033 Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html" 1034 Components consisting of only numbers are allowed (it would be rather 1035 difficult to prohibit them), but these may interact with adjacent RTL 1036 components in ways that are not easy to predict. 1038 Example 11 (allowed but not recommended): 1039 Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html" 1040 Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html" 1041 Components consisting of numbers and left-to-right characters are 1042 allowed, but these may interact with adjacent RTL components in ways 1043 that are not easy to predict. 1045 5. Normalization and Comparison 1047 Note: The structure and much of the material for this section is 1048 taken from section 6 of [RFC3986]; the differences are due to the 1049 specifics of IRIs. 1051 One of the most common operations on IRIs is simple comparison: 1052 Determining whether two IRIs are equivalent without using the IRIs or 1053 the mapped URIs to access their respective resource(s). A comparison 1054 is performed whenever a response cache is accessed, a browser checks 1055 its history to color a link, or an XML parser processes tags within a 1056 namespace. Extensive normalization prior to comparison of IRIs may 1057 be used by spiders and indexing engines to prune a search space or 1058 reduce duplication of request actions and response storage. 1060 IRI comparison is performed for some particular purpose. Protocols 1061 or implementations that compare IRIs for different purposes will 1062 often be subject to differing design trade-offs in regards to how 1063 much effort should be spent in reducing aliased identifiers. This 1064 section describes various methods that may be used to compare IRIs, 1065 the trade-offs between them, and the types of applications that might 1066 use them. 1068 5.1. Equivalence 1070 Because IRIs exist to identify resources, presumably they should be 1071 considered equivalent when they identify the same resource. However, 1072 this definition of equivalence is not of much practical use, as there 1073 is no way for an implementation to compare two resources unless it 1074 has full knowledge or control of them. For this reason, 1075 determination of equivalence or difference of IRIs is based on string 1076 comparison, perhaps augmented by reference to additional rules 1077 provided by URI scheme definitions. We use the terms "different" and 1078 "equivalent" to describe the possible outcomes of such comparisons, 1079 but there are many application-dependent versions of equivalence. 1081 Even though it is possible to determine that two IRIs are equivalent, 1082 IRI comparison is not sufficient to determine whether two IRIs 1083 identify different resources. For example, an owner of two different 1084 domain names could decide to serve the same resource from both, 1085 resulting in two different IRIs. Therefore, comparison methods are 1086 designed to minimize false negatives while strictly avoiding false 1087 positives. 1089 In testing for equivalence, applications should not directly compare 1090 relative references; the references should be converted to their 1091 respective target IRIs before comparison. When IRIs are compared to 1092 select (or avoid) a network action, such as retrieval of a 1093 representation, fragment components (if any) should be excluded from 1094 the comparison. 1096 Applications using IRIs as identity tokens with no relationship to a 1097 protocol MUST use the Simple String Comparison (see Section 5.3.1). 1098 All other applications MUST select one of the comparison practices 1099 from the Comparison Ladder (see Section 5.3 or, after IRI-to-URI 1100 conversion, select one of the comparison practices from the URI 1101 comparison ladder in [RFC3986], section 6.2). 1103 5.2. Preparation for Comparison 1105 Any kind of IRI comparison REQUIRES that all escapings or encodings 1106 in the protocol or format that carries an IRI are resolved. This is 1107 usually done when the protocol or format is parsed. Examples of such 1108 escapings or encodings are entities and numeric character references 1109 in [HTML4] and [XML1]. As an example, 1110 "http://example.org/rosé" (in HTML), 1111 "http://example.org/rosé" (in HTML or XML), and 1112 "http://example.org/rosé" (in HTML or XML) are all resolved into 1113 what is denoted in this document (see Section 1.4) as 1114 "http://example.org/rosé" (the "é" here standing for the 1115 actual e-acute character, to compensate for the fact that this 1116 document cannot contain non-ASCII characters). 1118 Similar considerations apply to encodings such as Transfer Codings in 1119 HTTP (see [RFC2616]) and Content Transfer Encodings in MIME 1120 ([RFC2045]), although in these cases, the encoding is based not on 1121 characters but on octets, and additional care is required to make 1122 sure that characters, and not just arbitrary octets, are compared 1123 (see Section 5.3.1). 1125 5.3. Comparison Ladder 1127 In practice, a variety of methods are used, to test IRI equivalence. 1128 These methods fall into a range distinguished by the amount of 1129 processing required and the degree to which the probability of false 1130 negatives is reduced. As noted above, false negatives cannot be 1131 eliminated. In practice, their probability can be reduced, but this 1132 reduction requires more processing and is not cost-effective for all 1133 applications. 1135 If this range of comparison practices is considered as a ladder, the 1136 following discussion will climb the ladder, starting with practices 1137 that are cheap but have a relatively higher chance of producing false 1138 negatives, and proceeding to those that have higher computational 1139 cost and lower risk of false negatives. 1141 5.3.1. Simple String Comparison 1143 If two IRIs, when considered as character strings, are identical, 1144 then it is safe to conclude that they are equivalent. This type of 1145 equivalence test has very low computational cost and is in wide use 1146 in a variety of applications, particularly in the domain of parsing. 1147 It is also used when a definitive answer to the question of IRI 1148 equivalence is needed that is independent of the scheme used and that 1149 can be calculated quickly and without accessing a network. An 1150 example of such a case is XML Namespaces ([XMLNamespace]). 1152 Testing strings for equivalence requires some basic precautions. 1153 This procedure is often referred to as "bit-for-bit" or "byte-for- 1154 byte" comparison, which is potentially misleading. Testing strings 1155 for equality is normally based on pair comparison of the characters 1156 that make up the strings, starting from the first and proceeding 1157 until both strings are exhausted and all characters are found to be 1158 equal, until a pair of characters compares unequal, or until one of 1159 the strings is exhausted before the other. 1161 This character comparison requires that each pair of characters be 1162 put in comparable encoding form. For example, should one IRI be 1163 stored in a byte array in UTF-8 encoding form and the second in a 1164 UTF-16 encoding form, bit-for-bit comparisons applied naively will 1165 produce errors. It is better to speak of equality on a character- 1166 for-character rather than on a byte-for-byte or bit-for-bit basis. 1167 In practical terms, character-by-character comparisons should be done 1168 codepoint by codepoint after conversion to a common character 1169 encoding form. When comparing character by character, the comparison 1170 function MUST NOT map IRIs to URIs, because such a mapping would 1171 create additional spurious equivalences. It follows that an IRI 1172 SHOULD NOT be modified when being transported if there is any chance 1173 that this IRI might be used as an identifier. 1175 False negatives are caused by the production and use of IRI aliases. 1176 Unnecessary aliases can be reduced, regardless of the comparison 1177 method, by consistently providing IRI references in an already 1178 normalized form (i.e., a form identical to what would be produced 1179 after normalization is applied, as described below). Protocols and 1180 data formats often limit some IRI comparisons to simple string 1181 comparison, based on the theory that people and implementations will, 1182 in their own best interest, be consistent in providing IRI 1183 references, or at least be consistent enough to negate any efficiency 1184 that might be obtained from further normalization. 1186 5.3.2. Syntax-Based Normalization 1188 Implementations may use logic based on the definitions provided by 1189 this specification to reduce the probability of false negatives. 1190 This processing is moderately higher in cost than character-for- 1191 character string comparison. For example, an application using this 1192 approach could reasonably consider the following two IRIs equivalent: 1194 example://a/b/c/%7Bfoo%7D/rosé 1195 eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 1197 Web user agents, such as browsers, typically apply this type of IRI 1198 normalization when determining whether a cached response is 1199 available. Syntax-based normalization includes such techniques as 1200 case normalization, character normalization, percent-encoding 1201 normalization, and removal of dot-segments. 1203 5.3.2.1. Case Normalization 1205 For all IRIs, the hexadecimal digits within a percent-encoding 1206 triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore 1207 should be normalized to use uppercase letters for the digits A-F. 1209 When an IRI uses components of the generic syntax, the component 1210 syntax equivalence rules always apply; namely, that the scheme and 1211 US-ASCII only host are case insensitive and therefore should be 1212 normalized to lowercase. For example, the URI 1213 "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". 1214 Case equivalence for non-ASCII characters in IRI components that are 1215 IDNs are discussed in Section 5.3.3. The other generic syntax 1216 components are assumed to be case sensitive unless specifically 1217 defined otherwise by the scheme. 1219 Creating schemes that allow case-insensitive syntax components 1220 containing non-ASCII characters should be avoided. Case 1221 normalization of non-ASCII characters can be culturally dependent and 1222 is always a complex operation. The only exception concerns non-ASCII 1223 host names for which the character normalization includes a mapping 1224 step derived from case folding. 1226 5.3.2.2. Character Normalization 1228 The Unicode Standard [UNIV4] defines various equivalences between 1229 sequences of characters for various purposes. Unicode Standard Annex 1230 #15 [UTR15] defines various Normalization Forms for these 1231 equivalences, in particular Normalization Form C (NFC, Canonical 1232 Decomposition, followed by Canonical Composition) and Normalization 1233 Form KC (NFKC, Compatibility Decomposition, followed by Canonical 1234 Composition). 1236 Equivalence of IRIs MUST rely on the assumption that IRIs are 1237 appropriately pre-character-normalized rather than apply character 1238 normalization when comparing two IRIs. The exceptions are conversion 1239 from a non-digital form, and conversion from a non-UCS-based 1240 character encoding to a UCS-based character encoding. In these 1241 cases, NFC or a normalizing transcoder using NFC MUST be used for 1242 interoperability. To avoid false negatives and problems with 1243 transcoding, IRIs SHOULD be created by using NFC. Using NFKC may 1244 avoid even more problems; for example, by choosing half-width Latin 1245 letters instead of full-width ones, and full-width instead of half- 1246 width Katakana. 1248 As an example, "http://www.example.org/résumé.html" (in XML 1249 Notation) is in NFC. On the other hand, 1250 "http://www.example.org/résumé.html" is not in NFC. 1252 The former uses precombined e-acute characters, and the latter uses 1253 "e" characters followed by combining acute accents. Both usages are 1254 defined as canonically equivalent in [UNIV4]. 1256 Note: Because it is unknown how a particular sequence of characters 1257 is being treated with respect to character normalization, it would 1258 be inappropriate to allow third parties to normalize an IRI 1259 arbitrarily. This does not contradict the recommendation that 1260 when a resource is created, its IRI should be as character 1261 normalized as possible (i.e., NFC or even NFKC). This is similar 1262 to the uppercase/lowercase problems. Some parts of a URI are case 1263 insensitive (domain name). For others, it is unclear whether they 1264 are case sensitive, case insensitive, or something in between 1265 (e.g., case sensitive, but with a multiple choice selection if the 1266 wrong case is used, instead of a direct negative result). The 1267 best recipe is that the creator use a reasonable capitalization 1268 and, when transferring the URI, capitalization never be changed. 1270 Various IRI schemes may allow the usage of Internationalized Domain 1271 Names (IDN) [RFC3490] either in the ireg-name part or elsewhere. 1272 Character Normalization also applies to IDNs, as discussed in 1273 Section 5.3.3. 1275 5.3.2.3. Percent-Encoding Normalization 1277 The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a 1278 frequent source of variance among otherwise identical IRIs. In 1279 addition to the case normalization issue noted above, some IRI 1280 producers percent-encode octets that do not require percent-encoding, 1281 resulting in IRIs that are equivalent to their nonencoded 1282 counterparts. These IRIs should be normalized by decoding any 1283 percent-encoded octet sequence that corresponds to an unreserved 1284 character, as described in section 2.3 of [RFC3986]. 1286 For actual resolution, differences in percent-encoding (except for 1287 the percent-encoding of reserved characters) MUST always result in 1288 the same resource. For example, "http://example.org/~user", 1289 "http://example.org/%7euser", and "http://example.org/%7Euser", must 1290 resolve to the same resource. 1292 If this kind of equivalence is to be tested, the percent-encoding of 1293 both IRIs to be compared has to be aligned; for example, by 1294 converting both IRIs to URIs (see Section 3.1), eliminating escape 1295 differences in the resulting URIs, and making sure that the case of 1296 the hexadecimal characters in the percent-encoding is always the same 1297 (preferably upper case). If the IRI is to be passed to another 1298 application or used further in some other way, its original form MUST 1299 be preserved. The conversion described here should be performed only 1300 for local comparison. 1302 5.3.2.4. Path Segment Normalization 1304 The complete path segments "." and ".." are intended only for use 1305 within relative references (Section 4.1 of [RFC3986]) and are removed 1306 as part of the reference resolution process (Section 5.2 of 1307 [RFC3986]). However, some implementations may incorrectly assume 1308 that reference resolution is not necessary when the reference is 1309 already an IRI, and thus fail to remove dot-segments when they occur 1310 in non-relative paths. IRI normalizers should remove dot-segments by 1311 applying the remove_dot_segments algorithm to the path, as described 1312 in Section 5.2.4 of [RFC3986]. 1314 5.3.3. Scheme-Based Normalization 1316 The syntax and semantics of IRIs vary from scheme to scheme, as 1317 described by the defining specification for each scheme. 1319 Implementations may use scheme-specific rules, at further processing 1320 cost, to reduce the probability of false negatives. For example, 1321 because the "http" scheme makes use of an authority component, has a 1322 default port of "80", and defines an empty path to be equivalent to 1323 "/", the following four IRIs are equivalent: 1325 http://example.com 1326 http://example.com/ 1327 http://example.com:/ 1328 http://example.com:80/ 1330 In general, an IRI that uses the generic syntax for authority with an 1331 empty path should be normalized to a path of "/". Likewise, an 1332 explicit ":port", for which the port is empty or the default for the 1333 scheme, is equivalent to one where the port and its ":" delimiter are 1334 elided and thus should be removed by scheme-based normalization. For 1335 example, the second IRI above is the normal form for the "http" 1336 scheme. 1338 Another case where normalization varies by scheme is in the handling 1339 of an empty authority component or empty host subcomponent. For many 1340 scheme specifications, an empty authority or host is considered an 1341 error; for others, it is considered equivalent to "localhost" or the 1342 end-user's host. When a scheme defines a default for authority and 1343 an IRI reference to that default is desired, the reference should be 1344 normalized to an empty authority for the sake of uniformity, brevity, 1345 and internationalization. If, however, either the userinfo or port 1346 subcomponents are non-empty, then the host should be given explicitly 1347 even if it matches the default. 1349 Normalization should not remove delimiters when their associated 1350 component is empty unless it is licensed to do so by the scheme 1351 specification. For example, the IRI "http://example.com/?" cannot be 1352 assumed to be equivalent to any of the examples above. Likewise, the 1353 presence or absence of delimiters within a userinfo subcomponent is 1354 usually significant to its interpretation. The fragment component is 1355 not subject to any scheme-based normalization; thus, two IRIs that 1356 differ only by the suffix "#" are considered different regardless of 1357 the scheme. 1359 Some IRI schemes may allow the usage of Internationalized Domain 1360 Names (IDN) [RFC3490] either in their ireg-name part or elsewhere. 1361 When in use in IRIs, those names SHOULD be validated by using the 1362 ToASCII operation defined in [RFC3490], with the flags 1363 "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an 1364 invalid IDN cannot successfully be resolved. Validated IDN 1365 components of IRIs SHOULD be character normalized by using the 1366 Nameprep process [RFC3491]; however, for legibility purposes, they 1367 SHOULD NOT be converted into ASCII Compatible Encoding (ACE). 1369 Scheme-based normalization may also consider IDN components and their 1370 conversions to punycode as equivalent. As an example, 1371 "http://résumé.example.org" may be considered equivalent to 1372 "http://xn--rsum-bpad.example.org". 1374 Other scheme-specific normalizations are possible. 1376 5.3.4. Protocol-Based Normalization 1378 Substantial effort to reduce the incidence of false negatives is 1379 often cost-effective for web spiders. Consequently, they implement 1380 even more aggressive techniques in IRI comparison. For example, if 1381 they observe that an IRI such as 1383 http://example.com/data 1385 redirects to an IRI differing only in the trailing slash 1387 http://example.com/data/ 1389 they will likely regard the two as equivalent in the future. This 1390 kind of technique is only appropriate when equivalence is clearly 1391 indicated by both the result of accessing the resources and the 1392 common conventions of their scheme's dereference algorithm (in this 1393 case, use of redirection by HTTP origin servers to avoid problems 1394 with relative references). 1396 6. Use of IRIs 1398 6.1. Limitations on UCS Characters Allowed in IRIs 1400 This section discusses limitations on characters and character 1401 sequences usable for IRIs beyond those given in Section 2.2 and 1402 Section 4.1. The considerations in this section are relevant when 1403 IRIs are created and when URIs are converted to IRIs. 1405 a. The repertoire of characters allowed in each IRI component is 1406 limited by the definition of that component. For example, the 1407 definition of the scheme component does not allow characters 1408 beyond US-ASCII. 1410 (Note: In accordance with URI practice, generic IRI software 1411 cannot and should not check for such limitations.) 1413 b. The UCS contains many areas of characters for which there are 1414 strong visual look-alikes. Because of the likelihood of 1415 transcription errors, these also should be avoided. This includes 1416 the full-width equivalents of Latin characters, half-width 1417 Katakana characters for Japanese, and many others. It also 1418 includes many look-alikes of "space", "delims", and "unwise", 1419 characters excluded in [RFC3491]. 1421 Additional information is available from [UNIXML]. [UNIXML] is 1422 written in the context of running text rather than in that of 1423 identifiers. Nevertheless, it discusses many of the categories of 1424 characters not appropriate for IRIs. 1426 6.2. Software Interfaces and Protocols 1428 Although an IRI is defined as a sequence of characters, software 1429 interfaces for URIs typically function on sequences of octets or 1430 other kinds of code units. Thus, software interfaces and protocols 1431 MUST define which character encoding is used. 1433 Intermediate software interfaces between IRI-capable components and 1434 URI-only components MUST map the IRIs per Section 3.1, when 1435 transferring from IRI-capable to URI-only components. This mapping 1436 SHOULD be applied as late as possible. It SHOULD NOT be applied 1437 between components that are known to be able to handle IRIs. 1439 6.3. Format of URIs and IRIs in Documents and Protocols 1441 Document formats that transport URIs may have to be upgraded to allow 1442 the transport of IRIs. In cases where the document as a whole has a 1443 native character encoding, IRIs MUST also be encoded in this 1444 character encoding and converted accordingly by a parser or 1445 interpreter. IRI characters not expressible in the native character 1446 encoding SHOULD be escaped by using the escaping conventions of the 1447 document format if such conventions are available. Alternatively, 1448 they MAY be percent-encoded according to Section 3.1. For example, 1449 in HTML or XML, numeric character references SHOULD be used. If a 1450 document as a whole has a native character encoding and that 1451 character encoding is not UTF-8, then IRIs MUST NOT be placed into 1452 the document in the UTF-8 character encoding. 1454 Note: Some formats already accommodate IRIs, although they use 1455 different terminology. HTML 4.0 [HTML4] defines the conversion from 1456 IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink 1457 [XLink], XML Schema [XMLSchema], and specifications based upon them 1458 allow IRIs. Also, it is expected that all relevant new W3C formats 1459 and protocols will be required to handle IRIs [CharMod]. 1461 6.4. Use of UTF-8 for Encoding Original Characters 1463 This section discusses details and gives examples for point c) in 1464 Section 1.2. To be able to use IRIs, the URI corresponding to the 1465 IRI in question has to encode original characters into octets by 1466 using UTF-8. This can be specified for all URIs of a URI scheme or 1467 can apply to individual URIs for schemes that do not specify how to 1468 encode original characters. It can apply to the whole URI, or only 1469 to some part. For background information on encoding characters into 1470 URIs, see also Section 2.5 of [RFC3986]. 1472 For new URI schemes, using UTF-8 is recommended in [RFC2718]. 1473 Examples where UTF-8 is already used are the URN syntax [RFC2141], 1474 IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, 1475 because the HTTP URL scheme does not specify how to encode original 1476 characters, only some HTTP URLs can have corresponding but different 1477 IRIs. 1479 For example, for a document with a URI of 1480 "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to 1481 construct a corresponding IRI (in XML notation, see Section 1.4): 1482 "http://www.example.org/résumé.html" ("é" stands for 1483 the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent- 1484 encoded representation of that character). On the other hand, for a 1485 document with a URI of "http://www.example.org/r%E9sum%E9.html", the 1486 percent-encoding octets cannot be converted to actual characters in 1487 an IRI, as the percent-encoding is not based on UTF-8. 1489 This means that for most URI schemes, there is no need to upgrade 1490 their scheme definition in order for them to work with IRIs. The 1491 main case where upgrading makes sense is when a scheme definition, or 1492 a particular component of a scheme, is strictly limited to the use of 1493 US-ASCII characters with no provision to include non-ASCII 1494 characters/octets via percent-encoding, or if a scheme definition 1495 currently uses highly scheme-specific provisions for the encoding of 1496 non-ASCII characters. An example of this is the mailto: scheme 1497 [RFC2368]. 1499 This specification does not upgrade any scheme specifications in any 1500 way; this has to be done separately. Also, note that there is no 1501 such thing as an "IRI scheme"; all IRIs use URI schemes, and all URI 1502 schemes can be used with IRIs, even though in some cases only by 1503 using URIs directly as IRIs, without any conversion. 1505 URI schemes can impose restrictions on the syntax of scheme-specific 1506 URIs; i.e., URIs that are admissible under the generic URI syntax 1507 [RFC3986] may not be admissible due to narrower syntactic constraints 1508 imposed by a URI scheme specification. URI scheme definitions cannot 1509 broaden the syntactic restrictions of the generic URI syntax; 1510 otherwise, it would be possible to generate URIs that satisfied the 1511 scheme-specific syntactic constraints without satisfying the 1512 syntactic constraints of the generic URI syntax. However, additional 1513 syntactic constraints imposed by URI scheme specifications are 1514 applicable to IRI, as the corresponding URI resulting from the 1515 mapping defined in Section 3.1 MUST be a valid URI under the 1516 syntactic restrictions of generic URI syntax and any narrower 1517 restrictions imposed by the corresponding URI scheme specification. 1519 The requirement for the use of UTF-8 applies to all parts of a URI 1520 (with the potential exception of the ireg-name part; see 1521 Section 3.1). However, it is possible that the capability of IRIs to 1522 represent a wide range of characters directly is used just in some 1523 parts of the IRI (or IRI reference). The other parts of the IRI may 1524 only contain US-ASCII characters, or they may not be based on UTF-8. 1525 They may be based on another character encoding, or they may directly 1526 encode raw binary data (see also [RFC2397]). 1528 For example, it is possible to have a URI reference of 1529 "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the 1530 document name is encoded in iso-8859-1 based on server settings, but 1531 where the fragment identifier is encoded in UTF-8 according to 1532 [XPointer]. The IRI corresponding to the above URI would be (in XML 1533 notation) 1534 "http://www.example.org/r%E9sum%E9.xml#résumé". 1536 Similar considerations apply to query parts. The functionality of 1537 IRIs (namely, to be able to include non-ASCII characters) can only be 1538 used if the query part is encoded in UTF-8. 1540 6.5. Relative IRI References 1542 Processing of relative IRI references against a base is handled 1543 straightforwardly; the algorithms of [RFC3986] can be applied 1544 directly, treating the characters additionally allowed in IRI 1545 references in the same way that unreserved characters are in URI 1546 references. 1548 7. Legacy Extended IRIs (LEIRIs) and Hypertext References 1550 For historic reasons, some formats have allowed variants of IRIs that 1551 are somewhat less restricted in syntax. This section provides 1552 definitions and names (Legacy Extended IRI or LEIRI [LEIRI], and 1553 Hypertext Reference or HREF [HTML5]) for these variants, for easier 1554 reference. These variants have to be used with care; they require 1555 further processing before being fully interchangeable as IRIs. New 1556 protocols and formats SHOULD NOT use Legacy Extended IRIs or 1557 Hypertext References. Even where they are allowed, only IRIs fully 1558 conforming to the syntax definition in Section 2.2 SHOULD be created, 1559 generated, and used. The provisions in this section also apply to 1560 Legacy Extended IRI references and other related forms. 1562 7.1. Legacy Extended IRI Syntax 1564 The syntax of Legacy Extended IRIs is the same as that for IRIs, 1565 except that ucschar is redefined as follows: 1567 ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" 1568 / "\" / "^" / "`" / %x0-1F / %x7F-D7FF 1569 / %xE000-FFFD / %x10000-10FFFF 1571 The restriction on bidirectional formatting characters in Section 4.1 1572 is lifted. The iprivate production becomes redundant. 1574 Likewise, the syntax for Legacy Extended IRI references (LEIRI 1575 references) is the same as that for IRI references with the above 1576 redefinition of ucschar applied. 1578 Formats that use Legacy Extended IRIs or Legacy Extended IRI 1579 references MAY further restrict the characters allowed therein, 1580 either implicitly by the fact that the format as such does not allow 1581 some characters, or explicitly. An example of a character not 1582 allowed implicitly may be the NUL character (U+0000). However, all 1583 the characters allowed in IRIs MUST still be allowed. 1585 7.2. Conversion of Legacy Extended IRIs to IRIs 1587 To convert a Legacy Extended IRI (reference) to an IRI (reference), 1588 each character allowed in a Legacy Extended IRI (reference) but not 1589 allowed in an IRI (reference) (see Section 7.3) MUST be percent- 1590 encoded by applying steps 2.1 to 2.3 of Section 3.1. 1592 7.3. Characters Allowed in Legacy Extended IRIs but not in IRIs 1594 This section provides a list of the groups of characters and code 1595 points that are allowed in Legacy Extedend IRIs, but are not allowed 1596 in IRIs or are allowed in IRIs only in the query part. For each 1597 group of characters, advice on the usage of these characters is also 1598 given, concentrating on the reasons for why not to use them. 1600 Space (U+0020): Some formats and applications use space as a 1601 delimiter, e.g. for items in a list. Appendix C of [RFC3986] also 1602 mentions that white space may have to be added when displaying or 1603 printing long URIs; the same applies to long IRIs. This means 1604 that spaces can disappear, or can make the Legacy Extended IRI to 1605 be interpreted as two or more separate IRIs. 1607 Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix 1608 C of [RFC3986] suggests the use of double-quotes 1609 ("http://example.com/") and angle brackets () 1610 as delimiters for URIs in plain text. These conventions are often 1611 used, and also apply to IRIs. Legacy Extended IRIs using these 1612 characters will be cut off at the wrong place. 1614 Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{" 1615 (U+007B), "|" (U+007C), and "}" (U+007D): These characters 1616 originally have been excluded from URIs because the respective 1617 codepoints are assigned to different graphic characters in some 1618 7-bit or 8-bit encoding. Despite the move to Unicode, some of 1619 these characters are still occasionally displayed differently on 1620 some systems, e.g. U+005C as a Japanese Yen symbol. Also, the 1621 fact that these characters are not used in URIs or IRIs has 1622 encouraged their use outside URIs or IRIs in contexts that may 1623 include URIs or IRIs. In case a Legacy Extended IRI with such a 1624 character is used in such a context, the Legacy Extended IRI will 1625 be interpreted piecemeal. 1627 The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F - 1628 #x9F): There is no way to transmit these characters reliably 1629 except potentially in electronic form. Even when in electronic 1630 form, some software components might silently filter out some of 1631 these characters, or may stop processing alltogether when 1632 encountering some of them. These characters may affect text 1633 display in subtle, unnoticable ways or in drastic, global, and 1634 irreversible ways depending on the hardware and software involved. 1635 The use of some of these characters may allow malicious users to 1636 manipulate the display of a Legacy Extended IRI and its context. 1638 Bidi formatting characters (U+200E, U+200F, U+202A-202E): These 1639 characters affect the display ordering of characters. Displayed 1640 Legacy Extended IRIs containing these characters cannot be 1641 converted back to electronic form (logical order) unambiguously. 1642 These characters may allow malicious users to manipulate the 1643 display of a Legacy Extended IRI and its context. 1645 Specials (U+FFF0-FFFD): These code points provide functionality 1646 beyond that useful in a Legacy Extended IRI, for example byte 1647 order identification, annotation, and replacements for unknown 1648 characters and objects. Their use and interpretation in a Legacy 1649 Extended IRI serves no purpose and may lead to confusing display 1650 variations. 1652 Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000- 1653 10FFFD): Display and interpretation of these code points is by 1654 definition undefined without private agreement. Therefore, these 1655 code points are not suited for use on the Internet. They are not 1656 interoperable and may have unpredictable effects. 1658 Tags (U+E0000-E0FFF): These characters provide a way to language 1659 tag in Unicode plain text. They are not appropriate for Legacy 1660 Extended IRIs because language information in identifiers cannot 1661 reliably be input, transmitted (e.g. on a visual medium such as 1662 paper), or recognized. 1664 Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF, 1665 U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF, 1666 U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF, 1667 U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF, 1668 U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as 1669 non-characters. Applications may use some of them internally, but 1670 are not prepared to interchange them. 1672 For reference, we here also list the code points and code units not 1673 even allowed in Legacy Extended IRIs: 1675 Surrogate code units (D800-DFFF): These do not represent Unicode 1676 codepoints. 1678 7.4. HyperText References 1680 ((NOTE: This section is intended to integrate the specification of 1681 browser behavior originally written in the public working draft of 1682 the HTML5 specification 1683 http://www.w3.org/TR/html5/infrastructure.html#parsing-urls into this 1684 document. This definition is an initial draft.)) 1686 A construct named Hypertext Reference (Section 7.4) (HRef, sometimes 1687 called a "web address") describes the extension of IRIs as actually 1688 deployed in popular web browsers, for use in both HTML and in a 1689 JavaScript scripting interface. The interpretation of a HRef is 1690 given as a modification of the mapping from a string of octets into a 1691 useful . The mapping is an extension of the mapping 1692 from IRI to URI. The differences are: 1694 o There is an additional parameter to the conversion, a character 1695 set used for encoding of the query component. 1697 o Leading and trailing spaces are removed. 1699 o Additional characters are escaped because of HRef parsing. 1701 The HRef-charset is determined by the context. If the context does 1702 not supply a HRef-charset, then the HRef-charset is UTF-8. For web 1703 browsers interpreting HTML, it is determined as follows: 1705 If the HRef came from a script (e.g. as an argument to a method) The 1706 HRef-charset is the script's character. encoding. 1708 If the HRef came from a DOM node (e.g. from an element) The node has 1709 a Document, and the HRef-charset is the Document's character 1710 encoding. 1712 If the HRef had a character encoding defined when the HRef was 1713 created or defined The HRef-charset is as defined. 1715 The following steps define the mapping from an HRef into an URI- 1716 reference. 1718 1. Strip leading and trailing instances of the space (U+0020) 1719 character. 1721 2. Apply the algorithm in Section 3.1, mapping the result to a 1722 string of characters in the URI repertoire. 1724 3. If the result begins with either of: 1726 * a string matching the production, followed by "://" 1728 * the string "//" 1730 then percent-encode any left or right square brackets (U+005B, 1731 U+005D, "[" and "]") following the first occurrence of "/", "?", 1732 or "#" which follows the first occurrence of "//". 1734 4. Otherwise, percent-encode all left and right square brackets. 1736 5. Percent-encode all occurrences of U+0023 (Number sign, "#") after 1737 the first. 1739 6. Parse the result using the production for URI-reference in RFC 1740 3986 [RFC3986]. 1742 7. If the result doesn't match production then 1743 perform no further action. Otherwise, parsing was successful. 1744 Parsing the encoded URI is needed to accomodate two changes to 1745 the IRI to URI mapping. 1747 8. If there is a component and a component and the 1748 port given by the component is the same as the default 1749 port defined for the protocol given by the component, 1750 then replace the in the parsed result with the 1751 component. (NOTE: Is this step necessary? Well-defined? Only 1752 used for HTTP?) 1754 9. If the HRef-charset is UTF-8, or if there is no query component, 1755 or if the query component contains no percent-encodings, no 1756 further processing is necessary. However, if the HRef-charset is 1757 not UTF-8 and there is a query component in the parsed results, 1758 then the query string is translated into the HRef-charset before 1759 percent-encoding. This can be accomplished as: 1761 1. Decode any hex-encoded components of the portion of the URI 1762 matching the production (which will yield a UTF-8 1763 encoded sequence of characters.) 1765 2. Decode the UTF-8 encoding to create a sequence of (abstract) 1766 characters. 1768 3. Encode the resulting character sequence into a sequence of 1769 octets as specified by the HRef-charset; any characters which 1770 cannot be expressed in HRef-charset should be replaced with 1771 an (ASCII) '?'. 1773 4. Percent-encode the resulting set of octets. 1775 8. URI/IRI Processing Guidelines (Informative) 1777 This informative section provides guidelines for supporting IRIs in 1778 the same software components and operations that currently process 1779 URIs: Software interfaces that handle URIs, software that allows 1780 users to enter URIs, software that creates or generates URIs, 1781 software that displays URIs, formats and protocols that transport 1782 URIs, and software that interprets URIs. These may all require 1783 modification before functioning properly with IRIs. The 1784 considerations in this section also apply to URI references and IRI 1785 references. 1787 8.1. URI/IRI Software Interfaces 1789 Software interfaces that handle URIs, such as URI-handling APIs and 1790 protocols transferring URIs, need interfaces and protocol elements 1791 that are designed to carry IRIs. 1793 In case the current handling in an API or protocol is based on US- 1794 ASCII, UTF-8 is recommended as the character encoding for IRIs, as it 1795 is compatible with US-ASCII, is in accordance with the 1796 recommendations of [RFC2277], and makes converting to URIs easy. In 1797 any case, the API or protocol definition must clearly define the 1798 character encoding to be used. 1800 The transfer from URI-only to IRI-capable components requires no 1801 mapping, although the conversion described in Section 3.2 above may 1802 be performed. It is preferable not to perform this inverse 1803 conversion when there is a chance that this cannot be done correctly. 1805 8.2. URI/IRI Entry 1807 Some components allow users to enter URIs into the system by typing 1808 or dictation, for example. This software must be updated to allow 1809 for IRI entry. 1811 A person viewing a visual representation of an IRI (as a sequence of 1812 glyphs, in some order, in some visual display) or hearing an IRI will 1813 use an entry method for characters in the user's language to input 1814 the IRI. Depending on the script and the input method used, this may 1815 be a more or less complicated process. 1817 The process of IRI entry must ensure, as much as possible, that the 1818 restrictions defined in Section 2.2 are met. This may be done by 1819 choosing appropriate input methods or variants/settings thereof, by 1820 appropriately converting the characters being input, by eliminating 1821 characters that cannot be converted, and/or by issuing a warning or 1822 error message to the user. 1824 As an example of variant settings, input method editors for East 1825 Asian Languages usually allow the input of Latin letters and related 1826 characters in full-width or half-width versions. For IRI input, the 1827 input method editor should be set so that it produces half-width 1828 Latin letters and punctuation and full-width Katakana. 1830 An input field primarily or solely used for the input of URIs/IRIs 1831 may allow the user to view an IRI as it is mapped to a URI. Places 1832 where the input of IRIs is frequent may provide the possibility for 1833 viewing an IRI as mapped to a URI. This will help users when some of 1834 the software they use does not yet accept IRIs. 1836 An IRI input component interfacing to components that handle URIs, 1837 but not IRIs, must map the IRI to a URI before passing it to these 1838 components. 1840 For the input of IRIs with right-to-left characters, please see 1841 Section 4.3. 1843 8.3. URI/IRI Transfer between Applications 1845 Many applications, particularly mail user agents, try to detect URIs 1846 appearing in plain text. For this, they use some heuristics based on 1847 URI syntax. They then allow the user to click on such URIs and 1848 retrieve the corresponding resource in an appropriate (usually 1849 scheme-dependent) application. 1851 Such applications have to be upgraded to use the IRI syntax as a base 1852 for heuristics. In particular, a non-ASCII character should not be 1853 taken as the indication of the end of an IRI. Such applications also 1854 have to make sure that they correctly convert the detected IRI from 1855 the character encoding of the document or application where the IRI 1856 appears to the character encoding used by the system-wide IRI 1857 invocation mechanism, or to a URI (according to Section 3.1) if the 1858 system-wide invocation mechanism only accepts URIs. 1860 The clipboard is another frequently used way to transfer URIs and 1861 IRIs from one application to another. On most platforms, the 1862 clipboard is able to store and transfer text in many languages and 1863 scripts. Correctly used, the clipboard transfers characters, not 1864 bytes, which will do the right thing with IRIs. 1866 8.4. URI/IRI Generation 1868 Systems that offer resources through the Internet, where those 1869 resources have logical names, sometimes automatically generate URIs 1870 for the resources they offer. For example, some HTTP servers can 1871 generate a directory listing for a file directory and then respond to 1872 the generated URIs with the files. 1874 Many legacy character encodings are in use in various file systems. 1875 Many currently deployed systems do not transform the local character 1876 representation of the underlying system before generating URIs. 1878 For maximum interoperability, systems that generate resource 1879 identifiers should make the appropriate transformations. For 1880 example, if a file system contains a file named "résum&# 1881 xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in 1882 a URI, which allows use of "résumé.html" in an IRI, even if 1883 locally the file name is kept in a character encoding other than 1884 UTF-8. 1886 This recommendation particularly applies to HTTP servers. For FTP 1887 servers, similar considerations apply; see [RFC2640]. 1889 8.5. URI/IRI Selection 1891 In some cases, resource owners and publishers have control over the 1892 IRIs used to identify their resources. This control is mostly 1893 executed by controlling the resource names, such as file names, 1894 directly. 1896 In these cases, it is recommended to avoid choosing IRIs that are 1897 easily confused. For example, for US-ASCII, the lower-case ell ("l") 1898 is easily confused with the digit one ("1"), and the upper-case oh 1899 ("O") is easily confused with the digit zero ("0"). Publishers 1900 should avoid confusing users with "br0ken" or "1ame" identifiers. 1902 Outside the US-ASCII repertoire, there are many more opportunities 1903 for confusion; a complete set of guidelines is too lengthy to include 1904 here. As long as names are limited to characters from a single 1905 script, native writers of a given script or language will know best 1906 when ambiguities can appear, and how they can be avoided. What may 1907 look ambiguous to a stranger may be completely obvious to the average 1908 native user. On the other hand, in some cases, the UCS contains 1909 variants for compatibility reasons; for example, for typographic 1910 purposes. These should be avoided wherever possible. Although there 1911 may be exceptions, newly created resource names should generally be 1912 in NFKC [UTR15] (which means that they are also in NFC). 1914 As an example, the UCS contains the "fi" ligature at U+FB01 for 1915 compatibility reasons. Wherever possible, IRIs should use the two 1916 letters "f" and "i" rather than the "fi" ligature. An example where 1917 the latter may be used is in the query part of an IRI for an explicit 1918 search for a word written containing the "fi" ligature. 1920 In certain cases, there is a chance that characters from different 1921 scripts look the same. The best known example is the similarity of 1922 the Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid 1923 such cases, only IRIs should be created where all the characters in a 1924 single component are used together in a given language. This usually 1925 means that all of these characters will be from the same script, but 1926 there are languages that mix characters from different scripts (such 1927 as Japanese). This is similar to the heuristics used to distinguish 1928 between letters and numbers in the examples above. Also, for Latin, 1929 Greek, and Cyrillic, using lowercase letters results in fewer 1930 ambiguities than using uppercase letters would. 1932 8.6. Display of URIs/IRIs 1934 In situations where the rendering software is not expected to display 1935 non-ASCII parts of the IRI correctly using the available layout and 1936 font resources, these parts should be percent-encoded before being 1937 displayed. 1939 For display of Bidi IRIs, please see Section 4.1. 1941 8.7. Interpretation of URIs and IRIs 1943 Software that interprets IRIs as the names of local resources should 1944 accept IRIs in multiple forms and convert and match them with the 1945 appropriate local resource names. 1947 First, multiple representations include both IRIs in the native 1948 character encoding of the protocol and also their URI counterparts. 1950 Second, it may include URIs constructed based on character encodings 1951 other than UTF-8. These URIs may be produced by user agents that do 1952 not conform to this specification and that use legacy character 1953 encodings to convert non-ASCII characters to URIs. Whether this is 1954 necessary, and what character encodings to cover, depends on a number 1955 of factors, such as the legacy character encodings used locally and 1956 the distribution of various versions of user agents. For example, 1957 software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in 1958 addition to UTF-8. 1960 Third, it may include additional mappings to be more user-friendly 1961 and robust against transmission errors. These would be similar to 1962 how some servers currently treat URIs as case insensitive or perform 1963 additional matching to account for spelling errors. For characters 1964 beyond the US-ASCII repertoire, this may, for example, include 1965 ignoring the accents on received IRIs or resource names. Please note 1966 that such mappings, including case mappings, are language dependent. 1968 It can be difficult to identify a resource unambiguously if too many 1969 mappings are taken into consideration. However, percent-encoded and 1970 not percent-encoded parts of IRIs can always be clearly 1971 distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes 1972 the potential for collisions lower than it may seem at first. 1974 8.8. Upgrading Strategy 1976 Where this recommendation places further constraints on software for 1977 which many instances are already deployed, it is important to 1978 introduce upgrades carefully and to be aware of the various 1979 interdependencies. 1981 If IRIs cannot be interpreted correctly, they should not be created, 1982 generated, or transported. This suggests that upgrading URI 1983 interpreting software to accept IRIs should have highest priority. 1985 On the other hand, a single IRI is interpreted only by a single or 1986 very few interpreters that are known in advance, although it may be 1987 entered and transported very widely. 1989 Therefore, IRIs benefit most from a broad upgrade of software to be 1990 able to enter and transport IRIs. However, before an individual IRI 1991 is published, care should be taken to upgrade the corresponding 1992 interpreting software in order to cover the forms expected to be 1993 received by various versions of entry and transport software. 1995 The upgrade of generating software to generate IRIs instead of using 1996 a local character encoding should happen only after the service is 1997 upgraded to accept IRIs. Similarly, IRIs should only be generated 1998 when the service accepts IRIs and the intervening infrastructure and 1999 protocol is known to transport them safely. 2001 Software converting from URIs to IRIs for display should be upgraded 2002 only after upgraded entry software has been widely deployed to the 2003 population that will see the displayed result. 2005 Where there is a free choice of character encodings, it is often 2006 possible to reduce the effort and dependencies for upgrading to IRIs 2007 by using UTF-8 rather than another encoding. For example, when a new 2008 file-based Web server is set up, using UTF-8 as the character 2009 encoding for file names will make the transition to IRIs easier. 2010 Likewise, when a new Web form is set up using UTF-8 as the character 2011 encoding of the form page, the returned query URIs will use UTF-8 as 2012 the character encoding (unless the user, for whatever reason, changes 2013 the character encoding) and will therefore be compatible with IRIs. 2015 These recommendations, when taken together, will allow for the 2016 extension from URIs to IRIs in order to handle characters other than 2017 US-ASCII while minimizing interoperability problems. For 2018 considerations regarding the upgrade of URI scheme definitions, see 2019 Section 6.4. 2021 9. IANA Considerations 2023 Note to the RFC Editor: Please remove this section before 2024 publication. 2026 This document does not require any actions by IANA. 2028 10. Security Considerations 2030 The security considerations discussed in [RFC3986] also apply to 2031 IRIs. In addition, the following issues require particular care for 2032 IRIs. 2034 Incorrect encoding or decoding can lead to security problems. In 2035 particular, some UTF-8 decoders do not check against overlong byte 2036 sequences. As an example, a "/" is encoded with the byte 0x2F both 2037 in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly 2038 interpret the sequence 0xC0 0xAF as a "/". A sequence such as 2039 "%C0%AF.." may pass some security tests and then be interpreted as 2040 "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion 2041 and checking are not done in the right order, and/or if reserved 2042 characters and unreserved characters are not clearly distinguished. 2044 There are various ways in which "spoofing" can occur with IRIs. 2045 "Spoofing" means that somebody may add a resource name that looks the 2046 same or similar to the user, but that points to a different resource. 2047 The added resource may pretend to be the real resource by looking 2048 very similar but may contain all kinds of changes that may be 2049 difficult to spot and that can cause all kinds of problems. Most 2050 spoofing possibilities for IRIs are extensions of those for URIs. 2052 Spoofing can occur for various reasons. First, a user's 2053 normalization expectations or actual normalization when entering an 2054 IRI or transcoding an IRI from a legacy character encoding do not 2055 match the normalization used on the server side. Conceptually, this 2056 is no different from the problems surrounding the use of case- 2057 insensitive web servers. For example, a popular web page with a 2058 mixed-case name ("http://big.example.com/PopularPage.html") might be 2059 "spoofed" by someone who is able to create 2060 "http://big.example.com/popularpage.html". However, the use of 2061 unnormalized character sequences, and of additional mappings for user 2062 convenience, may increase the chance for spoofing. Protocols and 2063 servers that allow the creation of resources with names that are not 2064 normalized are particularly vulnerable to such attacks. This is an 2065 inherent security problem of the relevant protocol, server, or 2066 resource and is not specific to IRIs, but it is mentioned here for 2067 completeness. 2069 Spoofing can occur in various IRI components, such as the domain name 2070 part or a path part. For considerations specific to the domain name 2071 part, see [RFC3491]. For the path part, administrators of sites that 2072 allow independent users to create resources in the same sub area may 2073 have to be careful to check for spoofing. 2075 Spoofing can occur because in the UCS many characters look very 2076 similar. Details are discussed in Section 8.5. Again, this is very 2077 similar to spoofing possibilities on US-ASCII, e.g., using "br0ken" 2078 or "1ame" URIs. 2080 Spoofing can occur when URIs with percent-encodings based on various 2081 character encodings are accepted to deal with older user agents. In 2082 some cases, particularly for Latin-based resource names, this is 2083 usually easy to detect because UTF-8-encoded names, when interpreted 2084 and viewed as legacy character encodings, produce mostly garbage. 2086 When concurrently used character encodings have a similar structure 2087 but there are no characters that have exactly the same encoding, 2088 detection is more difficult. 2090 Spoofing can occur with bidirectional IRIs, if the restrictions in 2091 Section 4.2 are not followed. The same visual representation may be 2092 interpreted as different logical representations, and vice versa. It 2093 is also very important that a correct Unicode bidirectional 2094 implementation be used. 2096 The use of Legacy Extended IRIs introduces additional security 2097 issues. 2099 11. Acknowledgements 2101 For contributions to this update, we would like to thank Ian Hickson, 2102 Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin, 2103 Henry S. Thomson, and the XML Core Working Group of the W3C. 2105 The discussion on the issue addressed here started a long time ago. 2106 There was a thread in the HTML working group in August 1995 (under 2107 the topic of "Globalizing URIs") and in the www-international mailing 2108 list in July 1996 (under the topic of "Internationalization and 2109 URLs"), and there were ad-hoc meetings at the Unicode conferences in 2110 September 1995 and September 1997. 2112 For contributions to the previous version of this document, RFC 3987, 2113 many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, 2114 Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim 2115 Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie 2116 Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley, 2117 Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne, 2118 Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan 2119 Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan 2120 Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris 2121 Haynes, Walter Underwood, and many others. 2123 The definition of HyperText Reference was initially produced by Ian 2124 Hixson, and further edited by Dan Connolly and C. M. Spergerg- 2125 McQueen. 2127 This document is a product of the Internationalization Working Group 2128 (I18N WG) of the World Wide Web Consortium (W3C). Thanks to the 2129 members of the W3C I18N Working Group and Interest Group for their 2130 contributions and their work on [CharMod]. Thanks also go to the 2131 members of many other W3C Working Groups for adopting IRIs, and to 2132 the members of the Montreal IAB Workshop on Internationalization and 2133 Localization for their review. 2135 12. Change Log 2137 Note to RFC Editor: Please completely remove this section before 2138 publication. 2140 12.1. Changes from -05 to -06 2142 o Add HyperText Reference, change abstract, acks and references for 2143 it 2145 o Add Masinter back as another editor. 2147 o Masinter integrates HRef material from HTML5 spec. 2149 o Rewrite introduction sections to modernize. 2151 12.2. Changes from -04 to -05 2153 o Updated references. 2155 o Changed IPR text to pre5378Trust200902. 2157 12.3. Changes from -03 to -04 2159 o Added explicit abbreviation for LEIRIs. 2161 o Mentioned LEIRI references. 2163 o Completed text in LEIRI section about tag characters and about 2164 specials. 2166 12.4. Changes from -02 to -03 2168 o Updated some references. 2170 o Updated Michel Suginard's coordinates. 2172 12.5. Changes from -01 to -02 2174 o Added tag range to iprivate (issue private-include-tags-115). 2176 o Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs. 2178 12.6. Changes from -00 to -01 2180 o Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" 2181 based on input from the W3C XML Core WG. Moved the relevant 2182 subsections to the back and promoted them to a section. 2184 o Added some text re. Legacy Extended IRIs to the security section. 2186 o Added a IANA Consideration Section. 2188 o Added this Change Log Section. 2190 o Added a section about "IRIs with Spaces/Controls" (converting from 2191 a Note in RFC 3987). 2193 12.7. Changes from RFC 3987 to -00 2195 Fixed errata (see 2196 http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987). 2198 13. References 2200 13.1. Normative References 2202 [ASCII] American National Standards Institute, "Coded Character 2203 Set -- 7-bit American Standard Code for Information 2204 Interchange", ANSI X3.4, 1986. 2206 [ISO10646] 2207 International Organization for Standardization, "ISO/IEC 2208 10646:2003: Information Technology - Universal Multiple- 2209 Octet Coded Character Set (UCS)", ISO Standard 10646, 2210 December 2003. 2212 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2213 Requirement Levels", BCP 14, RFC 2119, March 1997. 2215 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 2216 "Internationalizing Domain Names in Applications (IDNA)", 2217 RFC 3490, March 2003. 2219 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 2220 Profile for Internationalized Domain Names (IDN)", 2221 RFC 3491, March 2003. 2223 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 2224 10646", STD 63, RFC 3629, November 2003. 2226 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2227 Resource Identifier (URI): Generic Syntax", STD 66, 2228 RFC 3986, January 2005. 2230 [STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax 2231 Specifications: ABNF", STD 68, RFC 5234, January 2008. 2233 [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard 2234 Annex #9, March 2004, 2235 . 2237 [UNIV4] The Unicode Consortium, "The Unicode Standard, Version 2238 5.1.0, defined by: The Unicode Standard, Version 5.0 2239 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0), as 2240 amended by Unicode 4.1.0 2241 (http://www.unicode.org/versions/Unicode5.1.0/)", 2242 April 2008. 2244 [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", 2245 Unicode Standard Annex #15, March 2008, 2246 . 2249 13.2. Informative References 2251 [BidiEx] "Examples of bidirectional IRIs", 2252 . 2254 [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. 2255 Texin, "Character Model for the World Wide Web: Resource 2256 Identifiers", World Wide Web Consortium Candidate 2257 Recommendation, November 2004, 2258 . 2260 [Duerst97] 2261 Duerst, M., "The Properties and Promises of UTF-8", Proc. 2262 11th International Unicode Conference, San Jose , 2263 September 1997, . 2266 [Gettys] Gettys, J., "URI Model Consequences", 2267 . 2269 [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 2270 Specification", World Wide Web Consortium Recommendation, 2271 December 1999, 2272 . 2274 [HTML5] Hickson, I. and D. Hyatt, "A vocabulary and associated 2275 APIs for HTML and XHTML", World Wide Web 2276 Consortium Working Draft, April 2009, 2277 . 2279 [LEIRI] Thompson, H., Tobin, R., and N. Walsh, "Legacy extended 2280 IRIs for XML resource identification", World Wide Web 2281 Consortium Note, . 2283 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2284 Extensions (MIME) Part One: Format of Internet Message 2285 Bodies", RFC 2045, November 1996. 2287 [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., 2288 Atkinson, R., Crispin, M., and P. Svanberg, "The Report of 2289 the IAB Character Set Workshop held 29 February - 1 March, 2290 1996", RFC 2130, April 1997. 2292 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 2294 [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. 2296 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 2297 Languages", BCP 18, RFC 2277, January 1998. 2299 [RFC2368] Hoffman, P., Masinter, L., and J. Zawinski, "The mailto 2300 URL scheme", RFC 2368, July 1998. 2302 [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. 2304 [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2305 Resource Identifiers (URI): Generic Syntax", RFC 2396, 2306 August 1998. 2308 [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, 2309 August 1998. 2311 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 2312 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 2313 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 2315 [RFC2640] Curtin, B., "Internationalization of the File Transfer 2316 Protocol", RFC 2640, July 1999. 2318 [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke, 2319 "Guidelines for new URL Schemes", RFC 2718, November 1999. 2321 [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other 2322 Markup Languages", Unicode Technical Report #20, World 2323 Wide Web Consortium Note, June 2003, 2324 . 2326 [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking 2327 Language (XLink) Version 1.0", World Wide Web 2328 Consortium Recommendation, June 2001, 2329 . 2331 [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and 2332 F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth 2333 Edition)", World Wide Web Consortium Recommendation, 2334 August 2006, . 2336 [XMLNamespace] 2337 Bray, T., Hollander, D., Layman, A., and R. Tobin, 2338 "Namespaces in XML (Second Edition)", World Wide Web 2339 Consortium Recommendation, August 2006, 2340 . 2342 [XMLSchema] 2343 Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", 2344 World Wide Web Consortium Recommendation, May 2001, 2345 . 2347 [XPointer] 2348 Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer 2349 Framework", World Wide Web Consortium Recommendation, 2350 March 2003, 2351 . 2353 Appendix A. Design Alternatives 2355 This section shortly summarizes major design alternatives and the 2356 reasons for why they were not chosen. 2358 A.1. New Scheme(s) 2360 Introducing new schemes (for example, httpi:, ftpi:,...) or a new 2361 metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:, 2362 i:ftp:,...) was proposed to make IRI-to-URI conversion scheme 2363 dependent or to distinguish between percent-encodings resulting from 2364 IRI-to-URI conversion and percent-encodings from legacy character 2365 encodings. 2367 New schemes are not needed to distinguish URIs from true IRIs (i.e., 2368 IRIs that contain non-ASCII characters). The benefit of being able 2369 to detect the origin of percent-encodings is marginal, as UTF-8 can 2370 be detected with very high reliability. Deploying new schemes is 2371 extremely hard, so not requiring new schemes for IRIs makes 2372 deployment of IRIs vastly easier. Making conversion scheme dependent 2373 is highly inadvisable and would be encouraged by separate schemes for 2374 IRIs. Using a uniform convention for conversion from IRIs to URIs 2375 makes IRI implementation orthogonal to the introduction of actual new 2376 schemes. 2378 A.2. Character Encodings Other Than UTF-8 2380 At an early stage, UTF-7 was considered as an alternative to UTF-8 2381 when IRIs are converted to URIs. UTF-7 would not have needed 2382 percent-encoding and in most cases would have been shorter than 2383 percent-encoded UTF-8. 2385 Using UTF-8 avoids a double layering and overloading of the use of 2386 the "+" character. UTF-8 is fully compatible with US-ASCII and has 2387 therefore been recommended by the IETF, and is being used widely. 2389 UTF-7 has never been used much and is now clearly being discouraged. 2390 Requiring implementations to convert from UTF-8 to UTF-7 and back 2391 would be an additional implementation burden. 2393 A.3. New Encoding Convention 2395 Instead of using the existing percent-encoding convention of URIs, 2396 which is based on octets, the idea was to create a new encoding 2397 convention; for example, to use "%u" to introduce UCS code points. 2399 Using the existing octet-based percent-encoding mechanism does not 2400 need an upgrade of the URI syntax and does not need corresponding 2401 server upgrades. 2403 A.4. Indicating Character Encodings in the URI/IRI 2405 Some proposals suggested indicating the character encodings used in 2406 an URI or IRI with some new syntactic convention in the URI itself, 2407 similar to the "charset" parameter for e-mails and Web pages. As an 2408 example, the label in square brackets in 2409 "http://www.example.org/ros[iso-8859-1]é" indicated that the 2410 following "é" had to be interpreted as iso-8859-1. 2412 If UTF-8 is used exclusively, an upgrade to the URI syntax is not 2413 needed. It avoids potentially multiple labels that have to be copied 2414 correctly in all cases, even on the side of a bus or on a napkin, 2415 leading to usability problems (and being prohibitively annoying). 2416 Exclusively using UTF-8 also reduces transcoding errors and 2417 confusion. 2419 Authors' Addresses 2421 Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever 2422 possible, for example as "Dürst" in XML and HTML.) 2423 Aoyama Gakuin University 2424 5-10-1 Fuchinobe 2425 Sagamihara, Kanagawa 229-8558 2426 Japan 2428 Phone: +81 42 759 6329 2429 Fax: +81 42 759 6495 2430 Email: mailto:duerst@it.aoyama.ac.jp 2431 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 2432 (Note: This is the percent-encoded form of an IRI.) 2434 Michel Suignard 2435 Unicode Consortium 2436 P.O. Box 391476 2437 Mountain View, CA 94039-1476 2438 U.S.A. 2440 Phone: +1-650-693-3921 2441 Email: mailto:michel@unicode.org 2442 URI: http://www.suignard.com 2444 Larry Masinter 2445 Adobe 2446 345 Park Ave 2447 San Jose, CA 95110 2448 U.S.A. 2450 Phone: +1-408-536-3024 2451 Email: mailto:masinter@adobe.com 2452 URI: http://larry.masinter.net