idnits 2.17.1 draft-duerst-iri-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1.a on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2099. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2076. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2083. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2089. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: This document is an Internet-Draft and is subject to all provisions of Section 3 of RFC 3667. By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 45 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 793 has weird spacing: '...ace and stron...' == Line 1214 has weird spacing: '...d octet seque...' == Line 1425 has weird spacing: '...include non-A...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 30, 2004) is 7086 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Normative reference to a draft: ref. 'RFCYYYY' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV4' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2192 (Obsoleted by RFC 5092) -- Obsolete informational reference (is this intentional?): RFC 2368 (Obsoleted by RFC 6068) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 2718 (Obsoleted by RFC 4395) Summary: 8 errors (**), 0 flaws (~~), 7 warnings (==), 20 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Duerst 2 Internet-Draft W3C 3 Expires: May 31, 2005 M. Suignard 4 Microsoft Corporation 5 November 30, 2004 7 Internationalized Resource Identifiers (IRIs) 8 draft-duerst-iri-11 10 Status of this Memo 12 This document is an Internet-Draft and is subject to all provisions 13 of section 3 of RFC 3667. By submitting this Internet-Draft, each 14 author represents that any applicable patent or other IPR claims of 15 which he or she is aware have been or will be disclosed, and any of 16 which he or she become aware will be disclosed, in accordance with 17 RFC 3668. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as 22 Internet-Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on May 31, 2005. 37 Copyright Notice 39 Copyright (C) The Internet Society (2004). 41 Abstract 43 This document defines a new protocol element, the Internationalized 44 Resource Identifier (IRI), as a complement to the Uniform Resource 45 Identifier (URI). An IRI is a sequence of characters from the 46 Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to 47 URIs is defined, which means that IRIs can be used instead of URIs 48 where appropriate to identify resources. 50 The approach of defining a new protocol element was chosen, instead 51 of extending or changing the definition of URIs, to allow a clear 52 distinction and to avoid incompatibilities with existing software. 53 Guidelines for the use and deployment of IRIs in various protocols, 54 formats, and software components that now deal with URIs are 55 provided. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . 4 61 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . 4 62 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . 5 63 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6 64 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 7 65 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 7 66 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . 8 67 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10 68 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 11 69 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . 14 70 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . 15 71 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 17 72 4.1 Logical Storage and Visual Presentation . . . . . . . . . 17 73 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 18 74 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 20 75 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 20 76 5. Normalization and Comparison . . . . . . . . . . . . . . . . . 22 77 5.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . 22 78 5.2 Preparation for Comparison . . . . . . . . . . . . . . . . 23 79 5.3 Comparison Ladder . . . . . . . . . . . . . . . . . . . . 23 80 5.3.1 Simple String Comparison . . . . . . . . . . . . . . . 24 81 5.3.2 Syntax-based Normalization . . . . . . . . . . . . . . 25 82 5.3.3 Scheme-based Normalization . . . . . . . . . . . . . . 27 83 5.3.4 Protocol-based Normalization . . . . . . . . . . . . . 29 84 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 29 85 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 29 86 6.2 Software Interfaces and Protocols . . . . . . . . . . . . 30 87 6.3 Format of URIs and IRIs in Documents and Protocols . . . . 30 88 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 30 89 6.5 Relative IRI References . . . . . . . . . . . . . . . . . 32 90 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 32 91 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 32 92 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 33 93 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 34 94 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 34 95 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 35 96 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 35 97 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 36 98 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 36 99 8. Security Considerations . . . . . . . . . . . . . . . . . . . 37 100 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 101 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 39 102 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 39 103 11.1 Normative References . . . . . . . . . . . . . . . . . . . . 39 104 11.2 Non-normative References . . . . . . . . . . . . . . . . . . 41 105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 43 106 A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 43 107 A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 43 108 A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 44 109 A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 44 110 A.4 Indicating Character Encodings in the URI/IRI . . . . . . 44 111 Intellectual Property and Copyright Statements . . . . . . . . 45 113 1. Introduction 115 1.1 Overview and Motivation 117 A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a 118 sequence of characters chosen from a limited subset of the repertoire 119 of US-ASCII [ASCII] characters. 121 The characters in URIs are frequently used for representing words of 122 natural languages. Such usage has many advantages: such URIs are 123 easier to memorize, easier to interpret, easier to transcribe, easier 124 to create, and easier to guess. For most languages other than 125 English, however, the natural script uses characters other than A-Z. 126 For many people, handling Latin characters is as difficult as 127 handling the characters of other scripts is for people who use only 128 the Latin alphabet. Many languages with non-Latin scripts have 129 transcriptions to Latin letters. Such transcriptions are now often 130 used in URIs, but they introduce additional ambiguities. 132 The infrastructure for the appropriate handling of characters from 133 local scripts is now widely deployed in local versions of operating 134 system and application software. Software that can handle a wide 135 variety of scripts and languages at the same time is increasingly 136 widespread. Also, there are increasing numbers of protocols and 137 formats that can carry a wide range of characters. 139 This document defines a new protocol element, called 140 Internationalized Resource Identifier (IRI), by extending the syntax 141 of URIs to a much wider repertoire of characters. It also defines 142 "internationalized" versions corresponding to other constructs from 143 [RFCYYYY], such as URI references. The syntax of IRIs is defined in 144 Section 2, and the relationship between IRIs and URIs in Section 3. 146 Using characters outside of A-Z in IRIs brings with it some 147 difficulties. Section 4 discusses the special case of bidirectional 148 IRIs, Section 5 various forms of equivalence between IRIs, and 149 Section 6 the use of IRIs in different situations. Section 7 gives 150 additional informative guidelines, and Section 8 security 151 considerations. 153 1.2 Applicability 155 IRIs are designed to be compatible with recommendations for new URI 156 schemes [RFC2718]. The compatibility is provided by specifying a 157 well defined and deterministic mapping from the IRI character 158 sequence to the functionally equivalent URI character sequence. 159 Practical use of IRIs (or IRI references) in place of URIs (or URI 160 references) depends on the following conditions being met: 162 a) The protocol or format element where IRIs are used should be 163 explicitly designated to be able to carry IRIs. That is, the 164 intent is not to introduce IRIs into contexts that are not defined 165 to accept them. For example, XML schema [XMLSchema] has an 166 explicit type "anyURI" that includes IRIs and IRI references. 167 Therefore, IRIs and IRI references can be in attributes and 168 elements of type "anyURI". On the other hand, in the HTTP 169 protocol [RFC2616], the Request URI is defined as an URI, which 170 means that direct use of IRIs is not allowed in HTTP requests. 172 b) The protocol or format carrying the IRIs should have a mechanism 173 to represent the wide range of characters used in IRIs, either 174 natively or by some protocol- or format-specific escaping 175 mechanism (for example numeric character references in [XML1]). 177 c) The URI corresponding to the IRI in question has to encode 178 original characters into octets using UTF-8. For new URI schemes, 179 this is recommended in [RFC2718]. It can apply to a whole scheme 180 (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN 181 syntax [RFC2141]). It can apply to a specific part of a URI, such 182 as the fragment identifier (e.g. [XPointer]). It can apply to a 183 specific URI or part(s) thereof. For details, please see Section 184 6.4. 186 1.3 Definitions 188 The following definitions are used in this document; they follow the 189 terms in [RFC2130], [RFC2277] and [ISO10646]: 191 character: A member of a set of elements used for the organization, 192 control, or representation of data. For example, "LATIN CAPITAL 193 LETTER A" names a character. 195 octet: An ordered sequence of eight bits considered as a unit 197 character repertoire: A set of characters (in the mathematical sense) 199 sequence of characters: A sequence (one after another) of characters 201 sequence of octets: A sequence (one after another) of octets 203 character encoding: A method of representing a sequence of characters 204 as a sequence of octets (maybe with variants). A method of 205 (unambiguously) converting a sequence of octets into a sequence of 206 characters. 208 charset: The name of a parameter or attribute used to identify a 209 character encoding. 211 UCS: Universal Character Set; the coded character set defined by ISO/ 212 IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. 214 IRI reference: The term "IRI reference" denotes the common usage of 215 an Internationalized Resource Identifier. An IRI reference may be 216 absolute or relative. However, the "IRI" that results from such a 217 reference only includes absolute IRIs; any relative IRI references 218 are resolved to their absolute form. Note that in [RFC2396], URIs 219 did not include fragment identifiers, but in [RFCYYYY], fragment 220 identifiers are part of URIs. 222 running text: Human text (paragraphs, sentences, phrases) with syntax 223 according to orthographic conventions of a natural language, as 224 opposed to syntax defined for ease of processing by machines 225 (markup, programming languages,...). 227 protocol element: Any portion of a message which affects processing 228 of that message by the protocol in question. 230 presentation element: Presentation form corresponding to a protocol 231 element, for example using a wider range of characters. 233 create (an URI or IRI): With respect to URIs and IRIs, the word 234 'create' is used for the initial creation. This may be the 235 initial creation of a resource with a certain identifier, or the 236 initial exposition of a resource under a particular identifier. 238 generate (an URI or IRI): With respect to URIs and IRIs, the word 239 'generate' is used when the IRI is generated by derivation from 240 other information. 242 1.4 Notation 244 RFCs and Internet Drafts currently do not allow any characters 245 outside the US-ASCII repertoire. Therefore, this document uses 246 various special notations to denote such characters in examples. 248 In text, characters outside US-ASCII are sometimes referenced by 249 using a prefix of 'U+', followed by four to six hexadecimal digits. 251 To represent characters outside US-ASCII in examples, this document 252 uses two notations called 'XML Notation' and 'Bidi Notation'. 254 XML Notation uses leading '&#x', trailing ';', and the hexadecimal 255 number of the character in the UCS in between. Example: я 256 stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual 257 '&' is denoted by '&'. 259 Bidi Notation is used for bidirectional examples: lower case letters 260 stand for Latin letters or other letters that are written 261 left-to-right, whereas upper case letters represent Arabic or Hebrew 262 letters that are written right-to-left. 264 To denote actual octets in examples (as opposed to percent-encoded 265 octets), the two hex digits denoting the octet are enclosed in "<" 266 and ">". For example, the octet often denoted as 0xc9 is denoted 267 here as . 269 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 270 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 271 document are to be interpreted as described in [RFC2119]. 273 2. IRI Syntax 275 This section defines the syntax of Internationalized Resource 276 Identifiers (IRIs). 278 As with URIs, an IRI is defined as a sequence of characters, not as a 279 sequence of octets. This definition accommodates the fact that IRIs 280 may be written on paper or read over the radio as well as being 281 stored or transmitted digitally. The same IRI may be represented as 282 different sequences of octets in different protocols or documents if 283 these protocols or documents use different character encodings (and/ 284 or transfer encodings). Using the same character encoding as the 285 containing protocol or document assures that the characters in the 286 IRI can be handled (searched, converted, displayed,...) in the same 287 way as the rest of the protocol or document. 289 2.1 Summary of IRI Syntax 291 IRIs are defined similarly to URIs in [RFCYYYY], but the class of 292 unreserved characters is extended by adding the characters of the UCS 293 (Universal Character Set, [ISO10646]) beyond U+007F, subject to the 294 limitations given in the syntax rules below and in Section 6.1. 296 Otherwise, the syntax and use of components and reserved characters 297 is the same as that in [RFCYYYY]. All the operations defined in 298 [RFCYYYY], such as the resolution of relative references, can be 299 applied to IRIs by IRI-processing software in exactly the same way as 300 this is done to URIs by URI-processing software. 302 Characters outside the US-ASCII repertoire are not reserved and 303 therefore MUST NOT be used for syntactical purposes such as to 304 delimit components in newly defined schemes. As an example, it is 305 not allowed to use U+00A2, CENT SIGN, as a delimiter in IRIs, because 306 it is in the 'iunreserved' category, in the same way as it is not 307 possible to use '-' as a delimiter, because it is in the 'unreserved' 308 category in URIs. 310 2.2 ABNF for IRI References and IRIs 312 While it might be possible to define IRI references and IRIs merely 313 by their transformation to URI references and URIs, they can also be 314 accepted and processed directly. Therefore, an ABNF definition for 315 IRI references (which are the most general concept and the start of 316 the grammar) and IRIs is given here. The syntax of this ABNF is 317 described in [RFC2234]. Character numbers are taken from the UCS, 318 without implying any actual binary encoding. Terminals in the ABNF 319 are characters, not bytes. 321 The following grammar closely follows the URI grammar in [RFCYYYY], 322 except that the range of unreserved characters is expanded to include 323 UCS characters, with the restriction that private UCS characters can 324 occur only in query parts and not elsewhere. The grammar is split 325 into two parts, rules that differ from [RFCYYYY] because of the 326 above-mentioned expansion, and rules that are the same as in 327 [RFCYYYY]. For rules that are different than in [RFCYYYY], the names 328 of the non-terminals have been changed as follows: If the 329 non-terminal contains 'URI', this has been changed to 'IRI'. 330 Otherwise, an 'i' has been prefixed. 332 The following rules are different from [RFCYYYY]: 334 IRI = scheme ":" ihier-part [ "?" iquery ] 335 [ "#" ifragment ] 337 ihier-part = "//" iauthority ipath-abempty 338 / ipath-absolute 339 / ipath-rootless 340 / ipath-empty 342 IRI-reference = IRI / irelative-ref 344 absolute-IRI = scheme ":" ihier-part [ "?" iquery ] 346 irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ] 348 irelative-part = "//" iauthority ipath-abempty 349 / ipath-absolute 350 / ipath-noscheme 351 / ipath-empty 353 iauthority = [ iuserinfo "@" ] ihost [ ":" port ] 354 iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" ) 355 ihost = IP-literal / IPv4address / ireg-name 357 ireg-name = *( iunreserved / pct-encoded / sub-delims ) 359 ipath = ipath-abempty ; begins with "/" or is empty 360 / ipath-absolute ; begins with "/" but not "//" 361 / ipath-noscheme ; begins with a non-colon segment 362 / ipath-rootless ; begins with a segment 363 / ipath-empty ; zero characters 365 ipath-abempty = *( "/" isegment ) 366 ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ] 367 ipath-noscheme = isegment-nz-nc *( "/" isegment ) 368 ipath-rootless = isegment-nz *( "/" isegment ) 369 ipath-empty = 0 371 isegment = *ipchar 372 isegment-nz = 1*ipchar 373 isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims 374 / "@" ) 375 ; non-zero-length segment without any colon ":" 377 ipchar = iunreserved / pct-encoded / sub-delims / ":" 378 / "@" 380 iquery = *( ipchar / iprivate / "/" / "?" ) 382 ifragment = *( ipchar / "/" / "?" ) 384 iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar 386 ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF 387 / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD 388 / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD 389 / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD 390 / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD 391 / %xD0000-DFFFD / %xE1000-EFFFD 393 iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD 395 Some productions are ambiguous. The "first-match-wins" (a.k.a. 396 "greedy") algorithm applies. For details, see [RFCYYYY]. 398 The following are the same as in [RFCYYYY]: 400 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 402 port = *DIGIT 404 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 406 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims 407 / ":" ) 409 IPv6address = 6( h16 ":" ) ls32 410 / "::" 5( h16 ":" ) ls32 411 / [ h16 ] "::" 4( h16 ":" ) ls32 412 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 413 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 414 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 415 / [ *4( h16 ":" ) h16 ] "::" ls32 416 / [ *5( h16 ":" ) h16 ] "::" h16 417 / [ *6( h16 ":" ) h16 ] "::" 419 h16 = 1*4HEXDIG 420 ls32 = ( h16 ":" h16 ) / IPv4address 422 IPv4address = dec-octet "." dec-octet "." dec-octet 423 "." dec-octet 425 dec-octet = DIGIT ; 0-9 426 / %x31-39 DIGIT ; 10-99 427 / "1" 2DIGIT ; 100-199 428 / "2" %x30-34 DIGIT ; 200-249 429 / "25" %x30-35 ; 250-255 431 pct-encoded = "%" HEXDIG HEXDIG 433 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 434 reserved = gen-delims / sub-delims 435 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 436 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 437 / "*" / "+" / "," / ";" / "=" 439 This syntax does not support IPv6 scoped addressing zone identifiers. 441 3. Relationship between IRIs and URIs 443 IRIs are meant to replace URIs in identifying resources for 444 protocols, formats and software components which use a UCS-based 445 character repertoire. These protocols and components may never need 446 to use URIs directly, especially when the resource identifier is used 447 simply for identification purposes. However, when the resource 448 identifier is used for resource retrieval, it is in many cases 449 necessary to determine the associated URI because most retrieval 450 mechanisms currently only are defined for URIs. In this case, IRIs 451 can serve as presentation elements for URI protocol elements. An 452 example would be an address bar in a Web user agent. (Additional 453 rationale is given in Section 3.1.) 455 3.1 Mapping of IRIs to URIs 457 This section defines how to map an IRI to a URI. Everything in this 458 section applies also to IRI references and URI references, as well as 459 components thereof (for example fragment identifiers). 461 This mapping has two purposes: 463 a) Syntactical: Many URI schemes and components define additional 464 syntactical restrictions not captured in Section 2.2. 465 Scheme-specific restrictions are applied to IRIs by converting 466 IRIs to URIs and checking the URIs against the scheme-specific 467 restrictions. 469 b) Interpretational: URIs identify resources in various ways. IRIs 470 also identify resources. When the IRI is used solely for 471 identification purposes, it is not necessary to map the IRI to a 472 URI (see Section 5). However, when an IRI is used for resource 473 retrieval, the resource that the IRI locates is the same as the 474 one located by the URI obtained after converting the IRI according 475 to the procedure defined here. This means that there is no need 476 to define resolution separately on the IRI level. 478 Applications MUST map IRIs to URIs using the following two steps. 480 Step 1) This step generates a UCS character sequence from the 481 original IRI format. This step has three variants, depending on 482 the form of the input. 484 Variant A) If the IRI is written on paper or read out loud, or 485 otherwise represented as a sequence of characters independent 486 of any character encoding: Represent the IRI as a sequence of 487 characters from the UCS normalized according to Normalization 488 Form C (NFC, [UTR15]). 490 Variant B) If the IRI is in some digital representation (e.g. an 491 octet stream) in some known non-Unicode character encoding: 492 Convert the IRI to a sequence of characters from the UCS 493 normalized according to NFC. 495 Variant C) If the IRI is in an Unicode-based character encoding 496 (for example UTF-8 or UTF-16): Do not normalize (see Section 497 5.3.2.2 for details). Apply Step 2 directly to the encoded 498 Unicode character sequence. 500 Step 2) For each character in 'ucschar' or 'iprivate', apply Steps 501 2.1 through 2.3 below. 503 2.1) Convert the character to a sequence of one or more octets 504 using UTF-8 [RFC3629]. 506 2.2) Convert each octet to %HH, where HH is the hexadecimal 507 notation of the octet value. Note that this is identical to 508 the percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To 509 reduce variability, the hexadecimal notation SHOULD use upper 510 case letters. 512 2.3) Replace the original character with the resulting character 513 sequence (i.e., a sequence of %HH triplets). 515 The above mapping from IRIs to URIs produces URIs fully conforming to 516 [RFCYYYY]. The mapping is also an identity transformation for URIs 517 and is idempotent -- applying the mapping a second time will not 518 change anything. Every URI is by definition an IRI. 520 Infrastructure accepting IRIs MAY convert the ireg-name component of 521 an IRI as follows (before Step 2 above) for schemes that are known to 522 use domain names in ireg-name, but where the scheme definition does 523 not allow percent-encoding for ireg-name: Replace the ireg-name part 524 of the IRI by the part converted using the ToASCII operation 525 specified in Section 4.1 of [RFC3490] on each dot-separated label, 526 and using U+002E (FULL STOP) as a label separator, with the flag 527 UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to 528 FALSE for creating IRIs and set to TRUE otherwise. The ToASCII 529 operation may fail, but this would mean that the IRI cannot be 530 resolved. This conversion SHOULD be used when the goal is to 531 maximize interoperability with legacy URI resolvers. For example, 532 the IRI 533 http://résumé.example.org may be converted to 534 http://xn--rsum-bpad.example.org instead of 535 http://r%C3%A9sum%C3%A9.example.org. 537 An IRI with a scheme that is known to use domain names in ireg-name, 538 but where the scheme definition does not allow percent-encoding for 539 ireg-name, meets scheme-specific restrictions if either the 540 straightforward conversion or the conversion using the ToASCII 541 operation on ireg-name result in an URI that meets the 542 scheme-specific restrictions. Such an IRI resolves to the URI 543 obtained after converting the IRI including using the ToASCII 544 operation on ireg-name. Implementations do not need to do this 545 conversion as long as they produce the same result. 547 Note: The difference between Variants B and C in Step 1 (Variant B 548 using normalization with NFC while Variant C not using any 549 normalization) is to account for the fact that in many non-Unicode 550 character encodings, some text cannot be represented directly. 551 For example, Vietnam is natively written "Việt Nam" 552 (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW) 553 in NFC, but a direct transcoding from the windows-1258 character 554 encoding leads to "Việt Nam" (containing a LATIN SMALL 555 LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW), 556 whereas direct transcoding of other 8-bit encodings of Vietnamese 557 may lead to other representations. 559 Note: The uniform treatment of the whole IRI in Step 2 above is 560 important to not make processing dependent on URI scheme. See 561 [Gettys] for an in-depth discussion. 563 Note: In practice, the difference above will not be noticed if 564 mapping from IRI to URI and resolution is tightly integrated (e.g. 565 carried out in the same user agent). But conversion using 566 [RFC3490] may be able to better deal with backwards compatibility 567 issues in case mapping and resolution are separated, as in the 568 case of using an HTTP proxy. 570 Note: Internationalized Domain Names may be contained in parts of an 571 IRI other than the ireg-name part. It is the responsibility of 572 scheme-specific implementations (if the Internationalized Domain 573 Name is part of the scheme syntax) or of server-side 574 implementations (if the Internationalized Domain Name is part of 575 'iquery') to apply the necessary conversions at the appropriate 576 point. Example: Trying to validate the Web page at 577 http://résumé.example.org would lead to an IRI of 578 http://validator.w3.org/check?uri=http%3A%2F%2Frésumé. 579 example.org, which would convert to a URI of 580 http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. 581 example.org. The server side implementation would be responsible 582 to do the necessary conversions in order to be able to retrieve 583 the Web page. 585 Infrastructure accepting IRIs MAY also deal with the printable 586 characters in US-ASCII that are not allowed in URIs, namely "<", ">", 587 '"', Space, "{", "}", "|", "\", "^", and "`", in Step 2 above. If 588 such characters are found but are not converted, then the conversion 589 SHOULD fail. Please note that the number sign ("#"), the percent 590 sign ("%"), and the square bracket characters ("[", "]") are not part 591 of the above list, and MUST NOT be converted. Protocols and formats 592 that have used earlier definitions of IRIs including these characters 593 MAY require percent-encoding of these characters as a preprocessing 594 step to extract the actual IRI from a given field. Such 595 preprocessing MAY also be used by applications allowing the user to 596 enter an IRI. 598 Note: In this process (in Step 2.3), characters allowed in URI 599 references as well as existing percent-encoded sequences are not 600 encoded further. (This mapping is similar to, but different from, 601 the encoding applied when including arbitrary content into some 602 part of a URI.) For example, an IRI of 603 http://www.example.org/red%09rosé#red (in XML notation) is 604 converted to 605 http://www.example.org/red%09ros%C3%A9#red, not to something like 606 http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. 608 Note: Some older software transcoding to UTF-8 may produce illegal 609 output for some input, in particular for characters outside the 610 BMP (Basic Multilingual Plane). As an example, for the following 611 IRI with non-BMP characters (in XML Notation): 612 http://example.com/𐌀𐌁𐌂 613 (the first three letters of the Old Italic alphabet) the correct 614 conversion to a URI is: 615 http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 617 3.2 Converting URIs to IRIs 619 In some situations, it may be desirable to try to convert a URI into 620 an equivalent IRI. This section gives a procedure to do such a 621 conversion. The conversion described in this section will always 622 result in an IRI which maps back to the URI that was used as an input 623 for the conversion (except for potential case differences in 624 percent-encoding and for potential percent-encoded unreserved 625 characters). However, the IRI resulting from this conversion may not 626 be exactly the same as the original IRI (if there ever was one). 628 URI to IRI conversion removes percent-encodings, but not all 629 percent-encodings can be eliminated. There are several reasons for 630 this: 632 a) Some percent-encodings are necessary to distinguish 633 percent-encoded and unencoded uses of reserved characters. 635 b) Some percent-encodings cannot be interpreted as sequences of UTF-8 636 octets. 638 (Note: The octet patterns of UTF-8 are highly regular. Therefore, 639 there is a very high probability, but no guarantee, that 640 percent-encodings that can be interpreted as sequences of UTF-8 641 octets actually originated from UTF-8. For a detailed discussion, 642 see [Duerst97].) 644 c) The conversion may result in a character that is not appropriate 645 in an IRI. See Section 2.2, Section 4.1, and Section 6.1 for 646 further details. 648 Conversion from a URI to an IRI is done using the following steps (or 649 any other algorithm that produces the same result): 651 1) Represent the URI as a sequence of octets in US-ASCII. 653 2) Convert all percent-encodings (% followed by two hexadecimal 654 digits) except those corresponding to '%', characters in 655 'reserved', and characters in US-ASCII not allowed in URIs, to the 656 corresponding octets. 658 3) Re-percent-encode any octet produced in Step 2 that is not part of 659 a strictly legal UTF-8 octet sequence. 661 4) Re-percent-encode all octets produced in Step 3 that in UTF-8 662 represent characters that are not appropriate according to Section 663 2.2, Section 4.1, and Section 6.1. 665 5) Interpret the resulting octet sequence as a sequence of characters 666 encoded in UTF-8. 668 This procedure will convert as many percent-encoded characters as 669 possible to characters in an IRI. Because there are some choices 670 when applying Step 4 (see Section 6.1), results may vary. 672 Conversions from URIs to IRIs MUST NOT use any other character 673 encoding than UTF-8 in Steps 3 and 4 above, even if it might be 674 possible from context to guess that another character encoding than 675 UTF-8 was used in the URI. As an example, the URI 676 http://www.example.org/r%E9sum%E9.html might with some guessing be 677 interpreted to contain two e-acute characters encoded as iso-8859-1. 678 It must not be converted to an IRI containing these e-acute 679 characters. Otherwise, the IRI will in the future be mapped to 680 http://www.example.org/r%C3%A9sum%C3%A9.html, which is a different 681 URI than http://www.example.org/r%E9sum%E9.html. 683 3.2.1 Examples 685 This section shows various examples of converting URIs to IRIs. Each 686 example shows the result after applying each of the Steps 1 to 5. 687 XML Notation is used for the final result. 689 The following example contains the sequence '%C3%BC', which is a 690 strictly legal UTF-8 sequence, and which is converted into the actual 691 character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as 692 u-umlaut). 694 1) http://www.example.org/D%C3%BCrst 696 2) http://www.example.org/Drst 698 3) http://www.example.org/Drst 700 4) http://www.example.org/Drst 702 5) http://www.example.org/Dürst 704 The following example contains the sequence '%FC', which might 705 represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the 706 iso-8859-1 character encoding. (It might represent other characters 707 in other character encodings. For example, the octet in 708 iso-8859-5 represents U+045C CYRILLIC SMALL LETTER KJE.) Because 709 is not part of a strictly legal UTF-8 sequence, it is 710 re-percent-encoded in Step 3. 712 1) http://www.example.org/D%FCrst 714 2) http://www.example.org/Drst 716 3) http://www.example.org/D%FCrst 718 4) http://www.example.org/D%FCrst 720 5) http://www.example.org/D%FCrst 722 The following example contains '%e2%80%ae', which is the 723 percent-encoded 724 UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 725 4.1 forbids the direct use of this character in an IRI. Therefore, 726 the corresponding octets are re-percent-encoded in Step 4. This 727 example shows that the case (upper or lower) of letters used in 728 percent-encodes may not be preserved. The example also contains a 729 punycode-encoded domain name label (xn--99zt52a), which is not 730 converted. 732 1) http://xn--99zt52a.example.org/%e2%80%ae 734 2) http://xn--99zt52a.example.org/<80> 736 3) http://xn--99zt52a.example.org/<80> 738 4) http://xn--99zt52a.example.org/%E2%80%AE 740 5) http://xn--99zt52a.example.org/%E2%80%AE 742 Implementations with scheme-specific knowledge MAY convert 743 punycode-encoded domain name labels to the corresponding characters 744 using the ToUnicode procedure. Thus, for the example above, the 745 label xn--99zt52a may be converted to U+7D0D U+8C46 (Japanese Natto), 746 leading to the overall IRI of 747 http://納豆.example.org/%E2%80%AE 749 4. Bidirectional IRIs for Right-to-left Languages 751 Some UCS characters, such as those used in the Arabic and Hebrew 752 script, have an inherent right-to-left (rtl) writing direction. IRIs 753 containing such characters (called bidirectional IRIs or Bidi IRIs) 754 require additional attention because of the non-trivial relation 755 between logical representation (used for digital representation as 756 well as when reading/spelling) and visual representation (used for 757 display/printing). 759 Because of the complex interaction between the logical 760 representation, the visual representation, and the syntax of a Bidi 761 IRI, a balance is needed between various requirements. The main 762 requirements are: 764 1) user-predictable conversion between visual and logical 765 representation; 767 2) the ability to include a wide range of characters in various parts 768 of the IRI; 770 3) minor or no changes or restrictions for implementations. 772 4.1 Logical Storage and Visual Presentation 774 When stored or transmitted in digital representation, bidirectional 775 IRIs MUST be in full logical order, and MUST conform to the IRI 776 syntax rules (which includes the rules relevant to their scheme). 777 This assures that bidirectional IRIs can be processed in the same way 778 as other IRIs. 780 When rendered, bidirectional IRIs MUST be rendered using the Unicode 781 Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be 782 rendered in the same way as they would be rendered if they were in an 783 left-to-right embedding, i.e. as if they were preceded by U+202A, 784 LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP 785 DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can 786 also be done in a higher-level protocol (e.g. the dir='ltr' 787 attribute in HTML). 789 There is no requirement to actually use the above embedding if the 790 display is still the same without the embedding. For example, a 791 bidirectional IRI in a text with left-to-right base directionality 792 (such as used for English or Cyrillic) that is preceded and followed 793 by whitespace and strong left-to-right characters does not need an 794 embedding. Also, a bidirectional relative IRI reference that only 795 contains strong right-to-left characters and weak characters and that 796 starts and ends with a strong rigth-to-left character and appears in 797 a text with right-to-left base directionality (such as used for 798 Arabic or Hebrew) and is preceded and followed by whitespace and 799 strong characters does not need an embedding. 801 In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM) may be 802 sufficient to force the correct display behavior. However, the 803 details of the Unicode Bidirectional algorithm are not always easy to 804 understand. Implementers are strongly advised to err on the side of 805 caution and to use embedding in all cases where they are not 806 completely sure that the display behavior is unaffected without the 807 embedding. 809 The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits 810 higher-level protocols to influence bidirectional rendering. Such 811 changes by higher-level protocols MUST NOT be used if they change the 812 rendering of IRIs. 814 The bidirectional formatting characters that may be used before or 815 after the IRI to assure correct display are themselves not part of 816 the IRI. IRIs MUST NOT contain bidirectional formatting characters 817 (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual 818 rendering of the IRI, but do not themselves appear visually. It 819 would therefore not be possible to correctly input an IRI with such 820 characters. 822 4.2 Bidi IRI Structure 824 The Unicode Bidirectional Algorithm is designed mainly for running 825 text. To make sure that it does not affect the rendering of 826 bidirectional IRIs too much, some restrictions on bidirectional IRIs 827 are necessary. These restrictions are given in terms of delimiters 828 (structural characters, mostly punctuation such as '@', '.', ':', 829 '/') and components (usually consisting mostly of letters and 830 digits). 832 The following syntax rules from Section 2.2 correspond to components 833 for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, 834 isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment. 836 Specifications that define the syntax of any of the above components 837 MAY divide them further and define smaller parts to be components 838 according to this document. As an example, the restrictions of 839 [RFC3490] on bidirectional domain names correspond to treating each 840 label of a domain name as a component for those schemes where 841 ireg-name is a domain name. Even where the components are not 842 defined formally, it may be helpful to think about some syntax in 843 terms of components and to apply the relevant restrictions. For 844 example, for the usual name/value syntax in query parts, it is 845 convenient to treat each name and each value as a component. As 846 another example, the extensions in a resource name can be treated as 847 separate components. 849 For each component, the following restrictions apply: 851 1) A component SHOULD NOT use both right-to-left and left-to-right 852 characters. 854 2) A component using right-to-left characters SHOULD start and end 855 with right-to-left characters. 857 The above restrictions are given as shoulds, rather than as musts. 858 For IRIs that are never presented visually, they are not relevant. 859 However, for IRIs in general, they are very important to insure 860 consistent conversion between visual presentation and logical 861 representation, in both directions. 863 Note: In some components, the above restrictions may actually be 864 strictly enforced. For example, [RFC3490] requires that these 865 restrictions apply to the labels of a host name for those schemes 866 where ireg-name is a host name. In some other components, for 867 example path components, following these restrictions may not be 868 too difficult. For other components, such as parts of the query 869 part, it may be very difficult to enforce the restrictions, 870 because the values of query parameters may be arbitrary character 871 sequences. 873 If the above restrictions cannot be satisfied otherwise, the affected 874 component can always be mapped to URI notation as described in 875 Section 3.1. Please note that the whole component needs to be mapped 876 (see also Example 9 below). 878 4.3 Input of Bidi IRIs 880 Bidi input methods MUST generate Bidi IRIs in logical order while 881 rendering them according to Section 4.1. During input, rendering 882 SHOULD be updated after every new character that is input to avoid 883 end user confusion. 885 4.4 Examples 887 This section gives examples of bidirectional IRIs, in Bidi Notation. 888 It shows legal IRIs with the relationship between logical and visual 889 representation, and explains how certain phenomena in this 890 relationship may look strange to somebody not familiar with 891 bidirectional behavior, but familiar to users of Arabic and Hebrew. 892 It also shows what happens if the restrictions given in Section 4.2 893 are not followed. The examples below can be seen at [BidiEx], in 894 Arabic, Hebrew, and Bidi Notation variants. 896 To read the bidi text in the examples, read the visual representation 897 from left to right until you encounter a block of rtl text. Read the 898 rtl block (including slashes and other special characters) from right 899 to left, then continue at the next unread ltr character. 901 Example 1: A single component with rtl characters is inverted: 902 logical representation: http://ab.CDEFGH.ij/kl/mn/op.html 903 visual representation: http://ab.HGFEDC.ij/kl/mn/op.html 904 Components can be read one-by-one, and each component can be read in 905 its natural direction. 907 Example 2: More than one consecutive component with rtl characters is 908 inverted as a whole: 909 logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html 910 visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html 911 A sequence of rtl components is read rtl, in the same way as a 912 sequence of rtl words is read rtl in a bidi text. 914 Example 3: All components of an IRI (except for the scheme) are rtl. 915 All rtl components are inverted overall: 916 logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV 917 visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA 918 The whole IRI (except the scheme) is read rtl. Delimiters between 919 rtl components stay between the respective components; delimiters 920 between ltr and rtl components don't move. 922 Example 4: Several sequences of rtl components are each inverted on 923 their own: 925 logical representation: http://AB.CD.ef/gh/IJ/KL.html 926 visual representation: http://DC.BA.ef/gh/LK/JI.html 927 Each sequence of rtl components is read rtl, in the same way as each 928 sequence of rtl words in an ltr text is read rtl. 930 Example 5: Example 2, applied to components of different kinds: 931 logical representation: http://ab.cd.EF/GH/ij/kl.html 932 visual representation: http://ab.cd.HG/FE/ij/kl.html 933 The inversion of the domain name label and the path component may be 934 unexpected, but is consistent with other bidi behavior. For 935 reassurance that the domain component really is "ab.cd.EF", it may be 936 helpful to read aloud the visual representation following the bidi 937 algorithm. After "http://ab.cd." one reads the RTL block 938 "E-F-slash-G-H", which corresponds to the logical representation. 940 Example 6: Same as example 5, with more rtl components: 941 logical representation: http://ab.CD.EF/GH/IJ/kl.html 942 visual representation: http://ab.JI/HG/FE.DC/kl.html 943 The inversion of the domain name labels and the path components may 944 be easier to identify because the delimiters also move. 946 Example 7: A single rtl component with included digits: 947 logical representation: http://ab.CDE123FGH.ij/kl/mn/op.html 948 visual representation: http://ab.HGF123EDC.ij/kl/mn/op.html 949 Numbers are written ltr in all cases, but are treated as an 950 additional embedding inside a run of rtl characters. This is 951 completely consistent with usual bidirectional text. 953 Example 8 (not allowed): Numbers at the start or end of a rtl 954 component: 955 logical representation: http://ab.cd.ef/GH1/2IJ/KL.html 956 visual representation: http://ab.cd.ef/LK/JI1/2HG.html 957 The sequence '1/2' is interpreted by the bidi algorithm as a 958 fraction, fragmenting the components and leading to confusion. There 959 are other characters that are interpreted in a special way close to 960 numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'. 962 Example 9 (not allowed): The numbers in the previous example are 963 percent-encoded: 964 logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html, 965 visual representation (Hebrew): http://ab.cd.ef/%31HG/LK/JI%32.html 966 visual representation (Arabic): http://ab.cd.ef/31%HG/%LK/JI32.html 967 Depending on whether the upper-case letters represent Arabic or 968 Hebrew, the visual representation is different. 970 Example 10 (allowed, but not recommended): 971 logical representation: http://ab.CDEFGH.123/kl/mn/op.html 972 visual representation: http://ab.123.HGFEDC/kl/mn/op.html 973 Components consisting of only numbers are allowed (it would be rather 974 difficult to prohibit them), but may interact with adjacent RTL 975 components in ways that are not easy to predict. 977 5. Normalization and Comparison 979 Note: The structure and much of the material for this section is 980 taken from section 6 of [RFCYYYY]; the differences are due to the 981 specifics of IRIs. 983 One of the most common operations on IRIs is simple comparison: 984 determining if two IRIs are equivalent without using the IRIs or the 985 mapped URIs to access their respective resource(s). A comparison is 986 performed every time a response cache is accessed, a browser checks 987 its history to color a link, or an XML parser processes tags within a 988 namespace. Extensive normalization prior to comparison of IRIs may 989 be used by spiders and indexing engines to prune a search space or 990 reduce duplication of request actions and response storage. 992 IRI comparison is performed in respect to some particular purpose, 993 and implementations with differing purposes will often be subject to 994 differing design trade-offs in regards to how much effort should be 995 spent in reducing aliased identifiers. This section describes a 996 variety of methods that may be used to compare IRIs, the trade-offs 997 between them, and the types of applications that might use them. 999 5.1 Equivalence 1001 Since IRIs exist to identify resources, presumably they should be 1002 considered equivalent when they identify the same resource. However, 1003 such a definition of equivalence is not of much practical use, since 1004 there is no way for an implementation to compare two resources that 1005 are not under its own control. For this reason, determination of 1006 equivalence or difference of IRIs is based on string comparison, 1007 perhaps augmented by reference to additional rules provided by URI 1008 scheme definitions. We use the terms "different" and "equivalent" to 1009 describe the possible outcomes of such comparisons, but there are 1010 many applicationdependent versions of equivalence. 1012 Even though it is possible to determine that two IRIs are equivalent, 1013 IRI comparison is not sufficient to determine if two IRIs identify 1014 different resources. For example, an owner of two different domain 1015 names could decide to serve the same resource from both, resulting in 1016 two different IRIs. Therefore, comparison methods are designed to 1017 minimize false negatives while strictly avoiding false positives. 1019 In testing for equivalence, applications should not directly compare 1020 relative references; the references should be converted to their 1021 respective target IRIs before comparison. When IRIs are being 1022 compared for the purpose of selecting (or avoiding) a network action, 1023 such as retrieval of a representation, fragment components (if any) 1024 should be excluded from the comparison. 1026 Applications using IRIs as identity tokens with no relationship to a 1027 protocol MUST use the Simple String Comparison (see Section 5.3.1). 1028 All other applications MUST select one of the comparison practices 1029 from the Comparison Ladder (see Section 5.3, or, after IRI-to-URI 1030 conversion, select one of the comparison practices from the URI 1031 comparison ladder [RFCYYYY], Section 6.2. 1033 5.2 Preparation for Comparison 1035 Any kind of IRI comparison REQUIRES that all escapings or encodings 1036 in the protocol or format that carries an IRI are resolved. This is 1037 usually done when parsing the protocol or format. Examples of such 1038 escapings or encodings are entities and numeric character references 1039 in [HTML4] and [XML1]. As an example, http://example.org/rosé 1040 (in HTML), http://example.org/rosé (in HTML or XML), and 1041 http://example.org/rosé (in HTML or XML) all get resolved into 1042 what is denoted in this document (see Section 1.4) as 1043 http://example.org/rosé (the "é" here standing for the 1044 actual e-acute character, to compensate for the fact that this 1045 document cannot contain non-ASCII characters). 1047 Similar considerations apply to encodings such as Transfer Codings in 1048 HTTP (see [RFC2616]) and Content Transfer Encodings in MIME[RFC2045], 1049 although in these cases, the encoding is not based on characters, but 1050 on octets, and additional care is required to make sure that 1051 characters, and not just arbitrary octets, are compared (see Section 1052 5.3.1). 1054 5.3 Comparison Ladder 1056 A variety of methods are used in practice to test IRI equivalence. 1057 These methods fall into a range, distinguished by the amount of 1058 processing required and the degree to which the probability of false 1059 negatives is reduced. As noted above, false negatives cannot be 1060 eliminated. In practice, their probability can be reduced, but this 1061 reduction requires more processing and is not cost-effective for all 1062 applications. 1064 If this range of comparison practices is considered as a ladder, the 1065 following discussion will climb the ladder, starting with those 1066 practices that are cheap but have a relatively higher chance of 1067 producing false negatives, and proceeding to those that have higher 1068 computational cost and lower risk of false negatives. 1070 5.3.1 Simple String Comparison 1072 If two IRIs, considered as character strings, are identical, then it 1073 is safe to conclude that they are equivalent. This type of 1074 equivalence test has very low computational cost and is in wide use 1075 in a variety of applications, particularly in the domain of parsing 1076 and when a definitive answer to the question of IRI equivalence is 1077 needed that is independent of the scheme used and can be calculated 1078 quickly and without accessing a network. An example of such a case 1079 is XML Namespaces ([XMLNamespace]). 1081 Testing strings for equivalence requires some basic precautions. 1082 This procedure is often referred to as "bit-for-bit" or 1083 "byte-for-byte" comparison, which is potentially misleading. Testing 1084 of strings for equality is normally based on pairwise comparison of 1085 the characters that make up the strings, starting from the first and 1086 proceeding until both strings are exhausted and all characters found 1087 to be equal, a pair of characters compares unequal, or one of the 1088 strings is exhausted before the other. 1090 Such character comparisons require that each pair of characters be 1091 put in comparable encoding form. For example, should one IRI be 1092 stored in a byte array in UTF-8 encoding form, and the second be in a 1093 UTF-16 encoding form, bit-for-bit comparisons applied naively will 1094 produce errors. It is better to speak of equality on a 1095 character-for-character rather than byte-for-byte or bit-for-bit 1096 basis. In practical terms, character-by-character comparisons should 1097 be done codepoint-by-codepoint after conversion to a common character 1098 encoding form. When comparing character-by-character, the comparison 1099 function MUST NOT map IRIs to URIs, because such a mapping would 1100 create additional spurious equivalences. It follows that IRIs SHOULD 1101 NOT be modified when being transported if there is any chance that 1102 this IRI might be used as an identifier. 1104 False negatives are caused by the production and use of IRI aliases. 1105 Unnecessary aliases can be reduced, regardless of the comparison 1106 method, by consistently providing IRI references in an 1107 already-normalized form (i.e., a form identical to what would be 1108 produced after normalization is applied, as described below). 1109 Protocols and data formats often choose to limit some IRI comparisons 1110 to simple string comparison, based on the theory that people and 1111 implementations will, in their own best interest, be consistent in 1112 providing IRI references, or at least consistent enough to negate any 1113 efficiency that might be obtained from further normalization. 1115 5.3.2 Syntax-based Normalization 1117 Implementations may use logic based on the definitions provided by 1118 this specification to reduce the probability of false negatives. 1119 Such processing is moderately higher in cost than 1120 character-for-character string comparison. For example, an 1121 application using this approach could reasonably consider the 1122 following two IRIs equivalent: 1124 example://a/b/c/%7Bfoo%7D/rosé 1125 eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 1127 Web user agents, such as browsers, typically apply this type of IRI 1128 normalization when determining whether a cached response is 1129 available. Syntax-based normalization includes such techniques as 1130 case normalization, character normalization, percent-encoding 1131 normalization, and removal of dot-segments. 1133 5.3.2.1 Case Normalization 1135 For all IRIs, the hexadecimal digits within a percent-encoding 1136 triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore 1137 should be normalized to use uppercase letters for the digits A-F. 1139 When an IRI uses components of the generic syntax, the component 1140 syntax equivalence rules always apply; namely, that the scheme and 1141 US-ASCII only host are case-insensitive and therefore should be 1142 normalized to lowercase. For example, the URI 1143 is equivalent to . 1144 Case equivalence for non-ASCII characters in IRI components that are 1145 IDNs are discussed in Section 5.3.3. The other generic syntax 1146 components are assumed to be case-sensitive unless specifically 1147 defined otherwise by the scheme. 1149 Creating schemes that allow case-insensitive syntax components 1150 containing non US-ASCII characters should be avoided because such a 1151 case normalization may be cultural dependant and is always a complex 1152 operation. The only exception concerns non-ASCII host names for 1153 which the character normalization includes a mapping step derived 1154 from case folding. 1156 5.3.2.2 Character Normalization 1158 The Unicode Standard [UNIV4] defines various equivalences between 1159 sequences of characters for various purposes. Unicode Standard Annex 1160 #15 [UTR15] defines various Normalization Forms for these 1161 equivalences, in particular Normalization Form C (NFC, Canonical 1162 Decomposition, followed by Canonical Composition) and Normalization 1163 Form KC (NFKC, Compatibility Decomposition, followed by Canonical 1164 Composition). 1166 Equivalence of IRIs MUST rely on the assumption that IRIs are 1167 appropriately pre-character-normalized, rather than applying 1168 character normalization when comparing two IRIs. The exceptions are 1169 conversion from a non-digital form, and conversion from a 1170 non-UCS-based character encoding to an UCS-based character encoding. 1171 In these cases, NFC or a normalizing transcoder using NFC MUST be 1172 used for interoperability. To avoid false negatives and problems 1173 with transcoding, IRIs SHOULD be created using NFC. Using NFKC may 1174 avoid even more problems, for example by choosing half-width Latin 1175 letters instead of full-width, and full-width Katakana instead of 1176 half-width. 1178 As an example, http://www.example.org/résumé.html (in XML 1179 Notation) is in NFC. On the other hand, 1180 http://www.example.org/résumé.html is not in NFC. The 1181 former uses precombined e-acute characters, the latter uses 'e' 1182 characters followed by combining acute accents. Both usages are 1183 defined to be canonically equivalent in [UNIV4]. 1185 Note: Because it is unknown how a particular sequence of characters 1186 is being treated with respect to character normalization, it would 1187 be inappropriate to allow third parties to normalize an IRI 1188 arbitrarily. This does not contradict the recommendation that 1189 when a resource is created, its IRI should be as 1190 character-normalized as possible (i.e. NFC or even NFKC). This 1191 is similar to the upper-case/lower-case problems in 1192 character-normalized as possible (i.e. NFC or even NFKC). URIs. 1193 Some parts of a URI are case-insensitive (domain name). For 1194 others, it is unclear whether they are case-sensitive or 1195 case-insensitive, or something in between (e.g. case-sensitive, 1196 but if the wrong case is used, a multiple choice selection is 1197 provided instead of a direct negative result). The best recipe is 1198 that the creator uses a reasonable capitalization, and when 1199 transferring the URI, that capitalization is never changed. 1201 Various IRI schemes may allow the usage of Internationalized Domain 1202 Names (IDN) [RFC3490] either in the ireg-name part or elsewhere. 1203 Character Normalization also applies to IDNs, as discussed in Section 1204 5.3.3. 1206 5.3.2.3 Percent-Encoding Normalization 1208 The percent-encoding mechanism (Section 2.1 of [RFCYYYY]) is a 1209 frequent source of variance among otherwise identical IRIs. In 1210 addition to the case normalization issue noted above, some IRI 1211 producers percent-encode octets that do not require percent-encoding, 1212 resulting in IRIs that are equivalent to their nonencoded 1213 counterparts. Such IRIs should be normalized by decoding any 1214 percent-encoded octet sequence that corresponds to an unreserved 1215 character, as described in Section 2.3 of [RFCYYYY]. 1217 For actual resolution, differences in percent-encoding (except for 1218 the percent-encoding of reserved characters) MUST always result in 1219 the same resource. For example, http://example.org/~user, 1220 http://example.org/%7euser and http://example.org/%7Euser must 1221 resolve to the same resource. 1223 If this kind of equivalence is to be tested, the percent-encoding of 1224 both IRIs to be compared has to be aligned, for example by converting 1225 both IRIs to URIs (see Section 3.1), eliminating escape differences 1226 in the resulting URIs, and making sure that the case of the 1227 hexadecimal characters in the percent-encoding is always the same 1228 (preferably upper case). If the IRI is to be passed to another 1229 application, or used further in some other way, its original form 1230 MUST be preserved; the conversion described here should be performed 1231 only for the purpose of local comparison. 1233 5.3.2.4 Path Segment Normalization 1235 The complete path segments "." and ".." are intended only for use 1236 within relative references (Section 4.1 of [RFCYYYY]) and are removed 1237 as part of the reference resolution process (Section 5.2 of 1238 [RFCYYYY]). However, some implementations may incorrectly assume 1239 that reference resolution is not necessary when the reference is 1240 already an IRI, and thus fail to remove dot-segments when they occur 1241 in non-relative paths. IRI normalizers should remove dot-segments by 1242 applying the remove_dot_segments algorithm to the path, as described 1243 in Section 5.2.4 of [RFCYYYY]. 1245 5.3.3 Scheme-based Normalization 1247 The syntax and semantics of IRIs vary from scheme to scheme, as 1248 described by the defining specification for each scheme. 1249 Implementations may use scheme-specific rules, at further processing 1250 cost, to reduce the probability of false negatives. For example, 1251 since the "http" scheme makes use of an authority component, has a 1252 default port of "80", and defines an empty path to be equivalent to 1253 "/", the following four IRIs are equivalent: 1255 http://example.com 1256 http://example.com/ 1257 http://example.com:/ 1258 http://example.com:80/ 1260 In general, an IRI that uses the generic syntax for authority with an 1261 empty path should be normalized to a path of "/"; likewise, an 1262 explicit ":port", where the port is empty or the default for the 1263 scheme, is equivalent to one where the port and its ":" delimiter are 1264 elided, and thus should be removed by scheme-based normalization. 1265 For example, the second IRI above is the normal form for the "http" 1266 scheme. 1268 Another case where normalization varies by scheme is in the handling 1269 of an empty authority component or empty host subcomponent. For many 1270 scheme specifications, an empty authority or host is considered an 1271 error; for others, it is considered equivalent to "localhost" or the 1272 end-user's host. When a scheme defines a default for authority and 1273 an IRI reference to that default is desired, the reference should be 1274 normalized to an empty authority for the sake of uniformity, brevity, 1275 and internationalization. If, however, either the userinfo or port 1276 subcomponent is non-empty, then the host should be given explicitly 1277 even if it matches the default. 1279 Normalization should not remove delimiters when their associated 1280 component is empty unless licensed to do so by the scheme 1281 specification. For example, the IRI "http://example.com/?" cannot be 1282 assumed to be equivalent to any of the examples above. Likewise, the 1283 presence or absence of delimiters within a userinfo subcomponent is 1284 usually significant to its interpretation. The fragment component is 1285 not subject to any scheme-based normalization; thus, two IRIs that 1286 differ only by the suffix "#" are considered different regardless of 1287 the scheme. 1289 Some IRI schemes may allow the usage of Internationalized Domain 1290 Names (IDN) [RFC3490] either in their ireg-name part or elsewhere. 1291 When in use in IRIs, those names SHOULD be validated using the 1292 ToASCII operation defined in [RFC3490], with the flags 1293 "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an 1294 invalid IDN cannot successfully be resolved. Validated IDN 1295 components of IRIs SHOULD be character normalized using the Nameprep 1296 process [RFC3491]; however, for legibility purposes, they SHOULD NOT 1297 be converted into ASCII Compatible Encoding (ACE). 1299 Scheme-based normalization may also consider IDN components and their 1300 conversions to punycode as equivalent. As an example, 1301 http://résumé.example.org may be considered equivalent to 1302 http://xn--rsum-bpad.example.org 1304 Other scheme-specific normalizations are possible. 1306 5.3.4 Protocol-based Normalization 1308 Web spiders, for which substantial effort to reduce the incidence of 1309 false negatives is often cost-effective, are observed to implement 1310 even more aggressive techniques in IRI comparison. For example, if 1311 they observe that an IRI such as 1313 http://example.com/data 1315 redirects to an IRI differing only in the trailing slash 1317 http://example.com/data/ 1319 they will likely regard the two as equivalent in the future. This 1320 kind of technique is only appropriate when equivalence is clearly 1321 indicated by both the result of accessing the resources and the 1322 common conventions of their scheme's dereference algorithm (in this 1323 case, use of redirection by HTTP origin servers to avoid problems 1324 with relative references). 1326 6. Use of IRIs 1328 6.1 Limitations on UCS Characters Allowed in IRIs 1330 This section discusses limitations on characters and character 1331 sequences usable for IRIs beyond those given in Section 2.2 and 1332 Section 4.1. The considerations in this section are relevant when 1333 creating IRIs and when converting from URIs to IRIs. 1335 a) The repertoire of characters allowed in each IRI component is 1336 limited by the definition of that component. For example, the 1337 definition of the scheme component does not allow characters 1338 beyond US-ASCII. 1340 (Note: In accordance with URI practice, generic IRI software 1341 cannot and should not check for such limitations.) 1343 b) The UCS contains many areas of characters for which there are 1344 strong visual look-alikes. Because of the likelihood of 1345 transcription errors, these also should be avoided. This includes 1346 the full-width equivalents of Latin characters, half-width 1347 Katakana characters for Japanese, and many others. This also 1348 includes many look-alikes of "space", "delims", and "unwise", 1349 characters excluded in [RFC3491]. 1351 Additional information is available from [UNIXML]. [UNIXML] is 1352 written in the context of running text rather than in the context of 1353 identifiers. Nevertheless, it discusses many of the categories of 1354 characters not appropriate for IRIs. 1356 6.2 Software Interfaces and Protocols 1358 Although an IRI is defined as a sequence of characters, software 1359 interfaces for URIs typically function on sequences of octets or 1360 other kinds of code units. Thus, software interfaces and protocols 1361 MUST define which character encoding is used. 1363 Intermediate software interfaces between IRI-capable components and 1364 URI-only components MUST map the IRIs per Section 3.1, when 1365 transferring from IRI-capable to URI-only components. Such a mapping 1366 SHOULD be applied as late as possible. It SHOULD NOT be applied 1367 between components that are known to be able to handle IRIs. 1369 6.3 Format of URIs and IRIs in Documents and Protocols 1371 Document formats that transport URIs may need to be upgraded to allow 1372 the transport of IRIs. In those cases where the document as a whole 1373 has a native character encoding, IRIs MUST also be encoded in this 1374 character encoding, and converted accordingly by a parser or 1375 interpreter. IRI characters that are not expressible in the native 1376 character encoding SHOULD be escaped using the escaping conventions 1377 of the document format if such conventions are available. 1378 Alternatively, they MAY be percent-encoded according to Section 3.1. 1379 For example, in HTML or XML, numeric character references SHOULD be 1380 used. If a document as a whole has a native character encoding, and 1381 that character encoding is not UTF-8, then IRIs MUST NOT be placed 1382 into the document in the UTF-8 character encoding. 1384 Note: Some formats already accommodate IRIs, although they use 1385 different terminology. HTML 4.0 [HTML4] defines the conversion from 1386 IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink 1387 [XLink], and XML Schema [XMLSchema] and specifications based upon 1388 them allow IRIs. Also, it is expected that all relevant new W3C 1389 formats and protocols will be required to handle IRIs [CharMod]. 1391 6.4 Use of UTF-8 for Encoding Original Characters 1393 This section discusses details and gives examples for point c) in 1394 Section 1.2. In order to be able to use IRIs, the URI corresponding 1395 to the IRI in question has to encode original characters into octets 1396 using UTF-8. This can be specified for all URIs of a URI scheme, or 1397 can apply to individual URIs for schemes that do not specify how to 1398 encode original characters. It can apply to the whole URI, or only 1399 some part. For background information on encoding characters into 1400 URIs, see also Section 2.5 of [RFCYYYY]. 1402 For new URI schemes, using UTF-8 is recommended in [RFC2718]. 1403 Examples where UTF-8 is already used are the URN syntax [RFC2141], 1404 IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, 1405 because the HTTP URL scheme does not specify how to encode original 1406 characters, only some HTTP URLs can have corresponding but different 1407 IRIs. 1409 For example, for a document with a URI of 1410 http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to 1411 construct a corresponding IRI (in XML notation, see Section 1.4): 1412 http://www.example.org/résumé.html (é stands for the 1413 e-acute character, and %C3%A9 is the UTF-8 encoded and 1414 percent-encoded representation of that character). On the other 1415 hand, for a document with a URI of 1416 http://www.example.org/r%E9sum%E9.html, the percent-encoding octets 1417 cannot be converted to actual characters in an IRI, because the 1418 percent-encoding is not based on UTF-8. 1420 This means that for most URI schemes, there is no need to upgrade 1421 their scheme definition in order for them to work with IRIs. The 1422 main case where upgrading a scheme definition makes sense is when a 1423 scheme definition, or a particular component of a scheme, is strictly 1424 limited to the use of US-ASCII characters with no provision to 1425 include non-ASCII characters/octets via percent-encoding, or if a 1426 scheme definition currently uses highly scheme-specific provisions 1427 for the encoding of non-ASCII characters. An example of such a 1428 scheme might be the mailto: scheme [RFC2368]. 1430 This specification does not upgrade any scheme specifications in any 1431 way, this has to be done separately. Also, it should be noted that 1432 there is no such thing as an "IRI scheme"; all IRIs use URI schemes, 1433 and all URI schemes can be used with IRIs, even though in some cases 1434 only by using URIs directly as IRIs, without any conversion. 1436 URI schemes can impose restrictions on the syntax of scheme-specific 1437 URIs, ie. URIs that are admissable under the generic URI syntax 1438 [RFCYYYY] may not be admissable due to narrower syntactic constraints 1439 imposed by a URI scheme specification. URI scheme definitions cannot 1440 broaden the syntactic restrictions of the generic URI syntax, 1441 otherwise it would be possible to generate URIs that satisfied the 1442 scheme specific syntactic constraints without satisfying the 1443 syntactic constraints of the generic URI syntax. However, additional 1444 syntactic constraints imposed by URI scheme specifications are 1445 applicable to IRI since the corresponding URI resulting from the 1446 mapping defined in Section 3.1 MUST be a valid URI under the 1447 syntactic restrictions of generic URI syntax and any narrower 1448 restrictions imposed by the corresponding URI scheme specification. 1450 The requirement for the use of UTF-8 applies to all parts of a URI 1451 (with the potential exception of the ireg-name part, see Section 1452 3.1). However, it is possible that the capability of IRIs to 1453 represent a wide range of characters directly is used just in some 1454 parts of the IRI (or IRI reference). The other parts of the IRI may 1455 only contain US-ASCII characters, or they may not be based on UTF-8. 1456 They may be based on another character encoding, or they may directly 1457 encode raw binary data (see also [RFC2397]). 1459 For example, it is possible to have a URI reference of 1460 http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the 1461 document name is encoded in iso-8859-1 based on server settings, but 1462 the fragment identifier is encoded in UTF-8 according to [XPointer]. 1463 The IRI corresponding to the above URI would be (in XML notation) 1464 http://www.example.org/r%E9sum%E9.xml#résumé. 1466 Similar considerations apply to query parts. The functionality of 1467 IRIs (namely to be able to include non-ASCII characters) can only be 1468 used if the query part is encoded in UTF-8. 1470 6.5 Relative IRI References 1472 Processing of relative IRI references against a base is handled 1473 straightforwardly; the algorithms of [RFCYYYY] can be applied 1474 directly, treating the characters additionally allowed in IRI 1475 references in the same way as unreserved characters in URI 1476 references. 1478 7. URI/IRI Processing Guidelines (informative) 1480 This informative section provides guidelines for supporting IRIs in 1481 the same software components and operations that currently process 1482 URIs: software interfaces that handle URIs, software that allows 1483 users to enter URIs, software that creates or generates URIs, 1484 software that displays URIs, formats and protocols that transport 1485 URIs, and software that interprets URIs. These may all require more 1486 or less modification before functioning properly with IRIs. The 1487 considerations in this section also apply to URI references and IRI 1488 references. 1490 7.1 URI/IRI Software Interfaces 1492 Software interfaces that handle URIs, such as URI-handling APIs and 1493 protocols transferring URIs, need interfaces and protocol elements 1494 that are designed to carry IRIs. 1496 In case the current handling in an API or protocol is based on 1497 US-ASCII, UTF-8 is recommended as the character encoding for IRIs, 1498 because this is compatible with US-ASCII, is in accordance with the 1499 recommendations of [RFC2277], and makes it easy to convert to URIs 1500 where necessary. In any case, the API or protocol definition must 1501 clearly define the character encoding to be used. 1503 The transfer from URI-only to IRI-capable components requires no 1504 mapping, although the conversion described in Section 3.2 above may 1505 be performed. It is preferable not to perform this inverse 1506 conversion when there is a chance that this cannot be done correctly. 1508 7.2 URI/IRI Entry 1510 There are components that allow users to enter URIs into the system, 1511 for example by typing or dictation. This software must be updated to 1512 allow for IRI entry. 1514 A person viewing a visual representation of an IRI (as a sequence of 1515 glyphs, in some order, in some visual display) or hearing an IRI, 1516 will use a entry method for characters in the user's language to 1517 input the IRI. Depending on the script and the input method used, 1518 this may be a more or less complicated process. 1520 The process of IRI entry must assure, as far as possible, that the 1521 restrictions defined in Section 2.2 are met. This may be done by 1522 choosing appropriate input methods or variants/settings thereof, by 1523 appropriately converting the characters being input, by eliminating 1524 characters that cannot be converted, and/or by issuing a warning or 1525 error message to the user. 1527 As an example of variant settings, input method editors for East 1528 Asian Languages usually allow the input of Latin letters and related 1529 characters in full-width or half-width versions. For IRI input, the 1530 input method editor should be set so that it produces half-width 1531 Latin letters and punctuation, and full-width Katakana. 1533 An input field primarily or only used for the input of URIs/IRIs may 1534 allow the user to view an IRI as mapped to a URI. Places where the 1535 input of IRIs is frequent may provide the possibility for viewing an 1536 IRI as mapped to a URI. This will help users when some of the 1537 software they use does not yet accept IRIs. 1539 An IRI input component that interfaces to components that handle 1540 URIs, but not IRIs, must map the IRI to a URI before passing it to 1541 such a component. 1543 For the input of IRIs with right-to-left characters, please see 1544 Section 4.3. 1546 7.3 URI/IRI Transfer Between Applications 1548 Many applications, in particular many mail user agents, try to detect 1549 URIs appearing in plain text. For this, they use some heuristics 1550 based on URI syntax. They then allow the user to click on such URIs 1551 and retrieve the corresponding resource in an appropriate (usually 1552 scheme-dependent) application. 1554 Such applications have to be upgraded to use the IRI syntax rather 1555 than the URI syntax as a base for heuristics. In particular, a 1556 non-ASCII character should not be taken as the indication of the end 1557 of an IRI. Such applications also have to make sure that they 1558 correctly convert the detected IRI from the character encoding of the 1559 document or application where the IRI appears to the character 1560 encoding used by the system-wide IRI invocation mechanism, or to a 1561 URI (according to Section 3.1) if the system-wide invocation 1562 mechanism only accepts URIs. 1564 The clipboard is another frequently used way to transfer URIs and 1565 IRIs from one application to another. On most platforms, the 1566 clipboard is able to store and transfer text in many languages and 1567 scripts. Correctly used, the clipboard transfers characters, not 1568 bytes, which will do the right thing with IRIs. 1570 7.4 URI/IRI Generation 1572 Systems that offer resources through the Internet, where those 1573 resources have logical names, sometimes automatically generate URIs 1574 for the resources they offer. For example, some HTTP servers can 1575 generate a directory listing for a file directory, and then respond 1576 to the generated URIs with the files. 1578 Many legacy character encodings are in use in various file systems. 1579 Many currently deployed systems do not transform the local character 1580 representation of the underlying system before generating URIs. 1582 For maximum interoperability, systems that generate resource 1583 identifiers should do the appropriate transformations. For example, 1584 if a file system contains a file named résumé.html, a 1585 server should expose this as r%C3%A9sum%C3%A9.html in a URI, which 1586 allows to use résumé.html in an IRI, even if the file name 1587 locally is kept in a character encoding other than UTF-8. 1589 This recommendation in particular applies to HTTP servers. For FTP 1590 servers, similar considerations apply, see in particular [RFC2640]. 1592 7.5 URI/IRI Selection 1594 In some cases, resource owners and publishers have control over the 1595 IRIs used to identify their resources. Such control is mostly 1596 executed by controlling the resource names, such as file names, 1597 directly. 1599 In such cases, it is recommended to avoid choosing IRIs that are 1600 easily confused. For example, for US-ASCII, the lower-case ell "l" 1601 is easily confused with the digit one "1", and the upper-case oh "O" 1602 is easily confused with the digit zero "0". Publishers should avoid 1603 confusing users with "br0ken" or "1ame" identifiers. 1605 Outside of the US-ASCII repertoire, there are many more opportunities 1606 for confusion; a complete set of guidelines is too lengthy to include 1607 here. As long as names are limited to characters from a single 1608 script, native writers of a given script or language will know best 1609 when ambiguities can appear, and how they can be avoided. What may 1610 look ambiguous to a stranger may be completely obvious to the average 1611 native user. On the other hand, in some cases, the UCS contains 1612 variants for compatibility reasons, for example for typographic 1613 purposes. These should be avoided wherever possible. Although there 1614 may be exceptions, in general newly created resource names should be 1615 in NFKC [UTR15] (which means that they are also in NFC). 1617 As an example, the UCS contains the 'fi' ligature at U+FB01 for 1618 compatibility reasons. Wherever possible, IRIs should use the two 1619 letters 'f' and 'i' rather than the 'fi' ligature. An example where 1620 the latter may be used is in the query part of an IRI for an explicit 1621 search for a word written containing the 'fi' ligature. 1623 In certain cases, there is a chance that characters from different 1624 scripts look the same. The best known example is the Latin 'A', the 1625 Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs 1626 should be created where all the characters in a single component are 1627 used together in a given language. This usually means that all these 1628 characters will be from the same script, but there are languages that 1629 mix characters from different scripts (such as Japanese). This is 1630 similar to the heuristics used to distinguish between letters and 1631 numbers in the examples above. Also, for Latin, Greek, and Cyrillic, 1632 using lower-case letters results in fewer ambiguities than using 1633 upper-case letters. 1635 7.6 Display of URIs/IRIs 1637 In situations where the rendering software is not expected to display 1638 non-ASCII parts of the IRI correctly using the available layout and 1639 font resources, these parts should be percent-encoded before being 1640 displayed. 1642 For display of Bidi IRIs, please see Section 4.1. 1644 7.7 Interpretation of URIs and IRIs 1646 Software that interprets IRIs as the names of local resources should 1647 accept IRIs in multiple forms, and convert and match them with the 1648 appropriate local resource names. 1650 First, multiple representations include both IRIs in the native 1651 character encoding of the protocol and also their URI counterparts. 1653 Second, it may include URIs constructed based on other character 1654 encodings than UTF-8. Such URIs may be produced by user agents that 1655 do not conform to this specification and use legacy character 1656 encodings to convert non-ASCII characters to URIs. Whether this is 1657 necessary and what character encodings to cover, depends on a number 1658 of factors, such as the legacy character encodings used locally and 1659 the distribution of various versions of user agents. For example, 1660 software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in 1661 addition to UTF-8. 1663 Third, it may include additional mappings to be more user-friendly 1664 and robust against transmission errors. These would be similar to 1665 how currently some servers treat URIs as case-insensitive, or perform 1666 additional matching to account for spelling errors. For characters 1667 beyond the US-ASCII repertoire, this may for example include ignoring 1668 the accents on received IRIs or resource names where appropriate. 1669 Please note that such mappings, including case mappings, are 1670 language-dependent. 1672 It can be difficult to unambiguously identify a resource if too many 1673 mappings are taken into consideration. However, percent-encoded and 1674 not percent-encoded parts of IRIs can always clearly be 1675 distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes 1676 the potential for collisions lower than it may seem at first sight. 1678 7.8 Upgrading Strategy 1680 Where this recommendation places further constraints on software for 1681 which many instances are already deployed, it is important to 1682 introduce upgrades carefully, and to be aware of the various 1683 interdependencies. 1685 If IRIs cannot be interpreted correctly, they should not be created, 1686 generated, or transported. This suggests that upgrading URI 1687 interpreting software to accept IRIs should have highest priority. 1689 On the other hand, a single IRI is interpreted only by a single or 1690 very few interpreters that are known in advance, while it may be 1691 entered and transported very widely. 1693 Therefore, IRIs benefit most from a broad upgrade of software to be 1694 able to enter and transport IRIs, but before publishing any 1695 individual IRI, care should be taken to upgrade the corresponding 1696 interpreting software in order to cover the forms expected to be 1697 received by various versions of entry and transport software. 1699 The upgrade of generating software to generate IRIs instead of using 1700 a local character encoding should happen only after the service is 1701 upgraded to accept IRIs. Similarly, IRIs should only be generated 1702 when the service accepts IRIs and the intervening infrastructure and 1703 protocol is known to transport them safely. 1705 Software converting from URIs to IRIs for display should be upgraded 1706 only after upgraded entry software has been widely deployed to the 1707 population that will see the displayed result. 1709 It is often possible to reduce the effort and dependencies for 1710 upgrading to IRIs by using UTF-8 rather than another character 1711 encoding where there is a free choice of character encodings. For 1712 example, when setting up a new file-based Web server, using UTF-8 as 1713 the character encoding for file names will make the transition to 1714 IRIs easier. Likewise, when setting up a new Web form using UTF-8 as 1715 the character encoding of the form page, the returned query URIs will 1716 use UTF-8 as the character encoding (unless the user, for whatever 1717 reason, changes the character encoding) and will therefore be 1718 compatible with IRIs. 1720 These recommendations, when taken together, will allow for the 1721 extension from URIs to IRIs in order to handle characters other than 1722 US-ASCII while minimizing interoperability problems. For 1723 considerations regarding the upgrade of URI scheme definitions, 1724 please see Section 6.4. 1726 8. Security Considerations 1728 The security considerations discussed in [RFCYYYY] also apply to 1729 IRIs. In addition, the following issues require particular care for 1730 IRIs. 1732 Incorrect encoding or decoding can lead to security problems. In 1733 particular, some UTF-8 decoders do not check against overlong byte 1734 sequences. As an example, a '/' is encoded with the byte 0x2F both 1735 in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly 1736 interpret the sequence 0xC0 0xAF as a '/'. A sequence such as 1737 '%C0%AF..' may pass some security tests and then be interpreted as '/ 1738 ..' in a path if UTF-8 decoders are fault-tolerant, if conversion and 1739 checking are not done in the right order, and/or if reserved 1740 characters and unreserved characters are not clearly distinguished. 1742 There are various ways in which "spoofing" can occur with IRIs. 1743 "Spoofing" means that somebody may add a resource name that looks the 1744 same or similar to the user, but points to a different resource. The 1745 added resource may pretend to be the real resource by looking very 1746 similar, but may contain all kinds of changes that may be difficult 1747 to spot and can cause all kinds of problems. Most spoofing 1748 possibilities for IRIs are extensions of those for URIs. 1750 Spoofing can occur for various reasons. A first reason is that 1751 normalization expectations of a user or actual normalization when 1752 entering an IRI, or when transcoding an IRI from a legacy character 1753 encoding, do not match the normalization used on the server side. 1754 Conceptually, this is no different from the problems surrounding the 1755 use of case-insensitive web servers. For example, a popular web page 1756 with a mixed case name (http://big.example.com/PopularPage.html) 1757 might be "spoofed" by someone who is able to create 1758 http://big.example.com/popularpage.html. However, the use of 1759 unnormalized character sequences, and of additional mappings for user 1760 convenience, may increase the chance for spoofing. Protocols and 1761 servers that allow the creation of resources with names that are not 1762 normalized are particularly vulnerable to such attacks. This is an 1763 inherent security problem of the relevant protocol, server, or 1764 resource, and not specific to IRIs, but mentioned here for 1765 completeness. 1767 Spoofing can occur in various IRI components, such as the domain name 1768 part or a path part. For considerations specific to the domain name 1769 part, see [RFC3491]. For the path part, administrators of sites 1770 which allow independent users to create resources in the same subarea 1771 may need to be careful to check for spoofing. 1773 Spoofing can occur because in the UCS, there are many characters that 1774 look very similar. Details are discussed in Section 7.5. Again, 1775 this is very similar to spoofing possibilities on US-ASCII, e.g. 1776 using 'br0ken' or '1ame' URIs. 1778 Spoofing can occur when URIs with percent-encodings based on various 1779 character encodings are accepted to deal with older user agents. In 1780 some cases, in particular for Latin-based resource names, this is 1781 usually easy to detect because UTF-8-encoded names, when interpreted 1782 and viewed as legacy character encodings, produce mostly garbage. In 1783 other cases, when concurrently used character encodings have a 1784 similar structure, but there are no characters that have exactly the 1785 same encoding, detection is more difficult. 1787 Spoofing can occur with bidirectional IRIs, if the restrictions in 1788 Section 4.2 are not followed. The same visual representation may be 1789 interpreted as different logical representations, and vice versa. It 1790 is also very important that a correct Unicode bidirectional 1791 implementation is used. 1793 9. IANA Considerations 1795 This document has no actions for IANA. 1797 10. Acknowledgements 1799 We would like to thank Larry Masinter for his work as coauthor of 1800 many earlier versions of this document (draft-masinter-url-i18n-xx). 1802 The discussion on the issue addressed here has started a long time 1803 ago. There was a thread in the HTML working group in August 1995 1804 (under the topic of "Globalizing URIs") and in the www-international 1805 mailing list in July 1996 (under the topic of "Internationalization 1806 and URLs"), and ad-hoc meetings at the Unicode conferences in 1807 September 1995 and September 1997. 1809 Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, 1810 Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim 1811 Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie 1812 Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley, 1813 Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne, 1814 Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan 1815 Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan 1816 Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris 1817 Haynes, Walter Underwood, and many others for help with understanding 1818 the issues and possible solutions, and getting the details right. 1820 This document is a product of the Internationalization Working Group 1821 (I18N WG) of the World Wide Web Consortium (W3C). Thanks to the 1822 members of the W3C I18N Working Group and Interest Group for their 1823 contributions and their work on [CharMod]. Thanks also go to the 1824 members of many other W3C Working Groups for adopting IRIs, and to 1825 the members of the Montreal IAB Workshop on Internationalization and 1826 Localization for their review. 1828 11. References 1830 11.1 Normative References 1832 [ASCII] American National Standards Institute, "Coded Character 1833 Set -- 7-bit American Standard Code for Information 1834 Interchange", ANSI X3.4, 1986. 1836 [ISO10646] 1837 International Organization for Standardization, "ISO/IEC 1838 10646:2003: Information Technology - Universal 1839 Multiple-Octet Coded Character Set (UCS)", ISO Standard 1840 10646, December 2003. 1842 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1843 Requirement Levels", BCP 14, RFC 2119, March 1997. 1845 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1846 Specifications: ABNF", RFC 2234, November 1997. 1848 [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, 1849 "Internationalizing Domain Names in Applications (IDNA)", 1850 RFC 3490, March 2003. 1852 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1853 Profile for Internationalized Domain Names (IDN)", RFC 1854 3491, March 2003. 1856 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1857 10646", STD 63, RFC 3629, November 2003. 1859 [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 1860 Resource Identifier (URI): Generic Syntax (Note to the RFC 1861 Editor: Please update this reference with the RFC 1862 resulting from draft-fielding-uri-rfc2396bis-xx.txt, and 1863 remove this Note)", draft-fielding-uri-rfc2396bis-07 (work 1864 in progress), April 2004. 1866 [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard 1867 Annex #9, March 2004, 1868 . 1870 [UNIV4] The Unicode Consortium, "The Unicode Standard, Version 1871 4.0.1, defined by: The Unicode Standard, Version 4.0 1872 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), 1873 as amended by Unicode 4.0.1 1874 (http://www.unicode.org/versions/Unicode4.0.1/)", March 1875 2004. 1877 [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", 1878 Unicode Standard Annex #15, April 2003, 1879 . 1882 11.2 Non-normative References 1884 [BidiEx] "Examples of bidirectional IRIs", 1885 . 1887 [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. 1888 Texin, "Character Model for the World Wide Web", World 1889 Wide Web Consortium Working Draft, February 2004, 1890 . 1892 [Duerst97] 1893 Duerst, M., "The Properties and Promises of UTF-8", Proc. 1894 11th International Unicode Conference, San Jose , 1895 September 1997, 1896 . 1899 [Gettys] Gettys, J., "URI Model Consequences", 1900 . 1902 [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 1903 Specification", World Wide Web Consortium Recommendation, 1904 December 1999, 1905 . 1908 [RFC2045] Freed, N. and N. Freed, "Multipurpose Internet Mail 1909 Extensions (MIME) Part One: Format of Internet Message 1910 Bodies", RFC 2045, November 1996. 1912 [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., 1913 Atkinson, R., Crispin, M. and P. Svanberg, "The Report of 1914 the IAB Character Set Workshop held 29 February - 1 March, 1915 1996", RFC 2130, April 1997. 1917 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 1919 [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. 1921 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1922 Languages", BCP 18, RFC 2277, January 1998. 1924 [RFC2368] Hoffman, P., Masinter, L. and J. Zawinski, "The mailto URL 1925 scheme", RFC 2368, July 1998. 1927 [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. 1929 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 1930 Resource Identifiers (URI): Generic Syntax", RFC 2396, 1931 August 1998. 1933 [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, August 1934 1998. 1936 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., 1937 Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext 1938 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 1940 [RFC2640] Curtin, B., "Internationalization of the File Transfer 1941 Protocol", RFC 2640, July 1999. 1943 [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, 1944 "Guidelines for new URL Schemes", RFC 2718, November 1999. 1946 [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other 1947 Markup Languages", Unicode Technical Report #20, World 1948 Wide Web Consortium Note, February 2002, 1949 . 1951 [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking 1952 Language (XLink) Version 1.0", World Wide Web Consortium 1953 Recommendation, June 2001, 1954 . 1956 [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E. and 1957 F. Yergeau, "Extensible Markup Language (XML) 1.0 (Third 1958 Edition)", World Wide Web Consortium Recommendation, 1959 February 2004, 1960 . 1962 [XMLNamespace] 1963 Bray, T., Hollander, D. and A. Layman, "Namespaces in 1964 XML", World Wide Web Consortium Recommendation, January 1965 1999, . 1967 [XMLSchema] 1968 Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", 1969 World Wide Web Consortium Recommendation, May 2001, 1970 . 1972 [XPointer] 1973 Grosso, P., Maler, E., Marsh, J. and N. Walsh, "XPointer 1974 Framework", World Wide Web Consortium Recommendation, 1975 March 2003, 1976 . 1978 Authors' Addresses 1980 Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever 1981 possible, for example as "Dürst" in XML and HTML.) 1982 World Wide Web Consortium 1983 5322 Endo 1984 Fujisawa, Kanagawa 252-8520 1985 Japan 1987 Phone: +81 466 49 1170 1988 Fax: +81 466 49 1171 1989 EMail: mailto:duerst@w3.org 1990 URI: http://www.w3.org/People/D%C3%BCrst/ 1991 (Note: This is the percent-encoded form of an IRI.) 1993 Michel Suignard 1994 Microsoft Corporation 1995 One Microsoft Way 1996 Redmond, WA 98052 1997 U.S.A. 1999 Phone: +1 425 882-8080 2000 EMail: mailto:michelsu@microsoft.com 2001 URI: http://www.suignard.com 2003 Appendix A. Design Alternatives 2005 This section shortly summarizes major design alternatives and the 2006 reasons for why they were not chosen. 2008 Appendix A.1 New Scheme(s) 2010 Introducing new schemes (for example httpi:, ftpi:,...) or a new 2011 metascheme (e.g. i:, leading to URI/IRI prefixes such as i:http:, 2012 i:ftp:,...) was proposed to make IRI-to-URI conversion 2013 scheme-dependent or to distinguish between percent-encodings 2014 resulting from IRI-to-URI conversion and percent-encodings from 2015 legacy character encodings. 2017 New schemes are not needed to distinguish URIs from true IRIs (i.e. 2018 IRIs that contain non-ASCII characters). The benefit of being able 2019 to detect the origin of percent-encodings is marginal, because UTF-8 2020 can be detected with very high reliability. Deploying new schemes is 2021 extremely hard, so not requiring new schemes for IRIs makes 2022 deployment of IRIs vastly easier. Making conversion scheme-dependent 2023 is highly inadvisable, and would be encouraged by separate schemes 2024 for IRIs. Using an uniform convention for conversion from IRIs to 2025 URIs makes IRI implementation orthogonal to the introduction of 2026 actual new schemes. 2028 Appendix A.2 Other Character Encodings than UTF-8 2030 At an early stage, UTF-7 was considered as an alternative to UTF-8 2031 when converting IRIs to URIs. UTF-7 would not have needed 2032 percent-encoding, and would in most cases have been shorter than 2033 percent-encoded UTF-8. 2035 Using UTF-8 avoids a double layering and overloading of the use of 2036 the "+" character. UTF-8 is fully compatible with US-ASCII, and has 2037 therefore been recommended by the IETF, and is being used widely, 2038 while UTF-7 has never been used much and is now clearly being 2039 discouraged. Requiring implementations to convert from UTF-8 to 2040 UTF-7 and back would be an additional implementation burden. 2042 Appendix A.3 New Encoding Convention 2044 Instead of using the existing percent-encoding convention of URIs, 2045 which is based on octets, the idea was to create a new encoding 2046 convention, for example to use '%u' to introduce UCS code points. 2048 Using the existing octet-based percent-encoding mechanism does not 2049 need an upgrade of the URI syntax, and does not need corresponding 2050 server upgrades. 2052 Appendix A.4 Indicating Character Encodings in the URI/IRI 2054 Some proposals suggested indicating the character encodings used in 2055 an URI or IRI with some new syntactic convention in the URI itself, 2056 similar to the 'charset' parameter for emails and Web pages. As an 2057 example, the label in square brackets in 2058 http://www.example.org/ros[iso-8859-1]é indicated that the 2059 following é had to be interpreted as iso-8859-1. 2061 Using UTF-8 only does not need an upgrade to the URI syntax. It 2062 avoids potentially multiple labels that have to be copied correctly 2063 in all cases, even on the side of a bus or on a napkin, leading to 2064 usability problems to the extent of being prohibitively annoying. 2065 Using UTF-8 only also reduces transcoding errors and confusions. 2067 Intellectual Property Statement 2069 The IETF takes no position regarding the validity or scope of any 2070 Intellectual Property Rights or other rights that might be claimed to 2071 pertain to the implementation or use of the technology described in 2072 this document or the extent to which any license under such rights 2073 might or might not be available; nor does it represent that it has 2074 made any independent effort to identify any such rights. Information 2075 on the procedures with respect to rights in RFC documents can be 2076 found in BCP 78 and BCP 79. 2078 Copies of IPR disclosures made to the IETF Secretariat and any 2079 assurances of licenses to be made available, or the result of an 2080 attempt made to obtain a general license or permission for the use of 2081 such proprietary rights by implementers or users of this 2082 specification can be obtained from the IETF on-line IPR repository at 2083 http://www.ietf.org/ipr. 2085 The IETF invites any interested party to bring to its attention any 2086 copyrights, patents or patent applications, or other proprietary 2087 rights that may cover technology that may be required to implement 2088 this standard. Please address the information to the IETF at 2089 ietf-ipr@ietf.org. 2091 Disclaimer of Validity 2093 This document and the information contained herein are provided on an 2094 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2095 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2096 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2097 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2098 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2099 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2101 Copyright Statement 2103 Copyright (C) The Internet Society (2004). This document is subject 2104 to the rights, licenses and restrictions contained in BCP 78, and 2105 except as set forth therein, the authors retain all their rights. 2107 Acknowledgment 2109 Funding for the RFC Editor function is currently provided by the 2110 Internet Society.