idnits 2.17.1 draft-ietf-iri-3987bis-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC3987, but the abstract doesn't seem to directly say this. It does mention RFC3987 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 12, 2011) is 4635 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2045' is defined on line 2055, but no explicit reference was found in the text == Unused Reference: 'XMLNamespace' is defined on line 2122, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' -- Obsolete informational reference (is this intentional?): RFC 1738 (Obsoleted by RFC 4248, RFC 4266) -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2192 (Obsoleted by RFC 5092) -- Obsolete informational reference (is this intentional?): RFC 2368 (Obsoleted by RFC 6068) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internationalized Resource Identifiers M. Duerst 3 (iri) Aoyama Gakuin University 4 Internet-Draft M. Suignard 5 Obsoletes: 3987 (if approved) Unicode Consortium 6 Intended status: Standards Track L. Masinter 7 Expires: February 13, 2012 Adobe 8 August 12, 2011 10 Internationalized Resource Identifiers (IRIs) 11 draft-ietf-iri-3987bis-06 13 Abstract 15 This document defines the Internationalized Resource Identifier (IRI) 16 protocol element, as an extension of the Uniform Resource Identifier 17 (URI). An IRI is a sequence of characters from the Universal 18 Character Set (Unicode/ISO 10646). Grammar and processing rules are 19 given for IRIs and related syntactic forms. 21 In addition, this document provides named additional rule sets for 22 processing otherwise invalid IRIs, in a way that supports other 23 specifications that wish to mandate common behavior for 'error' 24 handling. In particular, rules used in some XML languages (LEIRI) 25 and web applications are given. 27 Defining IRI as new protocol element (rather than updating or 28 extending the definition of URI) allows independent orderly 29 transitions: other protocols and languages that use URIs must 30 explicitly choose to allow IRIs. 32 Guidelines are provided for the use and deployment of IRIs and 33 related protocol elements when revising protocols, formats, and 34 software components that currently deal only with URIs. 36 RFC Editor: Please remove the next paragraph before publication. 38 This document is intended to update RFC 3987 and move towards IETF 39 Draft Standard. For discussion and comments on this draft, please 40 join the IETF IRI WG by subscribing to the mailing list 41 public-iri@w3.org. For a list of open issues, please see the issue 42 tracker of the WG at http://trac.tools.ietf.org/wg/iri/trac/report/1. 43 For a list of individual edits, please see the change history at 44 http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis. 46 Status of this Memo 48 This Internet-Draft is submitted in full conformance with the 49 provisions of BCP 78 and BCP 79. 51 Internet-Drafts are working documents of the Internet Engineering 52 Task Force (IETF). Note that other groups may also distribute 53 working documents as Internet-Drafts. The list of current Internet- 54 Drafts is at http://datatracker.ietf.org/drafts/current/. 56 Internet-Drafts are draft documents valid for a maximum of six months 57 and may be updated, replaced, or obsoleted by other documents at any 58 time. It is inappropriate to use Internet-Drafts as reference 59 material or to cite them other than as "work in progress." 61 This Internet-Draft will expire on February 13, 2012. 63 Copyright Notice 65 Copyright (c) 2011 IETF Trust and the persons identified as the 66 document authors. All rights reserved. 68 This document is subject to BCP 78 and the IETF Trust's Legal 69 Provisions Relating to IETF Documents 70 (http://trustee.ietf.org/license-info) in effect on the date of 71 publication of this document. Please review these documents 72 carefully, as they describe your rights and restrictions with respect 73 to this document. Code Components extracted from this document must 74 include Simplified BSD License text as described in Section 4.e of 75 the Trust Legal Provisions and are provided without warranty as 76 described in the Simplified BSD License. 78 This document may contain material from IETF Documents or IETF 79 Contributions published or made publicly available before November 80 10, 2008. The person(s) controlling the copyright in some of this 81 material may not have granted the IETF Trust the right to allow 82 modifications of such material outside the IETF Standards Process. 83 Without obtaining an adequate license from the person(s) controlling 84 the copyright in such materials, this document may not be modified 85 outside the IETF Standards Process, and derivative works of it may 86 not be created outside the IETF Standards Process, except to format 87 it for publication as an RFC or to translate it into languages other 88 than English. 90 Table of Contents 92 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 93 1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5 94 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 95 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7 96 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 9 97 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 98 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10 99 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 100 3. Processing IRIs and related protocol elements . . . . . . . . 13 101 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 14 102 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 14 103 3.3. General percent-encoding of IRI components . . . . . . . 15 104 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 15 105 3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 15 106 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 16 107 3.4.3. Additional Considerations . . . . . . . . . . . . . . 16 108 3.5. Mapping query components . . . . . . . . . . . . . . . . 17 109 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 17 110 3.7. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 17 111 3.7.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 19 112 4. Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20 113 4.1. Logical Storage and Visual Presentation . . . . . . . . . 21 114 4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 22 115 4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 23 116 4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 23 117 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25 118 5.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 25 119 5.2. Software Interfaces and Protocols . . . . . . . . . . . . 26 120 5.3. Format of URIs and IRIs in Documents and Protocols . . . 26 121 5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 26 122 5.5. Relative IRI References . . . . . . . . . . . . . . . . . 28 123 6. Liberal Handling of Otherwise Invalid IRIs . . . . . . . . . . 28 124 6.1. LEIRI Processing . . . . . . . . . . . . . . . . . . . . 29 125 6.2. Web Address Processing . . . . . . . . . . . . . . . . . 29 126 6.3. Characters Not Allowed in IRIs . . . . . . . . . . . . . 31 127 7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 33 128 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 33 129 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 33 130 7.3. URI/IRI Transfer between Applications . . . . . . . . . . 34 131 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34 132 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 35 133 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 36 134 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 36 135 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 37 136 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 137 9. Security Considerations . . . . . . . . . . . . . . . . . . . 38 138 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39 139 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 40 140 11.1. Major restructuring of IRI processing model . . . . . . . 40 141 11.1.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 40 142 11.1.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 40 143 11.1.3. Extension of Syntax . . . . . . . . . . . . . . . . . 41 144 11.1.4. More to be added . . . . . . . . . . . . . . . . . . . 41 145 11.2. Change Log . . . . . . . . . . . . . . . . . . . . . . . 41 146 11.2.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 41 147 11.2.2. Changes from draft-duerst-iri-bis-07 to 148 draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 41 149 11.2.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 41 150 11.3. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 41 151 11.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 42 152 11.5. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 42 153 11.6. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 42 154 11.7. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 42 155 11.8. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 42 156 11.9. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 42 157 11.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 43 158 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 43 159 12.1. Normative References . . . . . . . . . . . . . . . . . . 43 160 12.2. Informative References . . . . . . . . . . . . . . . . . 44 161 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 46 163 1. Introduction 165 1.1. Overview and Motivation 167 A Uniform Resource Identifier (URI) is defined in [RFC3986] as a 168 sequence of characters chosen from a limited subset of the repertoire 169 of US-ASCII [ASCII] characters. 171 The characters in URIs are frequently used for representing words of 172 natural languages. This usage has many advantages: Such URIs are 173 easier to memorize, easier to interpret, easier to transcribe, easier 174 to create, and easier to guess. For most languages other than 175 English, however, the natural script uses characters other than A - 176 Z. For many people, handling Latin characters is as difficult as 177 handling the characters of other scripts is for those who use only 178 the Latin alphabet. Many languages with non-Latin scripts are 179 transcribed with Latin letters. These transcriptions are now often 180 used in URIs, but they introduce additional difficulties. 182 The infrastructure for the appropriate handling of characters from 183 additional scripts is now widely deployed in operating system and 184 application software. Software that can handle a wide variety of 185 scripts and languages at the same time is increasingly common. Also, 186 an increasing number of protocols and formats can carry a wide range 187 of characters. 189 URIs are used both as a protocol element (for transmission and 190 processing by software) and also a presentation element (for display 191 and handling by people who read, interpret, coin, or guess them). 192 The transition between these roles is more difficult and complex when 193 dealing with the larger set of characters than allowed for URIs in 194 [RFC3986]. 196 This document defines the protocol element called Internationalized 197 Resource Identifier (IRI), which allow applications of URIs to be 198 extended to use resource identifiers that have a much wider 199 repertoire of characters. It also provides corresponding 200 "internationalized" versions of other constructs from [RFC3986], such 201 as URI references. The syntax of IRIs is defined in Section 2. 203 Using characters outside of A - Z in IRIs adds a number of 204 difficulties. Section 4 discusses the special case of bidirectional 205 IRIs using characters from scripts written right-to-left. Section 5 206 discusses the use of IRIs in different situations. Section 7 gives 207 additional informative guidelines. Section 9 discusses IRI-specific 208 security considerations. 210 When originally defining IRIs, several design alternatives were 211 considered. Historically interested readers can find an overview in 212 Appendix A of [RFC3987]. For some additional background on the 213 design of URIs and IRIs, please also see [Gettys]. 215 1.2. Applicability 217 IRIs are designed to allow protocols and software that deal with URIs 218 to be updated to handle IRIs. A "URI scheme" (as defined by 219 [RFC3986] and registered through the IANA process defined in 220 [RFC4395bis] also serves as an "IRI scheme". Processing of IRIs is 221 accomplished by extending the URI syntax while retaining (and not 222 expanding) the set of "reserved" characters, such that the syntax for 223 any URI scheme may be extended to allow non-ASCII characters. In 224 addition, following parsing of an IRI, it is possible to construct a 225 corresponding URI by first encoding characters outside of the allowed 226 URI range and then reassembling the components. 228 Practical use of IRIs forms in place of URIs forms depends on the 229 following conditions being met: 231 a. A protocol or format element MUST be explicitly designated to be 232 able to carry IRIs. The intent is to avoid introducing IRIs into 233 contexts that are not defined to accept them. For example, XML 234 schema [XMLSchema] has an explicit type "anyURI" that includes 235 IRIs and IRI references. Therefore, IRIs and IRI references can 236 be in attributes and elements of type "anyURI". On the other 237 hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is 238 defined as a URI, which means that direct use of IRIs is not 239 allowed in HTTP requests. 241 b. The protocol or format carrying the IRIs MUST have a mechanism to 242 represent the wide range of characters used in IRIs, either 243 natively or by some protocol- or format-specific escaping 244 mechanism (for example, numeric character references in [XML1]). 246 c. The URI scheme definition, if it explicitly allows a percent sign 247 ("%") in any syntactic component, SHOULD define the interpretation 248 of sequences of percent-encoded octets (using "%XX" hex octets) as 249 octet from sequences of UTF-8 encoded strings; this is recommended 250 in the guidelines for registering new schemes, [RFC4395bis]. For 251 example, this is the practice for IMAP URLs [RFC2192], POP URLs 252 [RFC2384] and the URN syntax [RFC2141]). Note that use of 253 percent-encoding may also be restricted in some situations, for 254 example, URI schemes that disallow percent-encoding might still be 255 used with a fragment identifier which is percent-encoded (e.g., 256 [XPointer]). See Section 5.4 for further discussion. 258 1.3. Definitions 260 The following definitions are used in this document; they follow the 261 terms in [RFC2130], [RFC2277], and [ISO10646]. 263 character: A member of a set of elements used for the organization, 264 control, or representation of data. For example, "LATIN CAPITAL 265 LETTER A" names a character. 267 octet: An ordered sequence of eight bits considered as a unit. 269 character repertoire: A set of characters (set in the mathematical 270 sense). 272 sequence of characters: A sequence of characters (one after 273 another). 275 sequence of octets: A sequence of octets (one after another). 277 character encoding: A method of representing a sequence of 278 characters as a sequence of octets (maybe with variants). Also, a 279 method of (unambiguously) converting a sequence of octets into a 280 sequence of characters. 282 charset: The name of a parameter or attribute used to identify a 283 character encoding. 285 UCS: Universal Character Set. The coded character set defined by 286 ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6]. 288 IRI reference: Denotes the common usage of an Internationalized 289 Resource Identifier. An IRI reference may be absolute or 290 relative. However, the "IRI" that results from such a reference 291 only includes absolute IRIs; any relative IRI references are 292 resolved to their absolute form. Note that in [RFC2396] URIs did 293 not include fragment identifiers, but in [RFC3986] fragment 294 identifiers are part of URIs. 296 URL: The term "URL" was originally used [RFC1738] for roughly what 297 is now called a "URI". Books, software and documentation often 298 refers to URIs and IRIs using the "URL" term. Some usages 299 restrict "URL" to those URIs which are not URNs. Because of the 300 ambiguity of the term using the term "URL" is NOT RECOMMENDED in 301 formal documents. 303 LEIRI (Legacy Extended IRI) processing: This term was used in 304 various XML specifications to refer to strings that, although not 305 valid IRIs, were acceptable input to the processing rules in 306 Section 6.1. 308 (Web Address, Hypertext Reference, HREF): These terms have been 309 added in this document for convenience, to allow other 310 specifications to refer to those strings that, although not valid 311 IRIs, are acceptable input to the processing rules in Section 6.2. 312 This usage corresponds to the parsing rules of some popular web 313 browsing applications. ISSUE: Need to find a good name/ 314 abbreviation for these. 316 running text: Human text (paragraphs, sentences, phrases) with 317 syntax according to orthographic conventions of a natural 318 language, as opposed to syntax defined for ease of processing by 319 machines (e.g., markup, programming languages). 321 protocol element: Any portion of a message that affects processing 322 of that message by the protocol in question. 324 presentation element: A presentation form corresponding to a 325 protocol element; for example, using a wider range of characters. 327 create (a URI or IRI): With respect to URIs and IRIs, the term is 328 used for the initial creation. This may be the initial creation 329 of a resource with a certain identifier, or the initial exposition 330 of a resource under a particular identifier. 332 generate (a URI or IRI): With respect to URIs and IRIs, the term is 333 used when the identifier is generated by derivation from other 334 information. 336 parsed URI component: When a URI processor parses a URI (following 337 the generic syntax or a scheme-specific syntax, the result is a 338 set of parsed URI components, each of which has a type 339 (corresponding to the syntactic definition) and a sequence of URI 340 characters. 342 parsed IRI component: When an IRI processor parses an IRI directly, 343 following the general syntax or a scheme-specific syntax, the 344 result is a set of parsed IRI components, each of which has a type 345 (corresponding to the syntactice definition) and a sequence of IRI 346 characters. (This definition is analogous to "parsed URI 347 component".) 349 IRI scheme: A URI scheme may also be known as an "IRI scheme" if the 350 scheme's syntax has been extended to allow non-US-ASCII characters 351 according to the rules in this document. 353 1.4. Notation 355 RFCs and Internet Drafts currently do not allow any characters 356 outside the US-ASCII repertoire. Therefore, this document uses 357 various special notations to denote such characters in examples. 359 In text, characters outside US-ASCII are sometimes referenced by 360 using a prefix of 'U+', followed by four to six hexadecimal digits. 362 To represent characters outside US-ASCII in examples, this document 363 uses two notations: 'XML Notation' and 'Bidi Notation'. 365 XML Notation uses a leading '&#x', a trailing ';', and the 366 hexadecimal number of the character in the UCS in between. For 367 example, я stands for CYRILLIC CAPITAL LETTER YA. In this 368 notation, an actual '&' is denoted by '&'. 370 Bidi Notation is used for bidirectional examples: Lower case letters 371 stand for Latin letters or other letters that are written left to 372 right, whereas upper case letters represent Arabic or Hebrew letters 373 that are written right to left. 375 To denote actual octets in examples (as opposed to percent-encoded 376 octets), the two hex digits denoting the octet are enclosed in "<" 377 and ">". For example, the octet often denoted as 0xc9 is denoted 378 here as . 380 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 381 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 382 and "OPTIONAL" are to be interpreted as described in [RFC2119]. 384 2. IRI Syntax 386 This section defines the syntax of Internationalized Resource 387 Identifiers (IRIs). 389 As with URIs, an IRI is defined as a sequence of characters, not as a 390 sequence of octets. This definition accommodates the fact that IRIs 391 may be written on paper or read over the radio as well as stored or 392 transmitted digitally. The same IRI might be represented as 393 different sequences of octets in different protocols or documents if 394 these protocols or documents use different character encodings 395 (and/or transfer encodings). Using the same character encoding as 396 the containing protocol or document ensures that the characters in 397 the IRI can be handled (e.g., searched, converted, displayed) in the 398 same way as the rest of the protocol or document. 400 2.1. Summary of IRI Syntax 402 The IRI syntax extends the URI syntax in [RFC3986] by extending the 403 class of unreserved characters, primarily by adding the characters of 404 the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject 405 to the limitations given in the syntax rules below and in 406 Section 5.1. 408 The syntax and use of components and reserved characters is the same 409 as that in [RFC3986]. Each "URI scheme" thus also functions as an 410 "IRI scheme", in that scheme-specific parsing rules for URIs of a 411 scheme are be extended to allow parsing of IRIs using the same 412 parsing rules. 414 All the operations defined in [RFC3986], such as the resolution of 415 relative references, can be applied to IRIs by IRI-processing 416 software in exactly the same way as they are for URIs by URI- 417 processing software. 419 Characters outside the US-ASCII repertoire MUST NOT be reserved and 420 therefore MUST NOT be used for syntactical purposes, such as to 421 delimit components in newly defined schemes. For example, U+00A2, 422 CENT SIGN, is not allowed as a delimiter in IRIs, because it is in 423 the 'iunreserved' category. This is similar to the fact that it is 424 not possible to use '-' as a delimiter in URIs, because it is in the 425 'unreserved' category. 427 2.2. ABNF for IRI References and IRIs 429 An ABNF definition for IRI references (which are the most general 430 concept and the start of the grammar) and IRIs is given here. The 431 syntax of this ABNF is described in [STD68]. Character numbers are 432 taken from the UCS, without implying any actual binary encoding. 433 Terminals in the ABNF are characters, not octets. 435 The following grammar closely follows the URI grammar in [RFC3986], 436 except that the range of unreserved characters is expanded to include 437 UCS characters, with the restriction that private UCS characters can 438 occur only in query parts. The grammar is split into two parts: 439 Rules that differ from [RFC3986] because of the above-mentioned 440 expansion, and rules that are the same as those in [RFC3986]. For 441 rules that are different than those in [RFC3986], the names of the 442 non-terminals have been changed as follows. If the non-terminal 443 contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i' 444 has been prefixed. The rule has been introduced in order 445 to be able to reference it from other parts of the document. 447 The following rules are different from those in [RFC3986]: 449 IRI = scheme ":" ihier-part [ "?" iquery ] 450 [ "#" ifragment ] 452 ihier-part = "//" iauthority ipath-abempty 453 / ipath-absolute 454 / ipath-rootless 455 / ipath-empty 457 IRI-reference = IRI / irelative-ref 459 absolute-IRI = scheme ":" ihier-part [ "?" iquery ] 461 irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ] 463 irelative-part = "//" iauthority ipath-abempty 464 / ipath-absolute 465 / ipath-noscheme 466 / ipath-empty 468 iauthority = [ iuserinfo "@" ] ihost [ ":" port ] 469 iuserinfo = *( iunreserved / pct-form / sub-delims / ":" ) 470 ihost = IP-literal / IPv4address / ireg-name 472 pct-form = pct-encoded 474 ireg-name = *( iunreserved / sub-delims ) 476 ipath = ipath-abempty ; begins with "/" or is empty 477 / ipath-absolute ; begins with "/" but not "//" 478 / ipath-noscheme ; begins with a non-colon segment 479 / ipath-rootless ; begins with a segment 480 / ipath-empty ; zero characters 482 ipath-abempty = *( path-sep isegment ) 483 ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ] 484 ipath-noscheme = isegment-nz-nc *( path-sep isegment ) 485 ipath-rootless = isegment-nz *( path-sep isegment ) 486 ipath-empty = 0 487 path-sep = "/" 489 isegment = *ipchar 490 isegment-nz = 1*ipchar 491 isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims 492 / "@" ) 493 ; non-zero-length segment without any colon ":" 495 ipchar = iunreserved / pct-form / sub-delims / ":" 496 / "@" 498 iquery = *( ipchar / iprivate / "/" / "?" ) 500 ifragment = *( ipchar / "/" / "?" ) 502 iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar 504 ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF 505 / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD 506 / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD 507 / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD 508 / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD 509 / %xD0000-DFFFD / %xE1000-EFFFD 511 iprivate = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD 512 / %x100000-10FFFD 514 Some productions are ambiguous. The "first-match-wins" (a.k.a. 515 "greedy") algorithm applies. For details, see [RFC3986]. 517 The following rules are the same as those in [RFC3986]: 519 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 521 port = *DIGIT 523 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 525 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) 527 IPv6address = 6( h16 ":" ) ls32 528 / "::" 5( h16 ":" ) ls32 529 / [ h16 ] "::" 4( h16 ":" ) ls32 530 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 531 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 532 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 533 / [ *4( h16 ":" ) h16 ] "::" ls32 534 / [ *5( h16 ":" ) h16 ] "::" h16 535 / [ *6( h16 ":" ) h16 ] "::" 537 h16 = 1*4HEXDIG 538 ls32 = ( h16 ":" h16 ) / IPv4address 540 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 542 dec-octet = DIGIT ; 0-9 543 / %x31-39 DIGIT ; 10-99 544 / "1" 2DIGIT ; 100-199 545 / "2" %x30-34 DIGIT ; 200-249 546 / "25" %x30-35 ; 250-255 548 pct-encoded = "%" HEXDIG HEXDIG 550 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 551 reserved = gen-delims / sub-delims 552 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 553 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 554 / "*" / "+" / "," / ";" / "=" 556 This syntax does not support IPv6 scoped addressing zone identifiers. 558 3. Processing IRIs and related protocol elements 560 IRIs are meant to replace URIs in identifying resources within new 561 versions of protocols, formats, and software components that use a 562 UCS-based character repertoire. Protocols and components may use and 563 process IRIs directly. However, there are still numerous systems and 564 protocols which only accept URIs or components of parsed URIs; that 565 is, they only accept sequences of characters within the subset of US- 566 ASCII characters allowed in URIs. 568 This section defines specific processing steps for IRI consumers 569 which establish the relationship between the string given and the 570 interpreted derivatives. These processing steps apply to both IRIs 571 and IRI references (i.e., absolute or relative forms); for IRIs, some 572 steps are scheme specific. 574 3.1. Converting to UCS 576 Input that is already in a Unicode form (i.e., a sequence of Unicode 577 characters or an octet-stream representing a Unicode-based character 578 encoding such as UTF-8 or UTF-16) should be left as is and not 579 normalized or changed. 581 An IRI or IRI reference is a sequence of characters from the UCS. 582 For resource identifiers that are not already in a Unicode form (as 583 when written on paper, read aloud, or represented in a text stream 584 using a legacy character encoding), convert the IRI to Unicode. Note 585 that some character encodings or transcriptions can be converted to 586 or represented by more than one sequence of Unicode characters. 587 Ideally the resulting IRI would use a normalized form, such as 588 Unicode Normalization Form C [UTR15], since that ensures a stable, 589 consistent representation that is most likely to produce the intended 590 results. Implementers and users are cautioned that, while 591 denormalized character sequences are valid, they might be difficult 592 for other users or processes to reproduce and might lead to 593 unexpected results. 595 In other cases (written on paper, read aloud, or otherwise 596 represented independent of any character encoding) represent the IRI 597 as a sequence of characters from the UCS normalized according to 598 Unicode Normalization Form C (NFC, [UTR15]). 600 3.2. Parse the IRI into IRI components 602 Parse the IRI, either as a relative reference (no scheme) or using 603 scheme specific processing (according to the scheme given); the 604 result is a set of parsed IRI components. 606 NOTE: The result of parsing into components will correspond to 607 subtrings of the IRI that may be accessible via an API. For example, 608 in [HTML5], the protocol components of interest are SCHEME (scheme), 609 HOST (ireg-name), PORT (port), the PATH (ipath after the initial 610 "/"), QUERY (iquery), FRAGMENT (ifragment), and AUTHORITY 611 (iauthority). 613 Subsequent processing rules are sometimes used to define other 614 syntactic components. For example, [HTML5] defines APIs for IRI 615 processing; in these APIs: 617 HOSTSPECIFIC the substring that follows the substring matched by the 618 iauthority production, or the whole string if the iauthority 619 production wasn't matched. 621 HOSTPORT if there is a scheme component and a port component and the 622 port given by the port component is different than the default 623 port defined for the protocol given by the scheme component, then 624 HOSTPORT is the substring that starts with the substring matched 625 by the host production and ends with the substring matched by the 626 port production, and includes the colon in between the two. 627 Otherwise, it is the same as the host component. 629 3.3. General percent-encoding of IRI components 631 Except as noted in the following subsections, IRI components are 632 mapped to the equivalent URI components by percent-encoding those 633 characters not allowed in URIs. Previous processing steps will have 634 removed some characters, and the interpretation of reserved 635 characters will have already been done (with the syntactic reserved 636 characters outside of the IRI component). This mapping is defined 637 for all sequences of Unicode characters, whether or not they are 638 valid for the component in question. 640 For each character which is not allowed anywhere in a valid URI apply 641 the following steps. 643 Convert to UTF-8 Convert the character to a sequence of one or more 644 octets using UTF-8 [RFC3629]. 646 Percent encode Convert each octet of this sequence to %HH, where HH 647 is the hexadecimal notation of the octet value. The hexadecimal 648 notation SHOULD use uppercase letters. (This is the general URI 649 percent-encoding mechanism in Section 2.1 of [RFC3986].) 651 Note that the mapping is an identity transformation for parsed URI 652 components of valid URIs, and is idempotent: applying the mapping a 653 second time will not change anything. 655 3.4. Mapping ireg-name 657 3.4.1. Mapping using Percent-Encoding 659 The ireg-name component SHOULD be converted according to the general 660 procedure for percent-encoding of IRI components described in 661 Section 3.3. 663 For example, the IRI 664 "http://résumé.example.org" 665 will be converted to 666 "http://r%C3%A9sum%C3%A9.example.org". 668 This conversion for ireg-name is in line with Section 3.2.2 of 669 [RFC3986], which does not mandate a particular registered name lookup 670 technology. For further background, see [RFC6055] and [Gettys]. 672 3.4.2. Mapping using Punycode 674 The ireg-name component MAY also be converted as follows: 676 Replace the ireg-name part of the IRI by the part converted using the 677 Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891]. 678 on each dot-separated label, and by using U+002E (FULL STOP) as a 679 label separator. This procedure may fail, but this would mean that 680 the IRI cannot be resolved. In such cases, if the domain name 681 conversion fails, then the entire IRI conversion fails. Processors 682 that have no mechanism for signalling a failure MAY instead 683 substitute an otherwise invalid host name, although such processing 684 SHOULD be avoided. 686 For example, the IRI 687 "http://résumé.example.org" 688 MAY be converted to 689 "http://xn--rsum-bad.example.org" 690 . 692 This conversion for ireg-name will be better able to deal with legacy 693 infrastructure that cannot handle percent-encoding in domain names. 695 3.4.3. Additional Considerations 697 Note: Domain Names may appear in parts of an IRI other than the 698 ireg-name part. It is the responsibility of scheme-specific 699 implementations (if the Internationalized Domain Name is part of 700 the scheme syntax) or of server-side implementations (if the 701 Internationalized Domain Name is part of 'iquery') to apply the 702 necessary conversions at the appropriate point. Example: Trying 703 to validate the Web page at 704 http://résumé.example.org would lead to an IRI of 705 http://validator.w3.org/check?uri=http%3A%2F%2Frésumé. 706 example.org, which would convert to a URI of 707 http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. 708 example.org. The server-side implementation is responsible for 709 making the necessary conversions to be able to retrieve the Web 710 page. 712 Note: In this process, characters allowed in URI references and 713 existing percent-encoded sequences are not encoded further. (This 714 mapping is similar to, but different from, the encoding applied 715 when arbitrary content is included in some part of a URI.) For 716 example, an IRI of 717 "http://www.example.org/red%09rosé#red" (in XML notation) is 718 converted to 719 "http://www.example.org/red%09ros%C3%A9#red", not to something 720 like 721 "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red". 723 3.5. Mapping query components 725 ((NOTE: SEE ISSUES LIST)) For compatibility with existing deployed 726 HTTP infrastructure, the following special case applies for schemes 727 "http" and "https" and IRIs whose origin has a document charset other 728 than one which is UCS-based (e.g., UTF-8 or UTF-16). In such a case, 729 the "query" component of an IRI is mapped into a URI by using the 730 document charset rather than UTF-8 as the binary representation 731 before pct-encoding. This mapping is not applied for any other 732 scheme or component. 734 3.6. Mapping IRIs to URIs 736 The canonical mapping from a IRI to URI is defined by applying the 737 mapping above (from IRI to URI components) and then reassembling a 738 URI from the parsed URI components using the original punctuation 739 that delimited the IRI components. 741 3.7. Converting URIs to IRIs 743 In some situations, for presentation and further processing, it is 744 desirable to convert a URI into an equivalent IRI in which natural 745 characters are represented directly rather than percent encoded. Of 746 course, every URI is already an IRI in its own right without any 747 conversion, and in general there This section gives one such 748 procedure for this conversion. 750 The conversion described in this section, if given a valid URI, will 751 result in an IRI that maps back to the URI used as an input for the 752 conversion (except for potential case differences in percent-encoding 753 and for potential percent-encoded unreserved characters). However, 754 the IRI resulting from this conversion may differ from the original 755 IRI (if there ever was one). 757 URI-to-IRI conversion removes percent-encodings, but not all percent- 758 encodings can be eliminated. There are several reasons for this: 760 1. Some percent-encodings are necessary to distinguish percent- 761 encoded and unencoded uses of reserved characters. 763 2. Some percent-encodings cannot be interpreted as sequences of UTF-8 764 octets. 766 (Note: The octet patterns of UTF-8 are highly regular. Therefore, 767 there is a very high probability, but no guarantee, that percent- 768 encodings that can be interpreted as sequences of UTF-8 octets 769 actually originated from UTF-8. For a detailed discussion, see 770 [Duerst97].) 772 3. The conversion may result in a character that is not appropriate 773 in an IRI. See Section 2.2, Section 4.1, and Section 5.1 for 774 further details. 776 4. IRI to URI conversion has different rules for dealing with domain 777 names and query parameters. 779 Conversion from a URI to an IRI MAY be done by using the following 780 steps: 782 1. Represent the URI as a sequence of octets in US-ASCII. 784 2. Convert all percent-encodings ("%" followed by two hexadecimal 785 digits) to the corresponding octets, except those corresponding to 786 "%", characters in "reserved", and characters in US-ASCII not 787 allowed in URIs. 789 3. Re-percent-encode any octet produced in step 2 that is not part of 790 a strictly legal UTF-8 octet sequence. 792 4. Re-percent-encode all octets produced in step 3 that in UTF-8 793 represent characters that are not appropriate according to 794 Section 2.2, Section 4.1, and Section 5.1. 796 5. Interpret the resulting octet sequence as a sequence of characters 797 encoded in UTF-8. 799 6. URIs known to contain domain names in the reg-name component 800 SHOULD convert punycode-encoded domain name labels to the 801 corresponding characters using the ToUnicode procedure. 803 This procedure will convert as many percent-encoded characters as 804 possible to characters in an IRI. Because there are some choices 805 when step 4 is applied (see Section 5.1), results may vary. 807 Conversions from URIs to IRIs MUST NOT use any character encoding 808 other than UTF-8 in steps 3 and 4, even if it might be possible to 809 guess from the context that another character encoding than UTF-8 was 810 used in the URI. For example, the URI 811 "http://www.example.org/r%E9sum%E9.html" might with some guessing be 812 interpreted to contain two e-acute characters encoded as iso-8859-1. 813 It must not be converted to an IRI containing these e-acute 814 characters. Otherwise, in the future the IRI will be mapped to 815 "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different 816 URI from "http://www.example.org/r%E9sum%E9.html". 818 3.7.1. Examples 820 This section shows various examples of converting URIs to IRIs. Each 821 example shows the result after each of the steps 1 through 6 is 822 applied. XML Notation is used for the final result. Octets are 823 denoted by "<" followed by two hexadecimal digits followed by ">". 825 The following example contains the sequence "%C3%BC", which is a 826 strictly legal UTF-8 sequence, and which is converted into the actual 827 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as 828 u-umlaut). 830 1. http://www.example.org/D%C3%BCrst 832 2. http://www.example.org/Drst 834 3. http://www.example.org/Drst 836 4. http://www.example.org/Drst 838 5. http://www.example.org/Dürst 840 6. http://www.example.org/Dürst 842 The following example contains the sequence "%FC", which might 843 represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the 844 iso-8859-1 character encoding. (It might represent other characters 845 in other character encodings. For example, the octet in iso- 846 8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.) Because 847 is not part of a strictly legal UTF-8 sequence, it is re-percent- 848 encoded in step 3. 850 1. http://www.example.org/D%FCrst 852 2. http://www.example.org/Drst 854 3. http://www.example.org/D%FCrst 856 4. http://www.example.org/D%FCrst 858 5. http://www.example.org/D%FCrst 860 6. http://www.example.org/D%FCrst 862 The following example contains "%e2%80%ae", which is the percent- 863 encoded 864 UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. 865 Section 4.1 forbids the direct use of this character in an IRI. 866 Therefore, the corresponding octets are re-percent-encoded in step 4. 867 This example shows that the case (upper- or lowercase) of letters 868 used in percent-encodings may not be preserved. The example also 869 contains a punycode-encoded domain name label (xn--99zt52a), which is 870 not converted. 872 1. http://xn--99zt52a.example.org/%e2%80%ae 874 2. http://xn--99zt52a.example.org/<80> 876 3. http://xn--99zt52a.example.org/<80> 878 4. http://xn--99zt52a.example.org/%E2%80%AE 880 5. http://xn--99zt52a.example.org/%E2%80%AE 882 6. http://納豆.example.org/%E2%80%AE 884 Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46 885 (Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this 886 note.)) 888 4. Bidirectional IRIs for Right-to-Left Languages 890 Some UCS characters, such as those used in the Arabic and Hebrew 891 scripts, have an inherent right-to-left (rtl) writing direction. 892 IRIs containing these characters (called bidirectional IRIs or Bidi 893 IRIs) require additional attention because of the non-trivial 894 relation between logical representation (used for digital 895 representation and for reading/spelling) and visual representation 896 (used for display/printing). 898 Because of the complex interaction between the logical 899 representation, the visual representation, and the syntax of a Bidi 900 IRI, a balance is needed between various requirements. The main 901 requirements are 903 1. user-predictable conversion between visual and logical 904 representation; 906 2. the ability to include a wide range of characters in various parts 907 of the IRI; and 909 3. minor or no changes or restrictions for implementations. 911 4.1. Logical Storage and Visual Presentation 913 When stored or transmitted in digital representation, bidirectional 914 IRIs MUST be in full logical order and MUST conform to the IRI syntax 915 rules (which includes the rules relevant to their scheme). This 916 ensures that bidirectional IRIs can be processed in the same way as 917 other IRIs. 919 Bidirectional IRIs MUST be rendered by using the Unicode 920 Bidirectional Algorithm [UNIV6], [UNI9]. Bidirectional IRIs MUST be 921 rendered in the same way as they would be if they were in a left-to- 922 right embedding; i.e., as if they were preceded by U+202A, LEFT-TO- 923 RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL 924 FORMATTING (PDF). Setting the embedding direction can also be done 925 in a higher-level protocol (e.g., the dir='ltr' attribute in HTML). 927 There is no requirement to use the above embedding if the display is 928 still the same without the embedding. For example, a bidirectional 929 IRI in a text with left-to-right base directionality (such as used 930 for English or Cyrillic) that is preceded and followed by whitespace 931 and strong left-to-right characters does not need an embedding. 932 Also, a bidirectional relative IRI reference that only contains 933 strong right-to-left characters and weak characters and that starts 934 and ends with a strong right-to-left character and appears in a text 935 with right-to-left base directionality (such as used for Arabic or 936 Hebrew) and is preceded and followed by whitespace and strong 937 characters does not need an embedding. 939 In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be 940 sufficient to force the correct display behavior. However, the 941 details of the Unicode Bidirectional algorithm are not always easy to 942 understand. Implementers are strongly advised to err on the side of 943 caution and to use embedding in all cases where they are not 944 completely sure that the display behavior is unaffected without the 945 embedding. 947 The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits 948 higher-level protocols to influence bidirectional rendering. Such 949 changes by higher-level protocols MUST NOT be used if they change the 950 rendering of IRIs. 952 The bidirectional formatting characters that may be used before or 953 after the IRI to ensure correct display are not themselves part of 954 the IRI. IRIs MUST NOT contain bidirectional formatting characters 955 (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual 956 rendering of the IRI but do not appear themselves. It would 957 therefore not be possible to input an IRI with such characters 958 correctly. 960 4.2. Bidi IRI Structure 962 The Unicode Bidirectional Algorithm is designed mainly for running 963 text. To make sure that it does not affect the rendering of 964 bidirectional IRIs too much, some restrictions on bidirectional IRIs 965 are necessary. These restrictions are given in terms of delimiters 966 (structural characters, mostly punctuation such as "@", ".", ":", and 967 "/") and components (usually consisting mostly of letters and 968 digits). 970 The following syntax rules from Section 2.2 correspond to components 971 for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, 972 isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment. 974 Specifications that define the syntax of any of the above components 975 MAY divide them further and define smaller parts to be components 976 according to this document. As an example, the restrictions of 977 [RFC3490] on bidirectional domain names correspond to treating each 978 label of a domain name as a component for schemes with ireg-name as a 979 domain name. Even where the components are not defined formally, it 980 may be helpful to think about some syntax in terms of components and 981 to apply the relevant restrictions. For example, for the usual name/ 982 value syntax in query parts, it is convenient to treat each name and 983 each value as a component. As another example, the extensions in a 984 resource name can be treated as separate components. 986 For each component, the following restrictions apply: 988 1. A component SHOULD NOT use both right-to-left and left-to-right 989 characters. 991 2. A component using right-to-left characters SHOULD start and end 992 with right-to-left characters. 994 The above restrictions are given as "SHOULD"s, rather than as 995 "MUST"s. For IRIs that are never presented visually, they are not 996 relevant. However, for IRIs in general, they are very important to 997 ensure consistent conversion between visual presentation and logical 998 representation, in both directions. 1000 Note: In some components, the above restrictions may actually be 1001 strictly enforced. For example, [RFC3490] requires that these 1002 restrictions apply to the labels of a host name for those schemes 1003 where ireg-name is a host name. In some other components (for 1004 example, path components) following these restrictions may not be 1005 too difficult. For other components, such as parts of the query 1006 part, it may be very difficult to enforce the restrictions because 1007 the values of query parameters may be arbitrary character 1008 sequences. 1010 If the above restrictions cannot be satisfied otherwise, the affected 1011 component can always be mapped to URI notation as described in 1012 Section 3.3. Please note that the whole component has to be mapped 1013 (see also Example 9 below). 1015 4.3. Input of Bidi IRIs 1017 Bidi input methods MUST generate Bidi IRIs in logical order while 1018 rendering them according to Section 4.1. During input, rendering 1019 SHOULD be updated after every new character is input to avoid end- 1020 user confusion. 1022 4.4. Examples 1024 This section gives examples of bidirectional IRIs, in Bidi Notation. 1025 It shows legal IRIs with the relationship between logical and visual 1026 representation and explains how certain phenomena in this 1027 relationship may look strange to somebody not familiar with 1028 bidirectional behavior, but familiar to users of Arabic and Hebrew. 1029 It also shows what happens if the restrictions given in Section 4.2 1030 are not followed. The examples below can be seen at [BidiEx], in 1031 Arabic, Hebrew, and Bidi Notation variants. 1033 To read the bidi text in the examples, read the visual representation 1034 from left to right until you encounter a block of rtl text. Read the 1035 rtl block (including slashes and other special characters) from right 1036 to left, then continue at the next unread ltr character. 1038 Example 1: A single component with rtl characters is inverted: 1039 Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html" 1040 Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html" 1041 Components can be read one by one, and each component can be read in 1042 its natural direction. 1044 Example 2: More than one consecutive component with rtl characters is 1045 inverted as a whole: 1046 Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html" 1047 Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html" 1048 A sequence of rtl components is read rtl, in the same way as a 1049 sequence of rtl words is read rtl in a bidi text. 1051 Example 3: All components of an IRI (except for the scheme) are rtl. 1052 All rtl components are inverted overall: 1053 Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV" 1054 Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA" 1055 The whole IRI (except the scheme) is read rtl. Delimiters between 1056 rtl components stay between the respective components; delimiters 1057 between ltr and rtl components don't move. 1059 Example 4: Each of several sequences of rtl components is inverted on 1060 its own: 1061 Logical representation: "http://AB.CD.ef/gh/IJ/KL.html" 1062 Visual representation: "http://DC.BA.ef/gh/LK/JI.html" 1063 Each sequence of rtl components is read rtl, in the same way as each 1064 sequence of rtl words in an ltr text is read rtl. 1066 Example 5: Example 2, applied to components of different kinds: 1067 Logical representation: "http://ab.cd.EF/GH/ij/kl.html" 1068 Visual representation: "http://ab.cd.HG/FE/ij/kl.html" 1069 The inversion of the domain name label and the path component may be 1070 unexpected, but it is consistent with other bidi behavior. For 1071 reassurance that the domain component really is "ab.cd.EF", it may be 1072 helpful to read aloud the visual representation following the bidi 1073 algorithm. After "http://ab.cd." one reads the RTL block 1074 "E-F-slash-G-H", which corresponds to the logical representation. 1076 Example 6: Same as Example 5, with more rtl components: 1077 Logical representation: "http://ab.CD.EF/GH/IJ/kl.html" 1078 Visual representation: "http://ab.JI/HG/FE.DC/kl.html" 1079 The inversion of the domain name labels and the path components may 1080 be easier to identify because the delimiters also move. 1082 Example 7: A single rtl component includes digits: 1083 Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html" 1084 Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html" 1085 Numbers are written ltr in all cases but are treated as an additional 1086 embedding inside a run of rtl characters. This is completely 1087 consistent with usual bidirectional text. 1089 Example 8 (not allowed): Numbers are at the start or end of an rtl 1090 component: 1091 Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html" 1092 Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html" 1093 The sequence "1/2" is interpreted by the bidi algorithm as a 1094 fraction, fragmenting the components and leading to confusion. There 1095 are other characters that are interpreted in a special way close to 1096 numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":". 1098 Example 9 (not allowed): The numbers in the previous example are 1099 percent-encoded: 1100 Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html", 1101 Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html" 1103 Example 10 (allowed but not recommended): 1104 Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html" 1105 Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html" 1106 Components consisting of only numbers are allowed (it would be rather 1107 difficult to prohibit them), but these may interact with adjacent RTL 1108 components in ways that are not easy to predict. 1110 Example 11 (allowed but not recommended): 1111 Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html" 1112 Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html" 1113 Components consisting of numbers and left-to-right characters are 1114 allowed, but these may interact with adjacent RTL components in ways 1115 that are not easy to predict. 1117 5. Use of IRIs 1119 5.1. Limitations on UCS Characters Allowed in IRIs 1121 This section discusses limitations on characters and character 1122 sequences usable for IRIs beyond those given in Section 2.2 and 1123 Section 4.1. The considerations in this section are relevant when 1124 IRIs are created and when URIs are converted to IRIs. 1126 a. The repertoire of characters allowed in each IRI component is 1127 limited by the definition of that component. For example, the 1128 definition of the scheme component does not allow characters 1129 beyond US-ASCII. 1131 (Note: In accordance with URI practice, generic IRI software 1132 cannot and should not check for such limitations.) 1134 b. The UCS contains many areas of characters for which there are 1135 strong visual look-alikes. Because of the likelihood of 1136 transcription errors, these also should be avoided. This includes 1137 the full-width equivalents of Latin characters, half-width 1138 Katakana characters for Japanese, and many others. It also 1139 includes many look-alikes of "space", "delims", and "unwise", 1140 characters excluded in [RFC3491]. 1142 Additional information is available from [UNIXML]. [UNIXML] is 1143 written in the context of running text rather than in that of 1144 identifiers. Nevertheless, it discusses many of the categories of 1145 characters not appropriate for IRIs. 1147 5.2. Software Interfaces and Protocols 1149 Although an IRI is defined as a sequence of characters, software 1150 interfaces for URIs typically function on sequences of octets or 1151 other kinds of code units. Thus, software interfaces and protocols 1152 MUST define which character encoding is used. 1154 Intermediate software interfaces between IRI-capable components and 1155 URI-only components MUST map the IRIs per Section 3.6, when 1156 transferring from IRI-capable to URI-only components. This mapping 1157 SHOULD be applied as late as possible. It SHOULD NOT be applied 1158 between components that are known to be able to handle IRIs. 1160 5.3. Format of URIs and IRIs in Documents and Protocols 1162 Document formats that transport URIs may have to be upgraded to allow 1163 the transport of IRIs. In cases where the document as a whole has a 1164 native character encoding, IRIs MUST also be encoded in this 1165 character encoding and converted accordingly by a parser or 1166 interpreter. IRI characters not expressible in the native character 1167 encoding SHOULD be escaped by using the escaping conventions of the 1168 document format if such conventions are available. Alternatively, 1169 they MAY be percent-encoded according to Section 3.6. For example, 1170 in HTML or XML, numeric character references SHOULD be used. If a 1171 document as a whole has a native character encoding and that 1172 character encoding is not UTF-8, then IRIs MUST NOT be placed into 1173 the document in the UTF-8 character encoding. 1175 ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs, 1176 although they use different terminology. HTML 4.0 [HTML4] defines 1177 the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0 1178 [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications 1179 based upon them allow IRIs. Also, it is expected that all relevant 1180 new W3C formats and protocols will be required to handle IRIs 1181 [CharMod]. 1183 5.4. Use of UTF-8 for Encoding Original Characters 1185 This section discusses details and gives examples for point c) in 1186 Section 1.2. To be able to use IRIs, the URI corresponding to the 1187 IRI in question has to encode original characters into octets by 1188 using UTF-8. This can be specified for all URIs of a URI scheme or 1189 can apply to individual URIs for schemes that do not specify how to 1190 encode original characters. It can apply to the whole URI, or only 1191 to some part. For background information on encoding characters into 1192 URIs, see also Section 2.5 of [RFC3986]. 1194 For new URI schemes, using UTF-8 is recommended in [RFC4395bis]. 1195 Examples where UTF-8 is already used are the URN syntax [RFC2141], 1196 IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, 1197 because the HTTP URI scheme does not specify how to encode original 1198 characters, only some HTTP URLs can have corresponding but different 1199 IRIs. 1201 For example, for a document with a URI of 1202 "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to 1203 construct a corresponding IRI (in XML notation, see Section 1.4): 1204 "http://www.example.org/résumé.html" ("é" stands for 1205 the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent- 1206 encoded representation of that character). On the other hand, for a 1207 document with a URI of "http://www.example.org/r%E9sum%E9.html", the 1208 percent-encoding octets cannot be converted to actual characters in 1209 an IRI, as the percent-encoding is not based on UTF-8. 1211 For most URI schemes, there is no need to upgrade their scheme 1212 definition in order for them to work with IRIs. The main case where 1213 upgrading makes sense is when a scheme definition, or a particular 1214 component of a scheme, is strictly limited to the use of US-ASCII 1215 characters with no provision to include non-ASCII characters/octets 1216 via percent-encoding, or if a scheme definition currently uses highly 1217 scheme-specific provisions for the encoding of non-ASCII characters. 1218 An example of this is the mailto: scheme [RFC2368]. 1220 This specification updates the IANA registry of URI schemes to note 1221 their applicability to IRIs, see Section 8. All IRIs use URI 1222 schemes, and all URIs with URI schemes can be used as IRIs, even 1223 though in some cases only by using URIs directly as IRIs, without any 1224 conversion. 1226 Scheme definitions can impose restrictions on the syntax of scheme- 1227 specific URIs; i.e., URIs that are admissible under the generic URI 1228 syntax [RFC3986] may not be admissible due to narrower syntactic 1229 constraints imposed by a URI scheme specification. URI scheme 1230 definitions cannot broaden the syntactic restrictions of the generic 1231 URI syntax; otherwise, it would be possible to generate URIs that 1232 satisfied the scheme-specific syntactic constraints without 1233 satisfying the syntactic constraints of the generic URI syntax. 1234 However, additional syntactic constraints imposed by URI scheme 1235 specifications are applicable to IRI, as the corresponding URI 1236 resulting from the mapping defined in Section 3.6 MUST be a valid URI 1237 under the syntactic restrictions of generic URI syntax and any 1238 narrower restrictions imposed by the corresponding URI scheme 1239 specification. 1241 The requirement for the use of UTF-8 generally applies to all parts 1242 of a URI. However, it is possible that the capability of IRIs to 1243 represent a wide range of characters directly is used just in some 1244 parts of the IRI (or IRI reference). The other parts of the IRI may 1245 only contain US-ASCII characters, or they may not be based on UTF-8. 1246 They may be based on another character encoding, or they may directly 1247 encode raw binary data (see also [RFC2397]). 1249 For example, it is possible to have a URI reference of 1250 "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the 1251 document name is encoded in iso-8859-1 based on server settings, but 1252 where the fragment identifier is encoded in UTF-8 according to 1253 [XPointer]. The IRI corresponding to the above URI would be (in XML 1254 notation) 1255 "http://www.example.org/r%E9sum%E9.xml#résumé". 1257 Similar considerations apply to query parts. The functionality of 1258 IRIs (namely, to be able to include non-ASCII characters) can only be 1259 used if the query part is encoded in UTF-8. 1261 5.5. Relative IRI References 1263 Processing of relative IRI references against a base is handled 1264 straightforwardly; the algorithms of [RFC3986] can be applied 1265 directly, treating the characters additionally allowed in IRI 1266 references in the same way that unreserved characters are in URI 1267 references. 1269 6. Liberal Handling of Otherwise Invalid IRIs 1271 (EDITOR NOTE: This Section may move to an appendix.) Some technical 1272 specifications and widely-deployed software have allowed additional 1273 variations and extensions of IRIs to be used in syntactic components. 1274 This section describes two widely-used preprocessing agreements. 1275 Other technical specifications may wish to reference a syntactic 1276 component which is "a valid IRI or a string that will map to a valid 1277 IRI after this preprocessing algorithm". These two variants are 1278 known as Legacy Extended IRI or LEIRI [LEIRI], and Web Address 1279 [HTML5]). 1281 Future technical specifications SHOULD NOT allow conforming producers 1282 to produce, or conforming content to contain, such forms, as they are 1283 not interoperable with other IRI consuming software. 1285 6.1. LEIRI Processing 1287 This section defines Legacy Extended IRIs (LEIRIs). The syntax of 1288 Legacy Extended IRIs is the same as that for , except 1289 that the ucschar production is replaced by the leiri-ucschar 1290 production: 1292 leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" 1293 / "\" / "^" / "`" / %x0-1F / %x7F-D7FF 1294 / %xE000-FFFD / %x10000-10FFFF 1296 Among other extensions, processors based on this specification also 1297 did not enforce the restriction on bidirectional formatting 1298 characters in Section 4.1, and the iprivate production becomes 1299 redundant. 1301 To convert a string allowed as a LEIRI to an IRI, each character 1302 allowed in leiri-ucschar but not in ucschar must be percent-encoded 1303 using Section 3.3. 1305 6.2. Web Address Processing 1307 Many popular web browsers have taken the approach of being quite 1308 liberal in what is accepted as a "URL" or its relative forms. This 1309 section describes their behavior in terms of a preprocessor which 1310 maps strings into the IRI space for subsequent parsing and 1311 interpretation as an IRI. 1313 In some situations, it might be appropriate to describe the syntax 1314 that a liberal consumer implementation might accept as a "Web 1315 Address" or "Hypertext Reference" or "HREF". However, technical 1316 specifications SHOULD restrict the syntactic form allowed by 1317 compliant producers to the IRI or IRI reference syntax defined in 1318 this document even if they want to mandate this processing. 1320 Summary: 1322 o Leading and trailing whitespace is removed. 1324 o Some additional characters are removed. 1326 o Some additional characters are allowed and escaped (as with 1327 LEIRI). 1329 o If interpreting an IRI as a URI, the pct-encoding of the query 1330 component of the parsed URI component depends on operational 1331 context. 1333 Each string provided may have an associated charset (called the HREF- 1334 charset here); this defaults to UTF-8. For web browsers interpreting 1335 HTML, the document charset of a string is determined: 1337 If the string came from a script (e.g. as an argument to a method) 1338 The HRef-charset is the script's charset. 1340 If the string came from a DOM node (e.g. from an element) The node 1341 has a Document, and the HRef-charset is the Document's character 1342 encoding. 1344 If the string had a HRef-charset defined when the string was created 1345 or defined The HRef-charset is as defined. 1347 If the resulting HRef-charset is a unicode based character encoding 1348 (e.g., UTF-16), then use UTF-8 instead. 1350 The syntax for Web Addresses is obtained by replacing the 'ucschar', 1351 pct-form, path-sep, and ifragment rules with the href-ucschar, href- 1352 pct-form, href-path-sep, and href-ifragment rules below. In 1353 addition, some characters are stripped. 1355 href-ucschar = " " / "<" / ">" / DQUOTE / "{" / "}" / "|" 1356 / "\" / "^" / "`" / %x0-1F / %x7F-D7FF 1357 / %xE000-FFFD / %x10000-10FFFF 1358 href-pct-form = pct-encoded / "%" 1359 href-path-sep = "/" / "\" 1360 href-ifragment = *( ipchar / "/" / "?" / "#" ) ; adding "#" 1361 href-strip = 1363 (NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT 1364 SENTENCE) browsers did not enforce the restriction on bidirectional 1365 formatting characters in Section 4.1, and the iprivate production 1366 becomes redundant. 1368 'Web Address processing' requires the following additional 1369 preprocessing steps: 1371 1. Leading and trailing instances of space (U+0020), CR (U+000A), LF 1372 (U+000D), and TAB (U+0009) characters are removed. 1374 2. strip all characters in href-strip. 1376 3. Percent-encode all characters in href-ucschar not in ucschar. 1378 4. Replace occurrences of "%" not followed by two hexadecimal digits 1379 by "%25". 1381 5. Convert backslashes ('\') matching href-path-sep to forward 1382 slashes ('/'). 1384 6.3. Characters Not Allowed in IRIs 1386 This section provides a list of the groups of characters and code 1387 points that are allowed by LEIRI or HREF but are not allowed in IRIs 1388 or are allowed in IRIs only in the query part. For each group of 1389 characters, advice on the usage of these characters is also given, 1390 concentrating on the reasons for why they are excluded from IRI use. 1392 Space (U+0020): Some formats and applications use space as a 1393 delimiter, e.g. for items in a list. Appendix C of [RFC3986] also 1394 mentions that white space may have to be added when displaying or 1395 printing long URIs; the same applies to long IRIs. This means 1396 that spaces can disappear, or can make the what is intended as a 1397 single IRI or IRI reference to be treated as two or more separate 1398 IRIs. 1400 Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix 1401 C of [RFC3986] suggests the use of double-quotes 1402 ("http://example.com/") and angle brackets () 1403 as delimiters for URIs in plain text. These conventions are often 1404 used, and also apply to IRIs. Using these characters in strings 1405 intended to be IRIs would result in the IRIs being cut off at the 1406 wrong place. 1408 Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{" 1409 (U+007B), "|" (U+007C), and "}" (U+007D): These characters 1410 originally have been excluded from URIs because the respective 1411 codepoints are assigned to different graphic characters in some 1412 7-bit or 8-bit encoding. Despite the move to Unicode, some of 1413 these characters are still occasionally displayed differently on 1414 some systems, e.g. U+005C may appear as a Japanese Yen symbol on 1415 some systems. Also, the fact that these characters are not used 1416 in URIs or IRIs has encouraged their use outside URIs or IRIs in 1417 contexts that may include URIs or IRIs. If a string with such a 1418 character were used as an IRI in such a context, it would likely 1419 be interpreted piecemeal. 1421 The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F - 1422 #x9F): There is generally no way to transmit these characters 1423 reliably as text outside of a charset encoding. Even when in 1424 encoded form, many software components silently filter out some of 1425 these characters, or may stop processing alltogether when 1426 encountering some of them. These characters may affect text 1427 display in subtle, unnoticable ways or in drastic, global, and 1428 irreversible ways depending on the hardware and software involved. 1429 The use of some of these characters would allow malicious users to 1430 manipulate the display of an IRI and its context in many 1431 situations. 1433 Bidi formatting characters (U+200E, U+200F, U+202A-202E): These 1434 characters affect the display ordering of characters. If IRIs 1435 were allowed to contain these characters and the resulting visual 1436 display transcribed. they could not be converted back to 1437 electronic form (logical order) unambiguously. These characters, 1438 if allowed in IRIs, might allow malicious users to manipulate the 1439 display of IRI and its context. 1441 Specials (U+FFF0-FFFD): These code points provide functionality 1442 beyond that useful in an IRI, for example byte order 1443 identification, annotation, and replacements for unknown 1444 characters and objects. Their use and interpretation in an IRI 1445 would serve no purpose and might lead to confusing display 1446 variations. 1448 Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000- 1449 10FFFD): Display and interpretation of these code points is by 1450 definition undefined without private agreement. Therefore, these 1451 code points are not suited for use on the Internet. They are not 1452 interoperable and may have unpredictable effects. 1454 Tags (U+E0000-E0FFF): These characters provide a way to language 1455 tag in Unicode plain text. They are not appropriate for IRIs 1456 because language information in identifiers cannot reliably be 1457 input, transmitted (e.g. on a visual medium such as paper), or 1458 recognized. 1460 Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF, 1461 U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF, 1462 U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF, 1463 U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF, 1464 U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as 1465 non-characters. Applications may use some of them internally, but 1466 are not prepared to interchange them. 1468 LEIRI preprocessing disallowed some code points and code units: 1470 Surrogate code units (D800-DFFF): These do not represent Unicode 1471 codepoints. 1473 7. URI/IRI Processing Guidelines (Informative) 1475 This informative section provides guidelines for supporting IRIs in 1476 the same software components and operations that currently process 1477 URIs: Software interfaces that handle URIs, software that allows 1478 users to enter URIs, software that creates or generates URIs, 1479 software that displays URIs, formats and protocols that transport 1480 URIs, and software that interprets URIs. These may all require 1481 modification before functioning properly with IRIs. The 1482 considerations in this section also apply to URI references and IRI 1483 references. 1485 7.1. URI/IRI Software Interfaces 1487 Software interfaces that handle URIs, such as URI-handling APIs and 1488 protocols transferring URIs, need interfaces and protocol elements 1489 that are designed to carry IRIs. 1491 In case the current handling in an API or protocol is based on US- 1492 ASCII, UTF-8 is recommended as the character encoding for IRIs, as it 1493 is compatible with US-ASCII, is in accordance with the 1494 recommendations of [RFC2277], and makes converting to URIs easy. In 1495 any case, the API or protocol definition must clearly define the 1496 character encoding to be used. 1498 The transfer from URI-only to IRI-capable components requires no 1499 mapping, although the conversion described in Section 3.7 above may 1500 be performed. It is preferable not to perform this inverse 1501 conversion unless it is certain this can be done correctly. 1503 7.2. URI/IRI Entry 1505 Some components allow users to enter URIs into the system by typing 1506 or dictation, for example. This software must be updated to allow 1507 for IRI entry. 1509 A person viewing a visual representation of an IRI (as a sequence of 1510 glyphs, in some order, in some visual display) or hearing an IRI will 1511 use an entry method for characters in the user's language to input 1512 the IRI. Depending on the script and the input method used, this may 1513 be a more or less complicated process. 1515 The process of IRI entry must ensure, as much as possible, that the 1516 restrictions defined in Section 2.2 are met. This may be done by 1517 choosing appropriate input methods or variants/settings thereof, by 1518 appropriately converting the characters being input, by eliminating 1519 characters that cannot be converted, and/or by issuing a warning or 1520 error message to the user. 1522 As an example of variant settings, input method editors for East 1523 Asian Languages usually allow the input of Latin letters and related 1524 characters in full-width or half-width versions. For IRI input, the 1525 input method editor should be set so that it produces half-width 1526 Latin letters and punctuation and full-width Katakana. 1528 An input field primarily or solely used for the input of URIs/IRIs 1529 might allow the user to view an IRI as it is mapped to a URI. Places 1530 where the input of IRIs is frequent may provide the possibility for 1531 viewing an IRI as mapped to a URI. This will help users when some of 1532 the software they use does not yet accept IRIs. 1534 An IRI input component interfacing to components that handle URIs, 1535 but not IRIs, must map the IRI to a URI before passing it to these 1536 components. 1538 For the input of IRIs with right-to-left characters, please see 1539 Section 4.3. 1541 7.3. URI/IRI Transfer between Applications 1543 Many applications (for example, mail user agents) try to detect URIs 1544 appearing in plain text. For this, they use some heuristics based on 1545 URI syntax. They then allow the user to click on such URIs and 1546 retrieve the corresponding resource in an appropriate (usually 1547 scheme-dependent) application. 1549 Such applications would need to be upgraded, in order to use the IRI 1550 syntax as a base for heuristics. In particular, a non-ASCII 1551 character should not be taken as the indication of the end of an IRI. 1552 Such applications also would need to make sure that they correctly 1553 convert the detected IRI from the character encoding of the document 1554 or application where the IRI appears, to the character encoding used 1555 by the system-wide IRI invocation mechanism, or to a URI (according 1556 to Section 3.6) if the system-wide invocation mechanism only accepts 1557 URIs. 1559 The clipboard is another frequently used way to transfer URIs and 1560 IRIs from one application to another. On most platforms, the 1561 clipboard is able to store and transfer text in many languages and 1562 scripts. Correctly used, the clipboard transfers characters, not 1563 octets, which will do the right thing with IRIs. 1565 7.4. URI/IRI Generation 1567 Systems that offer resources through the Internet, where those 1568 resources have logical names, sometimes automatically generate URIs 1569 for the resources they offer. For example, some HTTP servers can 1570 generate a directory listing for a file directory and then respond to 1571 the generated URIs with the files. 1573 Many legacy character encodings are in use in various file systems. 1574 Many currently deployed systems do not transform the local character 1575 representation of the underlying system before generating URIs. 1577 For maximum interoperability, systems that generate resource 1578 identifiers should make the appropriate transformations. For 1579 example, if a file system contains a file named "résum&# 1580 xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in 1581 a URI, which allows use of "résumé.html" in an IRI, even if 1582 locally the file name is kept in a character encoding other than 1583 UTF-8. 1585 This recommendation particularly applies to HTTP servers. For FTP 1586 servers, similar considerations apply; see [RFC2640]. 1588 7.5. URI/IRI Selection 1590 In some cases, resource owners and publishers have control over the 1591 IRIs used to identify their resources. This control is mostly 1592 executed by controlling the resource names, such as file names, 1593 directly. 1595 In these cases, it is recommended to avoid choosing IRIs that are 1596 easily confused. For example, for US-ASCII, the lower-case ell ("l") 1597 is easily confused with the digit one ("1"), and the upper-case oh 1598 ("O") is easily confused with the digit zero ("0"). Publishers 1599 should avoid confusing users with "br0ken" or "1ame" identifiers. 1601 Outside the US-ASCII repertoire, there are many more opportunities 1602 for confusion; a complete set of guidelines is too lengthy to include 1603 here. As long as names are limited to characters from a single 1604 script, native writers of a given script or language will know best 1605 when ambiguities can appear, and how they can be avoided. What may 1606 look ambiguous to a stranger may be completely obvious to the average 1607 native user. On the other hand, in some cases, the UCS contains 1608 variants for compatibility reasons; for example, for typographic 1609 purposes. These should be avoided wherever possible. Although there 1610 may be exceptions, newly created resource names should generally be 1611 in NFKC [UTR15] (which means that they are also in NFC). 1613 As an example, the UCS contains the "fi" ligature at U+FB01 for 1614 compatibility reasons. Wherever possible, IRIs should use the two 1615 letters "f" and "i" rather than the "fi" ligature. An example where 1616 the latter may be used is in the query part of an IRI for an explicit 1617 search for a word written containing the "fi" ligature. 1619 In certain cases, there is a chance that characters from different 1620 scripts look the same. The best known example is the similarity of 1621 the Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid 1622 such cases, IRIs should only be created where all the characters in a 1623 single component are used together in a given language. This usually 1624 means that all of these characters will be from the same script, but 1625 there are languages that mix characters from different scripts (such 1626 as Japanese). This is similar to the heuristics used to distinguish 1627 between letters and numbers in the examples above. Also, for Latin, 1628 Greek, and Cyrillic, using lowercase letters results in fewer 1629 ambiguities than using uppercase letters would. 1631 7.6. Display of URIs/IRIs 1633 In situations where the rendering software is not expected to display 1634 non-ASCII parts of the IRI correctly using the available layout and 1635 font resources, these parts should be percent-encoded before being 1636 displayed. 1638 For display of Bidi IRIs, please see Section 4.1. 1640 7.7. Interpretation of URIs and IRIs 1642 Software that interprets IRIs as the names of local resources should 1643 accept IRIs in multiple forms and convert and match them with the 1644 appropriate local resource names. 1646 First, multiple representations include both IRIs in the native 1647 character encoding of the protocol and also their URI counterparts. 1649 Second, it may include URIs constructed based on character encodings 1650 other than UTF-8. These URIs may be produced by user agents that do 1651 not conform to this specification and that use legacy character 1652 encodings to convert non-ASCII characters to URIs. Whether this is 1653 necessary, and what character encodings to cover, depends on a number 1654 of factors, such as the legacy character encodings used locally and 1655 the distribution of various versions of user agents. For example, 1656 software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in 1657 addition to UTF-8. 1659 Third, it may include additional mappings to be more user-friendly 1660 and robust against transmission errors. These would be similar to 1661 how some servers currently treat URIs as case insensitive or perform 1662 additional matching to account for spelling errors. For characters 1663 beyond the US-ASCII repertoire, this may, for example, include 1664 ignoring the accents on received IRIs or resource names. Please note 1665 that such mappings, including case mappings, are language dependent. 1667 It can be difficult to identify a resource unambiguously if too many 1668 mappings are taken into consideration. However, percent-encoded and 1669 not percent-encoded parts of IRIs can always be clearly 1670 distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes 1671 the potential for collisions lower than it may seem at first. 1673 7.8. Upgrading Strategy 1675 Where this recommendation places further constraints on software for 1676 which many instances are already deployed, it is important to 1677 introduce upgrades carefully and to be aware of the various 1678 interdependencies. 1680 If IRIs cannot be interpreted correctly, they should not be created, 1681 generated, or transported. This suggests that upgrading URI 1682 interpreting software to accept IRIs should have highest priority. 1684 On the other hand, a single IRI is interpreted only by a single or 1685 very few interpreters that are known in advance, although it may be 1686 entered and transported very widely. 1688 Therefore, IRIs benefit most from a broad upgrade of software to be 1689 able to enter and transport IRIs. However, before an individual IRI 1690 is published, care should be taken to upgrade the corresponding 1691 interpreting software in order to cover the forms expected to be 1692 received by various versions of entry and transport software. 1694 The upgrade of generating software to generate IRIs instead of using 1695 a local character encoding should happen only after the service is 1696 upgraded to accept IRIs. Similarly, IRIs should only be generated 1697 when the service accepts IRIs and the intervening infrastructure and 1698 protocol is known to transport them safely. 1700 Software converting from URIs to IRIs for display should be upgraded 1701 only after upgraded entry software has been widely deployed to the 1702 population that will see the displayed result. 1704 Where there is a free choice of character encodings, it is often 1705 possible to reduce the effort and dependencies for upgrading to IRIs 1706 by using UTF-8 rather than another encoding. For example, when a new 1707 file-based Web server is set up, using UTF-8 as the character 1708 encoding for file names will make the transition to IRIs easier. 1709 Likewise, when a new Web form is set up using UTF-8 as the character 1710 encoding of the form page, the returned query URIs will use UTF-8 as 1711 the character encoding (unless the user, for whatever reason, changes 1712 the character encoding) and will therefore be compatible with IRIs. 1714 These recommendations, when taken together, will allow for the 1715 extension from URIs to IRIs in order to handle characters other than 1716 US-ASCII while minimizing interoperability problems. For 1717 considerations regarding the upgrade of URI scheme definitions, see 1718 Section 5.4. 1720 8. IANA Considerations 1722 RFC Editor and IANA note: Please Replace RFC XXXX with the number of 1723 this document when it issues as an RFC. 1725 IANA maintains a registry of "URI schemes". A "URI scheme" also 1726 serves an "IRI scheme". 1728 To clarify that the URI scheme registration process also applies to 1729 IRIs, change the description of the "URI schemes" registry header to 1730 say "[RFC4395] defines an IANA-maintained registry of URI Schemes. 1731 These registries include the Permanent and Provisional URI Schemes. 1732 RFC XXXX updates this registry to designate that schemes may also 1733 indicate their usability as IRI schemes. 1735 Update "per RFC 4395" to "per RFC 4395 and RFC XXXX". 1737 9. Security Considerations 1739 The security considerations discussed in [RFC3986] also apply to 1740 IRIs. In addition, the following issues require particular care for 1741 IRIs. 1743 Incorrect encoding or decoding can lead to security problems. For 1744 example, some UTF-8 decoders do not check against overlong byte 1745 sequences. See [UTR36] Section 3 for details. 1747 There are serious difficulties with relying on a human to verify that 1748 a an IRI (whether presented visually or aurally) is the same as 1749 another IRI or is the one intended. These problems exist with ASCII- 1750 only URIs (bl00mberg.com vs. bloomberg.com) but are strongly 1751 exacerbated when using the much larger character repertoire of 1752 Unicode. For details, see Section 2 of [UTR36]. Using 1753 administrative and technical means to reduce the availability of such 1754 exploits is possible, but they are difficult to eliminate altogether. 1755 User agents SHOULD NOT rely on visual or perceptual comparison or 1756 verification of IRIs as a means of validating or assuring safety, 1757 correctness or appropriateness of an IRI. Other means of presenting 1758 users with the validity, safety, or appropriateness of visited sites 1759 are being developed in the browser community as an alternative means 1760 of avoiding these difficulties. 1762 Besides the large character repertoire of Unicode, reasons for 1763 confusion include different forms of normalization and different 1764 normalization expectations, use of percent-encoding with various 1765 legacy encodings, and bidirectionality issues. See also [UTR36]. 1767 Confusion can occur in various IRI components, such as the domain 1768 name part or the path part, or between IRI components. For 1769 considerations specific to the domain name part, see [RFC5890]. For 1770 considerations specific to particular protocols or schemes, see the 1771 security sections of the relevant specifications and registration 1772 templates. Administrators of sites that allow independent users to 1773 create resources in the same sub area have to be careful. Details 1774 are discussed in Section 7.5. 1776 Confusion can occur with bidirectional IRIs, if the restrictions in 1777 Section 4.2 are not followed. The same visual representation may be 1778 interpreted as different logical representations, and vice versa. It 1779 is also very important that a correct Unicode bidirectional 1780 implementation be used. 1782 The characters additionally allowed in Legacy Extended IRIs introduce 1783 additional security issues. For details, see Section 6.3. 1785 10. Acknowledgements 1787 This document was derived from [RFC3987]; the acknowledgments from 1788 that specification still apply. 1790 We would like to thank Ian Hickson, Michael Sperberg-McQueen, and Dan 1791 Connolly for their work on HyperText References, and Norman Walsh, 1792 Richard Tobin, Henry S. Thomson, John Cowan, Paul Grosso, and the XML 1793 Core Working Group of the W3C for their work on LEIRIs. 1795 In addition, this document was influenced by contributions from (in 1796 no particular order) Chris Lilley, Bjoern Hoehrmann, Felix Sasaki, 1797 Jeremy Carroll, Frank Ellermann, Michael Everson, Cary Karp, 1798 Matitiahu Allouche, Richard Ishida, Addison Phillips, Jonathan 1799 Rosenne, Najib Tounsi, Debbie Garside, Mark Davis, Sarmad Hussain, 1800 Ted Hardie, Konrad Lanz, Thomas Roessler, Lisa Dusseault, Julian 1801 Reschke, Giovanni Campagna, Anne van Kesteren, Mark Nottingham, Erik 1802 van der Poel, Marcin Hanclik, Marcos Caceres, Roy Fielding, Greg 1803 Wilkins, Pieter Hintjens, Daniel R. Tobias, Marko Martin, Maciej 1804 Stanchowiak, Wil Tan, Yui Naruse, Michael A. Puls II, Dave Thaler, 1805 Tom Petch, John Klensin, Shawn Steele, Peter Saint-Andre, Geoffrey 1806 Sneddon, Chris Weber, Alex Melnikov, Slim Amamou, S. Moonesamy, Tim 1807 Berners-Lee, Yaron Goland, Sam Ruby, Adam Barth, Abdulrahman I. 1808 ALGhadir, Aharon Lanin, Thomas Milo, Murray Sargent, Marc Blanchet, 1809 and Mykyta Yevstifeyev. 1811 11. Main Changes Since RFC 3987 1813 This section describes the main changes since [RFC3987]. 1815 11.1. Major restructuring of IRI processing model 1817 Major restructuring of IRI processing model to make scheme-specific 1818 translation necessary to handle IDNA requirements and for consistency 1819 with web implementations. 1821 Starting with IRI, you want one of: 1823 a IRI components (IRI parsed into UTF8 pieces) 1825 b URI components (URI parsed into ASCII pieces, encoded correctly) 1827 c whole URI (for passing on to some other system that wants whole 1828 URIs) 1830 11.1.1. OLD WAY 1832 1. Pct-encoding on the whole thing to a URI. (c1) If you want a 1833 (maybe broken) whole URI, you might stop here. 1835 2. Parsing the URI into URI components. (b1) If you want (maybe 1836 broken) URI components, stop here. 1838 3. Decode the components (undoing the pct-encoding). (a) if you want 1839 IRI components, stop here. 1841 4. reencode: Either using a different encoding some components (for 1842 domain names, and query components in web pages, which depends on 1843 the component, scheme and context), and otherwise using pct- 1844 encoding. (b2) if you want (good) URI components, stop here. 1846 5. reassemble the reencoded components. (c2) if you want a (*good*) 1847 whole URI stop here. 1849 11.1.2. NEW WAY 1851 1. Parse the IRI into IRI components using the generic syntax. (a) 1852 if you want IRI components, stop here. 1854 2. Encode each components, using pct-encoding, IDN encoding, or 1855 special query part encoding depending on the component scheme or 1856 context. (b) If you want URI components, stop here. 1858 3. reassemble the a whole URI from URI components. (c) if you want a 1859 whole URI stop here. 1861 11.1.3. Extension of Syntax 1863 Added the tag range (U+E0000-E0FFF) to the iprivate production. Some 1864 IRIs generated with the new syntax may fail to pass very strict 1865 checks relying on the old syntax. But characters in this range 1866 should be extremely infrequent anyway. 1868 11.1.4. More to be added 1870 TODO: There are more main changes that need to be documented in this 1871 section. 1873 11.2. Change Log 1875 Note to RFC Editor: Please completely remove this section before 1876 publication. 1878 11.2.1. Changes after draft-ietf-iri-3987bis-01 1880 Changes from draft-ietf-iri-3987bis-01 onwards are available as 1881 changesets in the IETF tools subversion repository at http:// 1882 trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/ 1883 draft-ietf-iri-3987bis.xml. 1885 11.2.2. Changes from draft-duerst-iri-bis-07 to 1886 draft-ietf-iri-3987bis-00 1888 Changed draft name, date, last paragraph of abstract, and titles in 1889 change log, and added this section in moving from 1890 draft-duerst-iri-bis-07 (personal submission) to 1891 draft-ietf-iri-3987bis-00 (WG document). 1893 11.2.3. Changes from -06 to -07 of draft-duerst-iri-bis 1895 Major restructuring of the processing model, see Section 11.1. 1897 11.3. Changes from -00 to -01 1899 o Removed 'mailto:' before mail addresses of authors. 1901 o Added "" as right side of 'href-strip' rule. Fixed 1902 '|' to '/' for alternatives. 1904 11.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 1906 o Add HyperText Reference, change abstract, acks and references for 1907 it 1909 o Add Masinter back as another editor. 1911 o Masinter integrates HRef material from HTML5 spec. 1913 o Rewrite introduction sections to modernize. 1915 11.5. Changes from -04 to -05 of draft-duerst-iri-bis 1917 o Updated references. 1919 o Changed IPR text to pre5378Trust200902. 1921 11.6. Changes from -03 to -04 of draft-duerst-iri-bis 1923 o Added explicit abbreviation for LEIRIs. 1925 o Mentioned LEIRI references. 1927 o Completed text in LEIRI section about tag characters and about 1928 specials. 1930 11.7. Changes from -02 to -03 of draft-duerst-iri-bis 1932 o Updated some references. 1934 o Updated Michel Suginard's coordinates. 1936 11.8. Changes from -01 to -02 of draft-duerst-iri-bis 1938 o Added tag range to iprivate (issue private-include-tags-115). 1940 o Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs. 1942 11.9. Changes from -00 to -01 of draft-duerst-iri-bis 1944 o Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" 1945 based on input from the W3C XML Core WG. Moved the relevant 1946 subsections to the back and promoted them to a section. 1948 o Added some text re. Legacy Extended IRIs to the security section. 1950 o Added a IANA Consideration Section. 1952 o Added this Change Log Section. 1954 o Added a section about "IRIs with Spaces/Controls" (converting from 1955 a Note in RFC 3987). 1957 11.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis 1959 Fixed errata (see 1960 http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987). 1962 12. References 1964 12.1. Normative References 1966 [ASCII] American National Standards Institute, "Coded Character 1967 Set -- 7-bit American Standard Code for Information 1968 Interchange", ANSI X3.4, 1986. 1970 [ISO10646] 1971 International Organization for Standardization, "ISO/IEC 1972 10646:2003: Information Technology - Universal Multiple- 1973 Octet Coded Character Set (UCS)", ISO Standard 10646, 1974 December 2003. 1976 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1977 Requirement Levels", BCP 14, RFC 2119, March 1997. 1979 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1980 "Internationalizing Domain Names in Applications (IDNA)", 1981 RFC 3490, March 2003. 1983 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1984 Profile for Internationalized Domain Names (IDN)", 1985 RFC 3491, March 2003. 1987 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1988 10646", STD 63, RFC 3629, November 2003. 1990 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1991 Resource Identifier (URI): Generic Syntax", STD 66, 1992 RFC 3986, January 2005. 1994 [RFC5890] Klensin, J., "Internationalized Domain Names for 1995 Applications (IDNA): Definitions and Document Framework", 1996 RFC 5890, August 2010. 1998 [RFC5891] Klensin, J., "Internationalized Domain Names in 1999 Applications (IDNA): Protocol", RFC 5891, August 2010. 2001 [STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax 2002 Specifications: ABNF", STD 68, RFC 5234, January 2008. 2004 [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard 2005 Annex #9, March 2004, 2006 . 2008 [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 2009 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 2010 ISBN 978-1-936213-01-6)", October 2010. 2012 [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", 2013 Unicode Standard Annex #15, March 2008, 2014 . 2017 12.2. Informative References 2019 [BidiEx] "Examples of bidirectional IRIs", 2020 . 2022 [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. 2023 Texin, "Character Model for the World Wide Web: Resource 2024 Identifiers", World Wide Web Consortium Candidate 2025 Recommendation, November 2004, 2026 . 2028 [Duerst97] 2029 Duerst, M., "The Properties and Promises of UTF-8", Proc. 2030 11th International Unicode Conference, San Jose , 2031 September 1997, . 2034 [Gettys] Gettys, J., "URI Model Consequences", 2035 . 2037 [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 2038 Specification", World Wide Web Consortium Recommendation, 2039 December 1999, 2040 . 2042 [HTML5] Hickson, I. and D. Hyatt, "A vocabulary and associated 2043 APIs for HTML and XHTML", World Wide Web 2044 Consortium Working Draft, April 2009, 2045 . 2047 [LEIRI] Thompson, H., Tobin, R., and N. Walsh, "Legacy extended 2048 IRIs for XML resource identification", World Wide Web 2049 Consortium Note, November 2008, 2050 . 2052 [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform 2053 Resource Locators (URL)", RFC 1738, December 1994. 2055 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2056 Extensions (MIME) Part One: Format of Internet Message 2057 Bodies", RFC 2045, November 1996. 2059 [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., 2060 Atkinson, R., Crispin, M., and P. Svanberg, "The Report of 2061 the IAB Character Set Workshop held 29 February - 1 March, 2062 1996", RFC 2130, April 1997. 2064 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 2066 [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. 2068 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 2069 Languages", BCP 18, RFC 2277, January 1998. 2071 [RFC2368] Hoffman, P., Masinter, L., and J. Zawinski, "The mailto 2072 URL scheme", RFC 2368, July 1998. 2074 [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. 2076 [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2077 Resource Identifiers (URI): Generic Syntax", RFC 2396, 2078 August 1998. 2080 [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, 2081 August 1998. 2083 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 2084 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 2085 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 2087 [RFC2640] Curtin, B., "Internationalization of the File Transfer 2088 Protocol", RFC 2640, July 1999. 2090 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 2091 Identifiers (IRIs)", RFC 3987, January 2005. 2093 [RFC4395bis] 2094 Hansen, T., Hardie, T., and L. Masinter, "Guidelines and 2095 Registration Procedures for New URI/IRI Schemes", 2096 draft-hansen-iri-4395bis-irireg-00 (work in progress), 2097 September 2010. 2099 [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on 2100 Encodings for Internationalized Domain Names", RFC 6055, 2101 February 2011. 2103 [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other 2104 Markup Languages", Unicode Technical Report #20, World 2105 Wide Web Consortium Note, June 2003, 2106 . 2108 [UTR36] Davis, M. and M. Suignard, "Unicode Security 2109 Considerations", Unicode Technical Report #36, 2110 August 2010, . 2112 [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking 2113 Language (XLink) Version 1.0", World Wide Web 2114 Consortium Recommendation, June 2001, 2115 . 2117 [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and 2118 F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth 2119 Edition)", World Wide Web Consortium Recommendation, 2120 August 2006, . 2122 [XMLNamespace] 2123 Bray, T., Hollander, D., Layman, A., and R. Tobin, 2124 "Namespaces in XML (Second Edition)", World Wide Web 2125 Consortium Recommendation, August 2006, 2126 . 2128 [XMLSchema] 2129 Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", 2130 World Wide Web Consortium Recommendation, May 2001, 2131 . 2133 [XPointer] 2134 Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer 2135 Framework", World Wide Web Consortium Recommendation, 2136 March 2003, 2137 . 2139 Authors' Addresses 2141 Martin Duerst 2142 Aoyama Gakuin University 2143 5-10-1 Fuchinobe 2144 Sagamihara, Kanagawa 229-8558 2145 Japan 2147 Phone: +81 42 759 6329 2148 Fax: +81 42 759 6495 2149 Email: duerst@it.aoyama.ac.jp 2150 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 2152 Michel Suignard 2153 Unicode Consortium 2154 P.O. Box 391476 2155 Mountain View, CA 94039-1476 2156 U.S.A. 2158 Phone: +1-650-693-3921 2159 Email: michel@unicode.org 2160 URI: http://www.suignard.com 2162 Larry Masinter 2163 Adobe 2164 345 Park Ave 2165 San Jose, CA 95110 2166 U.S.A. 2168 Phone: +1-408-536-3024 2169 Email: masinter@adobe.com 2170 URI: http://larry.masinter.net