idnits 2.17.1 draft-weber-iri-guidelines-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 11, 2011) is 4644 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC5137' is mentioned on line 103, but not defined == Missing Reference: 'RFC3896' is mentioned on line 186, but not defined == Missing Reference: 'RFC4395' is mentioned on line 360, but not defined ** Obsolete undefined reference: RFC 4395 (Obsoleted by RFC 7595) == Missing Reference: 'RFC6068' is mentioned on line 381, but not defined == Missing Reference: 'RFC2397' is mentioned on line 385, but not defined == Unused Reference: '8' is defined on line 452, but no explicit reference was found in the text == Outdated reference: A later version (-13) exists of draft-ietf-iri-3987bis-05 Summary: 1 error (**), 0 flaws (~~), 8 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Weber 3 Internet-Draft Casaba Security 4 Intended status: Standards Track July 11, 2011 5 Expires: January 12, 2012 7 Guidelines for Implementers of Internationalized Resource Identifiers 8 (IRIs) 9 draft-weber-iri-guidelines-01 11 Abstract 13 Some members of the implementation community have expressed confusion 14 about the rules and algorithms for processing Internationalized 15 Resource Identifiers (IRIs). This document aims to clarify these 16 matters and improve interoperability around IRI processing by 17 summarizing the steps required to prepare and parse arbitrary Unicode 18 strings as Internationalized Resource Identifiers. Further goals of 19 this document are to define limited scheme-specific rules around IRI 20 processing and to define the steps required for producing the 21 canonical form of an IRI. 23 Status of this Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on January 12, 2012. 40 Copyright Notice 42 Copyright (c) 2011 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 3. Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 4. Pre-processing Arbitrary Unicode Strings . . . . . . . . . . . 4 61 5. Parsing Unicode Strings into IRI Components . . . . . . . . . 5 62 5.1. Identify the scheme . . . . . . . . . . . . . . . . . . . 5 63 5.2. Identify the authority . . . . . . . . . . . . . . . . . . 6 64 5.2.1. Identify the userinfo . . . . . . . . . . . . . . . . 6 65 5.2.2. Identify the host . . . . . . . . . . . . . . . . . . 6 66 5.2.3. Identify the port . . . . . . . . . . . . . . . . . . 7 67 5.3. Identify the path . . . . . . . . . . . . . . . . . . . . 7 68 5.4. Identify the query . . . . . . . . . . . . . . . . . . . . 8 69 5.5. Identify the fragment . . . . . . . . . . . . . . . . . . 8 70 6. Scheme-Specific Processing . . . . . . . . . . . . . . . . . . 8 71 6.1. http . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 72 6.2. javascript . . . . . . . . . . . . . . . . . . . . . . . . 9 73 6.3. mailto . . . . . . . . . . . . . . . . . . . . . . . . . . 9 74 6.4. data . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 75 7. IRI Canonicalization . . . . . . . . . . . . . . . . . . . . . 9 76 7.1. Producing a valid URI from an IRI . . . . . . . . . . . . 9 77 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 78 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 79 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 80 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 81 11.1. Informative References . . . . . . . . . . . . . . . . . . 10 82 11.2. Normative References . . . . . . . . . . . . . . . . . . . 11 83 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11 85 1. Introduction 87 Internationalized Resource Identifiers (IRIs) extend the Uniform 88 Resource Identifier (URI) specification [RFC3986] by opening up the 89 authority, path, query, and fragment components to the character 90 space available in Unicode/ISO 10646. Arbitrary Unicode strings may 91 be prepared and parsed into IRI sub-components which map directly to 92 the same URI sub-components. 94 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 95 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 96 "OPTIONAL" in this document are to be interpreted as described in 97 [RFC2119]. 99 2. Terminology 101 This section defines terminology used in this document. Unicode 102 characters are referred to using the notation described in section 3 103 of [RFC5137]. A Unicode code point is represented as U+NNNN where 104 the NNNN string consists of the code point's hexadecimal numbers. 106 reference-string 107 The original and unprocessed input string being considered as an IRI 108 reference. 110 pre-processed-reference-string 111 The reference-string that has been through the pre-processing steps. 113 relative reference 114 The term "relative reference" used in this document can be 115 interpreted according to the rules in section 4.2 of [RFC3986] using 116 the additionally allowed characters that [RFC3987] permits in each 117 component. 119 percent-encode 120 Convert each octet of a sequence to %HH, where HH is the hexadecimal 121 notation of the octet value. 123 3. Sources 125 This document makes reference to the following sources of information 126 about the parsing of IRIs: 128 1: Internationalized Resource Identifiers (IRIs) as specified in 129 http://tools.ietf.org/html/draft-ietf-iri-3987bis-05 [RFC3987] 130 2: Parsing URLs for Fun and Profit, 131 http://tools.ietf.org/html/draft-abarth-url-01 [2] 132 3: HTML Living Standard: URLs, http://www.whatwg.org/specs/web-apps/ 133 current-work/multipage/urls.html#urls [3] 134 4: Change proposal for ISSUE-56, 135 http://lists.w3.org/Archives/Public/public-html/2010Jul/0036.html 136 [4] 137 5: URIs, URLs, and URNs: Clarifications and Recommendations 1.0, 138 http://www.w3.org/TR/uri-clarification/ [5] 139 6: HTML CHANGE PROPOSAL; change definition of URL to normative 140 reference to IRIBIS, 141 http://lists.w3.org/Archives/Public/public-html/2010Feb/0882.html 142 [6] 143 7: BIDI URL Display, https://docs.google.com/document/d/ 144 1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en [7] 146 4. Pre-processing Arbitrary Unicode Strings 148 This section describes the pre-processing steps required to prepare 149 an arbitrary Unicode reference-string for later parsing into IRI sub- 150 components. 152 1. Remove leading and trailing instances of ASCII whitespace 153 characters U+0020 SPACE, U+000D CARRIAGE RETURN (CR), U+000A LINE 154 FEED (LF), and U+0009 CHARACTER TABULATION from the string. Note 155 that Unicode has many more characters that are considered 156 whitespace, none of which are affected or considered in this 157 rule. 158 Reference: Section 7.2 of [RFC3987] for details. 159 2. If more than one reference is allowed, split the string into 160 substrings on blocks of contiguous U+0020 SPACE characters. Each 161 of one of these substrings is an independent reference-string and 162 will be processed individually. If more than one reference is 163 not allowed, either remove blocks of contiguous whitespace or 164 replace each U+0020 SPACE with a single percent-encoded U+0020 165 SPACE, written as "%20", depending on what is required for the 166 current context. 167 Reference: Mentioned in [4]. 168 3. If the current string is not already in a Unicode encoding, then 169 transcode the string to the a Unicode encoding such as UTF-8, 170 UTF-16, or UTF-32. 171 Reference: Section 3.1 of [RFC3987]. 172 4. TODO: Should numerical character references be replaced with 173 their corresponding character? e.g. would become 176 during this step. Or should this step of unescaping be limited 177 to the set? 179 This is the pre-processed-reference-string ready for parsing. 181 5. Parsing Unicode Strings into IRI Components 183 With an arbitrary IRI string that has been through pre-processing, 184 referred to as the "pre-processed-reference-string", this section 185 describes the subsequent process of parsing the string into its five 186 major IRI sub-components using rules defined by [RFC3896] (using an 187 algorithm equivalent to Appendix B of [RFC3986]) but with updated 188 ABNF of [RFC3987]. These rules are summarized here. 189 Reference: Section 3.2 of [RFC3987]. 191 5.1. Identify the scheme 193 If the current string does not contain a ":" U+003A then the string 194 does not contain a scheme and the pre-processed-reference-string may 195 be handled as a relative reference according to the rules in section 196 4.2 of [RFC3986] using the additionally allowed characters that 197 [RFC3987] permits. 199 o Continue to "Identify the path" 200 o Abort further scheme processing 202 If the first character of the string is not an then this is 203 not a valid scheme and the pre-processed-reference-string may be 204 handled as a relative reference. 206 o Continue to "Identify the path" 207 o Abort further scheme processing 209 Consume all characters up to but not including the first occurrence 210 of ":" U+003A. If the consumed substring contains any characters 211 other than < ALPHA / DIGIT / "+" / "-" / "." > then it is not a valid 212 scheme and the pre-processed-reference-string may be handled as a 213 relative reference. 215 o Continue to "Identify the path" 216 o Abort further scheme processing 218 The consumed substring at this point is the scheme. Skip over the 219 ":" U+003A character. Continue parsing the remaining string as an 220 authority part. 222 5.2. Identify the authority 224 The URI authority component may contain userinfo, a host, and a port. 225 If the current string does not begin with the two characters "//" 226 U+002F U+002F then the string is not an authority and may be handled 227 as a path sub-component. 229 o Continue to "Identify the path" 230 o Abort further authority processing 232 Consume up to the first occurrence of any one of the authority's 233 terminating characters "/" U+002F, "?" U+003F, "#" U+0023, or the 234 end of the string. This is the authority, also known as the 235 iauthority under IRI RFC3987. Continue further parsing of the 236 authority to identify the userinfo, host, and port parts. 238 5.2.1. Identify the userinfo 240 The userinfo may come in the form of a username and password rendered 241 as "user:password". The userinfo part may be parsed according to the 242 rules of RFC3986 Section 3.2.1 with the updated ABNF for iuserinfo in 243 RFC3987. 245 If the authority does not contain a "@" U+0040 then the string does 246 not contain a userinfo part and the authority may be parsed for a 247 host and port part. 249 o Continue to "Identify the host" 250 o Abort further userinfo processing 252 From the beginning of the authority, consume each character up to but 253 not including the first occurrence of "@" U+0040. This is the 254 userinfo. 256 Further parsing of the userinfo is scheme-specific. 258 Skip over the first occurrence of "@" U+0040 following the userinfo, 259 and continue parsing the remaining authority to identify the host. 261 5.2.2. Identify the host 263 The host part of authority may contain an IP-literal, IPv4address, or 264 a reg-name according to the ABNF rules of RFC3986 updated to support 265 Unicode characters in the ireg-name as described in RFC3987. Consume 266 all characters up to but not including the last ":" U+003A character 267 or the end of authority. 269 If this substring is determined to be an IP-literal or IPv4address, 270 then the consumed characters are the host. 272 o If the authority did contain a ":" then continue parsing the 273 remaining authority including the ":" character according to the 274 "Identify the port" section 275 o Abort further host processing 277 Else the substring is determined to be an ireg-name according to the 278 ABNF naming convention from RFC3987. This is the host. 280 o If the authority did contain a ":" then the remaining authority 281 including the ":" character will be processed according to the 282 "Identify the port". 283 o Continue processing the host 285 The host SHOULD be processed according to the rules of IDNA2008, but 286 MAY be processed according to UTS46 or IDNA2003. If the host is in 287 DNS Internet dot-notaion then it's labels SHOULD be converted to 288 punycode. This is the host. 289 TODO: Error handling. Mention leaving the host name in pure Unicode 290 form for intranet/local name scenarios that don't use DNS, e.g. 291 WINS? 293 5.2.3. Identify the port 295 Further processing SHOULD skip the first occurrence of ":" U+003A and 296 consume the remaining characters. If these characters are not *DIGIT 297 then the port is invalid. 299 o Continue to "Identify the path" 300 o Abort further scheme processing 302 Else this is the port. 304 5.3. Identify the path 306 Consume the remaining pre-processed-reference-string up to but not 307 including the first occurrence of a terminating character "?" 308 U+003F, "#" U+0023, or the end of the string. 310 Percent-encode all characters present from the ucschar list. 312 If the path contains any characters not allowed by the ABNF of 313 RFC3987 Section 2.2 or Section 7.2 then replace those characters with 314 their percent-encoding. 316 This is the path. 318 If the terminating character was "?" then process the remaining 319 string including the leading "?" according to "Identify the query". 321 If the terminating character was "#" then process the remaining 322 string including the leading "#" according to "Identify the 323 fragment". 325 TODO: Handling of special characters "/" and "\". 327 5.4. Identify the query 329 Consume the remaining string starting with the leading "?" and up to 330 but not including the first occurrence of "#" or the end of the 331 string. 333 Percent-encode all characters present from the ucschar list. 335 If the path contains any characters not allowed by the ABNF of 336 [RFC3987] Section 2.2 or the lists in Section 7.2 then replace those 337 characters with their percent-encoding. 339 This is the query component. 341 If the terminating character was "#" then process the remaining 342 string including the leading "#" according to "Identify the 343 fragment". 345 TODO: Handling of special characters "&", "?", "=", and "/" 347 5.5. Identify the fragment 349 Consume the remaining string starting with the leading "#" and to the 350 end of the string. 352 Percent-encode all characters present from the ucschar list. 354 This is the fragment. 356 TODO: Handling of special characters "?" and "/" 358 6. Scheme-Specific Processing 360 TODO Apply limited scheme-specific rules. Reference [RFC4395] 362 6.1. http 364 (NOTE: Taken directly from [RFC3987]) For compatibility with existing 365 deployed HTTP infrastructure, the following special case applies for 366 schemes "http" and "https" and IRIs whose origin has a document 367 charset other than one which is UCS-based (e.g., UTF-8 or UTF-16). 368 In such a case, the "query" component of an IRI is mapped into a URI 369 by using the document charset rather than UTF-8 as the binary 370 representation before pct-encoding. This mapping is not applied for 371 any other scheme or component. 372 Reference: Section 3.5 and 7.2 of [RFC3987]. 374 6.2. javascript 376 TODO reference? 377 http://tools.ietf.org/html/draft-hoehrmann-javascript-scheme-03 379 6.3. mailto 381 TODO reference [RFC6068] 383 6.4. data 385 TODO reference [RFC2397] 387 7. IRI Canonicalization 389 To follow. 391 [NOTE: Call out Special Case here or earlier? The U+005C should be 392 either percent-encoded or converted to U+002F. Of course, folks who 393 want to name such files will want to use the escaped form of \ in 394 order for their site to work in other browsers. 396 7.1. Producing a valid URI from an IRI 398 For each character which is not allowed anywhere in a valid URI, 399 apply the following steps. 400 Reference: Section 3.5 of [RFC3987]. 401 o Convert the IRI to the UTF-8 encoding, i.e., convert the character 402 to a sequence of one or more octets using UTF-8 [RFC3629]. 403 TODO: What about they query and fragment components? Should the 404 o For each IRI component, percent-encode each UTF-8 octet 405 representing each character that is not allowed in the same URI 406 component. In general this will include the set of characters 407 specified by ucschar of [RFC3987]. 409 8. Security Considerations 411 To follow. 413 9. IANA Considerations 415 This document has no actions for the IANA. 417 10. Acknowledgements 419 Many thanks to Mykyta Yevstifeyev, Addison Phillips, Mark Davis, Anne 420 van Kesteren, Adam Barth, Martin Duerst, and Julian Reschke for their 421 feedback. 423 11. References 425 11.1. Informative References 427 [2] Barth, A., "How Browsers Process URLs", 428 draft-abarth-url-01 (work in progress), April 2011, 429 . 431 [3] Hickson, I., "HTML Living Standard: URLs", WHATWG ?, 2011, 432 . 435 [4] Fielding, R., "Change proposal for ISSUE-56", July 2010, < 436 http://lists.w3.org/Archives/Public/public-html/2010Jul/ 437 0036.html>. 439 [5] W3C, "URIs, URLs, and URNs: Clarifications and 440 Recommendations 1.0", W3C Note 21 September 2001, 441 September 2001, . 443 [6] Masinter, L., "HTML CHANGE PROPOSAL; change definition of 444 URL to normative reference to IRIBIS", February 2010, . 448 [7] Davis, M., "Revision of UBA for improved display of URL/ 449 IRIs", May 2011, . 452 [8] Davis, M. and M. Duerst, "URLs", April 2003, 453 . 456 11.2. Normative References 458 [RFC3987] Duerst, M., Suignard, M., and L. Masinter, 459 "Internationalized Resource Identifiers (IRIs)", 460 draft-ietf-iri-3987bis-05 (work in progress), March 2011, 461 . 464 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 465 Resource Identifier (URI)", Internet-Standard rfc3986, 466 January 2005, . 468 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 469 10646", Internet-Standard rfc3629, November 2003, 470 . 472 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 473 Requirement Levels", BCP 14, RFC 2119, March 1997. 475 Author's Address 477 Chris Weber 478 Casaba Security 479 16625 Redmond Wa, Suite M348 480 Redmond, WA 98052 481 USA 483 Email: chris@lookout.net