idnits 2.17.1 draft-abarth-url-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 3 instances of too long lines in the document, the longest one being 3 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 23, 2011) is 4751 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 5 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 iri A. Barth 3 Internet-Draft Google, Inc. 4 Intended status: Standards Track April 23, 2011 5 Expires: October 25, 2011 7 Parsing URLs for Fun and Profit 8 draft-abarth-url-01 10 Abstract 12 This document contains a precise specification of how browsers 13 process URLs. The behavior specified in this document might or might 14 not match any particular browser, but browsers might be well-served 15 by adopting the behavior defined herein. 17 Editorial Note (To be removed by RFC Editor) 19 If you have suggestions for improving this document, please send 20 email to . Further Working Group 21 information is available from . 23 Status of this Memo 25 This Internet-Draft is submitted to IETF in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt. 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html. 44 This Internet-Draft will expire on October 25, 2011. 46 Copyright Notice 48 Copyright (c) 2011 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the BSD License. 61 Table of Contents 63 1. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 65 3. Parsing a URL . . . . . . . . . . . . . . . . . . . . . . . . 6 66 3.1. Finding the scheme . . . . . . . . . . . . . . . . . . . . 6 67 3.2. Finding the authority, path, query, and fragment . . . . . 7 68 3.3. Finding the user-info, host, and port . . . . . . . . . . 8 69 3.4. Find the user name and password . . . . . . . . . . . . . 8 70 4. Resolving a string relative to a base URL . . . . . . . . . . 9 71 4.1. Resolving a string as a relative URL . . . . . . . . . . . 9 72 4.2. Resolving a string as a scheme-relative URL . . . . . . . 10 73 4.3. Resolving a string as an authority-relative URL . . . . . 11 74 4.4. Resolving a string as a path-relative URL . . . . . . . . 11 75 4.5. Resolving a string as a query-relative URL . . . . . . . . 11 76 4.6. Resolving a string as a fragment-relative URL . . . . . . 12 77 5. Canonicalizing a URL . . . . . . . . . . . . . . . . . . . . . 13 78 5.1. Canonicalizing a Scheme . . . . . . . . . . . . . . . . . 14 79 5.2. Canonicalizing a User-Info . . . . . . . . . . . . . . . . 14 80 5.3. Canonicalizing a Host . . . . . . . . . . . . . . . . . . 15 81 5.3.1. Host Escape Normalization . . . . . . . . . . . . . . 15 82 5.4. Canonicalizing a Path . . . . . . . . . . . . . . . . . . 16 83 5.5. Canonicalizing a Query . . . . . . . . . . . . . . . . . . 16 84 5.6. Canonicalizing a Fragment . . . . . . . . . . . . . . . . 17 85 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 18 86 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 19 88 1. Open Issues 90 Browsers parse URLs differently depending on which operating system 91 they're running on. The problem is that they want to do sensible 92 things for file paths, but file paths look different on Windows and 93 Unix systems. 95 How should we handle cases where browsers disaggree with the regular 96 expression in RFC 3986? Currently, this document aims to describe 97 how browsers behave, but we'll likely need to compare that to RFC 98 3986 at some point. Some specific differences that have been brought 99 up on the mailing list: 101 o http:///example.com/ 103 o http://example.com; 105 2. Definitions 107 A control character is a character whose value is less than or equal 108 to U+0020 (" "). 110 A slash character is either U+???? ("/") or U+???? ("\"). TODO: 111 There's some question as to whether this is necessary for non-file 112 URLs. 114 An authority terminating character is either a slash charcter, U+???? 115 ("?"), U+???? ("#"), or U+???? (";"). TODO: Why is ";" on this list? 117 During a parsing algorithm, the remaining string is the characters of 118 the input that have not yet been consumed. 120 3. Parsing a URL 122 Given a string of characters, consume all leading and trailing 123 control characters. 125 Find the scheme, as described in Section ??. 127 If the algorithm for finding the scheme determines that the URL is 128 invalid: 130 -> Abort these steps. 132 If the scheme is a single upper or lower case ASCII character (TODO: 133 Just ALPHA?): 135 -> TODO: Windows drive specs! 137 If the scheme is a ASCII case-insensitive match for "file": 139 -> TODO: File URLs! 141 If the scheme is a ASCII case-insensitive match for "mailto": 143 -> TODO: I think mailto URLs are special, but more testing is 144 required. 146 If the scheme is hierarchical: 148 -> In the after-scheme, if any, find the authority, path, query, 149 and fragment, as described in Section ??. 151 -> In the authority, if any, find the user-info, host, and port, 152 as described in Section ??. 154 -> In the user-info, if any, find the user name and password, as 155 described in Section ??. 157 -> Abort these steps. 159 The remaining string is the path. TODO: This might not be the best 160 approach. We need to do more testing of data and javascript URLs. 162 3.1. Finding the scheme 164 If the remaining string does not contain a ":" character: 166 -> The URL is invalid. 168 -> Abort these steps. 170 Consume characters up to, but not including, the first ":" character. 171 These characters are the scheme. 173 Consume the ":" character. 175 The remaining characters are the after-scheme. 177 3.2. Finding the authority, path, query, and fragment 179 Consume any number of slash characters. 181 If the remaining string does not contain any authority terminating 182 characters: 184 -> The remaining string is the authority. 186 -> Abort these steps. 188 Consume characters up to, but not including, the first authority 189 terminating character. The consumed characters are authority. 191 If the remaining string does not contain a "?" character or a "#" 192 character: 194 -> The remaining string is the path. 196 -> Abort these steps. 198 Consume characters up to, but not including, the first "?" or "#" 199 charcter. The consumed characters are the path. 201 If the first character of the remaining string is a "?" character: 203 -> Consume the "?" character. 205 -> If the remaining string does not contain a "#" character: 207 -> The remaining string is the query. 209 -> Abort these steps. 211 -> Consume characters up to, but not including, the first "#" 212 charcter. The consumed characters are the query. 214 Consume the "#" character. 216 The remaining string is the fragment. 218 3.3. Finding the user-info, host, and port 220 If the remaining string contains an "@" character: 222 -> Consume characters up to, but not including the *last* "@" 223 character. The consumed characters are the user-info. 225 -> Consume the "@" character. 227 If the remaining string does not contain an ":" character: 229 -> The remaining string is the host. 231 -> Abort these steps. 233 If the first character of the remaining string is a "[" character, 234 the remaining string contains a "]" character, and the last ":" 235 character in the remaining string occurs before the last "]" 236 character in the remaining string: 238 -> The remaining string is the host. 240 -> Abort these steps. 242 Consume characters up to, but not including, the last ":" character. 243 The consumed characters are the host. 245 Consume the ":" character. 247 The remaining string is the port. 249 3.4. Find the user name and password 251 If the remaining string does not contain a ":" character: 253 -> The remaining string is the user name. 255 -> Abort these steps. 257 Consume characters up to, but not including, the first ":" character. 258 The consumed characters are the user name. 260 Consume the ":" character. 262 The remaining string is the password. 264 4. Resolving a string relative to a base URL 266 Given a string relative-url and a ParsedURL base-url, find the scheme 267 of relative-url. 269 TODO: We probably need to trim leading and trailing control 270 characters. 272 If relative-url is an invalid URL: 274 -> The resolved URL is relative-url resolved as relative URL. 276 -> Abort these steps. 278 If relative-url's scheme contains any characters which are not "valid 279 scheme characters" (TODO: Define valid scheme characters): 281 -> The resolved URL is relative-url resolved as relative URL. 283 -> Abort these steps. 285 If base-url's scheme is an ASCII case insensitive match for relative- 286 url's scheme and the shared scheme is hierarchical: 288 -> The resolved URL is relative-url's after-scheme resolved as a 289 relative URL. 291 -> Abort these steps. 293 The resolved URL is relative-url parsed as an absolute URL. 295 4.1. Resolving a string as a relative URL 297 Given a string relative-url and a ParsedURL base-url, determine the 298 resolved URL as follows: 300 TODO: If base-url's scheme is not hierarchical, we can't resolve as a 301 relative URL. We'll probably want to return an invalid URL. Check 302 what happens when resolving an empty string as a relative URL with a 303 non-hierarchical base. 305 If relative-url is empty: 307 -> The resolved URL is identical to base-url, with the fragment, 308 if any, removed. 310 -> Abort these steps. 312 If the first character of relative-url is a slash character: 314 -> If relative-url has at least two characters and the second 315 character is also a slash character: 317 -> The resolved URL is relative-url resolved as a scheme- 318 relative URL. 320 Otherwise: 322 -> The resolved URL is relative-url resolved as an authority- 323 relative URL. 325 -> Abort these steps. 327 If the first character of relative-url is a "?" character: 329 -> The resolved URL is relative-url resolved as a query-relative 330 URL. 332 -> Abort these steps. 334 If the first character of relative-url is a "#" character: 336 -> The resolved URL is relative-url resolved as a fragment- 337 relative URL. 339 -> Abort these steps. 341 TODO: Think about the case where the relative-url is empty. 343 The resolved URL is relative-url resolved as a path-relative URL. 345 4.2. Resolving a string as a scheme-relative URL 347 Given a string relative-url and a ParsedURL base-url, let resolved- 348 url be 350 o base-url's scheme 352 o concatenated with ":", 354 o concatenated with relative-url. 356 The resolved URL is resolved-url parsed as an absolute URL. 358 4.3. Resolving a string as an authority-relative URL 360 Given a string relative-url and a ParsedURL base-url, let resolved- 361 url be 363 o base-url's scheme 365 o concatenated with "://", 367 o concatenated with base-url's authority, 369 o concatenated with relative-url. 371 The resolved URL is resolved-url parsed as an absolute URL. 373 4.4. Resolving a string as a path-relative URL 375 TODO: Can the first character of relative-url be a slash character at 376 this point? 378 TODO: Can we assume base-url is canonicalized here so that it always 379 has at least one "/" character? 381 Let the directory-name be the characters of the base-url's path up to 382 and including the last slash character. 384 Let resolved-url be 386 o base-url's scheme 388 o concatenated with "://", 390 o concatenated with base-url's authority, 392 o concatenated with directory-name. 394 o concatenated with relative-url. 396 The resolved URL is resolved-url parsed as an absolute URL. 398 4.5. Resolving a string as a query-relative URL 400 Given a string relative-url and a ParsedURL base-url, let resolved- 401 url be 403 o base-url's scheme 404 o concatenated with "://", 406 o concatenated with base-url's authority, 408 o concatenated with base-url's path, 410 o concatenated with relative-url. 412 The resolved URL is resolved-url parsed as an absolute URL. 414 4.6. Resolving a string as a fragment-relative URL 416 Given a string relative-url and a ParsedURL base-url, let resolved- 417 url be 419 o base-url's scheme 421 o concatenated with "://", 423 o concatenated with base-url's authority, 425 o concatenated with base-url's path, 427 o concatenated with "?", 429 o concatenated with base-url's query, 431 o concatenated with relative-url. 433 The resolved URL is resolved-url parsed as an absolute URL. 435 5. Canonicalizing a URL 437 This section describes how to construct a canonical version of a 438 parsed URL string. TODO: We probably should mention somewhere that 439 there is *not* a unique canonicalization for every URL. 441 Given parsed URL original-url, if original-url is invalid: 443 -> Abort these steps. 445 TODO: Handle file URLs. 447 If the scheme is hierarchical: 449 Output the canonicalized scheme (as described in Section ??). 451 Output "://". 453 If the user-info is non-empty: 455 Output the canonicalized user-info (as described in Section 456 ??). 458 Output "@". 460 Output the canonicalized host (as described in Section ??). 462 Let the canonicalized-port be the canonicalized port (as described 463 in Section ??). 465 If the canonicalized-port is non-empty and is not the default port 466 for the scheme: 468 Output ":". 470 Output the canonicalized-port. 472 Output the canonicalized path (as described in Section ??). 474 Let the canonicalized-query be the canonicalized query (as 475 described in Section ??). 477 If the canonicalized-query is non-empty (TODO: Distinguish between 478 empty and non-existent queries): 480 Output "?". 482 Output the canonicalized-query. 484 Let the canonicalized-fragment be the canonicalized fragment (as 485 described in Section ??). 487 If the canonicalized-fragment is non-empty (TODO: Distinguish 488 between empty and non-existent fragments): 490 Output "#". 492 Output the canonicalized-fragment. 494 5.1. Canonicalizing a Scheme 496 If the first character of the scheme is not in ALPHA, the scheme is 497 invalid. 499 Process each character of the scheme in sequence: 501 If the current character is among ALPHA, DIGIT, "+", "-", and ".": 503 -> Output the current character. 505 Otherwise, if the current character is "%": 507 -> The scheme is invalid. 509 -> Output the current character. 511 Otherwise: 513 -> The scheme is invalid. 515 -> Output the utf8-percent-escaping of the current character. 517 5.2. Canonicalizing a User-Info 519 Process each character of the username in sequence: 521 If the current character is among TODO: 523 -> Output the current character. 525 Otherwise: 527 -> Output the utf8-percent-escaping of the current character. 529 If there is no password or if the password is empty: 531 -> Abort these steps. 533 Output ":". 535 Process each character of the password in sequence: 537 If the current character is among TODO: 539 -> Output the current character. 541 Otherwise: 543 -> Output the utf8-percent-escaping of the current character. 545 5.3. Canonicalizing a Host 547 TODO: Handle IP addresses. 549 Let unicode-host be the host-escape-normalized host (see Section ??). 551 Output result of applying the IDNA to-ascii algorithm to the unicode- 552 host. TODO: Properly reference IDNA's to-ascii algorith (we might 553 need a wrapper like we do in the cookie spec). 555 5.3.1. Host Escape Normalization 557 host-escaped = U+0000-U+002A / U+002C / U+002F / U+003B-U+0040 / U+005C / 558 U+005E / U+0060 / U+007B-U+007F 560 Process each character of the host in sequence: 562 If the current character is "%": 564 -> TODO: Handle percent-unescaping. 566 If the current character matches host-escaped: 568 -> Output the utf8-percent-escaping of the current character. 570 Otherwise, if the current character matches ALPHA: 572 -> Output the current character converted to lower case. 574 Otherwise: 576 -> Output the current character. 578 5.4. Canonicalizing a Path 580 TODO: Do we need to ensure that path's always start with a slash 581 character? 583 If the path is empty: 585 -> Ouput "/" and abort these steps. 587 path-escaped = U+0000-U+0020 / U+0022-U+0023 / U+0025 / U+003C / U+003E / 588 U+003F / U+005C / U+005E / U+0060 / U+007B-U+007D / U+007F 589 path-unescaped = "-" / DIGIT / ALPHA / "_" / "~" 591 Process each character of the path in sequence: 593 If the current character matches path-escaped or is greater than 594 or equal to U+0080: 596 -> Output the utf8-percent-escaping of the current character. 598 Otherwise, if the current character is ".": 600 -> TODO: Handle "." collapsing. 602 Otherwise, if the current character is "\": 604 -> Output "/". 606 Otherwise, if the current character is "%": 608 -> TODO: Handle percent-unescaping. 610 Otherwise: 612 -> Output the current character. 614 5.5. Canonicalizing a Query 616 TODO: Handle the ambient encoding case. 618 Process each character of the query in sequence: 620 If the current character is among TODO: 622 -> Output the current character. 624 Otherwise: 626 -> Output the utf8-percent-escaping of the current character. 627 TODO: We need to handle the goofy query escaping format. 629 5.6. Canonicalizing a Fragment 631 Process each character of the fragment in sequence: 633 If the current character has a Unicode value greater than or equal 634 to U+0020: 636 -> Output the current character. 638 Otherwise: 640 -> Output the utf8-percent-escaping of the current character. 642 Note: The above algorithm results in the canonicalized fragment 643 containing non-US-ASCII characters. 645 Appendix A. Acknowledgements 647 TODO 649 Author's Address 651 Adam Barth 652 Google, Inc. 654 Email: ietf@adambarth.com 655 URI: http://www.adambarth.com/