idnits 2.17.1 

draft-abarth-url-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Sep 2009 rather than the newer Notice from 28 Dec 2009.  (See
     https://trustee.ietf.org/license-info/)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an Introduction section.

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 3 instances of too long lines in the document, the longest one
     being 3 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (April 23, 2011) is 4751 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

     No issues found here.

     Summary: 5 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	iri                                                             A. Barth
3	Internet-Draft                                              Google, Inc.
4	Intended status: Standards Track                          April 23, 2011
5	Expires: October 25, 2011

7	                    Parsing URLs for Fun and Profit
8	                          draft-abarth-url-01

10	Abstract

12	   This document contains a precise specification of how browsers
13	   process URLs.  The behavior specified in this document might or might
14	   not match any particular browser, but browsers might be well-served
15	   by adopting the behavior defined herein.

17	Editorial Note (To be removed by RFC Editor)

19	   If you have suggestions for improving this document, please send
20	   email to <mailto:public-iri@w3.org>.  Further Working Group
21	   information is available from <https://tools.ietf.org/wg/iri/>.

23	Status of this Memo

25	   This Internet-Draft is submitted to IETF in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF), its areas, and its working groups.  Note that
30	   other groups may also distribute working documents as Internet-
31	   Drafts.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   The list of current Internet-Drafts can be accessed at
39	   http://www.ietf.org/ietf/1id-abstracts.txt.

41	   The list of Internet-Draft Shadow Directories can be accessed at
42	   http://www.ietf.org/shadow.html.

44	   This Internet-Draft will expire on October 25, 2011.

46	Copyright Notice

48	   Copyright (c) 2011 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (http://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document.  Please review these documents
55	   carefully, as they describe your rights and restrictions with respect
56	   to this document.  Code Components extracted from this document must
57	   include Simplified BSD License text as described in Section 4.e of
58	   the Trust Legal Provisions and are provided without warranty as
59	   described in the BSD License.

61	Table of Contents

63	   1.  Open Issues  . . . . . . . . . . . . . . . . . . . . . . . . .  4
64	   2.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  5
65	   3.  Parsing a URL  . . . . . . . . . . . . . . . . . . . . . . . .  6
66	     3.1.  Finding the scheme . . . . . . . . . . . . . . . . . . . .  6
67	     3.2.  Finding the authority, path, query, and fragment . . . . .  7
68	     3.3.  Finding the user-info, host, and port  . . . . . . . . . .  8
69	     3.4.  Find the user name and password  . . . . . . . . . . . . .  8
70	   4.  Resolving a string relative to a base URL  . . . . . . . . . .  9
71	     4.1.  Resolving a string as a relative URL . . . . . . . . . . .  9
72	     4.2.  Resolving a string as a scheme-relative URL  . . . . . . . 10
73	     4.3.  Resolving a string as an authority-relative URL  . . . . . 11
74	     4.4.  Resolving a string as a path-relative URL  . . . . . . . . 11
75	     4.5.  Resolving a string as a query-relative URL . . . . . . . . 11
76	     4.6.  Resolving a string as a fragment-relative URL  . . . . . . 12
77	   5.  Canonicalizing a URL . . . . . . . . . . . . . . . . . . . . . 13
78	     5.1.  Canonicalizing a Scheme  . . . . . . . . . . . . . . . . . 14
79	     5.2.  Canonicalizing a User-Info . . . . . . . . . . . . . . . . 14
80	     5.3.  Canonicalizing a Host  . . . . . . . . . . . . . . . . . . 15
81	       5.3.1.  Host Escape Normalization  . . . . . . . . . . . . . . 15
82	     5.4.  Canonicalizing a Path  . . . . . . . . . . . . . . . . . . 16
83	     5.5.  Canonicalizing a Query . . . . . . . . . . . . . . . . . . 16
84	     5.6.  Canonicalizing a Fragment  . . . . . . . . . . . . . . . . 17
85	   Appendix A.  Acknowledgements  . . . . . . . . . . . . . . . . . . 18
86	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 19

88	1.  Open Issues

90	   Browsers parse URLs differently depending on which operating system
91	   they're running on.  The problem is that they want to do sensible
92	   things for file paths, but file paths look different on Windows and
93	   Unix systems.

95	   How should we handle cases where browsers disaggree with the regular
96	   expression in RFC 3986?  Currently, this document aims to describe
97	   how browsers behave, but we'll likely need to compare that to RFC
98	   3986 at some point.  Some specific differences that have been brought
99	   up on the mailing list:

101	   o  http:///example.com/

103	   o  http://example.com;

105	2.  Definitions

107	   A control character is a character whose value is less than or equal
108	   to U+0020 (" ").

110	   A slash character is either U+???? ("/") or U+???? ("\").  TODO:
111	   There's some question as to whether this is necessary for non-file
112	   URLs.

114	   An authority terminating character is either a slash charcter, U+????
115	   ("?"), U+???? ("#"), or U+???? (";").  TODO: Why is ";" on this list?

117	   During a parsing algorithm, the remaining string is the characters of
118	   the input that have not yet been consumed.

120	3.  Parsing a URL

122	   Given a string of characters, consume all leading and trailing
123	   control characters.

125	   Find the scheme, as described in Section ??.

127	   If the algorithm for finding the scheme determines that the URL is
128	   invalid:

130	      -> Abort these steps.

132	   If the scheme is a single upper or lower case ASCII character (TODO:
133	   Just ALPHA?):

135	      -> TODO: Windows drive specs!

137	   If the scheme is a ASCII case-insensitive match for "file":

139	      -> TODO: File URLs!

141	   If the scheme is a ASCII case-insensitive match for "mailto":

143	      -> TODO: I think mailto URLs are special, but more testing is
144	      required.

146	   If the scheme is hierarchical:

148	      -> In the after-scheme, if any, find the authority, path, query,
149	      and fragment, as described in Section ??.

151	      -> In the authority, if any, find the user-info, host, and port,
152	      as described in Section ??.

154	      -> In the user-info, if any, find the user name and password, as
155	      described in Section ??.

157	      -> Abort these steps.

159	   The remaining string is the path.  TODO: This might not be the best
160	   approach.  We need to do more testing of data and javascript URLs.

162	3.1.  Finding the scheme

164	   If the remaining string does not contain a ":" character:

166	      -> The URL is invalid.

168	      -> Abort these steps.

170	   Consume characters up to, but not including, the first ":" character.
171	   These characters are the scheme.

173	   Consume the ":" character.

175	   The remaining characters are the after-scheme.

177	3.2.  Finding the authority, path, query, and fragment

179	   Consume any number of slash characters.

181	   If the remaining string does not contain any authority terminating
182	   characters:

184	      -> The remaining string is the authority.

186	      -> Abort these steps.

188	   Consume characters up to, but not including, the first authority
189	   terminating character.  The consumed characters are authority.

191	   If the remaining string does not contain a "?" character or a "#"
192	   character:

194	      -> The remaining string is the path.

196	      -> Abort these steps.

198	   Consume characters up to, but not including, the first "?" or "#"
199	   charcter.  The consumed characters are the path.

201	   If the first character of the remaining string is a "?" character:

203	      -> Consume the "?" character.

205	      -> If the remaining string does not contain a "#" character:

207	         -> The remaining string is the query.

209	         -> Abort these steps.

211	      -> Consume characters up to, but not including, the first "#"
212	      charcter.  The consumed characters are the query.

214	   Consume the "#" character.

216	   The remaining string is the fragment.

218	3.3.  Finding the user-info, host, and port

220	   If the remaining string contains an "@" character:

222	      -> Consume characters up to, but not including the *last* "@"
223	      character.  The consumed characters are the user-info.

225	      -> Consume the "@" character.

227	   If the remaining string does not contain an ":" character:

229	      -> The remaining string is the host.

231	      -> Abort these steps.

233	   If the first character of the remaining string is a "[" character,
234	   the remaining string contains a "]" character, and the last ":"
235	   character in the remaining string occurs before the last "]"
236	   character in the remaining string:

238	      -> The remaining string is the host.

240	      -> Abort these steps.

242	   Consume characters up to, but not including, the last ":" character.
243	   The consumed characters are the host.

245	   Consume the ":" character.

247	   The remaining string is the port.

249	3.4.  Find the user name and password

251	   If the remaining string does not contain a ":" character:

253	      -> The remaining string is the user name.

255	      -> Abort these steps.

257	   Consume characters up to, but not including, the first ":" character.
258	   The consumed characters are the user name.

260	   Consume the ":" character.

262	   The remaining string is the password.

264	4.  Resolving a string relative to a base URL

266	   Given a string relative-url and a ParsedURL base-url, find the scheme
267	   of relative-url.

269	   TODO: We probably need to trim leading and trailing control
270	   characters.

272	   If relative-url is an invalid URL:

274	      -> The resolved URL is relative-url resolved as relative URL.

276	      -> Abort these steps.

278	   If relative-url's scheme contains any characters which are not "valid
279	   scheme characters" (TODO: Define valid scheme characters):

281	      -> The resolved URL is relative-url resolved as relative URL.

283	      -> Abort these steps.

285	   If base-url's scheme is an ASCII case insensitive match for relative-
286	   url's scheme and the shared scheme is hierarchical:

288	      -> The resolved URL is relative-url's after-scheme resolved as a
289	      relative URL.

291	      -> Abort these steps.

293	   The resolved URL is relative-url parsed as an absolute URL.

295	4.1.  Resolving a string as a relative URL

297	   Given a string relative-url and a ParsedURL base-url, determine the
298	   resolved URL as follows:

300	   TODO: If base-url's scheme is not hierarchical, we can't resolve as a
301	   relative URL.  We'll probably want to return an invalid URL.  Check
302	   what happens when resolving an empty string as a relative URL with a
303	   non-hierarchical base.

305	   If relative-url is empty:

307	      -> The resolved URL is identical to base-url, with the fragment,
308	      if any, removed.

310	      -> Abort these steps.

312	   If the first character of relative-url is a slash character:

314	      -> If relative-url has at least two characters and the second
315	      character is also a slash character:

317	         -> The resolved URL is relative-url resolved as a scheme-
318	         relative URL.

320	      Otherwise:

322	         -> The resolved URL is relative-url resolved as an authority-
323	         relative URL.

325	      -> Abort these steps.

327	   If the first character of relative-url is a "?" character:

329	      -> The resolved URL is relative-url resolved as a query-relative
330	      URL.

332	      -> Abort these steps.

334	   If the first character of relative-url is a "#" character:

336	      -> The resolved URL is relative-url resolved as a fragment-
337	      relative URL.

339	      -> Abort these steps.

341	   TODO: Think about the case where the relative-url is empty.

343	   The resolved URL is relative-url resolved as a path-relative URL.

345	4.2.  Resolving a string as a scheme-relative URL

347	   Given a string relative-url and a ParsedURL base-url, let resolved-
348	   url be

350	   o  base-url's scheme

352	   o  concatenated with ":",

354	   o  concatenated with relative-url.

356	   The resolved URL is resolved-url parsed as an absolute URL.

358	4.3.  Resolving a string as an authority-relative URL

360	   Given a string relative-url and a ParsedURL base-url, let resolved-
361	   url be

363	   o  base-url's scheme

365	   o  concatenated with "://",

367	   o  concatenated with base-url's authority,

369	   o  concatenated with relative-url.

371	   The resolved URL is resolved-url parsed as an absolute URL.

373	4.4.  Resolving a string as a path-relative URL

375	   TODO: Can the first character of relative-url be a slash character at
376	   this point?

378	   TODO: Can we assume base-url is canonicalized here so that it always
379	   has at least one "/" character?

381	   Let the directory-name be the characters of the base-url's path up to
382	   and including the last slash character.

384	   Let resolved-url be

386	   o  base-url's scheme

388	   o  concatenated with "://",

390	   o  concatenated with base-url's authority,

392	   o  concatenated with directory-name.

394	   o  concatenated with relative-url.

396	   The resolved URL is resolved-url parsed as an absolute URL.

398	4.5.  Resolving a string as a query-relative URL

400	   Given a string relative-url and a ParsedURL base-url, let resolved-
401	   url be

403	   o  base-url's scheme
404	   o  concatenated with "://",

406	   o  concatenated with base-url's authority,

408	   o  concatenated with base-url's path,

410	   o  concatenated with relative-url.

412	   The resolved URL is resolved-url parsed as an absolute URL.

414	4.6.  Resolving a string as a fragment-relative URL

416	   Given a string relative-url and a ParsedURL base-url, let resolved-
417	   url be

419	   o  base-url's scheme

421	   o  concatenated with "://",

423	   o  concatenated with base-url's authority,

425	   o  concatenated with base-url's path,

427	   o  concatenated with "?",

429	   o  concatenated with base-url's query,

431	   o  concatenated with relative-url.

433	   The resolved URL is resolved-url parsed as an absolute URL.

435	5.  Canonicalizing a URL

437	   This section describes how to construct a canonical version of a
438	   parsed URL string.  TODO: We probably should mention somewhere that
439	   there is *not* a unique canonicalization for every URL.

441	   Given parsed URL original-url, if original-url is invalid:

443	      -> Abort these steps.

445	   TODO: Handle file URLs.

447	   If the scheme is hierarchical:

449	      Output the canonicalized scheme (as described in Section ??).

451	      Output "://".

453	      If the user-info is non-empty:

455	         Output the canonicalized user-info (as described in Section
456	         ??).

458	         Output "@".

460	      Output the canonicalized host (as described in Section ??).

462	      Let the canonicalized-port be the canonicalized port (as described
463	      in Section ??).

465	      If the canonicalized-port is non-empty and is not the default port
466	      for the scheme:

468	         Output ":".

470	         Output the canonicalized-port.

472	      Output the canonicalized path (as described in Section ??).

474	      Let the canonicalized-query be the canonicalized query (as
475	      described in Section ??).

477	      If the canonicalized-query is non-empty (TODO: Distinguish between
478	      empty and non-existent queries):

480	         Output "?".

482	         Output the canonicalized-query.

484	      Let the canonicalized-fragment be the canonicalized fragment (as
485	      described in Section ??).

487	      If the canonicalized-fragment is non-empty (TODO: Distinguish
488	      between empty and non-existent fragments):

490	         Output "#".

492	         Output the canonicalized-fragment.

494	5.1.  Canonicalizing a Scheme

496	   If the first character of the scheme is not in ALPHA, the scheme is
497	   invalid.

499	   Process each character of the scheme in sequence:

501	      If the current character is among ALPHA, DIGIT, "+", "-", and ".":

503	         -> Output the current character.

505	      Otherwise, if the current character is "%":

507	         -> The scheme is invalid.

509	         -> Output the current character.

511	      Otherwise:

513	         -> The scheme is invalid.

515	         -> Output the utf8-percent-escaping of the current character.

517	5.2.  Canonicalizing a User-Info

519	   Process each character of the username in sequence:

521	      If the current character is among TODO:

523	         -> Output the current character.

525	      Otherwise:

527	         -> Output the utf8-percent-escaping of the current character.

529	   If there is no password or if the password is empty:

531	      -> Abort these steps.

533	   Output ":".

535	   Process each character of the password in sequence:

537	      If the current character is among TODO:

539	         -> Output the current character.

541	      Otherwise:

543	         -> Output the utf8-percent-escaping of the current character.

545	5.3.  Canonicalizing a Host

547	   TODO: Handle IP addresses.

549	   Let unicode-host be the host-escape-normalized host (see Section ??).

551	   Output result of applying the IDNA to-ascii algorithm to the unicode-
552	   host.  TODO: Properly reference IDNA's to-ascii algorith (we might
553	   need a wrapper like we do in the cookie spec).

555	5.3.1.  Host Escape Normalization

557	host-escaped   = U+0000-U+002A / U+002C / U+002F / U+003B-U+0040 / U+005C /
558	                 U+005E / U+0060 / U+007B-U+007F

560	   Process each character of the host in sequence:

562	      If the current character is "%":

564	         -> TODO: Handle percent-unescaping.

566	      If the current character matches host-escaped:

568	         -> Output the utf8-percent-escaping of the current character.

570	      Otherwise, if the current character matches ALPHA:

572	         -> Output the current character converted to lower case.

574	      Otherwise:

576	         -> Output the current character.

578	5.4.  Canonicalizing a Path

580	   TODO: Do we need to ensure that path's always start with a slash
581	   character?

583	   If the path is empty:

585	      -> Ouput "/" and abort these steps.

587	path-escaped   = U+0000-U+0020 / U+0022-U+0023 / U+0025 / U+003C / U+003E /
588	                 U+003F / U+005C / U+005E / U+0060 / U+007B-U+007D / U+007F
589	path-unescaped = "-" / DIGIT / ALPHA / "_" / "~"

591	   Process each character of the path in sequence:

593	      If the current character matches path-escaped or is greater than
594	      or equal to U+0080:

596	         -> Output the utf8-percent-escaping of the current character.

598	      Otherwise, if the current character is ".":

600	         -> TODO: Handle "." collapsing.

602	      Otherwise, if the current character is "\":

604	         -> Output "/".

606	      Otherwise, if the current character is "%":

608	         -> TODO: Handle percent-unescaping.

610	      Otherwise:

612	         -> Output the current character.

614	5.5.  Canonicalizing a Query

616	   TODO: Handle the ambient encoding case.

618	   Process each character of the query in sequence:

620	      If the current character is among TODO:

622	         -> Output the current character.

624	      Otherwise:

626	         -> Output the utf8-percent-escaping of the current character.
627	         TODO: We need to handle the goofy query escaping format.

629	5.6.  Canonicalizing a Fragment

631	   Process each character of the fragment in sequence:

633	      If the current character has a Unicode value greater than or equal
634	      to U+0020:

636	         -> Output the current character.

638	      Otherwise:

640	         -> Output the utf8-percent-escaping of the current character.

642	   Note: The above algorithm results in the canonicalized fragment
643	   containing non-US-ASCII characters.

645	Appendix A.  Acknowledgements

647	   TODO

649	Author's Address

651	   Adam Barth
652	   Google, Inc.

654	   Email: ietf@adambarth.com
655	   URI:   http://www.adambarth.com/