TOC 
iriA. Barth
Internet-DraftGoogle, Inc.
Intended status: InformationalNovember 9, 2010
Expires: May 13, 2011 


How Browsers Process URLs
draft-abarth-url-00

Abstract

This document contains a precise specification of how browsers process URLs. The behavior specified in this document might or might not match any particular browser, but browsers might be well-served by adopting the behavior defined herein.

Editorial Note (To be removed by RFC Editor)

If you have suggestions for improving this document, please send email to mailto:public-iri@w3.org. Further Working Group information is available from https://tools.ietf.org/wg/iri/.

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

This Internet-Draft will expire on May 13, 2011.

Copyright Notice

Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.



Table of Contents

1.  Open Issues
2.  Definitions
3.  Parsing a URL
    3.1.  Finding the scheme
    3.2.  Finding the authority, path, query, and fragment
    3.3.  Finding the user information, host, and port
    3.4.  Find the user name and password
4.  Resolving a string relative to a base URL
    4.1.  Resolving a string as a relative URL
    4.2.  Resolving a string as a scheme-relative URL
    4.3.  Resolving a string as an authority-relative URL
    4.4.  Resolving a string as a query-relative URL
    4.5.  Resolving a string as a fragment-relative URL
    4.6.  Resolving a string as a path-relative URL
Appendix A.  Acknowledgements
§  Author's Address




 TOC 

1.  Open Issues

Browsers parse URLs differently depending on which operating system they're running on. The problem is that they want to do sensible things for file paths, but file paths look different on Windows and Unix systems.

How should we handle cases where browsers disaggree with the regular expression in RFC 3986? Currently, this document aims to describe how browsers behave, but we'll likely need to compare that to RFC 3986 at some point. Some specific differences that have been brought up on the mailing list:



 TOC 

2.  Definitions

A /control character/ is a character whose value is less than or equal to U+0020 (" ").

A /slash character/ is either U+???? ("/") or U+???? ("\").

An /authority terminating characters/ is either a slash charcter, U+???? ("?"), U+???? ("#"), or U+???? (";").

During a parsing algorithm, the /remaining string/ are the characters of the input that have not yet been consumed.



 TOC 

3.  Parsing a URL

Given a string of characters, find the scheme, as described in Section ??.

If the URL is invalid:

-> Abort these steps.

If the scheme is a single upper or lower case ASCII character:

-> TODO: Windows drive specs!

If the scheme is a ASCII case-insensitive match for "file":

-> TODO: File URLs!

If the scheme is a ASCII case-insensitive match for "mailto":

-> TODO: I think mailto URLs are special, but more testing is required.

If the scheme is hierarchical:

-> In the after-scheme, if any, find the authority, path, query, and fragment, as decribed in Section ??.

-> In the authority, if any, find the user information, host, and port, as described in Section ??.

-> In the user-info, if any, find the user name and password, as described in Section ??.

-> Abort these steps.

The remaining string is the /path/.



 TOC 

3.1.  Finding the scheme

Consume all leading and trailing control characters.

If the remaining string does not contain a ":" character:

-> The URL is invalid.

-> Abort these steps.

Consume characters up to, but not including, the first ":" character. These characters are the /scheme/.

Consume the ":" character.

The reamining characters are the /after-scheme/.



 TOC 

3.2.  Finding the authority, path, query, and fragment

Consume any number of slash characters.

If the remaining string does not contain any authority terminating characters:

-> The remaining string is the /authority/.

-> Abort these steps.

Consume characters up to, but not including, the first authority terminating character. The consumed characters are /authority/.

If the remaining string does not contain a "?" character or a "# character:

-> The remaining string is the /path/.

-> Abort these steps.

Consume characters up to, but not including, the first "?" or "#" charcter. The consumed characters are the /path/.

If the first character of the remaining string is a "?" character:

-> Consume the "?" character.

-> If the remaining string does not contain a "#" character:

-> The remaining string is the /query/.

-> Abort these steps.

-> Consume characters up to, but not including, the first "#" charcter. The consumed characters are the /query/.

Consume the "#" character.

The remaining string is the /fragment/.



 TOC 

3.3.  Finding the user information, host, and port

If the remaining string contains an "@" character:

-> Consume characters up to, but not including the *last* "@" character. The consumed characters are the /user-info/.

-> Consume the "@" character.

If the remaining string does not contain an ":" character:

-> The remaining string is the /host/.

-> Abort these steps.

If the first character of the remaining string is a "[" character, the remaining string contains a "]" character, and the last ":" character in the remaining string occurs before the last "]" character in the remaining string:

-> The remaining string is the /host/.

-> Abort these steps.

Consume characters up to, but not including, the last ":" character. The consumed characters are the /host/.

Consume the ":" character.

The remaining string is the /port/.



 TOC 

3.4.  Find the user name and password

If the remaining string does not contain a ":" character:

-> The remaining string is the /user/.

-> Abort these steps.

Consume characters up to, but not including, the first ":" character. The consumed characters are the /user/.

Consume the ":" character.

The remaining string is the /password/.



 TOC 

4.  Resolving a string relative to a base URL

Given a string /spec/ and a ParsedURL /base-url/, find the scheme of spec.

TODO: We probably need to trim leading and trailing control characters.

If spec is an invalid URL:

-> The resolved URL is spec resolved as relative URL.

-> Abort these steps.

If spec's scheme contains any characters which are not "valid scheme characters" (TODO: Define valid scheme characters):

-> The resolved URL is spec resolved as relative URL.

-> Abort these steps.

If base-url's scheme is an ASCII case insensitive match for spec's scheme and the shared scheme is hierarchical:

-> The resolved URL is spec's after-scheme resolved as a relative URL.

-> Abort these steps.

The resolved URL is spec parsed as an absolute URL.



 TOC 

4.1.  Resolving a string as a relative URL

Given a string /spec/ and a ParsedURL /base-url/...

TODO: If base-url's scheme is not hierarchical, we can't resolve as a relative URL. We'll probably want to return an invalid URL. Check what happens when resolving an empty string as a relative URL with a non-hierarchical base.

If spec is empty:

-> The resolved URL is identical to base-url, with the fragment, if any, removed.

-> Abort these steps.

If the first character of spec is a slash character:

-> If spec has at least two characters and the second character is also a slash character:

-> The resolved URL is spec resolved as a scheme-relative URL.

Otherwise:

-> The resolved URL is spec resolved as an authority-relative URL.

-> Abort these steps.

If the first character of spec is a "?" character:

-> The resolved URL is spec resolved as a query-relative URL.

-> Abort these steps.

If the first character of spec is a "#" character:

-> The resolved URL is spec resolved as a fragment-relative URL.

-> Abort these steps.

The resolved URL is spec resolved as a path-relative URL.



 TOC 

4.2.  Resolving a string as a scheme-relative URL

Given a string /spec/ and a ParsedURL /base-url/, let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.



 TOC 

4.3.  Resolving a string as an authority-relative URL

Given a string /spec/ and a ParsedURL /base-url/, let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.



 TOC 

4.4.  Resolving a string as a query-relative URL

Given a string /spec/ and a ParsedURL /base-url/, let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.



 TOC 

4.5.  Resolving a string as a fragment-relative URL

Given a string /spec/ and a ParsedURL /base-url/, let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.



 TOC 

4.6.  Resolving a string as a path-relative URL

TODO: Handle path-relative URLs. This requires a bunch of path and dot semantics.



 TOC 

Appendix A.  Acknowledgements

TODO



 TOC 

Author's Address

  Adam Barth
  Google, Inc.
Email:  ietf@adambarth.com
URI:  http://www.adambarth.com/