iri A. Barth
Internet-Draft Google, Inc.
Intended status: Standards Track April 24, 2011
Expires: October 26, 2011

Parsing URLs for Fun and Profit
draft-abarth-url-01

Abstract

This document contains a precise specification of how browsers process URLs. The behavior specified in this document might or might not match any particular browser, but browsers might be well-served by adopting the behavior defined herein.

Editorial Note (To be removed by RFC Editor)

If you have suggestions for improving this document, please send email to mailto:public-iri@w3.org. Further Working Group information is available from https://tools.ietf.org/wg/iri/.

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on October 26, 2011.

Copyright Notice

Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Open Issues

Browsers parse URLs differently depending on which operating system they're running on. The problem is that they want to do sensible things for file paths, but file paths look different on Windows and Unix systems.

How should we handle cases where browsers disaggree with the regular expression in RFC 3986? Currently, this document aims to describe how browsers behave, but we'll likely need to compare that to RFC 3986 at some point. Some specific differences that have been brought up on the mailing list:

2. Definitions

A control character is a character whose value is less than or equal to U+0020 (" ").

A slash character is either U+???? ("/") or U+???? ("\"). TODO: There's some question as to whether this is necessary for non-file URLs.

An authority terminating character is either a slash charcter, U+???? ("?"), U+???? ("#"), or U+???? (";"). TODO: Why is ";" on this list?

During a parsing algorithm, the remaining string is the characters of the input that have not yet been consumed.

3. Parsing a URL

Given a string of characters, consume all leading and trailing control characters.

Find the scheme, as described in Section ??.

If the algorithm for finding the scheme determines that the URL is invalid:

If the scheme is a single upper or lower case ASCII character (TODO: Just ALPHA?):

If the scheme is a ASCII case-insensitive match for "file":

If the scheme is a ASCII case-insensitive match for "mailto":

If the scheme is hierarchical:

The remaining string is the path. TODO: This might not be the best approach. We need to do more testing of data and javascript URLs.

3.1. Finding the scheme

If the remaining string does not contain a ":" character:

Consume characters up to, but not including, the first ":" character. These characters are the scheme.

Consume the ":" character.

The remaining characters are the after-scheme.

3.2. Finding the authority, path, query, and fragment

Consume any number of slash characters.

If the remaining string does not contain any authority terminating characters:

Consume characters up to, but not including, the first authority terminating character. The consumed characters are authority.

If the remaining string does not contain a "?" character or a "#" character:

Consume characters up to, but not including, the first "?" or "#" charcter. The consumed characters are the path.

If the first character of the remaining string is a "?" character:

Consume the "#" character.

The remaining string is the fragment.

3.3. Finding the user-info, host, and port

If the remaining string contains an "@" character:

If the remaining string does not contain an ":" character:

If the first character of the remaining string is a "[" character, the remaining string contains a "]" character, and the last ":" character in the remaining string occurs before the last "]" character in the remaining string:

Consume characters up to, but not including, the last ":" character. The consumed characters are the host.

Consume the ":" character.

The remaining string is the port.

3.4. Find the user name and password

If the remaining string does not contain a ":" character:

Consume characters up to, but not including, the first ":" character. The consumed characters are the user name.

Consume the ":" character.

The remaining string is the password.

4. Resolving a string relative to a base URL

Given a string relative-url and a ParsedURL base-url, find the scheme of relative-url.

TODO: We probably need to trim leading and trailing control characters.

If relative-url is an invalid URL:

If relative-url's scheme contains any characters which are not "valid scheme characters" (TODO: Define valid scheme characters):

If base-url's scheme is an ASCII case insensitive match for relative-url's scheme and the shared scheme is hierarchical:

The resolved URL is relative-url parsed as an absolute URL.

4.1. Resolving a string as a relative URL

Given a string relative-url and a ParsedURL base-url, determine the resolved URL as follows:

TODO: If base-url's scheme is not hierarchical, we can't resolve as a relative URL. We'll probably want to return an invalid URL. Check what happens when resolving an empty string as a relative URL with a non-hierarchical base.

If relative-url is empty:

If the first character of relative-url is a slash character:

If the first character of relative-url is a "?" character:

If the first character of relative-url is a "#" character:

TODO: Think about the case where the relative-url is empty.

The resolved URL is relative-url resolved as a path-relative URL.

4.2. Resolving a string as a scheme-relative URL

Given a string relative-url and a ParsedURL base-url, let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.

4.3. Resolving a string as an authority-relative URL

Given a string relative-url and a ParsedURL base-url, let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.

4.4. Resolving a string as a path-relative URL

TODO: Can the first character of relative-url be a slash character at this point?

TODO: Can we assume base-url is canonicalized here so that it always has at least one "/" character?

Let the directory-name be the characters of the base-url's path up to and including the last slash character.

Let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.

4.5. Resolving a string as a query-relative URL

Given a string relative-url and a ParsedURL base-url, let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.

4.6. Resolving a string as a fragment-relative URL

Given a string relative-url and a ParsedURL base-url, let resolved-url be

The resolved URL is resolved-url parsed as an absolute URL.

5. Canonicalizing a URL

This section describes how to construct a canonical version of a parsed URL string. TODO: We probably should mention somewhere that there is *not* a unique canonicalization for every URL.

Given parsed URL original-url, if original-url is invalid:

TODO: Handle file URLs.

If the scheme is hierarchical:

5.1. Canonicalizing a Scheme

If the first character of the scheme is not in ALPHA, the scheme is invalid.

Process each character of the scheme in sequence:

5.2. Canonicalizing a User-Info

Process each character of the username in sequence:

If there is no password or if the password is empty:

Output ":".

Process each character of the password in sequence:

5.3. Canonicalizing a Host

TODO: Handle IP addresses.

Let unicode-host be the host-escape-normalized host (see Section ??).

Output result of applying the IDNA to-ascii algorithm to the unicode-host. TODO: Properly reference IDNA's to-ascii algorith (we might need a wrapper like we do in the cookie spec).

5.3.1. Host Escape Normalization

host-escaped   = U+0000-U+002A / U+002C / U+002F / U+003B-U+0040 / U+005C /
                 U+005E / U+0060 / U+007B-U+007F
            

Process each character of the host in sequence:

5.4. Canonicalizing a Path

TODO: Do we need to ensure that path's always start with a slash character?

If the path is empty:

path-escaped   = U+0000-U+0020 / U+0022-U+0023 / U+0025 / U+003C / U+003E /
                 U+003F / U+005C / U+005E / U+0060 / U+007B-U+007D / U+007F
path-unescaped = "-" / DIGIT / ALPHA / "_" / "~"
          

Process each character of the path in sequence:

5.5. Canonicalizing a Query

TODO: Handle the ambient encoding case.

Process each character of the query in sequence:

5.6. Canonicalizing a Fragment

Process each character of the fragment in sequence:

Note: The above algorithm results in the canonicalized fragment containing non-US-ASCII characters.

6. References

Appendix A. Acknowledgements

TODO

Author's Address

Adam Barth Google, Inc. EMail: ietf@adambarth.com URI: http://www.adambarth.com/