< draft-fielding-uri-syntax-02.txt   draft-fielding-uri-syntax-03.txt >
Network Working Group T. Berners-Lee, MIT/LCS Network Working Group T. Berners-Lee, MIT/LCS
INTERNET-DRAFT R. Fielding, U.C. Irvine INTERNET-DRAFT R. Fielding, U.C. Irvine
draft-fielding-uri-syntax-02 L. Masinter, Xerox Corporation draft-fielding-uri-syntax-03 L. Masinter, Xerox Corporation
Expires six months after publication date March 4, 1998 Expires six months after publication date June 4, 1998
Uniform Resource Identifiers (URI): Generic Syntax Uniform Resource Identifiers (URI): Generic Syntax
Status of this Memo Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas, documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts. working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as ``work in as reference material or to cite them other than as ``work in
progress.'' progress.''
To learn the current status of any Internet-Draft, please check the To view the entire list of current Internet-Drafts, please check the
``1id-abstracts.txt'' listing contained in the Internet-Drafts "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern
(Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific
Coast), or ftp.isi.edu (US West Coast). Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).
Instructions to RFC Editor: This document will obsolete RFC 1738 and Instructions to RFC Editor: This document will obsolete RFC 1738 and
RFC 1808. If the new version of the MHTML proposed standard is RFC 1808. If the new version of the MHTML proposed standard is
ready for publication at the same time as this document, please ready for publication at the same time as this document, please
change all references to RFC 2110 to refer to its new version. change all references to RFC 2110 to refer to its new version.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (1998). All Rights Reserved. Copyright (C) The Internet Society (1998). All Rights Reserved.
Abstract Abstract
A Uniform Resource Identifier (URI) is a compact string of characters A Uniform Resource Identifier (URI) is a compact string of characters
for identifying an abstract or physical resource. This document for identifying an abstract or physical resource. This document
defines the general syntax of URIs, including both absolute and defines the generic syntax of URI, including both absolute and
relative forms, and guidelines for their use; it revises and replaces relative forms, and guidelines for their use; it revises and replaces
the generic definitions in RFC 1738 and RFC 1808. the generic definitions in RFC 1738 and RFC 1808.
This document defines a grammar that is a superset of all valid URI,
such that an implementation can parse the common components of a URI
reference without knowing the scheme-specific requirements of every
possible identifier type. This document does not define a generative
grammar for URI; that task will be performed by the individual
specifications of each URI scheme.
1. Introduction 1. Introduction
Uniform Resource Identifiers (URIs) provide a simple and extensible Uniform Resource Identifiers (URI) provide a simple and extensible
means for identifying a resource. This specification of URI syntax means for identifying a resource. This specification of URI syntax
and semantics is derived from concepts introduced by the World Wide and semantics is derived from concepts introduced by the World Wide
Web global information initiative, whose use of such objects dates Web global information initiative, whose use of such objects dates
from 1990 and is described in "Universal Resource Identifiers in WWW" from 1990 and is described in "Universal Resource Identifiers in WWW"
[RFC1630]. The specification of URIs is designed to meet the [RFC1630]. The specification of URI is designed to meet the
recommendations laid out in "Functional Recommendations for Internet recommendations laid out in "Functional Recommendations for Internet
Resource Locators" [RFC1736] and "Functional Requirements for Uniform Resource Locators" [RFC1736] and "Functional Requirements for Uniform
Resource Names" [RFC1737]. Resource Names" [RFC1737].
This document updates and merges "Uniform Resource Locators" This document updates and merges "Uniform Resource Locators"
[RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in
order to define a single, general syntax for all URIs. It excludes order to define a single, generic syntax for all URI. It excludes
those portions of RFC 1738 that defined the specific syntax of those portions of RFC 1738 that defined the specific syntax of
individual URL schemes; those portions will be updated as separate individual URL schemes; those portions will be updated as separate
documents, as will the process for registration of new URI schemes. documents, as will the process for registration of new URI schemes.
This document does not discuss the issues and recommendation for This document does not discuss the issues and recommendation for
dealing with characters outside of the US-ASCII character set dealing with characters outside of the US-ASCII character set
[ASCII]; those recommendations are discussed in a separate document. [ASCII]; those recommendations are discussed in a separate document.
All significant changes from the prior RFCs are noted in Appendix G. All significant changes from the prior RFCs are noted in Appendix G.
1.1 Overview of URIs 1.1 Overview of URI
URIs are characterized by the following definitions: URI are characterized by the following definitions:
Uniform Uniform
Uniformity provides several benefits: it allows different types Uniformity provides several benefits: it allows different types
of resource identifiers to be used in the same context, even of resource identifiers to be used in the same context, even
when the mechanisms used to access those resources may differ; when the mechanisms used to access those resources may differ;
it allows uniform semantic interpretation of common syntactic it allows uniform semantic interpretation of common syntactic
conventions across different types of resource identifiers; it conventions across different types of resource identifiers; it
allows introduction of new types of resource identifiers allows introduction of new types of resource identifiers
without interfering with the way that existing identifiers are without interfering with the way that existing identifiers are
used; and, it allows the identifiers to be reused in many used; and, it allows the identifiers to be reused in many
skipping to change at line 102 skipping to change at line 109
The resource is the conceptual mapping to an entity or set of The resource is the conceptual mapping to an entity or set of
entities, not necessarily the entity which corresponds to that entities, not necessarily the entity which corresponds to that
mapping at any particular instance in time. Thus, a resource mapping at any particular instance in time. Thus, a resource
can remain constant even when its content---the entities to can remain constant even when its content---the entities to
which it currently corresponds---changes over time, provided which it currently corresponds---changes over time, provided
that the conceptual mapping is not changed in the process. that the conceptual mapping is not changed in the process.
Identifier Identifier
An identifier is an object that can act as a reference to An identifier is an object that can act as a reference to
something that has identity. In the case of URIs, the object something that has identity. In the case of URI, the object
is a sequence of characters with a restricted syntax. is a sequence of characters with a restricted syntax.
Having identified a resource, a system may perform a variety of Having identified a resource, a system may perform a variety of
operations on the resource, as might be characterized by such words operations on the resource, as might be characterized by such words
as `access', `update', `replace', or `find attributes'. as `access', `update', `replace', or `find attributes'.
1.2. URI, URL, and URN 1.2. URI, URL, and URN
A URI can be further classified as a locator, a name, or both. The A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URI term "Uniform Resource Locator" (URL) refers to the subset of URI
that identify resources via a representation of their primary access that identify resources via a representation of their primary access
mechanism (e.g., their network "location"), rather than identifying mechanism (e.g., their network "location"), rather than identifying
the resource by name or by some other attribute(s) of that resource. the resource by name or by some other attribute(s) of that resource.
The term "Uniform Resource Name" (URN) refers to the subset of URI The term "Uniform Resource Name" (URN) refers to the subset of URI
that are required to remain globally unique and persistent even when that are required to remain globally unique and persistent even when
the resource ceases to exist or becomes unavailable. the resource ceases to exist or becomes unavailable.
The URI scheme (Section 3.1) defines the namespace of the URI, and The URI scheme (Section 3.1) defines the namespace of the URI, and
thus may further restrict the syntax and semantics of identifiers thus may further restrict the syntax and semantics of identifiers
using that scheme. This specification defines those elements of the using that scheme. This specification defines those elements of the
URI syntax which are either required of all URI schemes or are common URI syntax that are either required of all URI schemes or are common
to many URI schemes. It thus defines the syntax and semantics that to many URI schemes. It thus defines the syntax and semantics that
are needed to implement a scheme-independent parsing mechanism for are needed to implement a scheme-independent parsing mechanism for
URI references, such that the scheme-dependent handling of a URI can URI references, such that the scheme-dependent handling of a URI can
be postponed until the scheme-dependent semantics are needed. We use be postponed until the scheme-dependent semantics are needed. We use
the term URL below when describing syntax or semantics that only the term URL below when describing syntax or semantics that only
apply to locators. apply to locators.
Although many URL schemes are named after protocols, this does not Although many URL schemes are named after protocols, this does not
imply that the only way to access the URL's resource is via the named imply that the only way to access the URL's resource is via the named
protocol. Gateways, proxies, caches, and name resolution services protocol. Gateways, proxies, caches, and name resolution services
might be used to access some resources, independent of the protocol might be used to access some resources, independent of the protocol
of their origin, and the resolution of some URLs may require the use of their origin, and the resolution of some URL may require the use
of more than one protocol (e.g., both DNS and HTTP are typically used of more than one protocol (e.g., both DNS and HTTP are typically used
to access an "http" URL's resource when it can't be found in a local to access an "http" URL's resource when it can't be found in a local
cache). cache).
A URN differs from a URL in that it's primary purpose is persistent A URN differs from a URL in that it's primary purpose is persistent
labeling of a resource with an identifier. That identifier is drawn labeling of a resource with an identifier. That identifier is drawn
from one of a set of defined namespaces, each of which has its own from one of a set of defined namespaces, each of which has its own
set name structure and assignment procedures. The "urn" scheme has set name structure and assignment procedures. The "urn" scheme has
been reserved to establish the requirements for a standardized URN been reserved to establish the requirements for a standardized URN
namespace, as defined in "URN Syntax" [RFC2141] and its related namespace, as defined in "URN Syntax" [RFC2141] and its related
specifications. specifications.
Most of the examples in this specification demonstrate URLs, since Most of the examples in this specification demonstrate URL, since
they allow the most varied use of the syntax and often have a they allow the most varied use of the syntax and often have a
hierarchical namespace. A parser of the URI syntax is capable of hierarchical namespace. A parser of the URI syntax is capable of
parsing both URL and URN references as a generic URI; once the scheme parsing both URL and URN references as a generic URI; once the scheme
is determined, the scheme-specific parsing can be performed on the is determined, the scheme-specific parsing can be performed on the
generic URI components. In other words, the URI syntax is a superset generic URI components. In other words, the URI syntax is a superset
of the syntax of all URI schemes. of the syntax of all URI schemes.
1.3. Example URIs 1.3. Example URI
The following examples illustrate URIs which are in common use. The following examples illustrate URI that are in common use.
ftp://ftp.is.co.za/rfc/rfc1808.txt ftp://ftp.is.co.za/rfc/rfc1808.txt
-- ftp scheme for File Transfer Protocol services -- ftp scheme for File Transfer Protocol services
gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
-- gopher scheme for Gopher and Gopher+ Protocol services -- gopher scheme for Gopher and Gopher+ Protocol services
http://www.math.uio.no/faq/compression-faq/part1.html http://www.math.uio.no/faq/compression-faq/part1.html
-- http scheme for Hypertext Transfer Protocol services -- http scheme for Hypertext Transfer Protocol services
mailto:mduerst@ifi.unizh.ch mailto:mduerst@ifi.unizh.ch
-- mailto scheme for electronic mail addresses -- mailto scheme for electronic mail addresses
news:comp.infosystems.www.servers.unix news:comp.infosystems.www.servers.unix
-- news scheme for USENET news groups and articles -- news scheme for USENET news groups and articles
telnet://melvyl.ucop.edu/ telnet://melvyl.ucop.edu/
-- telnet scheme for interactive services via the TELNET Protocol -- telnet scheme for interactive services via the TELNET Protocol
1.4. Hierarchical URIs and Relative Forms 1.4. Hierarchical URI and Relative Forms
An absolute identifier refers to a resource independent of the An absolute identifier refers to a resource independent of the
context in which the identifier is used. In contrast, a relative context in which the identifier is used. In contrast, a relative
identifier refers to a resource by describing the difference within a identifier refers to a resource by describing the difference within a
hierarchical namespace between the current context and an absolute hierarchical namespace between the current context and an absolute
identifier of the resource. identifier of the resource.
Some URI schemes support a hierarchical naming system, where the Some URI schemes support a hierarchical naming system, where the
hierarchy of the name is denoted by a "/" delimiter separating the hierarchy of the name is denoted by a "/" delimiter separating the
components in the scheme. This document defines a scheme-independent components in the scheme. This document defines a scheme-independent
`relative' form of URI reference that can be used in conjunction with `relative' form of URI reference that can be used in conjunction with
a `base' URI (of a hierarchical scheme) to produce another URI. The a `base' URI (of a hierarchical scheme) to produce another URI. The
syntax of hierarchical URIs is described in Section 3; the relative syntax of hierarchical URI is described in Section 3; the relative
URI calculation is described in Section 5. URI calculation is described in Section 5.
1.5. URI Transcribability 1.5. URI Transcribability
The URI syntax was designed with global transcribability as one of The URI syntax was designed with global transcribability as one of
its main concerns. A URI is a sequence of characters from a very its main concerns. A URI is a sequence of characters from a very
limited set, i.e. the letters of the basic Latin alphabet, digits, limited set, i.e. the letters of the basic Latin alphabet, digits,
and a few special characters. A URI may be represented in a and a few special characters. A URI may be represented in a
variety of ways: e.g., ink on paper, pixels on a screen, or a variety of ways: e.g., ink on paper, pixels on a screen, or a
sequence of octets in a coded character set. The interpretation of sequence of octets in a coded character set. The interpretation of
skipping to change at line 219 skipping to change at line 226
for the research site on a napkin. Upon returning home, Sam takes for the research site on a napkin. Upon returning home, Sam takes
out the napkin and types the URI into a computer, which then out the napkin and types the URI into a computer, which then
retrieves the information to which Kim referred. retrieves the information to which Kim referred.
There are several design concerns revealed by the scenario: There are several design concerns revealed by the scenario:
o A URI is a sequence of characters, which is not always o A URI is a sequence of characters, which is not always
represented as a sequence of octets. represented as a sequence of octets.
o A URI may be transcribed from a non-network source, and thus o A URI may be transcribed from a non-network source, and thus
should consist of characters which are most likely to be able should consist of characters that are most likely to be able
to be typed into a computer, within the constraints imposed by to be typed into a computer, within the constraints imposed by
keyboards (and related input devices) across languages and keyboards (and related input devices) across languages and
locales. locales.
o A URI often needs to be remembered by people, and it is easier o A URI often needs to be remembered by people, and it is easier
for people to remember a URI when it consists of meaningful for people to remember a URI when it consists of meaningful
components. components.
These design concerns are not always in alignment. For example, it These design concerns are not always in alignment. For example, it
is often the case that the most meaningful name for a URI component is often the case that the most meaningful name for a URI component
would require characters which cannot be typed into some systems. would require characters that cannot be typed into some systems.
The ability to transcribe the resource identifier from one medium to The ability to transcribe the resource identifier from one medium to
another was considered more important than having its URI consist another was considered more important than having its URI consist
of the most meaningful of components. In local and regional of the most meaningful of components. In local and regional
contexts and with improving technology, users might benefit from contexts and with improving technology, users might benefit from
being able to use a wider range of characters; such use is not being able to use a wider range of characters; such use is not
defined in this document. defined in this document.
1.6. Syntax Notation and Common Elements 1.6. Syntax Notation and Common Elements
This document uses two conventions to describe and define the syntax This document uses two conventions to describe and define the syntax
skipping to change at line 261 skipping to change at line 268
The second convention is a BNF-like grammar, used to define the The second convention is a BNF-like grammar, used to define the
formal URI syntax. The grammar is that of [RFC822], except that formal URI syntax. The grammar is that of [RFC822], except that
"|" is used to designate alternatives. Briefly, rules are separated "|" is used to designate alternatives. Briefly, rules are separated
from definitions by an equal "=", indentation is used to continue a from definitions by an equal "=", indentation is used to continue a
rule definition over more than one line, literals are quoted with "", rule definition over more than one line, literals are quoted with "",
parentheses "(" and ")" are used to group elements, optional elements parentheses "(" and ")" are used to group elements, optional elements
are enclosed in "[" and "]" brackets, and elements may be preceded are enclosed in "[" and "]" brackets, and elements may be preceded
with <n>* to designate n or more repetitions of the following with <n>* to designate n or more repetitions of the following
element; n defaults to 0. element; n defaults to 0.
Unlike many specifications which use a BNF-like grammar to define the Unlike many specifications that use a BNF-like grammar to define the
bytes (octets) allowed by a protocol, the URI grammar is defined in bytes (octets) allowed by a protocol, the URI grammar is defined in
terms of characters. Each literal in the grammar corresponds to the terms of characters. Each literal in the grammar corresponds to the
character it represents, rather than to the octet encoding of that character it represents, rather than to the octet encoding of that
character in any particular coded character set. How a URI is character in any particular coded character set. How a URI is
represented in terms of bits and bytes on the wire is dependent upon represented in terms of bits and bytes on the wire is dependent upon
the character encoding of the protocol used to transport it, or the the character encoding of the protocol used to transport it, or the
charset of the document which contains it. charset of the document which contains it.
The following definitions are common to many elements: The following definitions are common to many elements:
skipping to change at line 291 skipping to change at line 298
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9" "8" | "9"
alphanum = alpha | digit alphanum = alpha | digit
The complete URI syntax is collected in Appendix A. The complete URI syntax is collected in Appendix A.
2. URI Characters and Escape Sequences 2. URI Characters and Escape Sequences
URIs consist of a restricted set of characters, primarily chosen to URI consist of a restricted set of characters, primarily chosen to
aid transcribability and usability both in computer systems and in aid transcribability and usability both in computer systems and in
non-computer communications. Characters used conventionally as non-computer communications. Characters used conventionally as
delimiters around URIs were excluded. The restricted set of delimiters around URI were excluded. The restricted set of
characters consists of digits, letters, and a few graphic symbols characters consists of digits, letters, and a few graphic symbols
were chosen from those common to most of the character encodings were chosen from those common to most of the character encodings
and input facilities available to Internet users. and input facilities available to Internet users.
uric = reserved | unreserved | escaped
Within a URI, characters are either used as delimiters, or to Within a URI, characters are either used as delimiters, or to
represent strings of data (octets) within the delimited portions. represent strings of data (octets) within the delimited portions.
Octets are either represented directly by a character (using the Octets are either represented directly by a character (using the
US-ASCII character for that octet [ASCII]) or by an escape encoding. US-ASCII character for that octet [ASCII]) or by an escape encoding.
This representation is elaborated below. This representation is elaborated below.
2.1 URIs and non-ASCII characters 2.1 URI and non-ASCII characters
The relationship between URIs and characters has been a source of The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a "character" the relationship, it is useful to distinguish between a "character"
(as a distinguishable semantic entity) and an "octet" (an 8-bit (as a distinguishable semantic entity) and an "octet" (an 8-bit
byte). There are two mappings, one from URI characters to octets, byte). There are two mappings, one from URI characters to octets,
and a second from octets to original characters: and a second from octets to original characters:
URI character sequence->octet sequence->original character sequence URI character sequence->octet sequence->original character sequence
A URI is represented as a sequence of characters, not as a sequence A URI is represented as a sequence of characters, not as a sequence
of octets. That is because URIs might be "transported" by means that of octets. That is because URI might be "transported" by means that
are not through a computer network, e.g., printed on paper, read are not through a computer network, e.g., printed on paper, read
over the radio, etc. over the radio, etc.
A URI scheme may define a mapping from URI characters to octets; A URI scheme may define a mapping from URI characters to octets;
whether this is done depends on the scheme. Commonly, within a whether this is done depends on the scheme. Commonly, within a
delimited component of a URI, a sequence of characters may be delimited component of a URI, a sequence of characters may be
used to represent a sequence of octets. For example, the character used to represent a sequence of octets. For example, the character
"a" represents the octet 97 (decimal), while the character sequence "a" represents the octet 97 (decimal), while the character sequence
"%", "0", "a" represents the octet 10 (decimal). "%", "0", "a" represents the octet 10 (decimal).
skipping to change at line 353 skipping to change at line 362
however, the situation is more difficult. Internet protocols that however, the situation is more difficult. Internet protocols that
transmit octet sequences intended to represent character sequences transmit octet sequences intended to represent character sequences
are expected to provide some way of identifying the charset used, are expected to provide some way of identifying the charset used,
if there might be more than one [RFC2277]. However, there is if there might be more than one [RFC2277]. However, there is
currently no provision within the generic URI syntax to accomplish currently no provision within the generic URI syntax to accomplish
this identification. An individual URI scheme may require a single this identification. An individual URI scheme may require a single
charset, define a default charset, or provide a way to indicate the charset, define a default charset, or provide a way to indicate the
charset used. charset used.
It is expected that a systematic treatment of character encoding It is expected that a systematic treatment of character encoding
within URIs will be developed as a future modification of this within URI will be developed as a future modification of this
specification. specification.
2.2. Reserved Characters 2.2. Reserved Characters
Many URIs include components consisting of or delimited by, certain Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before reserved purpose, then the conflicting data must be escaped before
forming the URI. forming the URI.
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | "," "$" | ","
The "reserved" syntax class above refers to those characters which The "reserved" syntax class above refers to those characters that
are allowed within a URI, but which may not be allowed within a are allowed within a URI, but which may not be allowed within a
particular component of the generic URI syntax; they are used as particular component of the generic URI syntax; they are used as
delimiters of the components described in Section 3. delimiters of the components described in Section 3.
Characters in the "reserved" set are not reserved in all contexts. Characters in the "reserved" set are not reserved in all contexts.
The set of characters actually reserved within any given URI The set of characters actually reserved within any given URI
component is defined by that component. In general, a character is component is defined by that component. In general, a character is
reserved if the semantics of the URI changes if the character is reserved if the semantics of the URI changes if the character is
replaced with its escaped US-ASCII encoding. replaced with its escaped US-ASCII encoding.
2.3. Unreserved Characters 2.3. Unreserved Characters
Data characters which are allowed in a URI but do not have a reserved Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and letters, decimal digits, and a limited set of punctuation marks and
symbols. symbols.
unreserved = alphanum | mark unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used of the URI, but this should not be done unless the URI is being used
in a context which does not allow the unescaped character to appear. in a context that does not allow the unescaped character to appear.
2.4. Escape Sequences 2.4. Escape Sequences
Data must be escaped if it does not have a representation using an Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond unreserved character; this includes data that does not correspond
to a printable character of the US-ASCII coded character set, or to a printable character of the US-ASCII coded character set, or
that corresponds to any US-ASCII character that is disallowed, as that corresponds to any US-ASCII character that is disallowed, as
explained below. explained below.
2.4.1. Escaped Encoding 2.4.1. Escaped Encoding
skipping to change at line 419 skipping to change at line 428
escaped = "%" hex hex escaped = "%" hex hex
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f" "a" | "b" | "c" | "d" | "e" | "f"
2.4.2. When to Escape and Unescape 2.4.2. When to Escape and Unescape
A URI is always in an "escaped" form, since escaping or unescaping A URI is always in an "escaped" form, since escaping or unescaping
a completed URI might change its semantics. Normally, the only a completed URI might change its semantics. Normally, the only
time escape encodings can safely be made is when the URI is being time escape encodings can safely be made is when the URI is being
created from its component parts; each component may have its own created from its component parts; each component may have its own
set of characters which are reserved, so only the mechanism set of characters that are reserved, so only the mechanism
responsible for generating or interpreting that component can responsible for generating or interpreting that component can
determine whether or not escaping a character will change its determine whether or not escaping a character will change its
semantics. Likewise, a URI must be separated into its components semantics. Likewise, a URI must be separated into its components
before the escaped characters within those components can be safely before the escaped characters within those components can be safely
decoded. decoded.
In some cases, data that could be represented by an unreserved In some cases, data that could be represented by an unreserved
character may appear escaped; for example, some of the unreserved character may appear escaped; for example, some of the unreserved
"mark" characters are automatically escaped by some systems. If the "mark" characters are automatically escaped by some systems. If the
given URI scheme defines a canonicalization algorithm, then given URI scheme defines a canonicalization algorithm, then
skipping to change at line 445 skipping to change at line 454
being the escape indicator, it must be escaped as "%25" in order to being the escape indicator, it must be escaped as "%25" in order to
be used as data within a URI. Implementers should be careful not to be used as data within a URI. Implementers should be careful not to
escape or unescape the same string more than once, since unescaping escape or unescape the same string more than once, since unescaping
an already unescaped string might lead to misinterpreting a percent an already unescaped string might lead to misinterpreting a percent
data character as another escaped character, or vice versa in the data character as another escaped character, or vice versa in the
case of escaping an already escaped string. case of escaping an already escaped string.
2.4.3. Excluded US-ASCII Characters 2.4.3. Excluded US-ASCII Characters
Although they are disallowed within the URI syntax, we include here Although they are disallowed within the URI syntax, we include here
a description of those US-ASCII characters which have been excluded a description of those US-ASCII characters that have been excluded
and the reasons for their exclusion. and the reasons for their exclusion.
The control characters in the US-ASCII coded character set are not The control characters in the US-ASCII coded character set are not
used within a URI, both because they are non-printable and because used within a URI, both because they are non-printable and because
they are likely to be misinterpreted by some control mechanisms. they are likely to be misinterpreted by some control mechanisms.
control = <US-ASCII coded characters 00-1F and 7F hexadecimal> control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
The space character is excluded because significant spaces may The space character is excluded because significant spaces may
disappear and insignificant spaces may be introduced when URIs are disappear and insignificant spaces may be introduced when URI are
transcribed or typeset or subjected to the treatment of transcribed or typeset or subjected to the treatment of
word-processing programs. Whitespace is also used to delimit URIs word-processing programs. Whitespace is also used to delimit URI
in many contexts. in many contexts.
space = <US-ASCII coded character 20 hexadecimal> space = <US-ASCII coded character 20 hexadecimal>
The angle-bracket "<" and ">" and double-quote (") characters are The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URIs excluded because they are often used as the delimiters around URI
in text documents and protocol fields. The character "#" is in text documents and protocol fields. The character "#" is
excluded because it is used to delimit a URI from a fragment excluded because it is used to delimit a URI from a fragment
identifier in URI references (Section 4). The percent character "%" identifier in URI references (Section 4). The percent character "%"
is excluded because it is used for the encoding of escaped is excluded because it is used for the encoding of escaped
characters. characters.
delims = "<" | ">" | "#" | "%" | <"> delims = "<" | ">" | "#" | "%" | <">
Other characters are excluded because gateways and other transport Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are agents are known to sometimes modify such characters, or they are
used as delimiters. used as delimiters.
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Data corresponding to excluded characters must be escaped in order Data corresponding to excluded characters must be escaped in order
to be properly represented within a URI. to be properly represented within a URI.
3. URI Syntactic Components 3. URI Syntactic Components
The URI syntax is dependent upon the scheme. In general, absolute The URI syntax is dependent upon the scheme. In general, absolute
URIs are written as follows: URI are written as follows:
<scheme>:<scheme-specific-part> <scheme>:<scheme-specific-part>
An absolute URI contains the name of the scheme being used (<scheme>) An absolute URI contains the name of the scheme being used (<scheme>)
followed by a colon (":") and then a string (the <scheme-specific- followed by a colon (":") and then a string (the <scheme-specific-
part>) whose interpretation depends on the scheme. part>) whose interpretation depends on the scheme.
The URI syntax does not require that the scheme-specific-part have The URI syntax does not require that the scheme-specific-part have
any general structure or set of semantics which is common among all any general structure or set of semantics which is common among all
URIs. However, a subset of URIs do share a common syntax for URI. However, a subset of URI do share a common syntax for
representing hierarchical relationships within the namespace. This representing hierarchical relationships within the namespace. This
"generic-URI" syntax consists of a sequence of four main components: "generic URI" syntax consists of a sequence of four main components:
<scheme>://<authority><path>?<query> <scheme>://<authority><path>?<query>
each of which, except <scheme>, may be absent from a particular URI. each of which, except <scheme>, may be absent from a particular URI.
For example, some URI schemes do not allow an <authority> component, For example, some URI schemes do not allow an <authority> component,
and others do not use a <query> component. and others do not use a <query> component.
absoluteURI = generic-URI | opaque-URI absoluteURI = scheme ":" ( hier_part | opaque_part )
opaque-URI = scheme ":" *uric
generic-URI = scheme ":" relativeURI
The separation of the URI grammar into <generic-URI> and <opaque-URI>
is redundant, since both rules will successfully parse any string of
<uric> characters. The distinction is simply to clarify that a
parser of relative URI references (Section 5) will view a URI as a
generic-URI, whereas a handler of absolute references need only view
it as an opaque-URI.
URIs which are hierarchical in nature use the slash "/" character for URI that are hierarchical in nature use the slash "/" character for
separating hierarchical components. For some file systems, a "/" separating hierarchical components. For some file systems, a "/"
character (used to denote the hierarchical structure of a URI) is the character (used to denote the hierarchical structure of a URI) is the
delimiter used to construct a file name hierarchy, and thus the URI delimiter used to construct a file name hierarchy, and thus the URI
path will look similar to a file pathname. This does NOT imply that path will look similar to a file pathname. This does NOT imply that
the resource is a file or that the URI maps to an actual filesystem the resource is a file or that the URI maps to an actual filesystem
pathname. pathname.
hier_part = ( net_path | abs_path ) [ "?" query ]
net_path = "//" authority [ abs_path ]
abs_path = "/" path_segments
URI that do not make use of the slash "/" character for separating
hierarchical components are considered opaque by the generic URI
parser.
opaque_part = uric_no_slash *uric
uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
"&" | "=" | "+" | "$" | ","
We use the term <path> to refer to both the <abs_path> and
<opaque_part> constructs, since they are mutually exclusive for any
given URI and can be parsed as a single component.
3.1. Scheme Component 3.1. Scheme Component
Just as there are many different methods of access to resources, Just as there are many different methods of access to resources,
there are a variety of schemes for identifying such resources. The there are a variety of schemes for identifying such resources. The
URI syntax consists of a sequence of components separated by reserved URI syntax consists of a sequence of components separated by reserved
characters, with the first component defining the semantics for the characters, with the first component defining the semantics for the
remainder of the URI string. remainder of the URI string.
Scheme names consist of a sequence of characters beginning with a Scheme names consist of a sequence of characters beginning with a
lower case letter and followed by any combination of lower case lower case letter and followed by any combination of lower case
letters, digits, plus ("+"), period ("."), or hyphen ("-"). For letters, digits, plus ("+"), period ("."), or hyphen ("-"). For
resiliency, programs interpreting URIs should treat upper case resiliency, programs interpreting URI should treat upper case
letters as equivalent to lower case in scheme names (e.g., allow letters as equivalent to lower case in scheme names (e.g., allow
"HTTP" as well as "http"). "HTTP" as well as "http").
scheme = alpha *( alpha | digit | "+" | "-" | "." ) scheme = alpha *( alpha | digit | "+" | "-" | "." )
Relative URI references are distinguished from absolute URIs in that Relative URI references are distinguished from absolute URI in that
they do not begin with a scheme name. Instead, the scheme is they do not begin with a scheme name. Instead, the scheme is
inherited from the base URI, as described in Section 5.2. inherited from the base URI, as described in Section 5.2.
3.2. Authority Component 3.2. Authority Component
Many URI schemes include a top hierarchical element for a naming Many URI schemes include a top hierarchical element for a naming
authority, such that the namespace defined by the remainder of the authority, such that the namespace defined by the remainder of the
URI is governed by that authority. This authority component is URI is governed by that authority. This authority component is
typically defined by an Internet-based server or a scheme-specific typically defined by an Internet-based server or a scheme-specific
registry of naming authorities. registry of naming authorities.
skipping to change at line 597 skipping to change at line 614
server = [ [ userinfo "@" ] hostport ] server = [ [ userinfo "@" ] hostport ]
The user information, if present, is followed by a commercial The user information, if present, is followed by a commercial
at-sign "@". at-sign "@".
userinfo = *( unreserved | escaped | userinfo = *( unreserved | escaped |
";" | ":" | "&" | "=" | "+" | "$" | "," ) ";" | ":" | "&" | "=" | "+" | "$" | "," )
Some URL schemes use the format "user:password" in the userinfo Some URL schemes use the format "user:password" in the userinfo
field. This practice is NOT RECOMMENDED, because the passing of field. This practice is NOT RECOMMENDED, because the passing of
authentication information in clear text (such as URIs) has proven to authentication information in clear text (such as URI) has proven to
be a security risk in almost every case where it has been used. be a security risk in almost every case where it has been used.
The host is a domain name of a network host, or its IPv4 address as The host is a domain name of a network host, or its IPv4 address as
a set of four decimal digit groups separated by ".". Literal IPv6 a set of four decimal digit groups separated by ".". Literal IPv6
addresses are not supported. addresses are not supported.
hostport = host [ ":" port ] hostport = host [ ":" port ]
host = hostname | IPv4address host = hostname | IPv4address
hostname = *( domainlabel "." ) toplabel [ "." ] hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
skipping to change at line 640 skipping to change at line 657
number may optionally be supplied, in decimal, separated from the number may optionally be supplied, in decimal, separated from the
host by a colon. If the port is omitted, the default port number is host by a colon. If the port is omitted, the default port number is
assumed. assumed.
3.3. Path Component 3.3. Path Component
The path component contains data, specific to the authority (or the The path component contains data, specific to the authority (or the
scheme if there is no authority component), identifying the resource scheme if there is no authority component), identifying the resource
within the scope of that scheme and authority. within the scope of that scheme and authority.
path = [ "/" ] path_segments path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment ) path_segments = segment *( "/" segment )
segment = *pchar *( ";" param ) segment = *pchar *( ";" param )
param = *pchar param = *pchar
pchar = unreserved | escaped | pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | "," ":" | "@" | "&" | "=" | "+" | "$" | ","
The path may consist of a sequence of path segments separated by a The path may consist of a sequence of path segments separated by a
single slash "/" character. Within a path segment, the characters single slash "/" character. Within a path segment, the characters
skipping to change at line 671 skipping to change at line 688
query = *uric query = *uric
Within a query component, the characters ";", "/", "?", ":", "@", Within a query component, the characters ";", "/", "?", ":", "@",
"&", "=", "+", ",", and "$" are reserved. "&", "=", "+", ",", and "$" are reserved.
4. URI References 4. URI References
The term "URI-reference" is used here to denote the common usage of The term "URI-reference" is used here to denote the common usage of
a resource identifier. A URI reference may be absolute or relative, a resource identifier. A URI reference may be absolute or relative,
and may have additional information attached in the form of a and may have additional information attached in the form of a
fragment identifier. However, "the URI" which results from such a fragment identifier. However, "the URI" that results from such a
reference includes only the absolute URI after the fragment reference includes only the absolute URI after the fragment
identifier (if any) is removed and after any relative URI is identifier (if any) is removed and after any relative URI is
resolved to its absolute form. Although it is possible to limit resolved to its absolute form. Although it is possible to limit
the discussion of URI syntax and semantics to that of the absolute the discussion of URI syntax and semantics to that of the absolute
result, most usage of URIs is within general URI references, and it result, most usage of URI is within general URI references, and it
is impossible to obtain the URI from such a reference without also is impossible to obtain the URI from such a reference without also
parsing the fragment and resolving the relative form. parsing the fragment and resolving the relative form.
URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
The syntax for relative URIs is a shortened form of that for absolute The syntax for relative URI is a shortened form of that for absolute
URIs, where some prefix of the URI is missing and certain path URI, where some prefix of the URI is missing and certain path
components ("." and "..") have a special meaning when interpreting a components ("." and "..") have a special meaning when interpreting a
relative path. The relative URI syntax is defined in Section 5. relative path. The relative URI syntax is defined in Section 5.
4.1. Fragment Identifier 4.1. Fragment Identifier
When a URI reference is used to perform a retrieval action on the When a URI reference is used to perform a retrieval action on the
identified resource, the optional fragment identifier, separated from identified resource, the optional fragment identifier, separated from
the URI by a crosshatch ("#") character, consists of additional the URI by a crosshatch ("#") character, consists of additional
reference information to be interpreted by the user agent after the reference information to be interpreted by the user agent after the
retrieval action has been successfully completed. As such, it is not retrieval action has been successfully completed. As such, it is not
part of a URI, but is often used in conjunction with a URI. part of a URI, but is often used in conjunction with a URI.
fragment = *uric fragment = *uric
The semantics of a fragment identifier is a property of the data The semantics of a fragment identifier is a property of the data
resulting from a retrieval action, regardless of the type of URI used resulting from a retrieval action, regardless of the type of URI used
in the reference. Therefore, the format and interpretation of in the reference. Therefore, the format and interpretation of
fragment identifiers is dependent on the media type [RFC2046] of the fragment identifiers is dependent on the media type [RFC2046] of the
retrieval result. The character restrictions described in Section 2 retrieval result. The character restrictions described in Section 2
for URIs also apply to the fragment in a URI-reference. Individual for URI also apply to the fragment in a URI-reference. Individual
media types may define additional restrictions or structure within media types may define additional restrictions or structure within
the fragment for specifying different types of "partial views" that the fragment for specifying different types of "partial views" that
can be identified within that media type. can be identified within that media type.
A fragment identifier is only meaningful when a URI reference is A fragment identifier is only meaningful when a URI reference is
intended for retrieval and the result of that retrieval is a document intended for retrieval and the result of that retrieval is a document
for which the identified fragment is consistently defined. for which the identified fragment is consistently defined.
4.2. Same-document References 4.2. Same-document References
A URI reference which does not contain a URI is a reference to the A URI reference that does not contain a URI is a reference to the
current document. In other words, an empty URI reference within a current document. In other words, an empty URI reference within a
document is interpreted as a reference to the start of that document, document is interpreted as a reference to the start of that document,
and a reference containing only a fragment identifier is a reference and a reference containing only a fragment identifier is a reference
to the identified fragment of that document. Traversal of such a to the identified fragment of that document. Traversal of such a
reference should not result in an additional retrieval action. reference should not result in an additional retrieval action.
However, if the URI reference occurs in a context that is always However, if the URI reference occurs in a context that is always
intended to result in a new request, as in the case of HTML's FORM intended to result in a new request, as in the case of HTML's FORM
element, then an empty URI reference represents the base URI of the element, then an empty URI reference represents the base URI of the
current document and should be replaced by that URI when transformed current document and should be replaced by that URI when transformed
into a request. into a request.
4.3. Parsing a URI Reference 4.3. Parsing a URI Reference
A URI reference is typically parsed according to the four main A URI reference is typically parsed according to the four main
components and fragment identifier in order to determine what components and fragment identifier in order to determine what
components are present and whether the reference is relative or components are present and whether the reference is relative or
absolute. The individual components are then parsed for their absolute. The individual components are then parsed for their
subparts and to verify their validity. A reference is parsed as if subparts and, if not opaque, to verify their validity.
it is a generic-URI, even though it might be considered opaque by
later processes.
Although the BNF defines what is allowed in each component, it is Although the BNF defines what is allowed in each component, it is
ambiguous in terms of differentiating between an authority component ambiguous in terms of differentiating between an authority component
and a path component that begins with two slash characters. The and a path component that begins with two slash characters. The
greedy algorithm is used for disambiguation: the left-most matching greedy algorithm is used for disambiguation: the left-most matching
rule soaks up as much of the URI reference string as it is capable of rule soaks up as much of the URI reference string as it is capable of
matching. In other words, the authority component wins. matching. In other words, the authority component wins.
Readers familiar with regular expressions should see Appendix B for a Readers familiar with regular expressions should see Appendix B for a
concrete parsing example and test oracle. concrete parsing example and test oracle.
5. Relative URI References 5. Relative URI References
It is often the case that a group or "tree" of documents has been It is often the case that a group or "tree" of documents has been
constructed to serve a common purpose; the vast majority of URIs in constructed to serve a common purpose; the vast majority of URI in
these documents point to resources within the tree rather than these documents point to resources within the tree rather than
outside of it. Similarly, documents located at a particular site outside of it. Similarly, documents located at a particular site
are much more likely to refer to other resources at that site than are much more likely to refer to other resources at that site than
to resources at remote sites. to resources at remote sites.
Relative addressing of URLs allows document trees to be partially Relative addressing of URI allows document trees to be partially
independent of their location and access scheme. For instance, it is independent of their location and access scheme. For instance, it is
possible for a single set of hypertext documents to be simultaneously possible for a single set of hypertext documents to be simultaneously
accessible and traversable via each of the "file", "http", and "ftp" accessible and traversable via each of the "file", "http", and "ftp"
schemes if the documents refer to each other using relative URIs. schemes if the documents refer to each other using relative URI.
Furthermore, such document trees can be moved, as a whole, without Furthermore, such document trees can be moved, as a whole, without
changing any of the relative references. Experience within the WWW changing any of the relative references. Experience within the WWW
has demonstrated that the ability to perform relative referencing has demonstrated that the ability to perform relative referencing
is necessary for the long-term usability of embedded URLs. is necessary for the long-term usability of embedded URI.
relativeURI = net_path | abs_path | rel_path The syntax for relative URI takes advantage of the <hier_part> syntax
of <absoluteURI> (Section 3) in order to express a reference that is
relative to the namespace of another hierarchical URI.
A relative reference beginning with two slash characters is termed a relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
network-path reference. Such references are rarely used.
net_path = "//" authority [ abs_path ] A relative reference beginning with two slash characters is termed a
network-path reference, as defined by <net_path> in Section 3. Such
references are rarely used.
A relative reference beginning with a single slash character is A relative reference beginning with a single slash character is
termed an absolute-path reference. termed an absolute-path reference, as defined by <abs_path> in
Section 3.
abs_path = "/" rel_path
A relative reference which does not begin with a scheme name or a A relative reference that does not begin with a scheme name or a
slash character is termed a relative-path reference. slash character is termed a relative-path reference.
rel_path = [ path_segments ] [ "?" query ] rel_path = rel_segment [ abs_path ]
rel_segment = 1*( unreserved | escaped |
";" | "@" | "&" | "=" | "+" | "$" | "," )
Within a relative-path reference, the complete path segments "." and Within a relative-path reference, the complete path segments "." and
".." have special meanings: "the current hierarchy level" and "the ".." have special meanings: "the current hierarchy level" and "the
level above this hierarchy level", respectively. Although this is level above this hierarchy level", respectively. Although this is
very similar to their use within Unix-based filesystems to indicate very similar to their use within Unix-based filesystems to indicate
directory levels, these path components are only considered special directory levels, these path components are only considered special
when resolving a relative-path reference to its absolute form when resolving a relative-path reference to its absolute form
(Section 5.2). (Section 5.2).
Authors should be aware that a path segment which contains a colon Authors should be aware that a path segment which contains a colon
character cannot be used as the first segment of a relative URI path character cannot be used as the first segment of a relative URI path
(e.g., "this:that"), because it would be mistaken for a scheme name. (e.g., "this:that"), because it would be mistaken for a scheme name.
It is therefore necessary to precede such segments with other It is therefore necessary to precede such segments with other
segments (e.g., "./this:that") in order for them to be referenced as segments (e.g., "./this:that") in order for them to be referenced as
a relative path. a relative path.
It is not necessary for all URIs within a given scheme to be It is not necessary for all URI within a given scheme to be
restricted to the generic-URI syntax, since the hierarchical restricted to the <hier_part> syntax, since the hierarchical
properties of that syntax are only necessary when relative URIs are properties of that syntax are only necessary when relative URI are
used within a particular document. Documents can only make use of used within a particular document. Documents can only make use of
relative URIs when their base URI fits within the generic-URI syntax. relative URI when their base URI fits within the <hier_part> syntax.
It is assumed that any document which contains a relative reference It is assumed that any document which contains a relative reference
will also have a base URI that obeys the syntax. In other words, will also have a base URI that obeys the syntax. In other words,
relative URIs cannot be used within a document that has an unsuitable relative URI cannot be used within a document that has an unsuitable
base URI. base URI.
Some URI schemes do not allow a hierarchical syntax matching the Some URI schemes do not allow a hierarchical syntax matching the
generic-URI syntax, and thus cannot use relative references. <hier_part> syntax, and thus cannot use relative references.
5.1. Establishing a Base URI 5.1. Establishing a Base URI
The term "relative URI" implies that there exists some absolute "base The term "relative URI" implies that there exists some absolute "base
URI" against which the relative reference is applied. Indeed, the URI" against which the relative reference is applied. Indeed, the
base URI is necessary to define the semantics of any relative URI base URI is necessary to define the semantics of any relative URI
reference; without it, a relative reference is meaningless. In order reference; without it, a relative reference is meaningless. In order
for relative URIs to be usable within a document, the base URI of for relative URI to be usable within a document, the base URI of
that document must be known to the parser. that document must be known to the parser.
The base URI of a document can be established in one of four ways, The base URI of a document can be established in one of four ways,
listed below in order of precedence. The order of precedence can be listed below in order of precedence. The order of precedence can be
thought of in terms of layers, where the innermost defined base URI thought of in terms of layers, where the innermost defined base URI
has the highest precedence. This can be visualized graphically as: has the highest precedence. This can be visualized graphically as:
.----------------------------------------------------------. .----------------------------------------------------------.
| .----------------------------------------------------. | | .----------------------------------------------------. |
| | .----------------------------------------------. | | | | .----------------------------------------------. | |
skipping to change at line 893 skipping to change at line 913
5.1.4. Default Base URI 5.1.4. Default Base URI
If none of the conditions described in Sections 5.1.1--5.1.3 apply, If none of the conditions described in Sections 5.1.1--5.1.3 apply,
then the base URI is defined by the context of the application. then the base URI is defined by the context of the application.
Since this definition is necessarily application-dependent, failing Since this definition is necessarily application-dependent, failing
to define the base URI using one of the other methods may result in to define the base URI using one of the other methods may result in
the same content being interpreted differently by different types of the same content being interpreted differently by different types of
application. application.
It is the responsibility of the distributor(s) of a document It is the responsibility of the distributor(s) of a document
containing relative URIs to ensure that the base URI for that containing relative URI to ensure that the base URI for that
document can be established. It must be emphasized that relative document can be established. It must be emphasized that relative
URIs cannot be used reliably in situations where the document's URI cannot be used reliably in situations where the document's
base URI is not well-defined. base URI is not well-defined.
5.2. Resolving Relative References to Absolute Form 5.2. Resolving Relative References to Absolute Form
This section describes an example algorithm for resolving URI This section describes an example algorithm for resolving URI
references which might be relative to a given base URI. references that might be relative to a given base URI.
The base URI is established according to the rules of Section 5.1 and The base URI is established according to the rules of Section 5.1 and
parsed into the four main components as described in Section 3. parsed into the four main components as described in Section 3.
Note that only the scheme component is required to be present in the Note that only the scheme component is required to be present in the
base URI; the other components may be empty or undefined. A base URI; the other components may be empty or undefined. A
component is undefined if its preceding separator does not appear in component is undefined if its preceding separator does not appear in
the URI reference; the path component is never undefined, though it the URI reference; the path component is never undefined, though it
may be empty. The base URI's query component is not used by the may be empty. The base URI's query component is not used by the
resolution algorithm and may be discarded. resolution algorithm and may be discarded.
skipping to change at line 928 skipping to change at line 948
query components are undefined, then it is a reference to the query components are undefined, then it is a reference to the
current document and we are done. Otherwise, the reference URI's current document and we are done. Otherwise, the reference URI's
query and fragment components are defined as found (or not found) query and fragment components are defined as found (or not found)
within the URI reference and not inherited from the base URI. within the URI reference and not inherited from the base URI.
3) If the scheme component is defined, indicating that the reference 3) If the scheme component is defined, indicating that the reference
starts with a scheme name, then the reference is interpreted as an starts with a scheme name, then the reference is interpreted as an
absolute URI and we are done. Otherwise, the reference URI's absolute URI and we are done. Otherwise, the reference URI's
scheme is inherited from the base URI's scheme component. scheme is inherited from the base URI's scheme component.
Due to a loophole in prior specifications [RFC1630], some parsers
allow the scheme name to be present in a relative URI if it is the
same as the base URI scheme. Unfortunately, this can conflict
with the correct parsing of non-hierarchical URI. For backwards
compatibility, an implementation may work around such references
by removing the scheme if it matches that of the base URI and the
scheme is known to always use the <hier_part> syntax. The parser
can then continue with the steps below for the remainder of the
reference components. Validating parsers should mark such a
misformed relative reference as an error.
4) If the authority component is defined, then the reference is a 4) If the authority component is defined, then the reference is a
network-path and we skip to step 7. Otherwise, the reference network-path and we skip to step 7. Otherwise, the reference
URI's authority is inherited from the base URI's authority URI's authority is inherited from the base URI's authority
component, which will also be undefined if the URI scheme does not component, which will also be undefined if the URI scheme does not
use an authority component. use an authority component.
5) If the path component begins with a slash character ("/"), then 5) If the path component begins with a slash character ("/"), then
the reference is an absolute-path and we skip to step 7. the reference is an absolute-path and we skip to step 7.
6) If this step is reached, then we are resolving a relative-path 6) If this step is reached, then we are resolving a relative-path
skipping to change at line 1025 skipping to change at line 1056
reference's query component from its path component before merging reference's query component from its path component before merging
the base and reference paths in step 6 above. This may result in the base and reference paths in step 6 above. This may result in
a loss of information if the query component contains the strings a loss of information if the query component contains the strings
"/../" or "/./". "/../" or "/./".
Resolution examples are provided in Appendix C. Resolution examples are provided in Appendix C.
6. URI Normalization and Equivalence 6. URI Normalization and Equivalence
In many cases, different URI strings may actually identify the In many cases, different URI strings may actually identify the
identical resource. For example, the host names used in URLs are identical resource. For example, the host names used in URL are
actually case insensitive, and the URL <http://www.XEROX.com> is actually case insensitive, and the URL <http://www.XEROX.com> is
equivalent to <http://www.xerox.com>. In general, the rules for equivalent to <http://www.xerox.com>. In general, the rules for
equivalence and definition of a normal form, if any, are scheme equivalence and definition of a normal form, if any, are scheme
dependent. When a scheme uses elements of the common syntax, it dependent. When a scheme uses elements of the common syntax, it
will also use the common syntax equivalence rules, namely that the will also use the common syntax equivalence rules, namely that the
scheme and hostname are case insensitive and a URL with an explicit scheme and hostname are case insensitive and a URL with an explicit
":port", where the port is the default for the scheme, is equivalent ":port", where the port is the default for the scheme, is equivalent
to one where the port is elided. to one where the port is elided.
7. Security Considerations 7. Security Considerations
skipping to change at line 1054 skipping to change at line 1085
resource in question. A specific URI scheme may include additional resource in question. A specific URI scheme may include additional
semantics, such as name persistence, if those semantics are required semantics, such as name persistence, if those semantics are required
of all naming authorities for that scheme. of all naming authorities for that scheme.
It is sometimes possible to construct a URL such that an attempt to It is sometimes possible to construct a URL such that an attempt to
perform a seemingly harmless, idempotent operation, such as the perform a seemingly harmless, idempotent operation, such as the
retrieval of an entity associated with the resource, will in fact retrieval of an entity associated with the resource, will in fact
cause a possibly damaging remote operation to occur. The unsafe URL cause a possibly damaging remote operation to occur. The unsafe URL
is typically constructed by specifying a port number other than that is typically constructed by specifying a port number other than that
reserved for the network protocol in question. The client reserved for the network protocol in question. The client
unwittingly contacts a site which is in fact running a different unwittingly contacts a site that is in fact running a different
protocol. The content of the URL contains instructions which, when protocol. The content of the URL contains instructions that, when
interpreted according to this other protocol, cause an unexpected interpreted according to this other protocol, cause an unexpected
operation. An example has been the use of gopher URLs to cause an operation. An example has been the use of a gopher URL to cause an
unintended or impersonating message to be sent via a SMTP server. unintended or impersonating message to be sent via a SMTP server.
Caution should be used when using any URL which specifies a port Caution should be used when using any URL that specifies a port
number other than the default for the protocol, especially when it number other than the default for the protocol, especially when it
is a number within the reserved space. is a number within the reserved space.
Care should be taken when URLs contain escaped delimiters for a Care should be taken when a URL contains escaped delimiters for a
given protocol (for example, CR and LF characters for telnet given protocol (for example, CR and LF characters for telnet
protocols) that these are not unescaped before transmission. This protocols) that these are not unescaped before transmission. This
might violate the protocol, but avoids the potential for such might violate the protocol, but avoids the potential for such
characters to be used to simulate an extra operation or parameter characters to be used to simulate an extra operation or parameter
in that protocol, which might lead to an unexpected and possibly in that protocol, which might lead to an unexpected and possibly
harmful remote operation to be performed. harmful remote operation to be performed.
It is clearly unwise to use a URL that contains a password which is It is clearly unwise to use a URL that contains a password which is
intended to be secret. In particular, the use of a password within intended to be secret. In particular, the use of a password within
the 'userinfo' component of a URL is strongly disrecommended except the 'userinfo' component of a URL is strongly disrecommended except
skipping to change at line 1155 skipping to change at line 1186
Cambridge, MA 02139 Cambridge, MA 02139
Fax: +1(617)258-8682 Fax: +1(617)258-8682
EMail: timbl@w3.org EMail: timbl@w3.org
Roy T. Fielding Roy T. Fielding
Department of Information and Computer Science Department of Information and Computer Science
University of California, Irvine University of California, Irvine
Irvine, CA 92697-3425 Irvine, CA 92697-3425
Fax: +1(714)824-1715 Fax: +1(949)824-1715
EMail: fielding@ics.uci.edu EMail: fielding@ics.uci.edu
Larry Masinter Larry Masinter
Xerox PARC Xerox PARC
3333 Coyote Hill Road 3333 Coyote Hill Road
Palo Alto, CA 94034 Palo Alto, CA 94034
Fax: +1(415)812-4333 Fax: +1(415)812-4333
EMail: masinter@parc.xerox.com EMail: masinter@parc.xerox.com
Appendices Appendices
A. Collected BNF for URIs A. Collected BNF for URI
URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
absoluteURI = generic-URI | opaque-URI absoluteURI = scheme ":" ( hier_part | opaque_part )
opaque-URI = scheme ":" *uric relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
generic-URI = scheme ":" relativeURI
hier_part = ( net_path | abs_path ) [ "?" query ]
opaque_part = uric_no_slash *uric
uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
"&" | "=" | "+" | "$" | ","
relativeURI = net_path | abs_path | rel_path
net_path = "//" authority [ abs_path ] net_path = "//" authority [ abs_path ]
abs_path = "/" rel_path abs_path = "/" path_segments
rel_path = [ path_segments ] [ "?" query ] rel_path = rel_segment [ abs_path ]
rel_segment = 1*( unreserved | escaped |
";" | "@" | "&" | "=" | "+" | "$" | "," )
scheme = alpha *( alpha | digit | "+" | "-" | "." ) scheme = alpha *( alpha | digit | "+" | "-" | "." )
authority = server | reg_name authority = server | reg_name
reg_name = 1*( unreserved | escaped | "$" | "," | reg_name = 1*( unreserved | escaped | "$" | "," |
";" | ":" | "@" | "&" | "=" | "+" ) ";" | ":" | "@" | "&" | "=" | "+" )
server = [ [ userinfo "@" ] hostport ] server = [ [ userinfo "@" ] hostport ]
userinfo = *( unreserved | escaped | userinfo = *( unreserved | escaped |
";" | ":" | "&" | "=" | "+" | "$" | "," ) ";" | ":" | "&" | "=" | "+" | "$" | "," )
hostport = host [ ":" port ] hostport = host [ ":" port ]
host = hostname | IPv4address host = hostname | IPv4address
hostname = *( domainlabel "." ) toplabel [ "." ] hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum toplabel = alpha | alpha *( alphanum | "-" ) alphanum
IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit
port = *digit port = *digit
path = [ "/" ] path_segments path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment ) path_segments = segment *( "/" segment )
segment = *pchar *( ";" param ) segment = *pchar *( ";" param )
param = *pchar param = *pchar
pchar = unreserved | escaped | pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | "," ":" | "@" | "&" | "=" | "+" | "$" | ","
query = *uric query = *uric
fragment = *uric fragment = *uric
skipping to change at line 1235 skipping to change at line 1273
"j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
"s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9" "8" | "9"
B. Parsing a URI Reference with a Regular Expression B. Parsing a URI Reference with a Regular Expression
As described in Section 4.3, the generic-URI syntax is not sufficient As described in Section 4.3, the generic URI syntax is not sufficient
to disambiguate the components of some forms of URI. Since the to disambiguate the components of some forms of URI. Since the
"greedy algorithm" described in that section is identical to the "greedy algorithm" described in that section is identical to the
disambiguation method used by POSIX regular expressions, it is disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the natural and commonplace to use a regular expression for parsing the
potential four components and fragment identifier of a URI reference. potential four components and fragment identifier of a URI reference.
The following line is the regular expression for breaking-down a URI The following line is the regular expression for breaking-down a URI
reference into its components. reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
skipping to change at line 1286 skipping to change at line 1324
and, going in the opposite direction, we can recreate a URI reference and, going in the opposite direction, we can recreate a URI reference
from its components using the algorithm in step 7 of Section 5.2. from its components using the algorithm in step 7 of Section 5.2.
C. Examples of Resolving Relative URI References C. Examples of Resolving Relative URI References
Within an object with a well-defined base URI of Within an object with a well-defined base URI of
http://a/b/c/d;p?q http://a/b/c/d;p?q
the relative URIs would be resolved as follows: the relative URI would be resolved as follows:
C.1. Normal Examples C.1. Normal Examples
g:h = g:h g:h = g:h
g = http://a/b/c/g g = http://a/b/c/g
./g = http://a/b/c/g ./g = http://a/b/c/g
g/ = http://a/b/c/g/ g/ = http://a/b/c/g/
/g = http://a/g /g = http://a/g
//g = http://g //g = http://g
?y = http://a/b/c/?y ?y = http://a/b/c/?y
skipping to change at line 1358 skipping to change at line 1396
nonsensical forms of the "." and ".." complete path segments. nonsensical forms of the "." and ".." complete path segments.
./../g = http://a/b/g ./../g = http://a/b/g
./g/. = http://a/b/c/g/ ./g/. = http://a/b/c/g/
g/./h = http://a/b/c/g/h g/./h = http://a/b/c/g/h
g/../h = http://a/b/c/h g/../h = http://a/b/c/h
g;x=1/./y = http://a/b/c/g;x=1/y g;x=1/./y = http://a/b/c/g;x=1/y
g;x=1/../y = http://a/b/c/y g;x=1/../y = http://a/b/c/y
All client applications remove the query component from the base URI All client applications remove the query component from the base URI
before resolving relative URIs. However, some applications fail to before resolving relative URI. However, some applications fail to
separate the reference's query and/or fragment components from a separate the reference's query and/or fragment components from a
relative path before merging it with the base path. This error is relative path before merging it with the base path. This error is
rarely noticed, since typical usage of a fragment never includes the rarely noticed, since typical usage of a fragment never includes the
hierarchy ("/") character, and the query component is not normally hierarchy ("/") character, and the query component is not normally
used within relative references. used within relative references.
g?y/./x = http://a/b/c/g?y/./x g?y/./x = http://a/b/c/g?y/./x
g?y/../x = http://a/b/c/g?y/../x g?y/../x = http://a/b/c/g?y/../x
g#s/./x = http://a/b/c/g#s/./x g#s/./x = http://a/b/c/g#s/./x
g#s/../x = http://a/b/c/g#s/../x g#s/../x = http://a/b/c/g#s/../x
Some parsers allow the scheme name to be present in a relative URI Some parsers allow the scheme name to be present in a relative URI
if it is the same as the base URI scheme. This is considered to be if it is the same as the base URI scheme. This is considered to be
a loophole in prior specifications of partial URIs [RFC1630]. Its a loophole in prior specifications of partial URI [RFC1630]. Its
use should be avoided. use should be avoided.
http:g = http:g http:g = http:g ; for validating parsers
http: = http: | http://a/b/c/g ; for backwards compatibility
D. Embedding the Base URI in HTML documents D. Embedding the Base URI in HTML documents
It is useful to consider an example of how the base URI of a It is useful to consider an example of how the base URI of a
document can be embedded within the document's content. In this document can be embedded within the document's content. In this
appendix, we describe how documents written in the Hypertext Markup appendix, we describe how documents written in the Hypertext Markup
Language (HTML) [RFC1866] can include an embedded base URI. This Language (HTML) [RFC1866] can include an embedded base URI. This
appendix does not form a part of the URI specification and should not appendix does not form a part of the URI specification and should not
be considered as anything more than a descriptive example. be considered as anything more than a descriptive example.
HTML defines a special element "BASE" which, when present in the HTML defines a special element "BASE" which, when present in the
"HEAD" portion of a document, signals that the parser should use "HEAD" portion of a document, signals that the parser should use
the BASE element's "HREF" attribute as the base URI for resolving the BASE element's "HREF" attribute as the base URI for resolving
any relative URIs. The "HREF" attribute must be an absolute URI. any relative URI. The "HREF" attribute must be an absolute URI.
Note that, in HTML, element and attribute names are Note that, in HTML, element and attribute names are
case-insensitive. For example: case-insensitive. For example:
<!doctype html public "-//IETF//DTD HTML//EN"> <!doctype html public "-//IETF//DTD HTML//EN">
<HTML><HEAD> <HTML><HEAD>
<TITLE>An example HTML document</TITLE> <TITLE>An example HTML document</TITLE>
<BASE href="http://www.ics.uci.edu/Test/a/b/c"> <BASE href="http://www.ics.uci.edu/Test/a/b/c">
</HEAD><BODY> </HEAD><BODY>
... <A href="../x">a hypertext anchor</A> ... ... <A href="../x">a hypertext anchor</A> ...
</BODY></HTML> </BODY></HTML>
A parser reading the example document should interpret the given A parser reading the example document should interpret the given
relative URI "../x" as representing the absolute URI relative URI "../x" as representing the absolute URI
<http://www.ics.uci.edu/Test/a/x> <http://www.ics.uci.edu/Test/a/x>
regardless of the context in which the example document was regardless of the context in which the example document was
obtained. obtained.
E. Recommendations for Delimiting URIs in Context E. Recommendations for Delimiting URI in Context
URIs are often transmitted through formats which do not provide a URI are often transmitted through formats that do not provide a
clear context for their interpretation. For example, there are clear context for their interpretation. For example, there are
many occasions when URIs are included in plain text; examples many occasions when URI are included in plain text; examples
include text sent in electronic mail, USENET news messages, and, include text sent in electronic mail, USENET news messages, and,
most importantly, printed on paper. In such cases, it is important most importantly, printed on paper. In such cases, it is important
to be able to delimit the URI from the rest of the text, and in to be able to delimit the URI from the rest of the text, and in
particular from punctuation marks that might be mistaken for part particular from punctuation marks that might be mistaken for part
of the URI. of the URI.
In practice, URIs are delimited in a variety of ways, but usually In practice, URI are delimited in a variety of ways, but usually
within double-quotes "http://test.com/", angle brackets within double-quotes "http://test.com/", angle brackets
<http://test.com/>, or just using whitespace <http://test.com/>, or just using whitespace
http://test.com/ http://test.com/
These wrappers do not form part of the URI. These wrappers do not form part of the URI.
In the case where a fragment identifier is associated with a URI In the case where a fragment identifier is associated with a URI
reference, the fragment would be placed within the brackets as well reference, the fragment would be placed within the brackets as well
(separated from the URI with a "#" character). (separated from the URI with a "#" character).
In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) In some cases, extra whitespace (spaces, linebreaks, tabs, etc.)
may need to be added to break long URIs across lines. The may need to be added to break long URI across lines. The
whitespace should be ignored when extracting the URI. whitespace should be ignored when extracting the URI.
No whitespace should be introduced after a hyphen ("-") character. No whitespace should be introduced after a hyphen ("-") character.
Because some typesetters and printers may (erroneously) introduce a Because some typesetters and printers may (erroneously) introduce a
hyphen at the end of line when breaking a line, the interpreter of a hyphen at the end of line when breaking a line, the interpreter of a
URI containing a line break immediately after a hyphen should ignore URI containing a line break immediately after a hyphen should ignore
all unescaped whitespace around the line break, and should be aware all unescaped whitespace around the line break, and should be aware
that the hyphen may or may not actually be part of the URI. that the hyphen may or may not actually be part of the URI.
Using <> angle brackets around each URI is especially recommended Using <> angle brackets around each URI is especially recommended
as a delimiting style for URIs that contain whitespace. as a delimiting style for URI that contain whitespace.
The prefix "URL:" (with or without a trailing space) was The prefix "URL:" (with or without a trailing space) was
recommended as a way to used to help distinguish a URL from other recommended as a way to used to help distinguish a URL from other
bracketed designators, although this is not common in practice. bracketed designators, although this is not common in practice.
For robustness, software that accepts user-typed URIs should For robustness, software that accepts user-typed URI should
attempt to recognize and strip both delimiters and embedded attempt to recognize and strip both delimiters and embedded
whitespace. whitespace.
For example, the text: For example, the text:
Yes, Jim, I found it under "http://www.w3.org/Addressing/", Yes, Jim, I found it under "http://www.w3.org/Addressing/",
but you can probably pick it up from <ftp://ds.internic. but you can probably pick it up from <ftp://ds.internic.
net/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ net/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/
ietf/uri/historical.html#WARNING>. ietf/uri/historical.html#WARNING>.
skipping to change at line 1505 skipping to change at line 1543
G. Summary of Non-editorial Changes G. Summary of Non-editorial Changes
G.1. Additions G.1. Additions
Section 4 (URI References) was added to stem the confusion Section 4 (URI References) was added to stem the confusion
regarding "what is a URI" and how to describe fragment identifiers regarding "what is a URI" and how to describe fragment identifiers
given that they are not part of the URI, but are part of the URI given that they are not part of the URI, but are part of the URI
syntax and parsing concerns. In addition, it provides a reference syntax and parsing concerns. In addition, it provides a reference
definition for use by other IETF specifications (HTML, HTTP, etc.) definition for use by other IETF specifications (HTML, HTTP, etc.)
which have previously attempted to redefine the URI syntax in order that have previously attempted to redefine the URI syntax in order
to account for the presence of fragment identifiers in URI to account for the presence of fragment identifiers in URI
references. references.
Section 2.4 was rewritten to clarify a number of misinterpretations Section 2.4 was rewritten to clarify a number of misinterpretations
and to leave room for fully internationalized URIs. and to leave room for fully internationalized URI.
Appendix F on abbreviated URLs was added to describe the shortened Appendix F on abbreviated URLs was added to describe the shortened
references often seen on television and magazine advertisements and references often seen on television and magazine advertisements and
explain why they are not used in other contexts. explain why they are not used in other contexts.
G.2. Modifications from both RFC 1738 and RFC 1808 G.2. Modifications from both RFC 1738 and RFC 1808
Changed to URI syntax instead of just URL. Changed to URI syntax instead of just URL.
Confusion regarding the terms "character encoding", the URI Confusion regarding the terms "character encoding", the URI
skipping to change at line 1533 skipping to change at line 1571
names regarding the character sets have been changed to more names regarding the character sets have been changed to more
accurately describe their purpose and to encompass all "characters" accurately describe their purpose and to encompass all "characters"
rather than just US-ASCII octets. Unless otherwise noted here, rather than just US-ASCII octets. Unless otherwise noted here,
these modifications do not affect the URI syntax. these modifications do not affect the URI syntax.
Both RFC 1738 and RFC 1808 refer to the "reserved" set of Both RFC 1738 and RFC 1808 refer to the "reserved" set of
characters as if URI-interpreting software were limited to a single characters as if URI-interpreting software were limited to a single
set of characters with a reserved purpose (i.e., as meaning set of characters with a reserved purpose (i.e., as meaning
something other than the data to which the characters correspond), something other than the data to which the characters correspond),
and that this set was fixed by the URI scheme. However, this has and that this set was fixed by the URI scheme. However, this has
not been true in practice; any character which is interpreted not been true in practice; any character that is interpreted
differently when it is escaped is, in effect, reserved. differently when it is escaped is, in effect, reserved.
Furthermore, the interpreting engine on a HTTP server is often Furthermore, the interpreting engine on a HTTP server is often
dependent on the resource, not just the URI scheme. The dependent on the resource, not just the URI scheme. The
description of reserved characters has been changed accordingly. description of reserved characters has been changed accordingly.
The plus "+", dollar "$", and comma "," characters have been added to The plus "+", dollar "$", and comma "," characters have been added to
those in the "reserved" set, since they are treated as reserved those in the "reserved" set, since they are treated as reserved
within the query component. within the query component.
The tilde "~" character was added to those in the "unreserved" set, The tilde "~" character was added to those in the "unreserved" set,
since it is extensively used on the Internet in spite of the since it is extensively used on the Internet in spite of the
difficulty to transcribe it with some keyboards. difficulty to transcribe it with some keyboards.
The syntax for URI scheme has been changed to require that all
schemes begin with an alpha character.
The "user:password" form in the previous BNF was changed to The "user:password" form in the previous BNF was changed to
a "userinfo" token, and the possibility that it might be a "userinfo" token, and the possibility that it might be
"user:password" made scheme specific. In particular, the use "user:password" made scheme specific. In particular, the use
of passwords in the clear is not even suggested by the syntax. of passwords in the clear is not even suggested by the syntax.
The question-mark "?" character was removed from the set of allowed The question-mark "?" character was removed from the set of allowed
characters for the userinfo in the authority component, since characters for the userinfo in the authority component, since
testing showed that many applications treat it as reserved for testing showed that many applications treat it as reserved for
separating the query component from the rest of the URI. separating the query component from the rest of the URI.
skipping to change at line 1568 skipping to change at line 1609
reserved within the authority component, since several new schemes reserved within the authority component, since several new schemes
are using it as a separator within userinfo to indicate the type are using it as a separator within userinfo to indicate the type
of user authentication. of user authentication.
RFC 1738 specified that the path was separated from the authority RFC 1738 specified that the path was separated from the authority
portion of a URI by a slash. RFC 1808 followed suit, but with a portion of a URI by a slash. RFC 1808 followed suit, but with a
fudge of carrying around the separator as a "prefix" in order to fudge of carrying around the separator as a "prefix" in order to
describe the parsing algorithm. RFC 1630 never had this problem, describe the parsing algorithm. RFC 1630 never had this problem,
since it considered the slash to be part of the path. In writing since it considered the slash to be part of the path. In writing
this specification, it was found to be impossible to accurately this specification, it was found to be impossible to accurately
describe and retain the difference between the two URIs describe and retain the difference between the two URI
<foo:/bar> and <foo:bar> <foo:/bar> and <foo:bar>
without either considering the slash to be part of the path (as without either considering the slash to be part of the path (as
corresponds to actual practice) or creating a separate component just corresponds to actual practice) or creating a separate component just
to hold that slash. We chose the former. to hold that slash. We chose the former.
G.3. Modifications from RFC 1738 G.3. Modifications from RFC 1738
The definition of specific URL schemes and their scheme-specific The definition of specific URL schemes and their scheme-specific
syntax and semantics has been moved to separate documents. syntax and semantics has been moved to separate documents.
The URL host was defined as a fully-qualified domain name. However, The URL host was defined as a fully-qualified domain name. However,
many URLs are used without fully-qualified domain names (in contexts many URLs are used without fully-qualified domain names (in contexts
for which the full qualification is not necessary), without any host for which the full qualification is not necessary), without any host
(as in some file URLs), or with a host of "localhost". (as in some file URLs), or with a host of "localhost".
The URL port is now *digit instead of 1*digit, since systems are The URL port is now *digit instead of 1*digit, since systems are
expected to handle the case where the ":" separator between host and expected to handle the case where the ":" separator between host and
port is supplied without a port. port is supplied without a port.
The recommendations for delimiting URIs in context (Appendix E) have The recommendations for delimiting URI in context (Appendix E) have
been adjusted to reflect current practice. been adjusted to reflect current practice.
G.4. Modifications from RFC 1808 G.4. Modifications from RFC 1808
RFC 1808 (Section 4) defined an empty URL reference (a reference RFC 1808 (Section 4) defined an empty URL reference (a reference
containing nothing aside from the fragment identifier) as being a containing nothing aside from the fragment identifier) as being a
reference to the base URL. Unfortunately, that definition could be reference to the base URL. Unfortunately, that definition could be
interpreted, upon selection of such a reference, as a new retrieval interpreted, upon selection of such a reference, as a new retrieval
action on that resource. Since the normal intent of such references action on that resource. Since the normal intent of such references
is for the user agent to change its view of the current document to is for the user agent to change its view of the current document to
the beginning of the specified fragment within that document, not to the beginning of the specified fragment within that document, not to
make an additional request of the resource, a description of how to make an additional request of the resource, a description of how to
correctly interpret an empty reference has been added in Section 4. correctly interpret an empty reference has been added in Section 4.
The description of the mythical Base header field has been replaced The description of the mythical Base header field has been replaced
with a reference to the Content-Location header field defined by with a reference to the Content-Location header field defined by
MHTML [RFC2110]. MHTML [RFC2110].
RFC 1808 described various schemes as either having or not having the RFC 1808 described various schemes as either having or not having the
properties of the generic-URI syntax. However, the only requirement properties of the generic URI syntax. However, the only requirement
is that the particular document containing the relative references is that the particular document containing the relative references
have a base URI which abides by the generic-URI syntax, regardless of have a base URI that abides by the generic URI syntax, regardless of
the URI scheme, so the associated description has been updated to the URI scheme, so the associated description has been updated to
reflect that. reflect that.
The BNF term <net_loc> has been replaced with <authority>, since the The BNF term <net_loc> has been replaced with <authority>, since the
latter more accurately describes its use and purpose. Likewise, the latter more accurately describes its use and purpose. Likewise, the
authority is no longer restricted to the IP server syntax. authority is no longer restricted to the IP server syntax.
Extensive testing of current client applications demonstrated that Extensive testing of current client applications demonstrated that
the majority of deployed systems do not use the ";" character to the majority of deployed systems do not use the ";" character to
indicate trailing parameter information, and that the presence of a indicate trailing parameter information, and that the presence of a
semicolon in a path segment does not affect the relative parsing of semicolon in a path segment does not affect the relative parsing of
that segment. Therefore, parameters have been removed as a separate that segment. Therefore, parameters have been removed as a separate
component and may now appear in any path segment. Their influence component and may now appear in any path segment. Their influence
has been removed from the algorithm for resolving a relative URI has been removed from the algorithm for resolving a relative URI
reference. The resolution examples in Appendix C have been modified reference. The resolution examples in Appendix C have been modified
to reflect this change. to reflect this change.
Implementations are now allowed to work around misformed relative
references that are prefixed by the same scheme as the base URI,
but only for schemes known to use the <hier_part> syntax.
H. Full Copyright Statement H. Full Copyright Statement
Copyright (C) The Internet Society (1998). All Rights Reserved. Copyright (C) The Internet Society (1998). All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this included on all such copies and derivative works. However, this
 End of changes. 104 change blocks. 
127 lines changed or deleted 172 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/