| < draft-fielding-uri-syntax-02.txt | draft-fielding-uri-syntax-03.txt > | |||
|---|---|---|---|---|
| Network Working Group T. Berners-Lee, MIT/LCS | Network Working Group T. Berners-Lee, MIT/LCS | |||
| INTERNET-DRAFT R. Fielding, U.C. Irvine | INTERNET-DRAFT R. Fielding, U.C. Irvine | |||
| draft-fielding-uri-syntax-02 L. Masinter, Xerox Corporation | draft-fielding-uri-syntax-03 L. Masinter, Xerox Corporation | |||
| Expires six months after publication date March 4, 1998 | Expires six months after publication date June 4, 1998 | |||
| Uniform Resource Identifiers (URI): Generic Syntax | Uniform Resource Identifiers (URI): Generic Syntax | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft. Internet-Drafts are working | This document is an Internet-Draft. Internet-Drafts are working | |||
| documents of the Internet Engineering Task Force (IETF), its areas, | documents of the Internet Engineering Task Force (IETF), its areas, | |||
| and its working groups. Note that other groups may also distribute | and its working groups. Note that other groups may also distribute | |||
| working documents as Internet-Drafts. | working documents as Internet-Drafts. | |||
| Internet-Drafts are draft documents valid for a maximum of six | Internet-Drafts are draft documents valid for a maximum of six | |||
| months and may be updated, replaced, or obsoleted by other | months and may be updated, replaced, or obsoleted by other | |||
| documents at any time. It is inappropriate to use Internet-Drafts | documents at any time. It is inappropriate to use Internet-Drafts | |||
| as reference material or to cite them other than as ``work in | as reference material or to cite them other than as ``work in | |||
| progress.'' | progress.'' | |||
| To learn the current status of any Internet-Draft, please check the | To view the entire list of current Internet-Drafts, please check the | |||
| ``1id-abstracts.txt'' listing contained in the Internet-Drafts | "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow | |||
| Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net | Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern | |||
| (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East | Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific | |||
| Coast), or ftp.isi.edu (US West Coast). | Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). | |||
| Instructions to RFC Editor: This document will obsolete RFC 1738 and | Instructions to RFC Editor: This document will obsolete RFC 1738 and | |||
| RFC 1808. If the new version of the MHTML proposed standard is | RFC 1808. If the new version of the MHTML proposed standard is | |||
| ready for publication at the same time as this document, please | ready for publication at the same time as this document, please | |||
| change all references to RFC 2110 to refer to its new version. | change all references to RFC 2110 to refer to its new version. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (1998). All Rights Reserved. | Copyright (C) The Internet Society (1998). All Rights Reserved. | |||
| Abstract | Abstract | |||
| A Uniform Resource Identifier (URI) is a compact string of characters | A Uniform Resource Identifier (URI) is a compact string of characters | |||
| for identifying an abstract or physical resource. This document | for identifying an abstract or physical resource. This document | |||
| defines the general syntax of URIs, including both absolute and | defines the generic syntax of URI, including both absolute and | |||
| relative forms, and guidelines for their use; it revises and replaces | relative forms, and guidelines for their use; it revises and replaces | |||
| the generic definitions in RFC 1738 and RFC 1808. | the generic definitions in RFC 1738 and RFC 1808. | |||
| This document defines a grammar that is a superset of all valid URI, | ||||
| such that an implementation can parse the common components of a URI | ||||
| reference without knowing the scheme-specific requirements of every | ||||
| possible identifier type. This document does not define a generative | ||||
| grammar for URI; that task will be performed by the individual | ||||
| specifications of each URI scheme. | ||||
| 1. Introduction | 1. Introduction | |||
| Uniform Resource Identifiers (URIs) provide a simple and extensible | Uniform Resource Identifiers (URI) provide a simple and extensible | |||
| means for identifying a resource. This specification of URI syntax | means for identifying a resource. This specification of URI syntax | |||
| and semantics is derived from concepts introduced by the World Wide | and semantics is derived from concepts introduced by the World Wide | |||
| Web global information initiative, whose use of such objects dates | Web global information initiative, whose use of such objects dates | |||
| from 1990 and is described in "Universal Resource Identifiers in WWW" | from 1990 and is described in "Universal Resource Identifiers in WWW" | |||
| [RFC1630]. The specification of URIs is designed to meet the | [RFC1630]. The specification of URI is designed to meet the | |||
| recommendations laid out in "Functional Recommendations for Internet | recommendations laid out in "Functional Recommendations for Internet | |||
| Resource Locators" [RFC1736] and "Functional Requirements for Uniform | Resource Locators" [RFC1736] and "Functional Requirements for Uniform | |||
| Resource Names" [RFC1737]. | Resource Names" [RFC1737]. | |||
| This document updates and merges "Uniform Resource Locators" | This document updates and merges "Uniform Resource Locators" | |||
| [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in | [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in | |||
| order to define a single, general syntax for all URIs. It excludes | order to define a single, generic syntax for all URI. It excludes | |||
| those portions of RFC 1738 that defined the specific syntax of | those portions of RFC 1738 that defined the specific syntax of | |||
| individual URL schemes; those portions will be updated as separate | individual URL schemes; those portions will be updated as separate | |||
| documents, as will the process for registration of new URI schemes. | documents, as will the process for registration of new URI schemes. | |||
| This document does not discuss the issues and recommendation for | This document does not discuss the issues and recommendation for | |||
| dealing with characters outside of the US-ASCII character set | dealing with characters outside of the US-ASCII character set | |||
| [ASCII]; those recommendations are discussed in a separate document. | [ASCII]; those recommendations are discussed in a separate document. | |||
| All significant changes from the prior RFCs are noted in Appendix G. | All significant changes from the prior RFCs are noted in Appendix G. | |||
| 1.1 Overview of URIs | 1.1 Overview of URI | |||
| URIs are characterized by the following definitions: | URI are characterized by the following definitions: | |||
| Uniform | Uniform | |||
| Uniformity provides several benefits: it allows different types | Uniformity provides several benefits: it allows different types | |||
| of resource identifiers to be used in the same context, even | of resource identifiers to be used in the same context, even | |||
| when the mechanisms used to access those resources may differ; | when the mechanisms used to access those resources may differ; | |||
| it allows uniform semantic interpretation of common syntactic | it allows uniform semantic interpretation of common syntactic | |||
| conventions across different types of resource identifiers; it | conventions across different types of resource identifiers; it | |||
| allows introduction of new types of resource identifiers | allows introduction of new types of resource identifiers | |||
| without interfering with the way that existing identifiers are | without interfering with the way that existing identifiers are | |||
| used; and, it allows the identifiers to be reused in many | used; and, it allows the identifiers to be reused in many | |||
| skipping to change at line 102 ¶ | skipping to change at line 109 ¶ | |||
| The resource is the conceptual mapping to an entity or set of | The resource is the conceptual mapping to an entity or set of | |||
| entities, not necessarily the entity which corresponds to that | entities, not necessarily the entity which corresponds to that | |||
| mapping at any particular instance in time. Thus, a resource | mapping at any particular instance in time. Thus, a resource | |||
| can remain constant even when its content---the entities to | can remain constant even when its content---the entities to | |||
| which it currently corresponds---changes over time, provided | which it currently corresponds---changes over time, provided | |||
| that the conceptual mapping is not changed in the process. | that the conceptual mapping is not changed in the process. | |||
| Identifier | Identifier | |||
| An identifier is an object that can act as a reference to | An identifier is an object that can act as a reference to | |||
| something that has identity. In the case of URIs, the object | something that has identity. In the case of URI, the object | |||
| is a sequence of characters with a restricted syntax. | is a sequence of characters with a restricted syntax. | |||
| Having identified a resource, a system may perform a variety of | Having identified a resource, a system may perform a variety of | |||
| operations on the resource, as might be characterized by such words | operations on the resource, as might be characterized by such words | |||
| as `access', `update', `replace', or `find attributes'. | as `access', `update', `replace', or `find attributes'. | |||
| 1.2. URI, URL, and URN | 1.2. URI, URL, and URN | |||
| A URI can be further classified as a locator, a name, or both. The | A URI can be further classified as a locator, a name, or both. The | |||
| term "Uniform Resource Locator" (URL) refers to the subset of URI | term "Uniform Resource Locator" (URL) refers to the subset of URI | |||
| that identify resources via a representation of their primary access | that identify resources via a representation of their primary access | |||
| mechanism (e.g., their network "location"), rather than identifying | mechanism (e.g., their network "location"), rather than identifying | |||
| the resource by name or by some other attribute(s) of that resource. | the resource by name or by some other attribute(s) of that resource. | |||
| The term "Uniform Resource Name" (URN) refers to the subset of URI | The term "Uniform Resource Name" (URN) refers to the subset of URI | |||
| that are required to remain globally unique and persistent even when | that are required to remain globally unique and persistent even when | |||
| the resource ceases to exist or becomes unavailable. | the resource ceases to exist or becomes unavailable. | |||
| The URI scheme (Section 3.1) defines the namespace of the URI, and | The URI scheme (Section 3.1) defines the namespace of the URI, and | |||
| thus may further restrict the syntax and semantics of identifiers | thus may further restrict the syntax and semantics of identifiers | |||
| using that scheme. This specification defines those elements of the | using that scheme. This specification defines those elements of the | |||
| URI syntax which are either required of all URI schemes or are common | URI syntax that are either required of all URI schemes or are common | |||
| to many URI schemes. It thus defines the syntax and semantics that | to many URI schemes. It thus defines the syntax and semantics that | |||
| are needed to implement a scheme-independent parsing mechanism for | are needed to implement a scheme-independent parsing mechanism for | |||
| URI references, such that the scheme-dependent handling of a URI can | URI references, such that the scheme-dependent handling of a URI can | |||
| be postponed until the scheme-dependent semantics are needed. We use | be postponed until the scheme-dependent semantics are needed. We use | |||
| the term URL below when describing syntax or semantics that only | the term URL below when describing syntax or semantics that only | |||
| apply to locators. | apply to locators. | |||
| Although many URL schemes are named after protocols, this does not | Although many URL schemes are named after protocols, this does not | |||
| imply that the only way to access the URL's resource is via the named | imply that the only way to access the URL's resource is via the named | |||
| protocol. Gateways, proxies, caches, and name resolution services | protocol. Gateways, proxies, caches, and name resolution services | |||
| might be used to access some resources, independent of the protocol | might be used to access some resources, independent of the protocol | |||
| of their origin, and the resolution of some URLs may require the use | of their origin, and the resolution of some URL may require the use | |||
| of more than one protocol (e.g., both DNS and HTTP are typically used | of more than one protocol (e.g., both DNS and HTTP are typically used | |||
| to access an "http" URL's resource when it can't be found in a local | to access an "http" URL's resource when it can't be found in a local | |||
| cache). | cache). | |||
| A URN differs from a URL in that it's primary purpose is persistent | A URN differs from a URL in that it's primary purpose is persistent | |||
| labeling of a resource with an identifier. That identifier is drawn | labeling of a resource with an identifier. That identifier is drawn | |||
| from one of a set of defined namespaces, each of which has its own | from one of a set of defined namespaces, each of which has its own | |||
| set name structure and assignment procedures. The "urn" scheme has | set name structure and assignment procedures. The "urn" scheme has | |||
| been reserved to establish the requirements for a standardized URN | been reserved to establish the requirements for a standardized URN | |||
| namespace, as defined in "URN Syntax" [RFC2141] and its related | namespace, as defined in "URN Syntax" [RFC2141] and its related | |||
| specifications. | specifications. | |||
| Most of the examples in this specification demonstrate URLs, since | Most of the examples in this specification demonstrate URL, since | |||
| they allow the most varied use of the syntax and often have a | they allow the most varied use of the syntax and often have a | |||
| hierarchical namespace. A parser of the URI syntax is capable of | hierarchical namespace. A parser of the URI syntax is capable of | |||
| parsing both URL and URN references as a generic URI; once the scheme | parsing both URL and URN references as a generic URI; once the scheme | |||
| is determined, the scheme-specific parsing can be performed on the | is determined, the scheme-specific parsing can be performed on the | |||
| generic URI components. In other words, the URI syntax is a superset | generic URI components. In other words, the URI syntax is a superset | |||
| of the syntax of all URI schemes. | of the syntax of all URI schemes. | |||
| 1.3. Example URIs | 1.3. Example URI | |||
| The following examples illustrate URIs which are in common use. | The following examples illustrate URI that are in common use. | |||
| ftp://ftp.is.co.za/rfc/rfc1808.txt | ftp://ftp.is.co.za/rfc/rfc1808.txt | |||
| -- ftp scheme for File Transfer Protocol services | -- ftp scheme for File Transfer Protocol services | |||
| gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles | gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles | |||
| -- gopher scheme for Gopher and Gopher+ Protocol services | -- gopher scheme for Gopher and Gopher+ Protocol services | |||
| http://www.math.uio.no/faq/compression-faq/part1.html | http://www.math.uio.no/faq/compression-faq/part1.html | |||
| -- http scheme for Hypertext Transfer Protocol services | -- http scheme for Hypertext Transfer Protocol services | |||
| mailto:mduerst@ifi.unizh.ch | mailto:mduerst@ifi.unizh.ch | |||
| -- mailto scheme for electronic mail addresses | -- mailto scheme for electronic mail addresses | |||
| news:comp.infosystems.www.servers.unix | news:comp.infosystems.www.servers.unix | |||
| -- news scheme for USENET news groups and articles | -- news scheme for USENET news groups and articles | |||
| telnet://melvyl.ucop.edu/ | telnet://melvyl.ucop.edu/ | |||
| -- telnet scheme for interactive services via the TELNET Protocol | -- telnet scheme for interactive services via the TELNET Protocol | |||
| 1.4. Hierarchical URIs and Relative Forms | 1.4. Hierarchical URI and Relative Forms | |||
| An absolute identifier refers to a resource independent of the | An absolute identifier refers to a resource independent of the | |||
| context in which the identifier is used. In contrast, a relative | context in which the identifier is used. In contrast, a relative | |||
| identifier refers to a resource by describing the difference within a | identifier refers to a resource by describing the difference within a | |||
| hierarchical namespace between the current context and an absolute | hierarchical namespace between the current context and an absolute | |||
| identifier of the resource. | identifier of the resource. | |||
| Some URI schemes support a hierarchical naming system, where the | Some URI schemes support a hierarchical naming system, where the | |||
| hierarchy of the name is denoted by a "/" delimiter separating the | hierarchy of the name is denoted by a "/" delimiter separating the | |||
| components in the scheme. This document defines a scheme-independent | components in the scheme. This document defines a scheme-independent | |||
| `relative' form of URI reference that can be used in conjunction with | `relative' form of URI reference that can be used in conjunction with | |||
| a `base' URI (of a hierarchical scheme) to produce another URI. The | a `base' URI (of a hierarchical scheme) to produce another URI. The | |||
| syntax of hierarchical URIs is described in Section 3; the relative | syntax of hierarchical URI is described in Section 3; the relative | |||
| URI calculation is described in Section 5. | URI calculation is described in Section 5. | |||
| 1.5. URI Transcribability | 1.5. URI Transcribability | |||
| The URI syntax was designed with global transcribability as one of | The URI syntax was designed with global transcribability as one of | |||
| its main concerns. A URI is a sequence of characters from a very | its main concerns. A URI is a sequence of characters from a very | |||
| limited set, i.e. the letters of the basic Latin alphabet, digits, | limited set, i.e. the letters of the basic Latin alphabet, digits, | |||
| and a few special characters. A URI may be represented in a | and a few special characters. A URI may be represented in a | |||
| variety of ways: e.g., ink on paper, pixels on a screen, or a | variety of ways: e.g., ink on paper, pixels on a screen, or a | |||
| sequence of octets in a coded character set. The interpretation of | sequence of octets in a coded character set. The interpretation of | |||
| skipping to change at line 219 ¶ | skipping to change at line 226 ¶ | |||
| for the research site on a napkin. Upon returning home, Sam takes | for the research site on a napkin. Upon returning home, Sam takes | |||
| out the napkin and types the URI into a computer, which then | out the napkin and types the URI into a computer, which then | |||
| retrieves the information to which Kim referred. | retrieves the information to which Kim referred. | |||
| There are several design concerns revealed by the scenario: | There are several design concerns revealed by the scenario: | |||
| o A URI is a sequence of characters, which is not always | o A URI is a sequence of characters, which is not always | |||
| represented as a sequence of octets. | represented as a sequence of octets. | |||
| o A URI may be transcribed from a non-network source, and thus | o A URI may be transcribed from a non-network source, and thus | |||
| should consist of characters which are most likely to be able | should consist of characters that are most likely to be able | |||
| to be typed into a computer, within the constraints imposed by | to be typed into a computer, within the constraints imposed by | |||
| keyboards (and related input devices) across languages and | keyboards (and related input devices) across languages and | |||
| locales. | locales. | |||
| o A URI often needs to be remembered by people, and it is easier | o A URI often needs to be remembered by people, and it is easier | |||
| for people to remember a URI when it consists of meaningful | for people to remember a URI when it consists of meaningful | |||
| components. | components. | |||
| These design concerns are not always in alignment. For example, it | These design concerns are not always in alignment. For example, it | |||
| is often the case that the most meaningful name for a URI component | is often the case that the most meaningful name for a URI component | |||
| would require characters which cannot be typed into some systems. | would require characters that cannot be typed into some systems. | |||
| The ability to transcribe the resource identifier from one medium to | The ability to transcribe the resource identifier from one medium to | |||
| another was considered more important than having its URI consist | another was considered more important than having its URI consist | |||
| of the most meaningful of components. In local and regional | of the most meaningful of components. In local and regional | |||
| contexts and with improving technology, users might benefit from | contexts and with improving technology, users might benefit from | |||
| being able to use a wider range of characters; such use is not | being able to use a wider range of characters; such use is not | |||
| defined in this document. | defined in this document. | |||
| 1.6. Syntax Notation and Common Elements | 1.6. Syntax Notation and Common Elements | |||
| This document uses two conventions to describe and define the syntax | This document uses two conventions to describe and define the syntax | |||
| skipping to change at line 261 ¶ | skipping to change at line 268 ¶ | |||
| The second convention is a BNF-like grammar, used to define the | The second convention is a BNF-like grammar, used to define the | |||
| formal URI syntax. The grammar is that of [RFC822], except that | formal URI syntax. The grammar is that of [RFC822], except that | |||
| "|" is used to designate alternatives. Briefly, rules are separated | "|" is used to designate alternatives. Briefly, rules are separated | |||
| from definitions by an equal "=", indentation is used to continue a | from definitions by an equal "=", indentation is used to continue a | |||
| rule definition over more than one line, literals are quoted with "", | rule definition over more than one line, literals are quoted with "", | |||
| parentheses "(" and ")" are used to group elements, optional elements | parentheses "(" and ")" are used to group elements, optional elements | |||
| are enclosed in "[" and "]" brackets, and elements may be preceded | are enclosed in "[" and "]" brackets, and elements may be preceded | |||
| with <n>* to designate n or more repetitions of the following | with <n>* to designate n or more repetitions of the following | |||
| element; n defaults to 0. | element; n defaults to 0. | |||
| Unlike many specifications which use a BNF-like grammar to define the | Unlike many specifications that use a BNF-like grammar to define the | |||
| bytes (octets) allowed by a protocol, the URI grammar is defined in | bytes (octets) allowed by a protocol, the URI grammar is defined in | |||
| terms of characters. Each literal in the grammar corresponds to the | terms of characters. Each literal in the grammar corresponds to the | |||
| character it represents, rather than to the octet encoding of that | character it represents, rather than to the octet encoding of that | |||
| character in any particular coded character set. How a URI is | character in any particular coded character set. How a URI is | |||
| represented in terms of bits and bytes on the wire is dependent upon | represented in terms of bits and bytes on the wire is dependent upon | |||
| the character encoding of the protocol used to transport it, or the | the character encoding of the protocol used to transport it, or the | |||
| charset of the document which contains it. | charset of the document which contains it. | |||
| The following definitions are common to many elements: | The following definitions are common to many elements: | |||
| skipping to change at line 291 ¶ | skipping to change at line 298 ¶ | |||
| digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | | digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | | |||
| "8" | "9" | "8" | "9" | |||
| alphanum = alpha | digit | alphanum = alpha | digit | |||
| The complete URI syntax is collected in Appendix A. | The complete URI syntax is collected in Appendix A. | |||
| 2. URI Characters and Escape Sequences | 2. URI Characters and Escape Sequences | |||
| URIs consist of a restricted set of characters, primarily chosen to | URI consist of a restricted set of characters, primarily chosen to | |||
| aid transcribability and usability both in computer systems and in | aid transcribability and usability both in computer systems and in | |||
| non-computer communications. Characters used conventionally as | non-computer communications. Characters used conventionally as | |||
| delimiters around URIs were excluded. The restricted set of | delimiters around URI were excluded. The restricted set of | |||
| characters consists of digits, letters, and a few graphic symbols | characters consists of digits, letters, and a few graphic symbols | |||
| were chosen from those common to most of the character encodings | were chosen from those common to most of the character encodings | |||
| and input facilities available to Internet users. | and input facilities available to Internet users. | |||
| uric = reserved | unreserved | escaped | ||||
| Within a URI, characters are either used as delimiters, or to | Within a URI, characters are either used as delimiters, or to | |||
| represent strings of data (octets) within the delimited portions. | represent strings of data (octets) within the delimited portions. | |||
| Octets are either represented directly by a character (using the | Octets are either represented directly by a character (using the | |||
| US-ASCII character for that octet [ASCII]) or by an escape encoding. | US-ASCII character for that octet [ASCII]) or by an escape encoding. | |||
| This representation is elaborated below. | This representation is elaborated below. | |||
| 2.1 URIs and non-ASCII characters | 2.1 URI and non-ASCII characters | |||
| The relationship between URIs and characters has been a source of | The relationship between URI and characters has been a source of | |||
| confusion for characters that are not part of US-ASCII. To describe | confusion for characters that are not part of US-ASCII. To describe | |||
| the relationship, it is useful to distinguish between a "character" | the relationship, it is useful to distinguish between a "character" | |||
| (as a distinguishable semantic entity) and an "octet" (an 8-bit | (as a distinguishable semantic entity) and an "octet" (an 8-bit | |||
| byte). There are two mappings, one from URI characters to octets, | byte). There are two mappings, one from URI characters to octets, | |||
| and a second from octets to original characters: | and a second from octets to original characters: | |||
| URI character sequence->octet sequence->original character sequence | URI character sequence->octet sequence->original character sequence | |||
| A URI is represented as a sequence of characters, not as a sequence | A URI is represented as a sequence of characters, not as a sequence | |||
| of octets. That is because URIs might be "transported" by means that | of octets. That is because URI might be "transported" by means that | |||
| are not through a computer network, e.g., printed on paper, read | are not through a computer network, e.g., printed on paper, read | |||
| over the radio, etc. | over the radio, etc. | |||
| A URI scheme may define a mapping from URI characters to octets; | A URI scheme may define a mapping from URI characters to octets; | |||
| whether this is done depends on the scheme. Commonly, within a | whether this is done depends on the scheme. Commonly, within a | |||
| delimited component of a URI, a sequence of characters may be | delimited component of a URI, a sequence of characters may be | |||
| used to represent a sequence of octets. For example, the character | used to represent a sequence of octets. For example, the character | |||
| "a" represents the octet 97 (decimal), while the character sequence | "a" represents the octet 97 (decimal), while the character sequence | |||
| "%", "0", "a" represents the octet 10 (decimal). | "%", "0", "a" represents the octet 10 (decimal). | |||
| skipping to change at line 353 ¶ | skipping to change at line 362 ¶ | |||
| however, the situation is more difficult. Internet protocols that | however, the situation is more difficult. Internet protocols that | |||
| transmit octet sequences intended to represent character sequences | transmit octet sequences intended to represent character sequences | |||
| are expected to provide some way of identifying the charset used, | are expected to provide some way of identifying the charset used, | |||
| if there might be more than one [RFC2277]. However, there is | if there might be more than one [RFC2277]. However, there is | |||
| currently no provision within the generic URI syntax to accomplish | currently no provision within the generic URI syntax to accomplish | |||
| this identification. An individual URI scheme may require a single | this identification. An individual URI scheme may require a single | |||
| charset, define a default charset, or provide a way to indicate the | charset, define a default charset, or provide a way to indicate the | |||
| charset used. | charset used. | |||
| It is expected that a systematic treatment of character encoding | It is expected that a systematic treatment of character encoding | |||
| within URIs will be developed as a future modification of this | within URI will be developed as a future modification of this | |||
| specification. | specification. | |||
| 2.2. Reserved Characters | 2.2. Reserved Characters | |||
| Many URIs include components consisting of or delimited by, certain | Many URI include components consisting of or delimited by, certain | |||
| special characters. These characters are called "reserved", since | special characters. These characters are called "reserved", since | |||
| their usage within the URI component is limited to their reserved | their usage within the URI component is limited to their reserved | |||
| purpose. If the data for a URI component would conflict with the | purpose. If the data for a URI component would conflict with the | |||
| reserved purpose, then the conflicting data must be escaped before | reserved purpose, then the conflicting data must be escaped before | |||
| forming the URI. | forming the URI. | |||
| reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | | reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | | |||
| "$" | "," | "$" | "," | |||
| The "reserved" syntax class above refers to those characters which | The "reserved" syntax class above refers to those characters that | |||
| are allowed within a URI, but which may not be allowed within a | are allowed within a URI, but which may not be allowed within a | |||
| particular component of the generic URI syntax; they are used as | particular component of the generic URI syntax; they are used as | |||
| delimiters of the components described in Section 3. | delimiters of the components described in Section 3. | |||
| Characters in the "reserved" set are not reserved in all contexts. | Characters in the "reserved" set are not reserved in all contexts. | |||
| The set of characters actually reserved within any given URI | The set of characters actually reserved within any given URI | |||
| component is defined by that component. In general, a character is | component is defined by that component. In general, a character is | |||
| reserved if the semantics of the URI changes if the character is | reserved if the semantics of the URI changes if the character is | |||
| replaced with its escaped US-ASCII encoding. | replaced with its escaped US-ASCII encoding. | |||
| 2.3. Unreserved Characters | 2.3. Unreserved Characters | |||
| Data characters which are allowed in a URI but do not have a reserved | Data characters that are allowed in a URI but do not have a reserved | |||
| purpose are called unreserved. These include upper and lower case | purpose are called unreserved. These include upper and lower case | |||
| letters, decimal digits, and a limited set of punctuation marks and | letters, decimal digits, and a limited set of punctuation marks and | |||
| symbols. | symbols. | |||
| unreserved = alphanum | mark | unreserved = alphanum | mark | |||
| mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" | mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" | |||
| Unreserved characters can be escaped without changing the semantics | Unreserved characters can be escaped without changing the semantics | |||
| of the URI, but this should not be done unless the URI is being used | of the URI, but this should not be done unless the URI is being used | |||
| in a context which does not allow the unescaped character to appear. | in a context that does not allow the unescaped character to appear. | |||
| 2.4. Escape Sequences | 2.4. Escape Sequences | |||
| Data must be escaped if it does not have a representation using an | Data must be escaped if it does not have a representation using an | |||
| unreserved character; this includes data that does not correspond | unreserved character; this includes data that does not correspond | |||
| to a printable character of the US-ASCII coded character set, or | to a printable character of the US-ASCII coded character set, or | |||
| that corresponds to any US-ASCII character that is disallowed, as | that corresponds to any US-ASCII character that is disallowed, as | |||
| explained below. | explained below. | |||
| 2.4.1. Escaped Encoding | 2.4.1. Escaped Encoding | |||
| skipping to change at line 419 ¶ | skipping to change at line 428 ¶ | |||
| escaped = "%" hex hex | escaped = "%" hex hex | |||
| hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | | hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | | |||
| "a" | "b" | "c" | "d" | "e" | "f" | "a" | "b" | "c" | "d" | "e" | "f" | |||
| 2.4.2. When to Escape and Unescape | 2.4.2. When to Escape and Unescape | |||
| A URI is always in an "escaped" form, since escaping or unescaping | A URI is always in an "escaped" form, since escaping or unescaping | |||
| a completed URI might change its semantics. Normally, the only | a completed URI might change its semantics. Normally, the only | |||
| time escape encodings can safely be made is when the URI is being | time escape encodings can safely be made is when the URI is being | |||
| created from its component parts; each component may have its own | created from its component parts; each component may have its own | |||
| set of characters which are reserved, so only the mechanism | set of characters that are reserved, so only the mechanism | |||
| responsible for generating or interpreting that component can | responsible for generating or interpreting that component can | |||
| determine whether or not escaping a character will change its | determine whether or not escaping a character will change its | |||
| semantics. Likewise, a URI must be separated into its components | semantics. Likewise, a URI must be separated into its components | |||
| before the escaped characters within those components can be safely | before the escaped characters within those components can be safely | |||
| decoded. | decoded. | |||
| In some cases, data that could be represented by an unreserved | In some cases, data that could be represented by an unreserved | |||
| character may appear escaped; for example, some of the unreserved | character may appear escaped; for example, some of the unreserved | |||
| "mark" characters are automatically escaped by some systems. If the | "mark" characters are automatically escaped by some systems. If the | |||
| given URI scheme defines a canonicalization algorithm, then | given URI scheme defines a canonicalization algorithm, then | |||
| skipping to change at line 445 ¶ | skipping to change at line 454 ¶ | |||
| being the escape indicator, it must be escaped as "%25" in order to | being the escape indicator, it must be escaped as "%25" in order to | |||
| be used as data within a URI. Implementers should be careful not to | be used as data within a URI. Implementers should be careful not to | |||
| escape or unescape the same string more than once, since unescaping | escape or unescape the same string more than once, since unescaping | |||
| an already unescaped string might lead to misinterpreting a percent | an already unescaped string might lead to misinterpreting a percent | |||
| data character as another escaped character, or vice versa in the | data character as another escaped character, or vice versa in the | |||
| case of escaping an already escaped string. | case of escaping an already escaped string. | |||
| 2.4.3. Excluded US-ASCII Characters | 2.4.3. Excluded US-ASCII Characters | |||
| Although they are disallowed within the URI syntax, we include here | Although they are disallowed within the URI syntax, we include here | |||
| a description of those US-ASCII characters which have been excluded | a description of those US-ASCII characters that have been excluded | |||
| and the reasons for their exclusion. | and the reasons for their exclusion. | |||
| The control characters in the US-ASCII coded character set are not | The control characters in the US-ASCII coded character set are not | |||
| used within a URI, both because they are non-printable and because | used within a URI, both because they are non-printable and because | |||
| they are likely to be misinterpreted by some control mechanisms. | they are likely to be misinterpreted by some control mechanisms. | |||
| control = <US-ASCII coded characters 00-1F and 7F hexadecimal> | control = <US-ASCII coded characters 00-1F and 7F hexadecimal> | |||
| The space character is excluded because significant spaces may | The space character is excluded because significant spaces may | |||
| disappear and insignificant spaces may be introduced when URIs are | disappear and insignificant spaces may be introduced when URI are | |||
| transcribed or typeset or subjected to the treatment of | transcribed or typeset or subjected to the treatment of | |||
| word-processing programs. Whitespace is also used to delimit URIs | word-processing programs. Whitespace is also used to delimit URI | |||
| in many contexts. | in many contexts. | |||
| space = <US-ASCII coded character 20 hexadecimal> | space = <US-ASCII coded character 20 hexadecimal> | |||
| The angle-bracket "<" and ">" and double-quote (") characters are | The angle-bracket "<" and ">" and double-quote (") characters are | |||
| excluded because they are often used as the delimiters around URIs | excluded because they are often used as the delimiters around URI | |||
| in text documents and protocol fields. The character "#" is | in text documents and protocol fields. The character "#" is | |||
| excluded because it is used to delimit a URI from a fragment | excluded because it is used to delimit a URI from a fragment | |||
| identifier in URI references (Section 4). The percent character "%" | identifier in URI references (Section 4). The percent character "%" | |||
| is excluded because it is used for the encoding of escaped | is excluded because it is used for the encoding of escaped | |||
| characters. | characters. | |||
| delims = "<" | ">" | "#" | "%" | <"> | delims = "<" | ">" | "#" | "%" | <"> | |||
| Other characters are excluded because gateways and other transport | Other characters are excluded because gateways and other transport | |||
| agents are known to sometimes modify such characters, or they are | agents are known to sometimes modify such characters, or they are | |||
| used as delimiters. | used as delimiters. | |||
| unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" | unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" | |||
| Data corresponding to excluded characters must be escaped in order | Data corresponding to excluded characters must be escaped in order | |||
| to be properly represented within a URI. | to be properly represented within a URI. | |||
| 3. URI Syntactic Components | 3. URI Syntactic Components | |||
| The URI syntax is dependent upon the scheme. In general, absolute | The URI syntax is dependent upon the scheme. In general, absolute | |||
| URIs are written as follows: | URI are written as follows: | |||
| <scheme>:<scheme-specific-part> | <scheme>:<scheme-specific-part> | |||
| An absolute URI contains the name of the scheme being used (<scheme>) | An absolute URI contains the name of the scheme being used (<scheme>) | |||
| followed by a colon (":") and then a string (the <scheme-specific- | followed by a colon (":") and then a string (the <scheme-specific- | |||
| part>) whose interpretation depends on the scheme. | part>) whose interpretation depends on the scheme. | |||
| The URI syntax does not require that the scheme-specific-part have | The URI syntax does not require that the scheme-specific-part have | |||
| any general structure or set of semantics which is common among all | any general structure or set of semantics which is common among all | |||
| URIs. However, a subset of URIs do share a common syntax for | URI. However, a subset of URI do share a common syntax for | |||
| representing hierarchical relationships within the namespace. This | representing hierarchical relationships within the namespace. This | |||
| "generic-URI" syntax consists of a sequence of four main components: | "generic URI" syntax consists of a sequence of four main components: | |||
| <scheme>://<authority><path>?<query> | <scheme>://<authority><path>?<query> | |||
| each of which, except <scheme>, may be absent from a particular URI. | each of which, except <scheme>, may be absent from a particular URI. | |||
| For example, some URI schemes do not allow an <authority> component, | For example, some URI schemes do not allow an <authority> component, | |||
| and others do not use a <query> component. | and others do not use a <query> component. | |||
| absoluteURI = generic-URI | opaque-URI | absoluteURI = scheme ":" ( hier_part | opaque_part ) | |||
| opaque-URI = scheme ":" *uric | ||||
| generic-URI = scheme ":" relativeURI | ||||
| The separation of the URI grammar into <generic-URI> and <opaque-URI> | ||||
| is redundant, since both rules will successfully parse any string of | ||||
| <uric> characters. The distinction is simply to clarify that a | ||||
| parser of relative URI references (Section 5) will view a URI as a | ||||
| generic-URI, whereas a handler of absolute references need only view | ||||
| it as an opaque-URI. | ||||
| URIs which are hierarchical in nature use the slash "/" character for | URI that are hierarchical in nature use the slash "/" character for | |||
| separating hierarchical components. For some file systems, a "/" | separating hierarchical components. For some file systems, a "/" | |||
| character (used to denote the hierarchical structure of a URI) is the | character (used to denote the hierarchical structure of a URI) is the | |||
| delimiter used to construct a file name hierarchy, and thus the URI | delimiter used to construct a file name hierarchy, and thus the URI | |||
| path will look similar to a file pathname. This does NOT imply that | path will look similar to a file pathname. This does NOT imply that | |||
| the resource is a file or that the URI maps to an actual filesystem | the resource is a file or that the URI maps to an actual filesystem | |||
| pathname. | pathname. | |||
| hier_part = ( net_path | abs_path ) [ "?" query ] | ||||
| net_path = "//" authority [ abs_path ] | ||||
| abs_path = "/" path_segments | ||||
| URI that do not make use of the slash "/" character for separating | ||||
| hierarchical components are considered opaque by the generic URI | ||||
| parser. | ||||
| opaque_part = uric_no_slash *uric | ||||
| uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | | ||||
| "&" | "=" | "+" | "$" | "," | ||||
| We use the term <path> to refer to both the <abs_path> and | ||||
| <opaque_part> constructs, since they are mutually exclusive for any | ||||
| given URI and can be parsed as a single component. | ||||
| 3.1. Scheme Component | 3.1. Scheme Component | |||
| Just as there are many different methods of access to resources, | Just as there are many different methods of access to resources, | |||
| there are a variety of schemes for identifying such resources. The | there are a variety of schemes for identifying such resources. The | |||
| URI syntax consists of a sequence of components separated by reserved | URI syntax consists of a sequence of components separated by reserved | |||
| characters, with the first component defining the semantics for the | characters, with the first component defining the semantics for the | |||
| remainder of the URI string. | remainder of the URI string. | |||
| Scheme names consist of a sequence of characters beginning with a | Scheme names consist of a sequence of characters beginning with a | |||
| lower case letter and followed by any combination of lower case | lower case letter and followed by any combination of lower case | |||
| letters, digits, plus ("+"), period ("."), or hyphen ("-"). For | letters, digits, plus ("+"), period ("."), or hyphen ("-"). For | |||
| resiliency, programs interpreting URIs should treat upper case | resiliency, programs interpreting URI should treat upper case | |||
| letters as equivalent to lower case in scheme names (e.g., allow | letters as equivalent to lower case in scheme names (e.g., allow | |||
| "HTTP" as well as "http"). | "HTTP" as well as "http"). | |||
| scheme = alpha *( alpha | digit | "+" | "-" | "." ) | scheme = alpha *( alpha | digit | "+" | "-" | "." ) | |||
| Relative URI references are distinguished from absolute URIs in that | Relative URI references are distinguished from absolute URI in that | |||
| they do not begin with a scheme name. Instead, the scheme is | they do not begin with a scheme name. Instead, the scheme is | |||
| inherited from the base URI, as described in Section 5.2. | inherited from the base URI, as described in Section 5.2. | |||
| 3.2. Authority Component | 3.2. Authority Component | |||
| Many URI schemes include a top hierarchical element for a naming | Many URI schemes include a top hierarchical element for a naming | |||
| authority, such that the namespace defined by the remainder of the | authority, such that the namespace defined by the remainder of the | |||
| URI is governed by that authority. This authority component is | URI is governed by that authority. This authority component is | |||
| typically defined by an Internet-based server or a scheme-specific | typically defined by an Internet-based server or a scheme-specific | |||
| registry of naming authorities. | registry of naming authorities. | |||
| skipping to change at line 597 ¶ | skipping to change at line 614 ¶ | |||
| server = [ [ userinfo "@" ] hostport ] | server = [ [ userinfo "@" ] hostport ] | |||
| The user information, if present, is followed by a commercial | The user information, if present, is followed by a commercial | |||
| at-sign "@". | at-sign "@". | |||
| userinfo = *( unreserved | escaped | | userinfo = *( unreserved | escaped | | |||
| ";" | ":" | "&" | "=" | "+" | "$" | "," ) | ";" | ":" | "&" | "=" | "+" | "$" | "," ) | |||
| Some URL schemes use the format "user:password" in the userinfo | Some URL schemes use the format "user:password" in the userinfo | |||
| field. This practice is NOT RECOMMENDED, because the passing of | field. This practice is NOT RECOMMENDED, because the passing of | |||
| authentication information in clear text (such as URIs) has proven to | authentication information in clear text (such as URI) has proven to | |||
| be a security risk in almost every case where it has been used. | be a security risk in almost every case where it has been used. | |||
| The host is a domain name of a network host, or its IPv4 address as | The host is a domain name of a network host, or its IPv4 address as | |||
| a set of four decimal digit groups separated by ".". Literal IPv6 | a set of four decimal digit groups separated by ".". Literal IPv6 | |||
| addresses are not supported. | addresses are not supported. | |||
| hostport = host [ ":" port ] | hostport = host [ ":" port ] | |||
| host = hostname | IPv4address | host = hostname | IPv4address | |||
| hostname = *( domainlabel "." ) toplabel [ "." ] | hostname = *( domainlabel "." ) toplabel [ "." ] | |||
| domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum | domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum | |||
| skipping to change at line 640 ¶ | skipping to change at line 657 ¶ | |||
| number may optionally be supplied, in decimal, separated from the | number may optionally be supplied, in decimal, separated from the | |||
| host by a colon. If the port is omitted, the default port number is | host by a colon. If the port is omitted, the default port number is | |||
| assumed. | assumed. | |||
| 3.3. Path Component | 3.3. Path Component | |||
| The path component contains data, specific to the authority (or the | The path component contains data, specific to the authority (or the | |||
| scheme if there is no authority component), identifying the resource | scheme if there is no authority component), identifying the resource | |||
| within the scope of that scheme and authority. | within the scope of that scheme and authority. | |||
| path = [ "/" ] path_segments | path = [ abs_path | opaque_part ] | |||
| path_segments = segment *( "/" segment ) | path_segments = segment *( "/" segment ) | |||
| segment = *pchar *( ";" param ) | segment = *pchar *( ";" param ) | |||
| param = *pchar | param = *pchar | |||
| pchar = unreserved | escaped | | pchar = unreserved | escaped | | |||
| ":" | "@" | "&" | "=" | "+" | "$" | "," | ":" | "@" | "&" | "=" | "+" | "$" | "," | |||
| The path may consist of a sequence of path segments separated by a | The path may consist of a sequence of path segments separated by a | |||
| single slash "/" character. Within a path segment, the characters | single slash "/" character. Within a path segment, the characters | |||
| skipping to change at line 671 ¶ | skipping to change at line 688 ¶ | |||
| query = *uric | query = *uric | |||
| Within a query component, the characters ";", "/", "?", ":", "@", | Within a query component, the characters ";", "/", "?", ":", "@", | |||
| "&", "=", "+", ",", and "$" are reserved. | "&", "=", "+", ",", and "$" are reserved. | |||
| 4. URI References | 4. URI References | |||
| The term "URI-reference" is used here to denote the common usage of | The term "URI-reference" is used here to denote the common usage of | |||
| a resource identifier. A URI reference may be absolute or relative, | a resource identifier. A URI reference may be absolute or relative, | |||
| and may have additional information attached in the form of a | and may have additional information attached in the form of a | |||
| fragment identifier. However, "the URI" which results from such a | fragment identifier. However, "the URI" that results from such a | |||
| reference includes only the absolute URI after the fragment | reference includes only the absolute URI after the fragment | |||
| identifier (if any) is removed and after any relative URI is | identifier (if any) is removed and after any relative URI is | |||
| resolved to its absolute form. Although it is possible to limit | resolved to its absolute form. Although it is possible to limit | |||
| the discussion of URI syntax and semantics to that of the absolute | the discussion of URI syntax and semantics to that of the absolute | |||
| result, most usage of URIs is within general URI references, and it | result, most usage of URI is within general URI references, and it | |||
| is impossible to obtain the URI from such a reference without also | is impossible to obtain the URI from such a reference without also | |||
| parsing the fragment and resolving the relative form. | parsing the fragment and resolving the relative form. | |||
| URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] | URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] | |||
| The syntax for relative URIs is a shortened form of that for absolute | The syntax for relative URI is a shortened form of that for absolute | |||
| URIs, where some prefix of the URI is missing and certain path | URI, where some prefix of the URI is missing and certain path | |||
| components ("." and "..") have a special meaning when interpreting a | components ("." and "..") have a special meaning when interpreting a | |||
| relative path. The relative URI syntax is defined in Section 5. | relative path. The relative URI syntax is defined in Section 5. | |||
| 4.1. Fragment Identifier | 4.1. Fragment Identifier | |||
| When a URI reference is used to perform a retrieval action on the | When a URI reference is used to perform a retrieval action on the | |||
| identified resource, the optional fragment identifier, separated from | identified resource, the optional fragment identifier, separated from | |||
| the URI by a crosshatch ("#") character, consists of additional | the URI by a crosshatch ("#") character, consists of additional | |||
| reference information to be interpreted by the user agent after the | reference information to be interpreted by the user agent after the | |||
| retrieval action has been successfully completed. As such, it is not | retrieval action has been successfully completed. As such, it is not | |||
| part of a URI, but is often used in conjunction with a URI. | part of a URI, but is often used in conjunction with a URI. | |||
| fragment = *uric | fragment = *uric | |||
| The semantics of a fragment identifier is a property of the data | The semantics of a fragment identifier is a property of the data | |||
| resulting from a retrieval action, regardless of the type of URI used | resulting from a retrieval action, regardless of the type of URI used | |||
| in the reference. Therefore, the format and interpretation of | in the reference. Therefore, the format and interpretation of | |||
| fragment identifiers is dependent on the media type [RFC2046] of the | fragment identifiers is dependent on the media type [RFC2046] of the | |||
| retrieval result. The character restrictions described in Section 2 | retrieval result. The character restrictions described in Section 2 | |||
| for URIs also apply to the fragment in a URI-reference. Individual | for URI also apply to the fragment in a URI-reference. Individual | |||
| media types may define additional restrictions or structure within | media types may define additional restrictions or structure within | |||
| the fragment for specifying different types of "partial views" that | the fragment for specifying different types of "partial views" that | |||
| can be identified within that media type. | can be identified within that media type. | |||
| A fragment identifier is only meaningful when a URI reference is | A fragment identifier is only meaningful when a URI reference is | |||
| intended for retrieval and the result of that retrieval is a document | intended for retrieval and the result of that retrieval is a document | |||
| for which the identified fragment is consistently defined. | for which the identified fragment is consistently defined. | |||
| 4.2. Same-document References | 4.2. Same-document References | |||
| A URI reference which does not contain a URI is a reference to the | A URI reference that does not contain a URI is a reference to the | |||
| current document. In other words, an empty URI reference within a | current document. In other words, an empty URI reference within a | |||
| document is interpreted as a reference to the start of that document, | document is interpreted as a reference to the start of that document, | |||
| and a reference containing only a fragment identifier is a reference | and a reference containing only a fragment identifier is a reference | |||
| to the identified fragment of that document. Traversal of such a | to the identified fragment of that document. Traversal of such a | |||
| reference should not result in an additional retrieval action. | reference should not result in an additional retrieval action. | |||
| However, if the URI reference occurs in a context that is always | However, if the URI reference occurs in a context that is always | |||
| intended to result in a new request, as in the case of HTML's FORM | intended to result in a new request, as in the case of HTML's FORM | |||
| element, then an empty URI reference represents the base URI of the | element, then an empty URI reference represents the base URI of the | |||
| current document and should be replaced by that URI when transformed | current document and should be replaced by that URI when transformed | |||
| into a request. | into a request. | |||
| 4.3. Parsing a URI Reference | 4.3. Parsing a URI Reference | |||
| A URI reference is typically parsed according to the four main | A URI reference is typically parsed according to the four main | |||
| components and fragment identifier in order to determine what | components and fragment identifier in order to determine what | |||
| components are present and whether the reference is relative or | components are present and whether the reference is relative or | |||
| absolute. The individual components are then parsed for their | absolute. The individual components are then parsed for their | |||
| subparts and to verify their validity. A reference is parsed as if | subparts and, if not opaque, to verify their validity. | |||
| it is a generic-URI, even though it might be considered opaque by | ||||
| later processes. | ||||
| Although the BNF defines what is allowed in each component, it is | Although the BNF defines what is allowed in each component, it is | |||
| ambiguous in terms of differentiating between an authority component | ambiguous in terms of differentiating between an authority component | |||
| and a path component that begins with two slash characters. The | and a path component that begins with two slash characters. The | |||
| greedy algorithm is used for disambiguation: the left-most matching | greedy algorithm is used for disambiguation: the left-most matching | |||
| rule soaks up as much of the URI reference string as it is capable of | rule soaks up as much of the URI reference string as it is capable of | |||
| matching. In other words, the authority component wins. | matching. In other words, the authority component wins. | |||
| Readers familiar with regular expressions should see Appendix B for a | Readers familiar with regular expressions should see Appendix B for a | |||
| concrete parsing example and test oracle. | concrete parsing example and test oracle. | |||
| 5. Relative URI References | 5. Relative URI References | |||
| It is often the case that a group or "tree" of documents has been | It is often the case that a group or "tree" of documents has been | |||
| constructed to serve a common purpose; the vast majority of URIs in | constructed to serve a common purpose; the vast majority of URI in | |||
| these documents point to resources within the tree rather than | these documents point to resources within the tree rather than | |||
| outside of it. Similarly, documents located at a particular site | outside of it. Similarly, documents located at a particular site | |||
| are much more likely to refer to other resources at that site than | are much more likely to refer to other resources at that site than | |||
| to resources at remote sites. | to resources at remote sites. | |||
| Relative addressing of URLs allows document trees to be partially | Relative addressing of URI allows document trees to be partially | |||
| independent of their location and access scheme. For instance, it is | independent of their location and access scheme. For instance, it is | |||
| possible for a single set of hypertext documents to be simultaneously | possible for a single set of hypertext documents to be simultaneously | |||
| accessible and traversable via each of the "file", "http", and "ftp" | accessible and traversable via each of the "file", "http", and "ftp" | |||
| schemes if the documents refer to each other using relative URIs. | schemes if the documents refer to each other using relative URI. | |||
| Furthermore, such document trees can be moved, as a whole, without | Furthermore, such document trees can be moved, as a whole, without | |||
| changing any of the relative references. Experience within the WWW | changing any of the relative references. Experience within the WWW | |||
| has demonstrated that the ability to perform relative referencing | has demonstrated that the ability to perform relative referencing | |||
| is necessary for the long-term usability of embedded URLs. | is necessary for the long-term usability of embedded URI. | |||
| relativeURI = net_path | abs_path | rel_path | The syntax for relative URI takes advantage of the <hier_part> syntax | |||
| of <absoluteURI> (Section 3) in order to express a reference that is | ||||
| relative to the namespace of another hierarchical URI. | ||||
| A relative reference beginning with two slash characters is termed a | relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] | |||
| network-path reference. Such references are rarely used. | ||||
| net_path = "//" authority [ abs_path ] | A relative reference beginning with two slash characters is termed a | |||
| network-path reference, as defined by <net_path> in Section 3. Such | ||||
| references are rarely used. | ||||
| A relative reference beginning with a single slash character is | A relative reference beginning with a single slash character is | |||
| termed an absolute-path reference. | termed an absolute-path reference, as defined by <abs_path> in | |||
| Section 3. | ||||
| abs_path = "/" rel_path | ||||
| A relative reference which does not begin with a scheme name or a | A relative reference that does not begin with a scheme name or a | |||
| slash character is termed a relative-path reference. | slash character is termed a relative-path reference. | |||
| rel_path = [ path_segments ] [ "?" query ] | rel_path = rel_segment [ abs_path ] | |||
| rel_segment = 1*( unreserved | escaped | | ||||
| ";" | "@" | "&" | "=" | "+" | "$" | "," ) | ||||
| Within a relative-path reference, the complete path segments "." and | Within a relative-path reference, the complete path segments "." and | |||
| ".." have special meanings: "the current hierarchy level" and "the | ".." have special meanings: "the current hierarchy level" and "the | |||
| level above this hierarchy level", respectively. Although this is | level above this hierarchy level", respectively. Although this is | |||
| very similar to their use within Unix-based filesystems to indicate | very similar to their use within Unix-based filesystems to indicate | |||
| directory levels, these path components are only considered special | directory levels, these path components are only considered special | |||
| when resolving a relative-path reference to its absolute form | when resolving a relative-path reference to its absolute form | |||
| (Section 5.2). | (Section 5.2). | |||
| Authors should be aware that a path segment which contains a colon | Authors should be aware that a path segment which contains a colon | |||
| character cannot be used as the first segment of a relative URI path | character cannot be used as the first segment of a relative URI path | |||
| (e.g., "this:that"), because it would be mistaken for a scheme name. | (e.g., "this:that"), because it would be mistaken for a scheme name. | |||
| It is therefore necessary to precede such segments with other | It is therefore necessary to precede such segments with other | |||
| segments (e.g., "./this:that") in order for them to be referenced as | segments (e.g., "./this:that") in order for them to be referenced as | |||
| a relative path. | a relative path. | |||
| It is not necessary for all URIs within a given scheme to be | It is not necessary for all URI within a given scheme to be | |||
| restricted to the generic-URI syntax, since the hierarchical | restricted to the <hier_part> syntax, since the hierarchical | |||
| properties of that syntax are only necessary when relative URIs are | properties of that syntax are only necessary when relative URI are | |||
| used within a particular document. Documents can only make use of | used within a particular document. Documents can only make use of | |||
| relative URIs when their base URI fits within the generic-URI syntax. | relative URI when their base URI fits within the <hier_part> syntax. | |||
| It is assumed that any document which contains a relative reference | It is assumed that any document which contains a relative reference | |||
| will also have a base URI that obeys the syntax. In other words, | will also have a base URI that obeys the syntax. In other words, | |||
| relative URIs cannot be used within a document that has an unsuitable | relative URI cannot be used within a document that has an unsuitable | |||
| base URI. | base URI. | |||
| Some URI schemes do not allow a hierarchical syntax matching the | Some URI schemes do not allow a hierarchical syntax matching the | |||
| generic-URI syntax, and thus cannot use relative references. | <hier_part> syntax, and thus cannot use relative references. | |||
| 5.1. Establishing a Base URI | 5.1. Establishing a Base URI | |||
| The term "relative URI" implies that there exists some absolute "base | The term "relative URI" implies that there exists some absolute "base | |||
| URI" against which the relative reference is applied. Indeed, the | URI" against which the relative reference is applied. Indeed, the | |||
| base URI is necessary to define the semantics of any relative URI | base URI is necessary to define the semantics of any relative URI | |||
| reference; without it, a relative reference is meaningless. In order | reference; without it, a relative reference is meaningless. In order | |||
| for relative URIs to be usable within a document, the base URI of | for relative URI to be usable within a document, the base URI of | |||
| that document must be known to the parser. | that document must be known to the parser. | |||
| The base URI of a document can be established in one of four ways, | The base URI of a document can be established in one of four ways, | |||
| listed below in order of precedence. The order of precedence can be | listed below in order of precedence. The order of precedence can be | |||
| thought of in terms of layers, where the innermost defined base URI | thought of in terms of layers, where the innermost defined base URI | |||
| has the highest precedence. This can be visualized graphically as: | has the highest precedence. This can be visualized graphically as: | |||
| .----------------------------------------------------------. | .----------------------------------------------------------. | |||
| | .----------------------------------------------------. | | | .----------------------------------------------------. | | |||
| | | .----------------------------------------------. | | | | | .----------------------------------------------. | | | |||
| skipping to change at line 893 ¶ | skipping to change at line 913 ¶ | |||
| 5.1.4. Default Base URI | 5.1.4. Default Base URI | |||
| If none of the conditions described in Sections 5.1.1--5.1.3 apply, | If none of the conditions described in Sections 5.1.1--5.1.3 apply, | |||
| then the base URI is defined by the context of the application. | then the base URI is defined by the context of the application. | |||
| Since this definition is necessarily application-dependent, failing | Since this definition is necessarily application-dependent, failing | |||
| to define the base URI using one of the other methods may result in | to define the base URI using one of the other methods may result in | |||
| the same content being interpreted differently by different types of | the same content being interpreted differently by different types of | |||
| application. | application. | |||
| It is the responsibility of the distributor(s) of a document | It is the responsibility of the distributor(s) of a document | |||
| containing relative URIs to ensure that the base URI for that | containing relative URI to ensure that the base URI for that | |||
| document can be established. It must be emphasized that relative | document can be established. It must be emphasized that relative | |||
| URIs cannot be used reliably in situations where the document's | URI cannot be used reliably in situations where the document's | |||
| base URI is not well-defined. | base URI is not well-defined. | |||
| 5.2. Resolving Relative References to Absolute Form | 5.2. Resolving Relative References to Absolute Form | |||
| This section describes an example algorithm for resolving URI | This section describes an example algorithm for resolving URI | |||
| references which might be relative to a given base URI. | references that might be relative to a given base URI. | |||
| The base URI is established according to the rules of Section 5.1 and | The base URI is established according to the rules of Section 5.1 and | |||
| parsed into the four main components as described in Section 3. | parsed into the four main components as described in Section 3. | |||
| Note that only the scheme component is required to be present in the | Note that only the scheme component is required to be present in the | |||
| base URI; the other components may be empty or undefined. A | base URI; the other components may be empty or undefined. A | |||
| component is undefined if its preceding separator does not appear in | component is undefined if its preceding separator does not appear in | |||
| the URI reference; the path component is never undefined, though it | the URI reference; the path component is never undefined, though it | |||
| may be empty. The base URI's query component is not used by the | may be empty. The base URI's query component is not used by the | |||
| resolution algorithm and may be discarded. | resolution algorithm and may be discarded. | |||
| skipping to change at line 928 ¶ | skipping to change at line 948 ¶ | |||
| query components are undefined, then it is a reference to the | query components are undefined, then it is a reference to the | |||
| current document and we are done. Otherwise, the reference URI's | current document and we are done. Otherwise, the reference URI's | |||
| query and fragment components are defined as found (or not found) | query and fragment components are defined as found (or not found) | |||
| within the URI reference and not inherited from the base URI. | within the URI reference and not inherited from the base URI. | |||
| 3) If the scheme component is defined, indicating that the reference | 3) If the scheme component is defined, indicating that the reference | |||
| starts with a scheme name, then the reference is interpreted as an | starts with a scheme name, then the reference is interpreted as an | |||
| absolute URI and we are done. Otherwise, the reference URI's | absolute URI and we are done. Otherwise, the reference URI's | |||
| scheme is inherited from the base URI's scheme component. | scheme is inherited from the base URI's scheme component. | |||
| Due to a loophole in prior specifications [RFC1630], some parsers | ||||
| allow the scheme name to be present in a relative URI if it is the | ||||
| same as the base URI scheme. Unfortunately, this can conflict | ||||
| with the correct parsing of non-hierarchical URI. For backwards | ||||
| compatibility, an implementation may work around such references | ||||
| by removing the scheme if it matches that of the base URI and the | ||||
| scheme is known to always use the <hier_part> syntax. The parser | ||||
| can then continue with the steps below for the remainder of the | ||||
| reference components. Validating parsers should mark such a | ||||
| misformed relative reference as an error. | ||||
| 4) If the authority component is defined, then the reference is a | 4) If the authority component is defined, then the reference is a | |||
| network-path and we skip to step 7. Otherwise, the reference | network-path and we skip to step 7. Otherwise, the reference | |||
| URI's authority is inherited from the base URI's authority | URI's authority is inherited from the base URI's authority | |||
| component, which will also be undefined if the URI scheme does not | component, which will also be undefined if the URI scheme does not | |||
| use an authority component. | use an authority component. | |||
| 5) If the path component begins with a slash character ("/"), then | 5) If the path component begins with a slash character ("/"), then | |||
| the reference is an absolute-path and we skip to step 7. | the reference is an absolute-path and we skip to step 7. | |||
| 6) If this step is reached, then we are resolving a relative-path | 6) If this step is reached, then we are resolving a relative-path | |||
| skipping to change at line 1025 ¶ | skipping to change at line 1056 ¶ | |||
| reference's query component from its path component before merging | reference's query component from its path component before merging | |||
| the base and reference paths in step 6 above. This may result in | the base and reference paths in step 6 above. This may result in | |||
| a loss of information if the query component contains the strings | a loss of information if the query component contains the strings | |||
| "/../" or "/./". | "/../" or "/./". | |||
| Resolution examples are provided in Appendix C. | Resolution examples are provided in Appendix C. | |||
| 6. URI Normalization and Equivalence | 6. URI Normalization and Equivalence | |||
| In many cases, different URI strings may actually identify the | In many cases, different URI strings may actually identify the | |||
| identical resource. For example, the host names used in URLs are | identical resource. For example, the host names used in URL are | |||
| actually case insensitive, and the URL <http://www.XEROX.com> is | actually case insensitive, and the URL <http://www.XEROX.com> is | |||
| equivalent to <http://www.xerox.com>. In general, the rules for | equivalent to <http://www.xerox.com>. In general, the rules for | |||
| equivalence and definition of a normal form, if any, are scheme | equivalence and definition of a normal form, if any, are scheme | |||
| dependent. When a scheme uses elements of the common syntax, it | dependent. When a scheme uses elements of the common syntax, it | |||
| will also use the common syntax equivalence rules, namely that the | will also use the common syntax equivalence rules, namely that the | |||
| scheme and hostname are case insensitive and a URL with an explicit | scheme and hostname are case insensitive and a URL with an explicit | |||
| ":port", where the port is the default for the scheme, is equivalent | ":port", where the port is the default for the scheme, is equivalent | |||
| to one where the port is elided. | to one where the port is elided. | |||
| 7. Security Considerations | 7. Security Considerations | |||
| skipping to change at line 1054 ¶ | skipping to change at line 1085 ¶ | |||
| resource in question. A specific URI scheme may include additional | resource in question. A specific URI scheme may include additional | |||
| semantics, such as name persistence, if those semantics are required | semantics, such as name persistence, if those semantics are required | |||
| of all naming authorities for that scheme. | of all naming authorities for that scheme. | |||
| It is sometimes possible to construct a URL such that an attempt to | It is sometimes possible to construct a URL such that an attempt to | |||
| perform a seemingly harmless, idempotent operation, such as the | perform a seemingly harmless, idempotent operation, such as the | |||
| retrieval of an entity associated with the resource, will in fact | retrieval of an entity associated with the resource, will in fact | |||
| cause a possibly damaging remote operation to occur. The unsafe URL | cause a possibly damaging remote operation to occur. The unsafe URL | |||
| is typically constructed by specifying a port number other than that | is typically constructed by specifying a port number other than that | |||
| reserved for the network protocol in question. The client | reserved for the network protocol in question. The client | |||
| unwittingly contacts a site which is in fact running a different | unwittingly contacts a site that is in fact running a different | |||
| protocol. The content of the URL contains instructions which, when | protocol. The content of the URL contains instructions that, when | |||
| interpreted according to this other protocol, cause an unexpected | interpreted according to this other protocol, cause an unexpected | |||
| operation. An example has been the use of gopher URLs to cause an | operation. An example has been the use of a gopher URL to cause an | |||
| unintended or impersonating message to be sent via a SMTP server. | unintended or impersonating message to be sent via a SMTP server. | |||
| Caution should be used when using any URL which specifies a port | Caution should be used when using any URL that specifies a port | |||
| number other than the default for the protocol, especially when it | number other than the default for the protocol, especially when it | |||
| is a number within the reserved space. | is a number within the reserved space. | |||
| Care should be taken when URLs contain escaped delimiters for a | Care should be taken when a URL contains escaped delimiters for a | |||
| given protocol (for example, CR and LF characters for telnet | given protocol (for example, CR and LF characters for telnet | |||
| protocols) that these are not unescaped before transmission. This | protocols) that these are not unescaped before transmission. This | |||
| might violate the protocol, but avoids the potential for such | might violate the protocol, but avoids the potential for such | |||
| characters to be used to simulate an extra operation or parameter | characters to be used to simulate an extra operation or parameter | |||
| in that protocol, which might lead to an unexpected and possibly | in that protocol, which might lead to an unexpected and possibly | |||
| harmful remote operation to be performed. | harmful remote operation to be performed. | |||
| It is clearly unwise to use a URL that contains a password which is | It is clearly unwise to use a URL that contains a password which is | |||
| intended to be secret. In particular, the use of a password within | intended to be secret. In particular, the use of a password within | |||
| the 'userinfo' component of a URL is strongly disrecommended except | the 'userinfo' component of a URL is strongly disrecommended except | |||
| skipping to change at line 1155 ¶ | skipping to change at line 1186 ¶ | |||
| Cambridge, MA 02139 | Cambridge, MA 02139 | |||
| Fax: +1(617)258-8682 | Fax: +1(617)258-8682 | |||
| EMail: timbl@w3.org | EMail: timbl@w3.org | |||
| Roy T. Fielding | Roy T. Fielding | |||
| Department of Information and Computer Science | Department of Information and Computer Science | |||
| University of California, Irvine | University of California, Irvine | |||
| Irvine, CA 92697-3425 | Irvine, CA 92697-3425 | |||
| Fax: +1(714)824-1715 | Fax: +1(949)824-1715 | |||
| EMail: fielding@ics.uci.edu | EMail: fielding@ics.uci.edu | |||
| Larry Masinter | Larry Masinter | |||
| Xerox PARC | Xerox PARC | |||
| 3333 Coyote Hill Road | 3333 Coyote Hill Road | |||
| Palo Alto, CA 94034 | Palo Alto, CA 94034 | |||
| Fax: +1(415)812-4333 | Fax: +1(415)812-4333 | |||
| EMail: masinter@parc.xerox.com | EMail: masinter@parc.xerox.com | |||
| Appendices | Appendices | |||
| A. Collected BNF for URIs | A. Collected BNF for URI | |||
| URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] | URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] | |||
| absoluteURI = generic-URI | opaque-URI | absoluteURI = scheme ":" ( hier_part | opaque_part ) | |||
| opaque-URI = scheme ":" *uric | relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] | |||
| generic-URI = scheme ":" relativeURI | ||||
| hier_part = ( net_path | abs_path ) [ "?" query ] | ||||
| opaque_part = uric_no_slash *uric | ||||
| uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | | ||||
| "&" | "=" | "+" | "$" | "," | ||||
| relativeURI = net_path | abs_path | rel_path | ||||
| net_path = "//" authority [ abs_path ] | net_path = "//" authority [ abs_path ] | |||
| abs_path = "/" rel_path | abs_path = "/" path_segments | |||
| rel_path = [ path_segments ] [ "?" query ] | rel_path = rel_segment [ abs_path ] | |||
| rel_segment = 1*( unreserved | escaped | | ||||
| ";" | "@" | "&" | "=" | "+" | "$" | "," ) | ||||
| scheme = alpha *( alpha | digit | "+" | "-" | "." ) | scheme = alpha *( alpha | digit | "+" | "-" | "." ) | |||
| authority = server | reg_name | authority = server | reg_name | |||
| reg_name = 1*( unreserved | escaped | "$" | "," | | reg_name = 1*( unreserved | escaped | "$" | "," | | |||
| ";" | ":" | "@" | "&" | "=" | "+" ) | ";" | ":" | "@" | "&" | "=" | "+" ) | |||
| server = [ [ userinfo "@" ] hostport ] | server = [ [ userinfo "@" ] hostport ] | |||
| userinfo = *( unreserved | escaped | | userinfo = *( unreserved | escaped | | |||
| ";" | ":" | "&" | "=" | "+" | "$" | "," ) | ";" | ":" | "&" | "=" | "+" | "$" | "," ) | |||
| hostport = host [ ":" port ] | hostport = host [ ":" port ] | |||
| host = hostname | IPv4address | host = hostname | IPv4address | |||
| hostname = *( domainlabel "." ) toplabel [ "." ] | hostname = *( domainlabel "." ) toplabel [ "." ] | |||
| domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum | domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum | |||
| toplabel = alpha | alpha *( alphanum | "-" ) alphanum | toplabel = alpha | alpha *( alphanum | "-" ) alphanum | |||
| IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit | IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit | |||
| port = *digit | port = *digit | |||
| path = [ "/" ] path_segments | path = [ abs_path | opaque_part ] | |||
| path_segments = segment *( "/" segment ) | path_segments = segment *( "/" segment ) | |||
| segment = *pchar *( ";" param ) | segment = *pchar *( ";" param ) | |||
| param = *pchar | param = *pchar | |||
| pchar = unreserved | escaped | | pchar = unreserved | escaped | | |||
| ":" | "@" | "&" | "=" | "+" | "$" | "," | ":" | "@" | "&" | "=" | "+" | "$" | "," | |||
| query = *uric | query = *uric | |||
| fragment = *uric | fragment = *uric | |||
| skipping to change at line 1235 ¶ | skipping to change at line 1273 ¶ | |||
| "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | | |||
| "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" | |||
| upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | | upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | | |||
| "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | | |||
| "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | |||
| digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | | digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | | |||
| "8" | "9" | "8" | "9" | |||
| B. Parsing a URI Reference with a Regular Expression | B. Parsing a URI Reference with a Regular Expression | |||
| As described in Section 4.3, the generic-URI syntax is not sufficient | As described in Section 4.3, the generic URI syntax is not sufficient | |||
| to disambiguate the components of some forms of URI. Since the | to disambiguate the components of some forms of URI. Since the | |||
| "greedy algorithm" described in that section is identical to the | "greedy algorithm" described in that section is identical to the | |||
| disambiguation method used by POSIX regular expressions, it is | disambiguation method used by POSIX regular expressions, it is | |||
| natural and commonplace to use a regular expression for parsing the | natural and commonplace to use a regular expression for parsing the | |||
| potential four components and fragment identifier of a URI reference. | potential four components and fragment identifier of a URI reference. | |||
| The following line is the regular expression for breaking-down a URI | The following line is the regular expression for breaking-down a URI | |||
| reference into its components. | reference into its components. | |||
| ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | |||
| skipping to change at line 1286 ¶ | skipping to change at line 1324 ¶ | |||
| and, going in the opposite direction, we can recreate a URI reference | and, going in the opposite direction, we can recreate a URI reference | |||
| from its components using the algorithm in step 7 of Section 5.2. | from its components using the algorithm in step 7 of Section 5.2. | |||
| C. Examples of Resolving Relative URI References | C. Examples of Resolving Relative URI References | |||
| Within an object with a well-defined base URI of | Within an object with a well-defined base URI of | |||
| http://a/b/c/d;p?q | http://a/b/c/d;p?q | |||
| the relative URIs would be resolved as follows: | the relative URI would be resolved as follows: | |||
| C.1. Normal Examples | C.1. Normal Examples | |||
| g:h = g:h | g:h = g:h | |||
| g = http://a/b/c/g | g = http://a/b/c/g | |||
| ./g = http://a/b/c/g | ./g = http://a/b/c/g | |||
| g/ = http://a/b/c/g/ | g/ = http://a/b/c/g/ | |||
| /g = http://a/g | /g = http://a/g | |||
| //g = http://g | //g = http://g | |||
| ?y = http://a/b/c/?y | ?y = http://a/b/c/?y | |||
| skipping to change at line 1358 ¶ | skipping to change at line 1396 ¶ | |||
| nonsensical forms of the "." and ".." complete path segments. | nonsensical forms of the "." and ".." complete path segments. | |||
| ./../g = http://a/b/g | ./../g = http://a/b/g | |||
| ./g/. = http://a/b/c/g/ | ./g/. = http://a/b/c/g/ | |||
| g/./h = http://a/b/c/g/h | g/./h = http://a/b/c/g/h | |||
| g/../h = http://a/b/c/h | g/../h = http://a/b/c/h | |||
| g;x=1/./y = http://a/b/c/g;x=1/y | g;x=1/./y = http://a/b/c/g;x=1/y | |||
| g;x=1/../y = http://a/b/c/y | g;x=1/../y = http://a/b/c/y | |||
| All client applications remove the query component from the base URI | All client applications remove the query component from the base URI | |||
| before resolving relative URIs. However, some applications fail to | before resolving relative URI. However, some applications fail to | |||
| separate the reference's query and/or fragment components from a | separate the reference's query and/or fragment components from a | |||
| relative path before merging it with the base path. This error is | relative path before merging it with the base path. This error is | |||
| rarely noticed, since typical usage of a fragment never includes the | rarely noticed, since typical usage of a fragment never includes the | |||
| hierarchy ("/") character, and the query component is not normally | hierarchy ("/") character, and the query component is not normally | |||
| used within relative references. | used within relative references. | |||
| g?y/./x = http://a/b/c/g?y/./x | g?y/./x = http://a/b/c/g?y/./x | |||
| g?y/../x = http://a/b/c/g?y/../x | g?y/../x = http://a/b/c/g?y/../x | |||
| g#s/./x = http://a/b/c/g#s/./x | g#s/./x = http://a/b/c/g#s/./x | |||
| g#s/../x = http://a/b/c/g#s/../x | g#s/../x = http://a/b/c/g#s/../x | |||
| Some parsers allow the scheme name to be present in a relative URI | Some parsers allow the scheme name to be present in a relative URI | |||
| if it is the same as the base URI scheme. This is considered to be | if it is the same as the base URI scheme. This is considered to be | |||
| a loophole in prior specifications of partial URIs [RFC1630]. Its | a loophole in prior specifications of partial URI [RFC1630]. Its | |||
| use should be avoided. | use should be avoided. | |||
| http:g = http:g | http:g = http:g ; for validating parsers | |||
| http: = http: | | http://a/b/c/g ; for backwards compatibility | |||
| D. Embedding the Base URI in HTML documents | D. Embedding the Base URI in HTML documents | |||
| It is useful to consider an example of how the base URI of a | It is useful to consider an example of how the base URI of a | |||
| document can be embedded within the document's content. In this | document can be embedded within the document's content. In this | |||
| appendix, we describe how documents written in the Hypertext Markup | appendix, we describe how documents written in the Hypertext Markup | |||
| Language (HTML) [RFC1866] can include an embedded base URI. This | Language (HTML) [RFC1866] can include an embedded base URI. This | |||
| appendix does not form a part of the URI specification and should not | appendix does not form a part of the URI specification and should not | |||
| be considered as anything more than a descriptive example. | be considered as anything more than a descriptive example. | |||
| HTML defines a special element "BASE" which, when present in the | HTML defines a special element "BASE" which, when present in the | |||
| "HEAD" portion of a document, signals that the parser should use | "HEAD" portion of a document, signals that the parser should use | |||
| the BASE element's "HREF" attribute as the base URI for resolving | the BASE element's "HREF" attribute as the base URI for resolving | |||
| any relative URIs. The "HREF" attribute must be an absolute URI. | any relative URI. The "HREF" attribute must be an absolute URI. | |||
| Note that, in HTML, element and attribute names are | Note that, in HTML, element and attribute names are | |||
| case-insensitive. For example: | case-insensitive. For example: | |||
| <!doctype html public "-//IETF//DTD HTML//EN"> | <!doctype html public "-//IETF//DTD HTML//EN"> | |||
| <HTML><HEAD> | <HTML><HEAD> | |||
| <TITLE>An example HTML document</TITLE> | <TITLE>An example HTML document</TITLE> | |||
| <BASE href="http://www.ics.uci.edu/Test/a/b/c"> | <BASE href="http://www.ics.uci.edu/Test/a/b/c"> | |||
| </HEAD><BODY> | </HEAD><BODY> | |||
| ... <A href="../x">a hypertext anchor</A> ... | ... <A href="../x">a hypertext anchor</A> ... | |||
| </BODY></HTML> | </BODY></HTML> | |||
| A parser reading the example document should interpret the given | A parser reading the example document should interpret the given | |||
| relative URI "../x" as representing the absolute URI | relative URI "../x" as representing the absolute URI | |||
| <http://www.ics.uci.edu/Test/a/x> | <http://www.ics.uci.edu/Test/a/x> | |||
| regardless of the context in which the example document was | regardless of the context in which the example document was | |||
| obtained. | obtained. | |||
| E. Recommendations for Delimiting URIs in Context | E. Recommendations for Delimiting URI in Context | |||
| URIs are often transmitted through formats which do not provide a | URI are often transmitted through formats that do not provide a | |||
| clear context for their interpretation. For example, there are | clear context for their interpretation. For example, there are | |||
| many occasions when URIs are included in plain text; examples | many occasions when URI are included in plain text; examples | |||
| include text sent in electronic mail, USENET news messages, and, | include text sent in electronic mail, USENET news messages, and, | |||
| most importantly, printed on paper. In such cases, it is important | most importantly, printed on paper. In such cases, it is important | |||
| to be able to delimit the URI from the rest of the text, and in | to be able to delimit the URI from the rest of the text, and in | |||
| particular from punctuation marks that might be mistaken for part | particular from punctuation marks that might be mistaken for part | |||
| of the URI. | of the URI. | |||
| In practice, URIs are delimited in a variety of ways, but usually | In practice, URI are delimited in a variety of ways, but usually | |||
| within double-quotes "http://test.com/", angle brackets | within double-quotes "http://test.com/", angle brackets | |||
| <http://test.com/>, or just using whitespace | <http://test.com/>, or just using whitespace | |||
| http://test.com/ | http://test.com/ | |||
| These wrappers do not form part of the URI. | These wrappers do not form part of the URI. | |||
| In the case where a fragment identifier is associated with a URI | In the case where a fragment identifier is associated with a URI | |||
| reference, the fragment would be placed within the brackets as well | reference, the fragment would be placed within the brackets as well | |||
| (separated from the URI with a "#" character). | (separated from the URI with a "#" character). | |||
| In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) | In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) | |||
| may need to be added to break long URIs across lines. The | may need to be added to break long URI across lines. The | |||
| whitespace should be ignored when extracting the URI. | whitespace should be ignored when extracting the URI. | |||
| No whitespace should be introduced after a hyphen ("-") character. | No whitespace should be introduced after a hyphen ("-") character. | |||
| Because some typesetters and printers may (erroneously) introduce a | Because some typesetters and printers may (erroneously) introduce a | |||
| hyphen at the end of line when breaking a line, the interpreter of a | hyphen at the end of line when breaking a line, the interpreter of a | |||
| URI containing a line break immediately after a hyphen should ignore | URI containing a line break immediately after a hyphen should ignore | |||
| all unescaped whitespace around the line break, and should be aware | all unescaped whitespace around the line break, and should be aware | |||
| that the hyphen may or may not actually be part of the URI. | that the hyphen may or may not actually be part of the URI. | |||
| Using <> angle brackets around each URI is especially recommended | Using <> angle brackets around each URI is especially recommended | |||
| as a delimiting style for URIs that contain whitespace. | as a delimiting style for URI that contain whitespace. | |||
| The prefix "URL:" (with or without a trailing space) was | The prefix "URL:" (with or without a trailing space) was | |||
| recommended as a way to used to help distinguish a URL from other | recommended as a way to used to help distinguish a URL from other | |||
| bracketed designators, although this is not common in practice. | bracketed designators, although this is not common in practice. | |||
| For robustness, software that accepts user-typed URIs should | For robustness, software that accepts user-typed URI should | |||
| attempt to recognize and strip both delimiters and embedded | attempt to recognize and strip both delimiters and embedded | |||
| whitespace. | whitespace. | |||
| For example, the text: | For example, the text: | |||
| Yes, Jim, I found it under "http://www.w3.org/Addressing/", | Yes, Jim, I found it under "http://www.w3.org/Addressing/", | |||
| but you can probably pick it up from <ftp://ds.internic. | but you can probably pick it up from <ftp://ds.internic. | |||
| net/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | net/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | |||
| ietf/uri/historical.html#WARNING>. | ietf/uri/historical.html#WARNING>. | |||
| skipping to change at line 1505 ¶ | skipping to change at line 1543 ¶ | |||
| G. Summary of Non-editorial Changes | G. Summary of Non-editorial Changes | |||
| G.1. Additions | G.1. Additions | |||
| Section 4 (URI References) was added to stem the confusion | Section 4 (URI References) was added to stem the confusion | |||
| regarding "what is a URI" and how to describe fragment identifiers | regarding "what is a URI" and how to describe fragment identifiers | |||
| given that they are not part of the URI, but are part of the URI | given that they are not part of the URI, but are part of the URI | |||
| syntax and parsing concerns. In addition, it provides a reference | syntax and parsing concerns. In addition, it provides a reference | |||
| definition for use by other IETF specifications (HTML, HTTP, etc.) | definition for use by other IETF specifications (HTML, HTTP, etc.) | |||
| which have previously attempted to redefine the URI syntax in order | that have previously attempted to redefine the URI syntax in order | |||
| to account for the presence of fragment identifiers in URI | to account for the presence of fragment identifiers in URI | |||
| references. | references. | |||
| Section 2.4 was rewritten to clarify a number of misinterpretations | Section 2.4 was rewritten to clarify a number of misinterpretations | |||
| and to leave room for fully internationalized URIs. | and to leave room for fully internationalized URI. | |||
| Appendix F on abbreviated URLs was added to describe the shortened | Appendix F on abbreviated URLs was added to describe the shortened | |||
| references often seen on television and magazine advertisements and | references often seen on television and magazine advertisements and | |||
| explain why they are not used in other contexts. | explain why they are not used in other contexts. | |||
| G.2. Modifications from both RFC 1738 and RFC 1808 | G.2. Modifications from both RFC 1738 and RFC 1808 | |||
| Changed to URI syntax instead of just URL. | Changed to URI syntax instead of just URL. | |||
| Confusion regarding the terms "character encoding", the URI | Confusion regarding the terms "character encoding", the URI | |||
| skipping to change at line 1533 ¶ | skipping to change at line 1571 ¶ | |||
| names regarding the character sets have been changed to more | names regarding the character sets have been changed to more | |||
| accurately describe their purpose and to encompass all "characters" | accurately describe their purpose and to encompass all "characters" | |||
| rather than just US-ASCII octets. Unless otherwise noted here, | rather than just US-ASCII octets. Unless otherwise noted here, | |||
| these modifications do not affect the URI syntax. | these modifications do not affect the URI syntax. | |||
| Both RFC 1738 and RFC 1808 refer to the "reserved" set of | Both RFC 1738 and RFC 1808 refer to the "reserved" set of | |||
| characters as if URI-interpreting software were limited to a single | characters as if URI-interpreting software were limited to a single | |||
| set of characters with a reserved purpose (i.e., as meaning | set of characters with a reserved purpose (i.e., as meaning | |||
| something other than the data to which the characters correspond), | something other than the data to which the characters correspond), | |||
| and that this set was fixed by the URI scheme. However, this has | and that this set was fixed by the URI scheme. However, this has | |||
| not been true in practice; any character which is interpreted | not been true in practice; any character that is interpreted | |||
| differently when it is escaped is, in effect, reserved. | differently when it is escaped is, in effect, reserved. | |||
| Furthermore, the interpreting engine on a HTTP server is often | Furthermore, the interpreting engine on a HTTP server is often | |||
| dependent on the resource, not just the URI scheme. The | dependent on the resource, not just the URI scheme. The | |||
| description of reserved characters has been changed accordingly. | description of reserved characters has been changed accordingly. | |||
| The plus "+", dollar "$", and comma "," characters have been added to | The plus "+", dollar "$", and comma "," characters have been added to | |||
| those in the "reserved" set, since they are treated as reserved | those in the "reserved" set, since they are treated as reserved | |||
| within the query component. | within the query component. | |||
| The tilde "~" character was added to those in the "unreserved" set, | The tilde "~" character was added to those in the "unreserved" set, | |||
| since it is extensively used on the Internet in spite of the | since it is extensively used on the Internet in spite of the | |||
| difficulty to transcribe it with some keyboards. | difficulty to transcribe it with some keyboards. | |||
| The syntax for URI scheme has been changed to require that all | ||||
| schemes begin with an alpha character. | ||||
| The "user:password" form in the previous BNF was changed to | The "user:password" form in the previous BNF was changed to | |||
| a "userinfo" token, and the possibility that it might be | a "userinfo" token, and the possibility that it might be | |||
| "user:password" made scheme specific. In particular, the use | "user:password" made scheme specific. In particular, the use | |||
| of passwords in the clear is not even suggested by the syntax. | of passwords in the clear is not even suggested by the syntax. | |||
| The question-mark "?" character was removed from the set of allowed | The question-mark "?" character was removed from the set of allowed | |||
| characters for the userinfo in the authority component, since | characters for the userinfo in the authority component, since | |||
| testing showed that many applications treat it as reserved for | testing showed that many applications treat it as reserved for | |||
| separating the query component from the rest of the URI. | separating the query component from the rest of the URI. | |||
| skipping to change at line 1568 ¶ | skipping to change at line 1609 ¶ | |||
| reserved within the authority component, since several new schemes | reserved within the authority component, since several new schemes | |||
| are using it as a separator within userinfo to indicate the type | are using it as a separator within userinfo to indicate the type | |||
| of user authentication. | of user authentication. | |||
| RFC 1738 specified that the path was separated from the authority | RFC 1738 specified that the path was separated from the authority | |||
| portion of a URI by a slash. RFC 1808 followed suit, but with a | portion of a URI by a slash. RFC 1808 followed suit, but with a | |||
| fudge of carrying around the separator as a "prefix" in order to | fudge of carrying around the separator as a "prefix" in order to | |||
| describe the parsing algorithm. RFC 1630 never had this problem, | describe the parsing algorithm. RFC 1630 never had this problem, | |||
| since it considered the slash to be part of the path. In writing | since it considered the slash to be part of the path. In writing | |||
| this specification, it was found to be impossible to accurately | this specification, it was found to be impossible to accurately | |||
| describe and retain the difference between the two URIs | describe and retain the difference between the two URI | |||
| <foo:/bar> and <foo:bar> | <foo:/bar> and <foo:bar> | |||
| without either considering the slash to be part of the path (as | without either considering the slash to be part of the path (as | |||
| corresponds to actual practice) or creating a separate component just | corresponds to actual practice) or creating a separate component just | |||
| to hold that slash. We chose the former. | to hold that slash. We chose the former. | |||
| G.3. Modifications from RFC 1738 | G.3. Modifications from RFC 1738 | |||
| The definition of specific URL schemes and their scheme-specific | The definition of specific URL schemes and their scheme-specific | |||
| syntax and semantics has been moved to separate documents. | syntax and semantics has been moved to separate documents. | |||
| The URL host was defined as a fully-qualified domain name. However, | The URL host was defined as a fully-qualified domain name. However, | |||
| many URLs are used without fully-qualified domain names (in contexts | many URLs are used without fully-qualified domain names (in contexts | |||
| for which the full qualification is not necessary), without any host | for which the full qualification is not necessary), without any host | |||
| (as in some file URLs), or with a host of "localhost". | (as in some file URLs), or with a host of "localhost". | |||
| The URL port is now *digit instead of 1*digit, since systems are | The URL port is now *digit instead of 1*digit, since systems are | |||
| expected to handle the case where the ":" separator between host and | expected to handle the case where the ":" separator between host and | |||
| port is supplied without a port. | port is supplied without a port. | |||
| The recommendations for delimiting URIs in context (Appendix E) have | The recommendations for delimiting URI in context (Appendix E) have | |||
| been adjusted to reflect current practice. | been adjusted to reflect current practice. | |||
| G.4. Modifications from RFC 1808 | G.4. Modifications from RFC 1808 | |||
| RFC 1808 (Section 4) defined an empty URL reference (a reference | RFC 1808 (Section 4) defined an empty URL reference (a reference | |||
| containing nothing aside from the fragment identifier) as being a | containing nothing aside from the fragment identifier) as being a | |||
| reference to the base URL. Unfortunately, that definition could be | reference to the base URL. Unfortunately, that definition could be | |||
| interpreted, upon selection of such a reference, as a new retrieval | interpreted, upon selection of such a reference, as a new retrieval | |||
| action on that resource. Since the normal intent of such references | action on that resource. Since the normal intent of such references | |||
| is for the user agent to change its view of the current document to | is for the user agent to change its view of the current document to | |||
| the beginning of the specified fragment within that document, not to | the beginning of the specified fragment within that document, not to | |||
| make an additional request of the resource, a description of how to | make an additional request of the resource, a description of how to | |||
| correctly interpret an empty reference has been added in Section 4. | correctly interpret an empty reference has been added in Section 4. | |||
| The description of the mythical Base header field has been replaced | The description of the mythical Base header field has been replaced | |||
| with a reference to the Content-Location header field defined by | with a reference to the Content-Location header field defined by | |||
| MHTML [RFC2110]. | MHTML [RFC2110]. | |||
| RFC 1808 described various schemes as either having or not having the | RFC 1808 described various schemes as either having or not having the | |||
| properties of the generic-URI syntax. However, the only requirement | properties of the generic URI syntax. However, the only requirement | |||
| is that the particular document containing the relative references | is that the particular document containing the relative references | |||
| have a base URI which abides by the generic-URI syntax, regardless of | have a base URI that abides by the generic URI syntax, regardless of | |||
| the URI scheme, so the associated description has been updated to | the URI scheme, so the associated description has been updated to | |||
| reflect that. | reflect that. | |||
| The BNF term <net_loc> has been replaced with <authority>, since the | The BNF term <net_loc> has been replaced with <authority>, since the | |||
| latter more accurately describes its use and purpose. Likewise, the | latter more accurately describes its use and purpose. Likewise, the | |||
| authority is no longer restricted to the IP server syntax. | authority is no longer restricted to the IP server syntax. | |||
| Extensive testing of current client applications demonstrated that | Extensive testing of current client applications demonstrated that | |||
| the majority of deployed systems do not use the ";" character to | the majority of deployed systems do not use the ";" character to | |||
| indicate trailing parameter information, and that the presence of a | indicate trailing parameter information, and that the presence of a | |||
| semicolon in a path segment does not affect the relative parsing of | semicolon in a path segment does not affect the relative parsing of | |||
| that segment. Therefore, parameters have been removed as a separate | that segment. Therefore, parameters have been removed as a separate | |||
| component and may now appear in any path segment. Their influence | component and may now appear in any path segment. Their influence | |||
| has been removed from the algorithm for resolving a relative URI | has been removed from the algorithm for resolving a relative URI | |||
| reference. The resolution examples in Appendix C have been modified | reference. The resolution examples in Appendix C have been modified | |||
| to reflect this change. | to reflect this change. | |||
| Implementations are now allowed to work around misformed relative | ||||
| references that are prefixed by the same scheme as the base URI, | ||||
| but only for schemes known to use the <hier_part> syntax. | ||||
| H. Full Copyright Statement | H. Full Copyright Statement | |||
| Copyright (C) The Internet Society (1998). All Rights Reserved. | Copyright (C) The Internet Society (1998). All Rights Reserved. | |||
| This document and translations of it may be copied and furnished to | This document and translations of it may be copied and furnished to | |||
| others, and derivative works that comment on or otherwise explain it | others, and derivative works that comment on or otherwise explain it | |||
| or assist in its implementation may be prepared, copied, published | or assist in its implementation may be prepared, copied, published | |||
| and distributed, in whole or in part, without restriction of any | and distributed, in whole or in part, without restriction of any | |||
| kind, provided that the above copyright notice and this paragraph are | kind, provided that the above copyright notice and this paragraph are | |||
| included on all such copies and derivative works. However, this | included on all such copies and derivative works. However, this | |||
| End of changes. 104 change blocks. | ||||
| 127 lines changed or deleted | 172 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||