| < draft-fielding-uri-rfc2396bis-01.txt | draft-fielding-uri-rfc2396bis-02.txt > | |||
|---|---|---|---|---|
| Network Working Group T. Berners-Lee | Network Working Group T. Berners-Lee | |||
| Internet-Draft MIT/LCS | Internet-Draft MIT/LCS | |||
| Updates: 1738 (if approved) R. Fielding | Updates: 1738 (if approved) R. Fielding | |||
| Obsoletes: 2732, 2396, 1808 (if approved) Day Software | Obsoletes: 2732, 2396, 1808 (if approved) Day Software | |||
| Expires: September 1, 2003 L. Masinter | L. Masinter | |||
| Adobe | Expires: November 21, 2003 Adobe | |||
| March 3, 2003 | May 23, 2003 | |||
| Uniform Resource Identifier (URI): Generic Syntax | Uniform Resource Identifier (URI): Generic Syntax | |||
| draft-fielding-uri-rfc2396bis-01 | draft-fielding-uri-rfc2396bis-02 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that other | Task Force (IETF), its areas, and its working groups. Note that other | |||
| groups may also distribute working documents as Internet-Drafts. | groups may also distribute working documents as Internet-Drafts. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/ietf/1id-abstracts.txt. | <http://www.ietf.org/ietf/1id-abstracts.txt>. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | <http://www.ietf.org/shadow.html>. | |||
| This Internet-Draft will expire on September 1, 2003. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2003). All Rights Reserved. | Copyright (C) The Internet Society (2003). All Rights Reserved. | |||
| Abstract | Abstract | |||
| A Uniform Resource Identifier (URI) is a compact string of characters | A Uniform Resource Identifier (URI) is a compact string of characters | |||
| for identifying an abstract or physical resource. This document | for identifying an abstract or physical resource. This document | |||
| defines the generic syntax of a URI, including both absolute and | defines the generic syntax of a URI, including both absolute and | |||
| skipping to change at page 2, line 11 ¶ | skipping to change at page 2, line 9 ¶ | |||
| such that an implementation can parse the common components of a URI | such that an implementation can parse the common components of a URI | |||
| reference without knowing the scheme-specific requirements of every | reference without knowing the scheme-specific requirements of every | |||
| possible identifier type. This document does not define a generative | possible identifier type. This document does not define a generative | |||
| grammar for all URIs; that task will be performed by the individual | grammar for all URIs; that task will be performed by the individual | |||
| specifications of each URI scheme. | specifications of each URI scheme. | |||
| Editorial Note | Editorial Note | |||
| Discussion of this draft and comments to the editors should be sent | Discussion of this draft and comments to the editors should be sent | |||
| to the uri@w3.org mailing list. An issues list and version history | to the uri@w3.org mailing list. An issues list and version history | |||
| is available at <http://www.apache.org/~fielding/uri/rev-2002/>. | is available at <http://www.apache.org/~fielding/uri/rev-2002/ | |||
| issues.html>. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.2 URI, URL, and URN . . . . . . . . . . . . . . . . . . . . . 5 | 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.3 Example URIs . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.4 Hierarchical URIs and Relative Forms . . . . . . . . . . . . 6 | 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.5 URI Transcribability . . . . . . . . . . . . . . . . . . . . 7 | 1.2 Design Considerations . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.6 Syntax Notation and Common Elements . . . . . . . . . . . . 8 | 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2. URI Characters and Escape Sequences . . . . . . . . . . . . 9 | 1.2.2 Separating Identification from Interaction . . . . . . . . . 7 | |||
| 2.1 URIs and non-ASCII characters . . . . . . . . . . . . . . . 9 | 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . . . . 9 | |||
| 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . . 9 | ||||
| 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . 10 | ||||
| 2.1 Encoding of Characters . . . . . . . . . . . . . . . . . . . 10 | ||||
| 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . . 10 | 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . . 10 | |||
| 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . . 11 | 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . . 11 | |||
| 2.4 Escape Sequences . . . . . . . . . . . . . . . . . . . . . . 11 | 2.4 Escaped Characters . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 2.4.1 Escaped Encoding . . . . . . . . . . . . . . . . . . . . . . 11 | 2.4.1 Escaped Encoding . . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 2.4.2 When to Escape and Unescape . . . . . . . . . . . . . . . . 11 | 2.4.2 When to Escape and Unescape . . . . . . . . . . . . . . . . 12 | |||
| 2.4.3 Excluded US-ASCII Characters . . . . . . . . . . . . . . . . 12 | 2.5 Excluded Characters . . . . . . . . . . . . . . . . . . . . 13 | |||
| 3. URI Syntactic Components . . . . . . . . . . . . . . . . . . 14 | 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 3.1 Scheme Component . . . . . . . . . . . . . . . . . . . . . . 15 | 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 3.2 Authority Component . . . . . . . . . . . . . . . . . . . . 15 | 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 3.2.1 Registry-based Naming Authority . . . . . . . . . . . . . . 16 | 3.2.1 User Information . . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 3.2.2 Server-based Naming Authority . . . . . . . . . . . . . . . 16 | 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 3.3 Path Component . . . . . . . . . . . . . . . . . . . . . . . 18 | 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 3.4 Query Component . . . . . . . . . . . . . . . . . . . . . . 19 | 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 | |||
| 4. URI References . . . . . . . . . . . . . . . . . . . . . . . 20 | 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 4.1 Fragment Identifier . . . . . . . . . . . . . . . . . . . . 20 | 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 4.2 Same-document References . . . . . . . . . . . . . . . . . . 21 | 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 4.3 Parsing a URI Reference . . . . . . . . . . . . . . . . . . 21 | 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 5. Relative URI References . . . . . . . . . . . . . . . . . . 22 | 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . . 23 | 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 5.1.1 Base URI within Document Content . . . . . . . . . . . . . . 24 | 4.4 Same-document Reference . . . . . . . . . . . . . . . . . . 23 | |||
| 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . . . . 24 | 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . . . . 25 | 5. Relative Resolution . . . . . . . . . . . . . . . . . . . . 25 | |||
| 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . . . . 25 | 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . . 25 | |||
| 5.2 Resolving Relative References to Absolute Form . . . . . . . 25 | 5.1.1 Base URI within Document Content . . . . . . . . . . . . . . 26 | |||
| 6. URI Normalization and Comparison . . . . . . . . . . . . . . 29 | 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . . . . 26 | |||
| 6.1 URI Equivalence . . . . . . . . . . . . . . . . . . . . . . 29 | 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . . . . 27 | |||
| 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . . 29 | 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . . . . 27 | |||
| 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 30 | 5.2 Obtaining the Referenced URI . . . . . . . . . . . . . . . . 27 | |||
| 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . . . . 31 | 5.3 Recomposition of a Parsed URI . . . . . . . . . . . . . . . 29 | |||
| 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . . . . 32 | 5.4 Examples of Relative Resolution . . . . . . . . . . . . . . 30 | |||
| 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . . . . 32 | 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . . . . 30 | |||
| 6.3 Good Practice When Using URIs . . . . . . . . . . . . . . . 32 | 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . . . . 31 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . 34 | 6. Normalization and Comparison . . . . . . . . . . . . . . . . 33 | |||
| 7.1 Reliability and Consistency . . . . . . . . . . . . . . . . 34 | 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . 33 | |||
| 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . . 34 | 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . . 33 | |||
| 7.3 Rare IP Address Formats . . . . . . . . . . . . . . . . . . 35 | 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 34 | |||
| 7.4 Sensitive Information . . . . . . . . . . . . . . . . . . . 35 | 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . . . . 35 | |||
| 7.5 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . . 36 | 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . . . . 36 | |||
| 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 37 | 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . . . . 36 | |||
| Normative References . . . . . . . . . . . . . . . . . . . . 38 | 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . . 36 | |||
| Non-normative References . . . . . . . . . . . . . . . . . . 39 | 7. Security Considerations . . . . . . . . . . . . . . . . . . 38 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 40 | 7.1 Reliability and Consistency . . . . . . . . . . . . . . . . 38 | |||
| A. Collected BNF for URI . . . . . . . . . . . . . . . . . . . 42 | 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . . 38 | |||
| B. Parsing a URI Reference with a Regular Expression . . . . . 43 | 7.3 Rare IP Address Formats . . . . . . . . . . . . . . . . . . 39 | |||
| C. Examples of Resolving Relative URI References . . . . . . . 44 | 7.4 Sensitive Information . . . . . . . . . . . . . . . . . . . 39 | |||
| C.1 Normal Examples . . . . . . . . . . . . . . . . . . . . . . 44 | 7.5 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . . 39 | |||
| C.2 Abnormal Examples . . . . . . . . . . . . . . . . . . . . . 44 | 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 41 | |||
| D. Embedding the Base URI in HTML documents . . . . . . . . . . 46 | Normative References . . . . . . . . . . . . . . . . . . . . 42 | |||
| E. Recommendations for Delimiting URI in Context . . . . . . . 47 | Informative References . . . . . . . . . . . . . . . . . . . 43 | |||
| F. Abbreviated URIs . . . . . . . . . . . . . . . . . . . . . . 49 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 45 | |||
| G. Summary of Non-editorial Changes . . . . . . . . . . . . . . 50 | A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . 46 | |||
| G.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . . 50 | B. Parsing a URI Reference with a Regular Expression . . . . . 47 | |||
| G.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . . 50 | C. Embedding the Base URI in HTML documents . . . . . . . . . . 48 | |||
| Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 | D. Delimiting a URI in Context . . . . . . . . . . . . . . . . 49 | |||
| Intellectual Property and Copyright Statements . . . . . . . 55 | E. Summary of Non-editorial Changes . . . . . . . . . . . . . . 51 | |||
| E.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . . 51 | ||||
| E.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . . 51 | ||||
| Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 | ||||
| Intellectual Property and Copyright Statements . . . . . . . 57 | ||||
| 1. Introduction | 1. Introduction | |||
| A Uniform Resource Identifier (URI) provides a simple and extensible | A Uniform Resource Identifier (URI) provides a simple and extensible | |||
| means for identifying a resource. This specification of URI syntax | means for identifying a resource. This specification of URI syntax | |||
| and semantics is derived from concepts introduced by the World Wide | and semantics is derived from concepts introduced by the World Wide | |||
| Web global information initiative, whose use of such objects dates | Web global information initiative, whose use of such identifiers | |||
| from 1990 and is described in "Universal Resource Identifiers in WWW" | dates from 1990 and is described in "Universal Resource Identifiers | |||
| [RFC1630], and is designed to meet the recommendations laid out in | in WWW" [RFC1630], and is designed to meet the recommendations laid | |||
| "Functional Recommendations for Internet Resource Locators" [RFC1736] | out in "Functional Recommendations for Internet Resource Locators" | |||
| and "Functional Requirements for Uniform Resource Names" [RFC1737]. | [RFC1736] and "Functional Requirements for Uniform Resource Names" | |||
| [RFC1737]. | ||||
| This document obsoletes [RFC2396], which merged "Uniform Resource | This document obsoletes [RFC2396], which merged "Uniform Resource | |||
| Locators" [RFC1738] and "Relative Uniform Resource Locators" | Locators" [RFC1738] and "Relative Uniform Resource Locators" | |||
| [RFC1808] in order to define a single, generic syntax for all URIs. | [RFC1808] in order to define a single, generic syntax for all URIs. | |||
| It excludes those portions of RFC 1738 that defined the specific | It excludes those portions of RFC 1738 that defined the specific | |||
| syntax of individual URI schemes; those portions will be updated as | syntax of individual URI schemes; those portions will be updated as | |||
| separate documents. The process for registration of new URI schemes | separate documents. The process for registration of new URI schemes | |||
| is defined separately by [RFC2717]. | is defined separately by [RFC2717]. | |||
| All significant changes from RFC 2396 are noted in Appendix G. | All significant changes from RFC 2396 are noted in Appendix G. | |||
| 1.1 Overview of URIs | 1.1 Overview of URIs | |||
| URIs are characterized by the following definitions: | URIs are characterized as follows: | |||
| Uniform | Uniform | |||
| Uniformity provides several benefits: it allows different types of | Uniformity provides several benefits: it allows different types of | |||
| resource identifiers to be used in the same context, even when the | resource identifiers to be used in the same context, even when the | |||
| mechanisms used to access those resources may differ; it allows | mechanisms used to access those resources may differ; it allows | |||
| uniform semantic interpretation of common syntactic conventions | uniform semantic interpretation of common syntactic conventions | |||
| across different types of resource identifiers; it allows | across different types of resource identifiers; it allows | |||
| introduction of new types of resource identifiers without | introduction of new types of resource identifiers without | |||
| interfering with the way that existing identifiers are used; and, | interfering with the way that existing identifiers are used; and, | |||
| it allows the identifiers to be reused in many different contexts, | it allows the identifiers to be reused in many different contexts, | |||
| thus permitting new applications or protocols to leverage a | thus permitting new applications or protocols to leverage a | |||
| pre-existing, large, and widely-used set of resource identifiers. | pre-existing, large, and widely-used set of resource identifiers. | |||
| Resource | Resource | |||
| A resource can be anything that has identity. Familiar examples | Anything that can be named or described can be a resource. | |||
| include an electronic document, an image, a service (e.g., | Familiar examples include an electronic document, an image, a | |||
| "today's weather report for Los Angeles"), and a collection of | service (e.g., "today's weather report for Los Angeles"), and a | |||
| other resources. Not all resources are network "retrievable"; | collection of other resources. A resource is not necessarily | |||
| e.g., human beings, corporations, and bound books in a library can | accessible via the Internet; e.g., human beings, corporations, and | |||
| also be considered resources. | bound books in a library can also be resources. Likewise, abstract | |||
| concepts can be resources, such as the operators and operands of a | ||||
| The resource is the conceptual mapping to an entity or set of | mathematical equation or the types of a relationship (e.g., | |||
| entities, not necessarily the entity which corresponds to that | "parent" or "employee"). | |||
| mapping at any particular instance in time. Thus, a resource can | ||||
| remain constant even when its content---the entities to which it | ||||
| currently corresponds---changes over time, provided that the | ||||
| conceptual mapping is not changed in the process. | ||||
| Identifier | Identifier | |||
| An identifier is an object that can act as a reference to | An identifier embodies the information required to distinguish | |||
| something that has identity. In the case of a URI, the object is | what is being identified from all other things within its scope of | |||
| a sequence of characters with a restricted syntax. | identification. | |||
| Having identified a resource, a system may perform a variety of | ||||
| operations on the resource, as might be characterized by such words | ||||
| as `access', `update', `replace', or `find attributes'. | ||||
| 1.2 URI, URL, and URN | ||||
| A URI can be further classified as a locator, a name, or both. The | A URI is an identifier that consists of a sequence of characters | |||
| term "Uniform Resource Locator" (URL) refers to the subset of URIs | matching the restricted syntax defined by this specification. A URI | |||
| that, in addition to identifying the resource, provide a means of | can be used to refer to a resource. This specification does not | |||
| locating the resource by describing its primary access mechanism | place any limits on the nature of a resource or the reasons why an | |||
| (e.g., its network "location"). The term "Uniform Resource Name" | application might wish to refer to a resource. URIs have a global | |||
| (URN) refers to the subset of URIs that are required to remain | scope and should be interpreted consistently regardless of context, | |||
| globally unique and persistent even when the resource ceases to exist | but that interpretation may be defined in relation to the user's | |||
| or becomes unavailable. | context (e.g., "http://localhost/" refers to a resource that is | |||
| relative to the user's network interface and yet not specific to any | ||||
| one user). | ||||
| An individual scheme does not need to be cast into one of a discrete | 1.1.1 Generic Syntax | |||
| set of URI types such as "URL", "URN", "URC", etc. Any given URI | ||||
| scheme may define subspaces that have the characteristics of a name, | ||||
| a locator, or both, often depending on the persistence and care in | ||||
| the assignment of identifiers by the naming authority, rather than on | ||||
| any quality of the URI scheme. For that reason, this specification | ||||
| deprecates use of the terms URL or URN to distinguish between | ||||
| schemes, instead using the term URI throughout. | ||||
| Each URI scheme (Section 3.1) defines the namespace of the URI, and | Each URI begins with a scheme name, as defined in Section 3.1, that | |||
| thus may further restrict the syntax and semantics of identifiers | refers to a specification for assigning identifiers within that | |||
| using that scheme. This specification defines those elements of the | scheme. As such, the URI syntax is a federated and extensible naming | |||
| URI syntax that are either required of all URI schemes or are common | system wherein each scheme's specification may further restrict the | |||
| to many URI schemes. It thus defines the syntax and semantics that | syntax and semantics of identifiers using that scheme. | |||
| are needed to implement a scheme-independent parsing mechanism for | ||||
| URI references, such that the scheme-dependent handling of a URI can | ||||
| be postponed until the scheme-dependent semantics are needed. | ||||
| Although many URI schemes are named after protocols, this does not | This specification defines those elements of the URI syntax that are | |||
| imply that use of such a URI will result in access to the resource | required of all URI schemes or are common to many URI schemes. It | |||
| via the named protocol. URIs are often used in contexts that are | thus defines the syntax and semantics that are needed to implement a | |||
| purely for identification, just like any other identifier. Even when | scheme-independent parsing mechanism for URI references, such that | |||
| a URI is used to obtain a representation of a resource, that access | the scheme-dependent handling of a URI can be postponed until the | |||
| might be through gateways, proxies, caches, and name resolution | scheme-dependent semantics are needed. Likewise, protocols and data | |||
| services that are independent of the protocol of the resource origin, | formats that make use of URI references can refer to this | |||
| and the resolution of some URIs may require the use of more than one | specification as defining the range of syntax allowed for all URIs, | |||
| protocol (e.g., both DNS and HTTP are typically used to access an | including those schemes that have yet to be defined. | |||
| "http" URI's resource when it can't be found in a local cache). | ||||
| A parser of the generic URI syntax is capable of parsing any URI | A parser of the generic URI syntax is capable of parsing any URI | |||
| reference into its major components; once the scheme is determined, | reference into its major components; once the scheme is determined, | |||
| further scheme-specific parsing can be performed on the components. | further scheme-specific parsing can be performed on the components. | |||
| In other words, the URI generic syntax is a superset of the syntax of | In other words, the URI generic syntax is a superset of the syntax of | |||
| all URI schemes. | all URI schemes. | |||
| 1.3 Example URIs | 1.1.2 Examples | |||
| The following examples illustrate URIs that are in common use. | The following examples illustrate URIs that are in common use. | |||
| ftp://ftp.is.co.za/rfc/rfc1808.txt | ftp://ftp.is.co.za/rfc/rfc1808.txt | |||
| -- ftp scheme for File Transfer Protocol services | -- ftp scheme for File Transfer Protocol services | |||
| gopher://gopher.tc.umn.edu:70/11/Mailing%20Lists/ | gopher://gopher.tc.umn.edu:70/11/Mailing%20Lists/ | |||
| -- gopher scheme for Gopher and Gopher+ Protocol services | -- gopher scheme for Gopher and Gopher+ Protocol services | |||
| http://www.ietf.org/rfc/rfc2396.txt | http://www.ietf.org/rfc/rfc2396.txt | |||
| skipping to change at page 6, line 40 ¶ | skipping to change at page 6, line 27 ¶ | |||
| mailto:John.Doe@example.com | mailto:John.Doe@example.com | |||
| -- mailto scheme for electronic mail addresses | -- mailto scheme for electronic mail addresses | |||
| news:comp.infosystems.www.servers.unix | news:comp.infosystems.www.servers.unix | |||
| -- news scheme for USENET news groups and articles | -- news scheme for USENET news groups and articles | |||
| telnet://melvyl.ucop.edu/ | telnet://melvyl.ucop.edu/ | |||
| -- telnet scheme for interactive TELNET services | -- telnet scheme for interactive TELNET services | |||
| 1.4 Hierarchical URIs and Relative Forms | 1.1.3 URI, URL, and URN | |||
| An absolute identifier refers to a resource independent of the | A URI can be further classified as a locator, a name, or both. The | |||
| context in which the identifier is used. In contrast, a relative | term "Uniform Resource Locator" (URL) refers to the subset of URIs | |||
| identifier refers to a resource by describing the difference within a | that, in addition to identifying the resource, provide a means of | |||
| hierarchical namespace between the current context and an absolute | locating the resource by describing its primary access mechanism | |||
| identifier of the resource. | (e.g., its network "location"). The term "Uniform Resource Name" | |||
| (URN) refers to the subset of URIs that are required to remain | ||||
| globally unique and persistent even when the resource ceases to exist | ||||
| or becomes unavailable. | ||||
| Some URI schemes support a hierarchical naming system, where the | An individual scheme does not need to be classified as being just one | |||
| hierarchy of the name is denoted by a "/" delimiter separating the | of "name" or "locator". Instances of URIs from any given scheme may | |||
| components in the scheme. This document defines a scheme-independent | have the characteristics of names or locators or both, often | |||
| `relative' form of URI reference that can be used in conjunction with | depending on the persistence and care in the assignment of | |||
| a `base' URI of a hierarchical scheme to produce the `absolute' URI | identifiers by the naming authority, rather than any quality of the | |||
| form of the reference. The syntax of a hierarchical URI is described | scheme. This specification deprecates use of the term "URN" for | |||
| in Section 3; the relative URI calculation is described in Section 5. | anything but URIs in the "urn" scheme [RFC2141]. This specification | |||
| also deprecates the term "URL". | ||||
| 1.5 URI Transcribability | 1.2 Design Considerations | |||
| The URI syntax was designed with global transcribability as one of | 1.2.1 Transcription | |||
| its main concerns. A URI is a sequence of characters from a very | ||||
| limited set, i.e. the letters of the basic Latin alphabet, digits, | The URI syntax has been designed with global transcription as one of | |||
| its main considerations. A URI is a sequence of characters from a | ||||
| very limited set: the letters of the basic Latin alphabet, digits, | ||||
| and a few special characters. A URI may be represented in a variety | and a few special characters. A URI may be represented in a variety | |||
| of ways: e.g., ink on paper, pixels on a screen, or a sequence of | of ways: e.g., ink on paper, pixels on a screen, or a sequence of | |||
| octets in a coded character set. The interpretation of a URI depends | octets in a coded character set. The interpretation of a URI depends | |||
| only on the characters used and not how those characters are | only on the characters used and not how those characters are | |||
| represented in a network protocol. | represented in a network protocol. | |||
| The goal of transcribability can be described by a simple scenario. | The goal of transcription can be described by a simple scenario. | |||
| Imagine two colleagues, Sam and Kim, sitting in a pub at an | Imagine two colleagues, Sam and Kim, sitting in a pub at an | |||
| international conference and exchanging research ideas. Sam asks Kim | international conference and exchanging research ideas. Sam asks Kim | |||
| for a location to get more information, so Kim writes the URI for the | for a location to get more information, so Kim writes the URI for the | |||
| research site on a napkin. Upon returning home, Sam takes out the | research site on a napkin. Upon returning home, Sam takes out the | |||
| napkin and types the URI into a computer, which then retrieves the | napkin and types the URI into a computer, which then retrieves the | |||
| information to which Kim referred. | information to which Kim referred. | |||
| There are several design concerns revealed by the scenario: | There are several design considerations revealed by the scenario: | |||
| o A URI is a sequence of characters, which is not always represented | o A URI is a sequence of characters that is not always represented | |||
| as a sequence of octets. | as a sequence of octets. | |||
| o A URI may be transcribed from a non-network source, and thus | o A URI might be transcribed from a non-network source, and thus | |||
| should consist of characters that are most likely to be able to be | should consist of characters that are most likely to be able to be | |||
| typed into a computer, within the constraints imposed by keyboards | entered into a computer, within the constraints imposed by | |||
| (and related input devices) across languages and locales. | keyboards (and related input devices) across languages and | |||
| locales. | ||||
| o A URI often needs to be remembered by people, and it is easier for | o A URI often needs to be remembered by people, and it is easier for | |||
| people to remember a URI when it consists of meaningful | people to remember a URI when it consists of meaningful or | |||
| components. | familiar components. | |||
| These design concerns are not always in alignment. For example, it | These design considerations are not always in alignment. For | |||
| is often the case that the most meaningful name for a URI component | example, it is often the case that the most meaningful name for a URI | |||
| would require characters that cannot be typed into some systems. The | component would require characters that cannot be typed into some | |||
| ability to transcribe the resource identifier from one medium to | systems. The ability to transcribe a resource identifier from one | |||
| another was considered more important than having its URI consist of | medium to another has been considered more important than having a | |||
| the most meaningful of components. In local and regional contexts | URI consist of the most meaningful of components. In local or | |||
| and with improving technology, users might benefit from being able to | regional contexts and with improving technology, users might benefit | |||
| use a wider range of characters; such use is not defined in this | from being able to use a wider range of characters; such use is not | |||
| document. | defined in this document. | |||
| 1.6 Syntax Notation and Common Elements | 1.2.2 Separating Identification from Interaction | |||
| This document uses two conventions to describe and define the syntax | A common misunderstanding of URIs is that they are only used to refer | |||
| for URI. The first, called the layout form, is a general description | to accessible resources. In fact, the URI alone only provides | |||
| of the order of components and component separators, as in | identification; access to the resource is neither guaranteed nor | |||
| implied by the presence of a URI. Instead, an operation (if any) | ||||
| associated with a URI reference is defined by the protocol element, | ||||
| data format attribute, or natural language text in which it appears. | ||||
| <first>/<second>;<third>?<fourth> | Given a URI, a system may attempt to perform a variety of operations | |||
| on the resource, as might be characterized by such words as "denote", | ||||
| "access", "update", "replace", or "find attributes". Such operations | ||||
| are defined by the protocols that make use of URIs, not by this | ||||
| specification. However, we do use a few general terms for describing | ||||
| common operations on URIs. URI "resolution" is the process of | ||||
| determining an access mechanism and the appropriate parameters | ||||
| necessary to dereference a URI; such resolution may require several | ||||
| iterations. Using that access mechanism to perform some action on | ||||
| the URI's resource is termed a "dereference" of the URI. | ||||
| The component names are enclosed in angle-brackets and any characters | When URIs are used within information systems to identify sources of | |||
| outside angle-brackets are literal separators. Whitespace should be | information, the most common form of URI dereference is "retrieval": | |||
| ignored. These descriptions are used informally and do not define | making use of a URI in order to retrieve a representation of its | |||
| the syntax requirements. | associated resource. A "representation" is a sequence of octets, | |||
| along with metadata describing those octets, that constitutes a | ||||
| record of the state of the resource at the time that the | ||||
| representation is generated. Retrieval is achieved by a process that | ||||
| might include using the URI as a cache key to check for a locally | ||||
| cached representation, resolution of the URI to determine an | ||||
| appropriate access mechanism (if any), and dereference of the URI for | ||||
| the sake of applying a retrieval operation. | ||||
| The second convention is a formal grammar defined using the Augmented | URI references in information systems are designed to be | |||
| Backus-Naur Form (ABNF) notation of [RFC2234]. Although the ABNF | late-binding: the result of an access is generally determined at the | |||
| defines syntax in terms of the ASCII character encoding [ASCII], the | time it is accessed and may vary over time or due to other aspects of | |||
| URI syntax should be interpreted in terms of the character that the | the interaction. When an author creates a reference to such a | |||
| ASCII-encoded octet represents, rather than the octet encoding | resource, they do so with the intention that the reference be used in | |||
| itself. How a URI is represented in terms of bits and bytes on the | the future; what is being identified is not some specific result that | |||
| wire is dependent upon the character encoding of the protocol used to | was obtained in the past, but rather some characteristic that is | |||
| transport it, or the charset of the document that contains it. | expected to be true for future results. In such cases, the resource | |||
| referred to by the URI is actually a sameness of characteristics as | ||||
| observed over time, perhaps elucidated by additional comments or | ||||
| assertions made by the resource provider. | ||||
| The complete URI syntax is collected in Appendix A. | Although many URI schemes are named after protocols, this does not | |||
| imply that use of such a URI will result in access to the resource | ||||
| via the named protocol. URIs are often used simply for the sake of | ||||
| identification. Even when a URI is used to retrieve a representation | ||||
| of a resource, that access might be through gateways, proxies, | ||||
| caches, and name resolution services that are independent of the | ||||
| protocol associated with the scheme name, and the resolution of some | ||||
| URIs may require the use of more than one protocol (e.g., both DNS | ||||
| and HTTP are typically used to access an "http" URI's origin server | ||||
| when a representation isn't found in a local cache). | ||||
| 2. URI Characters and Escape Sequences | 1.2.3 Hierarchical Identifiers | |||
| A URI consists of a restricted set of characters, primarily chosen | The URI syntax is organized hierarchically, with components listed in | |||
| to aid transcribability and usability both in computer systems and in | decreasing order from left to right. For some URI schemes, the | |||
| non-computer communications. Characters used conventionally as | visible hierarchy is limited to the scheme itself: everything after | |||
| delimiters around a URI are excluded. The restricted set of | the scheme component delimiter is considered opaque to URI | |||
| characters consists of digits, letters, and a few graphic symbols | processing. Other URI schemes make the hierarchy explicit and visible | |||
| chosen from those common to most of the character encodings and input | to generic parsing algorithms. | |||
| facilities available to Internet users. | ||||
| uric = reserved / unreserved / escaped | The URI syntax reserves the slash ("/"), question-mark ("?"), and | |||
| crosshatch ("#") characters for the purpose of delimiting components | ||||
| that are significant to the generic parser's hierarchical | ||||
| interpretation of an identifier. In addition to aiding the | ||||
| readability of such identifiers through the consistent use of | ||||
| familiar syntax, this uniform representation of hierarchy across | ||||
| naming schemes allows scheme-independent references to be made | ||||
| relative to that hierarchy. | ||||
| Within a URI, characters are either used as delimiters or to | An "absolute" URI refers to a resource independent of the naming | |||
| represent strings of data (octets) within the delimited portions. | hierarchy in which the identifier is used. In contrast, a "relative" | |||
| Octets are either represented directly by a character (using the | URI refers to a resource by describing the difference within a | |||
| US-ASCII character for that octet [ASCII]) or by an escape encoding. | hierarchical name space between the current context and an absolute | |||
| This representation is elaborated below. | URI of the resource. Section 4.2 defines a scheme-independent form | |||
| of relative URI reference that can be used in conjunction with a base | ||||
| URI of a hierarchical scheme to produce the absolute URI form of that | ||||
| reference. | ||||
| 2.1 URIs and non-ASCII characters | 1.3 Syntax Notation | |||
| The relationship between URIs and characters has been a source of | This document uses the Augmented Backus-Naur Form (ABNF) notation of | |||
| confusion for characters that are not part of US-ASCII. To describe | [RFC2234] to define the URI syntax. Although the ABNF defines syntax | |||
| the relationship, it is useful to distinguish between a "character" | in terms of the US-ASCII character encoding [ASCII], the URI syntax | |||
| (as a distinguishable semantic entity) and an "octet" (an 8-bit | should be interpreted in terms of the character that the | |||
| byte). There are two mappings, one from URI characters to octets, and | ASCII-encoded octet represents, rather than the octet encoding | |||
| a second from octets to original characters: | itself. How a URI is represented in terms of bits and bytes on the | |||
| wire is dependent upon the character encoding of the protocol used to | ||||
| transport it, or the charset of the document that contains it. | ||||
| URI character sequence->octet sequence->original character sequence | The following core ABNF productions are used by this specification as | |||
| defined by Section 6.1 of [RFC2234]: ALPHA, CR, CTL, DIGIT, DQUOTE, | ||||
| HEXDIG, LF, OCTET, and SP. The complete URI syntax is collected in | ||||
| Appendix A. | ||||
| A URI is represented as a sequence of characters, not as a sequence | 2. Characters | |||
| of octets. That is because a URI might be "transported" by means that | ||||
| are not through a computer network, e.g., printed on paper, read over | ||||
| the radio, etc. | ||||
| Within a delimited component of a URI, a sequence of characters is | A URI consists of a restricted set of characters, primarily chosen | |||
| used to represent a sequence of octets. For example, the character | to aid transcription and usability both in computer systems and in | |||
| "a" represents the octet 97 (decimal), while the character sequence | non-computer communications. Characters used conventionally as | |||
| "%", "0", "a" represents the octet 10 (decimal). | delimiters around a URI are excluded. The set of URI characters | |||
| consists of digits, letters, and a few graphic symbols chosen from | ||||
| those common to most of the character encodings and input facilities | ||||
| available to Internet users. | ||||
| There is a second translation for some resources: the sequence of | uric = reserved / unreserved / escaped | |||
| octets defined by a component of the URI is subsequently used to | ||||
| represent a sequence of characters. A 'charset' defines this mapping. | ||||
| There are many charsets in use in Internet protocols. For example, | ||||
| UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences | ||||
| of characters in the repertoire of ISO 10646. | ||||
| In the simplest case, the original character sequence contains only | Within a URI, reserved characters are used to delimit syntax | |||
| characters that are defined in US-ASCII, and the two levels of | components, unreserved characters are used to describe registered | |||
| mapping are simple and easily invertible: each 'original character' | names, and unreserved, non-delimiting reserved, and escaped | |||
| is represented as the octet for the US-ASCII code for it, which is, | characters are used to represent strings of data (1*OCTET) within the | |||
| in turn, represented as either the US-ASCII character, or else the | components. | |||
| "%" escape sequence for that octet. | ||||
| For original character sequences that contain non-ASCII characters, | 2.1 Encoding of Characters | |||
| however, the situation is more difficult. Internet protocols that | ||||
| transmit octet sequences intended to represent character sequences | ||||
| are expected to provide some way of identifying the charset used, if | ||||
| there might be more than one [RFC2277]. However, there is currently | ||||
| no provision within the generic URI syntax to accomplish this | ||||
| identification. An individual URI scheme may require a single | ||||
| charset, define a default charset, or provide a way to indicate the | ||||
| charset used. For example, a new scheme "foo" might be defined such | ||||
| that any escaped octet is keyed to the UTF-8 encoding in order to | ||||
| determine the corresponding Unicode character. | ||||
| It is expected that a systematic treatment of character encoding | As described above (Section 1.3), the URI syntax is defined in terms | |||
| within URIs will be developed as a future modification of this | of characters by reference to the US-ASCII encoding of characters to | |||
| specification. | octets. This specification does not mandate the use of any | |||
| particular mapping between its character set and the octets used to | ||||
| store or transmit those characters. | ||||
| URI characters representing strings of data within a component may, | ||||
| if allowed by the component production, represent an arbitrary | ||||
| sequence of octets. For example, portions of a given URI might | ||||
| correspond to a filename on a non-ASCII file system, a query on | ||||
| non-ASCII data, numeric coordinates on a map, etc. Some URI schemes | ||||
| define a specific encoding of raw data to US-ASCII characters as part | ||||
| of their scheme-specific requirements. Most URI schemes represent | ||||
| data octets by the US-ASCII character corresponding to that octet, | ||||
| either directly in the form of the character's glyph or by use of an | ||||
| escape triplet (Section 2.4). | ||||
| When a URI scheme defines a component that represents textual data | ||||
| consisting of characters from the Unicode (ISO 10646) character set, | ||||
| we recommend that the data be encoded first as octets according to | ||||
| the UTF-8 [UTF-8] character encoding, and then escaping any octets | ||||
| that are not in the unreserved character set. | ||||
| 2.2 Reserved Characters | 2.2 Reserved Characters | |||
| Many URI include components consisting of or delimited by, certain | URIs include components and sub-components that are delimited by | |||
| special characters. These characters are called "reserved", since | certain special characters. These characters are called "reserved", | |||
| their usage within the URI component is limited to their reserved | since their usage within a URI component is limited to their reserved | |||
| purpose. If the data for a URI component would conflict with the | purpose within that component. If data for a URI component would | |||
| reserved purpose, then the conflicting data must be escaped before | conflict with the reserved purpose, then the conflicting data must be | |||
| forming the URI. | escaped (Section 2.4) before forming the URI. | |||
| reserved = "[" / "]" / ";" / "/" / "?" / | reserved = "/" / "?" / "#" / "[" / "]" / ";" / | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ":" / "@" / "&" / "=" / "+" / "$" / "," | |||
| The "reserved" syntax class above refers to those characters that are | Reserved characters are used as delimiters of the generic URI | |||
| allowed within a URI, but which may not be allowed within a | components described in Section 3, as well as within those components | |||
| particular component of the generic URI syntax; they are used as | for delimiting sub-components. A component's ABNF syntax rule will | |||
| delimiters of the components described in Section 3. | not use the "reserved" production directly; instead, each rule lists | |||
| those reserved characters that are allowed within that component. | ||||
| Allowed reserved characters that are not assigned a sub-component | ||||
| delimiter role by this specification should be considered reserved | ||||
| for special use by whatever software generates the URI (i.e., they | ||||
| may be used to delimit or indicate information that is significant to | ||||
| interpretation of the identifier, but that significance is outside | ||||
| the scope of this specification). Outside of the URI's origin, a | ||||
| reserved character cannot be escaped without fear of changing how it | ||||
| will be interpreted; likewise, an escaped octet that corresponds to a | ||||
| reserved character cannot be unescaped outside the software that is | ||||
| responsible for interpreting it during URI resolution. | ||||
| Characters in the "reserved" set are not reserved in all contexts. | The slash ("/"), question-mark ("?"), and crosshatch ("#") characters | |||
| The set of characters actually reserved within any given URI | are reserved in all URI for the purpose of delimiting components that | |||
| component is defined by that component. In general, a character is | are significant to the generic parser's hierarchical interpretation | |||
| reserved if the semantics of the URI changes if the character is | of an identifier. The hierarchical prefix of a URI, wherein the | |||
| replaced with its escaped US-ASCII encoding. | slash ("/") character signifies a hierarchy delimiter, extends from | |||
| the scheme (Section 3.1) through to the first question-mark ("?"), | ||||
| crosshatch ("#"), or the end of the URI string. In other words, the | ||||
| slash ("/") character is not treated as a hierarchical separator | ||||
| within the query (Section 3.4) and fragment (Section 3.5) components | ||||
| of a URI, but is still considered reserved within those components | ||||
| for purposes outside the scope of this specification. | ||||
| 2.3 Unreserved Characters | 2.3 Unreserved Characters | |||
| Data characters that are allowed in a URI but do not have a reserved | Data characters that are allowed in a URI but do not have a reserved | |||
| purpose are called unreserved. These include upper and lower case | purpose are called unreserved. These include uppercase and lowercase | |||
| letters, decimal digits, and a limited set of punctuation marks and | letters, decimal digits, and a limited set of punctuation marks and | |||
| symbols. | symbols. | |||
| unreserved = ALPHA / DIGIT / mark | unreserved = ALPHA / DIGIT / mark | |||
| mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")" | mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")" | |||
| Unreserved characters can be escaped without changing the semantics | Unreserved characters can be escaped without changing the semantics | |||
| of the URI, but this should not be done unless the URI is being used | of a URI, but this should not be done unless the URI is being used in | |||
| in a context that does not allow the unescaped character to appear. | a context that does not allow the unescaped character to appear. URI | |||
| URI normalization processes may unescape sequences in the ranges of | normalization processes may unescape sequences in the ranges of ALPHA | |||
| ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), underscore (%5F), or | (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), underscore | |||
| tilde (%7E) without fear of creating a conflict, but unescaping the | (%5F), or tilde (%7E) without fear of creating a conflict, but | |||
| other mark characters is usually counterproductive. | unescaping the other mark characters is usually counterproductive. | |||
| 2.4 Escape Sequences | 2.4 Escaped Characters | |||
| Data must be escaped if it does not have a representation using an | Data must be escaped if it does not have a representation using an | |||
| unreserved character; this includes data that does not correspond to | unreserved character; this includes data that does not correspond to | |||
| a printable character of the US-ASCII coded character set, or that | a printable character of the US-ASCII coded character set or | |||
| corresponds to any US-ASCII character that is disallowed, as | corresponds to a US-ASCII character that delimits the component from | |||
| explained below. | others, is reserved in that component for delimiting sub-components, | |||
| or is excluded from any use within a URI (Section 2.5). | ||||
| 2.4.1 Escaped Encoding | 2.4.1 Escaped Encoding | |||
| An escaped octet is encoded as a character triplet, consisting of | An escaped octet is encoded as a character triplet, consisting of | |||
| the percent character "%" followed by the two hexadecimal digits | the percent character "%" followed by the two hexadecimal digits | |||
| representing the octet code in . For example, "%20" is the escaped | representing that octet's numeric value. For example, "%20" is the | |||
| encoding for the US-ASCII space character. | escaped encoding for the US-ASCII space character (SP). This is | |||
| sometimes referred to as "percent-encoding" the octet. | ||||
| escaped = "%" HEXDIG HEXDIG | escaped = "%" HEXDIG HEXDIG | |||
| The uppercase hexadecimal digits 'A' through 'F' are equivalent to | ||||
| the lowercase digits 'a' through 'f', respectively. Two URIs that | ||||
| differ only in the case of hexadecimal digits used in escaped octets | ||||
| are equivalent. For consistency, we recommend that uppercase digits | ||||
| be used by URI generators and normalizers. | ||||
| 2.4.2 When to Escape and Unescape | 2.4.2 When to Escape and Unescape | |||
| A URI is always in an "escaped" form, since escaping or unescaping a | Under normal circumstances, the only time that characters within a | |||
| completed URI might change its semantics. Normally, the only time | URI string are escaped is during the process of generating the URI | |||
| escape encodings can safely be made is when the URI is being created | from its component parts. Each component may have its own set of | |||
| from its component parts; each component may have its own set of | ||||
| characters that are reserved, so only the mechanism responsible for | characters that are reserved, so only the mechanism responsible for | |||
| generating or interpreting that component can determine whether or | generating or interpreting that component can determine whether or | |||
| not escaping a character will change its semantics. Likewise, a URI | not escaping a character will change its semantics. The exception is | |||
| must be separated into its components before the escaped characters | when a URI is being used within a context where the unreserved "mark" | |||
| within those components can be safely decoded. | characters might need to be escaped, such as when used for a | |||
| command-line argument or within a single-quoted attribute. | ||||
| Once generated, a URI is always in an escaped form. When a URI is | ||||
| resolved, the components significant to that scheme-specific | ||||
| resolution process (if any) must be parsed and separated before the | ||||
| escaped characters within those components can be safely unescaped. | ||||
| In some cases, data that could be represented by an unreserved | In some cases, data that could be represented by an unreserved | |||
| character may appear escaped; for example, some of the unreserved | character may appear escaped; for example, some of the unreserved | |||
| "mark" characters are automatically escaped by some systems. If the | "mark" characters are automatically escaped by some systems. A URI | |||
| given URI scheme defines a canonicalization algorithm, then | normalizer may unescape escaped octets that are represented by | |||
| unreserved characters may be unescaped according to that algorithm. | characters in the unreserved set. For example, "%7E" is sometimes | |||
| For example, "%7e" is sometimes used instead of "~" in an http URI | used instead of tilde ("~") in an "http" URI path and can be | |||
| path, but the two are equivalent for an http URI. | converted to "~" without changing the interpretation of the URI. | |||
| Because the percent "%" character always has the reserved purpose of | Because the percent ("%") character serves as the escape indicator, | |||
| being the escape indicator, it must be escaped as "%25" in order to | it must be escaped as "%25" in order for that octet to be used as | |||
| be used as data within a URI. Implementers should be careful not to | data within a URI. Implementers should be careful not to escape or | |||
| escape or unescape the same string more than once, since unescaping | unescape the same string more than once, since unescaping an already | |||
| an already unescaped string might lead to misinterpreting a percent | unescaped string might lead to misinterpreting a percent data | |||
| data character as another escaped character, or vice versa in the | character as another escaped character, or vice versa in the case of | |||
| case of escaping an already escaped string. | escaping an already escaped string. | |||
| 2.4.3 Excluded US-ASCII Characters | 2.5 Excluded Characters | |||
| Although they are disallowed within the URI syntax, we include here a | Although they are disallowed within the URI syntax, we include here | |||
| description of those US-ASCII characters that have been excluded and | a description of those characters that have been excluded and the | |||
| the reasons for their exclusion. | reasons for their exclusion. | |||
| excluded = invisible / delims / unwise | ||||
| The control characters (CTL) in the US-ASCII coded character set are | The control characters (CTL) in the US-ASCII coded character set are | |||
| not used within a URI, both because they are non-printable and | not used within a URI, both because they are non-printable and | |||
| because they are likely to be misinterpreted by some control | because they are likely to be misinterpreted by some control | |||
| mechanisms. | mechanisms. The space character (SP) is excluded because significant | |||
| spaces may disappear and insignificant spaces may be introduced when | ||||
| The space character (SP) is excluded because significant spaces may | a URI is transcribed, typeset, or subjected to the treatment of | |||
| disappear and insignificant spaces may be introduced when a URI is | ||||
| transcribed or typeset or subjected to the treatment of | ||||
| word-processing programs. Whitespace is also used to delimit a URI | word-processing programs. Whitespace is also used to delimit a URI | |||
| in many contexts. | in many contexts. Characters outside the US-ASCII set are excluded as | |||
| well. | ||||
| The angle-bracket "<" and ">" and double-quote (") characters are | invisible = CTL / SP / %x80-FF | |||
| The angle-bracket ("<" and ">") and double-quote (") characters are | ||||
| excluded because they are often used as the delimiters around a URI | excluded because they are often used as the delimiters around a URI | |||
| in text documents and protocol fields. The character "#" is excluded | in text documents and protocol fields. The percent character ("%") | |||
| because it is used to delimit a URI from a fragment identifier in a | is excluded because it is used for the encoding of escaped (Section | |||
| URI reference (Section 4). The percent character "%" is excluded | 2.4) characters. | |||
| because it is used for the encoding of escaped characters. | ||||
| delims = "<" / ">" / "#" / "%" / DQUOTE | delims = "<" / ">" / "%" / DQUOTE | |||
| Other characters are excluded because gateways and other transport | Other characters are excluded because gateways and other transport | |||
| agents are known to sometimes modify such characters, or they are | agents are known to sometimes modify such characters. | |||
| used as delimiters. | ||||
| unwise = "{" / "}" / "|" / "\" / "^" / "`" | unwise = "{" / "}" / "|" / "\" / "^" / "`" | |||
| Data corresponding to excluded characters must be escaped in order to | Data octets corresponding to excluded characters must be escaped in | |||
| be properly represented within a URI. | order to be represented within a URI. | |||
| 3. URI Syntactic Components | ||||
| The URI syntax is dependent upon the scheme. In general, absolute | ||||
| URIs are written as follows: | ||||
| <scheme>:<scheme-specific-part> | ||||
| An absolute URI contains the name of the scheme being used (<scheme>) | ||||
| followed by a colon (":") and then a string (the | ||||
| <scheme-specific-part>) whose interpretation depends on the scheme. | ||||
| The URI syntax does not require that the scheme-specific-part have | ||||
| any general structure or set of semantics which is common among all | ||||
| URIs. However, a subset of URI do share a common syntax for | ||||
| representing hierarchical relationships within the namespace. This | ||||
| "generic URI" syntax consists of a sequence of four main components: | ||||
| <scheme>://<authority><path>?<query> | ||||
| each of which, except <scheme>, may be absent from a particular URI. | ||||
| For example, some URI schemes do not allow an <authority> component, | ||||
| and others do not use a <query> component. | ||||
| absolute-URI = scheme ":" ( hier-part / opaque-part ) | ||||
| URIs that are hierarchical in nature use the slash "/" character for | ||||
| separating hierarchical components. For some file systems, a "/" | ||||
| character (used to denote the hierarchical structure of a URI) is the | ||||
| delimiter used to construct a file name hierarchy, and thus the URI | ||||
| path will look similar to a file pathname. This does NOT imply that | ||||
| the resource is a file or that the URI maps to an actual filesystem | ||||
| pathname. | ||||
| hier-part = [ net-path / abs-path ] [ "?" query ] | 3. Syntax Components | |||
| net-path = "//" authority [ abs-path ] | The generic URI syntax consists of a hierarchical sequence of | |||
| components referred to as the scheme, authority, path, query, and | ||||
| fragment. | ||||
| abs-path = "/" path-segments | URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | |||
| URIs that do not make use of the slash "/" character for separating | hier-part = net-path / abs-path / rel-path | |||
| hierarchical components are considered opaque by the generic URI | ||||
| parser. | ||||
| opaque-part = uric-no-slash *uric | net-path = "//" authority [ abs-path ] | |||
| abs-path = "/" path-segments | ||||
| rel-path = path-segments | ||||
| uric-no-slash = unreserved / escaped / "[" / "]" / ";" / "?" / | The scheme and path components are required, though path may be empty | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | (no characters). An ABNF-driven parser of hier-part will find that | |||
| the three productions in the rule are ambiguous: they are | ||||
| disambiguated by the "first-match-wins" (a.k.a. "greedy") algorithm. | ||||
| In other words, if the string begins with two slash characters ("// | ||||
| "), then it is a net-path; if it begins with only one slash | ||||
| character, then it is an abs-path; otherwise, it is a rel-path. Note | ||||
| that rel-path does not necessarily contain any slash ("/") | ||||
| characters; a non-hierarchical path will be treated as opaque data by | ||||
| a generic URI parser. | ||||
| We use the term <path> to refer to both the <abs-path> and | The authority component is only present when a string matches the | |||
| <opaque-part> constructs, since they are mutually exclusive for any | net-path production. Since the presence of an authority component | |||
| given URI and can be parsed as a single component. | restricts the remaining syntax for path, we have not included a | |||
| specific "path" rule in the syntax. Instead, what we refer to as the | ||||
| URI path is that part of the parsed URI string matching the abs-path | ||||
| or rel-path production in the syntax above, since they are mutually | ||||
| exclusive for any given URI and can be parsed as a single component. | ||||
| 3.1 Scheme Component | 3.1 Scheme | |||
| Just as there are many different methods of access to resources, | Each URI begins with a scheme name that refers to a specification for | |||
| there are a variety of schemes for identifying such resources. The | assigning identifiers within that scheme. As such, the URI syntax is | |||
| URI syntax consists of a sequence of components separated by reserved | a federated and extensible naming system wherein each scheme's | |||
| characters, with the first component defining the semantics for the | specification may further restrict the syntax and semantics of | |||
| remainder of the URI string. | identifiers using that scheme. | |||
| Scheme names consist of a sequence of characters beginning with a | Scheme names consist of a sequence of characters beginning with a | |||
| lower case letter and followed by any combination of lower case | letter and followed by any combination of letters, digits, plus | |||
| letters, digits, plus ("+"), period ("."), or hyphen ("-"). For | ("+"), period ("."), or hyphen ("-"). Although scheme is | |||
| resiliency, programs interpreting a URI should treat upper case | case-insensitive, the canonical form is lowercase and documents that | |||
| letters as equivalent to lower case in scheme names (e.g., allow | specify schemes must do so using lowercase letters. An | |||
| "HTTP" as well as "http"). | implementation should accept uppercase letters as equivalent to | |||
| lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for | ||||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | the sake of robustness, but should only generate lowercase scheme | |||
| names, for consistency. | ||||
| Relative URI references are distinguished from absolute URI in that | ||||
| they do not begin with a scheme name. Instead, the scheme is | ||||
| inherited from the base URI, as described in Section 5.2. | ||||
| 3.2 Authority Component | ||||
| Many URI schemes include a top hierarchical element for a naming | ||||
| authority, such that the namespace defined by the remainder of the | ||||
| URI is governed by that authority. This authority component is | ||||
| typically defined by an Internet-based server or a scheme-specific | ||||
| registry of naming authorities. | ||||
| authority = server / reg-name | ||||
| The authority component is preceded by a double slash "//" and is | ||||
| terminated by the next slash "/", question-mark "?", or by the end of | ||||
| the URI. Within the authority component, the characters ";", ":", | ||||
| "@", "?", "/", "[", and "]" are reserved. | ||||
| An authority component is not required for a URI scheme to make use | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| of relative references. A base URI without an authority component | ||||
| implies that any relative reference will also be without an authority | ||||
| component. | ||||
| 3.2.1 Registry-based Naming Authority | Individual schemes are not specified by this document. The process | |||
| for registration of new URI schemes is defined separately by | ||||
| [RFC2717]. The scheme registry maintains the mapping between scheme | ||||
| names and their specifications. | ||||
| The structure of a registry-based naming authority is specific to | 3.2 Authority | |||
| the URI scheme, but constrained to the allowed characters for an | ||||
| authority component. | ||||
| reg-name = 1*( unreserved / escaped / ";" / | Many URI schemes include a hierarchical element for a naming | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," ) | authority, such that governance of the name space defined by the | |||
| remainder of the URI is delegated to that authority (which may, in | ||||
| turn, delegate it further). The generic syntax provides a common | ||||
| means for distinguishing an authority based on a registered domain | ||||
| name or server address, along with optional port and user | ||||
| information. | ||||
| 3.2.2 Server-based Naming Authority | The authority component is preceded by a double slash ("//") and is | |||
| terminated by the next slash ("/"), question-mark ("?"), or | ||||
| crosshatch ("#") character, or by the end of the URI. | ||||
| URI schemes that involve the direct use of an IP-based protocol to a | authority = [ userinfo "@" ] host [ ":" port ] | |||
| specified server on the Internet use a common syntax for the server | ||||
| component of the URI's scheme-specific data: | ||||
| <userinfo>@<host>:<port> | The parts "<userinfo>@" and ":<port>" may be omitted. | |||
| where <userinfo> may consist of a user name and, optionally, | Some schemes do not allow the userinfo and/or port sub-components. | |||
| scheme-specific information about how to gain authorization to access | When presented with a URI that violates one or more scheme-specific | |||
| the server. The parts "<userinfo>@" and ":<port>" may be omitted. If | restrictions, the scheme-specific URI resolution process should flag | |||
| <host> is omitted, the default host is defined by the scheme-specific | the reference as an error rather than ignore the unused parts; doing | |||
| semantics of the URI (e.g., the "file" URI scheme defaults to | so reduces the number of equivalent URIs and helps detect abuses of | |||
| "localhost", whereas the "http" URI scheme does not allow host to be | the generic syntax that might indicate the URI has been constructed | |||
| omitted). | to mislead the user (Section 7.5). | |||
| server = [ [ userinfo "@" ] hostport ] | 3.2.1 User Information | |||
| The user information, if present, is followed by a commercial | The userinfo sub-component may consist of a user name and, | |||
| at-sign "@". | optionally, scheme-specific information about how to gain | |||
| authorization to access the server. The user information, if | ||||
| present, is followed by a commercial at-sign ("@") that delimits it | ||||
| from the host. | ||||
| userinfo = *( unreserved / escaped / ";" / | userinfo = *( unreserved / escaped / ";" / | |||
| ":" / "&" / "=" / "+" / "$" / "," ) | ":" / "&" / "=" / "+" / "$" / "," ) | |||
| Some URI schemes use the format "user:password" in the userinfo | Some URI schemes use the format "user:password" in the userinfo | |||
| field. This practice is NOT RECOMMENDED, because the passing of | field. This practice is NOT RECOMMENDED, because the passing of | |||
| authentication information in clear text has proven to be a security | authentication information in clear text has proven to be a security | |||
| risk in almost every case where it has been used. Note also that | risk in almost every case where it has been used. Note also that | |||
| userinfo which is crafted to look like a trusted domain name might be | userinfo might be crafted to look like a trusted domain name in order | |||
| used to mislead users, as described in Section 7.5. | to mislead users, as described in Section 7.5. | |||
| The server is identified by a network host --- as described by an | 3.2.2 Host | |||
| IPv6 literal encapsulated within square brackets, an IPv4 address in | ||||
| dotted-decimal form, or a domain name --- and an optional port | ||||
| number. The server's port, if any is required by the URI scheme, can | ||||
| be specified by a port number in decimal following the host and | ||||
| delimited from it by a colon (":") character. If no explicit port | ||||
| number is given, the default port number, as defined by the URI | ||||
| scheme, is assumed. The type of network port identified by the URI | ||||
| (e.g., TCP, UDP, SCTP, etc.) is defined by the scheme-specific | ||||
| semantics of the URI scheme. | ||||
| hostport = host [ ":" port ] | The host sub-component of authority is identified by an IPv6 literal | |||
| host = IPv6reference / IPv4address / hostname | encapsulated within square brackets, an IPv4 address in | |||
| port = *DIGIT | dotted-decimal form, or a domain name. | |||
| host = [ IPv6reference / IPv4address / hostname ] | ||||
| If host is omitted, a default may be defined by the scheme-specific | ||||
| semantics of the URI. For example, the "file" URI scheme defaults to | ||||
| "localhost", whereas the "http" URI scheme does not allow host to be | ||||
| omitted. | ||||
| The production for host is ambiguous because it does not completely | ||||
| distinguish between an IPv4address and a hostname. Again, the | ||||
| "first-match-wins" algorithm applies: If host matches the production | ||||
| for IPv4address, then it should be considered an IPv4 address literal | ||||
| and not a hostname. | ||||
| A hostname takes the form described in Section 3 of [RFC1034] and | A hostname takes the form described in Section 3 of [RFC1034] and | |||
| Section 2.1 of [RFC1123]: a sequence of domain labels separated by | Section 2.1 of [RFC1123]: a sequence of domain labels separated by | |||
| ".", each domain label starting and ending with an alphanumeric | ".", each domain label starting and ending with an alphanumeric | |||
| character and possibly also containing "-" characters. The rightmost | character and possibly also containing "-" characters. The rightmost | |||
| domain label of a fully qualified domain name will never start with a | domain label of a fully qualified domain name may be followed by a | |||
| digit, thus syntactically distinguishing domain names from IPv4 | single "." if it is necessary to distinguish between the complete | |||
| addresses, and may be followed by a single "." if it is necessary to | domain name and some local domain. | |||
| distinguish between the complete domain name and any local domain. | ||||
| hostname = domainlabel qualified | hostname = domainlabel qualified | |||
| qualified = *( "." domainlabel ) [ "." toplabel "." ] | qualified = *( "." domainlabel ) [ "." ] | |||
| domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ] | domainlabel = alphanum [ 0*61( alphanum / "-" ) alphanum ] | |||
| toplabel = alpha [ 0*61( alphanum | "-" ) alphanum ] | alphanum = ALPHA / DIGIT | |||
| alphanum = ALPHA / DIGIT | ||||
| A host identified by an IPv4 literal address is represented in | A host identified by an IPv4 literal address is represented in | |||
| dotted-decimal notation (a sequence of four decimal numbers in the | dotted-decimal notation (a sequence of four decimal numbers in the | |||
| range 0 to 255, separated by "."), as described in [RFC1123] by | range 0 to 255, separated by "."), as described in [RFC1123] by | |||
| reference to [RFC0952]. Note that other forms of dotted notation may | reference to [RFC0952]. Note that other forms of dotted notation may | |||
| be interpreted on some platforms, as described in Section 7.3, but | be interpreted on some platforms, as described in Section 7.3, but | |||
| only the dotted-decimal form of four octets is allowed by this | only the dotted-decimal form of four octets is allowed by this | |||
| grammar. | grammar. | |||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| dec-octet = DIGIT / ; 0-9 | dec-octet = DIGIT ; 0-9 | |||
| ( %x31-39 DIGIT ) / ; 10-99 | / %x31-39 DIGIT ; 10-99 | |||
| ( "1" 2DIGIT ) / ; 100-199 | / "1" 2DIGIT ; 100-199 | |||
| ( "2" %x30-34 DIGIT ) / ; 200-249 | / "2" %x30-34 DIGIT ; 200-249 | |||
| ( "25" %x30-35 ) ; 250-255 | / "25" %x30-35 ; 250-255 | |||
| A host identified by an IPv6 literal address [RFC2373] is | A host identified by an IPv6 literal address [RFC3513] is | |||
| distinguished by enclosing the IPv6 literal within square-brakets | distinguished by enclosing the IPv6 literal within square-brackets | |||
| ("[" and "]"). This is the only place where square-bracket | ("[" and "]"). This is the only place where square-bracket | |||
| characters are allowed in the hierarchical URI syntax. | characters are allowed in the URI syntax. | |||
| IPv6reference = "[" IPv6address "]" | IPv6reference = "[" IPv6address "]" | |||
| IPv6address = ( 6( h4 ":" ) ls32 ) | IPv6address = 6( h4 ":" ) ls32 | |||
| / ( "::" 5( h4 ":" ) ls32 ) | / "::" 5( h4 ":" ) ls32 | |||
| / ( [ h4 ] "::" 4( h4 ":" ) ls32 ) | / [ h4 ] "::" 4( h4 ":" ) ls32 | |||
| / ( [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 ) | / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 | |||
| / ( [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 ) | / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 | |||
| / ( [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 ) | / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 | |||
| / ( [ *4( h4 ":" ) h4 ] "::" ls32 ) | / [ *4( h4 ":" ) h4 ] "::" ls32 | |||
| / ( [ *5( h4 ":" ) h4 ] "::" h4 ) | / [ *5( h4 ":" ) h4 ] "::" h4 | |||
| / ( [ *6( h4 ":" ) h4 ] "::" ) | / [ *6( h4 ":" ) h4 ] "::" | |||
| ls32 = ( h4 ":" h4 ) / IPv4address | ls32 = ( h4 ":" h4 ) / IPv4address | |||
| ; least-significant 32 bits of address | ; least-significant 32 bits of address | |||
| h4 = 1*4HEXDIG | h4 = 1*4HEXDIG | |||
| 3.3 Path Component | The presence of host within a URI does not imply that the scheme | |||
| requires access to the given host on the Internet. In many cases, | ||||
| the host syntax is used only for the sake of reusing the existing | ||||
| registration process created and deployed for DNS, thus obtaining a | ||||
| globally unique name without the cost of deploying another registry. | ||||
| However, such use comes with its own costs: domain name ownership may | ||||
| change over time for reasons not anticipated by the URI creator. | ||||
| The path component contains data, specific to the authority (or the | 3.2.3 Port | |||
| scheme if there is no authority component), identifying the resource | ||||
| within the scope of that scheme and authority. | ||||
| path = [ abs-path / opaque-part ] | The port sub-component of authority is designated by an optional | |||
| port number in decimal following the host and delimited from it by a | ||||
| single colon (":") character. | ||||
| port = *DIGIT | ||||
| If port is omitted, a default may be defined by the scheme-specific | ||||
| semantics of the URI. Likewise, the type of network port designated | ||||
| by the port number (e.g., TCP, UDP, SCTP, etc.) is defined by the URI | ||||
| scheme. For example, the "http" URI scheme defines a default of TCP | ||||
| port 80. | ||||
| 3.3 Path | ||||
| The path component contains hierarchical data that, along with data | ||||
| in the optional query (Section 3.4) component, serves to identify a | ||||
| resource within the scope of that URI's scheme and naming authority | ||||
| (if any). There is no specific "path" syntax production in the | ||||
| generic URI syntax. Instead, what we refer to as the URI path is | ||||
| that part of the parsed URI string matching either the abs-path or | ||||
| the rel-path production, since they are mutually exclusive for any | ||||
| given URI and can be parsed as a single component. The path is | ||||
| terminated by the first question-mark ("?") or crosshatch ("#") | ||||
| character, or by the end of the URI. | ||||
| path-segments = segment *( "/" segment ) | path-segments = segment *( "/" segment ) | |||
| segment = *pchar | segment = *pchar | |||
| pchar = unreserved / escaped / ";" / | pchar = unreserved / escaped / ";" / | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ":" / "@" / "&" / "=" / "+" / "$" / "," | |||
| The path may consist of a sequence of path segments separated by a | The path consists of a sequence of path segments separated by a slash | |||
| single slash "/" character. Within a path segment, the characters "/ | ("/") character. A path is always defined for a URI, though the | |||
| ", ";", "=", and "?" are reserved. The semicolon (";") and equals | defined path may be empty (zero length) or opaque (not containing any | |||
| ("=") characters have the reserved purpose of delimiting parameters | "/" delimiters). For example, the URI <mailto:fred@example.com> has | |||
| and parameter values within a path segment. However, parameters are | a path of "fred@example.com". | |||
| not significant to the parsing of relative references. | ||||
| 3.4 Query Component | Within a path segment, the semicolon (";") and equals ("=") reserved | |||
| characters are often used for delimiting parameters and parameter | ||||
| values applicable to that segment. The comma (",") reserved | ||||
| character is often used for similar purposes. For example, one URI | ||||
| generator might use a segment like "name;v=1.1" to indicate a | ||||
| reference to version 1.1 of "name", whereas another might use a | ||||
| segment like "name,1.1" to indicate the same. Parameter types may be | ||||
| defined by scheme-specific semantics, but in most cases the meaning | ||||
| of a parameter is specific to the URI originator. Parameters are not | ||||
| significant to the parsing of relative references. | ||||
| The query component is a string of information to be interpreted by | The path segments "." and ".." are defined for relative reference | |||
| the resource. | within the path name hierarchy. They are intended for use at the | |||
| beginning of a relative path reference (Section 4.2) for indicating | ||||
| relative position within the hierarchical tree of names, with a | ||||
| similar effect to how they are used within some operating systems' | ||||
| file directory structure to indicate the current directory and parent | ||||
| directory, respectively. Unlike a file system, however, these | ||||
| dot-segments are only interpreted within the URI path hierarchy and | ||||
| must be removed as part of the URI normalization or resolution | ||||
| process, in accordance with the process described in Section 5.2. | ||||
| query = *( pchar / "/" / "?" ) | 3.4 Query | |||
| Within a query component, the characters ";", "/", "?", ":", "@", | The query component contains non-hierarchical data that, along with | |||
| "&", "=", "+", ",", and "$" are reserved. | data in the path (Section 3.3) component, serves to identify a | |||
| resource within the scope of that URI's scheme and naming authority | ||||
| (if any). The query component is indicated by the first question-mark | ||||
| ("?") character and terminated by a crosshatch ("#") character or by | ||||
| the end of the URI. | ||||
| 4. URI References | query = *( pchar / "/" / "?" ) | |||
| The term "URI-reference" is used here to denote the common usage of | The characters slash ("/") and question-mark ("?") are allowed to | |||
| a resource identifier. A URI reference may be absolute or relative, | represent data within the query component, but such use is | |||
| and may have additional information attached in the form of a | discouraged; incorrect implementations of relative URI resolution | |||
| fragment identifier. However, "the URI" that results from such a | often fail to distinguish them from hierarchical separators, thus | |||
| reference includes only the absolute URI after the fragment | resulting in non-interoperable results while parsing relative | |||
| identifier (if any) is removed and after any relative URI is resolved | references. However, since query components are often used to carry | |||
| to its absolute form. Although it is possible to limit the | identifying information in the form of "key=value" pairs, and one | |||
| discussion of URI syntax and semantics to that of the absolute | frequently used value is a reference to another URI, it is sometimes | |||
| result, most usage of URI is within general URI references, and it is | better for usability to include those characters unescaped. | |||
| impossible to obtain the URI from such a reference without also | ||||
| parsing the fragment and resolving the relative form. | ||||
| URI-reference = [ absolute-URI / relative-URI ] [ "#" fragment ] | 3.5 Fragment | |||
| Many protocol elements allow only the absolute form of a URI with an | The fragment identifier component allows indirect identification of | |||
| optional fragment identifier. | a secondary resource by reference to a primary resource and | |||
| additional identifying information that is selective within that | ||||
| resource. The identified secondary resource may be some portion or | ||||
| subset of the primary resource, some view on representations of the | ||||
| primary resource, or some other resource that is merely named within | ||||
| the primary resource. A fragment identifier component is indicated | ||||
| by the presence of a crosshatch ("#") character and terminated by the | ||||
| end of the URI string. | ||||
| absolute-URI-reference = absolute-URI [ "#" fragment ] | fragment = *( pchar / "/" / "?" ) | |||
| The syntax for a relative URI is a shortened form of that for an | The semantics of a fragment identifier are defined by the set of | |||
| absolute URI, where some prefix of the URI is missing and certain | representations that might result from a retrieval action on the | |||
| path components ("." and "..") have a special meaning when, and only | primary resource. Therefore, the format and interpretation of a | |||
| when, interpreting a relative path. The relative URI syntax is | fragment identifier component is dependent on the media type | |||
| defined in Section 5. | [RFC2046] of a potential retrieval result. Individual media types | |||
| may define their own restrictions on, or structure within, the | ||||
| fragment identifier syntax for specifying different types of subsets, | ||||
| views, or external references that are identifiable as fragments by | ||||
| that media type. If the primary resource is represented by multiple | ||||
| media types, as is often the case for resources whose representation | ||||
| is selected based on attributes of the retrieval request, then | ||||
| interpretation of the given fragment identifier must be consistent | ||||
| across all of those media types in order for it to be viable as an | ||||
| identifier. | ||||
| 4.1 Fragment Identifier | As with any URI, use of a fragment identifier component does not | |||
| imply that a retrieval action will take place. A URI with a fragment | ||||
| identifier may be used to refer to the secondary resource without any | ||||
| implication that the primary resource is accessible. However, if | ||||
| that URI is used in a context that does call for retrieval and is not | ||||
| a same-document reference (Section 4.4), the fragment identifier is | ||||
| only valid as a reference if a retrieval action on the primary | ||||
| resource succeeds and results in a representation that defines the | ||||
| fragment. | ||||
| When a URI reference is used to perform a retrieval action on the | Fragment identifiers have a special role in information systems as | |||
| identified resource, the optional fragment identifier, separated from | the primary form of client-side indirect referencing, allowing an | |||
| the URI by a crosshatch ("#") character, consists of additional | author to specifically identify those aspects of an existing resource | |||
| reference information to be interpreted by the user agent after the | that are only indirectly provided by the resource owner. As such, | |||
| retrieval action has been successfully completed. As such, it is not | interpretation of the fragment identifier during a retrieval action | |||
| part of a URI, but is often used in conjunction with a URI. | is performed solely by the user agent; the fragment identifier is not | |||
| passed to other systems during the process of retrieval. Although | ||||
| this is often perceived to be a loss of information, particularly in | ||||
| regards to accurate redirection of references as content moves over | ||||
| time, it also serves to prevent information providers from denying | ||||
| reference authors the right to selectively refer to information | ||||
| within a resource. | ||||
| fragment = *( pchar / "/" / "?" ) | The characters slash ("/") and question-mark ("?") are allowed to | |||
| represent data within the fragment identifier, but such use is | ||||
| discouraged for the same reasons as described above for query. | ||||
| The semantics of a fragment identifier is a property of the data | 4. Usage | |||
| resulting from a retrieval action, regardless of the type of URI used | ||||
| in the reference. Therefore, the format and interpretation of | ||||
| fragment identifiers is dependent on the media type [RFC2046] of the | ||||
| retrieval result. The character restrictions described in Section 2 | ||||
| for a URI also apply to the fragment in a URI-reference. Individual | ||||
| media types may define additional restrictions or structure within | ||||
| the fragment for specifying different types of "partial views" that | ||||
| can be identified within that media type. | ||||
| A fragment identifier is only meaningful when a URI reference is | When applications make reference to a URI, they do not always use the | |||
| intended for retrieval and the result of that retrieval is a document | full form of reference defined by the "URI" syntax production. In | |||
| for which the identified fragment is consistently defined. | order to save space and take advantage of hierarchical locality, many | |||
| Internet protocol elements and media type formats allow an | ||||
| abbreviation of a URI, while others restrict the syntax to a | ||||
| particular form of URI. We define the most common forms of reference | ||||
| syntax in this specification because they impact and depend upon the | ||||
| design of the generic syntax, requiring a uniform parsing algorithm | ||||
| in order to be interpreted consistently. | ||||
| 4.2 Same-document References | 4.1 URI Reference | |||
| A URI reference that does not contain a URI is a reference to the | The ABNF rule URI-reference is used to denote the most common usage | |||
| current document. In other words, an empty URI reference within a | of a resource identifier. | |||
| document is interpreted as a reference to the start of that document, | ||||
| and a reference containing only a fragment identifier is a reference | ||||
| to the identified fragment of that document. Traversal of such a | ||||
| reference should not result in an additional retrieval action. | ||||
| However, if the URI reference occurs in a context that is always | ||||
| intended to result in a new request, as in the case of HTML's FORM | ||||
| element [HTML], then an empty URI reference represents the base URI | ||||
| of the current document and should be replaced by that URI when | ||||
| transformed into a request. | ||||
| 4.3 Parsing a URI Reference | URI-reference = URI / relative-URI | |||
| A URI reference is typically parsed according to the four main | A URI-reference may be absolute or relative: if the reference | |||
| components and fragment identifier in order to determine what | string's prefix matches the syntax of a scheme followed by its colon | |||
| components are present and whether the reference is relative or | separator, then the reference is a URI rather than a relative-URI. | |||
| absolute. The individual components are then parsed for their | ||||
| subparts and, if not opaque, to verify their validity. | ||||
| Although the BNF defines what is allowed in each component, it is | A URI-reference is typically parsed first into the five URI | |||
| ambiguous in terms of differentiating between an authority component | components, in order to determine what components are present and | |||
| and a path component that begins with two slash characters. The | whether the reference is relative or absolute, and then each | |||
| greedy algorithm is used for disambiguation: the left-most matching | component is parsed for its subparts and their validation. The ABNF | |||
| rule soaks up as much of the URI reference string as it is capable of | of URI-reference, along with the "first-match-wins" disambiguation | |||
| matching. In other words, the authority component wins. | rule, is sufficient to define a validating parser for the generic | |||
| syntax. Readers familiar with regular expressions should see | ||||
| Appendix B for an example of a non-validating URI-reference parser | ||||
| that will take any given string and extract the URI components. | ||||
| Readers familiar with regular expressions should see Appendix B for a | 4.2 Relative URI | |||
| concrete parsing example and test oracle. | ||||
| 5. Relative URI References | A relative URI reference takes advantage of the hier-part syntax | |||
| (Section 3) in order to express a reference that is relative to the | ||||
| name space of another hierarchical URI. | ||||
| relative-URI = hier-part [ "?" query ] [ "#" fragment ] | ||||
| The URI referred to by a relative URI reference is obtained by | ||||
| applying the relative resolution algorithm of Section 5. | ||||
| A relative reference that begins with two slash characters is termed | ||||
| a network-path reference; such references are rarely used. A relative | ||||
| reference that begins with a single slash character is termed an | ||||
| absolute-path reference. A relative reference that does not begin | ||||
| with a slash character is termed a relative-path reference. | ||||
| A path segment that contains a colon character (e.g., "this:that") | ||||
| cannot be used as the first segment of a relative-path reference | ||||
| because it might be mistaken for a scheme name. Such a segment must | ||||
| be preceded by a dot-segment (e.g., "./this:that") to make a | ||||
| relative-path reference. | ||||
| 4.3 Absolute URI | ||||
| Some protocol elements allow only the absolute form of a URI without | ||||
| a fragment identifier. For example, defining the base URI for later | ||||
| use by relative references calls for an absolute-URI production that | ||||
| does not allow a fragment. | ||||
| absolute-URI = scheme ":" hier-part [ "?" query ] | ||||
| 4.4 Same-document Reference | ||||
| When a URI reference occurring within a document or message refers to | ||||
| a URI that is, aside from its fragment component (if any), identical | ||||
| to the base URI (Section 5), that reference is called a | ||||
| "same-document" reference. The most frequent examples of | ||||
| same-document references are relative references that are empty or | ||||
| include only the crosshatch ("#") separator followed by a fragment | ||||
| identifier. | ||||
| When a same-document reference is dereferenced for the purpose of a | ||||
| retrieval action, the target of that reference is defined to be | ||||
| within that current document or message; the dereference should not | ||||
| result in a new retrieval. | ||||
| 4.5 Suffix Reference | ||||
| The URI syntax is designed for unambiguous reference to resources and | ||||
| extensibility via the URI scheme. However, as URI identification and | ||||
| usage have become commonplace, traditional media (television, radio, | ||||
| newspapers, billboards, etc.) have increasingly used a suffix of the | ||||
| URI as a reference, consisting of only the authority and path | ||||
| portions of the URI, such as | ||||
| www.w3.org/Addressing/ | ||||
| or simply the DNS hostname on its own. Such references are primarily | ||||
| intended for human interpretation rather than machine, with the | ||||
| assumption that context-based heuristics are sufficient to complete | ||||
| the URI (e.g., most hostnames beginning with "www" are likely to have | ||||
| a URI prefix of "http://"). Although there is no standard set of | ||||
| heuristics for disambiguating a URI suffix, many client | ||||
| implementations allow them to be entered by the user and | ||||
| heuristically resolved. It should be noted that such heuristics may | ||||
| change over time, particularly when new URI schemes are introduced. | ||||
| Since a URI suffix has the same syntax as a relative path reference, | ||||
| a suffix reference cannot be used in contexts where relative URIs are | ||||
| expected. This limits use of suffix references to those places where | ||||
| there is no defined base URI, such as dialog boxes and off-line | ||||
| advertisements. | ||||
| 5. Relative Resolution | ||||
| It is often the case that a group or "tree" of documents has been | It is often the case that a group or "tree" of documents has been | |||
| constructed to serve a common purpose; the vast majority of URIs in | constructed to serve a common purpose; the vast majority of URIs in | |||
| these documents point to resources within the tree rather than | these documents point to resources within the tree rather than | |||
| outside of it. Similarly, documents located at a particular site are | outside of it. Similarly, documents located at a particular site are | |||
| much more likely to refer to other resources at that site than to | much more likely to refer to other resources at that site than to | |||
| resources at remote sites. | resources at remote sites. | |||
| Relative addressing of URIs allows document trees to be partially | Relative referencing of URIs allows document trees to be partially | |||
| independent of their location and access scheme. For instance, it is | independent of their location and access scheme. For instance, it is | |||
| possible for a single set of hypertext documents to be simultaneously | possible for a single set of hypertext documents to be simultaneously | |||
| accessible and traversable via each of the "file", "http", and "ftp" | accessible and traversable via each of the "file", "http", and "ftp" | |||
| schemes if the documents refer to each other using relative URIs. | schemes if the documents refer to each other using relative URIs. | |||
| Furthermore, such document trees can be moved, as a whole, without | Furthermore, such document trees can be moved, as a whole, without | |||
| changing any of the relative references. Experience within the WWW | changing any of the relative references. Experience within the WWW | |||
| has demonstrated that the ability to perform relative referencing is | has demonstrated that the ability to perform relative referencing is | |||
| necessary for the long-term usability of embedded URIs. | necessary for the long-term usability of embedded URIs. | |||
| The relative URI syntax takes advantage of the <hier-part> syntax of | ||||
| <absolute-URI> (Section 3) in order to express a reference that is | ||||
| relative to the namespace of another hierarchical URI. | ||||
| relative-URI = [ net-path / abs-path / rel-path ] [ "?" query ] | ||||
| A relative reference beginning with two slash characters is termed a | ||||
| network-path reference, as defined by <net-path> in Section 3. Such | ||||
| references are rarely used. | ||||
| A relative reference beginning with a single slash character is | ||||
| termed an absolute-path reference, as defined by <abs-path> in | ||||
| Section 3. | ||||
| A relative reference that does not begin with a scheme name or a | ||||
| slash character is termed a relative-path reference. | ||||
| rel-path = rel-segment [ abs-path ] | ||||
| rel-segment = 1*( unreserved / escaped / ";" / | ||||
| "@" / "&" / "=" / "+" / "$" / "," ) | ||||
| Within a relative-path reference, the complete path segments "." and | ||||
| ".." have special meanings: "the current hierarchy level" and "the | ||||
| level above this hierarchy level", respectively. Although this is | ||||
| very similar to their use within Unix-based filesystems to indicate | ||||
| directory levels, these path components are only considered special | ||||
| when resolving a relative-path reference to its absolute form | ||||
| (Section 5.2). | ||||
| Authors should be aware that a path segment which contains a colon | ||||
| character cannot be used as the first segment of a relative URI path | ||||
| (e.g., "this:that"), because it would be mistaken for a scheme name. | ||||
| It is therefore necessary to precede such segments with other | ||||
| segments (e.g., "./this:that") in order for them to be referenced as | ||||
| a relative path. | ||||
| It is not necessary for all URI within a given scheme to be | ||||
| restricted to the <hier-part> syntax, since the hierarchical | ||||
| properties of that syntax are only necessary when a relative URI is | ||||
| used within a particular document. Documents can only make use of a | ||||
| relative URI when their base URI fits within the <hier-part> syntax. | ||||
| It is assumed that any document which contains a relative reference | ||||
| will also have a base URI that obeys the syntax. In other words, a | ||||
| relative URI cannot be used within a document that has an unsuitable | ||||
| base URI. | ||||
| Some URI schemes do not allow a hierarchical syntax matching the | ||||
| <hier-part> syntax, and thus cannot use relative references. | ||||
| 5.1 Establishing a Base URI | 5.1 Establishing a Base URI | |||
| The term "relative URI" implies that there exists some absolute "base | The term "relative URI" implies that there exists some absolute "base | |||
| URI" against which the relative reference is applied. Indeed, the | URI" against which the relative reference is applied. Indeed, the | |||
| base URI is necessary to define the semantics of any relative URI | base URI is necessary to define the semantics of any relative URI | |||
| reference; without it, a relative reference is meaningless. In order | reference; without it, a relative reference is meaningless. In order | |||
| for relative URI to be usable within a document, the base URI of that | for relative URI to be usable within a document, the base URI of that | |||
| document must be known to the parser. | document must be known to the parser. | |||
| A document that contains relative references must have a base URI | ||||
| that contains a hierarchical path component. In other words, a | ||||
| relative-URI cannot be used within a document that has an unsuitable | ||||
| base URI. Some URI schemes do not allow a hierarchical path component | ||||
| and are thus restricted to full URI references. | ||||
| An authority component is not required for a URI scheme to make use | ||||
| of relative references. A base URI without an authority component | ||||
| implies that any relative reference will also be without an authority | ||||
| component. | ||||
| The base URI of a document can be established in one of four ways, | The base URI of a document can be established in one of four ways, | |||
| listed below in order of precedence. The order of precedence can be | listed below in order of precedence. The order of precedence can be | |||
| thought of in terms of layers, where the innermost defined base URI | thought of in terms of layers, where the innermost defined base URI | |||
| has the highest precedence. This can be visualized graphically as: | has the highest precedence. This can be visualized graphically as: | |||
| .----------------------------------------------------------. | .----------------------------------------------------------. | |||
| | .----------------------------------------------------. | | | .----------------------------------------------------. | | |||
| | | .----------------------------------------------. | | | | | .----------------------------------------------. | | | |||
| | | | .----------------------------------------. | | | | | | | .----------------------------------------. | | | | |||
| | | | | .----------------------------------. | | | | | | | | | .----------------------------------. | | | | | |||
| skipping to change at page 24, line 42 ¶ | skipping to change at page 26, line 37 ¶ | |||
| It is beyond the scope of this document to specify how, for each | It is beyond the scope of this document to specify how, for each | |||
| media type, the base URI can be embedded. It is assumed that user | media type, the base URI can be embedded. It is assumed that user | |||
| agents manipulating such media types will be able to obtain the | agents manipulating such media types will be able to obtain the | |||
| appropriate syntax from that media type's specification. An example | appropriate syntax from that media type's specification. An example | |||
| of how the base URI can be embedded in the Hypertext Markup Language | of how the base URI can be embedded in the Hypertext Markup Language | |||
| (HTML) [HTML] is provided in Appendix D. | (HTML) [HTML] is provided in Appendix D. | |||
| A mechanism for embedding the base URI within MIME container types | A mechanism for embedding the base URI within MIME container types | |||
| (e.g., the message and multipart types) is defined by MHTML | (e.g., the message and multipart types) is defined by MHTML | |||
| [RFC2110]. Protocols that do not use the MIME message header syntax, | [RFC2110]. Protocols that do not use the MIME message header syntax, | |||
| but which do allow some form of tagged metainformation to be included | but do allow some form of tagged metadata to be included within | |||
| within messages, may define their own syntax for defining the base | messages, may define their own syntax for defining the base URI as | |||
| URI as part of a message. | part of a message. | |||
| 5.1.2 Base URI from the Encapsulating Entity | 5.1.2 Base URI from the Encapsulating Entity | |||
| If no base URI is embedded, the base URI of a document is defined by | If no base URI is embedded, the base URI of a document is defined by | |||
| the document's retrieval context. For a document that is enclosed | the document's retrieval context. For a document that is enclosed | |||
| within another entity (such as a message or another document), the | within another entity (such as a message or another document), the | |||
| retrieval context is that entity; thus, the default base URI of the | retrieval context is that entity; thus, the default base URI of the | |||
| document is the base URI of the entity in which the document is | document is the base URI of the entity in which the document is | |||
| encapsulated. | encapsulated. | |||
| skipping to change at page 25, line 18 ¶ | skipping to change at page 27, line 16 ¶ | |||
| If no base URI is embedded and the document is not encapsulated | If no base URI is embedded and the document is not encapsulated | |||
| within some other entity (e.g., the top level of a composite entity), | within some other entity (e.g., the top level of a composite entity), | |||
| then, if a URI was used to retrieve the base document, that URI shall | then, if a URI was used to retrieve the base document, that URI shall | |||
| be considered the base URI. Note that if the retrieval was the | be considered the base URI. Note that if the retrieval was the | |||
| result of a redirected request, the last URI used (i.e., that which | result of a redirected request, the last URI used (i.e., that which | |||
| resulted in the actual retrieval of the document) is the base URI. | resulted in the actual retrieval of the document) is the base URI. | |||
| 5.1.4 Default Base URI | 5.1.4 Default Base URI | |||
| If none of the conditions described in Sections 5.1.1--5.1.3 apply, | If none of the conditions described in above apply, then the base URI | |||
| then the base URI is defined by the context of the application. Since | is defined by the context of the application. Since this definition | |||
| this definition is necessarily application-dependent, failing to | is necessarily application-dependent, failing to define the base URI | |||
| define the base URI using one of the other methods may result in the | using one of the other methods may result in the same content being | |||
| same content being interpreted differently by different types of | interpreted differently by different types of application. | |||
| application. | ||||
| It is the responsibility of the distributor(s) of a document | It is the responsibility of the distributor(s) of a document | |||
| containing a relative URI to ensure that the base URI for that | containing a relative URI to ensure that the base URI for that | |||
| document can be established. It must be emphasized that a relative | document can be established. It must be emphasized that a relative | |||
| URI cannot be used reliably in situations where the document's base | URI cannot be used reliably in situations where the document's base | |||
| URI is not well-defined. | URI is not well-defined. | |||
| 5.2 Resolving Relative References to Absolute Form | 5.2 Obtaining the Referenced URI | |||
| This section describes an example algorithm for resolving URI | This section describes an example algorithm for resolving URI | |||
| references that might be relative to a given base URI. The algorithm | references that might be relative to a given base URI. The algorithm | |||
| is intended to provide a definitive result that can be used to test | is intended to provide a definitive result that can be used to test | |||
| the output of other implementations. Implementation of the algorithm | the output of other implementations. Implementation of the algorithm | |||
| itself is not required, but the result given by an implementation | itself is not required, but the result given by an implementation | |||
| must match the result that would be given by this algorithm. | must match the result that would be given by this algorithm. | |||
| The base URI is established according to the rules of Section 5.1 and | The base URI (Base) is established according to the rules of Section | |||
| parsed into the four main components as described in Section 3. Note | 5.1 and parsed into the five main components described in Section 3. | |||
| that only the scheme component is required to be present in the base | Note that only the scheme component is required to be present in the | |||
| URI; the other components may be empty or undefined. A component is | base URI; the other components may be empty or undefined. A | |||
| undefined if its preceding separator does not appear in the URI | component is undefined if its preceding separator does not appear in | |||
| reference; the path component is never undefined, though it may be | the URI reference; the path component is never undefined, though it | |||
| empty. The base URI's query component is not used by the resolution | may be empty. | |||
| algorithm and may be discarded. | ||||
| For each URI reference (R), the following pseudocode describes an | For each URI reference (R), the following pseudocode describes an | |||
| algorithm for transforming R into its target (T), which is either an | algorithm for transforming R into its target URI (T): | |||
| absolute URI or the current document, and R's optional fragment: | ||||
| (R.scheme, R.authority, R.path, R.query, fragment) = parse(R); | (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); | |||
| -- The URI reference is parsed into the four components and | -- The URI reference is parsed into the five URI components | |||
| -- fragment identifier, as described in Section 4.3. | ||||
| if ((not validating) and (R.scheme == Base.scheme)) then | if ((not validating) and (R.scheme == Base.scheme)) then | |||
| -- A non-validating parser may ignore a scheme in the | -- A non-validating parser may ignore a scheme in the | |||
| -- reference if it is identical to the base URI's scheme. | -- reference if it is identical to the base URI's scheme. | |||
| undefine(R.scheme); | undefine(R.scheme); | |||
| endif; | endif; | |||
| if defined(R.scheme) then | if defined(R.scheme) then | |||
| T.scheme = R.scheme; | T.scheme = R.scheme; | |||
| T.authority = R.authority; | T.authority = R.authority; | |||
| T.path = R.path; | T.path = R.path; | |||
| T.query = R.query; | T.query = R.query; | |||
| else | else | |||
| if defined(R.authority) then | if defined(R.authority) then | |||
| T.authority = R.authority; | T.authority = R.authority; | |||
| T.path = R.path; | T.path = R.path; | |||
| T.query = R.query; | T.query = R.query; | |||
| else | else | |||
| if (R.path == "") then | if (R.path == "") then | |||
| T.path = Base.path; | ||||
| if defined(R.query) then | if defined(R.query) then | |||
| T.path = Base.path; | ||||
| T.query = R.query; | T.query = R.query; | |||
| else | else | |||
| -- An empty reference refers to the current document | T.query = Base.query; | |||
| return (current-document, fragment); | ||||
| endif; | endif; | |||
| else | else | |||
| if (R.path starts-with "/") then | if (R.path starts-with "/") then | |||
| T.path = R.path; | T.path = R.path; | |||
| else | else | |||
| T.path = merge(Base.path, R.path); | T.path = merge(Base.path, R.path); | |||
| endif; | endif; | |||
| T.query = R.query; | T.query = R.query; | |||
| endif; | endif; | |||
| T.authority = Base.authority; | T.authority = Base.authority; | |||
| endif; | endif; | |||
| T.scheme = Base.scheme; | T.scheme = Base.scheme; | |||
| endif; | endif; | |||
| return (T, fragment); | T.fragment = R.fragment; | |||
| The pseudocode above refers to a merge routine for merging a | The pseudocode above refers to a merge routine for merging a | |||
| relative-path reference with the path of the base URI to obtain the | relative-path reference with the path of the base URI to obtain the | |||
| target path. Although there are many ways to do this, we will | target path. Although there are many ways to do this, we will | |||
| describe a simple method using a separate string buffer: | describe a simple method using a separate string buffer: | |||
| 1. All but the last segment of the base URI's path component is | 1. All but the last segment of the base URI's path component is | |||
| copied to the buffer. In other words, any characters after the | copied to the buffer. In other words, any characters after the | |||
| last (right-most) slash character, if any, are excluded. If the | last (right-most) slash character, if any, are excluded. If the | |||
| base URI's path component is the empty string, then a single | base URI's path component is the empty string, then a single | |||
| skipping to change at page 27, line 32 ¶ | skipping to change at page 29, line 26 ¶ | |||
| removing the leftmost matching pattern on each iteration, until | removing the leftmost matching pattern on each iteration, until | |||
| no matching pattern remains. | no matching pattern remains. | |||
| 6. If the buffer string ends with "<segment>/..", where <segment> is | 6. If the buffer string ends with "<segment>/..", where <segment> is | |||
| a complete path segment not equal to "..", that "<segment>/.." is | a complete path segment not equal to "..", that "<segment>/.." is | |||
| removed. | removed. | |||
| 7. If the resulting buffer string still begins with one or more | 7. If the resulting buffer string still begins with one or more | |||
| complete path segments of "..", then the reference is considered | complete path segments of "..", then the reference is considered | |||
| to be in error. Implementations may handle this error by | to be in error. Implementations may handle this error by | |||
| retaining these components in the resolved path (i.e., treating | removing them from the resolved path (i.e., discarding relative | |||
| them as part of the final URI), by removing them from the | levels above the root) or by avoiding traversal of the reference. | |||
| resolved path (i.e., discarding relative levels above the root), | ||||
| or by avoiding traversal of the reference. | ||||
| 8. The remaining buffer string is the target URI's path component. | 8. The remaining buffer string is the target URI's path component. | |||
| Some systems may find it more efficient to implement the merge | Some systems may find it more efficient to implement the merge | |||
| algorithm as a pair of path segment stacks being merged, rather than | algorithm as a pair of path segment stacks being merged, rather than | |||
| as a series of string pattern replacements. | as a series of string pattern replacements. | |||
| Note: Some WWW client applications will fail to separate the | Note: Some WWW client applications will fail to separate the | |||
| reference's query component from its path component before merging | reference's query component from its path component before merging | |||
| the base and reference paths. This may result in a loss of | the base and reference paths. This may result in a loss of | |||
| information if the query component contains the strings "/../" or | information if the query component contains the strings "/../" or | |||
| "/./". | "/./". | |||
| The resulting target URI components and fragment can be recombined to | 5.3 Recomposition of a Parsed URI | |||
| provide the absolute form of the URI reference. Using pseudocode, | ||||
| this would be: | Parsed URI components can be recombined to obtain the referenced URI. | |||
| Using pseudocode, this would be: | ||||
| result = "" | result = "" | |||
| if defined(T.scheme) then | if defined(T.scheme) then | |||
| append T.scheme to result; | append T.scheme to result; | |||
| append ":" to result; | append ":" to result; | |||
| endif; | endif; | |||
| if defined(T.authority) then | if defined(T.authority) then | |||
| append "//" to result; | append "//" to result; | |||
| append T.authority to result; | append T.authority to result; | |||
| endif; | endif; | |||
| append T.path to result; | append T.path to result; | |||
| if defined(T.query) then | if defined(T.query) then | |||
| append "?" to result; | append "?" to result; | |||
| append T.query to result; | append T.query to result; | |||
| endif; | endif; | |||
| if defined(fragment) then | if defined(fragment) then | |||
| append "#" to result; | append "#" to result; | |||
| append fragment to result; | append fragment to result; | |||
| endif; | endif; | |||
| return result; | return result; | |||
| Note that we must be careful to preserve the distinction between a | Note that we are careful to preserve the distinction between a | |||
| component that is undefined, meaning that its separator was not | component that is undefined, meaning that its separator was not | |||
| present in the reference, and a component that is empty, meaning that | present in the reference, and a component that is empty, meaning that | |||
| the separator was present and was immediately followed by the next | the separator was present and was immediately followed by the next | |||
| component separator or the end of the reference. | component separator or the end of the reference. | |||
| Resolution examples are provided in Appendix C. | 5.4 Examples of Relative Resolution | |||
| 6. URI Normalization and Comparison | Within an object with a well-defined base URI of | |||
| http://a/b/c/d;p?q | ||||
| a relative URI reference would be resolved as follows: | ||||
| 5.4.1 Normal Examples | ||||
| "g:h" = "g:h" | ||||
| "g" = "http://a/b/c/g" | ||||
| "./g" = "http://a/b/c/g" | ||||
| "g/" = "http://a/b/c/g/" | ||||
| "/g" = "http://a/g" | ||||
| "//g" = "http://g" | ||||
| "?y" = "http://a/b/c/d;p?y" | ||||
| "g?y" = "http://a/b/c/g?y" | ||||
| "#s" = "http://a/b/c/d;p?q#s" | ||||
| "g#s" = "http://a/b/c/g#s" | ||||
| "g?y#s" = "http://a/b/c/g?y#s" | ||||
| ";x" = "http://a/b/c/;x" | ||||
| "g;x" = "http://a/b/c/g;x" | ||||
| "g;x?y#s" = "http://a/b/c/g;x?y#s" | ||||
| "." = "http://a/b/c/" | ||||
| "./" = "http://a/b/c/" | ||||
| ".." = "http://a/b/" | ||||
| "../" = "http://a/b/" | ||||
| "../g" = "http://a/b/g" | ||||
| "../.." = "http://a/" | ||||
| "../../" = "http://a/" | ||||
| "../../g" = "http://a/g" | ||||
| 5.4.2 Abnormal Examples | ||||
| Although the following abnormal examples are unlikely to occur in | ||||
| normal practice, all URI parsers should be capable of resolving them | ||||
| consistently. Each example uses the same base as above. | ||||
| An empty reference refers to the current base URI. | ||||
| "" = "http://a/b/c/d;p?q" | ||||
| Parsers must be careful in handling the case where there are more | ||||
| relative path ".." segments than there are hierarchical levels in the | ||||
| base URI's path. Note that the ".." syntax cannot be used to change | ||||
| the authority component of a URI. | ||||
| "../../../g" = "http://a/g" | ||||
| "../../../../g" = "http://a/g" | ||||
| Similarly, parsers should remove the dot-segments "." and ".." when | ||||
| they are complete components of a path, but not when they are only | ||||
| part of a segment. | ||||
| "/./g" = "http://a/g" | ||||
| "/../g" = "http://a/g" | ||||
| "g." = "http://a/b/c/g." | ||||
| ".g" = "http://a/b/c/.g" | ||||
| "g.." = "http://a/b/c/g.." | ||||
| "..g" = "http://a/b/c/..g" | ||||
| Less likely are cases where the relative URI uses unnecessary or | ||||
| nonsensical forms of the "." and ".." complete path segments. | ||||
| "./../g" = "http://a/b/g" | ||||
| "./g/." = "http://a/b/c/g/" | ||||
| "g/./h" = "http://a/b/c/g/h" | ||||
| "g/../h" = "http://a/b/c/h" | ||||
| "g;x=1/./y" = "http://a/b/c/g;x=1/y" | ||||
| "g;x=1/../y" = "http://a/b/c/y" | ||||
| Some applications fail to separate the reference's query and/or | ||||
| fragment components from a relative path before merging it with the | ||||
| base path. This error is rarely noticed, since typical usage of a | ||||
| fragment never includes the hierarchy ("/") character, and the query | ||||
| component is not normally used within relative references. | ||||
| "g?y/./x" = "http://a/b/c/g?y/./x" | ||||
| "g?y/../x" = "http://a/b/c/g?y/../x" | ||||
| "g#s/./x" = "http://a/b/c/g#s/./x" | ||||
| "g#s/../x" = "http://a/b/c/g#s/../x" | ||||
| Some parsers allow the scheme name to be present in a relative URI if | ||||
| it is the same as the base URI scheme. This is considered to be a | ||||
| loophole in prior specifications of partial URI [RFC1630]. Its use | ||||
| should be avoided, but is allowed for backward compatibility. | ||||
| "http:g" = "http:g" ; for validating parsers | ||||
| / "http://a/b/c/g" ; for backward compatibility | ||||
| 6. Normalization and Comparison | ||||
| One of the most common operations on URIs is simple comparison: | One of the most common operations on URIs is simple comparison: | |||
| determining if two URIs are equivalent without using the URIs to | determining if two URIs are equivalent without using the URIs to | |||
| access their respective resource(s). A comparison is performed every | access their respective resource(s). A comparison is performed every | |||
| time a response cache is accessed, a browser checks its history to | time a response cache is accessed, a browser checks its history to | |||
| color a link, or an XML parser processes tags within a namespace. | color a link, or an XML parser processes tags within a namespace. | |||
| Extensive normalization prior to comparison of URIs is often used by | Extensive normalization prior to comparison of URIs is often used by | |||
| spiders and indexing engines to prune a search space or reduce | spiders and indexing engines to prune a search space or reduce | |||
| duplication of request actions and response storage. | duplication of request actions and response storage. | |||
| URI comparison is performed in respect to some particular purpose, | URI comparison is performed in respect to some particular purpose, | |||
| and software with differing purposes will often be subject to | and software with differing purposes will often be subject to | |||
| differing design trade-offs in regards to how much effort should be | differing design trade-offs in regards to how much effort should be | |||
| spent in reducing duplicate identifiers. This section describes a | spent in reducing duplicate identifiers. This section describes a | |||
| variety of methods that may be used to compare URIs, the trade-offs | variety of methods that may be used to compare URIs, the trade-offs | |||
| between them, and the types of applications that might use them. | between them, and the types of applications that might use them. | |||
| 6.1 URI Equivalence | 6.1 Equivalence | |||
| Since URIs exist to identify resources, presumably they should be | Since URIs exist to identify resources, presumably they should be | |||
| considered equivalent when they identify the same resource. However, | considered equivalent when they identify the same resource. However, | |||
| such a definition of equivalence is not of much practical use, since | such a definition of equivalence is not of much practical use, since | |||
| there is no way for software to compare two resources without | there is no way for software to compare two resources without | |||
| knowledge of their origin. For this reason, determination of | knowledge of their origin. For this reason, determination of | |||
| equivalence or difference of URIs is based on string comparison, | equivalence or difference of URIs is based on string comparison, | |||
| perhaps augmented by reference to additional rules provided by URI | perhaps augmented by reference to additional rules provided by URI | |||
| scheme definitions. We use the terms "different" and "equivalent" to | scheme definitions. We use the terms "different" and "equivalent" to | |||
| describe the possible outcomes of such comparisons, but there are | describe the possible outcomes of such comparisons, but there are | |||
| skipping to change at page 29, line 46 ¶ | skipping to change at page 33, line 46 ¶ | |||
| Even though it is possible to determine that two URIs are equivalent, | Even though it is possible to determine that two URIs are equivalent, | |||
| it is never possible to be sure that two URIs identify different | it is never possible to be sure that two URIs identify different | |||
| resources. Therefore, comparison methods are designed to minimize | resources. Therefore, comparison methods are designed to minimize | |||
| false negatives while strictly avoiding false positives. | false negatives while strictly avoiding false positives. | |||
| In testing for equivalence, it is generally unwise to directly | In testing for equivalence, it is generally unwise to directly | |||
| compare relative URI references; they should be converted to their | compare relative URI references; they should be converted to their | |||
| absolute forms before comparison. Furthermore, when URI references | absolute forms before comparison. Furthermore, when URI references | |||
| are being compared for the purpose of selecting (or avoiding) a | are being compared for the purpose of selecting (or avoiding) a | |||
| network action, such as retrieval of a representation, it is often | network action, such as retrieval of a representation, it is often | |||
| necessary to separate fragment identifiers from the URIs prior to | necessary to remove fragment identifiers from the URIs prior to | |||
| comparison. | comparison. | |||
| 6.2 Comparison Ladder | 6.2 Comparison Ladder | |||
| A variety of methods are used in practice to test URI equivalence. | A variety of methods are used in practice to test URI equivalence. | |||
| These methods fall into a range, distinguished by the amount of | These methods fall into a range, distinguished by the amount of | |||
| processing required and the degree to which the probability of false | processing required and the degree to which the probability of false | |||
| negatives is reduced. As noted above, false negatives cannot in | negatives is reduced. As noted above, false negatives cannot in | |||
| principle be eliminated. In practice, their probability can be | principle be eliminated. In practice, their probability can be | |||
| reduced, but this reduction requires more processing and is not | reduced, but this reduction requires more processing and is not | |||
| skipping to change at page 30, line 29 ¶ | skipping to change at page 34, line 29 ¶ | |||
| is safe to conclude that they are equivalent. This type of | is safe to conclude that they are equivalent. This type of | |||
| equivalence test has very low computational cost and is in wide use | equivalence test has very low computational cost and is in wide use | |||
| in a variety of applications, particularly in the domain of parsing. | in a variety of applications, particularly in the domain of parsing. | |||
| Testing strings for equivalence requires some basic precautions. This | Testing strings for equivalence requires some basic precautions. This | |||
| procedure is often referred to as "bit-for-bit" or "byte-for-byte" | procedure is often referred to as "bit-for-bit" or "byte-for-byte" | |||
| comparison, which is potentially misleading. Testing of strings for | comparison, which is potentially misleading. Testing of strings for | |||
| equality is normally based on pairwise comparison of the characters | equality is normally based on pairwise comparison of the characters | |||
| that make up the strings, starting from the first and proceeding | that make up the strings, starting from the first and proceeding | |||
| until both strings are exhausted and all characters found to be | until both strings are exhausted and all characters found to be | |||
| equal, or a pair of characters compares unequal or one of the strings | equal, a pair of characters compares unequal, or one of the strings | |||
| is exhausted before the other. | is exhausted before the other. | |||
| Such character comparisons require that each pair of characters be | Such character comparisons require that each pair of characters be | |||
| put in comparable form. For example, should one URI be stored in a | put in comparable form. For example, should one URI be stored in a | |||
| byte array in EBCDIC encoding, and the second be in a Java String | byte array in EBCDIC encoding, and the second be in a Java String | |||
| object, bit-for-bit comparisons applied naively will produce both | object, bit-for-bit comparisons applied naively will produce both | |||
| false-positive and false-negative errors. Thus, in principle, it is | false-positive and false-negative errors. Thus, in principle, it is | |||
| better to speak of equality on a character-for-character rather than | better to speak of equality on a character-for-character rather than | |||
| byte-for-byte or bit-for-bit basis. | byte-for-byte or bit-for-bit basis. | |||
| Unicode defines a character as being identified by number | Unicode defines a character as being identified by number | |||
| ("codepoint") with an associated bundle of visual and other | ("codepoint") with an associated bundle of visual and other | |||
| semantics. At the software level, it is not practical to compare | semantics. At the software level, it is not practical to compare | |||
| semantic bundles, so in practical terms, character-by-character | semantic bundles, so in practical terms, character-by-character | |||
| comparisons are done codepoint-by-codepoint. | comparisons are done codepoint-by-codepoint. | |||
| 6.2.2 Syntax-based Normalization | 6.2.2 Syntax-based Normalization | |||
| Software may use logic based on the definitions provided by this | Software may use logic based on the definitions provided by this | |||
| specification to reduce the probability of false negatives. Such | specification to reduce the probability of false negatives. Such | |||
| processing is (moderately) higher in cost than | processing is moderately higher in cost than character-for-character | |||
| character-for-character string comparison. For example, an | string comparison. For example, an application using this approach | |||
| application using this approach could reasonably consider the | could reasonably consider the following two URIs equivalent: | |||
| following two URIs equivalent: | ||||
| example://a/b/c/%7A | example://a/b/c/%7A | |||
| eXAMPLE://a/./b/../b/c/%7a | eXAMPLE://a/./b/../b/c/%7a | |||
| Web user agents, such as browsers, typically apply this type of URI | Web user agents, such as browsers, typically apply this type of URI | |||
| normalization when determining whether a cached response is | normalization when determining whether a cached response is | |||
| available. Syntax-based normalization includes such techniques as | available. Syntax-based normalization includes such techniques as | |||
| case normalization, escape normalization, and removal of leftover | case normalization, escape normalization, and removal of leftover | |||
| relative path segments. | relative path segments. | |||
| 6.2.2.1 Case Normalization | 6.2.2.1 Case Normalization | |||
| When a URI scheme uses elements of the common syntax, it will also | When a URI scheme uses components of the generic syntax, it will also | |||
| use the common syntax equivalence rules, namely that the scheme and | use the common syntax equivalence rules, namely that the scheme and | |||
| hostname are case insensitive and therefore can be normailized to | hostname are case insensitive and therefore can be normalized to | |||
| lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | |||
| equivalent to <http://www.example.com/>. | equivalent to <http://www.example.com/>. | |||
| 6.2.2.2 Escape Normalization | 6.2.2.2 Escape Normalization | |||
| The %-escape mechanism described in Section 2.4 is a frequent source | The percent-escape mechanism described in Section 2.4 is a frequent | |||
| of variance among otherwise identical URIs. One cause is the choice | source of variance among otherwise identical URIs. One cause is the | |||
| of upper-case or lower-case letters for the hexadecimal digits within | choice of uppercase or lowercase letters for the hexadecimal digits | |||
| the escape sequence (e.g., "%3a" versus "%3A"). Such sequences are | within the escape sequence (e.g., "%3a" versus "%3A"). Such sequences | |||
| always equivalent; for the sake of uniformity, URI generators and | are always equivalent; for the sake of uniformity, URI generators and | |||
| normalizers are strongly encouraged to use upper-case letters for the | normalizers are strongly encouraged to use uppercase letters for the | |||
| hex digits A-F. | hex digits A-F. | |||
| Only characters that are excluded from or reserved within the URI | Only characters that are excluded from or reserved within the URI | |||
| syntax must be escaped when used as data. However, some URI | syntax must be escaped when used as data. However, some URI | |||
| generators go beyond that and escape characters that do not require | generators go beyond that and escape characters that do not require | |||
| escaping, resulting in URIs that are equivalent to their unescaped | escaping, resulting in URIs that are equivalent to their unescaped | |||
| counterparts. Such URIs can be normalized by unescaping sequences | counterparts. Such URIs can be normalized by unescaping sequences | |||
| that represent the unreserved characters, as described in Section | that represent the unreserved characters, as described in Section | |||
| 2.3. | 2.3. | |||
| skipping to change at page 32, line 25 ¶ | skipping to change at page 36, line 24 ¶ | |||
| the probability of false negatives. For example, Web spiders that | the probability of false negatives. For example, Web spiders that | |||
| populate most large search engines would consider the following two | populate most large search engines would consider the following two | |||
| URIs to be equivalent: | URIs to be equivalent: | |||
| http://example.com/ | http://example.com/ | |||
| http://example.com:80/ | http://example.com:80/ | |||
| This behavior is based on the rules provided by the syntax and | This behavior is based on the rules provided by the syntax and | |||
| semantics of the "http" URI scheme, which defines an empty port | semantics of the "http" URI scheme, which defines an empty port | |||
| component as being equivalent to the default TCP port for HTTP (port | component as being equivalent to the default TCP port for HTTP (port | |||
| 80). In general, a URI scheme that uses the generic syntax of | 80). In general, a URI scheme that uses the generic syntax for | |||
| hostport is defined such that a URI with an explicit ":port", where | authority is defined such that a URI with an explicit ":port", where | |||
| the port is the default for the scheme, is equivalent to one where | the port is the default for the scheme, is equivalent to one where | |||
| the port is elided. | the port is elided. | |||
| 6.2.4 Protocol-based Normalization | 6.2.4 Protocol-based Normalization | |||
| Web spiders, for which substantial effort to reduce the incidence of | Web spiders, for which substantial effort to reduce the incidence of | |||
| false negatives is often cost-effective, are observed to implement | false negatives is often cost-effective, are observed to implement | |||
| even more aggressive techniques in URI comparison. For example, if | even more aggressive techniques in URI comparison. For example, if | |||
| they observe that a URI such as | they observe that a URI such as | |||
| http://example.com/data | http://example.com/data | |||
| redirects to | redirects to | |||
| http://example.com/data/ | http://example.com/data/ | |||
| they will likely regard the two as equivalent in the future. | they will likely regard the two as equivalent in the future. | |||
| Obviously, this kind of technique is only appropriate in special | Obviously, this kind of technique is only appropriate in special | |||
| situations. | situations. | |||
| 6.3 Good Practice When Using URIs | 6.3 Canonical Form | |||
| It is in the best interests of everyone to avoid false-negatives in | It is in the best interests of everyone to avoid false-negatives in | |||
| comparing URIs, and to only require the minimum amount of software | comparing URIs and to minimize the amount of software processing for | |||
| processing for such comparisons. Those who generate and make | such comparisons. Those who generate and make reference to URIs can | |||
| reference to URIs can reduce the cost of processing and the risk of | reduce the cost of processing and the risk of false negatives by | |||
| false negatives by consistently providing them in a form that is | consistently providing them in a form that is reasonably canonical | |||
| reasonably canonical with respect to their scheme. Specifically: | with respect to their scheme. Specifically: | |||
| Always provide the URI scheme in lower-case characters. | ||||
| Always provide the hostname, if any, in lower-case characters. | Always provide the URI scheme in lowercase characters. | |||
| Only perform %-escaping where it is essential. | Always provide the hostname, if any, in lowercase characters. | |||
| Always use upper-case A-through-F characters when %-escaping. | Only perform percent-escaping where it is essential. | |||
| Use the UTF-8 character-to-octet mapping, whenever possible. | Always use uppercase A-through-F characters when percent-escaping. | |||
| Prevent /./ and /../ from appearing in absolute URI paths. | Prevent /./ and /../ from appearing in non-relative URI paths. | |||
| The choices listed above are motivated by observations that a high | The good practices listed above are motivated by observations that a | |||
| proportion of deployed software already use these techniques in | high proportion of deployed software use these techniques for the | |||
| practice for the purposes of normalization. | purposes of normalization. | |||
| 7. Security Considerations | 7. Security Considerations | |||
| A URI does not in itself pose a security threat. However, since URIs | A URI does not in itself pose a security threat. However, since URIs | |||
| are often used to provide a compact set of instructions for access to | are often used to provide a compact set of instructions for access to | |||
| network resources, care must be taken to properly interpret the data | network resources, care must be taken to properly interpret the data | |||
| within a URI, to prevent that data from causing unintended access, | within a URI, to prevent that data from causing unintended access, | |||
| and to avoid including data that should not be revealed in plain | and to avoid including data that should not be revealed in plain | |||
| text. | text. | |||
| 7.1 Reliability and Consistency | 7.1 Reliability and Consistency | |||
| There is no guarantee that, having once used a given URI to retrieve | There is no guarantee that, having once used a given URI to retrieve | |||
| some information, that the same information will be retievable by | some information, that the same information will be retrievable by | |||
| that URI in the future. Nor is there any guarantee that the | that URI in the future. Nor is there any guarantee that the | |||
| information retrievable via that URI in the future will be observably | information retrievable via that URI in the future will be observably | |||
| similar to that retrieved in the past. The URI syntax does not | similar to that retrieved in the past. The URI syntax does not | |||
| constrain how a given scheme or authority apportions its namespace or | constrain how a given scheme or authority apportions its name space | |||
| maintains it over time. Such a guarantee can only be obtained from | or maintains it over time. Such a guarantee can only be obtained | |||
| the person(s) controlling that namespace and the resource in | from the person(s) controlling that name space and the resource in | |||
| question. A specific URI scheme may define additional semantics, | question. A specific URI scheme may define additional semantics, | |||
| such as name persistence, if those semantics are required of all | such as name persistence, if those semantics are required of all | |||
| naming authorities for that scheme. | naming authorities for that scheme. | |||
| 7.2 Malicious Construction | 7.2 Malicious Construction | |||
| It is sometimes possible to construct a URI such that an attempt to | It is sometimes possible to construct a URI such that an attempt to | |||
| perform a seemingly harmless, idempotent operation, such as the | perform a seemingly harmless, idempotent operation, such as the | |||
| retrieval of a representation associated with a resource, will in | retrieval of a representation, will in fact cause a possibly damaging | |||
| fact cause a possibly damaging remote operation to occur. The unsafe | remote operation to occur. The unsafe URI is typically constructed | |||
| URI is typically constructed by specifying a port number other than | by specifying a port number other than that reserved for the network | |||
| that reserved for the network protocol in question. The client | protocol in question. The client unwittingly contacts a site that is | |||
| unwittingly contacts a site that is in fact running a different | running a different protocol service. The content of the URI | |||
| protocol. The content of the URI contains instructions that, when | contains instructions that, when interpreted according to this other | |||
| interpreted according to this other protocol, cause an unexpected | protocol, cause an unexpected operation. An example has been the use | |||
| operation. An example has been the use of a gopher URI to cause an | of a gopher URI to cause an unintended or impersonating message to be | |||
| unintended or impersonating message to be sent via a SMTP server. | sent via a SMTP server. | |||
| Caution should be used when using any URI that specifies a TCP port | Caution should be used when dereferencing a URI that specifies a TCP | |||
| number other than the default for the protocol, especially when it is | port number other than the default for the scheme, especially when it | |||
| a number within the reserved space. | is a number within the reserved space. | |||
| Care should be taken when a URI contains escaped delimiters for a | Care should be taken when a URI contains escaped delimiters for a | |||
| given protocol (for example, CR and LF characters for telnet | given protocol (for example, CR and LF characters for telnet | |||
| protocols) that these are not unescaped before transmission. This | protocols) that these octets are not unescaped before transmission. | |||
| might violate the protocol, but avoids the potential for such | This might violate the protocol, but avoids the potential for such | |||
| characters to be used to simulate an extra operation or parameter in | characters to be used to simulate an extra operation or parameter in | |||
| that protocol, which might lead to an unexpected and possibly harmful | that protocol which might lead to an unexpected and possibly harmful | |||
| remote operation being performed. | remote operation being performed. | |||
| 7.3 Rare IP Address Formats | 7.3 Rare IP Address Formats | |||
| Although the URI syntax for IPv4address only allows the common, | Although the URI syntax for IPv4address only allows the common, | |||
| dotted-decimal form of IPv4 address literal, many implementations | dotted-decimal form of IPv4 address literal, many implementations | |||
| that process URIs make use of platform-dependent system routines, | that process URIs make use of platform-dependent system routines, | |||
| such as gethostbyname() and inet_aton(), to translate the string | such as gethostbyname() and inet_aton(), to translate the string | |||
| literal to an actual IP address. Unfortunately, such system routines | literal to an actual IP address. Unfortunately, such system routines | |||
| often allow and process a much larger set of formats than those | often allow and process a much larger set of formats than those | |||
| skipping to change at page 36, line 15 ¶ | skipping to change at page 40, line 5 ¶ | |||
| 7.5 Semantic Attacks | 7.5 Semantic Attacks | |||
| Because the userinfo component is rarely used and appears before the | Because the userinfo component is rarely used and appears before the | |||
| hostname in the authority component, it can be used to construct a | hostname in the authority component, it can be used to construct a | |||
| URI that is intended to mislead a human user by appearing to identify | URI that is intended to mislead a human user by appearing to identify | |||
| one (trusted) naming authority while actually identifying a different | one (trusted) naming authority while actually identifying a different | |||
| authority hidden behind the noise. For example | authority hidden behind the noise. For example | |||
| http://www.example.com&story=breaking_news@10.0.0.1/top_story.htm | http://www.example.com&story=breaking_news@10.0.0.1/top_story.htm | |||
| might lead a human user to assume that the authority is | might lead a human user to assume that the host is 'www.example.com', | |||
| 'www.example.com', whereas it is actually '10.0.0.1'. Note that the | whereas it is actually '10.0.0.1'. Note that the misleading userinfo | |||
| misleading userinfo could be much longer than the example above. | could be much longer than the example above. | |||
| A misleading URI, such as the one above, is an attack on the user's | A misleading URI, such as the one above, is an attack on the user's | |||
| preconceived notions about the meaning of a URI, rather than an | preconceived notions about the meaning of a URI, rather than an | |||
| attack on the software itself. User agents may be able to reduce the | attack on the software itself. User agents may be able to reduce the | |||
| impact of such attacks by visually distinguishing the various | impact of such attacks by visually distinguishing the various | |||
| components of the URI when rendered, such as by using a different | components of the URI when rendered, such as by using a different | |||
| color or tone to render userinfo if any is present, though there is | color or tone to render userinfo if any is present, though there is | |||
| no general panacea. More information on URI-based semantic attacks | no general panacea. More information on URI-based semantic attacks | |||
| can be found in [Siedzik]. | can be found in [Siedzik]. | |||
| 8. Acknowledgements | 8. Acknowledgments | |||
| This document is derived from RFC 2396 [RFC2396], RFC 1808 [RFC1808], | This document is derived from RFC 2396 [RFC2396], RFC 1808 [RFC1808], | |||
| and RFC 1738 [RFC1738]; the acknowledgements in those specifications | and RFC 1738 [RFC1738]; the acknowledgments in those specifications | |||
| still apply. It also incorporates the update (with corrections) for | still apply. It also incorporates the update (with corrections) for | |||
| IPv6 literals in the host syntax, as defined by Robert M. Hinden, | IPv6 literals in the host syntax, as defined by Robert M. Hinden, | |||
| Brian E. Carpenter, and Larry Masinter in [RFC2732]. In addition, | Brian E. Carpenter, and Larry Masinter in [RFC2732]. In addition, | |||
| contributions by Reese Anschultz, Tim Bray, Dan Connolly, Adam M. | contributions by Reese Anschultz, Tim Bray, Rob Cameron, Dan | |||
| Costello, Jason Diamond, Martin Duerst, Henry Holtzman, Graham Klyne, | Connolly, Adam M. Costello, Jason Diamond, Martin Duerst, Stefan | |||
| Dan Kohn, Bruce Lilly, Michael Mealling, Julian Reschke, Tomas | Eissing, Clive D.W. Feather, Pat Hayes, Henry Holtzman, Graham Klyne, | |||
| Rokicki, Miles Sabin, Ronald Tschalaer, Marc Warne, Henry Zongaro, | Dan Kohn, Bruce Lilly, Andrew Main, Michael Mealling, Julian Reschke, | |||
| and Zefram are gratefully acknowledged. | Tomas Rokicki, Miles Sabin, Ronald Tschalaer, Marc Warne, Stuart | |||
| Williams, and Henry Zongaro are gratefully acknowledged. | ||||
| Normative References | Normative References | |||
| [ASCII] American National Standards Institute, "Coded Character | [ASCII] American National Standards Institute, "Coded Character | |||
| Set -- 7-bit American Standard Code for Information | Set -- 7-bit American Standard Code for Information | |||
| Interchange", ANSI X3.4, 1986. | Interchange", ANSI X3.4, 1986. | |||
| [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", RFC 2234, November 1997. | Specifications: ABNF", RFC 2234, November 1997. | |||
| Non-normative References | Informative References | |||
| [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | |||
| Languages", BCP 18, RFC 2277, January 1998. | Languages", BCP 18, RFC 2277, January 1998. | |||
| [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A | [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A | |||
| Unifying Syntax for the Expression of Names and Addresses | Unifying Syntax for the Expression of Names and Addresses | |||
| of Objects on the Network as used in the World-Wide Web", | of Objects on the Network as used in the World-Wide Web", | |||
| RFC 1630, June 1994. | RFC 1630, June 1994. | |||
| [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform | [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform | |||
| skipping to change at page 39, line 39 ¶ | skipping to change at page 43, line 39 ¶ | |||
| Extensions (MIME) Part Two: Media Types", RFC 2046, | Extensions (MIME) Part Two: Media Types", RFC 2046, | |||
| November 1996. | November 1996. | |||
| [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. | [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. | |||
| Jensen, "HTTP Extensions for Distributed Authoring -- | Jensen, "HTTP Extensions for Distributed Authoring -- | |||
| WEBDAV", RFC 2518, February 1999. | WEBDAV", RFC 2518, February 1999. | |||
| [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet | [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet | |||
| host table specification", RFC 952, October 1985. | host table specification", RFC 952, October 1985. | |||
| [RFC2373] Hinden, R. and S. Deering, "IP Version 6 Addressing | [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 | |||
| Architecture", RFC 2373, July 1998. | (IPv6) Addressing Architecture", RFC 3513, April 2003. | |||
| [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | |||
| Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | |||
| [RFC1736] Kunze, J., "Functional Recommendations for Internet | [RFC1736] Kunze, J., "Functional Recommendations for Internet | |||
| Resource Locators", RFC 1736, February 1995. | Resource Locators", RFC 1736, February 1995. | |||
| [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for | [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for | |||
| Uniform Resource Names", RFC 1737, December 1994. | Uniform Resource Names", RFC 1737, December 1994. | |||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | ||||
| [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | |||
| STD 13, RFC 1034, November 1987. | STD 13, RFC 1034, November 1987. | |||
| [RFC2110] Palme, J. and A. Hopmann, "MIME E-mail Encapsulation of | [RFC2110] Palme, J. and A. Hopmann, "MIME E-mail Encapsulation of | |||
| Aggregate Documents, such as HTML (MHTML)", RFC 2110, | Aggregate Documents, such as HTML (MHTML)", RFC 2110, | |||
| March 1997. | March 1997. | |||
| [RFC2717] Petke, R. and I. King, "Registration Procedures for URL | [RFC2717] Petke, R. and I. King, "Registration Procedures for URL | |||
| Scheme Names", BCP 35, RFC 2717, November 1999. | Scheme Names", BCP 35, RFC 2717, November 1999. | |||
| skipping to change at page 41, line 4 ¶ | skipping to change at page 45, line 29 ¶ | |||
| Roy T. Fielding | Roy T. Fielding | |||
| Day Software | Day Software | |||
| 2 Corporate Plaza, Suite 150 | 2 Corporate Plaza, Suite 150 | |||
| Newport Beach, CA 92660 | Newport Beach, CA 92660 | |||
| USA | USA | |||
| Phone: +1-949-999-2523 | Phone: +1-949-999-2523 | |||
| Fax: +1-949-644-5064 | Fax: +1-949-644-5064 | |||
| EMail: roy.fielding@day.com | EMail: roy.fielding@day.com | |||
| URI: http://www.apache.org/~fielding/ | URI: http://www.apache.org/~fielding/ | |||
| Larry Masinter | Larry Masinter | |||
| Adobe Systems Incorporated | Adobe Systems Incorporated | |||
| 345 Park Ave | 345 Park Ave | |||
| San Jose, CA 95110 | San Jose, CA 95110 | |||
| USA | USA | |||
| Phone: +1-408-536-3024 | Phone: +1-408-536-3024 | |||
| EMail: LMM@acm.org | EMail: LMM@acm.org | |||
| URI: http://larry.masinter.net/ | URI: http://larry.masinter.net/ | |||
| Appendix A. Collected BNF for URI | Appendix A. Collected ABNF for URI | |||
| To be filled-in later. | To be filled-in later. | |||
| Appendix B. Parsing a URI Reference with a Regular Expression | Appendix B. Parsing a URI Reference with a Regular Expression | |||
| As described in Section 4.3, the generic URI syntax is not sufficient | Since the "first-match-wins" algorithm is identical to the "greedy" | |||
| to disambiguate the components of some forms of URI. Since the | ||||
| "greedy algorithm" described in that section is identical to the | ||||
| disambiguation method used by POSIX regular expressions, it is | disambiguation method used by POSIX regular expressions, it is | |||
| natural and commonplace to use a regular expression for parsing the | natural and commonplace to use a regular expression for parsing the | |||
| potential four components and fragment identifier of a URI reference. | potential five components of a URI reference. | |||
| The following line is the regular expression for breaking-down a URI | The following line is the regular expression for breaking-down a | |||
| reference into its components. | well-formed URI reference into its components. | |||
| ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | |||
| 12 3 4 5 6 7 8 9 | 12 3 4 5 6 7 8 9 | |||
| The numbers in the second line above are only to assist readability; | The numbers in the second line above are only to assist readability; | |||
| they indicate the reference points for each subexpression (i.e., each | they indicate the reference points for each subexpression (i.e., each | |||
| paired parenthesis). We refer to the value matched for subexpression | paired parenthesis). We refer to the value matched for subexpression | |||
| <n> as $<n>. For example, matching the above expression to | <n> as $<n>. For example, matching the above expression to | |||
| http://www.ics.uci.edu/pub/ietf/uri/#Related | http://www.ics.uci.edu/pub/ietf/uri/#Related | |||
| skipping to change at page 43, line 50 ¶ | skipping to change at page 47, line 48 ¶ | |||
| the case for the query component in the above example. Therefore, we | the case for the query component in the above example. Therefore, we | |||
| can determine the value of the four components and fragment as | can determine the value of the four components and fragment as | |||
| scheme = $2 | scheme = $2 | |||
| authority = $4 | authority = $4 | |||
| path = $5 | path = $5 | |||
| query = $7 | query = $7 | |||
| fragment = $9 | fragment = $9 | |||
| and, going in the opposite direction, we can recreate a URI reference | and, going in the opposite direction, we can recreate a URI reference | |||
| from its components using the algorithm of Section 5.2. | from its components using the algorithm of Section 5.3. | |||
| Appendix C. Examples of Resolving Relative URI References | ||||
| Within an object with a well-defined base URI of | ||||
| http://a/b/c/d;p?q | ||||
| the relative URI would be resolved as follows: | ||||
| C.1 Normal Examples | ||||
| g:h = g:h | ||||
| g = http://a/b/c/g | ||||
| ./g = http://a/b/c/g | ||||
| g/ = http://a/b/c/g/ | ||||
| /g = http://a/g | ||||
| //g = http://g | ||||
| ?y = http://a/b/c/d;p?y | ||||
| g?y = http://a/b/c/g?y | ||||
| #s = (current document)#s | ||||
| g#s = http://a/b/c/g#s | ||||
| g?y#s = http://a/b/c/g?y#s | ||||
| ;x = http://a/b/c/;x | ||||
| g;x = http://a/b/c/g;x | ||||
| g;x?y#s = http://a/b/c/g;x?y#s | ||||
| . = http://a/b/c/ | ||||
| ./ = http://a/b/c/ | ||||
| .. = http://a/b/ | ||||
| ../ = http://a/b/ | ||||
| ../g = http://a/b/g | ||||
| ../.. = http://a/ | ||||
| ../../ = http://a/ | ||||
| ../../g = http://a/g | ||||
| C.2 Abnormal Examples | ||||
| Although the following abnormal examples are unlikely to occur in | ||||
| normal practice, all URI parsers should be capable of resolving them | ||||
| consistently. Each example uses the same base as above. | ||||
| An empty reference refers to the start of the current document. | ||||
| <> = (current document) | ||||
| Parsers must be careful in handling the case where there are more | ||||
| relative path ".." segments than there are hierarchical levels in the | ||||
| base URI's path. Note that the ".." syntax cannot be used to change | ||||
| the authority component of a URI. | ||||
| ../../../g = http://a/../g | ||||
| ../../../../g = http://a/../../g | ||||
| In practice, some implementations strip leading relative symbolic | ||||
| elements (".", "..") after applying a relative URI calculation, based | ||||
| on the theory that compensating for obvious author errors is better | ||||
| than allowing the request to fail. Thus, the above two references | ||||
| will be interpreted as "http://a/g" by some implementations. | ||||
| Similarly, parsers must avoid treating "." and ".." as special when | ||||
| they are not complete components of a relative path. | ||||
| /./g = http://a/./g | ||||
| /../g = http://a/../g | ||||
| g. = http://a/b/c/g. | ||||
| .g = http://a/b/c/.g | ||||
| g.. = http://a/b/c/g.. | ||||
| ..g = http://a/b/c/..g | ||||
| Less likely are cases where the relative URI uses unnecessary or | ||||
| nonsensical forms of the "." and ".." complete path segments. | ||||
| ./../g = http://a/b/g | ||||
| ./g/. = http://a/b/c/g/ | ||||
| g/./h = http://a/b/c/g/h | ||||
| g/../h = http://a/b/c/h | ||||
| g;x=1/./y = http://a/b/c/g;x=1/y | ||||
| g;x=1/../y = http://a/b/c/y | ||||
| Some applications fail to separate the reference's query and/or | ||||
| fragment components from a relative path before merging it with the | ||||
| base path. This error is rarely noticed, since typical usage of a | ||||
| fragment never includes the hierarchy ("/") character, and the query | ||||
| component is not normally used within relative references. | ||||
| g?y/./x = http://a/b/c/g?y/./x | ||||
| g?y/../x = http://a/b/c/g?y/../x | ||||
| g#s/./x = http://a/b/c/g#s/./x | ||||
| g#s/../x = http://a/b/c/g#s/../x | ||||
| Some parsers allow the scheme name to be present in a relative URI if | ||||
| it is the same as the base URI scheme. This is considered to be a | ||||
| loophole in prior specifications of partial URI [RFC1630]. Its use | ||||
| should be avoided, but is allowed for backwards compatibility. | ||||
| http:g = http:g ; for validating parsers | ||||
| / http://a/b/c/g ; for backwards compatibility | ||||
| Appendix D. Embedding the Base URI in HTML documents | Appendix C. Embedding the Base URI in HTML documents | |||
| It is useful to consider an example of how the base URI of a document | It is useful to consider an example of how the base URI of a document | |||
| can be embedded within the document's content. In this appendix, we | can be embedded within the document's content. In this appendix, we | |||
| describe how documents written in the Hypertext Markup Language | describe how documents written in the Hypertext Markup Language | |||
| (HTML) [HTML] can include an embedded base URI. This appendix does | (HTML) [HTML] can include an embedded base URI. This appendix does | |||
| not form a part of the URI specification and should not be considered | not form a part of the URI specification and should not be considered | |||
| as anything more than a descriptive example. | as anything more than a descriptive example. | |||
| HTML defines a special element "BASE" which, when present in the | HTML defines a special element "BASE" which, when present in the | |||
| "HEAD" portion of a document, signals that the parser should use the | "HEAD" portion of a document, signals that the parser should use the | |||
| BASE element's "HREF" attribute as the base URI for resolving any | BASE element's "HREF" attribute as the base URI for resolving any | |||
| relative URI. The "HREF" attribute must be an absolute URI. Note | relative URI. The "HREF" attribute must be an absolute URI. Note | |||
| that, in HTML, element and attribute names are case-insensitive. For | that, in HTML, element and attribute names are case-insensitive. For | |||
| example: | example: | |||
| <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"> | <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"> | |||
| <HTML><HEAD> | <HTML><HEAD> | |||
| <TITLE>An example HTML document</TITLE> | <TITLE>An example HTML document</TITLE> | |||
| <BASE href="http://www.example.com/Test/a/b/c"> | <BASE href="http://www.example.com/Test/a/b/c"> | |||
| </HEAD><BODY> | </HEAD><BODY> | |||
| ... <A href="../x">a hypertext anchor</A> ... | ... <A href="../x">a hypertext anchor</A> ... | |||
| </BODY></HTML> | </BODY></HTML> | |||
| A parser reading the example document should interpret the given | A parser reading the example document should interpret the given | |||
| relative URI "../x" as representing the absolute URI | relative URI "../x" as representing the absolute URI | |||
| <http://www.example.com/Test/a/x> | <http://www.example.com/Test/a/x> | |||
| regardless of the context in which the example document was obtained. | regardless of the context in which the example document was obtained. | |||
| Appendix E. Recommendations for Delimiting URI in Context | Appendix D. Delimiting a URI in Context | |||
| URIs are often transmitted through formats that do not provide a | URIs are often transmitted through formats that do not provide a | |||
| clear context for their interpretation. For example, there are many | clear context for their interpretation. For example, there are many | |||
| occasions when a URI is included in plain text; examples include text | occasions when a URI is included in plain text; examples include text | |||
| sent in electronic mail, USENET news messages, and, most importantly, | sent in electronic mail, USENET news messages, and, most importantly, | |||
| printed on paper. In such cases, it is important to be able to | printed on paper. In such cases, it is important to be able to | |||
| delimit the URI from the rest of the text, and in particular from | delimit the URI from the rest of the text, and in particular from | |||
| punctuation marks that might be mistaken for part of the URI. | punctuation marks that might be mistaken for part of the URI. | |||
| In practice, URI are delimited in a variety of ways, but usually | In practice, URI are delimited in a variety of ways, but usually | |||
| skipping to change at page 47, line 27 ¶ | skipping to change at page 49, line 27 ¶ | |||
| example.com/>, or just using whitespace | example.com/>, or just using whitespace | |||
| http://example.com/ | http://example.com/ | |||
| These wrappers do not form part of the URI. | These wrappers do not form part of the URI. | |||
| In the case where a fragment identifier is associated with a URI | In the case where a fragment identifier is associated with a URI | |||
| reference, the fragment would be placed within the brackets as well | reference, the fragment would be placed within the brackets as well | |||
| (separated from the URI with a "#" character). | (separated from the URI with a "#" character). | |||
| In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may | In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may | |||
| need to be added to break a long URI across lines. The whitespace | need to be added to break a long URI across lines. The whitespace | |||
| should be ignored when extracting the URI. | should be ignored when extracting the URI. | |||
| No whitespace should be introduced after a hyphen ("-") character. | No whitespace should be introduced after a hyphen ("-") character. | |||
| Because some typesetters and printers may (erroneously) introduce a | Because some typesetters and printers may (erroneously) introduce a | |||
| hyphen at the end of line when breaking a line, the interpreter of a | hyphen at the end of line when breaking a line, the interpreter of a | |||
| URI containing a line break immediately after a hyphen should ignore | URI containing a line break immediately after a hyphen should ignore | |||
| all unescaped whitespace around the line break, and should be aware | all unescaped whitespace around the line break, and should be aware | |||
| that the hyphen may or may not actually be part of the URI. | that the hyphen may or may not actually be part of the URI. | |||
| skipping to change at page 48, line 9 ¶ | skipping to change at page 50, line 4 ¶ | |||
| designators, though it is not commonly used in practice and is no | designators, though it is not commonly used in practice and is no | |||
| longer recommended. | longer recommended. | |||
| For robustness, software that accepts user-typed URI should attempt | For robustness, software that accepts user-typed URI should attempt | |||
| to recognize and strip both delimiters and embedded whitespace. | to recognize and strip both delimiters and embedded whitespace. | |||
| For example, the text: | For example, the text: | |||
| Yes, Jim, I found it under "http://www.w3.org/Addressing/", | Yes, Jim, I found it under "http://www.w3.org/Addressing/", | |||
| but you can probably pick it up from <ftp://ds.internic. | but you can probably pick it up from <ftp://ds.internic. | |||
| net/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | net/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | |||
| ietf/uri/historical.html#WARNING>. | ietf/uri/historical.html#WARNING>. | |||
| contains the URI references | contains the URI references | |||
| http://www.w3.org/Addressing/ | http://www.w3.org/Addressing/ | |||
| ftp://ds.internic.net/rfc/ | ftp://ds.internic.net/rfc/ | |||
| http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | |||
| Appendix F. Abbreviated URIs | Appendix E. Summary of Non-editorial Changes | |||
| The URI syntax was designed for unambiguous reference to network | ||||
| resources and extensibility via the URI scheme. However, as URI | ||||
| identification and usage have become commonplace, traditional media | ||||
| (television, radio, newspapers, billboards, etc.) have increasingly | ||||
| used abbreviated URI references. That is, a reference consisting of | ||||
| only the authority and path portions of the identified resource, such | ||||
| as | ||||
| www.w3.org/Addressing/ | ||||
| or simply the DNS hostname on its own. Such references are primarily | ||||
| intended for human interpretation rather than machine, with the | ||||
| assumption that context-based heuristics are sufficient to complete | ||||
| the URI (e.g., most hostnames beginning with "www" are likely to have | ||||
| a URI prefix of "http://"). Although there is no standard set of | ||||
| heuristics for disambiguating abbreviated URI references, many client | ||||
| implementations allow them to be entered by the user and | ||||
| heuristically resolved. It should be noted that such heuristics may | ||||
| change over time, particularly when new URI schemes are introduced. | ||||
| Since an abbreviated URI has the same syntax as a relative URI path, | ||||
| abbreviated URI references cannot be used in contexts where relative | ||||
| URIs are expected. This limits the use of abbreviated URIs to places | ||||
| where there is no defined base URI, such as dialog boxes and off-line | ||||
| advertisements. | ||||
| Appendix G. Summary of Non-editorial Changes | ||||
| G.1 Additions | E.1 Additions | |||
| IPv6 literals have been added to the list of possible identifiers for | IPv6 literals have been added to the list of possible identifiers for | |||
| the host portion of a server component, as described by [RFC2732], | the host portion of a authority component, as described by [RFC2732], | |||
| with the addition of "[" and "]" to the reserved, uric, and | with the addition of "[" and "]" to the reserved and uric sets. | |||
| uric-no-slash sets. Square brackets are now specified as reserved | Square brackets are now specified as reserved within the authority | |||
| for the authority component, allowed within the opaque part of an | component and not allowed outside their use as delimiters for an | |||
| opaque URI, and not allowed in the hierarchical syntax except for | IPv6reference within host. In order to make this change without | |||
| their use as delimiters for an IPv6reference within host. In order | changing the technical definition of the path, query, and fragment | |||
| to make this change without changing the technical definition of the | components, those rules were redefined to directly specify the | |||
| path, query, and fragment components, those rules were redefined to | characters allowed rather than be defined in terms of uric. | |||
| directly specify the characters allowed rather than continuing to be | ||||
| defined in terms of uric. | ||||
| Since [RFC2732] defers to [RFC2373] for definition of an IPv6 literal | Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal | |||
| address, which unfortunately has an incorrect ABNF description of | address, which unfortunately lacks an ABNF description of | |||
| IPv6address, we created a new ABNF rule for IPv6address that matches | IPv6address, we created a new ABNF rule for IPv6address that matches | |||
| the text representations defined by Section 2.2 of [RFC2373]. | the text representations defined by Section 2.2 of [RFC3513]. | |||
| Likewise, the definition of IPv4address has been improved in order to | Likewise, the definition of IPv4address has been improved in order to | |||
| limit each decimal octet to the range 0-255, and the definition of | limit each decimal octet to the range 0-255, and the definition of | |||
| hostname has been improved to better specify length limitations and | hostname has been improved to better specify length limitations and | |||
| partially-qualified domain names. | partially-qualified domain names. | |||
| Section 6 on URI normalization and comparison has been completely | Section 6 (Section 6) on URI normalization and comparison has been | |||
| rewritten and extended using input from Tim Bray and discussion | completely rewritten and extended using input from Tim Bray and | |||
| within the W3C Technical Architecture Group. | discussion within the W3C Technical Architecture Group. Likewise, | |||
| Section 2.1 on the encoding of characters has been replaced. | ||||
| G.2 Modifications from RFC 2396 | An ABNF production for URI has been introduced to correspond to the | |||
| common usage of the term: an absolute URI with optional fragment. | ||||
| E.2 Modifications from RFC 2396 | ||||
| The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. | The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. | |||
| This change required all rule names that formerly included underscore | This change required all rule names that formerly included underscore | |||
| characters to be renamed with a dash instead. Likewise, absoluteURI | characters to be renamed with a dash instead. | |||
| and relativeURI have been changed to absolute-URI and relative-URI, | ||||
| respectively, for consistency. | ||||
| The ABNF of hier-part and relative-URI (Section 3) has been corrected | Section 2.2 on reserved characters has been rewritten to clearly | |||
| to allow a relative URI path to be empty. This also allows an | explain what characters are reserved, when they are reserved, and why | |||
| absolute-URI to consist of nothing after the "scheme:", as is present | they are reserved even when not used as delimiters by the generic | |||
| in practice with the "DAV:" namespace [RFC2518] and the "about:" URI | syntax. Likewise, the section on escaped characters has been | |||
| used by many browser implementations. | rewritten, and URI normalizers are now given license to unescape any | |||
| octets corresponding to unreserved characters. The crosshatch ("#") | ||||
| character has been moved back from the excluded delims to the | ||||
| reserved set. | ||||
| The ABNF for URI and URI-reference has been redesigned to make them | ||||
| more friendly to LALR parsers and significantly reduce complexity. As | ||||
| a result, the layout form of syntax description has been removed, | ||||
| along with the uric-no-slash, opaque-part, and rel-segment | ||||
| productions. All references to "opaque" URIs have been replaced with | ||||
| a better description of how the path component may be opaque to | ||||
| hierarchy. The fragment identifier has been moved back into the | ||||
| section on generic syntax components and within the URI and | ||||
| relative-URI productions, though it remains excluded from | ||||
| absolute-URI. The ambiguity regarding the parsing of URI-reference as | ||||
| a URI or a relative-URI with a colon in the first segment is now | ||||
| explained and disambiguated in the section defining relative-URI. | ||||
| The ABNF of hier-part and relative-URI has been corrected to allow a | ||||
| relative URI path to be empty. This also allows an absolute-URI to | ||||
| consist of nothing after the "scheme:", as is present in practice | ||||
| with the "DAV:" namespace [RFC2518] and the "about:" URI used by many | ||||
| browser implementations. The ambiguity regarding the parsing of | ||||
| net-path, abs-path, and rel-path is now explained and disambiguated | ||||
| in the same section. | ||||
| Registry-based naming authorities that use the hierarchical authority | ||||
| syntax component are now limited to DNS hostnames, since those have | ||||
| been the only such URIs in deployment. This change was necessary to | ||||
| enable internationalized domain names to be processed in their native | ||||
| character encodings at the application layers above URI processing. | ||||
| The reg_name, server, and hostport productions have been removed to | ||||
| simplify parsing of the URI syntax. | ||||
| The ABNF of qualified has been simplified to remove a parsing | The ABNF of qualified has been simplified to remove a parsing | |||
| ambiguity without changing the allowed syntax. | ambiguity without changing the allowed syntax. The toplabel | |||
| production has been removed because it served no useful purpose. The | ||||
| ambiguity regarding the parsing of host as IPv4address or hostname is | ||||
| now explained and disambiguated in the same section. | ||||
| The resolving relative references algorithm of [RFC2396] has been | The resolving relative references algorithm of [RFC2396] has been | |||
| rewritten using pseudocode for this revision to improve clarity and | rewritten using pseudocode for this revision to improve clarity and | |||
| fix the following issues: | fix the following issues: | |||
| o [RFC2396] section 5.2, step 6a, failed to account for a base URI | o [RFC2396] section 5.2, step 6a, failed to account for a base URI | |||
| with no path. | with no path. | |||
| o Restored the behavior of [RFC1808] where, if the the reference | o Restored the behavior of [RFC1808] where, if the reference | |||
| contains an empty path and a defined query component, then the | contains an empty path and a defined query component, then the | |||
| target URI inherits the base URI's path component. | target URI inherits the base URI's path component. | |||
| o Removed the special-case treatment of same-document references in | ||||
| favor of a section that explains that a new retrieval action | ||||
| should not be made if the target URI and base URI, excluding | ||||
| fragments, match. | ||||
| Index | Index | |||
| A | A | |||
| abs-path 14 | ABNF 9 | |||
| absolute-URI 14 | abs-path 15 | |||
| absolute-URI-reference 20 | absolute 9 | |||
| absolute-path 22 | ||||
| absolute-URI 23 | ||||
| access 7 | ||||
| alphanum 17 | alphanum 17 | |||
| authority 15 | authority 15, 16 | |||
| D | D | |||
| dec-octet 17 | dec-octet 17 | |||
| delims 12 | delims 13 | |||
| dereference 8 | ||||
| domainlabel 17 | domainlabel 17 | |||
| dot-segments 19 | ||||
| E | E | |||
| escaped 11 | escaped 12 | |||
| excluded 13 | ||||
| F | F | |||
| fragment 20 | fragment 20 | |||
| G | ||||
| generic syntax 5 | ||||
| H | H | |||
| h4 18 | h4 18 | |||
| hier-part 14 | hier-part 15 | |||
| host 16 | hierarchical 9 | |||
| host 17 | ||||
| hostname 17 | hostname 17 | |||
| hostport 16 | ||||
| I | I | |||
| identifier 5 | ||||
| invisible 13 | ||||
| IPv4 17 | IPv4 17 | |||
| IPv4address 17 | IPv4address 17 | |||
| IPv6 18 | IPv6 18 | |||
| IPv6address 18 | IPv6address 18 | |||
| IPv6reference 18 | IPv6reference 18 | |||
| L | L | |||
| locator 6 | ||||
| ls32 18 | ls32 18 | |||
| M | M | |||
| mark 11 | mark 11 | |||
| N | N | |||
| net-path 14 | name 6 | |||
| net-path 15 | ||||
| O | network-path 22 | |||
| opaque-part 14 | ||||
| P | P | |||
| path 18 | path 15, 19 | |||
| path-segments 18 | path-segments 19 | |||
| pchar 18 | pchar 19 | |||
| port 16 | port 18 | |||
| Q | Q | |||
| qualified 17 | qualified 17 | |||
| query 19 | query 20 | |||
| R | R | |||
| reg-name 16 | rel-path 15 | |||
| rel-path 22 | relative 9 | |||
| rel-segment 22 | relative-path 22 | |||
| relative-URI 22 | relative-URI 22 | |||
| representation 8 | ||||
| reserved 10 | reserved 10 | |||
| resolution 8 | ||||
| resource 4 | ||||
| retrieval 8 | ||||
| S | S | |||
| same-document 23 | ||||
| sameness 8 | ||||
| scheme 15 | scheme 15 | |||
| segment 18 | segment 19 | |||
| server 16 | suffix 23 | |||
| T | T | |||
| toplabel 17 | transcription 6 | |||
| U | U | |||
| uniform 4 | ||||
| unreserved 11 | unreserved 11 | |||
| unwise 12 | unwise 13 | |||
| URI grammar | URI grammar | |||
| abs-path 14 | abs-path 15 | |||
| absolute-URI 14 | absolute-URI 23 | |||
| absolute-URI-reference 20 | ALPHA 9 | |||
| alphanum 17 | alphanum 17 | |||
| authority 15 | authority 15, 16 | |||
| CR 9 | ||||
| CTL 9 | ||||
| dec-octet 17 | dec-octet 17 | |||
| delims 12 | DIGIT 9 | |||
| domainlabel 17 | domainlabel 17 | |||
| escaped 11 | DQUOTE 9 | |||
| fragment 20 | escaped 12 | |||
| fragment 15, 20, 22 | ||||
| h4 18 | h4 18 | |||
| hier-part 14 | HEXDIG 9 | |||
| host 17 | hier-part 15, 22, 23 | |||
| host 16, 17 | ||||
| hostname 17 | hostname 17 | |||
| hostport 17 | ||||
| IPv4address 17 | IPv4address 17 | |||
| IPv6address 18 | IPv6address 18 | |||
| IPv6reference 18 | IPv6reference 18 | |||
| LF 9 | ||||
| ls32 18 | ls32 18 | |||
| mark 11 | mark 11 | |||
| net-path 14 | net-path 15 | |||
| opaque-part 14 | OCTET 9 | |||
| path 18 | path-segments 15, 19 | |||
| path-segments 18 | pchar 19, 20, 20 | |||
| pchar 18 | port 16, 18 | |||
| port 17 | ||||
| qualified 17 | qualified 17 | |||
| query 19 | query 15, 20, 22, 23 | |||
| reg-name 16 | rel-path 15 | |||
| rel-path 22 | relative-URI 22, 22 | |||
| rel-segment 22 | reserved 11 | |||
| relative-URI 22 | scheme 15, 16, 23 | |||
| reserved 10 | segment 19 | |||
| scheme 15 | SP 9 | |||
| segment 18 | ||||
| server 16 | ||||
| toplabel 17 | ||||
| unreserved 11 | unreserved 11 | |||
| unwise 12 | URI 15, 22 | |||
| URI-reference 20 | URI-reference 22 | |||
| uric 9 | uric 10 | |||
| uric-no-slash 14 | userinfo 16, 16 | |||
| userinfo 16 | URI 15 | |||
| URI-reference 20 | URI-reference 22 | |||
| uric 9 | uric 10 | |||
| uric-no-slash 14 | URL 6 | |||
| URN 6 | ||||
| userinfo 16 | userinfo 16 | |||
| Intellectual Property Statement | Intellectual Property Statement | |||
| The IETF takes no position regarding the validity or scope of any | The IETF takes no position regarding the validity or scope of any | |||
| intellectual property or other rights that might be claimed to | intellectual property or other rights that might be claimed to | |||
| pertain to the implementation or use of the technology described in | pertain to the implementation or use of the technology described in | |||
| this document or the extent to which any license under such rights | this document or the extent to which any license under such rights | |||
| might or might not be available; neither does it represent that it | might or might not be available; neither does it represent that it | |||
| has made any effort to identify any such rights. Information on the | has made any effort to identify any such rights. Information on the | |||
| End of changes. 230 change blocks. | ||||
| 919 lines changed or deleted | 1080 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||