| < draft-fielding-uri-rfc2396bis-03.txt | draft-fielding-uri-rfc2396bis-04.txt > | |||
|---|---|---|---|---|
| Network Working Group T. Berners-Lee | Network Working Group T. Berners-Lee | |||
| Internet-Draft MIT/LCS | Internet-Draft MIT/LCS | |||
| Updates: 1738 (if approved) R. Fielding | Updates: 1738 (if approved) R. Fielding | |||
| Obsoletes: 2732, 2396, 1808 (if approved) Day Software | Obsoletes: 2732, 2396, 1808 (if approved) Day Software | |||
| L. Masinter | Expires: August 16, 2004 L. Masinter | |||
| Expires: December 5, 2003 Adobe | Adobe | |||
| June 6, 2003 | February 16, 2004 | |||
| Uniform Resource Identifier (URI): Generic Syntax | Uniform Resource Identifier (URI): Generic Syntax | |||
| draft-fielding-uri-rfc2396bis-03 | draft-fielding-uri-rfc2396bis-04 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that other | Task Force (IETF), its areas, and its working groups. Note that other | |||
| groups may also distribute working documents as Internet-Drafts. | groups may also distribute working documents as Internet-Drafts. | |||
| skipping to change at page 1, line 34 ¶ | skipping to change at page 1, line 33 ¶ | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| <http://www.ietf.org/ietf/1id-abstracts.txt>. | <http://www.ietf.org/ietf/1id-abstracts.txt>. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| <http://www.ietf.org/shadow.html>. | <http://www.ietf.org/shadow.html>. | |||
| This Internet-Draft will expire on August 16, 2004. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2003). All Rights Reserved. | Copyright (C) The Internet Society (2004). All Rights Reserved. | |||
| Abstract | Abstract | |||
| A Uniform Resource Identifier (URI) is a compact string of characters | A Uniform Resource Identifier (URI) is a compact string of characters | |||
| for identifying an abstract or physical resource. This specification | for identifying an abstract or physical resource. This specification | |||
| defines the generic URI syntax and a process for resolving URI | defines the generic URI syntax and a process for resolving URI | |||
| references that might be in relative form, along with guidelines and | references that might be in relative form, along with guidelines and | |||
| security considerations for the use of URIs on the Internet. | security considerations for the use of URIs on the Internet. | |||
| The URI syntax defines a grammar that is a superset of all valid | The URI syntax defines a grammar that is a superset of all valid | |||
| URIs, such that an implementation can parse the common components of | URIs, such that an implementation can parse the common components of | |||
| a URI reference without knowing the scheme-specific requirements of | a URI reference without knowing the scheme-specific requirements of | |||
| every possible identifier. This specification does not define a | every possible identifier. This specification does not define a | |||
| generative grammar for URIs; that task is performed by the individual | generative grammar for URIs; that task is performed by the individual | |||
| specifications of each URI scheme. | specifications of each URI scheme. | |||
| Editorial Note | Editorial Note | |||
| Discussion of this draft and comments to the editors should be sent | Discussion of this draft and comments to the editors should be sent | |||
| to the uri@w3.org mailing list. An issues list and version history | to the uri@w3.org mailing list. An issues list and version history | |||
| is available at <http://www.apache.org/~fielding/uri/rev-2002/ | is available at <http://gbiv.com/protocols/uri/rev-2002/issues.html>. | |||
| issues.html>. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . . . . 6 | 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.2 Design Considerations . . . . . . . . . . . . . . . . . . . 6 | 1.2 Design Considerations . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.2.2 Separating Identification from Interaction . . . . . . . . . 7 | 1.2.2 Separating Identification from Interaction . . . . . . . . . 7 | |||
| 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . . . . 8 | 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . . . . 9 | |||
| 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . . 9 | 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . . 10 | |||
| 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . 11 | 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . 11 | |||
| 2.1 Encoding of Characters . . . . . . . . . . . . . . . . . . . 11 | 2.1 Percent Encoding . . . . . . . . . . . . . . . . . . . . . . 11 | |||
| 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . . 11 | 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . . 12 | |||
| 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . . 12 | 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . . 12 | |||
| 2.4 Escaped Characters . . . . . . . . . . . . . . . . . . . . . 13 | 2.4 When to Encode or Decode . . . . . . . . . . . . . . . . . . 13 | |||
| 2.4.1 Escaped Encoding . . . . . . . . . . . . . . . . . . . . . . 13 | 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 2.4.2 When to Escape and Unescape . . . . . . . . . . . . . . . . 13 | 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 2.5 Excluded Characters . . . . . . . . . . . . . . . . . . . . 14 | 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . 16 | 3.2.1 User Information . . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 | 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . . 17 | ||||
| 3.2.1 User Information . . . . . . . . . . . . . . . . . . . . . . 18 | ||||
| 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | ||||
| 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 | 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . . 24 | 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . . 24 | 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . . 25 | 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 4.4 Same-document Reference . . . . . . . . . . . . . . . . . . 25 | 4.4 Same-document Reference . . . . . . . . . . . . . . . . . . 25 | |||
| 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . . 25 | 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . 27 | 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . 27 | |||
| 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . . 27 | 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . . 27 | |||
| 5.1.1 Base URI within Document Content . . . . . . . . . . . . . . 27 | 5.1.1 Base URI within Document Content . . . . . . . . . . . . . . 27 | |||
| 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . . . . 28 | 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . . . . 28 | |||
| 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . . . . 28 | 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . . . . 28 | |||
| 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . . . . 28 | 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . . . . 28 | |||
| 5.2 Obtaining the Referenced URI . . . . . . . . . . . . . . . . 28 | 5.2 Relative Resolution . . . . . . . . . . . . . . . . . . . . 28 | |||
| 5.3 Recomposition of a Parsed URI . . . . . . . . . . . . . . . 31 | 5.2.1 Pre-parse the Base URI . . . . . . . . . . . . . . . . . . . 29 | |||
| 5.4 Reference Resolution Examples . . . . . . . . . . . . . . . 32 | 5.2.2 Transform References . . . . . . . . . . . . . . . . . . . . 29 | |||
| 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . . . . 32 | 5.2.3 Merge Paths . . . . . . . . . . . . . . . . . . . . . . . . 30 | |||
| 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . . . . 32 | 5.2.4 Remove Dot Segments . . . . . . . . . . . . . . . . . . . . 30 | |||
| 5.3 Component Recomposition . . . . . . . . . . . . . . . . . . 32 | ||||
| 5.4 Reference Resolution Examples . . . . . . . . . . . . . . . 33 | ||||
| 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . . . . 33 | ||||
| 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . . . . 33 | ||||
| 6. Normalization and Comparison . . . . . . . . . . . . . . . . 35 | 6. Normalization and Comparison . . . . . . . . . . . . . . . . 35 | |||
| 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . 35 | 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . 35 | |||
| 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . . 35 | 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . . 36 | |||
| 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 36 | 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 36 | |||
| 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . . . . 37 | 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . . . . 37 | |||
| 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . . . . 38 | 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . . . . 38 | |||
| 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . . . . 38 | 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . . . . 39 | |||
| 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . . 38 | 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . . 39 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . 40 | 7. Security Considerations . . . . . . . . . . . . . . . . . . 41 | |||
| 7.1 Reliability and Consistency . . . . . . . . . . . . . . . . 40 | 7.1 Reliability and Consistency . . . . . . . . . . . . . . . . 41 | |||
| 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . . 40 | 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . . 41 | |||
| 7.3 Rare IP Address Formats . . . . . . . . . . . . . . . . . . 41 | 7.3 Back-end Transcoding . . . . . . . . . . . . . . . . . . . . 42 | |||
| 7.4 Sensitive Information . . . . . . . . . . . . . . . . . . . 41 | 7.4 Rare IP Address Formats . . . . . . . . . . . . . . . . . . 42 | |||
| 7.5 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . . 41 | 7.5 Sensitive Information . . . . . . . . . . . . . . . . . . . 43 | |||
| 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 43 | 7.6 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . . 43 | |||
| Normative References . . . . . . . . . . . . . . . . . . . . 44 | 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 45 | |||
| Informative References . . . . . . . . . . . . . . . . . . . 45 | Normative References . . . . . . . . . . . . . . . . . . . . 46 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 47 | Informative References . . . . . . . . . . . . . . . . . . . 47 | |||
| A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . 48 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 48 | |||
| B. Parsing a URI Reference with a Regular Expression . . . . . 50 | A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . 50 | |||
| C. Delimiting a URI in Context . . . . . . . . . . . . . . . . 51 | B. Parsing a URI Reference with a Regular Expression . . . . . 52 | |||
| D. Summary of Non-editorial Changes . . . . . . . . . . . . . . 53 | C. Delimiting a URI in Context . . . . . . . . . . . . . . . . 53 | |||
| D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . . 53 | D. Summary of Non-editorial Changes . . . . . . . . . . . . . . 55 | |||
| D.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . . 53 | D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . . 55 | |||
| Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 | D.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . . 55 | |||
| Intellectual Property and Copyright Statements . . . . . . . 60 | Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 | |||
| Intellectual Property and Copyright Statements . . . . . . . 62 | ||||
| 1. Introduction | 1. Introduction | |||
| A Uniform Resource Identifier (URI) provides a simple and extensible | A Uniform Resource Identifier (URI) provides a simple and extensible | |||
| means for identifying a resource. This specification of URI syntax | means for identifying a resource. This specification of URI syntax | |||
| and semantics is derived from concepts introduced by the World Wide | and semantics is derived from concepts introduced by the World Wide | |||
| Web global information initiative, whose use of such identifiers | Web global information initiative, whose use of such identifiers | |||
| dates from 1990 and is described in "Universal Resource Identifiers | dates from 1990 and is described in "Universal Resource Identifiers | |||
| in WWW" [RFC1630], and is designed to meet the recommendations laid | in WWW" [RFC1630], and is designed to meet the recommendations laid | |||
| out in "Functional Recommendations for Internet Resource Locators" | out in "Functional Recommendations for Internet Resource Locators" | |||
| [RFC1736] and "Functional Requirements for Uniform Resource Names" | [RFC1736] and "Functional Requirements for Uniform Resource Names" | |||
| [RFC1737]. | [RFC1737]. | |||
| This document obsoletes [RFC2396], which merged "Uniform Resource | This document obsoletes [RFC2396], which merged "Uniform Resource | |||
| Locators" [RFC1738] and "Relative Uniform Resource Locators" | Locators" [RFC1738] and "Relative Uniform Resource Locators" | |||
| [RFC1808] in order to define a single, generic syntax for all URIs. | [RFC1808] in order to define a single, generic syntax for all URIs. | |||
| It excludes those portions of RFC 1738 that defined the specific | It excludes those portions of RFC 1738 that defined the specific | |||
| syntax of individual URI schemes; those portions will be updated as | syntax of individual URI schemes; those portions will be updated as | |||
| separate documents. The process for registration of new URI schemes | separate documents. The process for registration of new URI schemes | |||
| is defined separately by [RFC2717]. | is defined separately by [RFC2717]. Advice for designers of new URI | |||
| schemes can be found in [RFC2718]. | ||||
| All significant changes from RFC 2396 are noted in Appendix D. | All significant changes from RFC 2396 are noted in Appendix D. | |||
| This specification uses the terms "character" and "character | ||||
| encoding" in accordance with the definitions provided in [RFC2978]. | ||||
| 1.1 Overview of URIs | 1.1 Overview of URIs | |||
| URIs are characterized as follows: | URIs are characterized as follows: | |||
| Uniform | Uniform | |||
| Uniformity provides several benefits: it allows different types of | Uniformity provides several benefits: it allows different types of | |||
| resource identifiers to be used in the same context, even when the | resource identifiers to be used in the same context, even when the | |||
| mechanisms used to access those resources may differ; it allows | mechanisms used to access those resources may differ; it allows | |||
| uniform semantic interpretation of common syntactic conventions | uniform semantic interpretation of common syntactic conventions | |||
| skipping to change at page 5, line 14 ¶ | skipping to change at page 5, line 18 ¶ | |||
| mathematical equation or the types of a relationship (e.g., | mathematical equation or the types of a relationship (e.g., | |||
| "parent" or "employee"). | "parent" or "employee"). | |||
| Identifier | Identifier | |||
| An identifier embodies the information required to distinguish | An identifier embodies the information required to distinguish | |||
| what is being identified from all other things within its scope of | what is being identified from all other things within its scope of | |||
| identification. | identification. | |||
| A URI is an identifier that consists of a sequence of characters | A URI is an identifier that consists of a sequence of characters | |||
| matching the syntax defined by the grammar rule named "URI" in | matching the syntax defined by the syntax rule named "URI" in Section | |||
| Section 3. A URI can be used to refer to a resource. This | 3. A URI can be used to refer to a resource. This specification does | |||
| specification does not place any limits on the nature of a resource | not place any limits on the nature of a resource or the reasons why | |||
| or the reasons why an application might wish to refer to a resource. | an application might wish to refer to a resource. URIs have a global | |||
| URIs have a global scope and should be interpreted consistently | scope and should be interpreted consistently regardless of context, | |||
| regardless of context, but that interpretation may be defined in | but that interpretation may be defined in relation to the user's | |||
| relation to the user's context (e.g., "http://localhost/" refers to a | context (e.g., "http://localhost/" refers to a resource that is | |||
| resource that is relative to the user's network interface and yet not | relative to the user's network interface and yet not specific to any | |||
| specific to any one user). | one user). | |||
| 1.1.1 Generic Syntax | 1.1.1 Generic Syntax | |||
| Each URI begins with a scheme name, as defined in Section 3.1, that | Each URI begins with a scheme name, as defined in Section 3.1, that | |||
| refers to a specification for assigning identifiers within that | refers to a specification for assigning identifiers within that | |||
| scheme. As such, the URI syntax is a federated and extensible naming | scheme. As such, the URI syntax is a federated and extensible naming | |||
| system wherein each scheme's specification may further restrict the | system wherein each scheme's specification may further restrict the | |||
| syntax and semantics of identifiers using that scheme. | syntax and semantics of identifiers using that scheme. | |||
| This specification defines those elements of the URI syntax that are | This specification defines those elements of the URI syntax that are | |||
| skipping to change at page 6, line 10 ¶ | skipping to change at page 6, line 10 ¶ | |||
| reference into its major components; once the scheme is determined, | reference into its major components; once the scheme is determined, | |||
| further scheme-specific parsing can be performed on the components. | further scheme-specific parsing can be performed on the components. | |||
| In other words, the URI generic syntax is a superset of the syntax of | In other words, the URI generic syntax is a superset of the syntax of | |||
| all URI schemes. | all URI schemes. | |||
| 1.1.2 Examples | 1.1.2 Examples | |||
| The following examples illustrate URIs that are in common use. | The following examples illustrate URIs that are in common use. | |||
| ftp://ftp.is.co.za/rfc/rfc1808.txt | ftp://ftp.is.co.za/rfc/rfc1808.txt | |||
| -- ftp scheme for File Transfer Protocol services | ||||
| gopher://gopher.tc.umn.edu:70/11/Mailing%20Lists/ | ||||
| -- gopher scheme for Gopher and Gopher+ Protocol services | ||||
| http://www.ietf.org/rfc/rfc2396.txt | http://www.ietf.org/rfc/rfc2396.txt | |||
| -- http scheme for Hypertext Transfer Protocol services | ||||
| mailto:John.Doe@example.com | mailto:John.Doe@example.com | |||
| -- mailto scheme for electronic mail addresses | ||||
| news:comp.infosystems.www.servers.unix | news:comp.infosystems.www.servers.unix | |||
| -- news scheme for USENET news groups and articles | ||||
| telnet://melvyl.ucop.edu/ | telnet://melvyl.ucop.edu/ | |||
| -- telnet scheme for interactive TELNET services | ||||
| 1.1.3 URI, URL, and URN | 1.1.3 URI, URL, and URN | |||
| A URI can be further classified as a locator, a name, or both. The | A URI can be further classified as a locator, a name, or both. The | |||
| term "Uniform Resource Locator" (URL) refers to the subset of URIs | term "Uniform Resource Locator" (URL) refers to the subset of URIs | |||
| that, in addition to identifying a resource, provide a means of | that, in addition to identifying a resource, provide a means of | |||
| locating the resource by describing its primary access mechanism | locating the resource by describing its primary access mechanism | |||
| (e.g., its network "location"). The term "Uniform Resource Name" | (e.g., its network "location"). The term "Uniform Resource Name" | |||
| (URN) refers to URIs under the "urn" scheme [RFC2141], which are | (URN) has been used historically to refer to both URIs under the | |||
| required to remain globally unique and persistent even when the | "urn" scheme [RFC2141], which are required to remain globally unique | |||
| resource ceases to exist or becomes unavailable. | and persistent even when the resource ceases to exist or becomes | |||
| unavailable, and to any other URI with the properties of a name. | ||||
| An individual scheme does not need to be classified as being just one | An individual scheme does not need to be classified as being just one | |||
| of "name" or "locator". Instances of URIs from any given scheme may | of "name" or "locator". Instances of URIs from any given scheme may | |||
| have the characteristics of names or locators or both, often | have the characteristics of names or locators or both, often | |||
| depending on the persistence and care in the assignment of | depending on the persistence and care in the assignment of | |||
| identifiers by the naming authority, rather than any quality of the | identifiers by the naming authority, rather than any quality of the | |||
| scheme. | scheme. Future specifications and related documentation should use | |||
| the general term "URI", rather than the more restrictive terms URL | ||||
| and URN [RFC3305]. | ||||
| 1.2 Design Considerations | 1.2 Design Considerations | |||
| 1.2.1 Transcription | 1.2.1 Transcription | |||
| The URI syntax has been designed with global transcription as one of | The URI syntax has been designed with global transcription as one of | |||
| its main considerations. A URI is a sequence of characters from a | its main considerations. A URI is a sequence of characters from a | |||
| very limited set: the letters of the basic Latin alphabet, digits, | very limited set: the letters of the basic Latin alphabet, digits, | |||
| and a few special characters. A URI may be represented in a variety | and a few special characters. A URI may be represented in a variety | |||
| of ways: e.g., ink on paper, pixels on a screen, or a sequence of | of ways: e.g., ink on paper, pixels on a screen, or a sequence of | |||
| octets in a coded character set. The interpretation of a URI depends | integers from a coded character set. The interpretation of a URI | |||
| only on the characters used and not how those characters are | depends only on the characters used and not how those characters are | |||
| represented in a network protocol. | represented in a network protocol. | |||
| The goal of transcription can be described by a simple scenario. | The goal of transcription can be described by a simple scenario. | |||
| Imagine two colleagues, Sam and Kim, sitting in a pub at an | Imagine two colleagues, Sam and Kim, sitting in a pub at an | |||
| international conference and exchanging research ideas. Sam asks Kim | international conference and exchanging research ideas. Sam asks Kim | |||
| for a location to get more information, so Kim writes the URI for the | for a location to get more information, so Kim writes the URI for the | |||
| research site on a napkin. Upon returning home, Sam takes out the | research site on a napkin. Upon returning home, Sam takes out the | |||
| napkin and types the URI into a computer, which then retrieves the | napkin and types the URI into a computer, which then retrieves the | |||
| information to which Kim referred. | information to which Kim referred. | |||
| skipping to change at page 7, line 38 ¶ | skipping to change at page 7, line 33 ¶ | |||
| o A URI often needs to be remembered by people, and it is easier for | o A URI often needs to be remembered by people, and it is easier for | |||
| people to remember a URI when it consists of meaningful or | people to remember a URI when it consists of meaningful or | |||
| familiar components. | familiar components. | |||
| These design considerations are not always in alignment. For | These design considerations are not always in alignment. For | |||
| example, it is often the case that the most meaningful name for a URI | example, it is often the case that the most meaningful name for a URI | |||
| component would require characters that cannot be typed into some | component would require characters that cannot be typed into some | |||
| systems. The ability to transcribe a resource identifier from one | systems. The ability to transcribe a resource identifier from one | |||
| medium to another has been considered more important than having a | medium to another has been considered more important than having a | |||
| URI consist of the most meaningful of components. In local or | URI consist of the most meaningful of components. | |||
| regional contexts and with improving technology, users might benefit | ||||
| from being able to use a wider range of characters; such use is not | In local or regional contexts and with improving technology, users | |||
| defined in this specification. | might benefit from being able to use a wider range of characters; | |||
| such use is not defined in this specification. Percent-encoded | ||||
| octets (Section 2.1) may be used within a URI to represent characters | ||||
| outside the range of the US-ASCII coded character set if such | ||||
| representation is defined by the scheme or by the protocol element in | ||||
| which the URI is referenced; such a definition will specify the | ||||
| character encoding scheme used to map those characters to octets | ||||
| prior to being percent-encoded for the URI. | ||||
| 1.2.2 Separating Identification from Interaction | 1.2.2 Separating Identification from Interaction | |||
| A common misunderstanding of URIs is that they are only used to refer | A common misunderstanding of URIs is that they are only used to refer | |||
| to accessible resources. In fact, the URI alone only provides | to accessible resources. In fact, the URI alone only provides | |||
| identification; access to the resource is neither guaranteed nor | identification; access to the resource is neither guaranteed nor | |||
| implied by the presence of a URI. Instead, an operation (if any) | implied by the presence of a URI. Instead, an operation (if any) | |||
| associated with a URI reference is defined by the protocol element, | associated with a URI reference is defined by the protocol element, | |||
| data format attribute, or natural language text in which it appears. | data format attribute, or natural language text in which it appears. | |||
| Given a URI, a system may attempt to perform a variety of operations | Given a URI, a system may attempt to perform a variety of operations | |||
| on the resource, as might be characterized by such words as "denote", | on the resource, as might be characterized by such words as "access", | |||
| "access", "update", "replace", or "find attributes". Such operations | "update", "replace", or "find attributes". Such operations are | |||
| are defined by the protocols that make use of URIs, not by this | defined by the protocols that make use of URIs, not by this | |||
| specification. However, we do use a few general terms for describing | specification. However, we do use a few general terms for describing | |||
| common operations on URIs. URI "resolution" is the process of | common operations on URIs. URI "resolution" is the process of | |||
| determining an access mechanism and the appropriate parameters | determining an access mechanism and the appropriate parameters | |||
| necessary to dereference a URI; such resolution may require several | necessary to dereference a URI; such resolution may require several | |||
| iterations. Use of that access mechanism to perform an action on the | iterations. To use that access mechanism to perform an action on the | |||
| URI's resource is termed a "dereference" of the URI. | URI's resource is to "dereference" the URI. | |||
| When URIs are used within information systems to identify sources of | When URIs are used within information systems to identify sources of | |||
| information, the most common form of URI dereference is "retrieval": | information, the most common form of URI dereference is "retrieval": | |||
| making use of a URI in order to retrieve a representation of its | making use of a URI in order to retrieve a representation of its | |||
| associated resource. A "representation" is a sequence of octets, | associated resource. A "representation" is a sequence of octets, | |||
| along with metadata describing those octets, that constitutes a | along with representation metadata describing those octets, that | |||
| record of the state of the resource at the time that the | constitutes a record of the state of the resource at the time that | |||
| representation is generated. Retrieval is achieved by a process that | the representation is generated. Retrieval is achieved by a process | |||
| might include using the URI as a cache key to check for a locally | that might include using the URI as a cache key to check for a | |||
| cached representation, resolution of the URI to determine an | locally cached representation, resolution of the URI to determine an | |||
| appropriate access mechanism (if any), and dereference of the URI for | appropriate access mechanism (if any), and dereference of the URI for | |||
| the sake of applying a retrieval operation. | the sake of applying a retrieval operation. Depending on the | |||
| protocols used to perform the retrieval, additional information might | ||||
| be supplied about the resource (resource metadata) and its relation | ||||
| to other resources. | ||||
| URI references in information systems are designed to be | URI references in information systems are designed to be | |||
| late-binding: the result of an access is generally determined at the | late-binding: the result of an access is generally determined at the | |||
| time it is accessed and may vary over time or due to other aspects of | time it is accessed and may vary over time or due to other aspects of | |||
| the interaction. When an author creates a reference to such a | the interaction. When an author creates a reference to such a | |||
| resource, they do so with the intention that the reference be used in | resource, they do so with the intention that the reference be used in | |||
| the future; what is being identified is not some specific result that | the future; what is being identified is not some specific result that | |||
| was obtained in the past, but rather some characteristic that is | was obtained in the past, but rather some characteristic that is | |||
| expected to be true for future results. In such cases, the resource | expected to be true for future results. In such cases, the resource | |||
| referred to by the URI is actually a sameness of characteristics as | referred to by the URI is actually a sameness of characteristics as | |||
| skipping to change at page 9, line 4 ¶ | skipping to change at page 9, line 6 ¶ | |||
| via the named protocol. URIs are often used simply for the sake of | via the named protocol. URIs are often used simply for the sake of | |||
| identification. Even when a URI is used to retrieve a representation | identification. Even when a URI is used to retrieve a representation | |||
| of a resource, that access might be through gateways, proxies, | of a resource, that access might be through gateways, proxies, | |||
| caches, and name resolution services that are independent of the | caches, and name resolution services that are independent of the | |||
| protocol associated with the scheme name, and the resolution of some | protocol associated with the scheme name, and the resolution of some | |||
| URIs may require the use of more than one protocol (e.g., both DNS | URIs may require the use of more than one protocol (e.g., both DNS | |||
| and HTTP are typically used to access an "http" URI's origin server | and HTTP are typically used to access an "http" URI's origin server | |||
| when a representation isn't found in a local cache). | when a representation isn't found in a local cache). | |||
| 1.2.3 Hierarchical Identifiers | 1.2.3 Hierarchical Identifiers | |||
| The URI syntax is organized hierarchically, with components listed in | The URI syntax is organized hierarchically, with components listed in | |||
| decreasing order from left to right. For some URI schemes, the | order of decreasing significance from left to right. For some URI | |||
| visible hierarchy is limited to the scheme itself: everything after | schemes, the visible hierarchy is limited to the scheme itself: | |||
| the scheme component delimiter is considered opaque to URI | everything after the scheme component delimiter (":") is considered | |||
| processing. Other URI schemes make the hierarchy explicit and visible | opaque to URI processing. Other URI schemes make the hierarchy | |||
| to generic parsing algorithms. | explicit and visible to generic parsing algorithms. | |||
| The URI syntax reserves the slash ("/"), question-mark ("?"), and | The generic syntax uses the slash ("/"), question mark ("?"), and | |||
| number-sign ("#") characters for the purpose of delimiting components | number sign ("#") characters for the purpose of delimiting components | |||
| that are significant to the generic parser's hierarchical | that are significant to the generic parser's hierarchical | |||
| interpretation of an identifier. In addition to aiding the | interpretation of an identifier. In addition to aiding the | |||
| readability of such identifiers through the consistent use of | readability of such identifiers through the consistent use of | |||
| familiar syntax, this uniform representation of hierarchy across | familiar syntax, this uniform representation of hierarchy across | |||
| naming schemes allows scheme-independent references to be made | naming schemes allows scheme-independent references to be made | |||
| relative to that hierarchy. | relative to that hierarchy. | |||
| It is often the case that a group or "tree" of documents has been | It is often the case that a group or "tree" of documents has been | |||
| constructed to serve a common purpose; the vast majority of URIs in | constructed to serve a common purpose, wherein the vast majority of | |||
| these documents point to resources within the tree rather than | URIs in these documents point to resources within the tree rather | |||
| outside of it. Similarly, documents located at a particular site are | than outside of it. Similarly, documents located at a particular | |||
| much more likely to refer to other resources at that site than to | site are much more likely to refer to other resources at that site | |||
| resources at remote sites. | than to resources at remote sites. Relative referencing of URIs | |||
| allows document trees to be partially independent of their location | ||||
| Relative referencing of URIs allows document trees to be partially | and access scheme. For instance, it is possible for a single set of | |||
| independent of their location and access scheme. For instance, it is | hypertext documents to be simultaneously accessible and traversable | |||
| possible for a single set of hypertext documents to be simultaneously | via each of the "file", "http", and "ftp" schemes if the documents | |||
| accessible and traversable via each of the "file", "http", and "ftp" | refer to each other using relative references. Furthermore, such | |||
| schemes if the documents refer to each other using relative | document trees can be moved, as a whole, without changing any of the | |||
| references. Furthermore, such document trees can be moved, as a | relative references. | |||
| whole, without changing any of the relative references. | ||||
| A relative URI reference (Section 4.2) refers to a resource by | A relative URI reference (Section 4.2) refers to a resource by | |||
| describing the difference within a hierarchical name space between | describing the difference within a hierarchical name space between | |||
| the current context and the target URI. The reference resolution | the reference context and the target URI. The reference resolution | |||
| algorithm, presented in Section 5, defines how such references are | algorithm, presented in Section 5, defines how such a reference is | |||
| resolved. | transformed to the target URI. Since relative references can only be | |||
| used within the context of a hierarchical URI, designers of new URI | ||||
| schemes should use a syntax consistent with the generic syntax's | ||||
| hierarchical components unless there are compelling reasons to forbid | ||||
| relative referencing within that scheme. | ||||
| All URIs are parsed by generic syntax parsers when used. A URI scheme | ||||
| that wishes to remain opaque to hierarchical processing must disallow | ||||
| the use of slash and question mark characters. However, since a | ||||
| non-relative URI reference is only modified by the generic parser if | ||||
| it contains complete path segments of "." or ".." (see Section 3.3), | ||||
| URIs may safely use "/" for other purposes if they do not allow | ||||
| dot-segments. | ||||
| 1.3 Syntax Notation | 1.3 Syntax Notation | |||
| This specification uses the Augmented Backus-Naur Form (ABNF) | This specification uses the Augmented Backus-Naur Form (ABNF) | |||
| notation of [RFC2234] to define the URI syntax. Although the ABNF | notation of [RFC2234], including the following core ABNF syntax rules | |||
| defines syntax in terms of the US-ASCII character encoding [ASCII], | defined by that specification: ALPHA (letters), CR (carriage return), | |||
| the URI syntax should be interpreted in terms of the character that | CTL (control characters), DIGIT (decimal digits), DQUOTE (double | |||
| the ASCII-encoded octet represents, rather than the octet encoding | quote), HEXDIG (hexadecimal digits), LF (line feed), and SP (space). | |||
| itself. How a URI is represented in terms of bits and bytes on the | The complete URI syntax is collected in Appendix A. | |||
| wire is dependent upon the character encoding of the protocol used to | ||||
| transport it, or the charset of the document that contains it. | ||||
| The following core ABNF productions are used by this specification as | ||||
| defined by Section 6.1 of [RFC2234]: ALPHA, CR, CTL, DIGIT, DQUOTE, | ||||
| HEXDIG, LF, OCTET, and SP. The complete URI syntax is collected in | ||||
| Appendix A. | ||||
| 2. Characters | 2. Characters | |||
| A URI consists of a restricted set of characters, primarily chosen | Although ABNF notation defines its terminal values to be non-negative | |||
| to aid transcription and usability both in computer systems and in | integers (codepoints) based on the US-ASCII coded character set | |||
| non-computer communications. Characters used conventionally as | [ASCII], we must invert that relation in order to understand the URI | |||
| delimiters around a URI are excluded. The set of URI characters | syntax, since URIs are defined as strings of characters independent | |||
| consists of digits, letters, and a few graphic symbols chosen from | of any particular encoding. Therefore, the integer values must be | |||
| those common to most of the character encodings and input facilities | mapped back to their corresponding characters via US-ASCII in order | |||
| available to Internet users. | to complete the syntax rules. | |||
| uric = reserved / unreserved / escaped | This specification does not mandate the use of any particular | |||
| character encoding scheme for mapping between URI characters and the | ||||
| octets used to store or transmit those characters. When a URI appears | ||||
| in a protocol element, the character encoding is defined by that | ||||
| protocol; absent such a definition, a URI is assumed to use the same | ||||
| character encoding as the surrounding text. | ||||
| Within a URI, reserved characters are used to delimit syntax | A URI is composed from a limited set of characters consisting of | |||
| components, unreserved characters are used to describe registered | digits, letters, and a few graphic symbols. A reserved (Section 2.2) | |||
| names, and unreserved, non-delimiting reserved, and escaped | subset of those characters may be used to delimit syntax components | |||
| characters are used to represent strings of data (1*OCTET) within the | within a URI, while the remaining characters, including both the | |||
| components. | unreserved (Section 2.3) set and those reserved characters not acting | |||
| as delimiters, define each component's data. | ||||
| 2.1 Encoding of Characters | 2.1 Percent Encoding | |||
| As described above (Section 1.3), the URI syntax is defined in terms | A percent-encoding mechanism is used to represent a data octet in a | |||
| of characters by reference to the US-ASCII encoding of characters to | component when that octet's corresponding character is outside the | |||
| octets. This specification does not mandate the use of any | allowed set or is being used as a delimiter of, or within, the | |||
| particular mapping between its character set and the octets used to | component. A percent-encoded octet is encoded as a character triplet, | |||
| store or transmit those characters. | consisting of the percent character "%" followed by the two | |||
| hexadecimal digits representing that octet's numeric value. For | ||||
| example, "%20" is the percent-encoding for the binary octet | ||||
| "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space | ||||
| character (SP). | ||||
| URI characters representing strings of data within a component may, | pct-encoded = "%" HEXDIG HEXDIG | |||
| if allowed by the component production, represent an arbitrary | ||||
| sequence of octets. For example, portions of a given URI might | ||||
| correspond to a filename on a non-ASCII file system, a query on | ||||
| non-ASCII data, numeric coordinates on a map, etc. Some URI schemes | ||||
| define a specific encoding of raw data to US-ASCII characters as part | ||||
| of their scheme-specific requirements. Most URI schemes represent | ||||
| data octets by the US-ASCII character corresponding to that octet, | ||||
| either directly in the form of the character's glyph or by use of an | ||||
| escape triplet (Section 2.4). | ||||
| When a URI scheme defines a component that represents textual data | The uppercase hexadecimal digits 'A' through 'F' are equivalent to | |||
| consisting of characters from the Unicode (ISO 10646) character set, | the lowercase digits 'a' through 'f', respectively. Two URIs that | |||
| we recommend that the data be encoded first as octets according to | differ only in the case of hexadecimal digits used in percent-encoded | |||
| the UTF-8 [UTF-8] character encoding, and then escaping only those | octets are equivalent. For consistency, URI producers and | |||
| octets that are not in the unreserved character set. | normalizers should use uppercase hexadecimal digits for all | |||
| percent-encodings. | ||||
| 2.2 Reserved Characters | 2.2 Reserved Characters | |||
| URIs include components and sub-components that are delimited by | URIs include components and sub-components that are delimited by | |||
| certain special characters. These characters are called "reserved", | characters in the "reserved" set. These characters are called | |||
| since their usage within a URI component is limited to their reserved | "reserved" because they may (or may not) be defined as delimiters by | |||
| purpose within that component. If data for a URI component would | the generic syntax, by each scheme-specific syntax, or by the | |||
| conflict with the reserved purpose, then the conflicting data must be | implementation-specific syntax of a URI's dereferencing algorithm. | |||
| escaped (Section 2.4) before forming the URI. | If data for a URI component would conflict with a reserved | |||
| character's purpose as a delimiter, then the conflicting data must be | ||||
| percent-encoded before forming the URI. | ||||
| reserved = "/" / "?" / "#" / "[" / "]" / ";" / | reserved = gen-delims / sub-delims | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ||||
| Reserved characters are used as delimiters of the generic URI | gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | |||
| components described in Section 3, as well as within those components | ||||
| for delimiting sub-components. A component's ABNF syntax rule will | ||||
| not use the "reserved" production directly; instead, each rule lists | ||||
| those reserved characters that are allowed within that component. | ||||
| Allowed reserved characters that are not assigned a sub-component | ||||
| delimiter role by this specification should be considered reserved | ||||
| for special use by whatever software generates the URI (i.e., they | ||||
| may be used to delimit or indicate information that is significant to | ||||
| interpretation of the identifier, but that significance is outside | ||||
| the scope of this specification). Outside of the URI's origin, a | ||||
| reserved character cannot be escaped without fear of changing how it | ||||
| will be interpreted; likewise, an escaped octet that corresponds to a | ||||
| reserved character cannot be unescaped outside the software that is | ||||
| responsible for interpreting it during URI resolution. | ||||
| The slash ("/"), question-mark ("?"), and number-sign ("#") | sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | |||
| characters are reserved in all URIs for the purpose of delimiting | / "*" / "+" / "," / ";" / "=" | |||
| components that are significant to the generic parser's hierarchical | ||||
| interpretation of an identifier. The hierarchical prefix of a URI, | A subset of the reserved characters (gen-delims) are used as | |||
| wherein the slash ("/") character signifies a hierarchy delimiter, | delimiters of the generic URI components described in Section 3. A | |||
| extends from the scheme (Section 3.1) through to the first | component's ABNF syntax rule will not use the reserved or gen-delims | |||
| question-mark ("?"), number-sign ("#"), or the end of the URI string. | rule names directly; instead, each syntax rule lists those reserved | |||
| In other words, the slash ("/") character is not treated as a | characters that are allowed within that component (i.e., not | |||
| hierarchical separator within the query (Section 3.4) and fragment | delimiting it). The allowed reserved characters, including those in | |||
| (Section 3.5) components of a URI, but is still considered reserved | the sub-delims set and any of the gen-delims that are not a delimiter | |||
| within those components for purposes outside the scope of this | of that component, are reserved for use as sub-component delimiters | |||
| specification. | within the component. Only the most common sub-components are | |||
| defined by this specification; other sub-components may be defined by | ||||
| a URI scheme's specification, or by the implementation-specific | ||||
| syntax of a URI's dereferencing algorithm, provided that such | ||||
| sub-components are delimited by characters in that component's | ||||
| reserved set. If no such delimiting role has been assigned, then a | ||||
| reserved character appearing in a component represents the data octet | ||||
| corresponding to its encoding in US-ASCII. | ||||
| URIs that differ in the replacement of a reserved character with its | ||||
| corresponding percent-encoded octet are not equivalent. | ||||
| Percent-encoding a reserved character, or decoding a percent-encoded | ||||
| octet that corresponds to a reserved character, will change how the | ||||
| URI is interpreted by most applications. | ||||
| 2.3 Unreserved Characters | 2.3 Unreserved Characters | |||
| Characters that are allowed in a URI but do not have a reserved | Characters that are allowed in a URI but do not have a reserved | |||
| purpose are called unreserved. These include uppercase and lowercase | purpose are called unreserved. These include uppercase and lowercase | |||
| letters, decimal digits, and a limited set of punctuation marks and | letters, decimal digits, hyphen, period, underscore, and tilde. | |||
| symbols. | ||||
| unreserved = ALPHA / DIGIT / mark | ||||
| mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")" | ||||
| Escaping unreserved characters in a URI does not change what resource | ||||
| is identified by that URI. However, it may change the result of a | ||||
| URI comparison (Section 6), potentially leading to less efficient | ||||
| actions by an application. Therefore, unreserved characters should | ||||
| not be escaped unless the URI is being used in a context that does | ||||
| not allow the unescaped character to appear. URI normalization | ||||
| processes may unescape sequences in the ranges of ALPHA (%41-%5A and | ||||
| %61-%7A), DIGIT (%30-%39), hyphen (%2D), underscore (%5F), or tilde | ||||
| (%7E) without fear of creating a conflict, but unescaping the other | ||||
| mark characters is usually counterproductive. | ||||
| 2.4 Escaped Characters | ||||
| Data must be escaped if it does not have a representation using an | ||||
| unreserved character; this includes data that does not correspond to | ||||
| a printable character of the US-ASCII coded character set or | ||||
| corresponds to a US-ASCII character that delimits the component from | ||||
| others, is reserved in that component for delimiting sub-components, | ||||
| or is excluded from any use within a URI (Section 2.5). | ||||
| 2.4.1 Escaped Encoding | ||||
| An escaped octet is encoded as a character triplet, consisting of | ||||
| the percent character "%" followed by the two hexadecimal digits | ||||
| representing that octet's numeric value. For example, "%20" is the | ||||
| escaped encoding for the binary octet "00100000" (ABNF: %x20), which | ||||
| corresponds to the US-ASCII space character (SP). This is sometimes | ||||
| referred to as "percent-encoding" the octet. | ||||
| escaped = "%" HEXDIG HEXDIG | ||||
| The uppercase hexadecimal digits 'A' through 'F' are equivalent to | ||||
| the lowercase digits 'a' through 'f', respectively. Two URIs that | ||||
| differ only in the case of hexadecimal digits used in escaped octets | ||||
| are equivalent. For consistency, we recommend that uppercase digits | ||||
| be used by URI generators and normalizers. | ||||
| 2.4.2 When to Escape and Unescape | ||||
| Under normal circumstances, the only time that characters within a | ||||
| URI string are escaped is during the process of generating the URI | ||||
| from its component parts. Each component may have its own set of | ||||
| characters that are reserved, so only the mechanism responsible for | ||||
| generating or interpreting that component can determine whether or | ||||
| not escaping a character will change its semantics. The exception is | ||||
| when a URI is being used within a context where the unreserved "mark" | ||||
| characters might need to be escaped, such as when used for a | ||||
| command-line argument or within a single-quoted attribute. | ||||
| Once generated, a URI is always in an escaped form. When a URI is | ||||
| resolved, the components significant to that scheme-specific | ||||
| resolution process (if any) must be parsed and separated before the | ||||
| escaped characters within those components can be safely unescaped. | ||||
| In some cases, data that could be represented by an unreserved | ||||
| character may appear escaped; for example, some of the unreserved | ||||
| "mark" characters are automatically escaped by some systems. A URI | ||||
| normalizer may unescape escaped octets that are represented by | ||||
| characters in the unreserved set. For example, "%7E" is sometimes | ||||
| used instead of tilde ("~") in an "http" URI path and can be | ||||
| converted to "~" without changing the interpretation of the URI. | ||||
| In all cases, a URI character is equivalent to its corresponding | ||||
| ASCII-encoded octet, even when that octet is represented as a | ||||
| percent-escape. URI characters are provided as an external ASCII | ||||
| interface for identification between systems. A system that | ||||
| internally provides identifiers in the form of a different character | ||||
| encoding, such as EBCDIC, will generally perform character | ||||
| translation of textual identifiers to UTF-8 at some internal | ||||
| interface, thus providing meaningful identifiers in ASCII even though | ||||
| the back-end identifiers are in a different encoding. Escaped octets | ||||
| must be unescaped before such a transcoding is applied. Although | ||||
| this specification does not define the character encoding of escaped | ||||
| octets outside the ASCII range, the general principle of unescaping | ||||
| before transcoding should be applied for all character encodings. | ||||
| Because the percent ("%") character serves as the escape indicator, | ||||
| it must be escaped as "%25" in order for that octet to be used as | ||||
| data within a URI. Implementers should be careful not to escape or | ||||
| unescape the same string more than once, since unescaping an already | ||||
| unescaped string might lead to misinterpreting a percent data | ||||
| character as another escaped character, or vice versa in the case of | ||||
| escaping an already escaped string. | ||||
| 2.5 Excluded Characters | ||||
| Although they are disallowed within the URI syntax, we include here | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |||
| a description of those characters that have been excluded and the | ||||
| reasons for their exclusion. | ||||
| excluded = invisible / delims / unwise | URIs that differ in the replacement of an unreserved character with | |||
| its corresponding percent-encoded octet are equivalent: they identify | ||||
| the same resource. However, percent-encoded unreserved characters | ||||
| may change the result of some URI comparisons (Section 6), | ||||
| potentially leading to incorrect or inefficient behavior. For | ||||
| consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A | ||||
| and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore | ||||
| (%5F), or tilde (%7E) should not be created by URI producers and, | ||||
| when found in a URI, should be decoded to their corresponding | ||||
| unreserved character by URI normalizers. | ||||
| The control characters (CTL) in the US-ASCII coded character set are | 2.4 When to Encode or Decode | |||
| not used within a URI, both because they are non-printable and | ||||
| because they are likely to be misinterpreted by some control | ||||
| mechanisms. The space character (SP) is excluded because significant | ||||
| spaces may disappear and insignificant spaces may be introduced when | ||||
| a URI is transcribed, typeset, or subjected to the treatment of | ||||
| word-processing programs. Whitespace is also used to delimit a URI | ||||
| in many contexts. Characters outside the US-ASCII set are excluded as | ||||
| well. | ||||
| invisible = CTL / SP / %x80-FF | Under normal circumstances, the only time that octets within a URI | |||
| are percent-encoded is during the process of producing the URI from | ||||
| its component parts. It is during that process that an | ||||
| implementation determines which of the reserved characters are to be | ||||
| used as sub-component delimiters and which can be safely used as | ||||
| data. Once produced, a URI is always in its percent-encoded form. | ||||
| The angle-bracket ("<" and ">") and double-quote (") characters are | When a URI is dereferenced, the components and sub-components | |||
| excluded because they are often used as the delimiters around a URI | significant to the scheme-specific dereferencing process (if any) | |||
| in text documents and protocol fields. The percent character ("%") | must be parsed and separated before the percent-encoded octets within | |||
| is excluded because it is used for the encoding of escaped (Section | those components can be safely decoded, since otherwise the data may | |||
| 2.4) characters. | be mistaken for component delimiters. The only exception is for | |||
| percent-encoded octets corresponding to characters in the unreserved | ||||
| set, which can be decoded at any time. For example, the octet | ||||
| corresponding to the tilde ("~") character is often encoded as "%7E" | ||||
| by older URI processing software; the "%7E" can be replaced by "~" | ||||
| without changing its interpretation. | ||||
| delims = "<" / ">" / "%" / DQUOTE | Because the percent ("%") character serves as the indicator for | |||
| percent-encoded octets, it must be percent-encoded as "%25" in order | ||||
| for that octet to be used as data within a URI. Implementations must | ||||
| not percent-encode or decode the same string more than once, since | ||||
| decoding an already decoded string might lead to misinterpreting a | ||||
| percent data octet as the beginning of a percent-encoding, or vice | ||||
| versa in the case of percent-encoding an already percent-encoded | ||||
| string. | ||||
| Other characters are excluded because gateways and other transport | URI characters serve as an external interface for identification | |||
| agents are known to sometimes modify such characters. | between systems. A system that internally provides identifiers in | |||
| the form of a different character encoding, such as EBCDIC, will | ||||
| generally perform character translation of textual identifiers to | ||||
| UTF-8 [RFC3629] (or some other superset of the US-ASCII character | ||||
| encoding) at an internal interface, since that results in more | ||||
| meaningful identifiers than simply percent-encoding the original | ||||
| octets. When interpreting an incoming URI on such an interface, | ||||
| percent-encoded octets must be decoded before the reverse transcoding | ||||
| can be applied. | ||||
| unwise = "{" / "}" / "|" / "\" / "^" / "`" | In some cases, the interface between a URI component and the | |||
| identifying data it has been crafted to represent is much less direct | ||||
| than a character encoding translation. For example, portions of a | ||||
| URI might reflect a query on non-ASCII data, numeric coordinates on a | ||||
| map, etc. Likewise, a URI scheme may define components with | ||||
| additional encoding requirements, such as base64, that are applied | ||||
| prior to forming the component and producing the URI. | ||||
| Data octets corresponding to excluded characters must be escaped in | When a URI scheme defines a component that represents textual data | |||
| order to be represented within a URI. | consisting of characters from the Unicode (ISO/IEC 10646-1) character | |||
| set, the data should be encoded first as octets according to the | ||||
| UTF-8 character encoding [RFC3629], and then only those octets that | ||||
| do not correspond to characters in the unreserved set should be | ||||
| percent-encoded. For example, the character A would be represented | ||||
| as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be | ||||
| represented as "%C3%80", and the character KATAKANA LETTER A would be | ||||
| represented as "%E3%82%A2". | ||||
| 3. Syntax Components | 3. Syntax Components | |||
| The generic URI syntax consists of a hierarchical sequence of | The generic URI syntax consists of a hierarchical sequence of | |||
| components referred to as the scheme, authority, path, query, and | components referred to as the scheme, authority, path, query, and | |||
| fragment. | fragment. | |||
| URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment] | |||
| hier-part = net-path / abs-path / rel-path | ||||
| net-path = "//" authority [ abs-path ] | ||||
| abs-path = "/" path-segments | ||||
| rel-path = path-segments | ||||
| The scheme and path components are required, though path may be empty | The scheme and path components are required, though path may be empty | |||
| (no characters). An ABNF-driven parser of hier-part will find that | (no characters). An ABNF-driven parser will find that the border | |||
| the three productions in the rule are ambiguous: they are | between authority and path is ambiguous; they are disambiguated by | |||
| disambiguated by the "first-match-wins" (a.k.a. "greedy") algorithm. | the "first-match-wins" (a.k.a. "greedy") algorithm. In other words, | |||
| In other words, if the string begins with two slash characters ("// | if authority is present then the first segment of the path must be | |||
| "), then it is a net-path; if it begins with only one slash | empty. | |||
| character, then it is an abs-path; otherwise, it is a rel-path. Note | ||||
| that rel-path does not necessarily contain any slash ("/") | ||||
| characters; a non-hierarchical path will be treated as opaque data by | ||||
| a generic URI parser. | ||||
| The authority component is only present when a string matches the | ||||
| net-path production. Since the presence of an authority component | ||||
| restricts the remaining syntax for path, we have not included a | ||||
| specific "path" rule in the syntax. Instead, what we refer to as the | ||||
| URI path is that part of the parsed URI string matching the abs-path | ||||
| or rel-path production in the syntax above, since they are mutually | ||||
| exclusive for any given URI and can be parsed as a single component. | ||||
| The following are two example URIs and their component parts: | The following are two example URIs and their component parts: | |||
| foo://example.com:8042/over/there?name=ferret#nose | foo://example.com:8042/over/there?name=ferret#nose | |||
| \_/ \______________/\_________/ \_________/ \__/ | \_/ \______________/\_________/ \_________/ \__/ | |||
| | | | | | | | | | | | | |||
| scheme authority path query fragment | scheme authority path query fragment | |||
| | _____________________|__ | | _____________________|__ | |||
| / \ / \ | / \ / \ | |||
| urn:example:animal:ferret:nose | urn:example:animal:ferret:nose | |||
| skipping to change at page 17, line 15 ¶ | skipping to change at page 15, line 45 ¶ | |||
| specification may further restrict the syntax and semantics of | specification may further restrict the syntax and semantics of | |||
| identifiers using that scheme. | identifiers using that scheme. | |||
| Scheme names consist of a sequence of characters beginning with a | Scheme names consist of a sequence of characters beginning with a | |||
| letter and followed by any combination of letters, digits, plus | letter and followed by any combination of letters, digits, plus | |||
| ("+"), period ("."), or hyphen ("-"). Although scheme is | ("+"), period ("."), or hyphen ("-"). Although scheme is | |||
| case-insensitive, the canonical form is lowercase and documents that | case-insensitive, the canonical form is lowercase and documents that | |||
| specify schemes must do so using lowercase letters. An | specify schemes must do so using lowercase letters. An | |||
| implementation should accept uppercase letters as equivalent to | implementation should accept uppercase letters as equivalent to | |||
| lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for | lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for | |||
| the sake of robustness, but should only generate lowercase scheme | the sake of robustness, but should only produce lowercase scheme | |||
| names, for consistency. | names, for consistency. | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| Individual schemes are not specified by this document. The process | Individual schemes are not specified by this document. The process | |||
| for registration of new URI schemes is defined separately by | for registration of new URI schemes is defined separately by | |||
| [RFC2717]. The scheme registry maintains the mapping between scheme | [RFC2717]. The scheme registry maintains the mapping between scheme | |||
| names and their specifications. | names and their specifications. Advice for designers of new URI | |||
| schemes can be found in [RFC2718]. | ||||
| When presented with a URI that violates one or more scheme-specific | ||||
| restrictions, the scheme-specific resolution process should flag the | ||||
| reference as an error rather than ignore the unused parts; doing so | ||||
| reduces the number of equivalent URIs and helps detect abuses of the | ||||
| generic syntax that might indicate the URI has been constructed to | ||||
| mislead the user (Section 7.6). | ||||
| 3.2 Authority | 3.2 Authority | |||
| Many URI schemes include a hierarchical element for a naming | Many URI schemes include a hierarchical element for a naming | |||
| authority, such that governance of the name space defined by the | authority, such that governance of the name space defined by the | |||
| remainder of the URI is delegated to that authority (which may, in | remainder of the URI is delegated to that authority (which may, in | |||
| turn, delegate it further). The generic syntax provides a common | turn, delegate it further). The generic syntax provides a common | |||
| means for distinguishing an authority based on a registered domain | means for distinguishing an authority based on a registered name or | |||
| name or server address, along with optional port and user | server address, along with optional port and user information. | |||
| information. | ||||
| The authority component is preceded by a double slash ("//") and is | The authority component is preceded by a double slash ("//") and is | |||
| terminated by the next slash ("/"), question-mark ("?"), or | terminated by the next slash ("/"), question mark ("?"), or number | |||
| number-sign ("#") character, or by the end of the URI. | sign ("#") character, or by the end of the URI. | |||
| authority = [ userinfo "@" ] host [ ":" port ] | authority = [ userinfo "@" ] host [ ":" port ] | |||
| The parts "<userinfo>@" and ":<port>" may be omitted. | URI producers and normalizers should omit the "@" delimiter that | |||
| separates userinfo from host if the userinfo component is empty (zero | ||||
| Some schemes do not allow the userinfo and/or port sub-components. | length) and should omit the ":" delimiter that separates host from | |||
| When presented with a URI that violates one or more scheme-specific | port if the port component is empty. Some schemes do not allow the | |||
| restrictions, the scheme-specific URI resolution process should flag | userinfo and/or port sub-components. | |||
| the reference as an error rather than ignore the unused parts; doing | ||||
| so reduces the number of equivalent URIs and helps detect abuses of | ||||
| the generic syntax that might indicate the URI has been constructed | ||||
| to mislead the user (Section 7.5). | ||||
| 3.2.1 User Information | 3.2.1 User Information | |||
| The userinfo sub-component may consist of a user name and, | The userinfo sub-component may consist of a user name and, | |||
| optionally, scheme-specific information about how to gain | optionally, scheme-specific information about how to gain | |||
| authorization to access the server. The user information, if | authorization to access the resource. The user information, if | |||
| present, is followed by a commercial at-sign ("@") that delimits it | present, is followed by a commercial at-sign ("@") that delimits it | |||
| from the host. | from the host. | |||
| userinfo = *( unreserved / escaped / ";" / | userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | |||
| ":" / "&" / "=" / "+" / "$" / "," ) | ||||
| Some URI schemes use the format "user:password" in the userinfo | Use of the format "user:password" in the userinfo field is | |||
| field. This practice is NOT RECOMMENDED, because the passing of | deprecated. Applications should not render as clear text any data | |||
| after the first colon (":") character found within a userinfo | ||||
| sub-component unless such data is the empty string (indicating no | ||||
| password) or "anonymous". Applications may choose to ignore or reject | ||||
| such data when received as part of a reference, and should reject the | ||||
| storage of such data in unencrypted form. The passing of | ||||
| authentication information in clear text has proven to be a security | authentication information in clear text has proven to be a security | |||
| risk in almost every case where it has been used. Note also that | risk in almost every case where it has been used. | |||
| userinfo might be crafted to look like a trusted domain name in order | ||||
| to mislead users, as described in Section 7.5. | Applications that render a URI for the sake of user feedback, such as | |||
| in graphical hypertext browsing, should render userinfo in a way that | ||||
| is distinguished from the rest of a URI, when feasible. Such | ||||
| rendering will assist the user in cases where the userinfo has been | ||||
| misleadingly crafted to look like a trusted domain name (Section | ||||
| 7.6). | ||||
| 3.2.2 Host | 3.2.2 Host | |||
| The host sub-component of authority is identified by an IPv6 literal | The host sub-component of authority is identified by an IP literal | |||
| encapsulated within square brackets, an IPv4 address in | encapsulated within square brackets, an IPv4 address in | |||
| dotted-decimal form, or a domain name. | dotted-decimal form, or a host name. | |||
| host = [ IPv6reference / IPv4address / hostname ] | host = IP-literal / IPv4address / reg-name | |||
| If host is omitted, a default may be defined by the scheme-specific | The syntax rule for host is ambiguous because it does not completely | |||
| semantics of the URI. For example, the "file" URI scheme defaults to | distinguish between an IPv4address and a reg-name. Again, the | |||
| "localhost", whereas the "http" URI scheme does not allow host to be | "first-match-wins" algorithm applies: If host matches the rule for | |||
| omitted. | IPv4address, then it should be considered an IPv4 address literal and | |||
| not a reg-name. Although host is case-insensitive, producers and | ||||
| normalizers should use lowercase for host names and hexadecimal | ||||
| addresses for the sake of uniformity, while only using uppercase | ||||
| letters for percent-encodings. | ||||
| The production for host is ambiguous because it does not completely | A host identified by an Internet Protocol literal address, version 6 | |||
| distinguish between an IPv4address and a hostname. Again, the | [RFC3513] or later, is distinguished by enclosing the IP literal | |||
| "first-match-wins" algorithm applies: If host matches the production | within square brackets ("[" and "]"). This is the only place where | |||
| for IPv4address, then it should be considered an IPv4 address literal | square bracket characters are allowed in the URI syntax. In | |||
| and not a hostname. | anticipation of future, as-yet-undefined IP literal address formats, | |||
| an optional version flag may be used to indicate such a format | ||||
| explicitly rather than relying on heuristic determination. | ||||
| A hostname takes the form described in Section 3 of [RFC1034] and | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |||
| Section 2.1 of [RFC1123]: a sequence of domain labels separated by | ||||
| ".", each domain label starting and ending with an alphanumeric | ||||
| character and possibly also containing "-" characters. The rightmost | ||||
| domain label of a fully qualified domain name may be followed by a | ||||
| single "." if it is necessary to distinguish between the complete | ||||
| domain name and some local domain. | ||||
| hostname = domainlabel qualified | IPvFuture = "v" HEXDIG "." 1*( unreserved / sub-delims / ":" ) | |||
| qualified = *( "." domainlabel ) [ "." ] | ||||
| domainlabel = alphanum [ 0*61( alphanum / "-" ) alphanum ] | The version flag does not indicate the IP version; rather, it | |||
| alphanum = ALPHA / DIGIT | indicates future versions of the literal format. As such, | |||
| implementations must not provide the version flag for existing IPv4 | ||||
| and IPv6 literal addresses. If a URI containing an IP-literal that | ||||
| starts with "v" (case-insensitive), indicating that the version flag | ||||
| is present, is dereferenced by an application that does not know the | ||||
| meaning of that version flag, then the application should return an | ||||
| appropriate error for "address mechanism not supported". | ||||
| A host identified by an IPv6 literal address is represented inside | ||||
| the square brackets without a preceding version flag. The ABNF | ||||
| provided here is a translation of the text definition of an IPv6 | ||||
| literal address provided in [RFC3513]. A 128-bit IPv6 address is | ||||
| divided into eight 16-bit pieces. Each piece is represented | ||||
| numerically in case-insensitive hexadecimal, using one to four | ||||
| hexadecimal digits (leading zeroes are permitted). The eight encoded | ||||
| pieces are given most-significant first, separated by colon | ||||
| characters. Optionally, the least-significant two pieces may instead | ||||
| be represented in IPv4 address textual format. A sequence of one or | ||||
| more consecutive zero-valued 16-bit pieces within the address may be | ||||
| elided, omitting all their digits and leaving exactly two consecutive | ||||
| colons in their place to mark the elision. | ||||
| IPv6address = 6( h16 ":" ) ls32 | ||||
| / "::" 5( h16 ":" ) ls32 | ||||
| / [ h16 ] "::" 4( h16 ":" ) ls32 | ||||
| / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | ||||
| / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | ||||
| / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | ||||
| / [ *4( h16 ":" ) h16 ] "::" ls32 | ||||
| / [ *5( h16 ":" ) h16 ] "::" h16 | ||||
| / [ *6( h16 ":" ) h16 ] "::" | ||||
| ls32 = ( h16 ":" h16 ) / IPv4address | ||||
| ; least-significant 32 bits of address | ||||
| h16 = 1*4HEXDIG | ||||
| ; 16 bits of address represented in hexadecimal | ||||
| A host identified by an IPv4 literal address is represented in | A host identified by an IPv4 literal address is represented in | |||
| dotted-decimal notation (a sequence of four decimal numbers in the | dotted-decimal notation (a sequence of four decimal numbers in the | |||
| range 0 to 255, separated by "."), as described in [RFC1123] by | range 0 to 255, separated by "."), as described in [RFC1123] by | |||
| reference to [RFC0952]. Note that other forms of dotted notation may | reference to [RFC0952]. Note that other forms of dotted notation may | |||
| be interpreted on some platforms, as described in Section 7.3, but | be interpreted on some platforms, as described in Section 7.4, but | |||
| only the dotted-decimal form of four octets is allowed by this | only the dotted-decimal form of four octets is allowed by this | |||
| grammar. | grammar. | |||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| dec-octet = DIGIT ; 0-9 | dec-octet = DIGIT ; 0-9 | |||
| / %x31-39 DIGIT ; 10-99 | / %x31-39 DIGIT ; 10-99 | |||
| / "1" 2DIGIT ; 100-199 | / "1" 2DIGIT ; 100-199 | |||
| / "2" %x30-34 DIGIT ; 200-249 | / "2" %x30-34 DIGIT ; 200-249 | |||
| / "25" %x30-35 ; 250-255 | / "25" %x30-35 ; 250-255 | |||
| A host identified by an IPv6 literal address [RFC3513] is | A host identified by a registered name is a string of characters that | |||
| distinguished by enclosing the IPv6 literal within square-brackets | is intended for lookup within a locally-defined host or service name | |||
| ("[" and "]"). This is the only place where square-bracket | registry. The most common of such registry mechanisms is the Domain | |||
| characters are allowed in the URI syntax. | Name System (DNS), as defined by Section 3 of [RFC1034] and Section | |||
| 2.1 of [RFC1123]. A DNS name consists of a sequence of domain labels | ||||
| separated by ".", each domain label starting and ending with an | ||||
| alphanumeric character and possibly also containing "-" characters. | ||||
| The rightmost domain label of a fully qualified domain name in DNS | ||||
| may be followed by a single "." and should be followed by one if it | ||||
| is necessary to distinguish between the complete domain name and some | ||||
| local domain. | ||||
| IPv6reference = "[" IPv6address "]" | reg-name = 0*255( unreserved / pct-encoded / sub-delims ) | |||
| IPv6address = 6( h4 ":" ) ls32 | If the host component is defined and the registered name is empty | |||
| / "::" 5( h4 ":" ) ls32 | (zero length), then the name defaults to "localhost" (Section 6.2.3 | |||
| / [ h4 ] "::" 4( h4 ":" ) ls32 | discusses how this should be normalized). If "localhost" is not | |||
| / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 | determined by a host name lookup, then it should be interpreted to | |||
| / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 | mean the machine on which the URI is being resolved. | |||
| / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 | ||||
| / [ *4( h4 ":" ) h4 ] "::" ls32 | ||||
| / [ *5( h4 ":" ) h4 ] "::" h4 | ||||
| / [ *6( h4 ":" ) h4 ] "::" | ||||
| ls32 = ( h4 ":" h4 ) / IPv4address | This specification does not mandate a particular registered name | |||
| ; least-significant 32 bits of address | lookup technology and therefore does not restrict the syntax of | |||
| reg-name beyond that necessary for interoperability. Instead, it | ||||
| delegates the issue of host name syntax conformance to the operating | ||||
| system of each application performing URI resolution, and that | ||||
| operating system decides what it will allow for the purpose of host | ||||
| identification. A URI resolution implementation might use DNS, host | ||||
| tables, yellow pages, NetInfo, WINS, or any other system for lookup | ||||
| of host and service names. However, a globally-scoped naming system, | ||||
| such as DNS fully-qualified domain names, is necessary for URIs that | ||||
| are intended to have global scope. URI producers should use host | ||||
| names that conform to the DNS syntax, even when use of DNS is not | ||||
| immediately apparent. | ||||
| h4 = 1*4HEXDIG | The reg-name syntax allows percent-encoded octets in order to | |||
| represent non-ASCII host or service names in a uniform way that is | ||||
| independent of the underlying name resolution technology; such octets | ||||
| must represent characters encoded in the UTF-8 character encoding | ||||
| [RFC3629] prior to being percent-encoded. When a non-ASCII host name | ||||
| represents an internationalized domain name intended for resolution | ||||
| via DNS, the name must be transformed to the IDNA encoding [RFC3490] | ||||
| prior to name lookup. URI producers should provide such host names in | ||||
| the IDNA encoding, rather than a percent-encoding, if they wish to | ||||
| maximize interoperability with legacy URI resolvers. | ||||
| The presence of host within a URI does not imply that the scheme | The presence of host within a URI does not imply that the scheme | |||
| requires access to the given host on the Internet. In many cases, | requires access to the given host on the Internet. In many cases, | |||
| the host syntax is used only for the sake of reusing the existing | the host syntax is used only for the sake of reusing the existing | |||
| registration process created and deployed for DNS, thus obtaining a | registration process created and deployed for DNS, thus obtaining a | |||
| globally unique name without the cost of deploying another registry. | globally unique name without the cost of deploying another registry. | |||
| However, such use comes with its own costs: domain name ownership may | However, such use comes with its own costs: domain name ownership may | |||
| change over time for reasons not anticipated by the URI creator. | change over time for reasons not anticipated by the URI producer. | |||
| 3.2.3 Port | 3.2.3 Port | |||
| The port sub-component of authority is designated by an optional | The port sub-component of authority is designated by an optional port | |||
| port number in decimal following the host and delimited from it by a | number in decimal following the host and delimited from it by a | |||
| single colon (":") character. | single colon (":") character. | |||
| port = *DIGIT | port = *DIGIT | |||
| If port is omitted, a default may be defined by the scheme-specific | A scheme may define a default port. For example, the "http" scheme | |||
| semantics of the URI. Likewise, the type of network port designated | defines a default port of "80", corresponding to its reserved TCP | |||
| by the port number (e.g., TCP, UDP, SCTP, etc.) is defined by the URI | port number. The type of port designated by the port number (e.g., | |||
| scheme. For example, the "http" URI scheme defines a default of TCP | TCP, UDP, SCTP, etc.) is defined by the URI scheme. URI producers | |||
| port 80. | and normalizers should omit the port component and its ":" delimiter | |||
| if port is empty or its value would be the same as the scheme's | ||||
| default. | ||||
| 3.3 Path | 3.3 Path | |||
| The path component contains hierarchical data that, along with data | The path component contains data, usually organized in hierarchical | |||
| in the optional query (Section 3.4) component, serves to identify a | form, that, along with data in the non-hierarchical query component | |||
| resource within the scope of that URI's scheme and naming authority | (Section 3.4), serves to identify a resource within the scope of the | |||
| (if any). There is no specific "path" syntax production in the | URI's scheme and naming authority (if any). If a URI contains an | |||
| generic URI syntax. Instead, what we refer to as the URI path is | authority component, then the initial path segment must be empty | |||
| that part of the parsed URI string matching either the abs-path or | (i.e., the path must begin with a slash ("/") character or be | |||
| the rel-path production, since they are mutually exclusive for any | entirely empty). The path is terminated by the first question mark | |||
| given URI and can be parsed as a single component. The path is | ("?") or number sign ("#") character, or by the end of the URI. | |||
| terminated by the first question-mark ("?") or number-sign ("#") | ||||
| character, or by the end of the URI. | ||||
| path-segments = segment *( "/" segment ) | path = segment *( "/" segment ) | |||
| segment = *pchar | segment = *pchar | |||
| pchar = unreserved / escaped / ";" / | pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ||||
| The path consists of a sequence of path segments separated by a slash | A path consists of a sequence of path segments separated by a slash | |||
| ("/") character. A path is always defined for a URI, though the | ("/") character. A path is always defined for a URI, though the | |||
| defined path may be empty (zero length) or opaque (not containing any | defined path may be empty (zero length). Use of the slash character | |||
| "/" delimiters). For example, the URI <mailto:fred@example.com> has | to indicate hierarchy is only required when a URI will be used as the | |||
| a path of "fred@example.com". | context for relative references. For example, the URI | |||
| <mailto:fred@example.com> has a path of "fred@example.com", whereas | ||||
| the URI <foo://info.example.com?fred> has an empty path. | ||||
| The path segments "." and ".." are defined for relative reference | The path segments "." and ".." are defined for relative reference | |||
| within the path name hierarchy. They are intended for use at the | within the path name hierarchy. They are intended for use at the | |||
| beginning of a relative path reference (Section 4.2) for indicating | beginning of a relative path reference (Section 4.2) for indicating | |||
| relative position within the hierarchical tree of names, with a | relative position within the hierarchical tree of names. This is | |||
| similar effect to how they are used within some operating systems' | similar to their role within some operating systems' file directory | |||
| file directory structure to indicate the current directory and parent | structure to indicate the current directory and parent directory, | |||
| directory, respectively. Unlike a file system, however, these | respectively. However, unlike a file system, these dot-segments are | |||
| dot-segments are only interpreted within the URI path hierarchy and | only interpreted within the URI path hierarchy and are removed as | |||
| are removed as part of the URI normalization or resolution process, | part of the resolution process (Section 5.2). | |||
| as described in Section 5.2. | ||||
| Aside from dot-segments in hierarchical paths, a path segment is | Aside from dot-segments in hierarchical paths, a path segment is | |||
| considered opaque by the generic syntax. URI generating applications | considered opaque by the generic syntax. URI-producing applications | |||
| often use the reserved characters allowed in segment for the purpose | often use the reserved characters allowed in a segment for the | |||
| of delimiting scheme-specific or generator-specific sub-components. | purpose of delimiting scheme-specific or dereference-handler-specific | |||
| For example, the semicolon (";") and equals ("=") reserved characters | sub-components. For example, the semicolon (";") and equals ("=") | |||
| are often used for delimiting parameters and parameter values | reserved characters are often used for delimiting parameters and | |||
| applicable to that segment. The comma (",") reserved character is | parameter values applicable to that segment. The comma (",") | |||
| often used for similar purposes. For example, one URI generator | reserved character is often used for similar purposes. For example, | |||
| might use a segment like "name;v=1.1" to indicate a reference to | one URI producer might use a segment like "name;v=1.1" to indicate a | |||
| version 1.1 of "name", whereas another might use a segment like | reference to version 1.1 of "name", whereas another might use a | |||
| "name,1.1" to indicate the same. Parameter types may be defined by | segment like "name,1.1" to indicate the same. Parameter types may be | |||
| scheme-specific semantics, but in most cases the meaning of a | defined by scheme-specific semantics, but in most cases the syntax of | |||
| parameter is specific to the URI originator. | a parameter is specific to the implementation of the URI's | |||
| dereferencing algorithm. | ||||
| 3.4 Query | 3.4 Query | |||
| The query component contains non-hierarchical data that, along with | The query component contains non-hierarchical data that, along with | |||
| data in the path (Section 3.3) component, serves to identify a | data in the path component (Section 3.3), serves to identify a | |||
| resource within the scope of that URI's scheme and naming authority | resource within the scope of the URI's scheme and naming authority | |||
| (if any). The query component is indicated by the first question-mark | (if any). The query component is indicated by the first question mark | |||
| ("?") character and terminated by a number-sign ("#") character or by | ("?") character and terminated by a number sign ("#") character or by | |||
| the end of the URI. | the end of the URI. | |||
| query = *( pchar / "/" / "?" ) | query = *( pchar / "/" / "?" ) | |||
| The characters slash ("/") and question-mark ("?") are allowed to | The characters slash ("/") and question mark ("?") may represent data | |||
| represent data within the query component, but such use is | within the query component, but should not be used as such within a | |||
| discouraged; incorrect implementations of reference resolution often | URI that is expected to be the base for relative references (Section | |||
| fail to distinguish them from hierarchical separators, thus resulting | 5.1). Incorrect implementations of reference resolution often fail | |||
| in non-interoperable results while parsing relative references. | to distinguish query data from path data when looking for | |||
| hierarchical separators, thus resulting in non-interoperable results. | ||||
| However, since query components are often used to carry identifying | However, since query components are often used to carry identifying | |||
| information in the form of "key=value" pairs, and one frequently used | information in the form of "key=value" pairs, and one frequently used | |||
| value is a reference to another URI, it is sometimes better for | value is a reference to another URI, it is sometimes better for | |||
| usability to include those characters unescaped. | usability to avoid percent-encoding those characters. | |||
| Note: Some client applications will fail to separate a reference's | ||||
| query component from its path component before merging the base | ||||
| and reference paths (Section 5.2). This may result in loss of | ||||
| information if the query component contains the strings "/../" or | ||||
| "/./". | ||||
| 3.5 Fragment | 3.5 Fragment | |||
| The fragment identifier component allows indirect identification of a | The fragment identifier component of a URI allows indirect | |||
| secondary resource by reference to a primary resource and additional | identification of a secondary resource by reference to a primary | |||
| identifying information that is selective within that resource. The | resource and additional identifying information. The identified | |||
| identified secondary resource may be some portion or subset of the | secondary resource may be some portion or subset of the primary | |||
| primary resource, some view on representations of the primary | resource, some view on representations of the primary resource, or | |||
| resource, or some other resource that is merely named within the | some other resource defined or described by those representations. A | |||
| primary resource. A fragment identifier component is indicated by | fragment identifier component is indicated by the presence of a | |||
| the presence of a number-sign ("#") character and terminated by the | number sign ("#") character and terminated by the end of the URI. | |||
| end of the URI string. | ||||
| fragment = *( pchar / "/" / "?" ) | fragment = *( pchar / "/" / "?" ) | |||
| The semantics of a fragment identifier are defined by the set of | The semantics of a fragment identifier are defined by the set of | |||
| representations that might result from a retrieval action on the | representations that might result from a retrieval action on the | |||
| primary resource. The fragment's format and resolution is therefore | primary resource. The fragment's format and resolution is therefore | |||
| dependent on the media type [RFC2046] of the retrieved | dependent on the media type [RFC2046] of a potentially retrieved | |||
| representation, even though such a retrieval is only performed if the | representation, even though such a retrieval is only performed if the | |||
| URI is dereferenced. Individual media types may define their own | URI is dereferenced. Individual media types may define their own | |||
| restrictions on, or structure within, the fragment identifier syntax | restrictions on, or structure within, the fragment identifier syntax | |||
| for specifying different types of subsets, views, or external | for specifying different types of subsets, views, or external | |||
| references that are identifiable as secondary resources by that media | references that are identifiable as secondary resources by that media | |||
| type. If the primary resource is represented by multiple media | type. If the primary resource has multiple representations, as is | |||
| types, as is often the case for resources whose representation is | often the case for resources whose representation is selected based | |||
| selected based on attributes of the retrieval request, then | on attributes of the retrieval request (a.k.a., content negotiation), | |||
| interpretation of the fragment identifier must be consistent across | then whatever is identified by the fragment should be consistent | |||
| all of those media types in order for it to be viable as an | across all of those representations: each representation should | |||
| identifier. | either define the fragment such that it corresponds to the same | |||
| secondary resource, regardless of how it is represented, or the | ||||
| fragment should be left undefined by the representation (i.e., not | ||||
| found). | ||||
| As with any URI, use of a fragment identifier component does not | As with any URI, use of a fragment identifier component does not | |||
| imply that a retrieval action will take place. A URI with a fragment | imply that a retrieval action will take place. A URI with a fragment | |||
| identifier may be used to refer to the secondary resource without any | identifier may be used to refer to the secondary resource without any | |||
| implication that the primary resource is accessible. However, if | implication that the primary resource is accessible or will ever be | |||
| that URI is used in a context that does call for retrieval and is not | accessed. | |||
| a same-document reference (Section 4.4), the fragment identifier is | ||||
| only valid as a reference if a retrieval action on the primary | ||||
| resource succeeds and results in a representation for which the | ||||
| fragment identifier is meaningful. | ||||
| Fragment identifiers have a special role in information systems as | Fragment identifiers have a special role in information systems as | |||
| the primary form of client-side indirect referencing, allowing an | the primary form of client-side indirect referencing, allowing an | |||
| author to specifically identify those aspects of an existing resource | author to specifically identify those aspects of an existing resource | |||
| that are only indirectly provided by the resource owner. As such, | that are only indirectly provided by the resource owner. As such, | |||
| interpretation of the fragment identifier during a retrieval action | interpretation of the fragment identifier during a retrieval action | |||
| is performed solely by the user agent; the fragment identifier is not | is performed solely by the user agent; the fragment identifier is not | |||
| passed to other systems during the process of retrieval. Although | passed to other systems during the process of retrieval. Although | |||
| this is often perceived to be a loss of information, particularly in | this is often perceived to be a loss of information, particularly in | |||
| regards to accurate redirection of references as content moves over | regards to accurate redirection of references as content moves over | |||
| time, it also serves to prevent information providers from denying | time, it also serves to prevent information providers from denying | |||
| reference authors the right to selectively refer to information | reference authors the right to selectively refer to information | |||
| within a resource. | within a resource. | |||
| The characters slash ("/") and question-mark ("?") are allowed to | The characters slash ("/") and question mark ("?") are allowed to | |||
| represent data within the fragment identifier, but such use is | represent data within the fragment identifier, but should not be used | |||
| discouraged for the same reasons as described above for query. | as such within a URI that is expected to be the base for relative | |||
| references (Section 5.1) for the same reasons as described above for | ||||
| query. | ||||
| 4. Usage | 4. Usage | |||
| When applications make reference to a URI, they do not always use the | When applications make reference to a URI, they do not always use the | |||
| full form of reference defined by the "URI" syntax production. In | full form of reference defined by the "URI" syntax rule. In order to | |||
| order to save space and take advantage of hierarchical locality, many | save space and take advantage of hierarchical locality, many Internet | |||
| Internet protocol elements and media type formats allow an | protocol elements and media type formats allow an abbreviation of a | |||
| abbreviation of a URI, while others restrict the syntax to a | URI, while others restrict the syntax to a particular form of URI. | |||
| particular form of URI. We define the most common forms of reference | We define the most common forms of reference syntax in this | |||
| syntax in this specification because they impact and depend upon the | specification because they impact and depend upon the design of the | |||
| design of the generic syntax, requiring a uniform parsing algorithm | generic syntax, requiring a uniform parsing algorithm in order to be | |||
| in order to be interpreted consistently. | interpreted consistently. | |||
| 4.1 URI Reference | 4.1 URI Reference | |||
| The ABNF rule URI-reference is used to denote the most common usage | URI-reference is used to denote the most common usage of a resource | |||
| of a resource identifier. | identifier. | |||
| URI-reference = URI / relative-URI | URI-reference = URI / relative-URI | |||
| A URI-reference may be relative: if the reference string's prefix | A URI-reference may be relative: if the reference's prefix matches | |||
| matches the syntax of a scheme followed by its colon separator, then | the syntax of a scheme followed by its colon separator, then the | |||
| the reference is a URI rather than a relative-URI. | reference is a URI rather than a relative-URI. | |||
| A URI-reference is typically parsed first into the five URI | A URI-reference is typically parsed first into the five URI | |||
| components, in order to determine what components are present and | components, in order to determine what components are present and | |||
| whether or not the reference is relative, and then each component is | whether or not the reference is relative, and then each component is | |||
| parsed for its subparts and their validation. The ABNF of | parsed for its subparts and their validation. The ABNF of | |||
| URI-reference, along with the "first-match-wins" disambiguation rule, | URI-reference, along with the "first-match-wins" disambiguation rule, | |||
| is sufficient to define a validating parser for the generic syntax. | is sufficient to define a validating parser for the generic syntax. | |||
| Readers familiar with regular expressions should see Appendix B for | Readers familiar with regular expressions should see Appendix B for | |||
| an example of a non-validating URI-reference parser that will take | an example of a non-validating URI-reference parser that will take | |||
| any given string and extract the URI components. | any given string and extract the URI components. | |||
| 4.2 Relative URI | 4.2 Relative URI | |||
| A relative URI reference takes advantage of the hier-part syntax | A relative URI reference takes advantage of the hierarchical syntax | |||
| (Section 3) in order to express a reference that is relative to the | (Section 1.2.3) in order to express a reference that is relative to | |||
| name space of another hierarchical URI. | the name space of another hierarchical URI. | |||
| relative-URI = hier-part [ "?" query ] [ "#" fragment ] | relative-URI = ["//" authority] path ["?" query] ["#" fragment] | |||
| The URI referred to by a relative reference is obtained by applying | The URI referred to by a relative reference, also known as the target | |||
| the reference resolution algorithm of Section 5. | URI, is obtained by applying the reference resolution algorithm of | |||
| Section 5. | ||||
| A relative reference that begins with two slash characters is termed | A relative reference that begins with two slash characters is termed | |||
| a network-path reference; such references are rarely used. A relative | a network-path reference; such references are rarely used. A relative | |||
| reference that begins with a single slash character is termed an | reference that begins with a single slash character is termed an | |||
| absolute-path reference. A relative reference that does not begin | absolute-path reference. A relative reference that does not begin | |||
| with a slash character is termed a relative-path reference. | with a slash character is termed a relative-path reference. | |||
| A path segment that contains a colon character (e.g., "this:that") | A path segment that contains a colon character (e.g., "this:that") | |||
| cannot be used as the first segment of a relative-path reference | cannot be used as the first segment of a relative-path reference | |||
| because it would be mistaken for a scheme name. Such a segment must | because it would be mistaken for a scheme name. Such a segment must | |||
| be preceded by a dot-segment (e.g., "./this:that") to make a | be preceded by a dot-segment (e.g., "./this:that") to make a | |||
| relative-path reference. | relative-path reference. | |||
| 4.3 Absolute URI | 4.3 Absolute URI | |||
| Some protocol elements allow only the absolute form of a URI without | Some protocol elements allow only the absolute form of a URI without | |||
| a fragment identifier. For example, defining the base URI for later | a fragment identifier. For example, defining a base URI for later | |||
| use by relative references calls for an absolute-URI production that | use by relative references calls for an absolute-URI syntax rule that | |||
| does not allow a fragment. | does not allow a fragment. | |||
| absolute-URI = scheme ":" hier-part [ "?" query ] | absolute-URI = scheme ":" ["//" authority] path ["?" query] | |||
| 4.4 Same-document Reference | 4.4 Same-document Reference | |||
| When a URI reference occurring within a document or message refers to | When a URI reference refers to a URI that is, aside from its fragment | |||
| a URI that is, aside from its fragment component (if any), identical | component (if any), identical to the base URI (Section 5.1), that | |||
| to the base URI (Section 5.1), that reference is called a | reference is called a "same-document" reference. The most frequent | |||
| "same-document" reference. The most frequent examples of | examples of same-document references are relative references that are | |||
| same-document references are relative references that are empty or | empty or include only the number sign ("#") separator followed by a | |||
| include only the number-sign ("#") separator followed by a fragment | fragment identifier. | |||
| identifier. | ||||
| When a same-document reference is dereferenced for the purpose of a | When a same-document reference is dereferenced for the purpose of a | |||
| retrieval action, the target of that reference is defined to be | retrieval action, the target of that reference is defined to be | |||
| within that current document or message; the dereference should not | within the same entity (representation, document, or message) as the | |||
| result in a new retrieval. | reference; therefore, a dereference should not result in a new | |||
| retrieval action. | ||||
| Normalization of the base and target URIs prior to their comparison, | ||||
| as described in Section 6.2.2 and Section 6.2.3, is allowed but | ||||
| rarely performed in practice. Normalization may increase the set of | ||||
| same-document references, which may be of benefit to some caching | ||||
| applications. As such, reference authors should not assume that a | ||||
| slightly different, though equivalent, reference URI will (or will | ||||
| not) be interpreted as a same-document reference by any given | ||||
| application. | ||||
| 4.5 Suffix Reference | 4.5 Suffix Reference | |||
| The URI syntax is designed for unambiguous reference to resources and | The URI syntax is designed for unambiguous reference to resources and | |||
| extensibility via the URI scheme. However, as URI identification and | extensibility via the URI scheme. However, as URI identification and | |||
| usage have become commonplace, traditional media (television, radio, | usage have become commonplace, traditional media (television, radio, | |||
| newspapers, billboards, etc.) have increasingly used a suffix of the | newspapers, billboards, etc.) have increasingly used a suffix of the | |||
| URI as a reference, consisting of only the authority and path | URI as a reference, consisting of only the authority and path | |||
| portions of the URI, such as | portions of the URI, such as | |||
| www.w3.org/Addressing/ | www.w3.org/Addressing/ | |||
| or simply the DNS hostname on its own. Such references are primarily | or simply a DNS registered name on its own. Such references are | |||
| intended for human interpretation rather than machine, with the | primarily intended for human interpretation, rather than for | |||
| assumption that context-based heuristics are sufficient to complete | machines, with the assumption that context-based heuristics are | |||
| the URI (e.g., most hostnames beginning with "www" are likely to have | sufficient to complete the URI (e.g., most host names beginning with | |||
| a URI prefix of "http://"). Although there is no standard set of | "www" are likely to have a URI prefix of "http://"). Although there | |||
| heuristics for disambiguating a URI suffix, many client | is no standard set of heuristics for disambiguating a URI suffix, | |||
| implementations allow them to be entered by the user and | many client implementations allow them to be entered by the user and | |||
| heuristically resolved. It should be noted that such heuristics may | heuristically resolved. | |||
| change over time, particularly when new URI schemes are introduced. | ||||
| While this practice of using suffix references is common, it should | ||||
| be avoided whenever possible and never used in situations where | ||||
| long-term references are expected. The heuristics noted above will | ||||
| change over time, particularly when a new URI scheme becomes popular, | ||||
| and are often incorrect when used out of context. Furthermore, they | ||||
| can lead to security issues along the lines of those described in | ||||
| [RFC1535]. | ||||
| Since a URI suffix has the same syntax as a relative path reference, | Since a URI suffix has the same syntax as a relative path reference, | |||
| a suffix reference cannot be used in contexts where a relative | a suffix reference cannot be used in contexts where a relative | |||
| reference is expected. As a result, suffix references are limited to | reference is expected. As a result, suffix references are limited to | |||
| those places where there is no defined base URI, such as dialog boxes | those places where there is no defined base URI, such as dialog boxes | |||
| and off-line advertisements. | and off-line advertisements. | |||
| 5. Reference Resolution | 5. Reference Resolution | |||
| This section defines the process of resolving a URI reference within | This section defines the process of resolving a URI reference within | |||
| a context that allows relative references, such that the result is a | a context that allows relative references, such that the result is a | |||
| string matching the "URI" syntax production of Section 3. | string matching the "URI" syntax rule of Section 3. | |||
| 5.1 Establishing a Base URI | 5.1 Establishing a Base URI | |||
| The term "relative" implies that there exists some "base URI" against | The term "relative" implies that there exists a "base URI" against | |||
| which the relative reference is applied. Aside from same-document | which the relative reference is applied. Aside from fragment-only | |||
| references (Section 4.4, relative references are only usable if the | references (Section 4.4), relative references are only usable when a | |||
| base URI is known. The base URI must be established by the parser | base URI is known. A base URI must be established by the parser | |||
| prior to parsing URI references that might be relative. | prior to parsing URI references that might be relative. | |||
| The base URI of a document can be established in one of four ways, | The base URI of a reference can be established in one of four ways, | |||
| listed below in order of precedence. The order of precedence can be | discussed below in order of precedence. The order of precedence can | |||
| thought of in terms of layers, where the innermost defined base URI | be thought of in terms of layers, where the innermost defined base | |||
| has the highest precedence. This can be visualized graphically as: | URI has the highest precedence. This can be visualized graphically | |||
| as: | ||||
| .----------------------------------------------------------. | .----------------------------------------------------------. | |||
| | .----------------------------------------------------. | | | .----------------------------------------------------. | | |||
| | | .----------------------------------------------. | | | | | .----------------------------------------------. | | | |||
| | | | .----------------------------------------. | | | | | | | .----------------------------------------. | | | | |||
| | | | | .----------------------------------. | | | | | | | | | .----------------------------------. | | | | | |||
| | | | | | <relative-reference> | | | | | | | | | | | <relative-reference> | | | | | | |||
| | | | | `----------------------------------' | | | | | | | | | `----------------------------------' | | | | | |||
| | | | | (5.1.1) Base URI embedded in the | | | | | | | | | (5.1.1) Base URI embedded in content | | | | | |||
| | | | | document's content | | | | | ||||
| | | | `----------------------------------------' | | | | | | | `----------------------------------------' | | | | |||
| | | | (5.1.2) Base URI of the encapsulating entity | | | | | | | (5.1.2) Base URI of the encapsulating entity | | | | |||
| | | | (message, document, or none). | | | | | | | (message, representation, or none) | | | | |||
| | | `----------------------------------------------' | | | | | `----------------------------------------------' | | | |||
| | | (5.1.3) URI used to retrieve the entity | | | | | (5.1.3) URI used to retrieve the entity | | | |||
| | `----------------------------------------------------' | | | `----------------------------------------------------' | | |||
| | (5.1.4) Default Base URI is application-dependent | | | (5.1.4) Default Base URI (application-dependent) | | |||
| `----------------------------------------------------------' | `----------------------------------------------------------' | |||
| 5.1.1 Base URI within Document Content | 5.1.1 Base URI within Document Content | |||
| Within certain document media types, the base URI of the document can | Within certain media types, a base URI for relative references can be | |||
| be embedded within the content itself such that it can be readily | embedded within the content itself such that it can be readily | |||
| obtained by a parser. This can be useful for descriptive documents, | obtained by a parser. This can be useful for descriptive documents, | |||
| such as tables of content, which may be transmitted to others through | such as tables of content, which may be transmitted to others through | |||
| protocols other than their usual retrieval context (e.g., E-Mail or | protocols other than their usual retrieval context (e.g., E-Mail or | |||
| USENET news). | USENET news). | |||
| It is beyond the scope of this document to specify how, for each | It is beyond the scope of this specification to specify how, for each | |||
| media type, the base URI can be embedded. It is assumed that user | media type, a base URI can be embedded. The appropriate syntax, when | |||
| agents manipulating such media types will be able to obtain the | available, is described by each media type's specification. | |||
| appropriate syntax from that media type's specification. | ||||
| A mechanism for embedding the base URI within MIME container types | 5.1.2 Base URI from the Encapsulating Entity | |||
| If no base URI is embedded, the base URI is defined by the | ||||
| representation's retrieval context. For a document that is enclosed | ||||
| within another entity, such as a message or archive, the retrieval | ||||
| context is that entity; thus, the default base URI of a | ||||
| representation is the base URI of the entity in which the | ||||
| representation is encapsulated. | ||||
| A mechanism for embedding a base URI within MIME container types | ||||
| (e.g., the message and multipart types) is defined by MHTML | (e.g., the message and multipart types) is defined by MHTML | |||
| [RFC2110]. Protocols that do not use the MIME message header syntax, | [RFC2110]. Protocols that do not use the MIME message header syntax, | |||
| but do allow some form of tagged metadata to be included within | but do allow some form of tagged metadata to be included within | |||
| messages, may define their own syntax for defining the base URI as | messages, may define their own syntax for defining a base URI as part | |||
| part of a message. | of a message. | |||
| 5.1.2 Base URI from the Encapsulating Entity | ||||
| If no base URI is embedded, the base URI of a document is defined by | ||||
| the document's retrieval context. For a document that is enclosed | ||||
| within another entity (such as a message or another document), the | ||||
| retrieval context is that entity; thus, the default base URI of the | ||||
| document is the base URI of the entity in which the document is | ||||
| encapsulated. | ||||
| 5.1.3 Base URI from the Retrieval URI | 5.1.3 Base URI from the Retrieval URI | |||
| If no base URI is embedded and the document is not encapsulated | If no base URI is embedded and the representation is not encapsulated | |||
| within some other entity (e.g., the top level of a composite entity), | within some other entity, then, if a URI was used to retrieve the | |||
| then, if a URI was used to retrieve the base document, that URI shall | representation, that URI shall be considered the base URI. Note that | |||
| be considered the base URI. Note that if the retrieval was the | if the retrieval was the result of a redirected request, the last URI | |||
| result of a redirected request, the last URI used (i.e., that which | used (i.e., the URI that resulted in the actual retrieval of the | |||
| resulted in the actual retrieval of the document) is the base URI. | representation) is the base URI. | |||
| 5.1.4 Default Base URI | 5.1.4 Default Base URI | |||
| If none of the conditions described in above apply, then the base URI | If none of the conditions described above apply, then the base URI is | |||
| is defined by the context of the application. Since this definition | defined by the context of the application. Since this definition is | |||
| is necessarily application-dependent, failing to define the base URI | necessarily application-dependent, failing to define a base URI using | |||
| using one of the other methods may result in the same content being | one of the other methods may result in the same content being | |||
| interpreted differently by different types of application. | interpreted differently by different types of application. | |||
| It is the responsibility of the distributor(s) of a document | A sender of a representation containing relative references is | |||
| containing a relative reference to ensure that the base URI for that | responsible for ensuring that a base URI for those references can be | |||
| document can be established. It must be emphasized that a relative | established. Aside from fragment-only references, relative references | |||
| reference, aside from a same-document reference, cannot be used | can only be used reliably in situations where the base URI is | |||
| reliably in situations where the document's base URI is not | ||||
| well-defined. | well-defined. | |||
| 5.2 Obtaining the Referenced URI | 5.2 Relative Resolution | |||
| This section describes an example algorithm for resolving URI | This section describes an algorithm for converting a URI reference | |||
| references that might be relative to a given base URI. The algorithm | that might be relative to a given base URI into the parsed componets | |||
| is intended to provide a definitive result that can be used to test | of the reference's target. The components can then be recomposed, as | |||
| the output of other implementations. Implementation of the algorithm | described in Section 5.3, to form the target URI. This algorithm | |||
| itself is not required, but the result given by an implementation | provides definitive results that can be used to test the output of | |||
| must match the result that would be given by this algorithm. | other implementations. Applications may implement relative reference | |||
| resolution using some other algorithm, provided that the results | ||||
| match what would be given by this algorithm. | ||||
| The base URI (Base) is established according to the rules of Section | 5.2.1 Pre-parse the Base URI | |||
| 5.1 and parsed into the five main components described in Section 3. | ||||
| Note that only the scheme component is required to be present in the | The base URI (Base) is established according to the procedure of | |||
| base URI; the other components may be empty or undefined. A | Section 5.1 and parsed into the five main components described in | |||
| component is undefined if its preceding separator does not appear in | Section 3. Note that only the scheme component is required to be | |||
| the URI reference; the path component is never undefined, though it | present in a base URI; the other components may be empty or | |||
| may be empty. The algorithm assumes that the base URI is well-formed | undefined. A component is undefined if its associated delimiter does | |||
| and does not contain dot-segments in its path. | not appear in the URI reference; the path component is never | |||
| undefined, though it may be empty. | ||||
| Normalization of the base URI, as described in Section 6.2.2 and | ||||
| Section 6.2.3, is optional. A URI reference must be transformed to | ||||
| its target URI before it can be normalized. | ||||
| 5.2.2 Transform References | ||||
| For each URI reference (R), the following pseudocode describes an | For each URI reference (R), the following pseudocode describes an | |||
| algorithm for transforming R into its target URI (T): | algorithm for transforming R into its target URI (T): | |||
| -- The URI reference is parsed into the five URI components | -- The URI reference is parsed into the five URI components | |||
| -- | -- | |||
| (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); | (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); | |||
| -- A non-strict parser may ignore a scheme in the reference | -- A non-strict parser may ignore a scheme in the reference | |||
| -- if it is identical to the base URI's scheme. | -- if it is identical to the base URI's scheme. | |||
| skipping to change at page 30, line 18 ¶ | skipping to change at page 30, line 25 ¶ | |||
| endif; | endif; | |||
| T.query = R.query; | T.query = R.query; | |||
| endif; | endif; | |||
| T.authority = Base.authority; | T.authority = Base.authority; | |||
| endif; | endif; | |||
| T.scheme = Base.scheme; | T.scheme = Base.scheme; | |||
| endif; | endif; | |||
| T.fragment = R.fragment; | T.fragment = R.fragment; | |||
| The pseudocode above refers to a merge routine for merging a | 5.2.3 Merge Paths | |||
| The pseudocode above refers to a "merge" routine for merging a | ||||
| relative-path reference with the path of the base URI. This is | relative-path reference with the path of the base URI. This is | |||
| accomplished as follows: | accomplished as follows: | |||
| o If the base URI's path is empty, then return a string consisting | o If the base URI has a defined authority component and an empty | |||
| of "/" concatenated with the reference's path component; | path, then return a string consisting of "/" concatenated with the | |||
| otherwise, | reference's path; otherwise, | |||
| o If the base URI's path is non-hierarchical, as indicated by not | ||||
| beginning with a slash, then return a string consisting of the | ||||
| reference's path component; otherwise, | ||||
| o Return a string consisting of the reference's path component | o Return a string consisting of the reference's path component | |||
| appended to all but the last segment of the base URI's path (i.e., | appended to all but the last segment of the base URI's path (i.e., | |||
| any characters after the right-most "/" in the base URI path are | excluding any characters after the right-most "/" in the base URI | |||
| excluded). | path, or excluding the entire base URI path if it does not contain | |||
| any "/" characters). | ||||
| The pseudocode also refers to a remove_dot_segments routine for | 5.2.4 Remove Dot Segments | |||
| The pseudocode also refers to a "remove_dot_segments" routine for | ||||
| interpreting and removing the special "." and ".." complete path | interpreting and removing the special "." and ".." complete path | |||
| segments from a referenced path. This is done after the path is | segments from a referenced path. This is done after the path is | |||
| extracted from a reference, whether or not the path was relative, in | extracted from a reference, whether or not the path was relative, in | |||
| order to remove any invalid or extraneous dot-segments prior to | order to remove any invalid or extraneous dot-segments prior to | |||
| forming the target URI. Although there are many ways to accomplish | forming the target URI. Although there are many ways to accomplish | |||
| this removal process, we describe a simple method using a separate | this removal process, we describe a simple method using a two string | |||
| string buffer: | buffers. | |||
| 1. The buffer is initialized with the unprocessed path component. | 1. The input buffer is initialized with the now-appended path | |||
| components and the output buffer is initialized to the empty | ||||
| string. | ||||
| 2. If the buffer begins with "./" or "../", the "." or ".." segment | 2. Replace any prefix of "./" or "../" at the beginning of the input | |||
| is removed. | buffer with "/". | |||
| 3. All occurrences of "/./" in the buffer are replaced with "/". | 3. While the input buffer is not empty, loop: | |||
| 4. If the buffer ends with "/.", the "." is removed. | 1. If the input buffer begins with a prefix of "/./" or "/.", | |||
| where "." is a complete path segment, then replace that | ||||
| prefix with "/"; otherwise | ||||
| 5. All occurrences of "/<segment>/../" in the buffer, where ".." and | 2. If the input buffer begins with a prefix of "/../" or "/..", | |||
| <segment> are complete path segments, are iteratively replaced | where ".." is a complete path segment, then replace that | |||
| with "/" in order from left to right until no matching pattern | prefix with "/" and remove the last segment and its preceding | |||
| remains. If the buffer ends with "/<segment>/..", that is also | "/" (if any) from the output buffer; otherwise | |||
| replaced with "/". Note that <segment> may be empty. | ||||
| 6. All prefixes of "<segment>/../" in the buffer, where ".." and | 3. Remove the first segment and its preceding "/" (if any) from | |||
| <segment> are complete path segments, are iteratively replaced | the input buffer and append them to the output buffer. | |||
| with "/" in order from left to right until no matching pattern | ||||
| remains. If the buffer ends with "<segment>/..", that is also | ||||
| replaced with "/". Note that <segment> may be empty. | ||||
| 7. The remaining buffer is returned as the result of | 4. Finally, the output buffer is returned as the result of | |||
| remove_dot_segments. | remove_dot_segments. | |||
| Some systems may find it more efficient to implement the | The following illustrates how the above steps are applied for two | |||
| remove_dot_segments algorithm as a stack of path segments being | example merged paths, showing the state of the two buffers after each | |||
| compressed, rather than as a series of string pattern replacements. | step. | |||
| 5.3 Recomposition of a Parsed URI | STEP OUTPUT BUFFER INPUT BUFFER | |||
| 1 : /a/b/c/./../../g | ||||
| 3c: /a /b/c/./../../g | ||||
| 3c: /a/b /c/./../../g | ||||
| 3c: /a/b/c /./../../g | ||||
| 3a: /a/b/c /../../g | ||||
| 3b: /a/b /../g | ||||
| 3b: /a /g | ||||
| 3c: /a/g | ||||
| STEP OUTPUT BUFFER INPUT BUFFER | ||||
| 1 : mid/content=5/../6 | ||||
| 3c: mid /content=5/../6 | ||||
| 3c: mid/content=5 /../6 | ||||
| 3b: mid /6 | ||||
| 3c: mid/6 | ||||
| Some applications may find it more efficient to implement the | ||||
| remove_dot_segments algorithm using two segment stacks rather than | ||||
| strings. | ||||
| Note: Some client applications will fail to separate a reference's | ||||
| query component from its path component before merging the base | ||||
| and reference paths. This may result in loss of information if | ||||
| the query component contains the strings "/../" or "/./". | ||||
| 5.3 Component Recomposition | ||||
| Parsed URI components can be recomposed to obtain the corresponding | Parsed URI components can be recomposed to obtain the corresponding | |||
| URI reference string. Using pseudocode, this would be: | URI reference string. Using pseudocode, this would be: | |||
| result = "" | result = "" | |||
| if defined(scheme) then | if defined(scheme) then | |||
| append scheme to result; | append scheme to result; | |||
| append ":" to result; | append ":" to result; | |||
| endif; | endif; | |||
| skipping to change at page 32, line 4 ¶ | skipping to change at page 32, line 42 ¶ | |||
| if defined(query) then | if defined(query) then | |||
| append "?" to result; | append "?" to result; | |||
| append query to result; | append query to result; | |||
| endif; | endif; | |||
| if defined(fragment) then | if defined(fragment) then | |||
| append "#" to result; | append "#" to result; | |||
| append fragment to result; | append fragment to result; | |||
| endif; | endif; | |||
| return result; | return result; | |||
| Note that we are careful to preserve the distinction between a | Note that we are careful to preserve the distinction between a | |||
| component that is undefined, meaning that its separator was not | component that is undefined, meaning that its separator was not | |||
| present in the reference, and a component that is empty, meaning that | present in the reference, and a component that is empty, meaning that | |||
| the separator was present and was immediately followed by the next | the separator was present and was immediately followed by the next | |||
| component separator or the end of the reference. | component separator or the end of the reference. | |||
| 5.4 Reference Resolution Examples | 5.4 Reference Resolution Examples | |||
| Within an object with a well-defined base URI of | Within a representation with a well-defined base URI of | |||
| http://a/b/c/d;p?q | http://a/b/c/d;p?q | |||
| a relative URI reference would be resolved as follows: | a relative URI reference is transformed to its target URI as follows. | |||
| 5.4.1 Normal Examples | 5.4.1 Normal Examples | |||
| "g:h" = "g:h" | "g:h" = "g:h" | |||
| "g" = "http://a/b/c/g" | "g" = "http://a/b/c/g" | |||
| "./g" = "http://a/b/c/g" | "./g" = "http://a/b/c/g" | |||
| "g/" = "http://a/b/c/g/" | "g/" = "http://a/b/c/g/" | |||
| "/g" = "http://a/g" | "/g" = "http://a/g" | |||
| "//g" = "http://g" | "//g" = "http://g" | |||
| "?y" = "http://a/b/c/d;p?y" | "?y" = "http://a/b/c/d;p?y" | |||
| "g?y" = "http://a/b/c/g?y" | "g?y" = "http://a/b/c/g?y" | |||
| "#s" = "http://a/b/c/d;p?q#s" | "#s" = "http://a/b/c/d;p?q#s" | |||
| "g#s" = "http://a/b/c/g#s" | "g#s" = "http://a/b/c/g#s" | |||
| "g?y#s" = "http://a/b/c/g?y#s" | "g?y#s" = "http://a/b/c/g?y#s" | |||
| ";x" = "http://a/b/c/;x" | ";x" = "http://a/b/c/;x" | |||
| "g;x" = "http://a/b/c/g;x" | "g;x" = "http://a/b/c/g;x" | |||
| "g;x?y#s" = "http://a/b/c/g;x?y#s" | "g;x?y#s" = "http://a/b/c/g;x?y#s" | |||
| "" = "http://a/b/c/d;p?q" | ||||
| "." = "http://a/b/c/" | "." = "http://a/b/c/" | |||
| "./" = "http://a/b/c/" | "./" = "http://a/b/c/" | |||
| ".." = "http://a/b/" | ".." = "http://a/b/" | |||
| "../" = "http://a/b/" | "../" = "http://a/b/" | |||
| "../g" = "http://a/b/g" | "../g" = "http://a/b/g" | |||
| "../.." = "http://a/" | "../.." = "http://a/" | |||
| "../../" = "http://a/" | "../../" = "http://a/" | |||
| "../../g" = "http://a/g" | "../../g" = "http://a/g" | |||
| 5.4.2 Abnormal Examples | 5.4.2 Abnormal Examples | |||
| Although the following abnormal examples are unlikely to occur in | Although the following abnormal examples are unlikely to occur in | |||
| normal practice, all URI parsers should be capable of resolving them | normal practice, all URI parsers should be capable of resolving them | |||
| consistently. Each example uses the same base as above. | consistently. Each example uses the same base as above. | |||
| An empty reference refers to the current base URI. | Parsers must be careful in handling cases where there are more | |||
| "" = "http://a/b/c/d;p?q" | ||||
| Parsers must be careful in handling the case where there are more | ||||
| relative path ".." segments than there are hierarchical levels in the | relative path ".." segments than there are hierarchical levels in the | |||
| base URI's path. Note that the ".." syntax cannot be used to change | base URI's path. Note that the ".." syntax cannot be used to change | |||
| the authority component of a URI. | the authority component of a URI. | |||
| "../../../g" = "http://a/g" | "../../../g" = "http://a/g" | |||
| "../../../../g" = "http://a/g" | "../../../../g" = "http://a/g" | |||
| Similarly, parsers must remove the dot-segments "." and ".." when | Similarly, parsers must remove the dot-segments "." and ".." when | |||
| they are complete components of a path, but not when they are only | they are complete components of a path, but not when they are only | |||
| part of a segment. | part of a segment. | |||
| "/./g" = "http://a/g" | "/./g" = "http://a/g" | |||
| "/../g" = "http://a/g" | "/../g" = "http://a/g" | |||
| "g." = "http://a/b/c/g." | "g." = "http://a/b/c/g." | |||
| ".g" = "http://a/b/c/.g" | ".g" = "http://a/b/c/.g" | |||
| "g.." = "http://a/b/c/g.." | "g.." = "http://a/b/c/g.." | |||
| "..g" = "http://a/b/c/..g" | "..g" = "http://a/b/c/..g" | |||
| Less likely are cases where the relative URI uses unnecessary or | Less likely are cases where the relative URI reference uses | |||
| nonsensical forms of the "." and ".." complete path segments. | unnecessary or nonsensical forms of the "." and ".." complete path | |||
| segments. | ||||
| "./../g" = "http://a/b/g" | "./../g" = "http://a/b/g" | |||
| "./g/." = "http://a/b/c/g/" | "./g/." = "http://a/b/c/g/" | |||
| "g/./h" = "http://a/b/c/g/h" | "g/./h" = "http://a/b/c/g/h" | |||
| "g/../h" = "http://a/b/c/h" | "g/../h" = "http://a/b/c/h" | |||
| "g;x=1/./y" = "http://a/b/c/g;x=1/y" | "g;x=1/./y" = "http://a/b/c/g;x=1/y" | |||
| "g;x=1/../y" = "http://a/b/c/y" | "g;x=1/../y" = "http://a/b/c/y" | |||
| Some applications fail to separate the reference's query and/or | Some applications fail to separate the reference's query and/or | |||
| fragment components from a relative path before merging it with the | fragment components from a relative path before merging it with the | |||
| base path and removing dot-segments. This error is rarely noticed, | base path and removing dot-segments. This error is rarely noticed, | |||
| since typical usage of a fragment never includes the hierarchy ("/") | since typical usage of a fragment never includes the hierarchy ("/") | |||
| character, and the query component is not normally used within | character, and the query component is not normally used within | |||
| relative references. | relative references. | |||
| "g?y/./x" = "http://a/b/c/g?y/./x" | "g?y/./x" = "http://a/b/c/g?y/./x" | |||
| "g?y/../x" = "http://a/b/c/g?y/../x" | "g?y/../x" = "http://a/b/c/g?y/../x" | |||
| "g#s/./x" = "http://a/b/c/g#s/./x" | "g#s/./x" = "http://a/b/c/g#s/./x" | |||
| "g#s/../x" = "http://a/b/c/g#s/../x" | "g#s/../x" = "http://a/b/c/g#s/../x" | |||
| Some parsers allow the scheme name to be present in a relative URI if | Some parsers allow the scheme name to be present in a relative URI | |||
| it is the same as the base URI scheme. This is considered to be a | reference if it is the same as the base URI scheme. This is | |||
| loophole in prior specifications of partial URI [RFC1630]. Its use | considered to be a loophole in prior specifications of partial URI | |||
| should be avoided, but is allowed for backward compatibility. | [RFC1630]. Its use should be avoided, but is allowed for backward | |||
| compatibility. | ||||
| "http:g" = "http:g" ; for strict parsers | "http:g" = "http:g" ; for strict parsers | |||
| / "http://a/b/c/g" ; for backward compatibility | / "http://a/b/c/g" ; for backward compatibility | |||
| 6. Normalization and Comparison | 6. Normalization and Comparison | |||
| One of the most common operations on URIs is simple comparison: | One of the most common operations on URIs is simple comparison: | |||
| determining if two URIs are equivalent without using the URIs to | determining if two URIs are equivalent without using the URIs to | |||
| access their respective resource(s). A comparison is performed every | access their respective resource(s). A comparison is performed every | |||
| time a response cache is accessed, a browser checks its history to | time a response cache is accessed, a browser checks its history to | |||
| skipping to change at page 35, line 29 ¶ | skipping to change at page 35, line 29 ¶ | |||
| spent in reducing duplicate identifiers. This section describes a | spent in reducing duplicate identifiers. This section describes a | |||
| variety of methods that may be used to compare URIs, the trade-offs | variety of methods that may be used to compare URIs, the trade-offs | |||
| between them, and the types of applications that might use them. | between them, and the types of applications that might use them. | |||
| 6.1 Equivalence | 6.1 Equivalence | |||
| Since URIs exist to identify resources, presumably they should be | Since URIs exist to identify resources, presumably they should be | |||
| considered equivalent when they identify the same resource. However, | considered equivalent when they identify the same resource. However, | |||
| such a definition of equivalence is not of much practical use, since | such a definition of equivalence is not of much practical use, since | |||
| there is no way for software to compare two resources without | there is no way for software to compare two resources without | |||
| knowledge of their origin. For this reason, determination of | knowledge of the implementation-specific syntax of each URI's | |||
| dereferencing algorithm. For this reason, determination of | ||||
| equivalence or difference of URIs is based on string comparison, | equivalence or difference of URIs is based on string comparison, | |||
| perhaps augmented by reference to additional rules provided by URI | perhaps augmented by reference to additional rules provided by URI | |||
| scheme definitions. We use the terms "different" and "equivalent" to | scheme definitions. We use the terms "different" and "equivalent" to | |||
| describe the possible outcomes of such comparisons, but there are | describe the possible outcomes of such comparisons, but there are | |||
| many application-dependent versions of equivalence. | many application-dependent versions of equivalence. | |||
| Even though it is possible to determine that two URIs are equivalent, | Even though it is possible to determine that two URIs are equivalent, | |||
| it is never possible to be sure that two URIs identify different | it is never possible to be sure that two URIs identify different | |||
| resources. Therefore, comparison methods are designed to minimize | resources. For example, an owner of two different domain names could | |||
| false negatives while strictly avoiding false positives. | decide to serve the same resource from both, resulting in two | |||
| different URIs. Therefore, comparison methods are designed to | ||||
| minimize false negatives while strictly avoiding false positives. | ||||
| In testing for equivalence, it is generally unwise to directly | In testing for equivalence, applications should not directly compare | |||
| compare relative URI references; they should be converted to their | relative URI references; the references should be converted to their | |||
| absolute forms before comparison. Furthermore, when URI references | target URI forms before comparison. When URIs are being compared for | |||
| are being compared for the purpose of selecting (or avoiding) a | the purpose of selecting (or avoiding) a network action, such as | |||
| network action, such as retrieval of a representation, it is often | retrieval of a representation, the fragment components (if any) | |||
| necessary to remove fragment identifiers from the URIs prior to | should be excluded from the comparison. | |||
| comparison. | ||||
| 6.2 Comparison Ladder | 6.2 Comparison Ladder | |||
| A variety of methods are used in practice to test URI equivalence. | A variety of methods are used in practice to test URI equivalence. | |||
| These methods fall into a range, distinguished by the amount of | These methods fall into a range, distinguished by the amount of | |||
| processing required and the degree to which the probability of false | processing required and the degree to which the probability of false | |||
| negatives is reduced. As noted above, false negatives cannot in | negatives is reduced. As noted above, false negatives cannot in | |||
| principle be eliminated. In practice, their probability can be | principle be eliminated. In practice, their probability can be | |||
| reduced, but this reduction requires more processing and is not | reduced, but this reduction requires more processing and is not | |||
| cost-effective for all applications. | cost-effective for all applications. | |||
| If this range of comparison practices is considered as a ladder, the | If this range of comparison practices is considered as a ladder, the | |||
| following discussion will climb the ladder, starting with those that | following discussion will climb the ladder, starting with those | |||
| are cheap but have a relatively higher chance of producing false | practices that are cheap but have a relatively higher chance of | |||
| negatives, and proceeding to those that have higher computational | producing false negatives, and proceeding to those that have higher | |||
| cost and lower risk of false negatives. | computational cost and lower risk of false negatives. | |||
| 6.2.1 Simple String Comparison | 6.2.1 Simple String Comparison | |||
| If two URIs, considered as character strings, are identical, then it | If two URIs, considered as character strings, are identical, then it | |||
| is safe to conclude that they are equivalent. This type of | is safe to conclude that they are equivalent. This type of | |||
| equivalence test has very low computational cost and is in wide use | equivalence test has very low computational cost and is in wide use | |||
| in a variety of applications, particularly in the domain of parsing. | in a variety of applications, particularly in the domain of parsing. | |||
| Testing strings for equivalence requires some basic precautions. This | Testing strings for equivalence requires some basic precautions. This | |||
| procedure is often referred to as "bit-for-bit" or "byte-for-byte" | procedure is often referred to as "bit-for-bit" or "byte-for-byte" | |||
| comparison, which is potentially misleading. Testing of strings for | comparison, which is potentially misleading. Testing of strings for | |||
| equality is normally based on pairwise comparison of the characters | equality is normally based on pairwise comparison of the characters | |||
| that make up the strings, starting from the first and proceeding | that make up the strings, starting from the first and proceeding | |||
| until both strings are exhausted and all characters found to be | until both strings are exhausted and all characters found to be | |||
| equal, a pair of characters compares unequal, or one of the strings | equal, a pair of characters compares unequal, or one of the strings | |||
| is exhausted before the other. | is exhausted before the other. | |||
| Such character comparisons require that each pair of characters be | Such character comparisons require that each pair of characters be | |||
| put in comparable form. For example, should one URI be stored in a | put in comparable form. For example, should one URI be stored in a | |||
| byte array in EBCDIC encoding, and the second be in a Java String | byte array in EBCDIC encoding, and the second be in a Java String | |||
| object, bit-for-bit comparisons applied naively will produce both | object (UTF-16), bit-for-bit comparisons applied naively will produce | |||
| false-positive and false-negative errors. Thus, in principle, it is | both false-positive and false-negative errors. It is better to speak | |||
| better to speak of equality on a character-for-character rather than | of equality on a character-for-character rather than byte-for-byte or | |||
| byte-for-byte or bit-for-bit basis. | bit-for-bit basis. In practical terms, character-by-character | |||
| comparisons should be done codepoint-by-codepoint after conversion to | ||||
| Unicode defines a character as being identified by number | a common character encoding. | |||
| ("codepoint") with an associated bundle of visual and other | ||||
| semantics. At the software level, it is not practical to compare | ||||
| semantic bundles, so in practical terms, character-by-character | ||||
| comparisons are done codepoint-by-codepoint. | ||||
| 6.2.2 Syntax-based Normalization | 6.2.2 Syntax-based Normalization | |||
| Software may use logic based on the definitions provided by this | Software may use logic based on the definitions provided by this | |||
| specification to reduce the probability of false negatives. Such | specification to reduce the probability of false negatives. Such | |||
| processing is moderately higher in cost than character-for-character | processing is moderately higher in cost than character-for-character | |||
| string comparison. For example, an application using this approach | string comparison. For example, an application using this approach | |||
| could reasonably consider the following two URIs equivalent: | could reasonably consider the following two URIs equivalent: | |||
| example://a/b/c/%7A | example://a/b/c/%7Bfoo%7D | |||
| eXAMPLE://a/./b/../b/c/%7a | eXAMPLE://a/./b/../b/%63/%7bfoo%7d | |||
| Web user agents, such as browsers, typically apply this type of URI | Web user agents, such as browsers, typically apply this type of URI | |||
| normalization when determining whether a cached response is | normalization when determining whether a cached response is | |||
| available. Syntax-based normalization includes such techniques as | available. Syntax-based normalization includes such techniques as | |||
| case normalization, escape normalization, and removal of | case normalization, encoding normalization, empty-component | |||
| dot-segments. | normalization, and removal of dot-segments. | |||
| 6.2.2.1 Case Normalization | 6.2.2.1 Case Normalization | |||
| When a URI scheme uses components of the generic syntax, it will also | When a URI scheme uses components of the generic syntax, it will also | |||
| use the common syntax equivalence rules, namely that the scheme and | use the common syntax equivalence rules, namely that the scheme and | |||
| hostname are case insensitive and therefore can be normalized to | host are case-insensitive and therefore should be normalized to | |||
| lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | |||
| equivalent to <http://www.example.com/>. | equivalent to <http://www.example.com/>. Applications should not | |||
| assume anything about the case sensitivity of other URI components, | ||||
| since that is dependent on the implementation used to handle a | ||||
| dereference. | ||||
| 6.2.2.2 Escape Normalization | The hexadecimal digits within a percent-encoding triplet (e.g., "%3a" | |||
| versus "%3A") are case-insensitive and therefore should be normalized | ||||
| to use uppercase letters for the digits A-F. | ||||
| The percent-escape mechanism described in Section 2.4 is a frequent | 6.2.2.2 Encoding Normalization | |||
| source of variance among otherwise identical URIs. One cause is the | ||||
| choice of uppercase or lowercase letters for the hexadecimal digits | ||||
| within the escape sequence (e.g., "%3a" versus "%3A"). Such sequences | ||||
| are always equivalent; for the sake of uniformity, URI generators and | ||||
| normalizers are strongly encouraged to use uppercase letters for the | ||||
| hex digits A-F. | ||||
| Only characters that are excluded from or reserved within the URI | The percent-encoding mechanism (Section 2.1) is a frequent source of | |||
| syntax must be escaped when used as data. However, some URI | variance among otherwise identical URIs. In addition to the | |||
| generators go beyond that and escape characters that do not require | case-insensitivity issue noted above, some URI producers | |||
| escaping, resulting in URIs that are equivalent to their unescaped | percent-encode octets that do not require percent-encoding, resulting | |||
| counterparts. Such URIs can be normalized by unescaping sequences | in URIs that are equivalent to their non-encoded counterparts. Such | |||
| that represent the unreserved characters, as described in Section | URIs should be normalized by decoding any percent-encoded octet that | |||
| 2.3. | corresponds to an unreserved character, as described in Section 2.3. | |||
| 6.2.2.3 Path Segment Normalization | 6.2.2.3 Empty-component Normalization | |||
| Components of the generic URI syntax are delimited from other | ||||
| components by optional separators. For example, a query component is | ||||
| separated from the path by a question mark ("?") and a port | ||||
| sub-component is separated from host by a colon (":"). A URI in | ||||
| which a delimiter is present and the (sub-)component it delimits is | ||||
| empty is equivalent to the same URI without that delimiter. For | ||||
| example, the following are all equivalent: | ||||
| ftp://example.com/ | ||||
| ftp://example.com:/ | ||||
| ftp://@example.com:/ | ||||
| ftp://@example.com:/? | ||||
| ftp://@example.com:/?# | ||||
| URI producers and normalizers should omit a delimiter if the | ||||
| component it delimits is empty, as exemplified by the first URI | ||||
| above, with one exception: a double-slash delimiter indicating an | ||||
| authority component should not be removed, even when the authority is | ||||
| empty, since doing so can lead to misinterpreting the path. | ||||
| 6.2.2.4 Path Segment Normalization | ||||
| The complete path segments "." and ".." have a special meaning within | The complete path segments "." and ".." have a special meaning within | |||
| hierarchical URI schemes. As such, they should not appear in | hierarchical URI schemes. As such, they should not appear in | |||
| absolute paths; if they are found, they can be removed by applying | absolute paths; if they are found, they can be removed by applying | |||
| the remove_dot_segments algorithm to the path, as described in | the remove_dot_segments algorithm to the path, as described in | |||
| Section 5.2. | Section 5.2. | |||
| 6.2.3 Scheme-based Normalization | 6.2.3 Scheme-based Normalization | |||
| The syntax and semantics of URIs vary from scheme to scheme, as | The syntax and semantics of URIs vary from scheme to scheme, as | |||
| described by the defining specification for each scheme. Software | described by the defining specification for each scheme. Software | |||
| may use scheme-specific rules, at further processing cost, to reduce | may use scheme-specific rules, at further processing cost, to reduce | |||
| the probability of false negatives. For example, Web spiders that | the probability of false negatives. For example, since the "http" | |||
| populate most large search engines would consider the following two | scheme makes use of an authority component, has a default port of | |||
| URIs to be equivalent: | "80", and defines an empty path to be equivalent to "/", the | |||
| following four URIs are equivalent: | ||||
| http://example.com | ||||
| http://example.com/ | http://example.com/ | |||
| http://example.com:/ | ||||
| http://example.com:80/ | http://example.com:80/ | |||
| This behavior is based on the rules provided by the syntax and | In general, a URI that uses the generic syntax for authority with an | |||
| semantics of the "http" URI scheme, which defines an empty port | empty path should be normalized to a path of "/"; likewise, an | |||
| component as being equivalent to the default TCP port for HTTP (port | explicit ":port", where the port is empty or the default for the | |||
| 80). In general, a URI scheme that uses the generic syntax for | scheme, is equivalent to one where the port and its ":" delimiter are | |||
| authority is defined such that a URI with an explicit ":port", where | elided. In other words, the second of the above URI examples is the | |||
| the port is the default for the scheme, is equivalent to one where | normal form for the "http" scheme. | |||
| the port is elided. | ||||
| Another case where normalization varies by scheme is in the handling | ||||
| of an empty authority component. For many scheme specifications, an | ||||
| empty authority is considered an error; for others, it is considered | ||||
| equivalent to "localhost". For the sake of uniformity, future scheme | ||||
| specifications should define an empty authority as being equivalent | ||||
| to "localhost", and URI producers and normalizers should use | ||||
| "localhost" instead of an empty authority. | ||||
| 6.2.4 Protocol-based Normalization | 6.2.4 Protocol-based Normalization | |||
| Web spiders, for which substantial effort to reduce the incidence of | Web spiders, for which substantial effort to reduce the incidence of | |||
| false negatives is often cost-effective, are observed to implement | false negatives is often cost-effective, are observed to implement | |||
| even more aggressive techniques in URI comparison. For example, if | even more aggressive techniques in URI comparison. For example, if | |||
| they observe that a URI such as | they observe that a URI such as | |||
| http://example.com/data | http://example.com/data | |||
| redirects to a URI differing only in the trailing slash | redirects to a URI differing only in the trailing slash | |||
| http://example.com/data/ | http://example.com/data/ | |||
| they will likely regard the two as equivalent in the future. | they will likely regard the two as equivalent in the future. This | |||
| Obviously, this kind of technique is only appropriate in special | kind of technique is only appropriate when equivalence is clearly | |||
| situations. | indicated by both the result of accessing the resources and the | |||
| common conventions of their scheme's dereference algorithm (in this | ||||
| case, use of redirection by HTTP origin servers to avoid problems | ||||
| with relative references). | ||||
| 6.3 Canonical Form | 6.3 Canonical Form | |||
| It is in the best interests of everyone to avoid false-negatives in | It is in the best interests of everyone to avoid false-negatives in | |||
| comparing URIs and to minimize the amount of software processing for | comparing URIs and to minimize the amount of software processing for | |||
| such comparisons. Those who generate and make reference to URIs can | such comparisons. Those who produce and make reference to URIs can | |||
| reduce the cost of processing and the risk of false negatives by | reduce the cost of processing and the risk of false negatives by | |||
| consistently providing them in a form that is reasonably canonical | consistently providing them in a form that is reasonably canonical | |||
| with respect to their scheme. Specifically: | with respect to their scheme. Specifically: | |||
| o Always provide the URI scheme in lowercase characters. | o Always provide the URI scheme in lowercase characters. | |||
| o Always provide the hostname, if any, in lowercase characters. | o Always provide the host, if any, in lowercase characters. | |||
| o Only perform percent-escaping where it is essential. | o Only perform percent-encoding where it is essential. | |||
| o Always use uppercase A-through-F characters when percent-escaping. | o Always use uppercase A-through-F characters when percent-encoding. | |||
| o Prevent /./ and /../ from appearing in non-relative URI paths. | o Prevent /./ and /../ from appearing in non-relative URI paths. | |||
| The good practices listed above are motivated by deployed software | o Omit delimiters when their associated (sub-)component is empty. | |||
| that frequently use these techniques for the purposes of | ||||
| normalization. | o For schemes that define an empty authority to be equivalent to | |||
| "localhost", use "localhost". | ||||
| o For schemes that define an empty path to be equivalent to a path | ||||
| of "/", use "/". | ||||
| 7. Security Considerations | 7. Security Considerations | |||
| A URI does not in itself pose a security threat. However, since URIs | A URI does not in itself pose a security threat. However, since URIs | |||
| are often used to provide a compact set of instructions for access to | are often used to provide a compact set of instructions for access to | |||
| network resources, care must be taken to properly interpret the data | network resources, care must be taken to properly interpret the data | |||
| within a URI, to prevent that data from causing unintended access, | within a URI, to prevent that data from causing unintended access, | |||
| and to avoid including data that should not be revealed in plain | and to avoid including data that should not be revealed in plain | |||
| text. | text. | |||
| skipping to change at page 40, line 36 ¶ | skipping to change at page 41, line 36 ¶ | |||
| scheme. | scheme. | |||
| 7.2 Malicious Construction | 7.2 Malicious Construction | |||
| It is sometimes possible to construct a URI such that an attempt to | It is sometimes possible to construct a URI such that an attempt to | |||
| perform a seemingly harmless, idempotent operation, such as the | perform a seemingly harmless, idempotent operation, such as the | |||
| retrieval of a representation, will in fact cause a possibly damaging | retrieval of a representation, will in fact cause a possibly damaging | |||
| remote operation to occur. The unsafe URI is typically constructed | remote operation to occur. The unsafe URI is typically constructed | |||
| by specifying a port number other than that reserved for the network | by specifying a port number other than that reserved for the network | |||
| protocol in question. The client unwittingly contacts a site that is | protocol in question. The client unwittingly contacts a site that is | |||
| running a different protocol service. The content of the URI | running a different protocol service and data within the URI contains | |||
| contains instructions that, when interpreted according to this other | instructions that, when interpreted according to this other protocol, | |||
| protocol, cause an unexpected operation. An example has been the use | cause an unexpected operation. A frequent example of such abuse has | |||
| of a gopher URI to cause an unintended or impersonating message to be | been the use of a protocol-based scheme with a port component of | |||
| sent via a SMTP server. | "25", thereby fooling user agent software into sending an unintended | |||
| or impersonating message via an SMTP server. | ||||
| Caution should be used when dereferencing a URI that specifies a TCP | Applications should prevent dereference of a URI that specifies a TCP | |||
| port number other than the default for the scheme, especially when it | port number within the "well-known port" range (0 - 1023) unless the | |||
| is a number within the reserved space. | protocol being used to dereference that URI is compatible with the | |||
| protocol expected on that well-known port. Although IANA maintains a | ||||
| registry of well-known ports, applications should make such | ||||
| restrictions user-configurable to avoid preventing the deployment of | ||||
| new services. | ||||
| Care should be taken when a URI contains escaped delimiters for a | When a URI contains percent-encoded octets that match the delimiters | |||
| given protocol (for example, CR and LF characters for telnet | for a given resolution or dereference protocol (for example, CR and | |||
| protocols) that these octets are not unescaped before transmission. | LF characters for the TELNET protocol), such percent-encoded octets | |||
| This might violate the protocol, but avoids the potential for such | must not be decoded before transmission across that protocol. | |||
| characters to be used to simulate an extra operation or parameter in | Transfer of the percent-encoding, which might violate the protocol, | |||
| that protocol which might lead to an unexpected and possibly harmful | is less harmful than allowing decoded octets to be interpreted as | |||
| remote operation being performed. | additional operations or parameters, perhaps triggering an unexpected | |||
| and possibly harmful remote operation. | ||||
| 7.3 Rare IP Address Formats | 7.3 Back-end Transcoding | |||
| When a URI is dereferenced, the data within it is often parsed by | ||||
| both the user agent and one or more servers. In HTTP, for example, a | ||||
| typical user agent will parse a URI into its five major components, | ||||
| access the authority's server, and send it the data within the | ||||
| authority, path, and query components. A typical server will take | ||||
| that information, parse the path into segments and the query into | ||||
| key/value pairs, and then invoke implementation-specific handlers to | ||||
| respond to the request. As a result, a common security concern for | ||||
| server implementations that handle a URI, either as a whole or split | ||||
| into separate components, is proper interpretation of the octet data | ||||
| represented by the characters and percent-encodings within that URI. | ||||
| Percent-encoded octets must be decoded at some point during the | ||||
| dereference process. Applications must split the URI into its | ||||
| components and sub-components prior to decoding the octets, since | ||||
| otherwise the decoded octets might be mistaken for delimiters. | ||||
| Security checks of the data within a URI should be applied after | ||||
| decoding the octets. Note, however, that the "%00" percent-encoding | ||||
| (NUL) may require special handling and should be rejected if the | ||||
| application is not expecting to receive raw data within a component. | ||||
| Special care should be taken when the URI path interpretation process | ||||
| involves the use of a back-end filesystem or related system | ||||
| functions. Filesystems typically assign an operational meaning to | ||||
| special characters, such as the "/", "\", ":", "[", and "]" | ||||
| characters, and special device names like ".", "..", "...", "aux", | ||||
| "lpt", etc. In some cases, merely testing for the existence of such a | ||||
| name will cause the operating system to pause or invoke unrelated | ||||
| system calls, leading to significant security concerns regarding | ||||
| denial of service and unintended data transfer. It would be | ||||
| impossible for this specification to list all such significant | ||||
| characters and device names; implementers should research the | ||||
| reserved names and characters for the types of storage device that | ||||
| may be attached to their application and restrict the use of data | ||||
| obtained from URI components accordingly. | ||||
| 7.4 Rare IP Address Formats | ||||
| Although the URI syntax for IPv4address only allows the common, | Although the URI syntax for IPv4address only allows the common, | |||
| dotted-decimal form of IPv4 address literal, many implementations | dotted-decimal form of IPv4 address literal, many implementations | |||
| that process URIs make use of platform-dependent system routines, | that process URIs make use of platform-dependent system routines, | |||
| such as gethostbyname() and inet_aton(), to translate the string | such as gethostbyname() and inet_aton(), to translate the string | |||
| literal to an actual IP address. Unfortunately, such system routines | literal to an actual IP address. Unfortunately, such system routines | |||
| often allow and process a much larger set of formats than those | often allow and process a much larger set of formats than those | |||
| described in Section 3.2.2. | described in Section 3.2.2. | |||
| For example, many implementations allow dotted forms of three | For example, many implementations allow dotted forms of three | |||
| skipping to change at page 41, line 32 ¶ | skipping to change at page 43, line 32 ¶ | |||
| directly in the network address. Adding further to the confusion, | directly in the network address. Adding further to the confusion, | |||
| some implementations allow each dotted part to be interpreted as | some implementations allow each dotted part to be interpreted as | |||
| decimal, octal, or hexadecimal, as specified in the C language (i.e., | decimal, octal, or hexadecimal, as specified in the C language (i.e., | |||
| a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 | a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 | |||
| implies octal; otherwise, the number is interpreted as decimal). | implies octal; otherwise, the number is interpreted as decimal). | |||
| These additional IP address formats are not allowed in the URI syntax | These additional IP address formats are not allowed in the URI syntax | |||
| due to differences between platform implementations. However, they | due to differences between platform implementations. However, they | |||
| can become a security concern if an application attempts to filter | can become a security concern if an application attempts to filter | |||
| access to resources based on the IP address in string literal format. | access to resources based on the IP address in string literal format. | |||
| If such filtering is performed, it is recommended that literals be | If such filtering is performed, literals should be converted to | |||
| converted to numeric form and filtered based on the numeric value, | numeric form and filtered based on the numeric value, rather than a | |||
| rather than a prefix or suffix of the string form. | prefix or suffix of the string form. | |||
| 7.4 Sensitive Information | 7.5 Sensitive Information | |||
| It is clearly unwise to use a URI that contains a password which is | URI producers should not provide a URI that contains a username or | |||
| intended to be secret. In particular, the use of a password within | password which is intended to be secret: URIs are frequently | |||
| the userinfo component of a URI is strongly discouraged except in | displayed by browsers, stored in clear text bookmarks, and logged by | |||
| those rare cases where the 'password' parameter is intended to be | user agent history and intermediary applications (proxies). A | |||
| public. | password appearing within the userinfo component is deprecated and | |||
| should be considered an error (or simply ignored) except in those | ||||
| rare cases where the 'password' parameter is intended to be public. | ||||
| 7.5 Semantic Attacks | 7.6 Semantic Attacks | |||
| Because the userinfo component is rarely used and appears before the | Because the userinfo sub-component is rarely used and appears before | |||
| hostname in the authority component, it can be used to construct a | the host in the authority component, it can be used to construct a | |||
| URI that is intended to mislead a human user by appearing to identify | URI that is intended to mislead a human user by appearing to identify | |||
| one (trusted) naming authority while actually identifying a different | one (trusted) naming authority while actually identifying a different | |||
| authority hidden behind the noise. For example | authority hidden behind the noise. For example | |||
| http://www.example.com&story=breaking_news@10.0.0.1/top_story.htm | ftp://ftp.example.com&story=breaking_news@10.0.0.1/top_story.htm | |||
| might lead a human user to assume that the host is 'www.example.com', | might lead a human user to assume that the host is | |||
| whereas it is actually '10.0.0.1'. Note that the misleading userinfo | 'trusted.example.com', whereas it is actually '10.0.0.1'. Note that | |||
| could be much longer than the example above. | a misleading userinfo sub-component could be much longer than the | |||
| example above. | ||||
| A misleading URI, such as the one above, is an attack on the user's | A misleading URI, such as the one above, is an attack on the user's | |||
| preconceived notions about the meaning of a URI, rather than an | preconceived notions about the meaning of a URI, rather than an | |||
| attack on the software itself. User agents may be able to reduce the | attack on the software itself. User agents may be able to reduce the | |||
| impact of such attacks by visually distinguishing the various | impact of such attacks by distinguishing the various components of | |||
| components of the URI when rendered, such as by using a different | the URI when rendered, such as by using a different color or tone to | |||
| color or tone to render userinfo if any is present, though there is | render userinfo if any is present, though there is no general | |||
| no general panacea. More information on URI-based semantic attacks | panacea. More information on URI-based semantic attacks can be found | |||
| can be found in [Siedzik]. | in [Siedzik]. | |||
| 8. Acknowledgments | 8. Acknowledgments | |||
| This specification is derived from RFC 2396 [RFC2396], RFC 1808 | This specification is derived from RFC 2396 [RFC2396], RFC 1808 | |||
| [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those | [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those | |||
| documents still apply. It also incorporates the update (with | documents still apply. It also incorporates the update (with | |||
| corrections) for IPv6 literals in the host syntax, as defined by | corrections) for IPv6 literals in the host syntax, as defined by | |||
| Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in | Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in | |||
| [RFC2732]. In addition, contributions by Reese Anschultz, Tim Bray, | [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, | |||
| Rob Cameron, Dan Connolly, Adam M. Costello, John Cowan, Jason | Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, | |||
| Diamond, Martin Duerst, Stefan Eissing, Clive D.W. Feather, Pat | Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin | |||
| Hayes, Henry Holtzman, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew | Duerst, Stefan Eissing, Clive D.W. Feather, Tony Hammond, Pat Hayes, | |||
| Main, Michael Mealling, Julian Reschke, Tomas Rokicki, Miles Sabin, | Henry Holtzman, Ian B. Jacobs, Michael Kay, John C. Klensin, Graham | |||
| Ronald Tschalaer, Marc Warne, Stuart Williams, and Henry Zongaro are | Klyne, Dan Kohn, Bruce Lilly, Andrew Main, Ira McDonald, Michael | |||
| gratefully acknowledged. | Mealling, Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, | |||
| Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne, Stuart | ||||
| Williams, and Henry Zongaro are gratefully acknowledged. | ||||
| Normative References | Normative References | |||
| [ASCII] American National Standards Institute, "Coded Character | [ASCII] American National Standards Institute, "Coded Character | |||
| Set -- 7-bit American Standard Code for Information | Set -- 7-bit American Standard Code for Information | |||
| Interchange", ANSI X3.4, 1986. | Interchange", ANSI X3.4, 1986. | |||
| [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", RFC 2234, November 1997. | Specifications: ABNF", RFC 2234, November 1997. | |||
| [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | ||||
| 10646", STD 63, RFC 3629, November 2003. | ||||
| Informative References | Informative References | |||
| [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet | |||
| Languages", BCP 18, RFC 2277, January 1998. | host table specification", RFC 952, October 1985. | |||
| [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | ||||
| STD 13, RFC 1034, November 1987. | ||||
| [RFC1123] Braden, R., "Requirements for Internet Hosts - Application | ||||
| and Support", STD 3, RFC 1123, October 1989. | ||||
| [RFC1535] Gavron, E., "A Security Problem and Proposed Correction | ||||
| With Widely Deployed DNS Software", RFC 1535, October | ||||
| 1993. | ||||
| [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A | [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A | |||
| Unifying Syntax for the Expression of Names and Addresses | Unifying Syntax for the Expression of Names and Addresses | |||
| of Objects on the Network as used in the World-Wide Web", | of Objects on the Network as used in the World-Wide Web", | |||
| RFC 1630, June 1994. | RFC 1630, June 1994. | |||
| [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform | [RFC1736] Kunze, J., "Functional Recommendations for Internet | |||
| Resource Locators (URL)", RFC 1738, December 1994. | Resource Locators", RFC 1736, February 1995. | |||
| [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for | |||
| Resource Identifiers (URI): Generic Syntax", RFC 2396, | Uniform Resource Names", RFC 1737, December 1994. | |||
| August 1998. | ||||
| [RFC1123] Braden, R., "Requirements for Internet Hosts - Application | [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform | |||
| and Support", STD 3, RFC 1123, October 1989. | Resource Locators (URL)", RFC 1738, December 1994. | |||
| [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC | [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC | |||
| 1808, June 1995. | 1808, June 1995. | |||
| [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | |||
| Extensions (MIME) Part Two: Media Types", RFC 2046, | Extensions (MIME) Part Two: Media Types", RFC 2046, | |||
| November 1996. | November 1996. | |||
| [RFC2110] Palme, J. and A. Hopmann, "MIME E-mail Encapsulation of | ||||
| Aggregate Documents, such as HTML (MHTML)", RFC 2110, | ||||
| March 1997. | ||||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | ||||
| [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | ||||
| Languages", BCP 18, RFC 2277, January 1998. | ||||
| [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | ||||
| Resource Identifiers (URI): Generic Syntax", RFC 2396, | ||||
| August 1998. | ||||
| [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. | [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. | |||
| Jensen, "HTTP Extensions for Distributed Authoring -- | Jensen, "HTTP Extensions for Distributed Authoring -- | |||
| WEBDAV", RFC 2518, February 1999. | WEBDAV", RFC 2518, February 1999. | |||
| [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet | [RFC2717] Petke, R. and I. King, "Registration Procedures for URL | |||
| host table specification", RFC 952, October 1985. | Scheme Names", BCP 35, RFC 2717, November 1999. | |||
| [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 | [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, | |||
| (IPv6) Addressing Architecture", RFC 3513, April 2003. | "Guidelines for new URL Schemes", RFC 2718, November 1999. | |||
| [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | |||
| Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | |||
| [RFC1736] Kunze, J., "Functional Recommendations for Internet | [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration | |||
| Resource Locators", RFC 1736, February 1995. | Procedures", BCP 19, RFC 2978, October 2000. | |||
| [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for | ||||
| Uniform Resource Names", RFC 1737, December 1994. | ||||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | ||||
| [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint W3C/ | |||
| STD 13, RFC 1034, November 1987. | IETF URI Planning Interest Group: Uniform Resource | |||
| Identifiers (URIs), URLs, and Uniform Resource Names | ||||
| (URNs): Clarifications and Recommendations", RFC 3305, | ||||
| August 2002. | ||||
| [RFC2110] Palme, J. and A. Hopmann, "MIME E-mail Encapsulation of | [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | |||
| Aggregate Documents, such as HTML (MHTML)", RFC 2110, | "Internationalizing Domain Names in Applications (IDNA)", | |||
| March 1997. | RFC 3490, March 2003. | |||
| [RFC2717] Petke, R. and I. King, "Registration Procedures for URL | [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 | |||
| Scheme Names", BCP 35, RFC 2717, November 1999. | (IPv6) Addressing Architecture", RFC 3513, April 2003. | |||
| [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April | [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April | |||
| 2001. | 2001, <http://www.giac.org/practical/gsec/ | |||
| Richard_Siedzik_GSEC.pdf>. | ||||
| [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO | ||||
| 10646", RFC 2279, January 1998. | ||||
| Authors' Addresses | Authors' Addresses | |||
| Tim Berners-Lee | Tim Berners-Lee | |||
| World Wide Web Consortium | World Wide Web Consortium | |||
| MIT/LCS, Room NE43-356 | MIT/LCS, Room NE43-356 | |||
| 200 Technology Square | 200 Technology Square | |||
| Cambridge, MA 02139 | Cambridge, MA 02139 | |||
| USA | USA | |||
| Phone: +1-617-253-5702 | Phone: +1-617-253-5702 | |||
| Fax: +1-617-258-5999 | Fax: +1-617-258-5999 | |||
| EMail: timbl@w3.org | EMail: timbl@w3.org | |||
| URI: http://www.w3.org/People/Berners-Lee/ | URI: http://www.w3.org/People/Berners-Lee/ | |||
| Roy T. Fielding | Roy T. Fielding | |||
| Day Software | Day Software | |||
| 2 Corporate Plaza, Suite 150 | 5251 California Ave., Suite 110 | |||
| Newport Beach, CA 92660 | Irvine, CA 92612-3074 | |||
| USA | USA | |||
| Phone: +1-949-999-2523 | Phone: +1-949-679-2960 | |||
| Fax: +1-949-644-5064 | Fax: +1-949-679-2972 | |||
| EMail: roy.fielding@day.com | EMail: fielding@gbiv.com | |||
| URI: http://www.apache.org/~fielding/ | URI: http://roy.gbiv.com/ | |||
| Larry Masinter | Larry Masinter | |||
| Adobe Systems Incorporated | Adobe Systems Incorporated | |||
| 345 Park Ave | 345 Park Ave | |||
| San Jose, CA 95110 | San Jose, CA 95110 | |||
| USA | USA | |||
| Phone: +1-408-536-3024 | Phone: +1-408-536-3024 | |||
| EMail: LMM@acm.org | EMail: LMM@acm.org | |||
| URI: http://larry.masinter.net/ | URI: http://larry.masinter.net/ | |||
| Appendix A. Collected ABNF for URI | Appendix A. Collected ABNF for URI | |||
| abs-path = "/" path-segments | URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment] | |||
| absolute-URI = scheme ":" hier-part [ "?" query ] | ||||
| alphanum = ALPHA / DIGIT | ||||
| authority = [ userinfo "@" ] host [ ":" port ] | URI-reference = URI / relative-URI | |||
| dec-octet = DIGIT ; 0-9 | relative-URI = ["//" authority] path ["?" query] ["#" fragment] | |||
| / %x31-39 DIGIT ; 10-99 | ||||
| / "1" 2DIGIT ; 100-199 | ||||
| / "2" %x30-34 DIGIT ; 200-249 | ||||
| / "25" %x30-35 ; 250-255 | ||||
| domainlabel = alphanum [ 0*61( alphanum / "-" ) alphanum ] | absolute-URI = scheme ":" ["//" authority] path ["?" query] | |||
| escaped = "%" HEXDIG HEXDIG | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| fragment = *( pchar / "/" / "?" ) | authority = [ userinfo "@" ] host [ ":" port ] | |||
| userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | ||||
| host = IP-literal / IPv4address / reg-name | ||||
| port = *DIGIT | ||||
| h4 = 1*4HEXDIG | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |||
| hier-part = net-path / abs-path / rel-path | IPvFuture = "v" HEXDIG "." 1*( unreserved / sub-delims / ":" ) | |||
| host = [ IPv6reference / IPv4address / hostname ] | IPv6address = 6( h16 ":" ) ls32 | |||
| / "::" 5( h16 ":" ) ls32 | ||||
| / [ h16 ] "::" 4( h16 ":" ) ls32 | ||||
| / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | ||||
| / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | ||||
| / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | ||||
| / [ *4( h16 ":" ) h16 ] "::" ls32 | ||||
| / [ *5( h16 ":" ) h16 ] "::" h16 | ||||
| / [ *6( h16 ":" ) h16 ] "::" | ||||
| hostname = domainlabel qualified | h16 = 1*4HEXDIG | |||
| ls32 = ( h16 ":" h16 ) / IPv4address | ||||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| IPv6address = 6( h4 ":" ) ls32 | dec-octet = DIGIT ; 0-9 | |||
| / "::" 5( h4 ":" ) ls32 | / %x31-39 DIGIT ; 10-99 | |||
| / [ h4 ] "::" 4( h4 ":" ) ls32 | / "1" 2DIGIT ; 100-199 | |||
| / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 | / "2" %x30-34 DIGIT ; 200-249 | |||
| / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 | / "25" %x30-35 ; 250-255 | |||
| / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 | ||||
| / [ *4( h4 ":" ) h4 ] "::" ls32 | ||||
| / [ *5( h4 ":" ) h4 ] "::" h4 | ||||
| / [ *6( h4 ":" ) h4 ] "::" | ||||
| IPv6reference = "[" IPv6address "]" | ||||
| ls32 = ( h4 ":" h4 ) / IPv4address | ||||
| mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")" | ||||
| net-path = "//" authority [ abs-path ] | ||||
| path-segments = segment *( "/" segment ) | ||||
| pchar = unreserved / escaped / ";" / | ||||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ||||
| port = *DIGIT | ||||
| qualified = *( "." domainlabel ) [ "." ] | ||||
| query = *( pchar / "/" / "?" ) | ||||
| rel-path = path-segments | ||||
| relative-URI = hier-part [ "?" query ] [ "#" fragment ] | ||||
| reserved = "/" / "?" / "#" / "[" / "]" / ";" / | ||||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ||||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | reg-name = 0*255( unreserved / pct-encoded / sub-delims ) | |||
| path = segment *( "/" segment ) | ||||
| segment = *pchar | segment = *pchar | |||
| unreserved = ALPHA / DIGIT / mark | query = *( pchar / "/" / "?" ) | |||
| fragment = *( pchar / "/" / "?" ) | ||||
| URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | ||||
| URI-reference = URI / relative-URI | pct-encoded = "%" HEXDIG HEXDIG | |||
| uric = reserved / unreserved / escaped | pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | |||
| userinfo = *( unreserved / escaped / ";" / | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |||
| ":" / "&" / "=" / "+" / "$" / "," ) | reserved = gen-delims / sub-delims | |||
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | ||||
| sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | ||||
| / "*" / "+" / "," / ";" / "=" | ||||
| Appendix B. Parsing a URI Reference with a Regular Expression | Appendix B. Parsing a URI Reference with a Regular Expression | |||
| Since the "first-match-wins" algorithm is identical to the "greedy" | Since the "first-match-wins" algorithm is identical to the "greedy" | |||
| disambiguation method used by POSIX regular expressions, it is | disambiguation method used by POSIX regular expressions, it is | |||
| natural and commonplace to use a regular expression for parsing the | natural and commonplace to use a regular expression for parsing the | |||
| potential five components of a URI reference. | potential five components of a URI reference. | |||
| The following line is the regular expression for breaking-down a | The following line is the regular expression for breaking-down a | |||
| well-formed URI reference into its components. | well-formed URI reference into its components. | |||
| skipping to change at page 51, line 15 ¶ | skipping to change at page 53, line 15 ¶ | |||
| Appendix C. Delimiting a URI in Context | Appendix C. Delimiting a URI in Context | |||
| URIs are often transmitted through formats that do not provide a | URIs are often transmitted through formats that do not provide a | |||
| clear context for their interpretation. For example, there are many | clear context for their interpretation. For example, there are many | |||
| occasions when a URI is included in plain text; examples include text | occasions when a URI is included in plain text; examples include text | |||
| sent in electronic mail, USENET news messages, and, most importantly, | sent in electronic mail, USENET news messages, and, most importantly, | |||
| printed on paper. In such cases, it is important to be able to | printed on paper. In such cases, it is important to be able to | |||
| delimit the URI from the rest of the text, and in particular from | delimit the URI from the rest of the text, and in particular from | |||
| punctuation marks that might be mistaken for part of the URI. | punctuation marks that might be mistaken for part of the URI. | |||
| In practice, URI are delimited in a variety of ways, but usually | In practice, URIs are delimited in a variety of ways, but usually | |||
| within double-quotes "http://example.com/", angle brackets <http:// | within double-quotes "http://example.com/", angle brackets <http:// | |||
| example.com/>, or just using whitespace | example.com/>, or just using whitespace | |||
| http://example.com/ | http://example.com/ | |||
| These wrappers do not form part of the URI. | These wrappers do not form part of the URI. | |||
| In the case where a fragment identifier is associated with a URI | ||||
| reference, the fragment would be placed within the brackets as well | ||||
| (separated from the URI with a "#" character). | ||||
| In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may | In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may | |||
| need to be added to break a long URI across lines. The whitespace | need to be added to break a long URI across lines. The whitespace | |||
| should be ignored when extracting the URI. | should be ignored when extracting the URI. | |||
| No whitespace should be introduced after a hyphen ("-") character. | No whitespace should be introduced after a hyphen ("-") character. | |||
| Because some typesetters and printers may (erroneously) introduce a | Because some typesetters and printers may (erroneously) introduce a | |||
| hyphen at the end of line when breaking a line, the interpreter of a | hyphen at the end of line when breaking a line, the interpreter of a | |||
| URI containing a line break immediately after a hyphen should ignore | URI containing a line break immediately after a hyphen should ignore | |||
| all unescaped whitespace around the line break, and should be aware | all whitespace around the line break, and should be aware that the | |||
| that the hyphen may or may not actually be part of the URI. | hyphen may or may not actually be part of the URI. | |||
| Using <> angle brackets around each URI is especially recommended as | Using <> angle brackets around each URI is especially recommended as | |||
| a delimiting style for a URI that contains whitespace. | a delimiting style for a reference that contains embedded whitespace. | |||
| The prefix "URL:" (with or without a trailing space) was formerly | The prefix "URL:" (with or without a trailing space) was formerly | |||
| recommended as a way to help distinguish a URI from other bracketed | recommended as a way to help distinguish a URI from other bracketed | |||
| designators, though it is not commonly used in practice and is no | designators, though it is not commonly used in practice and is no | |||
| longer recommended. | longer recommended. | |||
| For robustness, software that accepts user-typed URI should attempt | For robustness, software that accepts user-typed URI should attempt | |||
| to recognize and strip both delimiters and embedded whitespace. | to recognize and strip both delimiters and embedded whitespace. | |||
| For example, the text: | For example, the text: | |||
| Yes, Jim, I found it under "http://www.w3.org/Addressing/", | Yes, Jim, I found it under "http://www.w3.org/Addressing/", | |||
| but you can probably pick it up from <ftp://ds.internic. | but you can probably pick it up from <ftp://foo.example. | |||
| net/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | |||
| ietf/uri/historical.html#WARNING>. | ietf/uri/historical.html#WARNING>. | |||
| contains the URI references | contains the URI references | |||
| http://www.w3.org/Addressing/ | http://www.w3.org/Addressing/ | |||
| ftp://ds.internic.net/rfc/ | ftp://foo.example.com/rfc/ | |||
| http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | |||
| Appendix D. Summary of Non-editorial Changes | Appendix D. Summary of Non-editorial Changes | |||
| D.1 Additions | D.1 Additions | |||
| IPv6 literals have been added to the list of possible identifiers for | IPv6 (and later) literals have been added to the list of possible | |||
| the host portion of a authority component, as described by [RFC2732], | identifiers for the host portion of a authority component, as | |||
| with the addition of "[" and "]" to the reserved and uric sets. | described by [RFC2732], with the addition of "[" and "]" to the | |||
| Square brackets are now specified as reserved within the authority | reserved set and a version flag to anticipate future versions of IP | |||
| component and not allowed outside their use as delimiters for an | literals. Square brackets are now specified as reserved within the | |||
| IPv6reference within host. In order to make this change without | authority component and not allowed outside their use as delimiters | |||
| for an IP literal within host. In order to make this change without | ||||
| changing the technical definition of the path, query, and fragment | changing the technical definition of the path, query, and fragment | |||
| components, those rules were redefined to directly specify the | components, those rules were redefined to directly specify the | |||
| characters allowed rather than be defined in terms of uric. | characters allowed rather than be defined in terms of uric. | |||
| Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal | Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal | |||
| address, which unfortunately lacks an ABNF description of | address, which unfortunately lacks an ABNF description of | |||
| IPv6address, we created a new ABNF rule for IPv6address that matches | IPv6address, we created a new ABNF rule for IPv6address that matches | |||
| the text representations defined by Section 2.2 of [RFC3513]. | the text representations defined by Section 2.2 of [RFC3513]. | |||
| Likewise, the definition of IPv4address has been improved in order to | Likewise, the definition of IPv4address has been improved in order to | |||
| limit each decimal octet to the range 0-255, and the definition of | limit each decimal octet to the range 0-255. | |||
| hostname has been improved to better specify length limitations and | ||||
| partially-qualified domain names. | ||||
| Section 6 (Section 6) on URI normalization and comparison has been | Section 6 (Section 6) on URI normalization and comparison has been | |||
| completely rewritten and extended using input from Tim Bray and | completely rewritten and extended using input from Tim Bray and | |||
| discussion within the W3C Technical Architecture Group. Likewise, | discussion within the W3C Technical Architecture Group. | |||
| Section 2.1 on the encoding of characters has been replaced. | ||||
| An ABNF production for URI has been introduced to correspond to the | An ABNF rule for URI has been introduced to correspond to the common | |||
| common usage of the term: an absolute URI with optional fragment. | usage of the term: an absolute URI with optional fragment. | |||
| D.2 Modifications from RFC 2396 | D.2 Modifications from RFC 2396 | |||
| The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. | The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. | |||
| This change required all rule names that formerly included underscore | This change required all rule names that formerly included underscore | |||
| characters to be renamed with a dash instead. | characters to be renamed with a dash instead. | |||
| Section 2.2 on reserved characters has been rewritten to clearly | Section 2 on characters has been rewritten to explain what characters | |||
| explain what characters are reserved, when they are reserved, and why | are reserved, when they are reserved, and why they are reserved even | |||
| they are reserved even when not used as delimiters by the generic | when not used as delimiters by the generic syntax. The mark | |||
| syntax. Likewise, the section on escaped characters has been | characters that are typically unsafe to decode, including the | |||
| rewritten, and URI normalizers are now given license to unescape any | exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open | |||
| octets corresponding to unreserved characters. The number-sign ("#") | and close parentheses ("(" and ")"), have been moved to the reserved | |||
| character has been moved back from the excluded delims to the | set in order to clarify the distinction between reserved and | |||
| reserved set. | unreserved and hopefully answer the most common question of scheme | |||
| designers. Likewise, the section on percent-encoded characters has | ||||
| been rewritten, and URI normalizers are now given license to decode | ||||
| any percent-encoded octets corresponding to unreserved characters. | ||||
| In general, the terms "escaped" and "unescaped" have been replaced | ||||
| with "percent-encoded" and "decoded", respectively, to reduce | ||||
| confusion with other forms of escape mechanisms. | ||||
| The ABNF for URI and URI-reference has been redesigned to make them | The ABNF for URI and URI-reference has been redesigned to make them | |||
| more friendly to LALR parsers and significantly reduce complexity. As | more friendly to LALR parsers and significantly reduce complexity. As | |||
| a result, the layout form of syntax description has been removed, | a result, the layout form of syntax description has been removed, | |||
| along with the uric-no-slash, opaque-part, and rel-segment | along with the uric, uric_no_slash, hier_part, opaque_part, net_path, | |||
| productions. All references to "opaque" URIs have been replaced with | abs_path, rel_path, path_segments, rel_segment, and mark rules. All | |||
| a better description of how the path component may be opaque to | references to "opaque" URIs have been replaced with a better | |||
| hierarchy. The fragment identifier has been moved back into the | description of how the path component may be opaque to hierarchy. The | |||
| section on generic syntax components and within the URI and | ambiguity regarding the parsing of URI-reference as a URI or a | |||
| relative-URI productions, though it remains excluded from | relative-URI with a colon in the first segment is now explained and | |||
| absolute-URI. The ambiguity regarding the parsing of URI-reference as | disambiguated in the section defining relative-URI. | |||
| a URI or a relative-URI with a colon in the first segment is now | ||||
| explained and disambiguated in the section defining relative-URI. | ||||
| The ABNF of hier-part and relative-URI has been corrected to allow a | The fragment identifier has been moved back into the section on | |||
| relative URI path to be empty. This also allows an absolute-URI to | generic syntax components and within the URI and relative-URI rules, | |||
| consist of nothing after the "scheme:", as is present in practice | though it remains excluded from absolute-URI. The number sign ("#") | |||
| with the "DAV:" namespace [RFC2518] and the "about:" URI used by many | character has been moved back to the reserved set as a result of | |||
| browser implementations. The ambiguity regarding the parsing of | reintegrating the fragment syntax. | |||
| net-path, abs-path, and rel-path is now explained and disambiguated | ||||
| in the same section. | ||||
| Registry-based naming authorities that use the generic syntax | The ABNF has been corrected to allow a relative path to be empty. | |||
| authority component are now limited to DNS hostnames, since those | This also allows an absolute-URI to consist of nothing after the | |||
| have been the only such URIs in deployment. This change was | "scheme:", as is present in practice with the "dav:" namespace | |||
| necessary to enable internationalized domain names to be processed in | [RFC2518] and the "about:" scheme used internally by many WWW browser | |||
| their native character encodings at the application layers above URI | implementations. The ambiguity regarding the boundary between | |||
| processing. The reg_name, server, and hostport productions have been | authority and path is now explained and disambiguated in the same | |||
| removed to simplify parsing of the URI syntax. | section. | |||
| The ABNF of qualified has been simplified to remove a parsing | Registry-based naming authorities that use the generic syntax are now | |||
| ambiguity without changing the allowed syntax. The toplabel | defined within the host rule and limited to 255 path characters. This | |||
| production has been removed because it served no useful purpose. The | change allows current implementations, where whatever name provided | |||
| ambiguity regarding the parsing of host as IPv4address or hostname is | is simply fed to the local name resolution mechanism, to be | |||
| now explained and disambiguated in the same section. | consistent with the specification and removes the need to re-specify | |||
| DNS name formats here. It also allows the host component to contain | ||||
| percent-encoded octets, which is necessary to enable | ||||
| internationalized domain names to be provided in URIs, processed in | ||||
| their native character encodings at the application layers above URI | ||||
| processing, and passed to an IDNA library as a registered name in the | ||||
| UTF-8 character encoding. The server, hostport, hostname, | ||||
| domainlabel, toplabel, and alphanum rules have been removed. | ||||
| The resolving relative references algorithm of [RFC2396] has been | The resolving relative references algorithm of [RFC2396] has been | |||
| rewritten using pseudocode for this revision to improve clarity and | rewritten using pseudocode for this revision to improve clarity and | |||
| fix the following issues: | fix the following issues: | |||
| o [RFC2396] section 5.2, step 6a, failed to account for a base URI | o [RFC2396] section 5.2, step 6a, failed to account for a base URI | |||
| with no path. | with no path. | |||
| o Restored the behavior of [RFC1808] where, if the reference | o Restored the behavior of [RFC1808] where, if the reference | |||
| contains an empty path and a defined query component, then the | contains an empty path and a defined query component, then the | |||
| target URI inherits the base URI's path component. | target URI inherits the base URI's path component. | |||
| o Removed the special-case treatment of same-document references in | o Removed the special-case treatment of same-document references | |||
| favor of a section that explains that a new retrieval action | within the URI parser in favor of a section that explains when a | |||
| should not be made if the target URI and base URI, excluding | reference should be interpreted by a dereferencing engine as a | |||
| fragments, match. This change has no impact on user agent | same-document reference: when the target URI and base URI, | |||
| behavior aside from how the resolved reference might be described | excluding fragments, match. This change does not modify the | |||
| to the user. | behavior of existing same-document references as defined by RFC | |||
| 2396 (fragment-only references); it merely adds the same-document | ||||
| distinction to other references that refer to the base URI and | ||||
| simplifies the interface between applications and their URI | ||||
| parsers, as is consistent with the internal architecture of | ||||
| deployed URI processing implementations. | ||||
| o Separated the path merge routine into two routines: merge, for | o Separated the path merge routine into two routines: merge, for | |||
| describing combination of the base URI path with a relative-path | describing combination of the base URI path with a relative-path | |||
| reference, and remove_dot_segments, for describing how to remove | reference, and remove_dot_segments, for describing how to remove | |||
| the special "." and ".." segments from a composed path. The | the special "." and ".." segments from a composed path. The | |||
| remove_dot_segments algorithm is now applied to all URI reference | remove_dot_segments algorithm is now applied to all URI reference | |||
| paths in order to match common implementations and improve the | paths in order to match common implementations and improve the | |||
| normalization of URIs in practice. This change only impacts the | normalization of URIs in practice. This change only impacts the | |||
| parsing of abnormal references and same-scheme references wherein | parsing of abnormal references and same-scheme references wherein | |||
| the base URI has a non-hierarchical path. | the base URI has a non-hierarchical path. | |||
| Index | Index | |||
| A | A | |||
| ABNF 9 | ABNF 10 | |||
| abs-path 16 | ||||
| absolute 25 | absolute 25 | |||
| absolute-path 24 | absolute-path 24 | |||
| absolute-URI 25 | absolute-URI 25 | |||
| access 7 | access 7 | |||
| alphanum 18 | authority 15, 16 | |||
| authority 16, 17 | ||||
| B | B | |||
| base URI 27 | base URI 27 | |||
| C | ||||
| characters 11 | ||||
| D | D | |||
| dec-octet 19 | dec-octet 18 | |||
| delims 15 | ||||
| dereference 7 | dereference 7 | |||
| domainlabel 18 | ||||
| dot-segments 20 | dot-segments 20 | |||
| E | ||||
| escaped 13 | ||||
| excluded 14 | ||||
| F | F | |||
| fragment 22 | fragment 22 | |||
| G | G | |||
| gen-delims 12 | ||||
| generic syntax 5 | generic syntax 5 | |||
| H | H | |||
| h4 19 | h16 17 | |||
| hier-part 16 | hierarchical 9 | |||
| hierarchical 8 | host 17 | |||
| host 18 | ||||
| hostname 18 | ||||
| I | I | |||
| identifier 5 | identifier 5 | |||
| invisible 14 | IP-literal 17 | |||
| IPv4 19 | IPv4 18 | |||
| IPv4address 19 | IPv4address 18 | |||
| IPv6 19 | IPv6 17 | |||
| IPv6address 19 | IPv6address 17 | |||
| IPv6reference 19 | IPvFuture 17 | |||
| L | L | |||
| locator 6 | locator 6 | |||
| ls32 19 | ls32 17 | |||
| M | M | |||
| mark 12 | ||||
| merge 30 | merge 30 | |||
| N | N | |||
| name 6 | name 6 | |||
| net-path 16 | ||||
| network-path 24 | network-path 24 | |||
| P | P | |||
| path 16, 20 | path 15, 20 | |||
| path-segments 20 | ||||
| pchar 20 | pchar 20 | |||
| pct-encoded 11 | ||||
| percent-encoding 11 | ||||
| port 20 | port 20 | |||
| Q | Q | |||
| qualified 18 | ||||
| query 21 | query 21 | |||
| R | R | |||
| rel-path 16 | reg-name 19 | |||
| registered name 19 | ||||
| relative 9, 27 | relative 9, 27 | |||
| relative-path 24 | relative-path 24 | |||
| relative-URI 24 | relative-URI 24 | |||
| remove_dot_segments 30 | remove_dot_segments 30 | |||
| representation 8 | representation 8 | |||
| reserved 11 | reserved 12 | |||
| resolution 7, 27 | resolution 7, 27 | |||
| resource 4 | resource 4 | |||
| retrieval 8 | retrieval 8 | |||
| S | S | |||
| same-document 25 | same-document 25 | |||
| sameness 8 | sameness 8 | |||
| scheme 16 | scheme 15 | |||
| segment 20 | segment 20 | |||
| sub-delims 12 | ||||
| suffix 25 | suffix 25 | |||
| T | T | |||
| transcription 6 | transcription 6 | |||
| U | U | |||
| uniform 4 | uniform 4 | |||
| unreserved 12 | unreserved 12 | |||
| unwise 15 | ||||
| URI grammar | URI grammar | |||
| abs-path 16 | ||||
| absolute-URI 25 | absolute-URI 25 | |||
| ALPHA 9 | ALPHA 10 | |||
| alphanum 18 | authority 15, 16 | |||
| authority 16, 17 | CR 10 | |||
| CR 9 | CTL 10 | |||
| CTL 9 | dec-octet 18 | |||
| dec-octet 19 | DIGIT 10 | |||
| DIGIT 9 | DQUOTE 10 | |||
| domainlabel 18 | fragment 15, 22, 24 | |||
| DQUOTE 9 | gen-delims 12 | |||
| escaped 13 | h16 18 | |||
| fragment 16, 22, 24 | HEXDIG 10 | |||
| h4 19 | host 16, 17 | |||
| HEXDIG 9 | IP-literal 17 | |||
| hier-part 16, 24, 25 | IPv4address 18 | |||
| host 17, 18 | IPv6address 17, 18 | |||
| hostname 18 | IPvFuture 17 | |||
| IPv4address 19 | LF 10 | |||
| IPv6address 19 | ls32 18 | |||
| IPv6reference 19 | ||||
| LF 9 | ||||
| ls32 19 | ||||
| mark 12 | mark 12 | |||
| net-path 16 | OCTET 10 | |||
| OCTET 9 | path 15 | |||
| path-segments 16, 20 | path-segments 20 | |||
| pchar 20, 21, 22 | pchar 20, 21, 22 | |||
| port 17, 20 | pct-encoded 11 | |||
| qualified 18 | port 16, 20 | |||
| query 16, 21, 24, 25 | query 15, 21, 24, 25 | |||
| rel-path 16 | reg-name 19 | |||
| relative-URI 24, 24 | relative-URI 24, 24 | |||
| reserved 12 | reserved 12 | |||
| scheme 16, 17, 25 | scheme 15, 15, 25 | |||
| segment 20 | segment 20 | |||
| SP 9 | SP 10 | |||
| sub-delims 12 | ||||
| unreserved 12 | unreserved 12 | |||
| URI 16, 24 | URI 15, 24 | |||
| URI-reference 24 | URI-reference 24 | |||
| uric 11 | userinfo 16, 16 | |||
| userinfo 17, 18 | URI 15 | |||
| URI 16 | ||||
| URI-reference 24 | URI-reference 24 | |||
| uric 11 | ||||
| URL 6 | URL 6 | |||
| URN 6 | URN 6 | |||
| userinfo 18 | userinfo 16 | |||
| Intellectual Property Statement | Intellectual Property Statement | |||
| The IETF takes no position regarding the validity or scope of any | The IETF takes no position regarding the validity or scope of any | |||
| intellectual property or other rights that might be claimed to | intellectual property or other rights that might be claimed to | |||
| pertain to the implementation or use of the technology described in | pertain to the implementation or use of the technology described in | |||
| this document or the extent to which any license under such rights | this document or the extent to which any license under such rights | |||
| might or might not be available; neither does it represent that it | might or might not be available; neither does it represent that it | |||
| has made any effort to identify any such rights. Information on the | has made any effort to identify any such rights. Information on the | |||
| IETF's procedures with respect to rights in standards-track and | IETF's procedures with respect to rights in standards-track and | |||
| skipping to change at page 60, line 29 ¶ | skipping to change at page 61, line 29 ¶ | |||
| be obtained from the IETF Secretariat. | be obtained from the IETF Secretariat. | |||
| The IETF invites any interested party to bring to its attention any | The IETF invites any interested party to bring to its attention any | |||
| copyrights, patents or patent applications, or other proprietary | copyrights, patents or patent applications, or other proprietary | |||
| rights which may cover technology that may be required to practice | rights which may cover technology that may be required to practice | |||
| this standard. Please address the information to the IETF Executive | this standard. Please address the information to the IETF Executive | |||
| Director. | Director. | |||
| Full Copyright Statement | Full Copyright Statement | |||
| Copyright (C) The Internet Society (2003). All Rights Reserved. | Copyright (C) The Internet Society (2004). All Rights Reserved. | |||
| This document and translations of it may be copied and furnished to | This document and translations of it may be copied and furnished to | |||
| others, and derivative works that comment on or otherwise explain it | others, and derivative works that comment on or otherwise explain it | |||
| or assist in its implementation may be prepared, copied, published | or assist in its implementation may be prepared, copied, published | |||
| and distributed, in whole or in part, without restriction of any | and distributed, in whole or in part, without restriction of any | |||
| kind, provided that the above copyright notice and this paragraph are | kind, provided that the above copyright notice and this paragraph are | |||
| included on all such copies and derivative works. However, this | included on all such copies and derivative works. However, this | |||
| document itself may not be modified in any way, such as by removing | document itself may not be modified in any way, such as by removing | |||
| the copyright notice or references to the Internet Society or other | the copyright notice or references to the Internet Society or other | |||
| Internet organizations, except as needed for the purpose of | Internet organizations, except as needed for the purpose of | |||
| skipping to change at page 61, line 7 ¶ | skipping to change at page 62, line 7 ¶ | |||
| The limited permissions granted above are perpetual and will not be | The limited permissions granted above are perpetual and will not be | |||
| revoked by the Internet Society or its successors or assignees. | revoked by the Internet Society or its successors or assignees. | |||
| This document and the information contained herein is provided on an | This document and the information contained herein is provided on an | |||
| "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING | "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING | |||
| TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING | TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING | |||
| BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION | BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION | |||
| HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF | HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF | |||
| MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | |||
| Acknowledgement | Acknowledgment | |||
| Funding for the RFC Editor function is currently provided by the | Funding for the RFC Editor function is currently provided by the | |||
| Internet Society. | Internet Society. | |||
| End of changes. 263 change blocks. | ||||
| 972 lines changed or deleted | 1118 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||