| < draft-fielding-uri-rfc2396bis-04.txt | draft-fielding-uri-rfc2396bis-05.txt > | |||
|---|---|---|---|---|
| Network Working Group T. Berners-Lee | Network Working Group T. Berners-Lee | |||
| Internet-Draft MIT/LCS | Internet-Draft W3C/MIT | |||
| Updates: 1738 (if approved) R. Fielding | Updates: 1738 (if approved) R. Fielding | |||
| Obsoletes: 2732, 2396, 1808 (if approved) Day Software | Obsoletes: 2732, 2396, 1808 (if approved) Day Software | |||
| Expires: August 16, 2004 L. Masinter | L. Masinter | |||
| Adobe | Expires: October 15, 2004 Adobe | |||
| February 16, 2004 | April 16, 2004 | |||
| Uniform Resource Identifier (URI): Generic Syntax | Uniform Resource Identifier (URI): Generic Syntax | |||
| draft-fielding-uri-rfc2396bis-04 | draft-fielding-uri-rfc2396bis-05 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | By submitting this Internet-Draft, I certify that any applicable | |||
| all provisions of Section 10 of RFC2026. | patent or other IPR claims of which I am aware have been disclosed, | |||
| and any of which I become aware will be disclosed, in accordance with | ||||
| RFC 3668. | ||||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that other | Task Force (IETF), its areas, and its working groups. Note that other | |||
| groups may also distribute working documents as Internet-Drafts. | groups may also distribute working documents as Internet-Drafts. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| skipping to change at page 1, line 29 ¶ | skipping to change at page 1, line 32 ¶ | |||
| Task Force (IETF), its areas, and its working groups. Note that other | Task Force (IETF), its areas, and its working groups. Note that other | |||
| groups may also distribute working documents as Internet-Drafts. | groups may also distribute working documents as Internet-Drafts. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| <http://www.ietf.org/ietf/1id-abstracts.txt>. | <http://www.ietf.org/ietf/1id-abstracts.txt>. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| <http://www.ietf.org/shadow.html>. | <http://www.ietf.org/shadow.html>. | |||
| This Internet-Draft will expire on August 16, 2004. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2004). All Rights Reserved. | Copyright (C) The Internet Society (2004). All Rights Reserved. | |||
| Abstract | Abstract | |||
| A Uniform Resource Identifier (URI) is a compact string of characters | A Uniform Resource Identifier (URI) is a compact sequence of | |||
| for identifying an abstract or physical resource. This specification | characters for identifying an abstract or physical resource. This | |||
| defines the generic URI syntax and a process for resolving URI | specification defines the generic URI syntax and a process for | |||
| references that might be in relative form, along with guidelines and | resolving URI references that might be in relative form, along with | |||
| security considerations for the use of URIs on the Internet. | guidelines and security considerations for the use of URIs on the | |||
| Internet. | ||||
| The URI syntax defines a grammar that is a superset of all valid | The URI syntax defines a grammar that is a superset of all valid | |||
| URIs, such that an implementation can parse the common components of | URIs, such that an implementation can parse the common components of | |||
| a URI reference without knowing the scheme-specific requirements of | a URI reference without knowing the scheme-specific requirements of | |||
| every possible identifier. This specification does not define a | every possible identifier. This specification does not define a | |||
| generative grammar for URIs; that task is performed by the individual | generative grammar for URIs; that task is performed by the individual | |||
| specifications of each URI scheme. | specifications of each URI scheme. | |||
| Editorial Note | Editorial Note | |||
| Discussion of this draft and comments to the editors should be sent | Discussion of this draft and comments to the editors should be sent | |||
| to the uri@w3.org mailing list. An issues list and version history | to the uri@w3.org mailing list. An issues list and version history | |||
| is available at <http://gbiv.com/protocols/uri/rev-2002/issues.html>. | is available at <http://gbiv.com/protocols/uri/rev-2002/issues.html>. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . . . . 6 | 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . 6 | |||
| 1.2 Design Considerations . . . . . . . . . . . . . . . . . . . 6 | 1.2 Design Considerations . . . . . . . . . . . . . . . . . . 7 | |||
| 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . 7 | |||
| 1.2.2 Separating Identification from Interaction . . . . . . . . . 7 | 1.2.2 Separating Identification from Interaction . . . . . . 8 | |||
| 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . . . . 9 | 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . 9 | |||
| 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . . 10 | 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . 10 | |||
| 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . 11 | 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 10 | |||
| 2.1 Percent Encoding . . . . . . . . . . . . . . . . . . . . . . 11 | 2.1 Percent-Encoding . . . . . . . . . . . . . . . . . . . . . 11 | |||
| 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . . 12 | 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . 11 | |||
| 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . . 12 | 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . 12 | |||
| 2.4 When to Encode or Decode . . . . . . . . . . . . . . . . . . 13 | 2.4 When to Encode or Decode . . . . . . . . . . . . . . . . . 13 | |||
| 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . 15 | 2.5 Identifying Data . . . . . . . . . . . . . . . . . . . . . 13 | |||
| 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . . 16 | 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 3.2.1 User Information . . . . . . . . . . . . . . . . . . . . . . 16 | 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | 3.2.1 User Information . . . . . . . . . . . . . . . . . . . 17 | |||
| 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 | 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . . 24 | 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . . 24 | 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . . 25 | 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 4.4 Same-document Reference . . . . . . . . . . . . . . . . . . 25 | 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . . 25 | 4.4 Same-document Reference . . . . . . . . . . . . . . . . . 25 | |||
| 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . 27 | 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . 26 | |||
| 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . . 27 | ||||
| 5.1.1 Base URI within Document Content . . . . . . . . . . . . . . 27 | ||||
| 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . . . . 28 | ||||
| 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . . . . 28 | ||||
| 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . . . . 28 | ||||
| 5.2 Relative Resolution . . . . . . . . . . . . . . . . . . . . 28 | ||||
| 5.2.1 Pre-parse the Base URI . . . . . . . . . . . . . . . . . . . 29 | ||||
| 5.2.2 Transform References . . . . . . . . . . . . . . . . . . . . 29 | ||||
| 5.2.3 Merge Paths . . . . . . . . . . . . . . . . . . . . . . . . 30 | ||||
| 5.2.4 Remove Dot Segments . . . . . . . . . . . . . . . . . . . . 30 | ||||
| 5.3 Component Recomposition . . . . . . . . . . . . . . . . . . 32 | ||||
| 5.4 Reference Resolution Examples . . . . . . . . . . . . . . . 33 | ||||
| 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . . . . 33 | ||||
| 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . . . . 33 | ||||
| 6. Normalization and Comparison . . . . . . . . . . . . . . . . 35 | ||||
| 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . 35 | ||||
| 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . . 36 | ||||
| 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 36 | ||||
| 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . . . . 37 | ||||
| 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . . . . 38 | ||||
| 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . . . . 39 | ||||
| 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . . 39 | ||||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . 41 | ||||
| 7.1 Reliability and Consistency . . . . . . . . . . . . . . . . 41 | ||||
| 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . . 41 | ||||
| 7.3 Back-end Transcoding . . . . . . . . . . . . . . . . . . . . 42 | ||||
| 7.4 Rare IP Address Formats . . . . . . . . . . . . . . . . . . 42 | ||||
| 7.5 Sensitive Information . . . . . . . . . . . . . . . . . . . 43 | ||||
| 7.6 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . . 43 | ||||
| 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 45 | ||||
| Normative References . . . . . . . . . . . . . . . . . . . . 46 | ||||
| Informative References . . . . . . . . . . . . . . . . . . . 47 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 48 | ||||
| A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . 50 | ||||
| B. Parsing a URI Reference with a Regular Expression . . . . . 52 | ||||
| C. Delimiting a URI in Context . . . . . . . . . . . . . . . . 53 | ||||
| D. Summary of Non-editorial Changes . . . . . . . . . . . . . . 55 | ||||
| D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . . 55 | ||||
| D.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . . 55 | ||||
| Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 | ||||
| Intellectual Property and Copyright Statements . . . . . . . 62 | ||||
| 1. Introduction | 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 27 | |||
| 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . 27 | ||||
| 5.1.1 Base URI Embedded in Content . . . . . . . . . . . . . 27 | ||||
| 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . 28 | ||||
| 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . 28 | ||||
| 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . 28 | ||||
| 5.2 Relative Resolution . . . . . . . . . . . . . . . . . . . 29 | ||||
| 5.2.1 Pre-parse the Base URI . . . . . . . . . . . . . . . . 29 | ||||
| 5.2.2 Transform References . . . . . . . . . . . . . . . . . 29 | ||||
| 5.2.3 Merge Paths . . . . . . . . . . . . . . . . . . . . . 30 | ||||
| 5.2.4 Remove Dot Segments . . . . . . . . . . . . . . . . . 31 | ||||
| 5.3 Component Recomposition . . . . . . . . . . . . . . . . . 33 | ||||
| 5.4 Reference Resolution Examples . . . . . . . . . . . . . . 34 | ||||
| 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . 34 | ||||
| 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . 34 | ||||
| 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 36 | ||||
| 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . 36 | ||||
| 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . 37 | ||||
| 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . 37 | ||||
| 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . 37 | ||||
| 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . 38 | ||||
| 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . 39 | ||||
| 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . 40 | ||||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 40 | ||||
| 7.1 Reliability and Consistency . . . . . . . . . . . . . . . 40 | ||||
| 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . 41 | ||||
| 7.3 Back-end Transcoding . . . . . . . . . . . . . . . . . . . 41 | ||||
| 7.4 Rare IP Address Formats . . . . . . . . . . . . . . . . . 42 | ||||
| 7.5 Sensitive Information . . . . . . . . . . . . . . . . . . 43 | ||||
| 7.6 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . 43 | ||||
| 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 44 | ||||
| 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 45 | ||||
| 9.1 Normative References . . . . . . . . . . . . . . . . . . . . 45 | ||||
| 9.2 Informative References . . . . . . . . . . . . . . . . . . . 45 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 47 | ||||
| A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 48 | ||||
| B. Parsing a URI Reference with a Regular Expression . . . . . . 50 | ||||
| C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 51 | ||||
| D. Summary of Non-editorial Changes . . . . . . . . . . . . . . . 52 | ||||
| D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . 52 | ||||
| D.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . 53 | ||||
| Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 | ||||
| Intellectual Property and Copyright Statements . . . . . . . . 58 | ||||
| 1. Introduction | ||||
| A Uniform Resource Identifier (URI) provides a simple and extensible | A Uniform Resource Identifier (URI) provides a simple and extensible | |||
| means for identifying a resource. This specification of URI syntax | means for identifying a resource. This specification of URI syntax | |||
| and semantics is derived from concepts introduced by the World Wide | and semantics is derived from concepts introduced by the World Wide | |||
| Web global information initiative, whose use of such identifiers | Web global information initiative, whose use of such identifiers | |||
| dates from 1990 and is described in "Universal Resource Identifiers | dates from 1990 and is described in "Universal Resource Identifiers | |||
| in WWW" [RFC1630], and is designed to meet the recommendations laid | in WWW" [RFC1630], and is designed to meet the recommendations laid | |||
| out in "Functional Recommendations for Internet Resource Locators" | out in "Functional Recommendations for Internet Resource Locators" | |||
| [RFC1736] and "Functional Requirements for Uniform Resource Names" | [RFC1736] and "Functional Requirements for Uniform Resource Names" | |||
| [RFC1737]. | [RFC1737]. | |||
| skipping to change at page 4, line 28 ¶ | skipping to change at page 4, line 28 ¶ | |||
| Locators" [RFC1738] and "Relative Uniform Resource Locators" | Locators" [RFC1738] and "Relative Uniform Resource Locators" | |||
| [RFC1808] in order to define a single, generic syntax for all URIs. | [RFC1808] in order to define a single, generic syntax for all URIs. | |||
| It excludes those portions of RFC 1738 that defined the specific | It excludes those portions of RFC 1738 that defined the specific | |||
| syntax of individual URI schemes; those portions will be updated as | syntax of individual URI schemes; those portions will be updated as | |||
| separate documents. The process for registration of new URI schemes | separate documents. The process for registration of new URI schemes | |||
| is defined separately by [RFC2717]. Advice for designers of new URI | is defined separately by [RFC2717]. Advice for designers of new URI | |||
| schemes can be found in [RFC2718]. | schemes can be found in [RFC2718]. | |||
| All significant changes from RFC 2396 are noted in Appendix D. | All significant changes from RFC 2396 are noted in Appendix D. | |||
| This specification uses the terms "character" and "character | This specification uses the terms "character" and "coded character | |||
| encoding" in accordance with the definitions provided in [RFC2978]. | set" in accordance with the definitions provided in [RFC2978], and | |||
| "character encoding" in place of what [RFC2978] refers to as a | ||||
| "charset". | ||||
| 1.1 Overview of URIs | 1.1 Overview of URIs | |||
| URIs are characterized as follows: | URIs are characterized as follows: | |||
| Uniform | Uniform | |||
| Uniformity provides several benefits: it allows different types of | Uniformity provides several benefits: it allows different types of | |||
| resource identifiers to be used in the same context, even when the | resource identifiers to be used in the same context, even when the | |||
| mechanisms used to access those resources may differ; it allows | mechanisms used to access those resources may differ; it allows | |||
| uniform semantic interpretation of common syntactic conventions | uniform semantic interpretation of common syntactic conventions | |||
| across different types of resource identifiers; it allows | across different types of resource identifiers; it allows | |||
| introduction of new types of resource identifiers without | introduction of new types of resource identifiers without | |||
| interfering with the way that existing identifiers are used; and, | interfering with the way that existing identifiers are used; and, | |||
| it allows the identifiers to be reused in many different contexts, | it allows the identifiers to be reused in many different contexts, | |||
| thus permitting new applications or protocols to leverage a | thus permitting new applications or protocols to leverage a | |||
| pre-existing, large, and widely-used set of resource identifiers. | pre-existing, large, and widely-used set of resource identifiers. | |||
| Resource | Resource | |||
| Anything that has been named or described can be a resource. | ||||
| Anything that can be named or described can be a resource. | ||||
| Familiar examples include an electronic document, an image, a | Familiar examples include an electronic document, an image, a | |||
| service (e.g., "today's weather report for Los Angeles"), and a | service (e.g., "today's weather report for Los Angeles"), and a | |||
| collection of other resources. A resource is not necessarily | collection of other resources. A resource is not necessarily | |||
| accessible via the Internet; e.g., human beings, corporations, and | accessible via the Internet; e.g., human beings, corporations, and | |||
| bound books in a library can also be resources. Likewise, abstract | bound books in a library can also be resources. Likewise, abstract | |||
| concepts can be resources, such as the operators and operands of a | concepts can be resources, such as the operators and operands of a | |||
| mathematical equation or the types of a relationship (e.g., | mathematical equation, the types of a relationship (e.g., "parent" | |||
| "parent" or "employee"). | or "employee"), or numeric values (e.g., zero, one, and infinity). | |||
| These things are called resources because they each can be | ||||
| considered a source of supply or support, or an available means, | ||||
| for some system, where such systems may be as diverse as the World | ||||
| Wide Web, a filesystem, an ontological graph, a theorem prover, or | ||||
| some other form of system for the direct or indirect observation | ||||
| and/or manipulation of resources. Note that "supply" is not | ||||
| necessary for a thing to be considered a resource: the ability to | ||||
| simply refer to that thing is often sufficient to support the | ||||
| operation of a given system. | ||||
| Identifier | Identifier | |||
| An identifier embodies the information required to distinguish | An identifier embodies the information required to distinguish | |||
| what is being identified from all other things within its scope of | what is being identified from all other things within its scope of | |||
| identification. | identification. Our use of the terms "identify" and "identifying" | |||
| refer to this process of distinguishing from many to one; they | ||||
| should not be mistaken as an assumption that the identifier | ||||
| defines the identity of what is referenced, though that may be the | ||||
| case for some identifiers. | ||||
| A URI is an identifier that consists of a sequence of characters | A URI is an identifier that consists of a sequence of characters | |||
| matching the syntax defined by the syntax rule named "URI" in Section | matching the syntax rule named <URI> in Section 3. A URI can be used | |||
| 3. A URI can be used to refer to a resource. This specification does | to refer to a resource. This specification does not place any limits | |||
| not place any limits on the nature of a resource or the reasons why | on the nature of a resource, the reasons why an application might | |||
| an application might wish to refer to a resource. URIs have a global | wish to refer to a resource, or the kinds of system that might use | |||
| scope and should be interpreted consistently regardless of context, | URIs for the sake of identifying resources. | |||
| but that interpretation may be defined in relation to the user's | ||||
| context (e.g., "http://localhost/" refers to a resource that is | ||||
| relative to the user's network interface and yet not specific to any | ||||
| one user). | ||||
| 1.1.1 Generic Syntax | URIs have a global scope and must be interpreted consistently | |||
| regardless of context, though the result of that interpretation may | ||||
| be in relation to the end-user's context. For example, "http:// | ||||
| localhost/" has the same interpretation for every user of that | ||||
| reference, even though the network interface corresponding to | ||||
| "localhost" may be different for each end-user: interpretation is | ||||
| independent of access. However, an action made on the basis of that | ||||
| reference will take place in relation to the end-user's context, | ||||
| which implies that an action intended to refer to a single, globally | ||||
| unique thing must use a URI that distinguishes that resource from all | ||||
| other things. URIs that identify in relation to the end-user's local | ||||
| context should only be used when the context itself is a defining | ||||
| aspect of the resource, such as when an on-line Linux manual refers | ||||
| to a file on the end-user's filesystem (e.g., "file:///etc/hosts"). | ||||
| 1.1.1 Generic Syntax | ||||
| Each URI begins with a scheme name, as defined in Section 3.1, that | Each URI begins with a scheme name, as defined in Section 3.1, that | |||
| refers to a specification for assigning identifiers within that | refers to a specification for assigning identifiers within that | |||
| scheme. As such, the URI syntax is a federated and extensible naming | scheme. As such, the URI syntax is a federated and extensible naming | |||
| system wherein each scheme's specification may further restrict the | system wherein each scheme's specification may further restrict the | |||
| syntax and semantics of identifiers using that scheme. | syntax and semantics of identifiers using that scheme. | |||
| This specification defines those elements of the URI syntax that are | This specification defines those elements of the URI syntax that are | |||
| required of all URI schemes or are common to many URI schemes. It | required of all URI schemes or are common to many URI schemes. It | |||
| thus defines the syntax and semantics that are needed to implement a | thus defines the syntax and semantics that are needed to implement a | |||
| skipping to change at page 6, line 5 ¶ | skipping to change at page 6, line 29 ¶ | |||
| formats that make use of URI references can refer to this | formats that make use of URI references can refer to this | |||
| specification as defining the range of syntax allowed for all URIs, | specification as defining the range of syntax allowed for all URIs, | |||
| including those schemes that have yet to be defined. | including those schemes that have yet to be defined. | |||
| A parser of the generic URI syntax is capable of parsing any URI | A parser of the generic URI syntax is capable of parsing any URI | |||
| reference into its major components; once the scheme is determined, | reference into its major components; once the scheme is determined, | |||
| further scheme-specific parsing can be performed on the components. | further scheme-specific parsing can be performed on the components. | |||
| In other words, the URI generic syntax is a superset of the syntax of | In other words, the URI generic syntax is a superset of the syntax of | |||
| all URI schemes. | all URI schemes. | |||
| 1.1.2 Examples | 1.1.2 Examples | |||
| The following examples illustrate URIs that are in common use. | The following examples illustrate URIs that are in common use. | |||
| ftp://ftp.is.co.za/rfc/rfc1808.txt | ftp://ftp.is.co.za/rfc/rfc1808.txt | |||
| http://www.ietf.org/rfc/rfc2396.txt | http://www.ietf.org/rfc/rfc2396.txt | |||
| mailto:John.Doe@example.com | mailto:John.Doe@example.com | |||
| news:comp.infosystems.www.servers.unix | news:comp.infosystems.www.servers.unix | |||
| telnet://melvyl.ucop.edu/ | telnet://melvyl.ucop.edu/ | |||
| 1.1.3 URI, URL, and URN | 1.1.3 URI, URL, and URN | |||
| A URI can be further classified as a locator, a name, or both. The | A URI can be further classified as a locator, a name, or both. The | |||
| term "Uniform Resource Locator" (URL) refers to the subset of URIs | term "Uniform Resource Locator" (URL) refers to the subset of URIs | |||
| that, in addition to identifying a resource, provide a means of | that, in addition to identifying a resource, provide a means of | |||
| locating the resource by describing its primary access mechanism | locating the resource by describing its primary access mechanism | |||
| (e.g., its network "location"). The term "Uniform Resource Name" | (e.g., its network "location"). The term "Uniform Resource Name" | |||
| (URN) has been used historically to refer to both URIs under the | (URN) has been used historically to refer to both URIs under the | |||
| "urn" scheme [RFC2141], which are required to remain globally unique | "urn" scheme [RFC2141], which are required to remain globally unique | |||
| and persistent even when the resource ceases to exist or becomes | and persistent even when the resource ceases to exist or becomes | |||
| unavailable, and to any other URI with the properties of a name. | unavailable, and to any other URI with the properties of a name. | |||
| An individual scheme does not need to be classified as being just one | An individual scheme does not need to be classified as being just one | |||
| of "name" or "locator". Instances of URIs from any given scheme may | of "name" or "locator". Instances of URIs from any given scheme may | |||
| have the characteristics of names or locators or both, often | have the characteristics of names or locators or both, often | |||
| depending on the persistence and care in the assignment of | depending on the persistence and care in the assignment of | |||
| identifiers by the naming authority, rather than any quality of the | identifiers by the naming authority, rather than any quality of the | |||
| scheme. Future specifications and related documentation should use | scheme. Future specifications and related documentation should use | |||
| the general term "URI", rather than the more restrictive terms URL | the general term "URI", rather than the more restrictive terms URL | |||
| and URN [RFC3305]. | and URN [RFC3305]. | |||
| 1.2 Design Considerations | 1.2 Design Considerations | |||
| 1.2.1 Transcription | 1.2.1 Transcription | |||
| The URI syntax has been designed with global transcription as one of | The URI syntax has been designed with global transcription as one of | |||
| its main considerations. A URI is a sequence of characters from a | its main considerations. A URI is a sequence of characters from a | |||
| very limited set: the letters of the basic Latin alphabet, digits, | very limited set: the letters of the basic Latin alphabet, digits, | |||
| and a few special characters. A URI may be represented in a variety | and a few special characters. A URI may be represented in a variety | |||
| of ways: e.g., ink on paper, pixels on a screen, or a sequence of | of ways: e.g., ink on paper, pixels on a screen, or a sequence of | |||
| integers from a coded character set. The interpretation of a URI | character encoding octets. The interpretation of a URI depends only | |||
| depends only on the characters used and not how those characters are | on the characters used and not how those characters are represented | |||
| represented in a network protocol. | in a network protocol. | |||
| The goal of transcription can be described by a simple scenario. | The goal of transcription can be described by a simple scenario. | |||
| Imagine two colleagues, Sam and Kim, sitting in a pub at an | Imagine two colleagues, Sam and Kim, sitting in a pub at an | |||
| international conference and exchanging research ideas. Sam asks Kim | international conference and exchanging research ideas. Sam asks Kim | |||
| for a location to get more information, so Kim writes the URI for the | for a location to get more information, so Kim writes the URI for the | |||
| research site on a napkin. Upon returning home, Sam takes out the | research site on a napkin. Upon returning home, Sam takes out the | |||
| napkin and types the URI into a computer, which then retrieves the | napkin and types the URI into a computer, which then retrieves the | |||
| information to which Kim referred. | information to which Kim referred. | |||
| There are several design considerations revealed by the scenario: | There are several design considerations revealed by the scenario: | |||
| skipping to change at page 7, line 37 ¶ | skipping to change at page 8, line 10 ¶ | |||
| These design considerations are not always in alignment. For | These design considerations are not always in alignment. For | |||
| example, it is often the case that the most meaningful name for a URI | example, it is often the case that the most meaningful name for a URI | |||
| component would require characters that cannot be typed into some | component would require characters that cannot be typed into some | |||
| systems. The ability to transcribe a resource identifier from one | systems. The ability to transcribe a resource identifier from one | |||
| medium to another has been considered more important than having a | medium to another has been considered more important than having a | |||
| URI consist of the most meaningful of components. | URI consist of the most meaningful of components. | |||
| In local or regional contexts and with improving technology, users | In local or regional contexts and with improving technology, users | |||
| might benefit from being able to use a wider range of characters; | might benefit from being able to use a wider range of characters; | |||
| such use is not defined in this specification. Percent-encoded | such use is not defined by this specification. Percent-encoded | |||
| octets (Section 2.1) may be used within a URI to represent characters | octets (Section 2.1) may be used within a URI to represent characters | |||
| outside the range of the US-ASCII coded character set if such | outside the range of the US-ASCII coded character set if such | |||
| representation is defined by the scheme or by the protocol element in | representation is allowed by the scheme or by the protocol element in | |||
| which the URI is referenced; such a definition will specify the | which the URI is referenced; such a definition should specify the | |||
| character encoding scheme used to map those characters to octets | character encoding used to map those characters to octets prior to | |||
| prior to being percent-encoded for the URI. | being percent-encoded for the URI. | |||
| 1.2.2 Separating Identification from Interaction | 1.2.2 Separating Identification from Interaction | |||
| A common misunderstanding of URIs is that they are only used to refer | A common misunderstanding of URIs is that they are only used to refer | |||
| to accessible resources. In fact, the URI alone only provides | to accessible resources. In fact, the URI alone only provides | |||
| identification; access to the resource is neither guaranteed nor | identification; access to the resource is neither guaranteed nor | |||
| implied by the presence of a URI. Instead, an operation (if any) | implied by the presence of a URI. Instead, an operation (if any) | |||
| associated with a URI reference is defined by the protocol element, | associated with a URI reference is defined by the protocol element, | |||
| data format attribute, or natural language text in which it appears. | data format attribute, or natural language text in which it appears. | |||
| Given a URI, a system may attempt to perform a variety of operations | Given a URI, a system may attempt to perform a variety of operations | |||
| on the resource, as might be characterized by such words as "access", | on the resource, as might be characterized by such words as "access", | |||
| skipping to change at page 8, line 34 ¶ | skipping to change at page 9, line 8 ¶ | |||
| locally cached representation, resolution of the URI to determine an | locally cached representation, resolution of the URI to determine an | |||
| appropriate access mechanism (if any), and dereference of the URI for | appropriate access mechanism (if any), and dereference of the URI for | |||
| the sake of applying a retrieval operation. Depending on the | the sake of applying a retrieval operation. Depending on the | |||
| protocols used to perform the retrieval, additional information might | protocols used to perform the retrieval, additional information might | |||
| be supplied about the resource (resource metadata) and its relation | be supplied about the resource (resource metadata) and its relation | |||
| to other resources. | to other resources. | |||
| URI references in information systems are designed to be | URI references in information systems are designed to be | |||
| late-binding: the result of an access is generally determined at the | late-binding: the result of an access is generally determined at the | |||
| time it is accessed and may vary over time or due to other aspects of | time it is accessed and may vary over time or due to other aspects of | |||
| the interaction. When an author creates a reference to such a | the interaction. Such references are created in order to be be used | |||
| resource, they do so with the intention that the reference be used in | in the future: what is being identified is not some specific result | |||
| the future; what is being identified is not some specific result that | that was obtained in the past, but rather some characteristic that is | |||
| was obtained in the past, but rather some characteristic that is | ||||
| expected to be true for future results. In such cases, the resource | expected to be true for future results. In such cases, the resource | |||
| referred to by the URI is actually a sameness of characteristics as | referred to by the URI is actually a sameness of characteristics as | |||
| observed over time, perhaps elucidated by additional comments or | observed over time, perhaps elucidated by additional comments or | |||
| assertions made by the resource provider. | assertions made by the resource provider. | |||
| Although many URI schemes are named after protocols, this does not | Although many URI schemes are named after protocols, this does not | |||
| imply that use of such a URI will result in access to the resource | imply that use of such a URI will result in access to the resource | |||
| via the named protocol. URIs are often used simply for the sake of | via the named protocol. URIs are often used simply for the sake of | |||
| identification. Even when a URI is used to retrieve a representation | identification. Even when a URI is used to retrieve a representation | |||
| of a resource, that access might be through gateways, proxies, | of a resource, that access might be through gateways, proxies, | |||
| caches, and name resolution services that are independent of the | caches, and name resolution services that are independent of the | |||
| protocol associated with the scheme name, and the resolution of some | protocol associated with the scheme name, and the resolution of some | |||
| URIs may require the use of more than one protocol (e.g., both DNS | URIs may require the use of more than one protocol (e.g., both DNS | |||
| and HTTP are typically used to access an "http" URI's origin server | and HTTP are typically used to access an "http" URI's origin server | |||
| when a representation isn't found in a local cache). | when a representation isn't found in a local cache). | |||
| 1.2.3 Hierarchical Identifiers | 1.2.3 Hierarchical Identifiers | |||
| The URI syntax is organized hierarchically, with components listed in | The URI syntax is organized hierarchically, with components listed in | |||
| order of decreasing significance from left to right. For some URI | order of decreasing significance from left to right. For some URI | |||
| schemes, the visible hierarchy is limited to the scheme itself: | schemes, the visible hierarchy is limited to the scheme itself: | |||
| everything after the scheme component delimiter (":") is considered | everything after the scheme component delimiter (":") is considered | |||
| opaque to URI processing. Other URI schemes make the hierarchy | opaque to URI processing. Other URI schemes make the hierarchy | |||
| explicit and visible to generic parsing algorithms. | explicit and visible to generic parsing algorithms. | |||
| The generic syntax uses the slash ("/"), question mark ("?"), and | The generic syntax uses the slash ("/"), question mark ("?"), and | |||
| number sign ("#") characters for the purpose of delimiting components | number sign ("#") characters for the purpose of delimiting components | |||
| skipping to change at page 9, line 49 ¶ | skipping to change at page 10, line 22 ¶ | |||
| the reference context and the target URI. The reference resolution | the reference context and the target URI. The reference resolution | |||
| algorithm, presented in Section 5, defines how such a reference is | algorithm, presented in Section 5, defines how such a reference is | |||
| transformed to the target URI. Since relative references can only be | transformed to the target URI. Since relative references can only be | |||
| used within the context of a hierarchical URI, designers of new URI | used within the context of a hierarchical URI, designers of new URI | |||
| schemes should use a syntax consistent with the generic syntax's | schemes should use a syntax consistent with the generic syntax's | |||
| hierarchical components unless there are compelling reasons to forbid | hierarchical components unless there are compelling reasons to forbid | |||
| relative referencing within that scheme. | relative referencing within that scheme. | |||
| All URIs are parsed by generic syntax parsers when used. A URI scheme | All URIs are parsed by generic syntax parsers when used. A URI scheme | |||
| that wishes to remain opaque to hierarchical processing must disallow | that wishes to remain opaque to hierarchical processing must disallow | |||
| the use of slash and question mark characters. However, since a | the use of slash and question mark characters. However, since a URI | |||
| non-relative URI reference is only modified by the generic parser if | reference is only modified by the generic parser if it contains a | |||
| it contains complete path segments of "." or ".." (see Section 3.3), | dot-segment (a complete path segment of "." or "..", as described in | |||
| URIs may safely use "/" for other purposes if they do not allow | Section 3.3), URI schemes may safely use "/" for other purposes if | |||
| dot-segments. | they do not allow dot-segments. | |||
| 1.3 Syntax Notation | 1.3 Syntax Notation | |||
| This specification uses the Augmented Backus-Naur Form (ABNF) | This specification uses the Augmented Backus-Naur Form (ABNF) | |||
| notation of [RFC2234], including the following core ABNF syntax rules | notation of [RFC2234], including the following core ABNF syntax rules | |||
| defined by that specification: ALPHA (letters), CR (carriage return), | defined by that specification: ALPHA (letters), CR (carriage return), | |||
| CTL (control characters), DIGIT (decimal digits), DQUOTE (double | DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal | |||
| quote), HEXDIG (hexadecimal digits), LF (line feed), and SP (space). | digits), LF (line feed), and SP (space). The complete URI syntax is | |||
| The complete URI syntax is collected in Appendix A. | collected in Appendix A. | |||
| 2. Characters | ||||
| Although ABNF notation defines its terminal values to be non-negative | 2. Characters | |||
| integers (codepoints) based on the US-ASCII coded character set | ||||
| [ASCII], we must invert that relation in order to understand the URI | ||||
| syntax, since URIs are defined as strings of characters independent | ||||
| of any particular encoding. Therefore, the integer values must be | ||||
| mapped back to their corresponding characters via US-ASCII in order | ||||
| to complete the syntax rules. | ||||
| This specification does not mandate the use of any particular | The URI syntax provides a method of encoding data, presumably for the | |||
| character encoding scheme for mapping between URI characters and the | sake of identifying a resource, as a sequence of characters. The URI | |||
| octets used to store or transmit those characters. When a URI appears | characters are, in turn, frequently encoded as octets for transport | |||
| in a protocol element, the character encoding is defined by that | or presentation. This specification does not mandate any particular | |||
| protocol; absent such a definition, a URI is assumed to use the same | character encoding for mapping between URI characters and the octets | |||
| used to store or transmit those characters. When a URI appears in a | ||||
| protocol element, the character encoding is defined by that protocol; | ||||
| absent such a definition, a URI is assumed to be in the same | ||||
| character encoding as the surrounding text. | character encoding as the surrounding text. | |||
| The ABNF notation defines its terminal values to be non-negative | ||||
| integers (codepoints) based on the US-ASCII coded character set | ||||
| [ASCII]. Since a URI is a sequence of characters, we must invert | ||||
| that relation in order to understand the URI syntax. Therefore, the | ||||
| integer values used by the ABNF must be mapped back to their | ||||
| corresponding characters via US-ASCII in order to complete the syntax | ||||
| rules. | ||||
| A URI is composed from a limited set of characters consisting of | A URI is composed from a limited set of characters consisting of | |||
| digits, letters, and a few graphic symbols. A reserved (Section 2.2) | digits, letters, and a few graphic symbols. A reserved subset of | |||
| subset of those characters may be used to delimit syntax components | those characters may be used to delimit syntax components within a | |||
| within a URI, while the remaining characters, including both the | URI, while the remaining characters, including both the unreserved | |||
| unreserved (Section 2.3) set and those reserved characters not acting | set and those reserved characters not acting as delimiters, define | |||
| as delimiters, define each component's data. | each component's identifying data. | |||
| 2.1 Percent Encoding | 2.1 Percent-Encoding | |||
| A percent-encoding mechanism is used to represent a data octet in a | A percent-encoding mechanism is used to represent a data octet in a | |||
| component when that octet's corresponding character is outside the | component when that octet's corresponding character is outside the | |||
| allowed set or is being used as a delimiter of, or within, the | allowed set or is being used as a delimiter of, or within, the | |||
| component. A percent-encoded octet is encoded as a character triplet, | component. A percent-encoded octet is encoded as a character triplet, | |||
| consisting of the percent character "%" followed by the two | consisting of the percent character "%" followed by the two | |||
| hexadecimal digits representing that octet's numeric value. For | hexadecimal digits representing that octet's numeric value. For | |||
| example, "%20" is the percent-encoding for the binary octet | example, "%20" is the percent-encoding for the binary octet | |||
| "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space | "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space | |||
| character (SP). | character (SP). Section 2.4 describes when percent-encoding and | |||
| decoding is applied. | ||||
| pct-encoded = "%" HEXDIG HEXDIG | pct-encoded = "%" HEXDIG HEXDIG | |||
| The uppercase hexadecimal digits 'A' through 'F' are equivalent to | The uppercase hexadecimal digits 'A' through 'F' are equivalent to | |||
| the lowercase digits 'a' through 'f', respectively. Two URIs that | the lowercase digits 'a' through 'f', respectively. Two URIs that | |||
| differ only in the case of hexadecimal digits used in percent-encoded | differ only in the case of hexadecimal digits used in percent-encoded | |||
| octets are equivalent. For consistency, URI producers and | octets are equivalent. For consistency, URI producers and | |||
| normalizers should use uppercase hexadecimal digits for all | normalizers should use uppercase hexadecimal digits for all | |||
| percent-encodings. | percent-encodings. | |||
| 2.2 Reserved Characters | 2.2 Reserved Characters | |||
| URIs include components and sub-components that are delimited by | URIs include components and subcomponents that are delimited by | |||
| characters in the "reserved" set. These characters are called | characters in the "reserved" set. These characters are called | |||
| "reserved" because they may (or may not) be defined as delimiters by | "reserved" because they may (or may not) be defined as delimiters by | |||
| the generic syntax, by each scheme-specific syntax, or by the | the generic syntax, by each scheme-specific syntax, or by the | |||
| implementation-specific syntax of a URI's dereferencing algorithm. | implementation-specific syntax of a URI's dereferencing algorithm. | |||
| If data for a URI component would conflict with a reserved | If data for a URI component would conflict with a reserved | |||
| character's purpose as a delimiter, then the conflicting data must be | character's purpose as a delimiter, then the conflicting data must be | |||
| percent-encoded before forming the URI. | percent-encoded before forming the URI. | |||
| reserved = gen-delims / sub-delims | reserved = gen-delims / sub-delims | |||
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | |||
| sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | |||
| / "*" / "+" / "," / ";" / "=" | / "*" / "+" / "," / ";" / "=" | |||
| A subset of the reserved characters (gen-delims) are used as | The purpose of reserved characters is to provide a set of delimiting | |||
| delimiters of the generic URI components described in Section 3. A | characters that are distinguishable from other data within a URI. | |||
| component's ABNF syntax rule will not use the reserved or gen-delims | ||||
| rule names directly; instead, each syntax rule lists those reserved | ||||
| characters that are allowed within that component (i.e., not | ||||
| delimiting it). The allowed reserved characters, including those in | ||||
| the sub-delims set and any of the gen-delims that are not a delimiter | ||||
| of that component, are reserved for use as sub-component delimiters | ||||
| within the component. Only the most common sub-components are | ||||
| defined by this specification; other sub-components may be defined by | ||||
| a URI scheme's specification, or by the implementation-specific | ||||
| syntax of a URI's dereferencing algorithm, provided that such | ||||
| sub-components are delimited by characters in that component's | ||||
| reserved set. If no such delimiting role has been assigned, then a | ||||
| reserved character appearing in a component represents the data octet | ||||
| corresponding to its encoding in US-ASCII. | ||||
| URIs that differ in the replacement of a reserved character with its | URIs that differ in the replacement of a reserved character with its | |||
| corresponding percent-encoded octet are not equivalent. | corresponding percent-encoded octet are not equivalent. | |||
| Percent-encoding a reserved character, or decoding a percent-encoded | Percent-encoding a reserved character, or decoding a percent-encoded | |||
| octet that corresponds to a reserved character, will change how the | octet that corresponds to a reserved character, will change how the | |||
| URI is interpreted by most applications. | URI is interpreted by most applications. Thus, characters in the | |||
| reserved set are protected from normalization and are therefore safe | ||||
| to be used by scheme-specific and producer-specific algorithms for | ||||
| delimiting data subcomponents within a URI. | ||||
| 2.3 Unreserved Characters | A subset of the reserved characters (gen-delims) are used as | |||
| delimiters of the generic URI components described in Section 3. A | ||||
| component's ABNF syntax rule will not use the reserved or gen-delims | ||||
| rule names directly; instead, each syntax rule lists the characters | ||||
| allowed within that component (i.e., not delimiting it) and any of | ||||
| those characters that are also in the reserved set are "reserved" for | ||||
| use as subcomponent delimiters within the component. Only the most | ||||
| common subcomponents are defined by this specification; other | ||||
| subcomponents may be defined by a URI scheme's specification, or by | ||||
| the implementation-specific syntax of a URI's dereferencing | ||||
| algorithm, provided that such subcomponents are delimited by | ||||
| characters in the reserved set allowed within that component. | ||||
| URI producing applications should percent-encode data octets that | ||||
| correspond to characters in the reserved set. However, if a reserved | ||||
| character is found in a URI component and no delimiting role is known | ||||
| for that character, then it should be interpreted as representing the | ||||
| data octet corresponding to that character's encoding in US-ASCII. | ||||
| 2.3 Unreserved Characters | ||||
| Characters that are allowed in a URI but do not have a reserved | Characters that are allowed in a URI but do not have a reserved | |||
| purpose are called unreserved. These include uppercase and lowercase | purpose are called unreserved. These include uppercase and lowercase | |||
| letters, decimal digits, hyphen, period, underscore, and tilde. | letters, decimal digits, hyphen, period, underscore, and tilde. | |||
| unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |||
| URIs that differ in the replacement of an unreserved character with | URIs that differ in the replacement of an unreserved character with | |||
| its corresponding percent-encoded octet are equivalent: they identify | its corresponding percent-encoded octet are equivalent: they identify | |||
| the same resource. However, percent-encoded unreserved characters | the same resource. However, percent-encoded unreserved characters | |||
| may change the result of some URI comparisons (Section 6), | may change the result of some URI comparisons (Section 6), | |||
| potentially leading to incorrect or inefficient behavior. For | potentially leading to incorrect or inefficient behavior. For | |||
| consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A | consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A | |||
| and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore | and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore | |||
| (%5F), or tilde (%7E) should not be created by URI producers and, | (%5F), or tilde (%7E) should not be created by URI producers and, | |||
| when found in a URI, should be decoded to their corresponding | when found in a URI, should be decoded to their corresponding | |||
| unreserved character by URI normalizers. | unreserved character by URI normalizers. | |||
| 2.4 When to Encode or Decode | 2.4 When to Encode or Decode | |||
| Under normal circumstances, the only time that octets within a URI | Under normal circumstances, the only time that octets within a URI | |||
| are percent-encoded is during the process of producing the URI from | are percent-encoded is during the process of producing the URI from | |||
| its component parts. It is during that process that an | its component parts. It is during that process that an | |||
| implementation determines which of the reserved characters are to be | implementation determines which of the reserved characters are to be | |||
| used as sub-component delimiters and which can be safely used as | used as subcomponent delimiters and which can be safely used as data. | |||
| data. Once produced, a URI is always in its percent-encoded form. | Once produced, a URI is always in its percent-encoded form. | |||
| When a URI is dereferenced, the components and sub-components | When a URI is dereferenced, the components and subcomponents | |||
| significant to the scheme-specific dereferencing process (if any) | significant to the scheme-specific dereferencing process (if any) | |||
| must be parsed and separated before the percent-encoded octets within | must be parsed and separated before the percent-encoded octets within | |||
| those components can be safely decoded, since otherwise the data may | those components can be safely decoded, since otherwise the data may | |||
| be mistaken for component delimiters. The only exception is for | be mistaken for component delimiters. The only exception is for | |||
| percent-encoded octets corresponding to characters in the unreserved | percent-encoded octets corresponding to characters in the unreserved | |||
| set, which can be decoded at any time. For example, the octet | set, which can be decoded at any time. For example, the octet | |||
| corresponding to the tilde ("~") character is often encoded as "%7E" | corresponding to the tilde ("~") character is often encoded as "%7E" | |||
| by older URI processing software; the "%7E" can be replaced by "~" | by older URI processing software; the "%7E" can be replaced by "~" | |||
| without changing its interpretation. | without changing its interpretation. | |||
| Because the percent ("%") character serves as the indicator for | Because the percent ("%") character serves as the indicator for | |||
| percent-encoded octets, it must be percent-encoded as "%25" in order | percent-encoded octets, it must be percent-encoded as "%25" in order | |||
| for that octet to be used as data within a URI. Implementations must | for that octet to be used as data within a URI. Implementations must | |||
| not percent-encode or decode the same string more than once, since | not percent-encode or decode the same string more than once, since | |||
| decoding an already decoded string might lead to misinterpreting a | decoding an already decoded string might lead to misinterpreting a | |||
| percent data octet as the beginning of a percent-encoding, or vice | percent data octet as the beginning of a percent-encoding, or vice | |||
| versa in the case of percent-encoding an already percent-encoded | versa in the case of percent-encoding an already percent-encoded | |||
| string. | string. | |||
| URI characters serve as an external interface for identification | 2.5 Identifying Data | |||
| between systems. A system that internally provides identifiers in | ||||
| the form of a different character encoding, such as EBCDIC, will | ||||
| generally perform character translation of textual identifiers to | ||||
| UTF-8 [RFC3629] (or some other superset of the US-ASCII character | ||||
| encoding) at an internal interface, since that results in more | ||||
| meaningful identifiers than simply percent-encoding the original | ||||
| octets. When interpreting an incoming URI on such an interface, | ||||
| percent-encoded octets must be decoded before the reverse transcoding | ||||
| can be applied. | ||||
| In some cases, the interface between a URI component and the | URI characters provide identifying data for each of the URI | |||
| identifying data it has been crafted to represent is much less direct | components, serving as an external interface for identification | |||
| than a character encoding translation. For example, portions of a | between systems. Although the presence and nature of the URI | |||
| URI might reflect a query on non-ASCII data, numeric coordinates on a | production interface is hidden from clients that use its URIs, and | |||
| map, etc. Likewise, a URI scheme may define components with | thus beyond the scope of the interoperability requirements defined by | |||
| additional encoding requirements, such as base64, that are applied | this specification, it is a frequent source of confusion and errors | |||
| prior to forming the component and producing the URI. | in the interpretation of URI character issues. Implementers need to | |||
| be aware that there are multiple character encodings involved in the | ||||
| production and transmission of URIs: local name and data encoding, | ||||
| public interface encoding, URI character encoding, data format | ||||
| encoding, and protocol encoding. | ||||
| When a URI scheme defines a component that represents textual data | The first encoding of identifying data is the one in which the local | |||
| consisting of characters from the Unicode (ISO/IEC 10646-1) character | names or data are stored. URI producing applications (a.k.a., origin | |||
| set, the data should be encoded first as octets according to the | servers) will typically use the local encoding as the basis for | |||
| UTF-8 character encoding [RFC3629], and then only those octets that | producing meaningful names. The URI producer will transform the | |||
| do not correspond to characters in the unreserved set should be | local encoding to one that is suitable for a public interface, and | |||
| then transform the public interface encoding into the restricted set | ||||
| of URI characters (reserved, unreserved, and percent-encodings). | ||||
| Those characters are, in turn, encoded as octets to be used as a | ||||
| reference within a data format (e.g., a document charset), and such | ||||
| data formats are often subsequently encoded for transmission over | ||||
| Internet protocols. | ||||
| For most systems, an unreserved character appearing within a URI | ||||
| component is interpreted as representing the data octet corresponding | ||||
| to that character's encoding in US-ASCII. Consumers of URIs assume | ||||
| that the letter "X" corresponds to the octet "01011000", and there is | ||||
| no harm in making that assumption even when it is incorrect. A | ||||
| system that internally provides identifiers in the form of a | ||||
| different character encoding, such as EBCDIC, will generally perform | ||||
| character translation of textual identifiers to UTF-8 [RFC3629] (or | ||||
| some other superset of the US-ASCII character encoding) at an | ||||
| internal interface, thereby providing more meaningful identifiers | ||||
| than simply percent-encoding the original octets. | ||||
| For example, consider an information service that provides data, | ||||
| stored locally using an EBCDIC-based filesystem, to clients on the | ||||
| Internet through an HTTP server. When an author creates a file on | ||||
| that filesystem with the name "Laguna Beach", their expectation is | ||||
| that the "http" URI corresponding to that resource would also contain | ||||
| the meaningful string "Laguna%20Beach". If, however, that server | ||||
| produces URIs using an overly-simplistic raw octet mapping, then the | ||||
| result would be a URI containing | ||||
| "%D3%81%87%A4%95%81@%C2%85%81%83%88". An internal transcoding | ||||
| interface fixes that problem by transcoding the local name to a | ||||
| superset of US-ASCII prior to producing the URI. Naturally, proper | ||||
| interpretation of an incoming URI on such an interface requires that | ||||
| percent-encoded octets be decoded (e.g., "%20" to SP) before the | ||||
| reverse transcoding is applied to obtain the local name. | ||||
| In some cases, the internal interface between a URI component and the | ||||
| identifying data that it has been crafted to represent is much less | ||||
| direct than a character encoding translation. For example, portions | ||||
| of a URI might reflect a query on non-ASCII data, numeric coordinates | ||||
| on a map, etc. Likewise, a URI scheme may define components with | ||||
| additional encoding requirements that are applied prior to forming | ||||
| the component and producing the URI. | ||||
| When a new URI scheme defines a component that represents textual | ||||
| data consisting of characters from the Unicode (ISO/IEC 10646-1) | ||||
| character set, the data should be encoded first as octets according | ||||
| to the UTF-8 character encoding [RFC3629], and then only those octets | ||||
| that do not correspond to characters in the unreserved set should be | ||||
| percent-encoded. For example, the character A would be represented | percent-encoded. For example, the character A would be represented | |||
| as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be | as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be | |||
| represented as "%C3%80", and the character KATAKANA LETTER A would be | represented as "%C3%80", and the character KATAKANA LETTER A would be | |||
| represented as "%E3%82%A2". | represented as "%E3%82%A2". | |||
| 3. Syntax Components | 3. Syntax Components | |||
| The generic URI syntax consists of a hierarchical sequence of | The generic URI syntax consists of a hierarchical sequence of | |||
| components referred to as the scheme, authority, path, query, and | components referred to as the scheme, authority, path, query, and | |||
| fragment. | fragment. | |||
| URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment] | URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | |||
| hier-part = "//" authority path-abempty | ||||
| / path-abs | ||||
| / path-rootless | ||||
| / path-empty | ||||
| The scheme and path components are required, though path may be empty | The scheme and path components are required, though path may be empty | |||
| (no characters). An ABNF-driven parser will find that the border | (no characters). When authority is present, the path must either be | |||
| between authority and path is ambiguous; they are disambiguated by | empty or begin with a slash ("/") character. When authority is not | |||
| the "first-match-wins" (a.k.a. "greedy") algorithm. In other words, | present, the path cannot begin with two slash characters ("//"). | |||
| if authority is present then the first segment of the path must be | These restrictions result in five different ABNF rules for a path | |||
| empty. | (Section 3.3), only one of which will match any given URI reference. | |||
| The following are two example URIs and their component parts: | The following are two example URIs and their component parts: | |||
| foo://example.com:8042/over/there?name=ferret#nose | foo://example.com:8042/over/there?name=ferret#nose | |||
| \_/ \______________/\_________/ \_________/ \__/ | \_/ \______________/\_________/ \_________/ \__/ | |||
| | | | | | | | | | | | | |||
| scheme authority path query fragment | scheme authority path query fragment | |||
| | _____________________|__ | | _____________________|__ | |||
| / \ / \ | / \ / \ | |||
| urn:example:animal:ferret:nose | urn:example:animal:ferret:nose | |||
| 3.1 Scheme | 3.1 Scheme | |||
| Each URI begins with a scheme name that refers to a specification for | Each URI begins with a scheme name that refers to a specification for | |||
| assigning identifiers within that scheme. As such, the URI syntax is | assigning identifiers within that scheme. As such, the URI syntax is | |||
| a federated and extensible naming system wherein each scheme's | a federated and extensible naming system wherein each scheme's | |||
| specification may further restrict the syntax and semantics of | specification may further restrict the syntax and semantics of | |||
| identifiers using that scheme. | identifiers using that scheme. | |||
| Scheme names consist of a sequence of characters beginning with a | Scheme names consist of a sequence of characters beginning with a | |||
| letter and followed by any combination of letters, digits, plus | letter and followed by any combination of letters, digits, plus | |||
| ("+"), period ("."), or hyphen ("-"). Although scheme is | ("+"), period ("."), or hyphen ("-"). Although scheme is | |||
| skipping to change at page 16, line 4 ¶ | skipping to change at page 16, line 19 ¶ | |||
| specify schemes must do so using lowercase letters. An | specify schemes must do so using lowercase letters. An | |||
| implementation should accept uppercase letters as equivalent to | implementation should accept uppercase letters as equivalent to | |||
| lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for | lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for | |||
| the sake of robustness, but should only produce lowercase scheme | the sake of robustness, but should only produce lowercase scheme | |||
| names, for consistency. | names, for consistency. | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| Individual schemes are not specified by this document. The process | Individual schemes are not specified by this document. The process | |||
| for registration of new URI schemes is defined separately by | for registration of new URI schemes is defined separately by | |||
| [RFC2717]. The scheme registry maintains the mapping between scheme | [RFC2717]. The scheme registry maintains the mapping between scheme | |||
| names and their specifications. Advice for designers of new URI | names and their specifications. Advice for designers of new URI | |||
| schemes can be found in [RFC2718]. | schemes can be found in [RFC2718]. | |||
| When presented with a URI that violates one or more scheme-specific | When presented with a URI that violates one or more scheme-specific | |||
| restrictions, the scheme-specific resolution process should flag the | restrictions, the scheme-specific resolution process should flag the | |||
| reference as an error rather than ignore the unused parts; doing so | reference as an error rather than ignore the unused parts; doing so | |||
| reduces the number of equivalent URIs and helps detect abuses of the | reduces the number of equivalent URIs and helps detect abuses of the | |||
| generic syntax that might indicate the URI has been constructed to | generic syntax that might indicate the URI has been constructed to | |||
| mislead the user (Section 7.6). | mislead the user (Section 7.6). | |||
| 3.2 Authority | 3.2 Authority | |||
| Many URI schemes include a hierarchical element for a naming | Many URI schemes include a hierarchical element for a naming | |||
| authority, such that governance of the name space defined by the | authority, such that governance of the name space defined by the | |||
| remainder of the URI is delegated to that authority (which may, in | remainder of the URI is delegated to that authority (which may, in | |||
| turn, delegate it further). The generic syntax provides a common | turn, delegate it further). The generic syntax provides a common | |||
| means for distinguishing an authority based on a registered name or | means for distinguishing an authority based on a registered name or | |||
| server address, along with optional port and user information. | server address, along with optional port and user information. | |||
| The authority component is preceded by a double slash ("//") and is | The authority component is preceded by a double slash ("//") and is | |||
| terminated by the next slash ("/"), question mark ("?"), or number | terminated by the next slash ("/"), question mark ("?"), or number | |||
| sign ("#") character, or by the end of the URI. | sign ("#") character, or by the end of the URI. | |||
| authority = [ userinfo "@" ] host [ ":" port ] | authority = [ userinfo "@" ] host [ ":" port ] | |||
| URI producers and normalizers should omit the "@" delimiter that | URI producers and normalizers should omit the ":" delimiter that | |||
| separates userinfo from host if the userinfo component is empty (zero | separates host from port if the port component is empty. Some schemes | |||
| length) and should omit the ":" delimiter that separates host from | do not allow the userinfo and/or port subcomponents. | |||
| port if the port component is empty. Some schemes do not allow the | ||||
| userinfo and/or port sub-components. | ||||
| 3.2.1 User Information | If a URI contains an authority component, then the path component | |||
| must either be empty or begin with a slash ("/") character. | ||||
| Non-validating parsers (those that merely separate a URI reference | ||||
| into its major components) will often ignore the subcomponent | ||||
| structure of authority, treating it as an opaque string from the | ||||
| double-slash to the first terminating delimiter, until such time as | ||||
| the URI is dereferenced. | ||||
| The userinfo sub-component may consist of a user name and, | 3.2.1 User Information | |||
| optionally, scheme-specific information about how to gain | ||||
| authorization to access the resource. The user information, if | The userinfo subcomponent may consist of a user name and, optionally, | |||
| present, is followed by a commercial at-sign ("@") that delimits it | scheme-specific information about how to gain authorization to access | |||
| from the host. | the resource. The user information, if present, is followed by a | |||
| commercial at-sign ("@") that delimits it from the host. | ||||
| userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | |||
| Use of the format "user:password" in the userinfo field is | Use of the format "user:password" in the userinfo field is | |||
| deprecated. Applications should not render as clear text any data | deprecated. Applications should not render as clear text any data | |||
| after the first colon (":") character found within a userinfo | after the first colon (":") character found within a userinfo | |||
| sub-component unless such data is the empty string (indicating no | subcomponent unless the data after the colon is the empty string | |||
| password) or "anonymous". Applications may choose to ignore or reject | (indicating no password). Applications may choose to ignore or reject | |||
| such data when received as part of a reference, and should reject the | such data when received as part of a reference, and should reject the | |||
| storage of such data in unencrypted form. The passing of | storage of such data in unencrypted form. The passing of | |||
| authentication information in clear text has proven to be a security | authentication information in clear text has proven to be a security | |||
| risk in almost every case where it has been used. | risk in almost every case where it has been used. | |||
| Applications that render a URI for the sake of user feedback, such as | Applications that render a URI for the sake of user feedback, such as | |||
| in graphical hypertext browsing, should render userinfo in a way that | in graphical hypertext browsing, should render userinfo in a way that | |||
| is distinguished from the rest of a URI, when feasible. Such | is distinguished from the rest of a URI, when feasible. Such | |||
| rendering will assist the user in cases where the userinfo has been | rendering will assist the user in cases where the userinfo has been | |||
| misleadingly crafted to look like a trusted domain name (Section | misleadingly crafted to look like a trusted domain name (Section | |||
| 7.6). | 7.6). | |||
| 3.2.2 Host | 3.2.2 Host | |||
| The host sub-component of authority is identified by an IP literal | The host subcomponent of authority is identified by an IP literal | |||
| encapsulated within square brackets, an IPv4 address in | encapsulated within square brackets, an IPv4 address in | |||
| dotted-decimal form, or a host name. | dotted-decimal form, or a registered name. The host subcomponent is | |||
| case-insensitive. The presence of a host subcomponent within a URI | ||||
| does not imply that the scheme requires access to the given host on | ||||
| the Internet. In many cases, the host syntax is used only for the | ||||
| sake of reusing the existing registration process created and | ||||
| deployed for DNS, thus obtaining a globally unique name without the | ||||
| cost of deploying another registry. However, such use comes with its | ||||
| own costs: domain name ownership may change over time for reasons not | ||||
| anticipated by the URI producer. In other cases, the data within the | ||||
| host component identifies a registered name that has nothing to do | ||||
| with an Internet host. We use the name "host" for the ABNF rule | ||||
| because that is its most common purpose, not its only purpose, and | ||||
| thus should not be considered as semantically limiting the data | ||||
| within it. | ||||
| host = IP-literal / IPv4address / reg-name | host = IP-literal / IPv4address / reg-name | |||
| The syntax rule for host is ambiguous because it does not completely | The syntax rule for host is ambiguous because it does not completely | |||
| distinguish between an IPv4address and a reg-name. Again, the | distinguish between an IPv4address and a reg-name. In order to | |||
| "first-match-wins" algorithm applies: If host matches the rule for | disambiguate, the syntax, we apply the "first-match-wins" algorithm: | |||
| IPv4address, then it should be considered an IPv4 address literal and | If host matches the rule for IPv4address, then it should be | |||
| not a reg-name. Although host is case-insensitive, producers and | considered an IPv4 address literal and not a reg-name. Although host | |||
| normalizers should use lowercase for host names and hexadecimal | is case-insensitive, producers and normalizers should use lowercase | |||
| addresses for the sake of uniformity, while only using uppercase | for registered names and hexadecimal addresses for the sake of | |||
| letters for percent-encodings. | uniformity, while only using uppercase letters for percent-encodings. | |||
| A host identified by an Internet Protocol literal address, version 6 | A host identified by an Internet Protocol literal address, version 6 | |||
| [RFC3513] or later, is distinguished by enclosing the IP literal | [RFC3513] or later, is distinguished by enclosing the IP literal | |||
| within square brackets ("[" and "]"). This is the only place where | within square brackets ("[" and "]"). This is the only place where | |||
| square bracket characters are allowed in the URI syntax. In | square bracket characters are allowed in the URI syntax. In | |||
| anticipation of future, as-yet-undefined IP literal address formats, | anticipation of future, as-yet-undefined IP literal address formats, | |||
| an optional version flag may be used to indicate such a format | an optional version flag may be used to indicate such a format | |||
| explicitly rather than relying on heuristic determination. | explicitly rather than relying on heuristic determination. | |||
| IP-literal = "[" ( IPv6address / IPvFuture ) "]" | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |||
| IPvFuture = "v" HEXDIG "." 1*( unreserved / sub-delims / ":" ) | IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) | |||
| The version flag does not indicate the IP version; rather, it | The version flag does not indicate the IP version; rather, it | |||
| indicates future versions of the literal format. As such, | indicates future versions of the literal format. As such, | |||
| implementations must not provide the version flag for existing IPv4 | implementations must not provide the version flag for existing IPv4 | |||
| and IPv6 literal addresses. If a URI containing an IP-literal that | and IPv6 literal addresses. If a URI containing an IP-literal that | |||
| starts with "v" (case-insensitive), indicating that the version flag | starts with "v" (case-insensitive), indicating that the version flag | |||
| is present, is dereferenced by an application that does not know the | is present, is dereferenced by an application that does not know the | |||
| meaning of that version flag, then the application should return an | meaning of that version flag, then the application should return an | |||
| appropriate error for "address mechanism not supported". | appropriate error for "address mechanism not supported". | |||
| skipping to change at page 19, line 5 ¶ | skipping to change at page 19, line 37 ¶ | |||
| grammar. | grammar. | |||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| dec-octet = DIGIT ; 0-9 | dec-octet = DIGIT ; 0-9 | |||
| / %x31-39 DIGIT ; 10-99 | / %x31-39 DIGIT ; 10-99 | |||
| / "1" 2DIGIT ; 100-199 | / "1" 2DIGIT ; 100-199 | |||
| / "2" %x30-34 DIGIT ; 200-249 | / "2" %x30-34 DIGIT ; 200-249 | |||
| / "25" %x30-35 ; 250-255 | / "25" %x30-35 ; 250-255 | |||
| A host identified by a registered name is a string of characters that | A host identified by a registered name is a sequence of characters | |||
| is intended for lookup within a locally-defined host or service name | that is usually intended for lookup within a locally-defined host or | |||
| registry. The most common of such registry mechanisms is the Domain | service name registry, though the URI's scheme-specific semantics may | |||
| Name System (DNS), as defined by Section 3 of [RFC1034] and Section | require that a specific registry (or fixed name table) be used | |||
| 2.1 of [RFC1123]. A DNS name consists of a sequence of domain labels | instead. The most common name registry mechanism is the Domain Name | |||
| System (DNS). A registered name intended for lookup in the DNS uses | ||||
| the syntax defined in Section 3.5 of [RFC1034] and Section 2.1 of | ||||
| [RFC1123]. Such a name consists of a sequence of domain labels | ||||
| separated by ".", each domain label starting and ending with an | separated by ".", each domain label starting and ending with an | |||
| alphanumeric character and possibly also containing "-" characters. | alphanumeric character and possibly also containing "-" characters. | |||
| The rightmost domain label of a fully qualified domain name in DNS | The rightmost domain label of a fully qualified domain name in DNS | |||
| may be followed by a single "." and should be followed by one if it | may be followed by a single "." and should be followed by one if it | |||
| is necessary to distinguish between the complete domain name and some | is necessary to distinguish between the complete domain name and some | |||
| local domain. | local domain. | |||
| reg-name = 0*255( unreserved / pct-encoded / sub-delims ) | reg-name = 0*255( unreserved / pct-encoded / sub-delims ) | |||
| If the host component is defined and the registered name is empty | If the URI scheme defines a default for host, then that default | |||
| (zero length), then the name defaults to "localhost" (Section 6.2.3 | applies when the host subcomponent is undefined or when the | |||
| discusses how this should be normalized). If "localhost" is not | registered name is empty (zero length). For example, the "file" URI | |||
| determined by a host name lookup, then it should be interpreted to | scheme is defined such that no authority, an empty host, and | |||
| mean the machine on which the URI is being resolved. | "localhost" all mean the end-user's machine, whereas the "http" | |||
| scheme considers a missing authority or empty host to be invalid. | ||||
| This specification does not mandate a particular registered name | This specification does not mandate a particular registered name | |||
| lookup technology and therefore does not restrict the syntax of | lookup technology and therefore does not restrict the syntax of | |||
| reg-name beyond that necessary for interoperability. Instead, it | reg-name beyond that necessary for interoperability. Instead, it | |||
| delegates the issue of host name syntax conformance to the operating | delegates the issue of registered name syntax conformance to the | |||
| system of each application performing URI resolution, and that | operating system of each application performing URI resolution, and | |||
| operating system decides what it will allow for the purpose of host | that operating system decides what it will allow for the purpose of | |||
| identification. A URI resolution implementation might use DNS, host | host identification. A URI resolution implementation might use DNS, | |||
| tables, yellow pages, NetInfo, WINS, or any other system for lookup | host tables, yellow pages, NetInfo, WINS, or any other system for | |||
| of host and service names. However, a globally-scoped naming system, | lookup of registered names. However, a globally-scoped naming system, | |||
| such as DNS fully-qualified domain names, is necessary for URIs that | such as DNS fully-qualified domain names, is necessary for URIs that | |||
| are intended to have global scope. URI producers should use host | are intended to have global scope. URI producers should use names | |||
| names that conform to the DNS syntax, even when use of DNS is not | that conform to the DNS syntax, even when use of DNS is not | |||
| immediately apparent. | immediately apparent. | |||
| The reg-name syntax allows percent-encoded octets in order to | The reg-name syntax allows percent-encoded octets in order to | |||
| represent non-ASCII host or service names in a uniform way that is | represent non-ASCII registered names in a uniform way that is | |||
| independent of the underlying name resolution technology; such octets | independent of the underlying name resolution technology; such | |||
| must represent characters encoded in the UTF-8 character encoding | non-ASCII characters must first be encoded according to UTF-8 | |||
| [RFC3629] prior to being percent-encoded. When a non-ASCII host name | [RFC3629] and then each octet of the corresponding UTF-8 sequence | |||
| represents an internationalized domain name intended for resolution | must be percent-encoded to be represented as URI characters. URI | |||
| via DNS, the name must be transformed to the IDNA encoding [RFC3490] | producing applications must not use percent-encoding in host unless | |||
| prior to name lookup. URI producers should provide such host names in | it is used to represent a UTF-8 character sequence. When a non-ASCII | |||
| the IDNA encoding, rather than a percent-encoding, if they wish to | registered name represents an internationalized domain name intended | |||
| maximize interoperability with legacy URI resolvers. | for resolution via the DNS, the name must be transformed to the IDNA | |||
| encoding [RFC3490] prior to name lookup. URI producers should | ||||
| The presence of host within a URI does not imply that the scheme | provide such registered names in the IDNA encoding, rather than a | |||
| requires access to the given host on the Internet. In many cases, | percent-encoding, if they wish to maximize interoperability with | |||
| the host syntax is used only for the sake of reusing the existing | legacy URI resolvers. | |||
| registration process created and deployed for DNS, thus obtaining a | ||||
| globally unique name without the cost of deploying another registry. | ||||
| However, such use comes with its own costs: domain name ownership may | ||||
| change over time for reasons not anticipated by the URI producer. | ||||
| 3.2.3 Port | 3.2.3 Port | |||
| The port sub-component of authority is designated by an optional port | The port subcomponent of authority is designated by an optional port | |||
| number in decimal following the host and delimited from it by a | number in decimal following the host and delimited from it by a | |||
| single colon (":") character. | single colon (":") character. | |||
| port = *DIGIT | port = *DIGIT | |||
| A scheme may define a default port. For example, the "http" scheme | A scheme may define a default port. For example, the "http" scheme | |||
| defines a default port of "80", corresponding to its reserved TCP | defines a default port of "80", corresponding to its reserved TCP | |||
| port number. The type of port designated by the port number (e.g., | port number. The type of port designated by the port number (e.g., | |||
| TCP, UDP, SCTP, etc.) is defined by the URI scheme. URI producers | TCP, UDP, SCTP, etc.) is defined by the URI scheme. URI producers | |||
| and normalizers should omit the port component and its ":" delimiter | and normalizers should omit the port component and its ":" delimiter | |||
| if port is empty or its value would be the same as the scheme's | if port is empty or its value would be the same as the scheme's | |||
| default. | default. | |||
| 3.3 Path | 3.3 Path | |||
| The path component contains data, usually organized in hierarchical | The path component contains data, usually organized in hierarchical | |||
| form, that, along with data in the non-hierarchical query component | form, that, along with data in the non-hierarchical query component | |||
| (Section 3.4), serves to identify a resource within the scope of the | (Section 3.4), serves to identify a resource within the scope of the | |||
| URI's scheme and naming authority (if any). If a URI contains an | URI's scheme and naming authority (if any). The path is terminated by | |||
| authority component, then the initial path segment must be empty | the first question mark ("?") or number sign ("#") character, or by | |||
| (i.e., the path must begin with a slash ("/") character or be | the end of the URI. | |||
| entirely empty). The path is terminated by the first question mark | ||||
| ("?") or number sign ("#") character, or by the end of the URI. | If a URI contains an authority component, then the path component | |||
| must either be empty or begin with a slash ("/") character. If a URI | ||||
| does not contain an authority component, then the path cannot begin | ||||
| with two slash characters ("//"). In addition, a URI reference | ||||
| (Section 4.1) may begin with a relative path, in which case the first | ||||
| path segment cannot contain a colon (":") character. The ABNF | ||||
| requires five separate rules to disambiguate these cases, only one of | ||||
| which will match a given URI reference. We use the generic term | ||||
| "path component" to describe the URI substring that is matched by the | ||||
| parser to one of these rules. | ||||
| path = path-abempty ; begins with "/" or is empty | ||||
| / path-abs ; begins with "/" but not "//" | ||||
| / path-noscheme ; begins with a non-colon segment | ||||
| / path-rootless ; begins with a segment | ||||
| / path-empty ; zero characters | ||||
| path-abempty = *( "/" segment ) | ||||
| path-abs = "/" [ segment-nz *( "/" segment ) ] | ||||
| path-noscheme = segment-nzc *( "/" segment ) | ||||
| path-rootless = segment-nz *( "/" segment ) | ||||
| path-empty = 0<pchar> | ||||
| path = segment *( "/" segment ) | ||||
| segment = *pchar | segment = *pchar | |||
| segment-nz = 1*pchar | ||||
| segment-nzc = 1*( unreserved / pct-encoded / sub-delims / "@" ) | ||||
| pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | |||
| A path consists of a sequence of path segments separated by a slash | A path consists of a sequence of path segments separated by a slash | |||
| ("/") character. A path is always defined for a URI, though the | ("/") character. A path is always defined for a URI, though the | |||
| defined path may be empty (zero length). Use of the slash character | defined path may be empty (zero length). Use of the slash character | |||
| to indicate hierarchy is only required when a URI will be used as the | to indicate hierarchy is only required when a URI will be used as the | |||
| context for relative references. For example, the URI | context for relative references. For example, the URI | |||
| <mailto:fred@example.com> has a path of "fred@example.com", whereas | <mailto:fred@example.com> has a path of "fred@example.com", whereas | |||
| the URI <foo://info.example.com?fred> has an empty path. | the URI <foo://info.example.com?fred> has an empty path. | |||
| The path segments "." and ".." are defined for relative reference | The path segments "." and "..", also known as dot-segments, are | |||
| within the path name hierarchy. They are intended for use at the | defined for relative reference within the path name hierarchy. They | |||
| beginning of a relative path reference (Section 4.2) for indicating | are intended for use at the beginning of a relative path reference | |||
| relative position within the hierarchical tree of names. This is | (Section 4.2) for indicating relative position within the | |||
| similar to their role within some operating systems' file directory | hierarchical tree of names. This is similar to their role within | |||
| structure to indicate the current directory and parent directory, | some operating systems' file directory structure to indicate the | |||
| respectively. However, unlike a file system, these dot-segments are | current directory and parent directory, respectively. However, unlike | |||
| only interpreted within the URI path hierarchy and are removed as | a file system, these dot-segments are only interpreted within the URI | |||
| part of the resolution process (Section 5.2). | path hierarchy and are removed as part of the resolution process | |||
| (Section 5.2). | ||||
| Aside from dot-segments in hierarchical paths, a path segment is | Aside from dot-segments in hierarchical paths, a path segment is | |||
| considered opaque by the generic syntax. URI-producing applications | considered opaque by the generic syntax. URI-producing applications | |||
| often use the reserved characters allowed in a segment for the | often use the reserved characters allowed in a segment for the | |||
| purpose of delimiting scheme-specific or dereference-handler-specific | purpose of delimiting scheme-specific or dereference-handler-specific | |||
| sub-components. For example, the semicolon (";") and equals ("=") | subcomponents. For example, the semicolon (";") and equals ("=") | |||
| reserved characters are often used for delimiting parameters and | reserved characters are often used for delimiting parameters and | |||
| parameter values applicable to that segment. The comma (",") | parameter values applicable to that segment. The comma (",") | |||
| reserved character is often used for similar purposes. For example, | reserved character is often used for similar purposes. For example, | |||
| one URI producer might use a segment like "name;v=1.1" to indicate a | one URI producer might use a segment like "name;v=1.1" to indicate a | |||
| reference to version 1.1 of "name", whereas another might use a | reference to version 1.1 of "name", whereas another might use a | |||
| segment like "name,1.1" to indicate the same. Parameter types may be | segment like "name,1.1" to indicate the same. Parameter types may be | |||
| defined by scheme-specific semantics, but in most cases the syntax of | defined by scheme-specific semantics, but in most cases the syntax of | |||
| a parameter is specific to the implementation of the URI's | a parameter is specific to the implementation of the URI's | |||
| dereferencing algorithm. | dereferencing algorithm. | |||
| 3.4 Query | 3.4 Query | |||
| The query component contains non-hierarchical data that, along with | The query component contains non-hierarchical data that, along with | |||
| data in the path component (Section 3.3), serves to identify a | data in the path component (Section 3.3), serves to identify a | |||
| resource within the scope of the URI's scheme and naming authority | resource within the scope of the URI's scheme and naming authority | |||
| (if any). The query component is indicated by the first question mark | (if any). The query component is indicated by the first question mark | |||
| ("?") character and terminated by a number sign ("#") character or by | ("?") character and terminated by a number sign ("#") character or by | |||
| the end of the URI. | the end of the URI. | |||
| query = *( pchar / "/" / "?" ) | query = *( pchar / "/" / "?" ) | |||
| The characters slash ("/") and question mark ("?") may represent data | The characters slash ("/") and question mark ("?") may represent data | |||
| within the query component, but should not be used as such within a | within the query component. Beware that some older, erroneous | |||
| URI that is expected to be the base for relative references (Section | implementations do not handle such URIs correctly when they are used | |||
| 5.1). Incorrect implementations of reference resolution often fail | as the base for relative references (Section 5.1), apparently because | |||
| to distinguish query data from path data when looking for | they fail to to distinguish query data from path data when looking | |||
| hierarchical separators, thus resulting in non-interoperable results. | for hierarchical separators. However, since query components are | |||
| However, since query components are often used to carry identifying | often used to carry identifying information in the form of | |||
| information in the form of "key=value" pairs, and one frequently used | "key=value" pairs, and one frequently used value is a reference to | |||
| value is a reference to another URI, it is sometimes better for | another URI, it is sometimes better for usability to avoid | |||
| usability to avoid percent-encoding those characters. | percent-encoding those characters. | |||
| 3.5 Fragment | 3.5 Fragment | |||
| The fragment identifier component of a URI allows indirect | The fragment identifier component of a URI allows indirect | |||
| identification of a secondary resource by reference to a primary | identification of a secondary resource by reference to a primary | |||
| resource and additional identifying information. The identified | resource and additional identifying information. The identified | |||
| secondary resource may be some portion or subset of the primary | secondary resource may be some portion or subset of the primary | |||
| resource, some view on representations of the primary resource, or | resource, some view on representations of the primary resource, or | |||
| some other resource defined or described by those representations. A | some other resource defined or described by those representations. A | |||
| fragment identifier component is indicated by the presence of a | fragment identifier component is indicated by the presence of a | |||
| number sign ("#") character and terminated by the end of the URI. | number sign ("#") character and terminated by the end of the URI. | |||
| fragment = *( pchar / "/" / "?" ) | fragment = *( pchar / "/" / "?" ) | |||
| The semantics of a fragment identifier are defined by the set of | The semantics of a fragment identifier are defined by the set of | |||
| representations that might result from a retrieval action on the | representations that might result from a retrieval action on the | |||
| primary resource. The fragment's format and resolution is therefore | primary resource. The fragment's format and resolution is therefore | |||
| dependent on the media type [RFC2046] of a potentially retrieved | dependent on the media type [RFC2046] of a potentially retrieved | |||
| representation, even though such a retrieval is only performed if the | representation, even though such a retrieval is only performed if the | |||
| URI is dereferenced. Individual media types may define their own | URI is dereferenced. If no such representation exists, then the | |||
| restrictions on, or structure within, the fragment identifier syntax | semantics of the fragment are considered unknown and, effectively, | |||
| for specifying different types of subsets, views, or external | unconstrained. Fragment identifier semantics are independent of the | |||
| references that are identifiable as secondary resources by that media | URI scheme and thus cannot be redefined by scheme specifications. | |||
| type. If the primary resource has multiple representations, as is | ||||
| often the case for resources whose representation is selected based | Individual media types may define their own restrictions on, or | |||
| on attributes of the retrieval request (a.k.a., content negotiation), | structure within, the fragment identifier syntax for specifying | |||
| then whatever is identified by the fragment should be consistent | different types of subsets, views, or external references that are | |||
| across all of those representations: each representation should | identifiable as secondary resources by that media type. If the | |||
| either define the fragment such that it corresponds to the same | primary resource has multiple representations, as is often the case | |||
| secondary resource, regardless of how it is represented, or the | for resources whose representation is selected based on attributes of | |||
| fragment should be left undefined by the representation (i.e., not | the retrieval request (a.k.a., content negotiation), then whatever is | |||
| found). | identified by the fragment should be consistent across all of those | |||
| representations: each representation should either define the | ||||
| fragment such that it corresponds to the same secondary resource, | ||||
| regardless of how it is represented, or the fragment should be left | ||||
| undefined by the representation (i.e., not found). | ||||
| As with any URI, use of a fragment identifier component does not | As with any URI, use of a fragment identifier component does not | |||
| imply that a retrieval action will take place. A URI with a fragment | imply that a retrieval action will take place. A URI with a fragment | |||
| identifier may be used to refer to the secondary resource without any | identifier may be used to refer to the secondary resource without any | |||
| implication that the primary resource is accessible or will ever be | implication that the primary resource is accessible or will ever be | |||
| accessed. | accessed. | |||
| Fragment identifiers have a special role in information systems as | Fragment identifiers have a special role in information systems as | |||
| the primary form of client-side indirect referencing, allowing an | the primary form of client-side indirect referencing, allowing an | |||
| author to specifically identify those aspects of an existing resource | author to specifically identify those aspects of an existing resource | |||
| that are only indirectly provided by the resource owner. As such, | that are only indirectly provided by the resource owner. As such, the | |||
| interpretation of the fragment identifier during a retrieval action | fragment identifier is not used in the scheme-specific processing of | |||
| is performed solely by the user agent; the fragment identifier is not | a URI; instead, the fragment identifier is separated from the rest of | |||
| passed to other systems during the process of retrieval. Although | the URI prior to a dereference, and thus the identifying information | |||
| this is often perceived to be a loss of information, particularly in | within the fragment itself is dereferenced solely by the user agent | |||
| regards to accurate redirection of references as content moves over | and regardless of the URI scheme. Although this separate handling is | |||
| time, it also serves to prevent information providers from denying | often perceived to be a loss of information, particularly in regards | |||
| reference authors the right to selectively refer to information | to accurate redirection of references as resources move over time, it | |||
| within a resource. | also serves to prevent information providers from denying reference | |||
| authors the right to selectively refer to information within a | ||||
| resource. Indirect referencing also provides additional flexibility | ||||
| and extensibility to systems that use URIs, since new media types are | ||||
| easier to define and deploy than new schemes of identification. | ||||
| The characters slash ("/") and question mark ("?") are allowed to | The characters slash ("/") and question mark ("?") are allowed to | |||
| represent data within the fragment identifier, but should not be used | represent data within the fragment identifier. Beware that some | |||
| as such within a URI that is expected to be the base for relative | older, erroneous implementations do not handle such URIs correctly | |||
| references (Section 5.1) for the same reasons as described above for | when they are used as the base for relative references (Section 5.1). | |||
| query. | ||||
| 4. Usage | 4. Usage | |||
| When applications make reference to a URI, they do not always use the | When applications make reference to a URI, they do not always use the | |||
| full form of reference defined by the "URI" syntax rule. In order to | full form of reference defined by the "URI" syntax rule. In order to | |||
| save space and take advantage of hierarchical locality, many Internet | save space and take advantage of hierarchical locality, many Internet | |||
| protocol elements and media type formats allow an abbreviation of a | protocol elements and media type formats allow an abbreviation of a | |||
| URI, while others restrict the syntax to a particular form of URI. | URI, while others restrict the syntax to a particular form of URI. | |||
| We define the most common forms of reference syntax in this | We define the most common forms of reference syntax in this | |||
| specification because they impact and depend upon the design of the | specification because they impact and depend upon the design of the | |||
| generic syntax, requiring a uniform parsing algorithm in order to be | generic syntax, requiring a uniform parsing algorithm in order to be | |||
| interpreted consistently. | interpreted consistently. | |||
| 4.1 URI Reference | 4.1 URI Reference | |||
| URI-reference is used to denote the most common usage of a resource | URI-reference is used to denote the most common usage of a resource | |||
| identifier. | identifier. | |||
| URI-reference = URI / relative-URI | URI-reference = URI / relative-URI | |||
| A URI-reference may be relative: if the reference's prefix matches | A URI-reference may be relative: if the reference's prefix matches | |||
| the syntax of a scheme followed by its colon separator, then the | the syntax of a scheme followed by its colon separator, then the | |||
| reference is a URI rather than a relative-URI. | reference is a URI rather than a relative-URI. | |||
| A URI-reference is typically parsed first into the five URI | A URI-reference is typically parsed first into the five URI | |||
| components, in order to determine what components are present and | components, in order to determine what components are present and | |||
| whether or not the reference is relative, and then each component is | whether or not the reference is relative, after which each component | |||
| parsed for its subparts and their validation. The ABNF of | is parsed for its subparts and their validation. The ABNF of | |||
| URI-reference, along with the "first-match-wins" disambiguation rule, | URI-reference, along with the "first-match-wins" disambiguation rule, | |||
| is sufficient to define a validating parser for the generic syntax. | is sufficient to define a validating parser for the generic syntax. | |||
| Readers familiar with regular expressions should see Appendix B for | Readers familiar with regular expressions should see Appendix B for | |||
| an example of a non-validating URI-reference parser that will take | an example of a non-validating URI-reference parser that will take | |||
| any given string and extract the URI components. | any given string and extract the URI components. | |||
| 4.2 Relative URI | 4.2 Relative URI | |||
| A relative URI reference takes advantage of the hierarchical syntax | A relative URI reference takes advantage of the hierarchical syntax | |||
| (Section 1.2.3) in order to express a reference that is relative to | (Section 1.2.3) in order to express a reference that is relative to | |||
| the name space of another hierarchical URI. | the name space of another hierarchical URI. | |||
| relative-URI = ["//" authority] path ["?" query] ["#" fragment] | relative-URI = relative-part [ "?" query ] [ "#" fragment ] | |||
| relative-part = "//" authority path-abempty | ||||
| / path-abs | ||||
| / path-noscheme | ||||
| / path-empty | ||||
| The URI referred to by a relative reference, also known as the target | The URI referred to by a relative reference, also known as the target | |||
| URI, is obtained by applying the reference resolution algorithm of | URI, is obtained by applying the reference resolution algorithm of | |||
| Section 5. | Section 5. | |||
| A relative reference that begins with two slash characters is termed | A relative reference that begins with two slash characters is termed | |||
| a network-path reference; such references are rarely used. A relative | a network-path reference; such references are rarely used. A relative | |||
| reference that begins with a single slash character is termed an | reference that begins with a single slash character is termed an | |||
| absolute-path reference. A relative reference that does not begin | absolute-path reference. A relative reference that does not begin | |||
| with a slash character is termed a relative-path reference. | with a slash character is termed a relative-path reference. | |||
| A path segment that contains a colon character (e.g., "this:that") | A path segment that contains a colon character (e.g., "this:that") | |||
| cannot be used as the first segment of a relative-path reference | cannot be used as the first segment of a relative-path reference | |||
| because it would be mistaken for a scheme name. Such a segment must | because it would be mistaken for a scheme name. Such a segment must | |||
| be preceded by a dot-segment (e.g., "./this:that") to make a | be preceded by a dot-segment (e.g., "./this:that") to make a | |||
| relative-path reference. | relative-path reference. | |||
| 4.3 Absolute URI | 4.3 Absolute URI | |||
| Some protocol elements allow only the absolute form of a URI without | Some protocol elements allow only the absolute form of a URI without | |||
| a fragment identifier. For example, defining a base URI for later | a fragment identifier. For example, defining a base URI for later | |||
| use by relative references calls for an absolute-URI syntax rule that | use by relative references calls for an absolute-URI syntax rule that | |||
| does not allow a fragment. | does not allow a fragment. | |||
| absolute-URI = scheme ":" ["//" authority] path ["?" query] | absolute-URI = scheme ":" hier-part [ "?" query ] | |||
| 4.4 Same-document Reference | 4.4 Same-document Reference | |||
| When a URI reference refers to a URI that is, aside from its fragment | When a URI reference refers to a URI that is, aside from its fragment | |||
| component (if any), identical to the base URI (Section 5.1), that | component (if any), identical to the base URI (Section 5.1), that | |||
| reference is called a "same-document" reference. The most frequent | reference is called a "same-document" reference. The most frequent | |||
| examples of same-document references are relative references that are | examples of same-document references are relative references that are | |||
| empty or include only the number sign ("#") separator followed by a | empty or include only the number sign ("#") separator followed by a | |||
| fragment identifier. | fragment identifier. | |||
| When a same-document reference is dereferenced for the purpose of a | When a same-document reference is dereferenced for the purpose of a | |||
| retrieval action, the target of that reference is defined to be | retrieval action, the target of that reference is defined to be | |||
| skipping to change at page 25, line 46 ¶ | skipping to change at page 26, line 20 ¶ | |||
| Normalization of the base and target URIs prior to their comparison, | Normalization of the base and target URIs prior to their comparison, | |||
| as described in Section 6.2.2 and Section 6.2.3, is allowed but | as described in Section 6.2.2 and Section 6.2.3, is allowed but | |||
| rarely performed in practice. Normalization may increase the set of | rarely performed in practice. Normalization may increase the set of | |||
| same-document references, which may be of benefit to some caching | same-document references, which may be of benefit to some caching | |||
| applications. As such, reference authors should not assume that a | applications. As such, reference authors should not assume that a | |||
| slightly different, though equivalent, reference URI will (or will | slightly different, though equivalent, reference URI will (or will | |||
| not) be interpreted as a same-document reference by any given | not) be interpreted as a same-document reference by any given | |||
| application. | application. | |||
| 4.5 Suffix Reference | 4.5 Suffix Reference | |||
| The URI syntax is designed for unambiguous reference to resources and | The URI syntax is designed for unambiguous reference to resources and | |||
| extensibility via the URI scheme. However, as URI identification and | extensibility via the URI scheme. However, as URI identification and | |||
| usage have become commonplace, traditional media (television, radio, | usage have become commonplace, traditional media (television, radio, | |||
| newspapers, billboards, etc.) have increasingly used a suffix of the | newspapers, billboards, etc.) have increasingly used a suffix of the | |||
| URI as a reference, consisting of only the authority and path | URI as a reference, consisting of only the authority and path | |||
| portions of the URI, such as | portions of the URI, such as | |||
| www.w3.org/Addressing/ | www.w3.org/Addressing/ | |||
| or simply a DNS registered name on its own. Such references are | or simply a DNS registered name on its own. Such references are | |||
| primarily intended for human interpretation, rather than for | primarily intended for human interpretation, rather than for | |||
| machines, with the assumption that context-based heuristics are | machines, with the assumption that context-based heuristics are | |||
| sufficient to complete the URI (e.g., most host names beginning with | sufficient to complete the URI (e.g., most registered names beginning | |||
| "www" are likely to have a URI prefix of "http://"). Although there | with "www" are likely to have a URI prefix of "http://"). Although | |||
| is no standard set of heuristics for disambiguating a URI suffix, | there is no standard set of heuristics for disambiguating a URI | |||
| many client implementations allow them to be entered by the user and | suffix, many client implementations allow them to be entered by the | |||
| heuristically resolved. | user and heuristically resolved. | |||
| While this practice of using suffix references is common, it should | While this practice of using suffix references is common, it should | |||
| be avoided whenever possible and never used in situations where | be avoided whenever possible and never used in situations where | |||
| long-term references are expected. The heuristics noted above will | long-term references are expected. The heuristics noted above will | |||
| change over time, particularly when a new URI scheme becomes popular, | change over time, particularly when a new URI scheme becomes popular, | |||
| and are often incorrect when used out of context. Furthermore, they | and are often incorrect when used out of context. Furthermore, they | |||
| can lead to security issues along the lines of those described in | can lead to security issues along the lines of those described in | |||
| [RFC1535]. | [RFC1535]. | |||
| Since a URI suffix has the same syntax as a relative path reference, | Since a URI suffix has the same syntax as a relative path reference, | |||
| a suffix reference cannot be used in contexts where a relative | a suffix reference cannot be used in contexts where a relative | |||
| reference is expected. As a result, suffix references are limited to | reference is expected. As a result, suffix references are limited to | |||
| those places where there is no defined base URI, such as dialog boxes | those places where there is no defined base URI, such as dialog boxes | |||
| and off-line advertisements. | and off-line advertisements. | |||
| 5. Reference Resolution | 5. Reference Resolution | |||
| This section defines the process of resolving a URI reference within | This section defines the process of resolving a URI reference within | |||
| a context that allows relative references, such that the result is a | a context that allows relative references, such that the result is a | |||
| string matching the "URI" syntax rule of Section 3. | string matching the "URI" syntax rule of Section 3. | |||
| 5.1 Establishing a Base URI | 5.1 Establishing a Base URI | |||
| The term "relative" implies that there exists a "base URI" against | The term "relative" implies that there exists a "base URI" against | |||
| which the relative reference is applied. Aside from fragment-only | which the relative reference is applied. Aside from fragment-only | |||
| references (Section 4.4), relative references are only usable when a | references (Section 4.4), relative references are only usable when a | |||
| base URI is known. A base URI must be established by the parser | base URI is known. A base URI must be established by the parser | |||
| prior to parsing URI references that might be relative. | prior to parsing URI references that might be relative. | |||
| The base URI of a reference can be established in one of four ways, | The base URI of a reference can be established in one of four ways, | |||
| discussed below in order of precedence. The order of precedence can | discussed below in order of precedence. The order of precedence can | |||
| be thought of in terms of layers, where the innermost defined base | be thought of in terms of layers, where the innermost defined base | |||
| skipping to change at page 27, line 42 ¶ | skipping to change at page 27, line 42 ¶ | |||
| | | | | (5.1.1) Base URI embedded in content | | | | | | | | | (5.1.1) Base URI embedded in content | | | | | |||
| | | | `----------------------------------------' | | | | | | | `----------------------------------------' | | | | |||
| | | | (5.1.2) Base URI of the encapsulating entity | | | | | | | (5.1.2) Base URI of the encapsulating entity | | | | |||
| | | | (message, representation, or none) | | | | | | | (message, representation, or none) | | | | |||
| | | `----------------------------------------------' | | | | | `----------------------------------------------' | | | |||
| | | (5.1.3) URI used to retrieve the entity | | | | | (5.1.3) URI used to retrieve the entity | | | |||
| | `----------------------------------------------------' | | | `----------------------------------------------------' | | |||
| | (5.1.4) Default Base URI (application-dependent) | | | (5.1.4) Default Base URI (application-dependent) | | |||
| `----------------------------------------------------------' | `----------------------------------------------------------' | |||
| 5.1.1 Base URI within Document Content | 5.1.1 Base URI Embedded in Content | |||
| Within certain media types, a base URI for relative references can be | Within certain media types, a base URI for relative references can be | |||
| embedded within the content itself such that it can be readily | embedded within the content itself such that it can be readily | |||
| obtained by a parser. This can be useful for descriptive documents, | obtained by a parser. This can be useful for descriptive documents, | |||
| such as tables of content, which may be transmitted to others through | such as tables of content, which may be transmitted to others through | |||
| protocols other than their usual retrieval context (e.g., E-Mail or | protocols other than their usual retrieval context (e.g., E-Mail or | |||
| USENET news). | USENET news). | |||
| It is beyond the scope of this specification to specify how, for each | It is beyond the scope of this specification to specify how, for each | |||
| media type, a base URI can be embedded. The appropriate syntax, when | media type, a base URI can be embedded. The appropriate syntax, when | |||
| available, is described by each media type's specification. | available, is described by the data format specification associated | |||
| with each media type. | ||||
| 5.1.2 Base URI from the Encapsulating Entity | 5.1.2 Base URI from the Encapsulating Entity | |||
| If no base URI is embedded, the base URI is defined by the | If no base URI is embedded, the base URI is defined by the | |||
| representation's retrieval context. For a document that is enclosed | representation's retrieval context. For a document that is enclosed | |||
| within another entity, such as a message or archive, the retrieval | within another entity, such as a message or archive, the retrieval | |||
| context is that entity; thus, the default base URI of a | context is that entity; thus, the default base URI of a | |||
| representation is the base URI of the entity in which the | representation is the base URI of the entity in which the | |||
| representation is encapsulated. | representation is encapsulated. | |||
| A mechanism for embedding a base URI within MIME container types | A mechanism for embedding a base URI within MIME container types | |||
| (e.g., the message and multipart types) is defined by MHTML | (e.g., the message and multipart types) is defined by MHTML | |||
| [RFC2110]. Protocols that do not use the MIME message header syntax, | [RFC2557]. Protocols that do not use the MIME message header syntax, | |||
| but do allow some form of tagged metadata to be included within | but do allow some form of tagged metadata to be included within | |||
| messages, may define their own syntax for defining a base URI as part | messages, may define their own syntax for defining a base URI as part | |||
| of a message. | of a message. | |||
| 5.1.3 Base URI from the Retrieval URI | 5.1.3 Base URI from the Retrieval URI | |||
| If no base URI is embedded and the representation is not encapsulated | If no base URI is embedded and the representation is not encapsulated | |||
| within some other entity, then, if a URI was used to retrieve the | within some other entity, then, if a URI was used to retrieve the | |||
| representation, that URI shall be considered the base URI. Note that | representation, that URI shall be considered the base URI. Note that | |||
| if the retrieval was the result of a redirected request, the last URI | if the retrieval was the result of a redirected request, the last URI | |||
| used (i.e., the URI that resulted in the actual retrieval of the | used (i.e., the URI that resulted in the actual retrieval of the | |||
| representation) is the base URI. | representation) is the base URI. | |||
| 5.1.4 Default Base URI | 5.1.4 Default Base URI | |||
| If none of the conditions described above apply, then the base URI is | If none of the conditions described above apply, then the base URI is | |||
| defined by the context of the application. Since this definition is | defined by the context of the application. Since this definition is | |||
| necessarily application-dependent, failing to define a base URI using | necessarily application-dependent, failing to define a base URI using | |||
| one of the other methods may result in the same content being | one of the other methods may result in the same content being | |||
| interpreted differently by different types of application. | interpreted differently by different types of application. | |||
| A sender of a representation containing relative references is | A sender of a representation containing relative references is | |||
| responsible for ensuring that a base URI for those references can be | responsible for ensuring that a base URI for those references can be | |||
| established. Aside from fragment-only references, relative references | established. Aside from fragment-only references, relative references | |||
| can only be used reliably in situations where the base URI is | can only be used reliably in situations where the base URI is | |||
| well-defined. | well-defined. | |||
| 5.2 Relative Resolution | 5.2 Relative Resolution | |||
| This section describes an algorithm for converting a URI reference | This section describes an algorithm for converting a URI reference | |||
| that might be relative to a given base URI into the parsed componets | that might be relative to a given base URI into the parsed components | |||
| of the reference's target. The components can then be recomposed, as | of the reference's target. The components can then be recomposed, as | |||
| described in Section 5.3, to form the target URI. This algorithm | described in Section 5.3, to form the target URI. This algorithm | |||
| provides definitive results that can be used to test the output of | provides definitive results that can be used to test the output of | |||
| other implementations. Applications may implement relative reference | other implementations. Applications may implement relative reference | |||
| resolution using some other algorithm, provided that the results | resolution using some other algorithm, provided that the results | |||
| match what would be given by this algorithm. | match what would be given by this algorithm. | |||
| 5.2.1 Pre-parse the Base URI | 5.2.1 Pre-parse the Base URI | |||
| The base URI (Base) is established according to the procedure of | The base URI (Base) is established according to the procedure of | |||
| Section 5.1 and parsed into the five main components described in | Section 5.1 and parsed into the five main components described in | |||
| Section 3. Note that only the scheme component is required to be | Section 3. Note that only the scheme component is required to be | |||
| present in a base URI; the other components may be empty or | present in a base URI; the other components may be empty or | |||
| undefined. A component is undefined if its associated delimiter does | undefined. A component is undefined if its associated delimiter does | |||
| not appear in the URI reference; the path component is never | not appear in the URI reference; the path component is never | |||
| undefined, though it may be empty. | undefined, though it may be empty. | |||
| Normalization of the base URI, as described in Section 6.2.2 and | Normalization of the base URI, as described in Section 6.2.2 and | |||
| Section 6.2.3, is optional. A URI reference must be transformed to | Section 6.2.3, is optional. A URI reference must be transformed to | |||
| its target URI before it can be normalized. | its target URI before it can be normalized. | |||
| 5.2.2 Transform References | 5.2.2 Transform References | |||
| For each URI reference (R), the following pseudocode describes an | For each URI reference (R), the following pseudocode describes an | |||
| algorithm for transforming R into its target URI (T): | algorithm for transforming R into its target URI (T): | |||
| -- The URI reference is parsed into the five URI components | -- The URI reference is parsed into the five URI components | |||
| -- | -- | |||
| (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); | (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); | |||
| -- A non-strict parser may ignore a scheme in the reference | -- A non-strict parser may ignore a scheme in the reference | |||
| -- if it is identical to the base URI's scheme. | -- if it is identical to the base URI's scheme. | |||
| skipping to change at page 30, line 25 ¶ | skipping to change at page 30, line 38 ¶ | |||
| endif; | endif; | |||
| T.query = R.query; | T.query = R.query; | |||
| endif; | endif; | |||
| T.authority = Base.authority; | T.authority = Base.authority; | |||
| endif; | endif; | |||
| T.scheme = Base.scheme; | T.scheme = Base.scheme; | |||
| endif; | endif; | |||
| T.fragment = R.fragment; | T.fragment = R.fragment; | |||
| 5.2.3 Merge Paths | 5.2.3 Merge Paths | |||
| The pseudocode above refers to a "merge" routine for merging a | The pseudocode above refers to a "merge" routine for merging a | |||
| relative-path reference with the path of the base URI. This is | relative-path reference with the path of the base URI. This is | |||
| accomplished as follows: | accomplished as follows: | |||
| o If the base URI has a defined authority component and an empty | o If the base URI has a defined authority component and an empty | |||
| path, then return a string consisting of "/" concatenated with the | path, then return a string consisting of "/" concatenated with the | |||
| reference's path; otherwise, | reference's path; otherwise, | |||
| o Return a string consisting of the reference's path component | o Return a string consisting of the reference's path component | |||
| appended to all but the last segment of the base URI's path (i.e., | appended to all but the last segment of the base URI's path (i.e., | |||
| excluding any characters after the right-most "/" in the base URI | excluding any characters after the right-most "/" in the base URI | |||
| path, or excluding the entire base URI path if it does not contain | path, or excluding the entire base URI path if it does not contain | |||
| any "/" characters). | any "/" characters). | |||
| 5.2.4 Remove Dot Segments | 5.2.4 Remove Dot Segments | |||
| The pseudocode also refers to a "remove_dot_segments" routine for | The pseudocode also refers to a "remove_dot_segments" routine for | |||
| interpreting and removing the special "." and ".." complete path | interpreting and removing the special "." and ".." complete path | |||
| segments from a referenced path. This is done after the path is | segments from a referenced path. This is done after the path is | |||
| extracted from a reference, whether or not the path was relative, in | extracted from a reference, whether or not the path was relative, in | |||
| order to remove any invalid or extraneous dot-segments prior to | order to remove any invalid or extraneous dot-segments prior to | |||
| forming the target URI. Although there are many ways to accomplish | forming the target URI. Although there are many ways to accomplish | |||
| this removal process, we describe a simple method using a two string | this removal process, we describe a simple method using two string | |||
| buffers. | buffers. | |||
| 1. The input buffer is initialized with the now-appended path | 1. The input buffer is initialized with the now-appended path | |||
| components and the output buffer is initialized to the empty | components and the output buffer is initialized to the empty | |||
| string. | string. | |||
| 2. Replace any prefix of "./" or "../" at the beginning of the input | 2. While the input buffer is not empty, loop: | |||
| buffer with "/". | ||||
| 3. While the input buffer is not empty, loop: | a. If the input buffer begins with a prefix of "../" or "./", | |||
| then remove that prefix from the input buffer; otherwise, | ||||
| 1. If the input buffer begins with a prefix of "/./" or "/.", | b. If the input buffer begins with a prefix of "/./" or "/.", | |||
| where "." is a complete path segment, then replace that | where "." is a complete path segment, then replace that | |||
| prefix with "/"; otherwise | prefix with "/" in the input buffer; otherwise, | |||
| 2. If the input buffer begins with a prefix of "/../" or "/..", | c. If the input buffer begins with a prefix of "/../" or "/..", | |||
| where ".." is a complete path segment, then replace that | where ".." is a complete path segment, then replace that | |||
| prefix with "/" and remove the last segment and its preceding | prefix with "/" in the input buffer and remove the last | |||
| "/" (if any) from the output buffer; otherwise | segment and its preceding "/" (if any) from the output | |||
| buffer; otherwise, | ||||
| 3. Remove the first segment and its preceding "/" (if any) from | d. If the input buffer consists only of "." or "..", then remove | |||
| the input buffer and append them to the output buffer. | that from the input buffer; otherwise, | |||
| 4. Finally, the output buffer is returned as the result of | e. Move the first path segment in the input buffer to the end of | |||
| the output buffer, including the initial "/" character (if | ||||
| any) and any subsequent characters up to, but not including, | ||||
| the next "/" character or the end of the input buffer. | ||||
| 3. Finally, the output buffer is returned as the result of | ||||
| remove_dot_segments. | remove_dot_segments. | |||
| Note that dot-segments are intended for use in URI references to | ||||
| express an identifier relative to the hierarchy of names in the base | ||||
| URI. The remove_dot_segments algorithm respects that hierarchy by | ||||
| removing extra dot-segments rather than treating them as an error or | ||||
| leaving them to be misinterpreted by dereference implementations. | ||||
| The following illustrates how the above steps are applied for two | The following illustrates how the above steps are applied for two | |||
| example merged paths, showing the state of the two buffers after each | example merged paths, showing the state of the two buffers after each | |||
| step. | step. | |||
| STEP OUTPUT BUFFER INPUT BUFFER | STEP OUTPUT BUFFER INPUT BUFFER | |||
| 1 : /a/b/c/./../../g | 1 : /a/b/c/./../../g | |||
| 3c: /a /b/c/./../../g | 2e: /a /b/c/./../../g | |||
| 3c: /a/b /c/./../../g | 2e: /a/b /c/./../../g | |||
| 3c: /a/b/c /./../../g | 2e: /a/b/c /./../../g | |||
| 3a: /a/b/c /../../g | 2b: /a/b/c /../../g | |||
| 3b: /a/b /../g | 2c: /a/b /../g | |||
| 3b: /a /g | 2c: /a /g | |||
| 3c: /a/g | 2e: /a/g | |||
| STEP OUTPUT BUFFER INPUT BUFFER | STEP OUTPUT BUFFER INPUT BUFFER | |||
| 1 : mid/content=5/../6 | 1 : mid/content=5/../6 | |||
| 3c: mid /content=5/../6 | 2e: mid /content=5/../6 | |||
| 3c: mid/content=5 /../6 | 2e: mid/content=5 /../6 | |||
| 3b: mid /6 | 2c: mid /6 | |||
| 3c: mid/6 | 2e: mid/6 | |||
| Some applications may find it more efficient to implement the | Some applications may find it more efficient to implement the | |||
| remove_dot_segments algorithm using two segment stacks rather than | remove_dot_segments algorithm using two segment stacks rather than | |||
| strings. | strings. | |||
| Note: Some client applications will fail to separate a reference's | Note: Beware that some older, erroneous implementations will fail | |||
| query component from its path component before merging the base | to separate a reference's query component from its path component | |||
| and reference paths. This may result in loss of information if | prior to merging the base and reference paths, resulting in an | |||
| the query component contains the strings "/../" or "/./". | interoperability failure if the query component contains the | |||
| strings "/../" or "/./". | ||||
| 5.3 Component Recomposition | 5.3 Component Recomposition | |||
| Parsed URI components can be recomposed to obtain the corresponding | Parsed URI components can be recomposed to obtain the corresponding | |||
| URI reference string. Using pseudocode, this would be: | URI reference string. Using pseudocode, this would be: | |||
| result = "" | result = "" | |||
| if defined(scheme) then | if defined(scheme) then | |||
| append scheme to result; | append scheme to result; | |||
| append ":" to result; | append ":" to result; | |||
| endif; | endif; | |||
| skipping to change at page 33, line 5 ¶ | skipping to change at page 34, line 5 ¶ | |||
| endif; | endif; | |||
| return result; | return result; | |||
| Note that we are careful to preserve the distinction between a | Note that we are careful to preserve the distinction between a | |||
| component that is undefined, meaning that its separator was not | component that is undefined, meaning that its separator was not | |||
| present in the reference, and a component that is empty, meaning that | present in the reference, and a component that is empty, meaning that | |||
| the separator was present and was immediately followed by the next | the separator was present and was immediately followed by the next | |||
| component separator or the end of the reference. | component separator or the end of the reference. | |||
| 5.4 Reference Resolution Examples | 5.4 Reference Resolution Examples | |||
| Within a representation with a well-defined base URI of | Within a representation with a well-defined base URI of | |||
| http://a/b/c/d;p?q | http://a/b/c/d;p?q | |||
| a relative URI reference is transformed to its target URI as follows. | a relative URI reference is transformed to its target URI as follows. | |||
| 5.4.1 Normal Examples | 5.4.1 Normal Examples | |||
| "g:h" = "g:h" | "g:h" = "g:h" | |||
| "g" = "http://a/b/c/g" | "g" = "http://a/b/c/g" | |||
| "./g" = "http://a/b/c/g" | "./g" = "http://a/b/c/g" | |||
| "g/" = "http://a/b/c/g/" | "g/" = "http://a/b/c/g/" | |||
| "/g" = "http://a/g" | "/g" = "http://a/g" | |||
| "//g" = "http://g" | "//g" = "http://g" | |||
| "?y" = "http://a/b/c/d;p?y" | "?y" = "http://a/b/c/d;p?y" | |||
| "g?y" = "http://a/b/c/g?y" | "g?y" = "http://a/b/c/g?y" | |||
| "#s" = "http://a/b/c/d;p?q#s" | "#s" = "http://a/b/c/d;p?q#s" | |||
| skipping to change at page 33, line 39 ¶ | skipping to change at page 34, line 39 ¶ | |||
| "" = "http://a/b/c/d;p?q" | "" = "http://a/b/c/d;p?q" | |||
| "." = "http://a/b/c/" | "." = "http://a/b/c/" | |||
| "./" = "http://a/b/c/" | "./" = "http://a/b/c/" | |||
| ".." = "http://a/b/" | ".." = "http://a/b/" | |||
| "../" = "http://a/b/" | "../" = "http://a/b/" | |||
| "../g" = "http://a/b/g" | "../g" = "http://a/b/g" | |||
| "../.." = "http://a/" | "../.." = "http://a/" | |||
| "../../" = "http://a/" | "../../" = "http://a/" | |||
| "../../g" = "http://a/g" | "../../g" = "http://a/g" | |||
| 5.4.2 Abnormal Examples | 5.4.2 Abnormal Examples | |||
| Although the following abnormal examples are unlikely to occur in | Although the following abnormal examples are unlikely to occur in | |||
| normal practice, all URI parsers should be capable of resolving them | normal practice, all URI parsers should be capable of resolving them | |||
| consistently. Each example uses the same base as above. | consistently. Each example uses the same base as above. | |||
| Parsers must be careful in handling cases where there are more | Parsers must be careful in handling cases where there are more | |||
| relative path ".." segments than there are hierarchical levels in the | relative path ".." segments than there are hierarchical levels in the | |||
| base URI's path. Note that the ".." syntax cannot be used to change | base URI's path. Note that the ".." syntax cannot be used to change | |||
| the authority component of a URI. | the authority component of a URI. | |||
| skipping to change at page 35, line 5 ¶ | skipping to change at page 36, line 5 ¶ | |||
| Some parsers allow the scheme name to be present in a relative URI | Some parsers allow the scheme name to be present in a relative URI | |||
| reference if it is the same as the base URI scheme. This is | reference if it is the same as the base URI scheme. This is | |||
| considered to be a loophole in prior specifications of partial URI | considered to be a loophole in prior specifications of partial URI | |||
| [RFC1630]. Its use should be avoided, but is allowed for backward | [RFC1630]. Its use should be avoided, but is allowed for backward | |||
| compatibility. | compatibility. | |||
| "http:g" = "http:g" ; for strict parsers | "http:g" = "http:g" ; for strict parsers | |||
| / "http://a/b/c/g" ; for backward compatibility | / "http://a/b/c/g" ; for backward compatibility | |||
| 6. Normalization and Comparison | 6. Normalization and Comparison | |||
| One of the most common operations on URIs is simple comparison: | One of the most common operations on URIs is simple comparison: | |||
| determining if two URIs are equivalent without using the URIs to | determining if two URIs are equivalent without using the URIs to | |||
| access their respective resource(s). A comparison is performed every | access their respective resource(s). A comparison is performed every | |||
| time a response cache is accessed, a browser checks its history to | time a response cache is accessed, a browser checks its history to | |||
| color a link, or an XML parser processes tags within a namespace. | color a link, or an XML parser processes tags within a namespace. | |||
| Extensive normalization prior to comparison of URIs is often used by | Extensive normalization prior to comparison of URIs is often used by | |||
| spiders and indexing engines to prune a search space or reduce | spiders and indexing engines to prune a search space or reduce | |||
| duplication of request actions and response storage. | duplication of request actions and response storage. | |||
| URI comparison is performed in respect to some particular purpose, | URI comparison is performed in respect to some particular purpose, | |||
| and software with differing purposes will often be subject to | and software with differing purposes will often be subject to | |||
| differing design trade-offs in regards to how much effort should be | differing design trade-offs in regards to how much effort should be | |||
| spent in reducing duplicate identifiers. This section describes a | spent in reducing duplicate identifiers. This section describes a | |||
| variety of methods that may be used to compare URIs, the trade-offs | variety of methods that may be used to compare URIs, the trade-offs | |||
| between them, and the types of applications that might use them. | between them, and the types of applications that might use them. A | |||
| canonical form for URI references is defined to reduce the occurrence | ||||
| of false negative comparisons. | ||||
| 6.1 Equivalence | 6.1 Equivalence | |||
| Since URIs exist to identify resources, presumably they should be | Since URIs exist to identify resources, presumably they should be | |||
| considered equivalent when they identify the same resource. However, | considered equivalent when they identify the same resource. However, | |||
| such a definition of equivalence is not of much practical use, since | such a definition of equivalence is not of much practical use, since | |||
| there is no way for software to compare two resources without | there is no way for software to compare two resources without | |||
| knowledge of the implementation-specific syntax of each URI's | knowledge of the implementation-specific syntax of each URI's | |||
| dereferencing algorithm. For this reason, determination of | dereferencing algorithm. For this reason, determination of | |||
| equivalence or difference of URIs is based on string comparison, | equivalence or difference of URIs is based on string comparison, | |||
| perhaps augmented by reference to additional rules provided by URI | perhaps augmented by reference to additional rules provided by URI | |||
| scheme definitions. We use the terms "different" and "equivalent" to | scheme definitions. We use the terms "different" and "equivalent" to | |||
| skipping to change at page 36, line 5 ¶ | skipping to change at page 37, line 5 ¶ | |||
| different URIs. Therefore, comparison methods are designed to | different URIs. Therefore, comparison methods are designed to | |||
| minimize false negatives while strictly avoiding false positives. | minimize false negatives while strictly avoiding false positives. | |||
| In testing for equivalence, applications should not directly compare | In testing for equivalence, applications should not directly compare | |||
| relative URI references; the references should be converted to their | relative URI references; the references should be converted to their | |||
| target URI forms before comparison. When URIs are being compared for | target URI forms before comparison. When URIs are being compared for | |||
| the purpose of selecting (or avoiding) a network action, such as | the purpose of selecting (or avoiding) a network action, such as | |||
| retrieval of a representation, the fragment components (if any) | retrieval of a representation, the fragment components (if any) | |||
| should be excluded from the comparison. | should be excluded from the comparison. | |||
| 6.2 Comparison Ladder | 6.2 Comparison Ladder | |||
| A variety of methods are used in practice to test URI equivalence. | A variety of methods are used in practice to test URI equivalence. | |||
| These methods fall into a range, distinguished by the amount of | These methods fall into a range, distinguished by the amount of | |||
| processing required and the degree to which the probability of false | processing required and the degree to which the probability of false | |||
| negatives is reduced. As noted above, false negatives cannot in | negatives is reduced. As noted above, false negatives cannot in | |||
| principle be eliminated. In practice, their probability can be | principle be eliminated. In practice, their probability can be | |||
| reduced, but this reduction requires more processing and is not | reduced, but this reduction requires more processing and is not | |||
| cost-effective for all applications. | cost-effective for all applications. | |||
| If this range of comparison practices is considered as a ladder, the | If this range of comparison practices is considered as a ladder, the | |||
| following discussion will climb the ladder, starting with those | following discussion will climb the ladder, starting with those | |||
| practices that are cheap but have a relatively higher chance of | practices that are cheap but have a relatively higher chance of | |||
| producing false negatives, and proceeding to those that have higher | producing false negatives, and proceeding to those that have higher | |||
| computational cost and lower risk of false negatives. | computational cost and lower risk of false negatives. | |||
| 6.2.1 Simple String Comparison | 6.2.1 Simple String Comparison | |||
| If two URIs, considered as character strings, are identical, then it | If two URIs, considered as character strings, are identical, then it | |||
| is safe to conclude that they are equivalent. This type of | is safe to conclude that they are equivalent. This type of | |||
| equivalence test has very low computational cost and is in wide use | equivalence test has very low computational cost and is in wide use | |||
| in a variety of applications, particularly in the domain of parsing. | in a variety of applications, particularly in the domain of parsing. | |||
| Testing strings for equivalence requires some basic precautions. This | Testing strings for equivalence requires some basic precautions. This | |||
| procedure is often referred to as "bit-for-bit" or "byte-for-byte" | procedure is often referred to as "bit-for-bit" or "byte-for-byte" | |||
| comparison, which is potentially misleading. Testing of strings for | comparison, which is potentially misleading. Testing of strings for | |||
| equality is normally based on pairwise comparison of the characters | equality is normally based on pairwise comparison of the characters | |||
| that make up the strings, starting from the first and proceeding | that make up the strings, starting from the first and proceeding | |||
| until both strings are exhausted and all characters found to be | until both strings are exhausted and all characters found to be | |||
| equal, a pair of characters compares unequal, or one of the strings | equal, a pair of characters compares unequal, or one of the strings | |||
| is exhausted before the other. | is exhausted before the other. | |||
| Such character comparisons require that each pair of characters be | Such character comparisons require that each pair of characters be | |||
| put in comparable form. For example, should one URI be stored in a | put in comparable form. For example, should one URI be stored in a | |||
| byte array in EBCDIC encoding, and the second be in a Java String | byte array in EBCDIC encoding, and the second be in a Java String | |||
| object (UTF-16), bit-for-bit comparisons applied naively will produce | object (UTF-16), bit-for-bit comparisons applied naively will produce | |||
| both false-positive and false-negative errors. It is better to speak | errors. It is better to speak of equality on a | |||
| of equality on a character-for-character rather than byte-for-byte or | character-for-character rather than byte-for-byte or bit-for-bit | |||
| bit-for-bit basis. In practical terms, character-by-character | basis. In practical terms, character-by-character comparisons should | |||
| comparisons should be done codepoint-by-codepoint after conversion to | be done codepoint-by-codepoint after conversion to a common character | |||
| a common character encoding. | encoding. | |||
| 6.2.2 Syntax-based Normalization | 6.2.2 Syntax-based Normalization | |||
| Software may use logic based on the definitions provided by this | Software may use logic based on the definitions provided by this | |||
| specification to reduce the probability of false negatives. Such | specification to reduce the probability of false negatives. Such | |||
| processing is moderately higher in cost than character-for-character | processing is moderately higher in cost than character-for-character | |||
| string comparison. For example, an application using this approach | string comparison. For example, an application using this approach | |||
| could reasonably consider the following two URIs equivalent: | could reasonably consider the following two URIs equivalent: | |||
| example://a/b/c/%7Bfoo%7D | example://a/b/c/%7Bfoo%7D | |||
| eXAMPLE://a/./b/../b/%63/%7bfoo%7d | eXAMPLE://a/./b/../b/%63/%7bfoo%7d | |||
| Web user agents, such as browsers, typically apply this type of URI | Web user agents, such as browsers, typically apply this type of URI | |||
| normalization when determining whether a cached response is | normalization when determining whether a cached response is | |||
| available. Syntax-based normalization includes such techniques as | available. Syntax-based normalization includes such techniques as | |||
| case normalization, encoding normalization, empty-component | case normalization, percent-encoding normalization, and removal of | |||
| normalization, and removal of dot-segments. | dot-segments. | |||
| 6.2.2.1 Case Normalization | 6.2.2.1 Case Normalization | |||
| When a URI scheme uses components of the generic syntax, it will also | When a URI scheme uses components of the generic syntax, it will also | |||
| use the common syntax equivalence rules, namely that the scheme and | use the common syntax equivalence rules, namely that the scheme and | |||
| host are case-insensitive and therefore should be normalized to | host are case-insensitive and therefore should be normalized to | |||
| lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | |||
| equivalent to <http://www.example.com/>. Applications should not | equivalent to <http://www.example.com/>. Applications should not | |||
| assume anything about the case sensitivity of other URI components, | assume anything about the case sensitivity of other URI components, | |||
| since that is dependent on the implementation used to handle a | since that is dependent on the implementation used to handle a | |||
| dereference. | dereference. | |||
| The hexadecimal digits within a percent-encoding triplet (e.g., "%3a" | The hexadecimal digits within a percent-encoding triplet (e.g., "%3a" | |||
| versus "%3A") are case-insensitive and therefore should be normalized | versus "%3A") are case-insensitive and therefore should be normalized | |||
| to use uppercase letters for the digits A-F. | to use uppercase letters for the digits A-F. | |||
| 6.2.2.2 Encoding Normalization | 6.2.2.2 Percent-Encoding Normalization | |||
| The percent-encoding mechanism (Section 2.1) is a frequent source of | The percent-encoding mechanism (Section 2.1) is a frequent source of | |||
| variance among otherwise identical URIs. In addition to the | variance among otherwise identical URIs. In addition to the | |||
| case-insensitivity issue noted above, some URI producers | case-insensitivity issue noted above, some URI producers | |||
| percent-encode octets that do not require percent-encoding, resulting | percent-encode octets that do not require percent-encoding, resulting | |||
| in URIs that are equivalent to their non-encoded counterparts. Such | in URIs that are equivalent to their non-encoded counterparts. Such | |||
| URIs should be normalized by decoding any percent-encoded octet that | URIs should be normalized by decoding any percent-encoded octet that | |||
| corresponds to an unreserved character, as described in Section 2.3. | corresponds to an unreserved character, as described in Section 2.3. | |||
| 6.2.2.3 Empty-component Normalization | 6.2.2.3 Path Segment Normalization | |||
| Components of the generic URI syntax are delimited from other | ||||
| components by optional separators. For example, a query component is | ||||
| separated from the path by a question mark ("?") and a port | ||||
| sub-component is separated from host by a colon (":"). A URI in | ||||
| which a delimiter is present and the (sub-)component it delimits is | ||||
| empty is equivalent to the same URI without that delimiter. For | ||||
| example, the following are all equivalent: | ||||
| ftp://example.com/ | ||||
| ftp://example.com:/ | ||||
| ftp://@example.com:/ | ||||
| ftp://@example.com:/? | ||||
| ftp://@example.com:/?# | ||||
| URI producers and normalizers should omit a delimiter if the | ||||
| component it delimits is empty, as exemplified by the first URI | ||||
| above, with one exception: a double-slash delimiter indicating an | ||||
| authority component should not be removed, even when the authority is | ||||
| empty, since doing so can lead to misinterpreting the path. | ||||
| 6.2.2.4 Path Segment Normalization | ||||
| The complete path segments "." and ".." have a special meaning within | The complete path segments "." and ".." have a special meaning within | |||
| hierarchical URI schemes. As such, they should not appear in | hierarchical URI schemes. As such, they should not appear in | |||
| absolute paths; if they are found, they can be removed by applying | absolute paths; if they are found, they can be removed by applying | |||
| the remove_dot_segments algorithm to the path, as described in | the remove_dot_segments algorithm to the path, as described in | |||
| Section 5.2. | Section 5.2. | |||
| 6.2.3 Scheme-based Normalization | 6.2.3 Scheme-based Normalization | |||
| The syntax and semantics of URIs vary from scheme to scheme, as | The syntax and semantics of URIs vary from scheme to scheme, as | |||
| described by the defining specification for each scheme. Software | described by the defining specification for each scheme. Software | |||
| may use scheme-specific rules, at further processing cost, to reduce | may use scheme-specific rules, at further processing cost, to reduce | |||
| the probability of false negatives. For example, since the "http" | the probability of false negatives. For example, since the "http" | |||
| scheme makes use of an authority component, has a default port of | scheme makes use of an authority component, has a default port of | |||
| "80", and defines an empty path to be equivalent to "/", the | "80", and defines an empty path to be equivalent to "/", the | |||
| following four URIs are equivalent: | following four URIs are equivalent: | |||
| http://example.com | http://example.com | |||
| skipping to change at page 39, line 9 ¶ | skipping to change at page 39, line 21 ¶ | |||
| http://example.com:80/ | http://example.com:80/ | |||
| In general, a URI that uses the generic syntax for authority with an | In general, a URI that uses the generic syntax for authority with an | |||
| empty path should be normalized to a path of "/"; likewise, an | empty path should be normalized to a path of "/"; likewise, an | |||
| explicit ":port", where the port is empty or the default for the | explicit ":port", where the port is empty or the default for the | |||
| scheme, is equivalent to one where the port and its ":" delimiter are | scheme, is equivalent to one where the port and its ":" delimiter are | |||
| elided. In other words, the second of the above URI examples is the | elided. In other words, the second of the above URI examples is the | |||
| normal form for the "http" scheme. | normal form for the "http" scheme. | |||
| Another case where normalization varies by scheme is in the handling | Another case where normalization varies by scheme is in the handling | |||
| of an empty authority component. For many scheme specifications, an | of an empty authority component or empty host subcomponent. For many | |||
| empty authority is considered an error; for others, it is considered | scheme specifications, an empty authority or host is considered an | |||
| equivalent to "localhost". For the sake of uniformity, future scheme | error; for others, it is considered equivalent to "localhost" or the | |||
| specifications should define an empty authority as being equivalent | end-user's host. When a scheme defines a default for authority and a | |||
| to "localhost", and URI producers and normalizers should use | URI reference to that default is desired, the reference should have | |||
| "localhost" instead of an empty authority. | an empty authority for the sake of uniformity, brevity, and | |||
| internationalization. If, however, either the userinfo or port | ||||
| subcomponent is non-empty, then the host should be given explicitly | ||||
| even if it matches the default. | ||||
| 6.2.4 Protocol-based Normalization | 6.2.4 Protocol-based Normalization | |||
| Web spiders, for which substantial effort to reduce the incidence of | Web spiders, for which substantial effort to reduce the incidence of | |||
| false negatives is often cost-effective, are observed to implement | false negatives is often cost-effective, are observed to implement | |||
| even more aggressive techniques in URI comparison. For example, if | even more aggressive techniques in URI comparison. For example, if | |||
| they observe that a URI such as | they observe that a URI such as | |||
| http://example.com/data | http://example.com/data | |||
| redirects to a URI differing only in the trailing slash | redirects to a URI differing only in the trailing slash | |||
| http://example.com/data/ | http://example.com/data/ | |||
| they will likely regard the two as equivalent in the future. This | they will likely regard the two as equivalent in the future. This | |||
| kind of technique is only appropriate when equivalence is clearly | kind of technique is only appropriate when equivalence is clearly | |||
| indicated by both the result of accessing the resources and the | indicated by both the result of accessing the resources and the | |||
| common conventions of their scheme's dereference algorithm (in this | common conventions of their scheme's dereference algorithm (in this | |||
| case, use of redirection by HTTP origin servers to avoid problems | case, use of redirection by HTTP origin servers to avoid problems | |||
| with relative references). | with relative references). | |||
| 6.3 Canonical Form | 6.3 Canonical Form | |||
| It is in the best interests of everyone to avoid false-negatives in | It is in the best interests of everyone concerned to avoid | |||
| comparing URIs and to minimize the amount of software processing for | false-negatives in comparing URIs and to minimize the amount of | |||
| such comparisons. Those who produce and make reference to URIs can | software processing for such comparisons. Those who produce and make | |||
| reduce the cost of processing and the risk of false negatives by | reference to URIs can reduce the cost of processing and the risk of | |||
| consistently providing them in a form that is reasonably canonical | false negatives by consistently providing them in a form that is | |||
| with respect to their scheme. Specifically: | reasonably canonical with respect to their scheme. Specifically: | |||
| o Always provide the URI scheme in lowercase characters. | o Always provide the URI scheme in lowercase characters. | |||
| o Always provide the host, if any, in lowercase characters. | o Always provide the host, if any, in lowercase characters. | |||
| o Only perform percent-encoding where it is essential. | o Only perform percent-encoding where it is essential. | |||
| o Always use uppercase A-through-F characters when percent-encoding. | o Always use uppercase A-through-F characters when percent-encoding. | |||
| o Prevent /./ and /../ from appearing in non-relative URI paths. | o Prevent dot-segments appearing in non-relative URI paths. | |||
| o Omit delimiters when their associated (sub-)component is empty. | ||||
| o For schemes that define an empty authority to be equivalent to | o For schemes that define a default authority, use an empty | |||
| "localhost", use "localhost". | authority if the default is desired. | |||
| o For schemes that define an empty path to be equivalent to a path | o For schemes that define an empty path to be equivalent to a path | |||
| of "/", use "/". | of "/", use "/". | |||
| 7. Security Considerations | 7. Security Considerations | |||
| A URI does not in itself pose a security threat. However, since URIs | A URI does not in itself pose a security threat. However, since URIs | |||
| are often used to provide a compact set of instructions for access to | are often used to provide a compact set of instructions for access to | |||
| network resources, care must be taken to properly interpret the data | network resources, care must be taken to properly interpret the data | |||
| within a URI, to prevent that data from causing unintended access, | within a URI, to prevent that data from causing unintended access, | |||
| and to avoid including data that should not be revealed in plain | and to avoid including data that should not be revealed in plain | |||
| text. | text. | |||
| 7.1 Reliability and Consistency | 7.1 Reliability and Consistency | |||
| There is no guarantee that, having once used a given URI to retrieve | There is no guarantee that, having once used a given URI to retrieve | |||
| some information, the same information will be retrievable by that | some information, the same information will be retrievable by that | |||
| URI in the future. Nor is there any guarantee that the information | URI in the future. Nor is there any guarantee that the information | |||
| retrievable via that URI in the future will be observably similar to | retrievable via that URI in the future will be observably similar to | |||
| that retrieved in the past. The URI syntax does not constrain how a | that retrieved in the past. The URI syntax does not constrain how a | |||
| given scheme or authority apportions its name space or maintains it | given scheme or authority apportions its name space or maintains it | |||
| over time. Such a guarantee can only be obtained from the person(s) | over time. Such a guarantee can only be obtained from the person(s) | |||
| controlling that name space and the resource in question. A specific | controlling that name space and the resource in question. A specific | |||
| URI scheme may define additional semantics, such as name persistence, | URI scheme may define additional semantics, such as name persistence, | |||
| if those semantics are required of all naming authorities for that | if those semantics are required of all naming authorities for that | |||
| scheme. | scheme. | |||
| 7.2 Malicious Construction | 7.2 Malicious Construction | |||
| It is sometimes possible to construct a URI such that an attempt to | It is sometimes possible to construct a URI such that an attempt to | |||
| perform a seemingly harmless, idempotent operation, such as the | perform a seemingly harmless, idempotent operation, such as the | |||
| retrieval of a representation, will in fact cause a possibly damaging | retrieval of a representation, will in fact cause a possibly damaging | |||
| remote operation to occur. The unsafe URI is typically constructed | remote operation to occur. The unsafe URI is typically constructed | |||
| by specifying a port number other than that reserved for the network | by specifying a port number other than that reserved for the network | |||
| protocol in question. The client unwittingly contacts a site that is | protocol in question. The client unwittingly contacts a site that is | |||
| running a different protocol service and data within the URI contains | running a different protocol service and data within the URI contains | |||
| instructions that, when interpreted according to this other protocol, | instructions that, when interpreted according to this other protocol, | |||
| cause an unexpected operation. A frequent example of such abuse has | cause an unexpected operation. A frequent example of such abuse has | |||
| skipping to change at page 42, line 11 ¶ | skipping to change at page 41, line 37 ¶ | |||
| When a URI contains percent-encoded octets that match the delimiters | When a URI contains percent-encoded octets that match the delimiters | |||
| for a given resolution or dereference protocol (for example, CR and | for a given resolution or dereference protocol (for example, CR and | |||
| LF characters for the TELNET protocol), such percent-encoded octets | LF characters for the TELNET protocol), such percent-encoded octets | |||
| must not be decoded before transmission across that protocol. | must not be decoded before transmission across that protocol. | |||
| Transfer of the percent-encoding, which might violate the protocol, | Transfer of the percent-encoding, which might violate the protocol, | |||
| is less harmful than allowing decoded octets to be interpreted as | is less harmful than allowing decoded octets to be interpreted as | |||
| additional operations or parameters, perhaps triggering an unexpected | additional operations or parameters, perhaps triggering an unexpected | |||
| and possibly harmful remote operation. | and possibly harmful remote operation. | |||
| 7.3 Back-end Transcoding | 7.3 Back-end Transcoding | |||
| When a URI is dereferenced, the data within it is often parsed by | When a URI is dereferenced, the data within it is often parsed by | |||
| both the user agent and one or more servers. In HTTP, for example, a | both the user agent and one or more servers. In HTTP, for example, a | |||
| typical user agent will parse a URI into its five major components, | typical user agent will parse a URI into its five major components, | |||
| access the authority's server, and send it the data within the | access the authority's server, and send it the data within the | |||
| authority, path, and query components. A typical server will take | authority, path, and query components. A typical server will take | |||
| that information, parse the path into segments and the query into | that information, parse the path into segments and the query into | |||
| key/value pairs, and then invoke implementation-specific handlers to | key/value pairs, and then invoke implementation-specific handlers to | |||
| respond to the request. As a result, a common security concern for | respond to the request. As a result, a common security concern for | |||
| server implementations that handle a URI, either as a whole or split | server implementations that handle a URI, either as a whole or split | |||
| into separate components, is proper interpretation of the octet data | into separate components, is proper interpretation of the octet data | |||
| represented by the characters and percent-encodings within that URI. | represented by the characters and percent-encodings within that URI. | |||
| Percent-encoded octets must be decoded at some point during the | Percent-encoded octets must be decoded at some point during the | |||
| dereference process. Applications must split the URI into its | dereference process. Applications must split the URI into its | |||
| components and sub-components prior to decoding the octets, since | components and subcomponents prior to decoding the octets, since | |||
| otherwise the decoded octets might be mistaken for delimiters. | otherwise the decoded octets might be mistaken for delimiters. | |||
| Security checks of the data within a URI should be applied after | Security checks of the data within a URI should be applied after | |||
| decoding the octets. Note, however, that the "%00" percent-encoding | decoding the octets. Note, however, that the "%00" percent-encoding | |||
| (NUL) may require special handling and should be rejected if the | (NUL) may require special handling and should be rejected if the | |||
| application is not expecting to receive raw data within a component. | application is not expecting to receive raw data within a component. | |||
| Special care should be taken when the URI path interpretation process | Special care should be taken when the URI path interpretation process | |||
| involves the use of a back-end filesystem or related system | involves the use of a back-end filesystem or related system | |||
| functions. Filesystems typically assign an operational meaning to | functions. Filesystems typically assign an operational meaning to | |||
| special characters, such as the "/", "\", ":", "[", and "]" | special characters, such as the "/", "\", ":", "[", and "]" | |||
| skipping to change at page 43, line 5 ¶ | skipping to change at page 42, line 29 ¶ | |||
| "lpt", etc. In some cases, merely testing for the existence of such a | "lpt", etc. In some cases, merely testing for the existence of such a | |||
| name will cause the operating system to pause or invoke unrelated | name will cause the operating system to pause or invoke unrelated | |||
| system calls, leading to significant security concerns regarding | system calls, leading to significant security concerns regarding | |||
| denial of service and unintended data transfer. It would be | denial of service and unintended data transfer. It would be | |||
| impossible for this specification to list all such significant | impossible for this specification to list all such significant | |||
| characters and device names; implementers should research the | characters and device names; implementers should research the | |||
| reserved names and characters for the types of storage device that | reserved names and characters for the types of storage device that | |||
| may be attached to their application and restrict the use of data | may be attached to their application and restrict the use of data | |||
| obtained from URI components accordingly. | obtained from URI components accordingly. | |||
| 7.4 Rare IP Address Formats | 7.4 Rare IP Address Formats | |||
| Although the URI syntax for IPv4address only allows the common, | Although the URI syntax for IPv4address only allows the common, | |||
| dotted-decimal form of IPv4 address literal, many implementations | dotted-decimal form of IPv4 address literal, many implementations | |||
| that process URIs make use of platform-dependent system routines, | that process URIs make use of platform-dependent system routines, | |||
| such as gethostbyname() and inet_aton(), to translate the string | such as gethostbyname() and inet_aton(), to translate the string | |||
| literal to an actual IP address. Unfortunately, such system routines | literal to an actual IP address. Unfortunately, such system routines | |||
| often allow and process a much larger set of formats than those | often allow and process a much larger set of formats than those | |||
| described in Section 3.2.2. | described in Section 3.2.2. | |||
| For example, many implementations allow dotted forms of three | For example, many implementations allow dotted forms of three | |||
| skipping to change at page 43, line 36 ¶ | skipping to change at page 43, line 13 ¶ | |||
| implies octal; otherwise, the number is interpreted as decimal). | implies octal; otherwise, the number is interpreted as decimal). | |||
| These additional IP address formats are not allowed in the URI syntax | These additional IP address formats are not allowed in the URI syntax | |||
| due to differences between platform implementations. However, they | due to differences between platform implementations. However, they | |||
| can become a security concern if an application attempts to filter | can become a security concern if an application attempts to filter | |||
| access to resources based on the IP address in string literal format. | access to resources based on the IP address in string literal format. | |||
| If such filtering is performed, literals should be converted to | If such filtering is performed, literals should be converted to | |||
| numeric form and filtered based on the numeric value, rather than a | numeric form and filtered based on the numeric value, rather than a | |||
| prefix or suffix of the string form. | prefix or suffix of the string form. | |||
| 7.5 Sensitive Information | 7.5 Sensitive Information | |||
| URI producers should not provide a URI that contains a username or | URI producers should not provide a URI that contains a username or | |||
| password which is intended to be secret: URIs are frequently | password which is intended to be secret: URIs are frequently | |||
| displayed by browsers, stored in clear text bookmarks, and logged by | displayed by browsers, stored in clear text bookmarks, and logged by | |||
| user agent history and intermediary applications (proxies). A | user agent history and intermediary applications (proxies). A | |||
| password appearing within the userinfo component is deprecated and | password appearing within the userinfo component is deprecated and | |||
| should be considered an error (or simply ignored) except in those | should be considered an error (or simply ignored) except in those | |||
| rare cases where the 'password' parameter is intended to be public. | rare cases where the 'password' parameter is intended to be public. | |||
| 7.6 Semantic Attacks | 7.6 Semantic Attacks | |||
| Because the userinfo sub-component is rarely used and appears before | Because the userinfo subcomponent is rarely used and appears before | |||
| the host in the authority component, it can be used to construct a | the host in the authority component, it can be used to construct a | |||
| URI that is intended to mislead a human user by appearing to identify | URI that is intended to mislead a human user by appearing to identify | |||
| one (trusted) naming authority while actually identifying a different | one (trusted) naming authority while actually identifying a different | |||
| authority hidden behind the noise. For example | authority hidden behind the noise. For example | |||
| ftp://ftp.example.com&story=breaking_news@10.0.0.1/top_story.htm | ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm | |||
| might lead a human user to assume that the host is | might lead a human user to assume that the host is 'cnn.example.com', | |||
| 'trusted.example.com', whereas it is actually '10.0.0.1'. Note that | whereas it is actually '10.0.0.1'. Note that a misleading userinfo | |||
| a misleading userinfo sub-component could be much longer than the | subcomponent could be much longer than the example above. | |||
| example above. | ||||
| A misleading URI, such as the one above, is an attack on the user's | A misleading URI, such as the one above, is an attack on the user's | |||
| preconceived notions about the meaning of a URI, rather than an | preconceived notions about the meaning of a URI, rather than an | |||
| attack on the software itself. User agents may be able to reduce the | attack on the software itself. User agents may be able to reduce the | |||
| impact of such attacks by distinguishing the various components of | impact of such attacks by distinguishing the various components of | |||
| the URI when rendered, such as by using a different color or tone to | the URI when rendered, such as by using a different color or tone to | |||
| render userinfo if any is present, though there is no general | render userinfo if any is present, though there is no general | |||
| panacea. More information on URI-based semantic attacks can be found | panacea. More information on URI-based semantic attacks can be found | |||
| in [Siedzik]. | in [Siedzik]. | |||
| 8. Acknowledgments | 8. Acknowledgments | |||
| This specification is derived from RFC 2396 [RFC2396], RFC 1808 | This specification is derived from RFC 2396 [RFC2396], RFC 1808 | |||
| [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those | [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those | |||
| documents still apply. It also incorporates the update (with | documents still apply. It also incorporates the update (with | |||
| corrections) for IPv6 literals in the host syntax, as defined by | corrections) for IPv6 literals in the host syntax, as defined by | |||
| Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in | Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in | |||
| [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, | [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, | |||
| Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, | Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, | |||
| Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin | Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin | |||
| Duerst, Stefan Eissing, Clive D.W. Feather, Tony Hammond, Pat Hayes, | Duerst, Stefan Eissing, Clive D.W. Feather, Tony Hammond, Pat Hayes, | |||
| Henry Holtzman, Ian B. Jacobs, Michael Kay, John C. Klensin, Graham | Henry Holtzman, Ian B. Jacobs, Michael Kay, John C. Klensin, Graham | |||
| Klyne, Dan Kohn, Bruce Lilly, Andrew Main, Ira McDonald, Michael | Klyne, Dan Kohn, Bruce Lilly, Andrew Main, Ira McDonald, Michael | |||
| Mealling, Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, | Mealling, Ray Merkert, Stephen Pollei, Julian Reschke, Tomas Rokicki, | |||
| Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne, Stuart | Miles Sabin, Kai Schaetzl, Mark Thomson, Ronald Tschalaer, Norm | |||
| Williams, and Henry Zongaro are gratefully acknowledged. | Walsh, Marc Warne, Stuart Williams, and Henry Zongaro are gratefully | |||
| acknowledged. | ||||
| Normative References | 9. References | |||
| 9.1 Normative References | ||||
| [ASCII] American National Standards Institute, "Coded Character | [ASCII] American National Standards Institute, "Coded Character | |||
| Set -- 7-bit American Standard Code for Information | Set -- 7-bit American Standard Code for Information | |||
| Interchange", ANSI X3.4, 1986. | Interchange", ANSI X3.4, 1986. | |||
| [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", RFC 2234, November 1997. | Specifications: ABNF", RFC 2234, November 1997. | |||
| [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | |||
| 10646", STD 63, RFC 3629, November 2003. | 10646", STD 63, RFC 3629, November 2003. | |||
| Informative References | 9.2 Informative References | |||
| [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet | [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet | |||
| host table specification", RFC 952, October 1985. | host table specification", RFC 952, October 1985. | |||
| [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | |||
| STD 13, RFC 1034, November 1987. | STD 13, RFC 1034, November 1987. | |||
| [RFC1123] Braden, R., "Requirements for Internet Hosts - Application | [RFC1123] Braden, R., "Requirements for Internet Hosts - Application | |||
| and Support", STD 3, RFC 1123, October 1989. | and Support", STD 3, RFC 1123, October 1989. | |||
| skipping to change at page 47, line 41 ¶ | skipping to change at page 46, line 5 ¶ | |||
| [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform | [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform | |||
| Resource Locators (URL)", RFC 1738, December 1994. | Resource Locators (URL)", RFC 1738, December 1994. | |||
| [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC | [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC | |||
| 1808, June 1995. | 1808, June 1995. | |||
| [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | |||
| Extensions (MIME) Part Two: Media Types", RFC 2046, | Extensions (MIME) Part Two: Media Types", RFC 2046, | |||
| November 1996. | November 1996. | |||
| [RFC2110] Palme, J. and A. Hopmann, "MIME E-mail Encapsulation of | ||||
| Aggregate Documents, such as HTML (MHTML)", RFC 2110, | ||||
| March 1997. | ||||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | |||
| [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | |||
| Languages", BCP 18, RFC 2277, January 1998. | Languages", BCP 18, RFC 2277, January 1998. | |||
| [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | |||
| Resource Identifiers (URI): Generic Syntax", RFC 2396, | Resource Identifiers (URI): Generic Syntax", RFC 2396, | |||
| August 1998. | August 1998. | |||
| [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. | [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. | |||
| Jensen, "HTTP Extensions for Distributed Authoring -- | Jensen, "HTTP Extensions for Distributed Authoring -- | |||
| WEBDAV", RFC 2518, February 1999. | WEBDAV", RFC 2518, February 1999. | |||
| [RFC2557] Palme, F., Hopmann, A., Shelness, N. and E. Stefferud, | ||||
| "MIME Encapsulation of Aggregate Documents, such as HTML | ||||
| (MHTML)", RFC 2557, March 1999. | ||||
| [RFC2717] Petke, R. and I. King, "Registration Procedures for URL | [RFC2717] Petke, R. and I. King, "Registration Procedures for URL | |||
| Scheme Names", BCP 35, RFC 2717, November 1999. | Scheme Names", BCP 35, RFC 2717, November 1999. | |||
| [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, | [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, | |||
| "Guidelines for new URL Schemes", RFC 2718, November 1999. | "Guidelines for new URL Schemes", RFC 2718, November 1999. | |||
| [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | |||
| Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | |||
| [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration | [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration | |||
| skipping to change at page 49, line 9 ¶ | skipping to change at page 47, line 9 ¶ | |||
| (IPv6) Addressing Architecture", RFC 3513, April 2003. | (IPv6) Addressing Architecture", RFC 3513, April 2003. | |||
| [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April | [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April | |||
| 2001, <http://www.giac.org/practical/gsec/ | 2001, <http://www.giac.org/practical/gsec/ | |||
| Richard_Siedzik_GSEC.pdf>. | Richard_Siedzik_GSEC.pdf>. | |||
| Authors' Addresses | Authors' Addresses | |||
| Tim Berners-Lee | Tim Berners-Lee | |||
| World Wide Web Consortium | World Wide Web Consortium | |||
| MIT/LCS, Room NE43-356 | Massachusetts Institute of Technology | |||
| 200 Technology Square | 77 Massachusetts Avenue | |||
| Cambridge, MA 02139 | Cambridge, MA 02139 | |||
| USA | USA | |||
| Phone: +1-617-253-5702 | Phone: +1-617-253-5702 | |||
| Fax: +1-617-258-5999 | Fax: +1-617-258-5999 | |||
| EMail: timbl@w3.org | EMail: timbl@w3.org | |||
| URI: http://www.w3.org/People/Berners-Lee/ | URI: http://www.w3.org/People/Berners-Lee/ | |||
| Roy T. Fielding | Roy T. Fielding | |||
| Day Software | Day Software | |||
| skipping to change at page 50, line 5 ¶ | skipping to change at page 48, line 5 ¶ | |||
| Larry Masinter | Larry Masinter | |||
| Adobe Systems Incorporated | Adobe Systems Incorporated | |||
| 345 Park Ave | 345 Park Ave | |||
| San Jose, CA 95110 | San Jose, CA 95110 | |||
| USA | USA | |||
| Phone: +1-408-536-3024 | Phone: +1-408-536-3024 | |||
| EMail: LMM@acm.org | EMail: LMM@acm.org | |||
| URI: http://larry.masinter.net/ | URI: http://larry.masinter.net/ | |||
| Appendix A. Collected ABNF for URI | Appendix A. Collected ABNF for URI | |||
| URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment] | URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | |||
| hier-part = "//" authority path-abempty | ||||
| / path-abs | ||||
| / path-rootless | ||||
| / path-empty | ||||
| URI-reference = URI / relative-URI | URI-reference = URI / relative-URI | |||
| relative-URI = ["//" authority] path ["?" query] ["#" fragment] | absolute-URI = scheme ":" hier-part [ "?" query ] | |||
| absolute-URI = scheme ":" ["//" authority] path ["?" query] | relative-URI = relative-part [ "?" query ] [ "#" fragment ] | |||
| relative-part = "//" authority path-abempty | ||||
| / path-abs | ||||
| / path-noscheme | ||||
| / path-empty | ||||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| authority = [ userinfo "@" ] host [ ":" port ] | authority = [ userinfo "@" ] host [ ":" port ] | |||
| userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | |||
| host = IP-literal / IPv4address / reg-name | host = IP-literal / IPv4address / reg-name | |||
| port = *DIGIT | port = *DIGIT | |||
| IP-literal = "[" ( IPv6address / IPvFuture ) "]" | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |||
| IPvFuture = "v" HEXDIG "." 1*( unreserved / sub-delims / ":" ) | IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) | |||
| IPv6address = 6( h16 ":" ) ls32 | IPv6address = 6( h16 ":" ) ls32 | |||
| / "::" 5( h16 ":" ) ls32 | / "::" 5( h16 ":" ) ls32 | |||
| / [ h16 ] "::" 4( h16 ":" ) ls32 | / [ h16 ] "::" 4( h16 ":" ) ls32 | |||
| / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | |||
| / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | |||
| / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | |||
| / [ *4( h16 ":" ) h16 ] "::" ls32 | / [ *4( h16 ":" ) h16 ] "::" ls32 | |||
| / [ *5( h16 ":" ) h16 ] "::" h16 | / [ *5( h16 ":" ) h16 ] "::" h16 | |||
| / [ *6( h16 ":" ) h16 ] "::" | / [ *6( h16 ":" ) h16 ] "::" | |||
| skipping to change at page 50, line 40 ¶ | skipping to change at page 49, line 4 ¶ | |||
| / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | |||
| / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | |||
| / [ *4( h16 ":" ) h16 ] "::" ls32 | / [ *4( h16 ":" ) h16 ] "::" ls32 | |||
| / [ *5( h16 ":" ) h16 ] "::" h16 | / [ *5( h16 ":" ) h16 ] "::" h16 | |||
| / [ *6( h16 ":" ) h16 ] "::" | / [ *6( h16 ":" ) h16 ] "::" | |||
| h16 = 1*4HEXDIG | h16 = 1*4HEXDIG | |||
| ls32 = ( h16 ":" h16 ) / IPv4address | ls32 = ( h16 ":" h16 ) / IPv4address | |||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| dec-octet = DIGIT ; 0-9 | dec-octet = DIGIT ; 0-9 | |||
| / %x31-39 DIGIT ; 10-99 | / %x31-39 DIGIT ; 10-99 | |||
| / "1" 2DIGIT ; 100-199 | / "1" 2DIGIT ; 100-199 | |||
| / "2" %x30-34 DIGIT ; 200-249 | / "2" %x30-34 DIGIT ; 200-249 | |||
| / "25" %x30-35 ; 250-255 | / "25" %x30-35 ; 250-255 | |||
| reg-name = 0*255( unreserved / pct-encoded / sub-delims ) | reg-name = 0*255( unreserved / pct-encoded / sub-delims ) | |||
| path = segment *( "/" segment ) | path = path-abempty ; begins with "/" or is empty | |||
| / path-abs ; begins with "/" but not "//" | ||||
| / path-noscheme ; begins with a non-colon segment | ||||
| / path-rootless ; begins with a segment | ||||
| / path-empty ; zero characters | ||||
| path-abempty = *( "/" segment ) | ||||
| path-abs = "/" [ segment-nz *( "/" segment ) ] | ||||
| path-noscheme = segment-nzc *( "/" segment ) | ||||
| path-rootless = segment-nz *( "/" segment ) | ||||
| path-empty = 0<pchar> | ||||
| segment = *pchar | segment = *pchar | |||
| segment-nz = 1*pchar | ||||
| segment-nzc = 1*( unreserved / pct-encoded / sub-delims / "@" ) | ||||
| pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | ||||
| query = *( pchar / "/" / "?" ) | query = *( pchar / "/" / "?" ) | |||
| fragment = *( pchar / "/" / "?" ) | fragment = *( pchar / "/" / "?" ) | |||
| pct-encoded = "%" HEXDIG HEXDIG | pct-encoded = "%" HEXDIG HEXDIG | |||
| pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | ||||
| unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |||
| reserved = gen-delims / sub-delims | reserved = gen-delims / sub-delims | |||
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | |||
| sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | |||
| / "*" / "+" / "," / ";" / "=" | / "*" / "+" / "," / ";" / "=" | |||
| Appendix B. Parsing a URI Reference with a Regular Expression | Appendix B. Parsing a URI Reference with a Regular Expression | |||
| Since the "first-match-wins" algorithm is identical to the "greedy" | Since the "first-match-wins" algorithm is identical to the "greedy" | |||
| disambiguation method used by POSIX regular expressions, it is | disambiguation method used by POSIX regular expressions, it is | |||
| natural and commonplace to use a regular expression for parsing the | natural and commonplace to use a regular expression for parsing the | |||
| potential five components of a URI reference. | potential five components of a URI reference. | |||
| The following line is the regular expression for breaking-down a | The following line is the regular expression for breaking-down a | |||
| well-formed URI reference into its components. | well-formed URI reference into its components. | |||
| ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | |||
| skipping to change at page 53, line 5 ¶ | skipping to change at page 51, line 5 ¶ | |||
| scheme = $2 | scheme = $2 | |||
| authority = $4 | authority = $4 | |||
| path = $5 | path = $5 | |||
| query = $7 | query = $7 | |||
| fragment = $9 | fragment = $9 | |||
| and, going in the opposite direction, we can recreate a URI reference | and, going in the opposite direction, we can recreate a URI reference | |||
| from its components using the algorithm of Section 5.3. | from its components using the algorithm of Section 5.3. | |||
| Appendix C. Delimiting a URI in Context | Appendix C. Delimiting a URI in Context | |||
| URIs are often transmitted through formats that do not provide a | URIs are often transmitted through formats that do not provide a | |||
| clear context for their interpretation. For example, there are many | clear context for their interpretation. For example, there are many | |||
| occasions when a URI is included in plain text; examples include text | occasions when a URI is included in plain text; examples include text | |||
| sent in electronic mail, USENET news messages, and, most importantly, | sent in electronic mail, USENET news messages, and, most importantly, | |||
| printed on paper. In such cases, it is important to be able to | printed on paper. In such cases, it is important to be able to | |||
| delimit the URI from the rest of the text, and in particular from | delimit the URI from the rest of the text, and in particular from | |||
| punctuation marks that might be mistaken for part of the URI. | punctuation marks that might be mistaken for part of the URI. | |||
| In practice, URIs are delimited in a variety of ways, but usually | In practice, URIs are delimited in a variety of ways, but usually | |||
| skipping to change at page 54, line 4 ¶ | skipping to change at page 52, line 13 ¶ | |||
| to recognize and strip both delimiters and embedded whitespace. | to recognize and strip both delimiters and embedded whitespace. | |||
| For example, the text: | For example, the text: | |||
| Yes, Jim, I found it under "http://www.w3.org/Addressing/", | Yes, Jim, I found it under "http://www.w3.org/Addressing/", | |||
| but you can probably pick it up from <ftp://foo.example. | but you can probably pick it up from <ftp://foo.example. | |||
| com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | |||
| ietf/uri/historical.html#WARNING>. | ietf/uri/historical.html#WARNING>. | |||
| contains the URI references | contains the URI references | |||
| http://www.w3.org/Addressing/ | http://www.w3.org/Addressing/ | |||
| ftp://foo.example.com/rfc/ | ftp://foo.example.com/rfc/ | |||
| http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | |||
| Appendix D. Summary of Non-editorial Changes | Appendix D. Summary of Non-editorial Changes | |||
| D.1 Additions | D.1 Additions | |||
| IPv6 (and later) literals have been added to the list of possible | IPv6 (and later) literals have been added to the list of possible | |||
| identifiers for the host portion of a authority component, as | identifiers for the host portion of a authority component, as | |||
| described by [RFC2732], with the addition of "[" and "]" to the | described by [RFC2732], with the addition of "[" and "]" to the | |||
| reserved set and a version flag to anticipate future versions of IP | reserved set and a version flag to anticipate future versions of IP | |||
| literals. Square brackets are now specified as reserved within the | literals. Square brackets are now specified as reserved within the | |||
| authority component and not allowed outside their use as delimiters | authority component and not allowed outside their use as delimiters | |||
| for an IP literal within host. In order to make this change without | for an IP literal within host. In order to make this change without | |||
| changing the technical definition of the path, query, and fragment | changing the technical definition of the path, query, and fragment | |||
| components, those rules were redefined to directly specify the | components, those rules were redefined to directly specify the | |||
| skipping to change at page 55, line 24 ¶ | skipping to change at page 52, line 37 ¶ | |||
| authority component and not allowed outside their use as delimiters | authority component and not allowed outside their use as delimiters | |||
| for an IP literal within host. In order to make this change without | for an IP literal within host. In order to make this change without | |||
| changing the technical definition of the path, query, and fragment | changing the technical definition of the path, query, and fragment | |||
| components, those rules were redefined to directly specify the | components, those rules were redefined to directly specify the | |||
| characters allowed rather than be defined in terms of uric. | characters allowed rather than be defined in terms of uric. | |||
| Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal | Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal | |||
| address, which unfortunately lacks an ABNF description of | address, which unfortunately lacks an ABNF description of | |||
| IPv6address, we created a new ABNF rule for IPv6address that matches | IPv6address, we created a new ABNF rule for IPv6address that matches | |||
| the text representations defined by Section 2.2 of [RFC3513]. | the text representations defined by Section 2.2 of [RFC3513]. | |||
| Likewise, the definition of IPv4address has been improved in order to | Likewise, the definition of IPv4address has been improved in order to | |||
| limit each decimal octet to the range 0-255. | limit each decimal octet to the range 0-255. | |||
| Section 6 (Section 6) on URI normalization and comparison has been | Section 6 (Section 6) on URI normalization and comparison has been | |||
| completely rewritten and extended using input from Tim Bray and | completely rewritten and extended using input from Tim Bray and | |||
| discussion within the W3C Technical Architecture Group. | discussion within the W3C Technical Architecture Group. | |||
| An ABNF rule for URI has been introduced to correspond to the common | An ABNF rule for URI has been introduced to correspond to the common | |||
| usage of the term: an absolute URI with optional fragment. | usage of the term: an absolute URI with optional fragment. | |||
| D.2 Modifications from RFC 2396 | D.2 Modifications from RFC 2396 | |||
| The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. | The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. | |||
| This change required all rule names that formerly included underscore | This change required all rule names that formerly included underscore | |||
| characters to be renamed with a dash instead. | characters to be renamed with a dash instead. | |||
| Section 2 on characters has been rewritten to explain what characters | Section 2 on characters has been rewritten to explain what characters | |||
| are reserved, when they are reserved, and why they are reserved even | are reserved, when they are reserved, and why they are reserved even | |||
| when not used as delimiters by the generic syntax. The mark | when not used as delimiters by the generic syntax. The mark | |||
| characters that are typically unsafe to decode, including the | characters that are typically unsafe to decode, including the | |||
| exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open | exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open | |||
| skipping to change at page 56, line 7 ¶ | skipping to change at page 53, line 27 ¶ | |||
| set in order to clarify the distinction between reserved and | set in order to clarify the distinction between reserved and | |||
| unreserved and hopefully answer the most common question of scheme | unreserved and hopefully answer the most common question of scheme | |||
| designers. Likewise, the section on percent-encoded characters has | designers. Likewise, the section on percent-encoded characters has | |||
| been rewritten, and URI normalizers are now given license to decode | been rewritten, and URI normalizers are now given license to decode | |||
| any percent-encoded octets corresponding to unreserved characters. | any percent-encoded octets corresponding to unreserved characters. | |||
| In general, the terms "escaped" and "unescaped" have been replaced | In general, the terms "escaped" and "unescaped" have been replaced | |||
| with "percent-encoded" and "decoded", respectively, to reduce | with "percent-encoded" and "decoded", respectively, to reduce | |||
| confusion with other forms of escape mechanisms. | confusion with other forms of escape mechanisms. | |||
| The ABNF for URI and URI-reference has been redesigned to make them | The ABNF for URI and URI-reference has been redesigned to make them | |||
| more friendly to LALR parsers and significantly reduce complexity. As | more friendly to LALR parsers and reduce complexity. As a result, the | |||
| a result, the layout form of syntax description has been removed, | layout form of syntax description has been removed, along with the | |||
| along with the uric, uric_no_slash, hier_part, opaque_part, net_path, | uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path, | |||
| abs_path, rel_path, path_segments, rel_segment, and mark rules. All | path_segments, rel_segment, and mark rules. All references to | |||
| references to "opaque" URIs have been replaced with a better | "opaque" URIs have been replaced with a better description of how the | |||
| description of how the path component may be opaque to hierarchy. The | path component may be opaque to hierarchy. The ambiguity regarding | |||
| ambiguity regarding the parsing of URI-reference as a URI or a | the parsing of URI-reference as a URI or a relative-URI with a colon | |||
| relative-URI with a colon in the first segment is now explained and | in the first segment has been eliminated through the use of five | |||
| disambiguated in the section defining relative-URI. | separate path matching rules. | |||
| The fragment identifier has been moved back into the section on | The fragment identifier has been moved back into the section on | |||
| generic syntax components and within the URI and relative-URI rules, | generic syntax components and within the URI and relative-URI rules, | |||
| though it remains excluded from absolute-URI. The number sign ("#") | though it remains excluded from absolute-URI. The number sign ("#") | |||
| character has been moved back to the reserved set as a result of | character has been moved back to the reserved set as a result of | |||
| reintegrating the fragment syntax. | reintegrating the fragment syntax. | |||
| The ABNF has been corrected to allow a relative path to be empty. | The ABNF has been corrected to allow a relative path to be empty. | |||
| This also allows an absolute-URI to consist of nothing after the | This also allows an absolute-URI to consist of nothing after the | |||
| "scheme:", as is present in practice with the "dav:" namespace | "scheme:", as is present in practice with the "dav:" namespace | |||
| [RFC2518] and the "about:" scheme used internally by many WWW browser | [RFC2518] and the "about:" scheme used internally by many WWW browser | |||
| implementations. The ambiguity regarding the boundary between | implementations. The ambiguity regarding the boundary between | |||
| authority and path is now explained and disambiguated in the same | authority and path has been eliminated through the use of five | |||
| section. | separate path matching rules. | |||
| Registry-based naming authorities that use the generic syntax are now | Registry-based naming authorities that use the generic syntax are now | |||
| defined within the host rule and limited to 255 path characters. This | defined within the host rule and limited to 255 path characters. This | |||
| change allows current implementations, where whatever name provided | change allows current implementations, where whatever name provided | |||
| is simply fed to the local name resolution mechanism, to be | is simply fed to the local name resolution mechanism, to be | |||
| consistent with the specification and removes the need to re-specify | consistent with the specification and removes the need to re-specify | |||
| DNS name formats here. It also allows the host component to contain | DNS name formats here. It also allows the host component to contain | |||
| percent-encoded octets, which is necessary to enable | percent-encoded octets, which is necessary to enable | |||
| internationalized domain names to be provided in URIs, processed in | internationalized domain names to be provided in URIs, processed in | |||
| their native character encodings at the application layers above URI | their native character encodings at the application layers above URI | |||
| skipping to change at page 58, line 10 ¶ | skipping to change at page 54, line 10 ¶ | |||
| paths in order to match common implementations and improve the | paths in order to match common implementations and improve the | |||
| normalization of URIs in practice. This change only impacts the | normalization of URIs in practice. This change only impacts the | |||
| parsing of abnormal references and same-scheme references wherein | parsing of abnormal references and same-scheme references wherein | |||
| the base URI has a non-hierarchical path. | the base URI has a non-hierarchical path. | |||
| Index | Index | |||
| A | A | |||
| ABNF 10 | ABNF 10 | |||
| absolute 25 | absolute 25 | |||
| absolute-path 24 | absolute-path 25 | |||
| absolute-URI 25 | absolute-URI 25 | |||
| access 7 | access 8 | |||
| authority 15, 16 | authority 15, 16 | |||
| B | B | |||
| base URI 27 | base URI 27 | |||
| C | C | |||
| characters 11 | character encoding 4 | |||
| character 4 | ||||
| characters 10 | ||||
| coded character set 4 | ||||
| D | D | |||
| dec-octet 18 | dec-octet 19 | |||
| dereference 7 | dereference 8 | |||
| dot-segments 20 | dot-segments 21 | |||
| F | F | |||
| fragment 22 | fragment 15, 23 | |||
| G | G | |||
| gen-delims 12 | gen-delims 11 | |||
| generic syntax 5 | generic syntax 6 | |||
| H | H | |||
| h16 17 | h16 18 | |||
| hier-part 15 | ||||
| hierarchical 9 | hierarchical 9 | |||
| host 17 | host 17 | |||
| I | I | |||
| identifier 5 | identifier 5 | |||
| IP-literal 17 | IP-literal 18 | |||
| IPv4 18 | IPv4 19 | |||
| IPv4address 18 | IPv4address 19 | |||
| IPv6 17 | IPv6 18 | |||
| IPv6address 17 | IPv6address 18 | |||
| IPvFuture 17 | IPvFuture 18 | |||
| L | L | |||
| locator 6 | locator 6 | |||
| ls32 17 | ls32 18 | |||
| M | M | |||
| merge 30 | merge 30 | |||
| N | N | |||
| name 6 | name 6 | |||
| network-path 24 | network-path 25 | |||
| P | P | |||
| path 15, 20 | path 15, 21 | |||
| pchar 20 | path-abempty 21 | |||
| path-abs 21 | ||||
| path-empty 21 | ||||
| path-noscheme 21 | ||||
| path-rootless 21 | ||||
| path-abempty 15 | ||||
| path-abs 15 | ||||
| path-empty 15 | ||||
| path-rootless 15 | ||||
| pchar 21 | ||||
| pct-encoded 11 | pct-encoded 11 | |||
| percent-encoding 11 | percent-encoding 11 | |||
| port 20 | port 20 | |||
| Q | Q | |||
| query 21 | query 15, 22 | |||
| R | R | |||
| reg-name 19 | reg-name 19 | |||
| registered name 19 | registered name 19 | |||
| relative 9, 27 | relative 9, 27 | |||
| relative-path 24 | relative-path 25 | |||
| relative-URI 24 | relative-URI 25 | |||
| remove_dot_segments 30 | remove_dot_segments 30, 31 | |||
| representation 8 | representation 8 | |||
| reserved 12 | reserved 11 | |||
| resolution 7, 27 | resolution 8, 27 | |||
| resource 4 | resource 4 | |||
| retrieval 8 | retrieval 8 | |||
| S | S | |||
| same-document 25 | same-document 25 | |||
| sameness 8 | sameness 8 | |||
| scheme 15 | scheme 15, 15 | |||
| segment 20 | segment 21 | |||
| sub-delims 12 | segment-nz 21 | |||
| suffix 25 | segment-nzc 21 | |||
| sub-delims 11 | ||||
| suffix 26 | ||||
| T | T | |||
| transcription 6 | transcription 7 | |||
| U | U | |||
| uniform 4 | uniform 4 | |||
| unreserved 12 | unreserved 12 | |||
| URI grammar | URI grammar | |||
| absolute-URI 25 | absolute-URI 25 | |||
| ALPHA 10 | ALPHA 10 | |||
| authority 15, 16 | authority 15, 16 | |||
| CR 10 | CR 10 | |||
| CTL 10 | dec-octet 19 | |||
| dec-octet 18 | ||||
| DIGIT 10 | DIGIT 10 | |||
| DQUOTE 10 | DQUOTE 10 | |||
| fragment 15, 22, 24 | fragment 15, 23, 25 | |||
| gen-delims 12 | gen-delims 11 | |||
| h16 18 | h16 18 | |||
| HEXDIG 10 | HEXDIG 10 | |||
| hier-part 15 | ||||
| host 16, 17 | host 16, 17 | |||
| IP-literal 17 | IP-literal 18 | |||
| IPv4address 18 | IPv4address 19 | |||
| IPv6address 17, 18 | IPv6address 18 | |||
| IPvFuture 17 | IPvFuture 18 | |||
| LF 10 | LF 10 | |||
| ls32 18 | ls32 18 | |||
| mark 12 | mark 12 | |||
| OCTET 10 | OCTET 10 | |||
| path 15 | path 21 | |||
| path-segments 20 | path-abempty 15, 21 | |||
| pchar 20, 21, 22 | path-abs 15, 21 | |||
| path-empty 15, 21 | ||||
| path-noscheme 21 | ||||
| path-rootless 15, 21 | ||||
| pchar 21, 22, 23 | ||||
| pct-encoded 11 | pct-encoded 11 | |||
| port 16, 20 | port 16, 20 | |||
| query 15, 21, 24, 25 | query 15, 22, 25 | |||
| reg-name 19 | reg-name 19 | |||
| relative-URI 24, 24 | relative-URI 24, 25 | |||
| reserved 12 | reserved 11 | |||
| scheme 15, 15, 25 | scheme 15, 16, 25 | |||
| segment 20 | segment 21 | |||
| segment-nz 21 | ||||
| segment-nzc 21 | ||||
| SP 10 | SP 10 | |||
| sub-delims 12 | sub-delims 11 | |||
| unreserved 12 | unreserved 12 | |||
| URI 15, 24 | URI 15, 24 | |||
| URI-reference 24 | URI-reference 24 | |||
| userinfo 16, 16 | userinfo 16, 17 | |||
| URI 15 | URI 15 | |||
| URI-reference 24 | URI-reference 24 | |||
| URL 6 | URL 6 | |||
| URN 6 | URN 6 | |||
| userinfo 16 | userinfo 17 | |||
| Intellectual Property Statement | Intellectual Property Statement | |||
| The IETF takes no position regarding the validity or scope of any | The IETF takes no position regarding the validity or scope of any | |||
| intellectual property or other rights that might be claimed to | Intellectual Property Rights or other rights that might be claimed to | |||
| pertain to the implementation or use of the technology described in | pertain to the implementation or use of the technology described in | |||
| this document or the extent to which any license under such rights | this document or the extent to which any license under such rights | |||
| might or might not be available; neither does it represent that it | might or might not be available; nor does it represent that it has | |||
| has made any effort to identify any such rights. Information on the | made any independent effort to identify any such rights. Information | |||
| IETF's procedures with respect to rights in standards-track and | on the IETF's procedures with respect to rights in IETF Documents can | |||
| standards-related documentation can be found in BCP-11. Copies of | be found in BCP 78 and BCP 79. | |||
| claims of rights made available for publication and any assurances of | ||||
| licenses to be made available, or the result of an attempt made to | Copies of IPR disclosures made to the IETF Secretariat and any | |||
| obtain a general license or permission for the use of such | assurances of licenses to be made available, or the result of an | |||
| proprietary rights by implementors or users of this specification can | attempt made to obtain a general license or permission for the use of | |||
| be obtained from the IETF Secretariat. | such proprietary rights by implementers or users of this | |||
| specification can be obtained from the IETF on-line IPR repository at | ||||
| http://www.ietf.org/ipr. | ||||
| The IETF invites any interested party to bring to its attention any | The IETF invites any interested party to bring to its attention any | |||
| copyrights, patents or patent applications, or other proprietary | copyrights, patents or patent applications, or other proprietary | |||
| rights which may cover technology that may be required to practice | rights that may cover technology that may be required to implement | |||
| this standard. Please address the information to the IETF Executive | this standard. Please address the information to the IETF at | |||
| Director. | ietf-ipr@ietf.org. | |||
| Full Copyright Statement | ||||
| Copyright (C) The Internet Society (2004). All Rights Reserved. | Disclaimer of Validity | |||
| This document and translations of it may be copied and furnished to | This document and the information contained herein are provided on an | |||
| others, and derivative works that comment on or otherwise explain it | "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | |||
| or assist in its implementation may be prepared, copied, published | OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET | |||
| and distributed, in whole or in part, without restriction of any | ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, | |||
| kind, provided that the above copyright notice and this paragraph are | INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE | |||
| included on all such copies and derivative works. However, this | INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED | |||
| document itself may not be modified in any way, such as by removing | WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | |||
| the copyright notice or references to the Internet Society or other | ||||
| Internet organizations, except as needed for the purpose of | ||||
| developing Internet standards in which case the procedures for | ||||
| copyrights defined in the Internet Standards process must be | ||||
| followed, or as required to translate it into languages other than | ||||
| English. | ||||
| The limited permissions granted above are perpetual and will not be | Copyright Statement | |||
| revoked by the Internet Society or its successors or assignees. | ||||
| This document and the information contained herein is provided on an | Copyright (C) The Internet Society (2004). This document is subject | |||
| "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING | to the rights, licenses and restrictions contained in BCP 78, and | |||
| TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING | except as set forth therein, the authors retain all their rights. | |||
| BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION | ||||
| HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF | ||||
| MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | ||||
| Acknowledgment | Acknowledgment | |||
| Funding for the RFC Editor function is currently provided by the | Funding for the RFC Editor function is currently provided by the | |||
| Internet Society. | Internet Society. | |||
| End of changes. 214 change blocks. | ||||
| 586 lines changed or deleted | 753 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||