idnits 2.17.1 draft-fielding-uri-rfc2396bis-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3667, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2646. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2630. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2636. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 2652), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 35. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC2732, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC2396, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC1808, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC1738, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 700 has weird spacing: '... query frag...' (Using the creation date from RFC1738, updated by this document, for RFC5378 checks: 1994-12-01) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 16, 2004) is 7314 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2277' is defined on line 2076, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) -- Obsolete informational reference (is this intentional?): RFC 1738 (Obsoleted by RFC 4248, RFC 4266) -- Obsolete informational reference (is this intentional?): RFC 1808 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) -- Obsolete informational reference (is this intentional?): RFC 2396 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 2518 (Obsoleted by RFC 4918) -- Obsolete informational reference (is this intentional?): RFC 2717 (Obsoleted by RFC 4395) -- Obsolete informational reference (is this intentional?): RFC 2718 (Obsoleted by RFC 4395) -- Obsolete informational reference (is this intentional?): RFC 2732 (Obsoleted by RFC 3986) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3513 (Obsoleted by RFC 4291) Summary: 11 errors (**), 0 flaws (~~), 5 warnings (==), 21 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group T. Berners-Lee 2 Internet-Draft W3C/MIT 3 Updates: 1738 (if approved) R. Fielding 4 Obsoletes: 2732, 2396, 1808 (if approved) Day Software 5 L. Masinter 6 Expires: October 15, 2004 Adobe 7 April 16, 2004 9 Uniform Resource Identifier (URI): Generic Syntax 10 draft-fielding-uri-rfc2396bis-05 12 Status of this Memo 14 By submitting this Internet-Draft, I certify that any applicable 15 patent or other IPR claims of which I am aware have been disclosed, 16 and any of which I become aware will be disclosed, in accordance with 17 RFC 3668. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that other 21 groups may also distribute working documents as Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 . 30 The list of Internet-Draft Shadow Directories can be accessed at 31 . 33 Copyright Notice 35 Copyright (C) The Internet Society (2004). All Rights Reserved. 37 Abstract 39 A Uniform Resource Identifier (URI) is a compact sequence of 40 characters for identifying an abstract or physical resource. This 41 specification defines the generic URI syntax and a process for 42 resolving URI references that might be in relative form, along with 43 guidelines and security considerations for the use of URIs on the 44 Internet. 46 The URI syntax defines a grammar that is a superset of all valid 47 URIs, such that an implementation can parse the common components of 48 a URI reference without knowing the scheme-specific requirements of 49 every possible identifier. This specification does not define a 50 generative grammar for URIs; that task is performed by the individual 51 specifications of each URI scheme. 53 Editorial Note 55 Discussion of this draft and comments to the editors should be sent 56 to the uri@w3.org mailing list. An issues list and version history 57 is available at . 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1 Overview of URIs . . . . . . . . . . . . . . . . . . . . . 4 63 1.1.1 Generic Syntax . . . . . . . . . . . . . . . . . . . . 6 64 1.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 6 65 1.1.3 URI, URL, and URN . . . . . . . . . . . . . . . . . . 6 66 1.2 Design Considerations . . . . . . . . . . . . . . . . . . 7 67 1.2.1 Transcription . . . . . . . . . . . . . . . . . . . . 7 68 1.2.2 Separating Identification from Interaction . . . . . . 8 69 1.2.3 Hierarchical Identifiers . . . . . . . . . . . . . . . 9 70 1.3 Syntax Notation . . . . . . . . . . . . . . . . . . . . . 10 71 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 10 72 2.1 Percent-Encoding . . . . . . . . . . . . . . . . . . . . . 11 73 2.2 Reserved Characters . . . . . . . . . . . . . . . . . . . 11 74 2.3 Unreserved Characters . . . . . . . . . . . . . . . . . . 12 75 2.4 When to Encode or Decode . . . . . . . . . . . . . . . . . 13 76 2.5 Identifying Data . . . . . . . . . . . . . . . . . . . . . 13 77 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 15 78 3.1 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 15 79 3.2 Authority . . . . . . . . . . . . . . . . . . . . . . . . 16 80 3.2.1 User Information . . . . . . . . . . . . . . . . . . . 17 81 3.2.2 Host . . . . . . . . . . . . . . . . . . . . . . . . . 17 82 3.2.3 Port . . . . . . . . . . . . . . . . . . . . . . . . . 20 83 3.3 Path . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 84 3.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . 22 85 3.5 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . 23 86 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 87 4.1 URI Reference . . . . . . . . . . . . . . . . . . . . . . 24 88 4.2 Relative URI . . . . . . . . . . . . . . . . . . . . . . . 25 89 4.3 Absolute URI . . . . . . . . . . . . . . . . . . . . . . . 25 90 4.4 Same-document Reference . . . . . . . . . . . . . . . . . 25 91 4.5 Suffix Reference . . . . . . . . . . . . . . . . . . . . . 26 93 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 27 94 5.1 Establishing a Base URI . . . . . . . . . . . . . . . . . 27 95 5.1.1 Base URI Embedded in Content . . . . . . . . . . . . . 27 96 5.1.2 Base URI from the Encapsulating Entity . . . . . . . . 28 97 5.1.3 Base URI from the Retrieval URI . . . . . . . . . . . 28 98 5.1.4 Default Base URI . . . . . . . . . . . . . . . . . . . 28 99 5.2 Relative Resolution . . . . . . . . . . . . . . . . . . . 29 100 5.2.1 Pre-parse the Base URI . . . . . . . . . . . . . . . . 29 101 5.2.2 Transform References . . . . . . . . . . . . . . . . . 29 102 5.2.3 Merge Paths . . . . . . . . . . . . . . . . . . . . . 30 103 5.2.4 Remove Dot Segments . . . . . . . . . . . . . . . . . 31 104 5.3 Component Recomposition . . . . . . . . . . . . . . . . . 33 105 5.4 Reference Resolution Examples . . . . . . . . . . . . . . 34 106 5.4.1 Normal Examples . . . . . . . . . . . . . . . . . . . 34 107 5.4.2 Abnormal Examples . . . . . . . . . . . . . . . . . . 34 108 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 36 109 6.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . 36 110 6.2 Comparison Ladder . . . . . . . . . . . . . . . . . . . . 37 111 6.2.1 Simple String Comparison . . . . . . . . . . . . . . . 37 112 6.2.2 Syntax-based Normalization . . . . . . . . . . . . . . 37 113 6.2.3 Scheme-based Normalization . . . . . . . . . . . . . . 38 114 6.2.4 Protocol-based Normalization . . . . . . . . . . . . . 39 115 6.3 Canonical Form . . . . . . . . . . . . . . . . . . . . . . 40 116 7. Security Considerations . . . . . . . . . . . . . . . . . . . 40 117 7.1 Reliability and Consistency . . . . . . . . . . . . . . . 40 118 7.2 Malicious Construction . . . . . . . . . . . . . . . . . . 41 119 7.3 Back-end Transcoding . . . . . . . . . . . . . . . . . . . 41 120 7.4 Rare IP Address Formats . . . . . . . . . . . . . . . . . 42 121 7.5 Sensitive Information . . . . . . . . . . . . . . . . . . 43 122 7.6 Semantic Attacks . . . . . . . . . . . . . . . . . . . . . 43 123 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 44 124 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 45 125 9.1 Normative References . . . . . . . . . . . . . . . . . . . . 45 126 9.2 Informative References . . . . . . . . . . . . . . . . . . . 45 127 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 47 128 A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 48 129 B. Parsing a URI Reference with a Regular Expression . . . . . . 50 130 C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 51 131 D. Summary of Non-editorial Changes . . . . . . . . . . . . . . . 52 132 D.1 Additions . . . . . . . . . . . . . . . . . . . . . . . . 52 133 D.2 Modifications from RFC 2396 . . . . . . . . . . . . . . . 53 134 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 135 Intellectual Property and Copyright Statements . . . . . . . . 58 137 1. Introduction 139 A Uniform Resource Identifier (URI) provides a simple and extensible 140 means for identifying a resource. This specification of URI syntax 141 and semantics is derived from concepts introduced by the World Wide 142 Web global information initiative, whose use of such identifiers 143 dates from 1990 and is described in "Universal Resource Identifiers 144 in WWW" [RFC1630], and is designed to meet the recommendations laid 145 out in "Functional Recommendations for Internet Resource Locators" 146 [RFC1736] and "Functional Requirements for Uniform Resource Names" 147 [RFC1737]. 149 This document obsoletes [RFC2396], which merged "Uniform Resource 150 Locators" [RFC1738] and "Relative Uniform Resource Locators" 151 [RFC1808] in order to define a single, generic syntax for all URIs. 152 It excludes those portions of RFC 1738 that defined the specific 153 syntax of individual URI schemes; those portions will be updated as 154 separate documents. The process for registration of new URI schemes 155 is defined separately by [RFC2717]. Advice for designers of new URI 156 schemes can be found in [RFC2718]. 158 All significant changes from RFC 2396 are noted in Appendix D. 160 This specification uses the terms "character" and "coded character 161 set" in accordance with the definitions provided in [RFC2978], and 162 "character encoding" in place of what [RFC2978] refers to as a 163 "charset". 165 1.1 Overview of URIs 167 URIs are characterized as follows: 169 Uniform 170 Uniformity provides several benefits: it allows different types of 171 resource identifiers to be used in the same context, even when the 172 mechanisms used to access those resources may differ; it allows 173 uniform semantic interpretation of common syntactic conventions 174 across different types of resource identifiers; it allows 175 introduction of new types of resource identifiers without 176 interfering with the way that existing identifiers are used; and, 177 it allows the identifiers to be reused in many different contexts, 178 thus permitting new applications or protocols to leverage a 179 pre-existing, large, and widely-used set of resource identifiers. 181 Resource 182 Anything that has been named or described can be a resource. 183 Familiar examples include an electronic document, an image, a 184 service (e.g., "today's weather report for Los Angeles"), and a 185 collection of other resources. A resource is not necessarily 186 accessible via the Internet; e.g., human beings, corporations, and 187 bound books in a library can also be resources. Likewise, abstract 188 concepts can be resources, such as the operators and operands of a 189 mathematical equation, the types of a relationship (e.g., "parent" 190 or "employee"), or numeric values (e.g., zero, one, and infinity). 191 These things are called resources because they each can be 192 considered a source of supply or support, or an available means, 193 for some system, where such systems may be as diverse as the World 194 Wide Web, a filesystem, an ontological graph, a theorem prover, or 195 some other form of system for the direct or indirect observation 196 and/or manipulation of resources. Note that "supply" is not 197 necessary for a thing to be considered a resource: the ability to 198 simply refer to that thing is often sufficient to support the 199 operation of a given system. 201 Identifier 202 An identifier embodies the information required to distinguish 203 what is being identified from all other things within its scope of 204 identification. Our use of the terms "identify" and "identifying" 205 refer to this process of distinguishing from many to one; they 206 should not be mistaken as an assumption that the identifier 207 defines the identity of what is referenced, though that may be the 208 case for some identifiers. 210 A URI is an identifier that consists of a sequence of characters 211 matching the syntax rule named in Section 3. A URI can be used 212 to refer to a resource. This specification does not place any limits 213 on the nature of a resource, the reasons why an application might 214 wish to refer to a resource, or the kinds of system that might use 215 URIs for the sake of identifying resources. 217 URIs have a global scope and must be interpreted consistently 218 regardless of context, though the result of that interpretation may 219 be in relation to the end-user's context. For example, "http:// 220 localhost/" has the same interpretation for every user of that 221 reference, even though the network interface corresponding to 222 "localhost" may be different for each end-user: interpretation is 223 independent of access. However, an action made on the basis of that 224 reference will take place in relation to the end-user's context, 225 which implies that an action intended to refer to a single, globally 226 unique thing must use a URI that distinguishes that resource from all 227 other things. URIs that identify in relation to the end-user's local 228 context should only be used when the context itself is a defining 229 aspect of the resource, such as when an on-line Linux manual refers 230 to a file on the end-user's filesystem (e.g., "file:///etc/hosts"). 232 1.1.1 Generic Syntax 234 Each URI begins with a scheme name, as defined in Section 3.1, that 235 refers to a specification for assigning identifiers within that 236 scheme. As such, the URI syntax is a federated and extensible naming 237 system wherein each scheme's specification may further restrict the 238 syntax and semantics of identifiers using that scheme. 240 This specification defines those elements of the URI syntax that are 241 required of all URI schemes or are common to many URI schemes. It 242 thus defines the syntax and semantics that are needed to implement a 243 scheme-independent parsing mechanism for URI references, such that 244 the scheme-dependent handling of a URI can be postponed until the 245 scheme-dependent semantics are needed. Likewise, protocols and data 246 formats that make use of URI references can refer to this 247 specification as defining the range of syntax allowed for all URIs, 248 including those schemes that have yet to be defined. 250 A parser of the generic URI syntax is capable of parsing any URI 251 reference into its major components; once the scheme is determined, 252 further scheme-specific parsing can be performed on the components. 253 In other words, the URI generic syntax is a superset of the syntax of 254 all URI schemes. 256 1.1.2 Examples 258 The following examples illustrate URIs that are in common use. 260 ftp://ftp.is.co.za/rfc/rfc1808.txt 262 http://www.ietf.org/rfc/rfc2396.txt 264 mailto:John.Doe@example.com 266 news:comp.infosystems.www.servers.unix 268 telnet://melvyl.ucop.edu/ 270 1.1.3 URI, URL, and URN 272 A URI can be further classified as a locator, a name, or both. The 273 term "Uniform Resource Locator" (URL) refers to the subset of URIs 274 that, in addition to identifying a resource, provide a means of 275 locating the resource by describing its primary access mechanism 276 (e.g., its network "location"). The term "Uniform Resource Name" 277 (URN) has been used historically to refer to both URIs under the 278 "urn" scheme [RFC2141], which are required to remain globally unique 279 and persistent even when the resource ceases to exist or becomes 280 unavailable, and to any other URI with the properties of a name. 282 An individual scheme does not need to be classified as being just one 283 of "name" or "locator". Instances of URIs from any given scheme may 284 have the characteristics of names or locators or both, often 285 depending on the persistence and care in the assignment of 286 identifiers by the naming authority, rather than any quality of the 287 scheme. Future specifications and related documentation should use 288 the general term "URI", rather than the more restrictive terms URL 289 and URN [RFC3305]. 291 1.2 Design Considerations 293 1.2.1 Transcription 295 The URI syntax has been designed with global transcription as one of 296 its main considerations. A URI is a sequence of characters from a 297 very limited set: the letters of the basic Latin alphabet, digits, 298 and a few special characters. A URI may be represented in a variety 299 of ways: e.g., ink on paper, pixels on a screen, or a sequence of 300 character encoding octets. The interpretation of a URI depends only 301 on the characters used and not how those characters are represented 302 in a network protocol. 304 The goal of transcription can be described by a simple scenario. 305 Imagine two colleagues, Sam and Kim, sitting in a pub at an 306 international conference and exchanging research ideas. Sam asks Kim 307 for a location to get more information, so Kim writes the URI for the 308 research site on a napkin. Upon returning home, Sam takes out the 309 napkin and types the URI into a computer, which then retrieves the 310 information to which Kim referred. 312 There are several design considerations revealed by the scenario: 314 o A URI is a sequence of characters that is not always represented 315 as a sequence of octets. 317 o A URI might be transcribed from a non-network source, and thus 318 should consist of characters that are most likely to be able to be 319 entered into a computer, within the constraints imposed by 320 keyboards (and related input devices) across languages and 321 locales. 323 o A URI often needs to be remembered by people, and it is easier for 324 people to remember a URI when it consists of meaningful or 325 familiar components. 327 These design considerations are not always in alignment. For 328 example, it is often the case that the most meaningful name for a URI 329 component would require characters that cannot be typed into some 330 systems. The ability to transcribe a resource identifier from one 331 medium to another has been considered more important than having a 332 URI consist of the most meaningful of components. 334 In local or regional contexts and with improving technology, users 335 might benefit from being able to use a wider range of characters; 336 such use is not defined by this specification. Percent-encoded 337 octets (Section 2.1) may be used within a URI to represent characters 338 outside the range of the US-ASCII coded character set if such 339 representation is allowed by the scheme or by the protocol element in 340 which the URI is referenced; such a definition should specify the 341 character encoding used to map those characters to octets prior to 342 being percent-encoded for the URI. 344 1.2.2 Separating Identification from Interaction 346 A common misunderstanding of URIs is that they are only used to refer 347 to accessible resources. In fact, the URI alone only provides 348 identification; access to the resource is neither guaranteed nor 349 implied by the presence of a URI. Instead, an operation (if any) 350 associated with a URI reference is defined by the protocol element, 351 data format attribute, or natural language text in which it appears. 353 Given a URI, a system may attempt to perform a variety of operations 354 on the resource, as might be characterized by such words as "access", 355 "update", "replace", or "find attributes". Such operations are 356 defined by the protocols that make use of URIs, not by this 357 specification. However, we do use a few general terms for describing 358 common operations on URIs. URI "resolution" is the process of 359 determining an access mechanism and the appropriate parameters 360 necessary to dereference a URI; such resolution may require several 361 iterations. To use that access mechanism to perform an action on the 362 URI's resource is to "dereference" the URI. 364 When URIs are used within information systems to identify sources of 365 information, the most common form of URI dereference is "retrieval": 366 making use of a URI in order to retrieve a representation of its 367 associated resource. A "representation" is a sequence of octets, 368 along with representation metadata describing those octets, that 369 constitutes a record of the state of the resource at the time that 370 the representation is generated. Retrieval is achieved by a process 371 that might include using the URI as a cache key to check for a 372 locally cached representation, resolution of the URI to determine an 373 appropriate access mechanism (if any), and dereference of the URI for 374 the sake of applying a retrieval operation. Depending on the 375 protocols used to perform the retrieval, additional information might 376 be supplied about the resource (resource metadata) and its relation 377 to other resources. 379 URI references in information systems are designed to be 380 late-binding: the result of an access is generally determined at the 381 time it is accessed and may vary over time or due to other aspects of 382 the interaction. Such references are created in order to be be used 383 in the future: what is being identified is not some specific result 384 that was obtained in the past, but rather some characteristic that is 385 expected to be true for future results. In such cases, the resource 386 referred to by the URI is actually a sameness of characteristics as 387 observed over time, perhaps elucidated by additional comments or 388 assertions made by the resource provider. 390 Although many URI schemes are named after protocols, this does not 391 imply that use of such a URI will result in access to the resource 392 via the named protocol. URIs are often used simply for the sake of 393 identification. Even when a URI is used to retrieve a representation 394 of a resource, that access might be through gateways, proxies, 395 caches, and name resolution services that are independent of the 396 protocol associated with the scheme name, and the resolution of some 397 URIs may require the use of more than one protocol (e.g., both DNS 398 and HTTP are typically used to access an "http" URI's origin server 399 when a representation isn't found in a local cache). 401 1.2.3 Hierarchical Identifiers 403 The URI syntax is organized hierarchically, with components listed in 404 order of decreasing significance from left to right. For some URI 405 schemes, the visible hierarchy is limited to the scheme itself: 406 everything after the scheme component delimiter (":") is considered 407 opaque to URI processing. Other URI schemes make the hierarchy 408 explicit and visible to generic parsing algorithms. 410 The generic syntax uses the slash ("/"), question mark ("?"), and 411 number sign ("#") characters for the purpose of delimiting components 412 that are significant to the generic parser's hierarchical 413 interpretation of an identifier. In addition to aiding the 414 readability of such identifiers through the consistent use of 415 familiar syntax, this uniform representation of hierarchy across 416 naming schemes allows scheme-independent references to be made 417 relative to that hierarchy. 419 It is often the case that a group or "tree" of documents has been 420 constructed to serve a common purpose, wherein the vast majority of 421 URIs in these documents point to resources within the tree rather 422 than outside of it. Similarly, documents located at a particular 423 site are much more likely to refer to other resources at that site 424 than to resources at remote sites. Relative referencing of URIs 425 allows document trees to be partially independent of their location 426 and access scheme. For instance, it is possible for a single set of 427 hypertext documents to be simultaneously accessible and traversable 428 via each of the "file", "http", and "ftp" schemes if the documents 429 refer to each other using relative references. Furthermore, such 430 document trees can be moved, as a whole, without changing any of the 431 relative references. 433 A relative URI reference (Section 4.2) refers to a resource by 434 describing the difference within a hierarchical name space between 435 the reference context and the target URI. The reference resolution 436 algorithm, presented in Section 5, defines how such a reference is 437 transformed to the target URI. Since relative references can only be 438 used within the context of a hierarchical URI, designers of new URI 439 schemes should use a syntax consistent with the generic syntax's 440 hierarchical components unless there are compelling reasons to forbid 441 relative referencing within that scheme. 443 All URIs are parsed by generic syntax parsers when used. A URI scheme 444 that wishes to remain opaque to hierarchical processing must disallow 445 the use of slash and question mark characters. However, since a URI 446 reference is only modified by the generic parser if it contains a 447 dot-segment (a complete path segment of "." or "..", as described in 448 Section 3.3), URI schemes may safely use "/" for other purposes if 449 they do not allow dot-segments. 451 1.3 Syntax Notation 453 This specification uses the Augmented Backus-Naur Form (ABNF) 454 notation of [RFC2234], including the following core ABNF syntax rules 455 defined by that specification: ALPHA (letters), CR (carriage return), 456 DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal 457 digits), LF (line feed), and SP (space). The complete URI syntax is 458 collected in Appendix A. 460 2. Characters 462 The URI syntax provides a method of encoding data, presumably for the 463 sake of identifying a resource, as a sequence of characters. The URI 464 characters are, in turn, frequently encoded as octets for transport 465 or presentation. This specification does not mandate any particular 466 character encoding for mapping between URI characters and the octets 467 used to store or transmit those characters. When a URI appears in a 468 protocol element, the character encoding is defined by that protocol; 469 absent such a definition, a URI is assumed to be in the same 470 character encoding as the surrounding text. 472 The ABNF notation defines its terminal values to be non-negative 473 integers (codepoints) based on the US-ASCII coded character set 474 [ASCII]. Since a URI is a sequence of characters, we must invert 475 that relation in order to understand the URI syntax. Therefore, the 476 integer values used by the ABNF must be mapped back to their 477 corresponding characters via US-ASCII in order to complete the syntax 478 rules. 480 A URI is composed from a limited set of characters consisting of 481 digits, letters, and a few graphic symbols. A reserved subset of 482 those characters may be used to delimit syntax components within a 483 URI, while the remaining characters, including both the unreserved 484 set and those reserved characters not acting as delimiters, define 485 each component's identifying data. 487 2.1 Percent-Encoding 489 A percent-encoding mechanism is used to represent a data octet in a 490 component when that octet's corresponding character is outside the 491 allowed set or is being used as a delimiter of, or within, the 492 component. A percent-encoded octet is encoded as a character triplet, 493 consisting of the percent character "%" followed by the two 494 hexadecimal digits representing that octet's numeric value. For 495 example, "%20" is the percent-encoding for the binary octet 496 "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space 497 character (SP). Section 2.4 describes when percent-encoding and 498 decoding is applied. 500 pct-encoded = "%" HEXDIG HEXDIG 502 The uppercase hexadecimal digits 'A' through 'F' are equivalent to 503 the lowercase digits 'a' through 'f', respectively. Two URIs that 504 differ only in the case of hexadecimal digits used in percent-encoded 505 octets are equivalent. For consistency, URI producers and 506 normalizers should use uppercase hexadecimal digits for all 507 percent-encodings. 509 2.2 Reserved Characters 511 URIs include components and subcomponents that are delimited by 512 characters in the "reserved" set. These characters are called 513 "reserved" because they may (or may not) be defined as delimiters by 514 the generic syntax, by each scheme-specific syntax, or by the 515 implementation-specific syntax of a URI's dereferencing algorithm. 516 If data for a URI component would conflict with a reserved 517 character's purpose as a delimiter, then the conflicting data must be 518 percent-encoded before forming the URI. 520 reserved = gen-delims / sub-delims 521 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 523 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 524 / "*" / "+" / "," / ";" / "=" 526 The purpose of reserved characters is to provide a set of delimiting 527 characters that are distinguishable from other data within a URI. 528 URIs that differ in the replacement of a reserved character with its 529 corresponding percent-encoded octet are not equivalent. 530 Percent-encoding a reserved character, or decoding a percent-encoded 531 octet that corresponds to a reserved character, will change how the 532 URI is interpreted by most applications. Thus, characters in the 533 reserved set are protected from normalization and are therefore safe 534 to be used by scheme-specific and producer-specific algorithms for 535 delimiting data subcomponents within a URI. 537 A subset of the reserved characters (gen-delims) are used as 538 delimiters of the generic URI components described in Section 3. A 539 component's ABNF syntax rule will not use the reserved or gen-delims 540 rule names directly; instead, each syntax rule lists the characters 541 allowed within that component (i.e., not delimiting it) and any of 542 those characters that are also in the reserved set are "reserved" for 543 use as subcomponent delimiters within the component. Only the most 544 common subcomponents are defined by this specification; other 545 subcomponents may be defined by a URI scheme's specification, or by 546 the implementation-specific syntax of a URI's dereferencing 547 algorithm, provided that such subcomponents are delimited by 548 characters in the reserved set allowed within that component. 550 URI producing applications should percent-encode data octets that 551 correspond to characters in the reserved set. However, if a reserved 552 character is found in a URI component and no delimiting role is known 553 for that character, then it should be interpreted as representing the 554 data octet corresponding to that character's encoding in US-ASCII. 556 2.3 Unreserved Characters 558 Characters that are allowed in a URI but do not have a reserved 559 purpose are called unreserved. These include uppercase and lowercase 560 letters, decimal digits, hyphen, period, underscore, and tilde. 562 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 564 URIs that differ in the replacement of an unreserved character with 565 its corresponding percent-encoded octet are equivalent: they identify 566 the same resource. However, percent-encoded unreserved characters 567 may change the result of some URI comparisons (Section 6), 568 potentially leading to incorrect or inefficient behavior. For 569 consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A 570 and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore 571 (%5F), or tilde (%7E) should not be created by URI producers and, 572 when found in a URI, should be decoded to their corresponding 573 unreserved character by URI normalizers. 575 2.4 When to Encode or Decode 577 Under normal circumstances, the only time that octets within a URI 578 are percent-encoded is during the process of producing the URI from 579 its component parts. It is during that process that an 580 implementation determines which of the reserved characters are to be 581 used as subcomponent delimiters and which can be safely used as data. 582 Once produced, a URI is always in its percent-encoded form. 584 When a URI is dereferenced, the components and subcomponents 585 significant to the scheme-specific dereferencing process (if any) 586 must be parsed and separated before the percent-encoded octets within 587 those components can be safely decoded, since otherwise the data may 588 be mistaken for component delimiters. The only exception is for 589 percent-encoded octets corresponding to characters in the unreserved 590 set, which can be decoded at any time. For example, the octet 591 corresponding to the tilde ("~") character is often encoded as "%7E" 592 by older URI processing software; the "%7E" can be replaced by "~" 593 without changing its interpretation. 595 Because the percent ("%") character serves as the indicator for 596 percent-encoded octets, it must be percent-encoded as "%25" in order 597 for that octet to be used as data within a URI. Implementations must 598 not percent-encode or decode the same string more than once, since 599 decoding an already decoded string might lead to misinterpreting a 600 percent data octet as the beginning of a percent-encoding, or vice 601 versa in the case of percent-encoding an already percent-encoded 602 string. 604 2.5 Identifying Data 606 URI characters provide identifying data for each of the URI 607 components, serving as an external interface for identification 608 between systems. Although the presence and nature of the URI 609 production interface is hidden from clients that use its URIs, and 610 thus beyond the scope of the interoperability requirements defined by 611 this specification, it is a frequent source of confusion and errors 612 in the interpretation of URI character issues. Implementers need to 613 be aware that there are multiple character encodings involved in the 614 production and transmission of URIs: local name and data encoding, 615 public interface encoding, URI character encoding, data format 616 encoding, and protocol encoding. 618 The first encoding of identifying data is the one in which the local 619 names or data are stored. URI producing applications (a.k.a., origin 620 servers) will typically use the local encoding as the basis for 621 producing meaningful names. The URI producer will transform the 622 local encoding to one that is suitable for a public interface, and 623 then transform the public interface encoding into the restricted set 624 of URI characters (reserved, unreserved, and percent-encodings). 625 Those characters are, in turn, encoded as octets to be used as a 626 reference within a data format (e.g., a document charset), and such 627 data formats are often subsequently encoded for transmission over 628 Internet protocols. 630 For most systems, an unreserved character appearing within a URI 631 component is interpreted as representing the data octet corresponding 632 to that character's encoding in US-ASCII. Consumers of URIs assume 633 that the letter "X" corresponds to the octet "01011000", and there is 634 no harm in making that assumption even when it is incorrect. A 635 system that internally provides identifiers in the form of a 636 different character encoding, such as EBCDIC, will generally perform 637 character translation of textual identifiers to UTF-8 [RFC3629] (or 638 some other superset of the US-ASCII character encoding) at an 639 internal interface, thereby providing more meaningful identifiers 640 than simply percent-encoding the original octets. 642 For example, consider an information service that provides data, 643 stored locally using an EBCDIC-based filesystem, to clients on the 644 Internet through an HTTP server. When an author creates a file on 645 that filesystem with the name "Laguna Beach", their expectation is 646 that the "http" URI corresponding to that resource would also contain 647 the meaningful string "Laguna%20Beach". If, however, that server 648 produces URIs using an overly-simplistic raw octet mapping, then the 649 result would be a URI containing 650 "%D3%81%87%A4%95%81@%C2%85%81%83%88". An internal transcoding 651 interface fixes that problem by transcoding the local name to a 652 superset of US-ASCII prior to producing the URI. Naturally, proper 653 interpretation of an incoming URI on such an interface requires that 654 percent-encoded octets be decoded (e.g., "%20" to SP) before the 655 reverse transcoding is applied to obtain the local name. 657 In some cases, the internal interface between a URI component and the 658 identifying data that it has been crafted to represent is much less 659 direct than a character encoding translation. For example, portions 660 of a URI might reflect a query on non-ASCII data, numeric coordinates 661 on a map, etc. Likewise, a URI scheme may define components with 662 additional encoding requirements that are applied prior to forming 663 the component and producing the URI. 665 When a new URI scheme defines a component that represents textual 666 data consisting of characters from the Unicode (ISO/IEC 10646-1) 667 character set, the data should be encoded first as octets according 668 to the UTF-8 character encoding [RFC3629], and then only those octets 669 that do not correspond to characters in the unreserved set should be 670 percent-encoded. For example, the character A would be represented 671 as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be 672 represented as "%C3%80", and the character KATAKANA LETTER A would be 673 represented as "%E3%82%A2". 675 3. Syntax Components 677 The generic URI syntax consists of a hierarchical sequence of 678 components referred to as the scheme, authority, path, query, and 679 fragment. 681 URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 683 hier-part = "//" authority path-abempty 684 / path-abs 685 / path-rootless 686 / path-empty 688 The scheme and path components are required, though path may be empty 689 (no characters). When authority is present, the path must either be 690 empty or begin with a slash ("/") character. When authority is not 691 present, the path cannot begin with two slash characters ("//"). 692 These restrictions result in five different ABNF rules for a path 693 (Section 3.3), only one of which will match any given URI reference. 695 The following are two example URIs and their component parts: 697 foo://example.com:8042/over/there?name=ferret#nose 698 \_/ \______________/\_________/ \_________/ \__/ 699 | | | | | 700 scheme authority path query fragment 701 | _____________________|__ 702 / \ / \ 703 urn:example:animal:ferret:nose 705 3.1 Scheme 707 Each URI begins with a scheme name that refers to a specification for 708 assigning identifiers within that scheme. As such, the URI syntax is 709 a federated and extensible naming system wherein each scheme's 710 specification may further restrict the syntax and semantics of 711 identifiers using that scheme. 713 Scheme names consist of a sequence of characters beginning with a 714 letter and followed by any combination of letters, digits, plus 715 ("+"), period ("."), or hyphen ("-"). Although scheme is 716 case-insensitive, the canonical form is lowercase and documents that 717 specify schemes must do so using lowercase letters. An 718 implementation should accept uppercase letters as equivalent to 719 lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for 720 the sake of robustness, but should only produce lowercase scheme 721 names, for consistency. 723 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 725 Individual schemes are not specified by this document. The process 726 for registration of new URI schemes is defined separately by 727 [RFC2717]. The scheme registry maintains the mapping between scheme 728 names and their specifications. Advice for designers of new URI 729 schemes can be found in [RFC2718]. 731 When presented with a URI that violates one or more scheme-specific 732 restrictions, the scheme-specific resolution process should flag the 733 reference as an error rather than ignore the unused parts; doing so 734 reduces the number of equivalent URIs and helps detect abuses of the 735 generic syntax that might indicate the URI has been constructed to 736 mislead the user (Section 7.6). 738 3.2 Authority 740 Many URI schemes include a hierarchical element for a naming 741 authority, such that governance of the name space defined by the 742 remainder of the URI is delegated to that authority (which may, in 743 turn, delegate it further). The generic syntax provides a common 744 means for distinguishing an authority based on a registered name or 745 server address, along with optional port and user information. 747 The authority component is preceded by a double slash ("//") and is 748 terminated by the next slash ("/"), question mark ("?"), or number 749 sign ("#") character, or by the end of the URI. 751 authority = [ userinfo "@" ] host [ ":" port ] 753 URI producers and normalizers should omit the ":" delimiter that 754 separates host from port if the port component is empty. Some schemes 755 do not allow the userinfo and/or port subcomponents. 757 If a URI contains an authority component, then the path component 758 must either be empty or begin with a slash ("/") character. 759 Non-validating parsers (those that merely separate a URI reference 760 into its major components) will often ignore the subcomponent 761 structure of authority, treating it as an opaque string from the 762 double-slash to the first terminating delimiter, until such time as 763 the URI is dereferenced. 765 3.2.1 User Information 767 The userinfo subcomponent may consist of a user name and, optionally, 768 scheme-specific information about how to gain authorization to access 769 the resource. The user information, if present, is followed by a 770 commercial at-sign ("@") that delimits it from the host. 772 userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) 774 Use of the format "user:password" in the userinfo field is 775 deprecated. Applications should not render as clear text any data 776 after the first colon (":") character found within a userinfo 777 subcomponent unless the data after the colon is the empty string 778 (indicating no password). Applications may choose to ignore or reject 779 such data when received as part of a reference, and should reject the 780 storage of such data in unencrypted form. The passing of 781 authentication information in clear text has proven to be a security 782 risk in almost every case where it has been used. 784 Applications that render a URI for the sake of user feedback, such as 785 in graphical hypertext browsing, should render userinfo in a way that 786 is distinguished from the rest of a URI, when feasible. Such 787 rendering will assist the user in cases where the userinfo has been 788 misleadingly crafted to look like a trusted domain name (Section 789 7.6). 791 3.2.2 Host 793 The host subcomponent of authority is identified by an IP literal 794 encapsulated within square brackets, an IPv4 address in 795 dotted-decimal form, or a registered name. The host subcomponent is 796 case-insensitive. The presence of a host subcomponent within a URI 797 does not imply that the scheme requires access to the given host on 798 the Internet. In many cases, the host syntax is used only for the 799 sake of reusing the existing registration process created and 800 deployed for DNS, thus obtaining a globally unique name without the 801 cost of deploying another registry. However, such use comes with its 802 own costs: domain name ownership may change over time for reasons not 803 anticipated by the URI producer. In other cases, the data within the 804 host component identifies a registered name that has nothing to do 805 with an Internet host. We use the name "host" for the ABNF rule 806 because that is its most common purpose, not its only purpose, and 807 thus should not be considered as semantically limiting the data 808 within it. 810 host = IP-literal / IPv4address / reg-name 812 The syntax rule for host is ambiguous because it does not completely 813 distinguish between an IPv4address and a reg-name. In order to 814 disambiguate, the syntax, we apply the "first-match-wins" algorithm: 815 If host matches the rule for IPv4address, then it should be 816 considered an IPv4 address literal and not a reg-name. Although host 817 is case-insensitive, producers and normalizers should use lowercase 818 for registered names and hexadecimal addresses for the sake of 819 uniformity, while only using uppercase letters for percent-encodings. 821 A host identified by an Internet Protocol literal address, version 6 822 [RFC3513] or later, is distinguished by enclosing the IP literal 823 within square brackets ("[" and "]"). This is the only place where 824 square bracket characters are allowed in the URI syntax. In 825 anticipation of future, as-yet-undefined IP literal address formats, 826 an optional version flag may be used to indicate such a format 827 explicitly rather than relying on heuristic determination. 829 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 831 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) 833 The version flag does not indicate the IP version; rather, it 834 indicates future versions of the literal format. As such, 835 implementations must not provide the version flag for existing IPv4 836 and IPv6 literal addresses. If a URI containing an IP-literal that 837 starts with "v" (case-insensitive), indicating that the version flag 838 is present, is dereferenced by an application that does not know the 839 meaning of that version flag, then the application should return an 840 appropriate error for "address mechanism not supported". 842 A host identified by an IPv6 literal address is represented inside 843 the square brackets without a preceding version flag. The ABNF 844 provided here is a translation of the text definition of an IPv6 845 literal address provided in [RFC3513]. A 128-bit IPv6 address is 846 divided into eight 16-bit pieces. Each piece is represented 847 numerically in case-insensitive hexadecimal, using one to four 848 hexadecimal digits (leading zeroes are permitted). The eight encoded 849 pieces are given most-significant first, separated by colon 850 characters. Optionally, the least-significant two pieces may instead 851 be represented in IPv4 address textual format. A sequence of one or 852 more consecutive zero-valued 16-bit pieces within the address may be 853 elided, omitting all their digits and leaving exactly two consecutive 854 colons in their place to mark the elision. 856 IPv6address = 6( h16 ":" ) ls32 857 / "::" 5( h16 ":" ) ls32 858 / [ h16 ] "::" 4( h16 ":" ) ls32 859 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 860 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 861 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 862 / [ *4( h16 ":" ) h16 ] "::" ls32 863 / [ *5( h16 ":" ) h16 ] "::" h16 864 / [ *6( h16 ":" ) h16 ] "::" 866 ls32 = ( h16 ":" h16 ) / IPv4address 867 ; least-significant 32 bits of address 869 h16 = 1*4HEXDIG 870 ; 16 bits of address represented in hexadecimal 872 A host identified by an IPv4 literal address is represented in 873 dotted-decimal notation (a sequence of four decimal numbers in the 874 range 0 to 255, separated by "."), as described in [RFC1123] by 875 reference to [RFC0952]. Note that other forms of dotted notation may 876 be interpreted on some platforms, as described in Section 7.4, but 877 only the dotted-decimal form of four octets is allowed by this 878 grammar. 880 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 882 dec-octet = DIGIT ; 0-9 883 / %x31-39 DIGIT ; 10-99 884 / "1" 2DIGIT ; 100-199 885 / "2" %x30-34 DIGIT ; 200-249 886 / "25" %x30-35 ; 250-255 888 A host identified by a registered name is a sequence of characters 889 that is usually intended for lookup within a locally-defined host or 890 service name registry, though the URI's scheme-specific semantics may 891 require that a specific registry (or fixed name table) be used 892 instead. The most common name registry mechanism is the Domain Name 893 System (DNS). A registered name intended for lookup in the DNS uses 894 the syntax defined in Section 3.5 of [RFC1034] and Section 2.1 of 895 [RFC1123]. Such a name consists of a sequence of domain labels 896 separated by ".", each domain label starting and ending with an 897 alphanumeric character and possibly also containing "-" characters. 898 The rightmost domain label of a fully qualified domain name in DNS 899 may be followed by a single "." and should be followed by one if it 900 is necessary to distinguish between the complete domain name and some 901 local domain. 903 reg-name = 0*255( unreserved / pct-encoded / sub-delims ) 905 If the URI scheme defines a default for host, then that default 906 applies when the host subcomponent is undefined or when the 907 registered name is empty (zero length). For example, the "file" URI 908 scheme is defined such that no authority, an empty host, and 909 "localhost" all mean the end-user's machine, whereas the "http" 910 scheme considers a missing authority or empty host to be invalid. 912 This specification does not mandate a particular registered name 913 lookup technology and therefore does not restrict the syntax of 914 reg-name beyond that necessary for interoperability. Instead, it 915 delegates the issue of registered name syntax conformance to the 916 operating system of each application performing URI resolution, and 917 that operating system decides what it will allow for the purpose of 918 host identification. A URI resolution implementation might use DNS, 919 host tables, yellow pages, NetInfo, WINS, or any other system for 920 lookup of registered names. However, a globally-scoped naming system, 921 such as DNS fully-qualified domain names, is necessary for URIs that 922 are intended to have global scope. URI producers should use names 923 that conform to the DNS syntax, even when use of DNS is not 924 immediately apparent. 926 The reg-name syntax allows percent-encoded octets in order to 927 represent non-ASCII registered names in a uniform way that is 928 independent of the underlying name resolution technology; such 929 non-ASCII characters must first be encoded according to UTF-8 930 [RFC3629] and then each octet of the corresponding UTF-8 sequence 931 must be percent-encoded to be represented as URI characters. URI 932 producing applications must not use percent-encoding in host unless 933 it is used to represent a UTF-8 character sequence. When a non-ASCII 934 registered name represents an internationalized domain name intended 935 for resolution via the DNS, the name must be transformed to the IDNA 936 encoding [RFC3490] prior to name lookup. URI producers should 937 provide such registered names in the IDNA encoding, rather than a 938 percent-encoding, if they wish to maximize interoperability with 939 legacy URI resolvers. 941 3.2.3 Port 943 The port subcomponent of authority is designated by an optional port 944 number in decimal following the host and delimited from it by a 945 single colon (":") character. 947 port = *DIGIT 949 A scheme may define a default port. For example, the "http" scheme 950 defines a default port of "80", corresponding to its reserved TCP 951 port number. The type of port designated by the port number (e.g., 952 TCP, UDP, SCTP, etc.) is defined by the URI scheme. URI producers 953 and normalizers should omit the port component and its ":" delimiter 954 if port is empty or its value would be the same as the scheme's 955 default. 957 3.3 Path 959 The path component contains data, usually organized in hierarchical 960 form, that, along with data in the non-hierarchical query component 961 (Section 3.4), serves to identify a resource within the scope of the 962 URI's scheme and naming authority (if any). The path is terminated by 963 the first question mark ("?") or number sign ("#") character, or by 964 the end of the URI. 966 If a URI contains an authority component, then the path component 967 must either be empty or begin with a slash ("/") character. If a URI 968 does not contain an authority component, then the path cannot begin 969 with two slash characters ("//"). In addition, a URI reference 970 (Section 4.1) may begin with a relative path, in which case the first 971 path segment cannot contain a colon (":") character. The ABNF 972 requires five separate rules to disambiguate these cases, only one of 973 which will match a given URI reference. We use the generic term 974 "path component" to describe the URI substring that is matched by the 975 parser to one of these rules. 977 path = path-abempty ; begins with "/" or is empty 978 / path-abs ; begins with "/" but not "//" 979 / path-noscheme ; begins with a non-colon segment 980 / path-rootless ; begins with a segment 981 / path-empty ; zero characters 983 path-abempty = *( "/" segment ) 984 path-abs = "/" [ segment-nz *( "/" segment ) ] 985 path-noscheme = segment-nzc *( "/" segment ) 986 path-rootless = segment-nz *( "/" segment ) 987 path-empty = 0 989 segment = *pchar 990 segment-nz = 1*pchar 991 segment-nzc = 1*( unreserved / pct-encoded / sub-delims / "@" ) 993 pchar = unreserved / pct-encoded / sub-delims / ":" / "@" 995 A path consists of a sequence of path segments separated by a slash 996 ("/") character. A path is always defined for a URI, though the 997 defined path may be empty (zero length). Use of the slash character 998 to indicate hierarchy is only required when a URI will be used as the 999 context for relative references. For example, the URI 1000 has a path of "fred@example.com", whereas 1001 the URI has an empty path. 1003 The path segments "." and "..", also known as dot-segments, are 1004 defined for relative reference within the path name hierarchy. They 1005 are intended for use at the beginning of a relative path reference 1006 (Section 4.2) for indicating relative position within the 1007 hierarchical tree of names. This is similar to their role within 1008 some operating systems' file directory structure to indicate the 1009 current directory and parent directory, respectively. However, unlike 1010 a file system, these dot-segments are only interpreted within the URI 1011 path hierarchy and are removed as part of the resolution process 1012 (Section 5.2). 1014 Aside from dot-segments in hierarchical paths, a path segment is 1015 considered opaque by the generic syntax. URI-producing applications 1016 often use the reserved characters allowed in a segment for the 1017 purpose of delimiting scheme-specific or dereference-handler-specific 1018 subcomponents. For example, the semicolon (";") and equals ("=") 1019 reserved characters are often used for delimiting parameters and 1020 parameter values applicable to that segment. The comma (",") 1021 reserved character is often used for similar purposes. For example, 1022 one URI producer might use a segment like "name;v=1.1" to indicate a 1023 reference to version 1.1 of "name", whereas another might use a 1024 segment like "name,1.1" to indicate the same. Parameter types may be 1025 defined by scheme-specific semantics, but in most cases the syntax of 1026 a parameter is specific to the implementation of the URI's 1027 dereferencing algorithm. 1029 3.4 Query 1031 The query component contains non-hierarchical data that, along with 1032 data in the path component (Section 3.3), serves to identify a 1033 resource within the scope of the URI's scheme and naming authority 1034 (if any). The query component is indicated by the first question mark 1035 ("?") character and terminated by a number sign ("#") character or by 1036 the end of the URI. 1038 query = *( pchar / "/" / "?" ) 1040 The characters slash ("/") and question mark ("?") may represent data 1041 within the query component. Beware that some older, erroneous 1042 implementations do not handle such URIs correctly when they are used 1043 as the base for relative references (Section 5.1), apparently because 1044 they fail to to distinguish query data from path data when looking 1045 for hierarchical separators. However, since query components are 1046 often used to carry identifying information in the form of 1047 "key=value" pairs, and one frequently used value is a reference to 1048 another URI, it is sometimes better for usability to avoid 1049 percent-encoding those characters. 1051 3.5 Fragment 1053 The fragment identifier component of a URI allows indirect 1054 identification of a secondary resource by reference to a primary 1055 resource and additional identifying information. The identified 1056 secondary resource may be some portion or subset of the primary 1057 resource, some view on representations of the primary resource, or 1058 some other resource defined or described by those representations. A 1059 fragment identifier component is indicated by the presence of a 1060 number sign ("#") character and terminated by the end of the URI. 1062 fragment = *( pchar / "/" / "?" ) 1064 The semantics of a fragment identifier are defined by the set of 1065 representations that might result from a retrieval action on the 1066 primary resource. The fragment's format and resolution is therefore 1067 dependent on the media type [RFC2046] of a potentially retrieved 1068 representation, even though such a retrieval is only performed if the 1069 URI is dereferenced. If no such representation exists, then the 1070 semantics of the fragment are considered unknown and, effectively, 1071 unconstrained. Fragment identifier semantics are independent of the 1072 URI scheme and thus cannot be redefined by scheme specifications. 1074 Individual media types may define their own restrictions on, or 1075 structure within, the fragment identifier syntax for specifying 1076 different types of subsets, views, or external references that are 1077 identifiable as secondary resources by that media type. If the 1078 primary resource has multiple representations, as is often the case 1079 for resources whose representation is selected based on attributes of 1080 the retrieval request (a.k.a., content negotiation), then whatever is 1081 identified by the fragment should be consistent across all of those 1082 representations: each representation should either define the 1083 fragment such that it corresponds to the same secondary resource, 1084 regardless of how it is represented, or the fragment should be left 1085 undefined by the representation (i.e., not found). 1087 As with any URI, use of a fragment identifier component does not 1088 imply that a retrieval action will take place. A URI with a fragment 1089 identifier may be used to refer to the secondary resource without any 1090 implication that the primary resource is accessible or will ever be 1091 accessed. 1093 Fragment identifiers have a special role in information systems as 1094 the primary form of client-side indirect referencing, allowing an 1095 author to specifically identify those aspects of an existing resource 1096 that are only indirectly provided by the resource owner. As such, the 1097 fragment identifier is not used in the scheme-specific processing of 1098 a URI; instead, the fragment identifier is separated from the rest of 1099 the URI prior to a dereference, and thus the identifying information 1100 within the fragment itself is dereferenced solely by the user agent 1101 and regardless of the URI scheme. Although this separate handling is 1102 often perceived to be a loss of information, particularly in regards 1103 to accurate redirection of references as resources move over time, it 1104 also serves to prevent information providers from denying reference 1105 authors the right to selectively refer to information within a 1106 resource. Indirect referencing also provides additional flexibility 1107 and extensibility to systems that use URIs, since new media types are 1108 easier to define and deploy than new schemes of identification. 1110 The characters slash ("/") and question mark ("?") are allowed to 1111 represent data within the fragment identifier. Beware that some 1112 older, erroneous implementations do not handle such URIs correctly 1113 when they are used as the base for relative references (Section 5.1). 1115 4. Usage 1117 When applications make reference to a URI, they do not always use the 1118 full form of reference defined by the "URI" syntax rule. In order to 1119 save space and take advantage of hierarchical locality, many Internet 1120 protocol elements and media type formats allow an abbreviation of a 1121 URI, while others restrict the syntax to a particular form of URI. 1122 We define the most common forms of reference syntax in this 1123 specification because they impact and depend upon the design of the 1124 generic syntax, requiring a uniform parsing algorithm in order to be 1125 interpreted consistently. 1127 4.1 URI Reference 1129 URI-reference is used to denote the most common usage of a resource 1130 identifier. 1132 URI-reference = URI / relative-URI 1134 A URI-reference may be relative: if the reference's prefix matches 1135 the syntax of a scheme followed by its colon separator, then the 1136 reference is a URI rather than a relative-URI. 1138 A URI-reference is typically parsed first into the five URI 1139 components, in order to determine what components are present and 1140 whether or not the reference is relative, after which each component 1141 is parsed for its subparts and their validation. The ABNF of 1142 URI-reference, along with the "first-match-wins" disambiguation rule, 1143 is sufficient to define a validating parser for the generic syntax. 1144 Readers familiar with regular expressions should see Appendix B for 1145 an example of a non-validating URI-reference parser that will take 1146 any given string and extract the URI components. 1148 4.2 Relative URI 1150 A relative URI reference takes advantage of the hierarchical syntax 1151 (Section 1.2.3) in order to express a reference that is relative to 1152 the name space of another hierarchical URI. 1154 relative-URI = relative-part [ "?" query ] [ "#" fragment ] 1156 relative-part = "//" authority path-abempty 1157 / path-abs 1158 / path-noscheme 1159 / path-empty 1161 The URI referred to by a relative reference, also known as the target 1162 URI, is obtained by applying the reference resolution algorithm of 1163 Section 5. 1165 A relative reference that begins with two slash characters is termed 1166 a network-path reference; such references are rarely used. A relative 1167 reference that begins with a single slash character is termed an 1168 absolute-path reference. A relative reference that does not begin 1169 with a slash character is termed a relative-path reference. 1171 A path segment that contains a colon character (e.g., "this:that") 1172 cannot be used as the first segment of a relative-path reference 1173 because it would be mistaken for a scheme name. Such a segment must 1174 be preceded by a dot-segment (e.g., "./this:that") to make a 1175 relative-path reference. 1177 4.3 Absolute URI 1179 Some protocol elements allow only the absolute form of a URI without 1180 a fragment identifier. For example, defining a base URI for later 1181 use by relative references calls for an absolute-URI syntax rule that 1182 does not allow a fragment. 1184 absolute-URI = scheme ":" hier-part [ "?" query ] 1186 4.4 Same-document Reference 1188 When a URI reference refers to a URI that is, aside from its fragment 1189 component (if any), identical to the base URI (Section 5.1), that 1190 reference is called a "same-document" reference. The most frequent 1191 examples of same-document references are relative references that are 1192 empty or include only the number sign ("#") separator followed by a 1193 fragment identifier. 1195 When a same-document reference is dereferenced for the purpose of a 1196 retrieval action, the target of that reference is defined to be 1197 within the same entity (representation, document, or message) as the 1198 reference; therefore, a dereference should not result in a new 1199 retrieval action. 1201 Normalization of the base and target URIs prior to their comparison, 1202 as described in Section 6.2.2 and Section 6.2.3, is allowed but 1203 rarely performed in practice. Normalization may increase the set of 1204 same-document references, which may be of benefit to some caching 1205 applications. As such, reference authors should not assume that a 1206 slightly different, though equivalent, reference URI will (or will 1207 not) be interpreted as a same-document reference by any given 1208 application. 1210 4.5 Suffix Reference 1212 The URI syntax is designed for unambiguous reference to resources and 1213 extensibility via the URI scheme. However, as URI identification and 1214 usage have become commonplace, traditional media (television, radio, 1215 newspapers, billboards, etc.) have increasingly used a suffix of the 1216 URI as a reference, consisting of only the authority and path 1217 portions of the URI, such as 1219 www.w3.org/Addressing/ 1221 or simply a DNS registered name on its own. Such references are 1222 primarily intended for human interpretation, rather than for 1223 machines, with the assumption that context-based heuristics are 1224 sufficient to complete the URI (e.g., most registered names beginning 1225 with "www" are likely to have a URI prefix of "http://"). Although 1226 there is no standard set of heuristics for disambiguating a URI 1227 suffix, many client implementations allow them to be entered by the 1228 user and heuristically resolved. 1230 While this practice of using suffix references is common, it should 1231 be avoided whenever possible and never used in situations where 1232 long-term references are expected. The heuristics noted above will 1233 change over time, particularly when a new URI scheme becomes popular, 1234 and are often incorrect when used out of context. Furthermore, they 1235 can lead to security issues along the lines of those described in 1236 [RFC1535]. 1238 Since a URI suffix has the same syntax as a relative path reference, 1239 a suffix reference cannot be used in contexts where a relative 1240 reference is expected. As a result, suffix references are limited to 1241 those places where there is no defined base URI, such as dialog boxes 1242 and off-line advertisements. 1244 5. Reference Resolution 1246 This section defines the process of resolving a URI reference within 1247 a context that allows relative references, such that the result is a 1248 string matching the "URI" syntax rule of Section 3. 1250 5.1 Establishing a Base URI 1252 The term "relative" implies that there exists a "base URI" against 1253 which the relative reference is applied. Aside from fragment-only 1254 references (Section 4.4), relative references are only usable when a 1255 base URI is known. A base URI must be established by the parser 1256 prior to parsing URI references that might be relative. 1258 The base URI of a reference can be established in one of four ways, 1259 discussed below in order of precedence. The order of precedence can 1260 be thought of in terms of layers, where the innermost defined base 1261 URI has the highest precedence. This can be visualized graphically 1262 as: 1264 .----------------------------------------------------------. 1265 | .----------------------------------------------------. | 1266 | | .----------------------------------------------. | | 1267 | | | .----------------------------------------. | | | 1268 | | | | .----------------------------------. | | | | 1269 | | | | | | | | | | 1270 | | | | `----------------------------------' | | | | 1271 | | | | (5.1.1) Base URI embedded in content | | | | 1272 | | | `----------------------------------------' | | | 1273 | | | (5.1.2) Base URI of the encapsulating entity | | | 1274 | | | (message, representation, or none) | | | 1275 | | `----------------------------------------------' | | 1276 | | (5.1.3) URI used to retrieve the entity | | 1277 | `----------------------------------------------------' | 1278 | (5.1.4) Default Base URI (application-dependent) | 1279 `----------------------------------------------------------' 1281 5.1.1 Base URI Embedded in Content 1283 Within certain media types, a base URI for relative references can be 1284 embedded within the content itself such that it can be readily 1285 obtained by a parser. This can be useful for descriptive documents, 1286 such as tables of content, which may be transmitted to others through 1287 protocols other than their usual retrieval context (e.g., E-Mail or 1288 USENET news). 1290 It is beyond the scope of this specification to specify how, for each 1291 media type, a base URI can be embedded. The appropriate syntax, when 1292 available, is described by the data format specification associated 1293 with each media type. 1295 5.1.2 Base URI from the Encapsulating Entity 1297 If no base URI is embedded, the base URI is defined by the 1298 representation's retrieval context. For a document that is enclosed 1299 within another entity, such as a message or archive, the retrieval 1300 context is that entity; thus, the default base URI of a 1301 representation is the base URI of the entity in which the 1302 representation is encapsulated. 1304 A mechanism for embedding a base URI within MIME container types 1305 (e.g., the message and multipart types) is defined by MHTML 1306 [RFC2557]. Protocols that do not use the MIME message header syntax, 1307 but do allow some form of tagged metadata to be included within 1308 messages, may define their own syntax for defining a base URI as part 1309 of a message. 1311 5.1.3 Base URI from the Retrieval URI 1313 If no base URI is embedded and the representation is not encapsulated 1314 within some other entity, then, if a URI was used to retrieve the 1315 representation, that URI shall be considered the base URI. Note that 1316 if the retrieval was the result of a redirected request, the last URI 1317 used (i.e., the URI that resulted in the actual retrieval of the 1318 representation) is the base URI. 1320 5.1.4 Default Base URI 1322 If none of the conditions described above apply, then the base URI is 1323 defined by the context of the application. Since this definition is 1324 necessarily application-dependent, failing to define a base URI using 1325 one of the other methods may result in the same content being 1326 interpreted differently by different types of application. 1328 A sender of a representation containing relative references is 1329 responsible for ensuring that a base URI for those references can be 1330 established. Aside from fragment-only references, relative references 1331 can only be used reliably in situations where the base URI is 1332 well-defined. 1334 5.2 Relative Resolution 1336 This section describes an algorithm for converting a URI reference 1337 that might be relative to a given base URI into the parsed components 1338 of the reference's target. The components can then be recomposed, as 1339 described in Section 5.3, to form the target URI. This algorithm 1340 provides definitive results that can be used to test the output of 1341 other implementations. Applications may implement relative reference 1342 resolution using some other algorithm, provided that the results 1343 match what would be given by this algorithm. 1345 5.2.1 Pre-parse the Base URI 1347 The base URI (Base) is established according to the procedure of 1348 Section 5.1 and parsed into the five main components described in 1349 Section 3. Note that only the scheme component is required to be 1350 present in a base URI; the other components may be empty or 1351 undefined. A component is undefined if its associated delimiter does 1352 not appear in the URI reference; the path component is never 1353 undefined, though it may be empty. 1355 Normalization of the base URI, as described in Section 6.2.2 and 1356 Section 6.2.3, is optional. A URI reference must be transformed to 1357 its target URI before it can be normalized. 1359 5.2.2 Transform References 1361 For each URI reference (R), the following pseudocode describes an 1362 algorithm for transforming R into its target URI (T): 1364 -- The URI reference is parsed into the five URI components 1365 -- 1366 (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); 1368 -- A non-strict parser may ignore a scheme in the reference 1369 -- if it is identical to the base URI's scheme. 1370 -- 1371 if ((not strict) and (R.scheme == Base.scheme)) then 1372 undefine(R.scheme); 1373 endif; 1374 if defined(R.scheme) then 1375 T.scheme = R.scheme; 1376 T.authority = R.authority; 1377 T.path = remove_dot_segments(R.path); 1378 T.query = R.query; 1379 else 1380 if defined(R.authority) then 1381 T.authority = R.authority; 1382 T.path = remove_dot_segments(R.path); 1383 T.query = R.query; 1384 else 1385 if (R.path == "") then 1386 T.path = Base.path; 1387 if defined(R.query) then 1388 T.query = R.query; 1389 else 1390 T.query = Base.query; 1391 endif; 1392 else 1393 if (R.path starts-with "/") then 1394 T.path = remove_dot_segments(R.path); 1395 else 1396 T.path = merge(Base.path, R.path); 1397 T.path = remove_dot_segments(T.path); 1398 endif; 1399 T.query = R.query; 1400 endif; 1401 T.authority = Base.authority; 1402 endif; 1403 T.scheme = Base.scheme; 1404 endif; 1406 T.fragment = R.fragment; 1408 5.2.3 Merge Paths 1410 The pseudocode above refers to a "merge" routine for merging a 1411 relative-path reference with the path of the base URI. This is 1412 accomplished as follows: 1414 o If the base URI has a defined authority component and an empty 1415 path, then return a string consisting of "/" concatenated with the 1416 reference's path; otherwise, 1418 o Return a string consisting of the reference's path component 1419 appended to all but the last segment of the base URI's path (i.e., 1420 excluding any characters after the right-most "/" in the base URI 1421 path, or excluding the entire base URI path if it does not contain 1422 any "/" characters). 1424 5.2.4 Remove Dot Segments 1426 The pseudocode also refers to a "remove_dot_segments" routine for 1427 interpreting and removing the special "." and ".." complete path 1428 segments from a referenced path. This is done after the path is 1429 extracted from a reference, whether or not the path was relative, in 1430 order to remove any invalid or extraneous dot-segments prior to 1431 forming the target URI. Although there are many ways to accomplish 1432 this removal process, we describe a simple method using two string 1433 buffers. 1435 1. The input buffer is initialized with the now-appended path 1436 components and the output buffer is initialized to the empty 1437 string. 1439 2. While the input buffer is not empty, loop: 1441 a. If the input buffer begins with a prefix of "../" or "./", 1442 then remove that prefix from the input buffer; otherwise, 1444 b. If the input buffer begins with a prefix of "/./" or "/.", 1445 where "." is a complete path segment, then replace that 1446 prefix with "/" in the input buffer; otherwise, 1448 c. If the input buffer begins with a prefix of "/../" or "/..", 1449 where ".." is a complete path segment, then replace that 1450 prefix with "/" in the input buffer and remove the last 1451 segment and its preceding "/" (if any) from the output 1452 buffer; otherwise, 1454 d. If the input buffer consists only of "." or "..", then remove 1455 that from the input buffer; otherwise, 1457 e. Move the first path segment in the input buffer to the end of 1458 the output buffer, including the initial "/" character (if 1459 any) and any subsequent characters up to, but not including, 1460 the next "/" character or the end of the input buffer. 1462 3. Finally, the output buffer is returned as the result of 1463 remove_dot_segments. 1465 Note that dot-segments are intended for use in URI references to 1466 express an identifier relative to the hierarchy of names in the base 1467 URI. The remove_dot_segments algorithm respects that hierarchy by 1468 removing extra dot-segments rather than treating them as an error or 1469 leaving them to be misinterpreted by dereference implementations. 1471 The following illustrates how the above steps are applied for two 1472 example merged paths, showing the state of the two buffers after each 1473 step. 1475 STEP OUTPUT BUFFER INPUT BUFFER 1477 1 : /a/b/c/./../../g 1478 2e: /a /b/c/./../../g 1479 2e: /a/b /c/./../../g 1480 2e: /a/b/c /./../../g 1481 2b: /a/b/c /../../g 1482 2c: /a/b /../g 1483 2c: /a /g 1484 2e: /a/g 1486 STEP OUTPUT BUFFER INPUT BUFFER 1488 1 : mid/content=5/../6 1489 2e: mid /content=5/../6 1490 2e: mid/content=5 /../6 1491 2c: mid /6 1492 2e: mid/6 1494 Some applications may find it more efficient to implement the 1495 remove_dot_segments algorithm using two segment stacks rather than 1496 strings. 1498 Note: Beware that some older, erroneous implementations will fail 1499 to separate a reference's query component from its path component 1500 prior to merging the base and reference paths, resulting in an 1501 interoperability failure if the query component contains the 1502 strings "/../" or "/./". 1504 5.3 Component Recomposition 1506 Parsed URI components can be recomposed to obtain the corresponding 1507 URI reference string. Using pseudocode, this would be: 1509 result = "" 1511 if defined(scheme) then 1512 append scheme to result; 1513 append ":" to result; 1514 endif; 1516 if defined(authority) then 1517 append "//" to result; 1518 append authority to result; 1519 endif; 1521 append path to result; 1523 if defined(query) then 1524 append "?" to result; 1525 append query to result; 1526 endif; 1528 if defined(fragment) then 1529 append "#" to result; 1530 append fragment to result; 1531 endif; 1533 return result; 1535 Note that we are careful to preserve the distinction between a 1536 component that is undefined, meaning that its separator was not 1537 present in the reference, and a component that is empty, meaning that 1538 the separator was present and was immediately followed by the next 1539 component separator or the end of the reference. 1541 5.4 Reference Resolution Examples 1543 Within a representation with a well-defined base URI of 1545 http://a/b/c/d;p?q 1547 a relative URI reference is transformed to its target URI as follows. 1549 5.4.1 Normal Examples 1551 "g:h" = "g:h" 1552 "g" = "http://a/b/c/g" 1553 "./g" = "http://a/b/c/g" 1554 "g/" = "http://a/b/c/g/" 1555 "/g" = "http://a/g" 1556 "//g" = "http://g" 1557 "?y" = "http://a/b/c/d;p?y" 1558 "g?y" = "http://a/b/c/g?y" 1559 "#s" = "http://a/b/c/d;p?q#s" 1560 "g#s" = "http://a/b/c/g#s" 1561 "g?y#s" = "http://a/b/c/g?y#s" 1562 ";x" = "http://a/b/c/;x" 1563 "g;x" = "http://a/b/c/g;x" 1564 "g;x?y#s" = "http://a/b/c/g;x?y#s" 1565 "" = "http://a/b/c/d;p?q" 1566 "." = "http://a/b/c/" 1567 "./" = "http://a/b/c/" 1568 ".." = "http://a/b/" 1569 "../" = "http://a/b/" 1570 "../g" = "http://a/b/g" 1571 "../.." = "http://a/" 1572 "../../" = "http://a/" 1573 "../../g" = "http://a/g" 1575 5.4.2 Abnormal Examples 1577 Although the following abnormal examples are unlikely to occur in 1578 normal practice, all URI parsers should be capable of resolving them 1579 consistently. Each example uses the same base as above. 1581 Parsers must be careful in handling cases where there are more 1582 relative path ".." segments than there are hierarchical levels in the 1583 base URI's path. Note that the ".." syntax cannot be used to change 1584 the authority component of a URI. 1586 "../../../g" = "http://a/g" 1587 "../../../../g" = "http://a/g" 1589 Similarly, parsers must remove the dot-segments "." and ".." when 1590 they are complete components of a path, but not when they are only 1591 part of a segment. 1593 "/./g" = "http://a/g" 1594 "/../g" = "http://a/g" 1595 "g." = "http://a/b/c/g." 1596 ".g" = "http://a/b/c/.g" 1597 "g.." = "http://a/b/c/g.." 1598 "..g" = "http://a/b/c/..g" 1600 Less likely are cases where the relative URI reference uses 1601 unnecessary or nonsensical forms of the "." and ".." complete path 1602 segments. 1604 "./../g" = "http://a/b/g" 1605 "./g/." = "http://a/b/c/g/" 1606 "g/./h" = "http://a/b/c/g/h" 1607 "g/../h" = "http://a/b/c/h" 1608 "g;x=1/./y" = "http://a/b/c/g;x=1/y" 1609 "g;x=1/../y" = "http://a/b/c/y" 1611 Some applications fail to separate the reference's query and/or 1612 fragment components from a relative path before merging it with the 1613 base path and removing dot-segments. This error is rarely noticed, 1614 since typical usage of a fragment never includes the hierarchy ("/") 1615 character, and the query component is not normally used within 1616 relative references. 1618 "g?y/./x" = "http://a/b/c/g?y/./x" 1619 "g?y/../x" = "http://a/b/c/g?y/../x" 1620 "g#s/./x" = "http://a/b/c/g#s/./x" 1621 "g#s/../x" = "http://a/b/c/g#s/../x" 1623 Some parsers allow the scheme name to be present in a relative URI 1624 reference if it is the same as the base URI scheme. This is 1625 considered to be a loophole in prior specifications of partial URI 1626 [RFC1630]. Its use should be avoided, but is allowed for backward 1627 compatibility. 1629 "http:g" = "http:g" ; for strict parsers 1630 / "http://a/b/c/g" ; for backward compatibility 1632 6. Normalization and Comparison 1634 One of the most common operations on URIs is simple comparison: 1635 determining if two URIs are equivalent without using the URIs to 1636 access their respective resource(s). A comparison is performed every 1637 time a response cache is accessed, a browser checks its history to 1638 color a link, or an XML parser processes tags within a namespace. 1639 Extensive normalization prior to comparison of URIs is often used by 1640 spiders and indexing engines to prune a search space or reduce 1641 duplication of request actions and response storage. 1643 URI comparison is performed in respect to some particular purpose, 1644 and software with differing purposes will often be subject to 1645 differing design trade-offs in regards to how much effort should be 1646 spent in reducing duplicate identifiers. This section describes a 1647 variety of methods that may be used to compare URIs, the trade-offs 1648 between them, and the types of applications that might use them. A 1649 canonical form for URI references is defined to reduce the occurrence 1650 of false negative comparisons. 1652 6.1 Equivalence 1654 Since URIs exist to identify resources, presumably they should be 1655 considered equivalent when they identify the same resource. However, 1656 such a definition of equivalence is not of much practical use, since 1657 there is no way for software to compare two resources without 1658 knowledge of the implementation-specific syntax of each URI's 1659 dereferencing algorithm. For this reason, determination of 1660 equivalence or difference of URIs is based on string comparison, 1661 perhaps augmented by reference to additional rules provided by URI 1662 scheme definitions. We use the terms "different" and "equivalent" to 1663 describe the possible outcomes of such comparisons, but there are 1664 many application-dependent versions of equivalence. 1666 Even though it is possible to determine that two URIs are equivalent, 1667 it is never possible to be sure that two URIs identify different 1668 resources. For example, an owner of two different domain names could 1669 decide to serve the same resource from both, resulting in two 1670 different URIs. Therefore, comparison methods are designed to 1671 minimize false negatives while strictly avoiding false positives. 1673 In testing for equivalence, applications should not directly compare 1674 relative URI references; the references should be converted to their 1675 target URI forms before comparison. When URIs are being compared for 1676 the purpose of selecting (or avoiding) a network action, such as 1677 retrieval of a representation, the fragment components (if any) 1678 should be excluded from the comparison. 1680 6.2 Comparison Ladder 1682 A variety of methods are used in practice to test URI equivalence. 1683 These methods fall into a range, distinguished by the amount of 1684 processing required and the degree to which the probability of false 1685 negatives is reduced. As noted above, false negatives cannot in 1686 principle be eliminated. In practice, their probability can be 1687 reduced, but this reduction requires more processing and is not 1688 cost-effective for all applications. 1690 If this range of comparison practices is considered as a ladder, the 1691 following discussion will climb the ladder, starting with those 1692 practices that are cheap but have a relatively higher chance of 1693 producing false negatives, and proceeding to those that have higher 1694 computational cost and lower risk of false negatives. 1696 6.2.1 Simple String Comparison 1698 If two URIs, considered as character strings, are identical, then it 1699 is safe to conclude that they are equivalent. This type of 1700 equivalence test has very low computational cost and is in wide use 1701 in a variety of applications, particularly in the domain of parsing. 1703 Testing strings for equivalence requires some basic precautions. This 1704 procedure is often referred to as "bit-for-bit" or "byte-for-byte" 1705 comparison, which is potentially misleading. Testing of strings for 1706 equality is normally based on pairwise comparison of the characters 1707 that make up the strings, starting from the first and proceeding 1708 until both strings are exhausted and all characters found to be 1709 equal, a pair of characters compares unequal, or one of the strings 1710 is exhausted before the other. 1712 Such character comparisons require that each pair of characters be 1713 put in comparable form. For example, should one URI be stored in a 1714 byte array in EBCDIC encoding, and the second be in a Java String 1715 object (UTF-16), bit-for-bit comparisons applied naively will produce 1716 errors. It is better to speak of equality on a 1717 character-for-character rather than byte-for-byte or bit-for-bit 1718 basis. In practical terms, character-by-character comparisons should 1719 be done codepoint-by-codepoint after conversion to a common character 1720 encoding. 1722 6.2.2 Syntax-based Normalization 1724 Software may use logic based on the definitions provided by this 1725 specification to reduce the probability of false negatives. Such 1726 processing is moderately higher in cost than character-for-character 1727 string comparison. For example, an application using this approach 1728 could reasonably consider the following two URIs equivalent: 1730 example://a/b/c/%7Bfoo%7D 1731 eXAMPLE://a/./b/../b/%63/%7bfoo%7d 1733 Web user agents, such as browsers, typically apply this type of URI 1734 normalization when determining whether a cached response is 1735 available. Syntax-based normalization includes such techniques as 1736 case normalization, percent-encoding normalization, and removal of 1737 dot-segments. 1739 6.2.2.1 Case Normalization 1741 When a URI scheme uses components of the generic syntax, it will also 1742 use the common syntax equivalence rules, namely that the scheme and 1743 host are case-insensitive and therefore should be normalized to 1744 lowercase. For example, the URI is 1745 equivalent to . Applications should not 1746 assume anything about the case sensitivity of other URI components, 1747 since that is dependent on the implementation used to handle a 1748 dereference. 1750 The hexadecimal digits within a percent-encoding triplet (e.g., "%3a" 1751 versus "%3A") are case-insensitive and therefore should be normalized 1752 to use uppercase letters for the digits A-F. 1754 6.2.2.2 Percent-Encoding Normalization 1756 The percent-encoding mechanism (Section 2.1) is a frequent source of 1757 variance among otherwise identical URIs. In addition to the 1758 case-insensitivity issue noted above, some URI producers 1759 percent-encode octets that do not require percent-encoding, resulting 1760 in URIs that are equivalent to their non-encoded counterparts. Such 1761 URIs should be normalized by decoding any percent-encoded octet that 1762 corresponds to an unreserved character, as described in Section 2.3. 1764 6.2.2.3 Path Segment Normalization 1766 The complete path segments "." and ".." have a special meaning within 1767 hierarchical URI schemes. As such, they should not appear in 1768 absolute paths; if they are found, they can be removed by applying 1769 the remove_dot_segments algorithm to the path, as described in 1770 Section 5.2. 1772 6.2.3 Scheme-based Normalization 1774 The syntax and semantics of URIs vary from scheme to scheme, as 1775 described by the defining specification for each scheme. Software 1776 may use scheme-specific rules, at further processing cost, to reduce 1777 the probability of false negatives. For example, since the "http" 1778 scheme makes use of an authority component, has a default port of 1779 "80", and defines an empty path to be equivalent to "/", the 1780 following four URIs are equivalent: 1782 http://example.com 1783 http://example.com/ 1784 http://example.com:/ 1785 http://example.com:80/ 1787 In general, a URI that uses the generic syntax for authority with an 1788 empty path should be normalized to a path of "/"; likewise, an 1789 explicit ":port", where the port is empty or the default for the 1790 scheme, is equivalent to one where the port and its ":" delimiter are 1791 elided. In other words, the second of the above URI examples is the 1792 normal form for the "http" scheme. 1794 Another case where normalization varies by scheme is in the handling 1795 of an empty authority component or empty host subcomponent. For many 1796 scheme specifications, an empty authority or host is considered an 1797 error; for others, it is considered equivalent to "localhost" or the 1798 end-user's host. When a scheme defines a default for authority and a 1799 URI reference to that default is desired, the reference should have 1800 an empty authority for the sake of uniformity, brevity, and 1801 internationalization. If, however, either the userinfo or port 1802 subcomponent is non-empty, then the host should be given explicitly 1803 even if it matches the default. 1805 6.2.4 Protocol-based Normalization 1807 Web spiders, for which substantial effort to reduce the incidence of 1808 false negatives is often cost-effective, are observed to implement 1809 even more aggressive techniques in URI comparison. For example, if 1810 they observe that a URI such as 1812 http://example.com/data 1814 redirects to a URI differing only in the trailing slash 1816 http://example.com/data/ 1818 they will likely regard the two as equivalent in the future. This 1819 kind of technique is only appropriate when equivalence is clearly 1820 indicated by both the result of accessing the resources and the 1821 common conventions of their scheme's dereference algorithm (in this 1822 case, use of redirection by HTTP origin servers to avoid problems 1823 with relative references). 1825 6.3 Canonical Form 1827 It is in the best interests of everyone concerned to avoid 1828 false-negatives in comparing URIs and to minimize the amount of 1829 software processing for such comparisons. Those who produce and make 1830 reference to URIs can reduce the cost of processing and the risk of 1831 false negatives by consistently providing them in a form that is 1832 reasonably canonical with respect to their scheme. Specifically: 1834 o Always provide the URI scheme in lowercase characters. 1836 o Always provide the host, if any, in lowercase characters. 1838 o Only perform percent-encoding where it is essential. 1840 o Always use uppercase A-through-F characters when percent-encoding. 1842 o Prevent dot-segments appearing in non-relative URI paths. 1844 o For schemes that define a default authority, use an empty 1845 authority if the default is desired. 1847 o For schemes that define an empty path to be equivalent to a path 1848 of "/", use "/". 1850 7. Security Considerations 1852 A URI does not in itself pose a security threat. However, since URIs 1853 are often used to provide a compact set of instructions for access to 1854 network resources, care must be taken to properly interpret the data 1855 within a URI, to prevent that data from causing unintended access, 1856 and to avoid including data that should not be revealed in plain 1857 text. 1859 7.1 Reliability and Consistency 1861 There is no guarantee that, having once used a given URI to retrieve 1862 some information, the same information will be retrievable by that 1863 URI in the future. Nor is there any guarantee that the information 1864 retrievable via that URI in the future will be observably similar to 1865 that retrieved in the past. The URI syntax does not constrain how a 1866 given scheme or authority apportions its name space or maintains it 1867 over time. Such a guarantee can only be obtained from the person(s) 1868 controlling that name space and the resource in question. A specific 1869 URI scheme may define additional semantics, such as name persistence, 1870 if those semantics are required of all naming authorities for that 1871 scheme. 1873 7.2 Malicious Construction 1875 It is sometimes possible to construct a URI such that an attempt to 1876 perform a seemingly harmless, idempotent operation, such as the 1877 retrieval of a representation, will in fact cause a possibly damaging 1878 remote operation to occur. The unsafe URI is typically constructed 1879 by specifying a port number other than that reserved for the network 1880 protocol in question. The client unwittingly contacts a site that is 1881 running a different protocol service and data within the URI contains 1882 instructions that, when interpreted according to this other protocol, 1883 cause an unexpected operation. A frequent example of such abuse has 1884 been the use of a protocol-based scheme with a port component of 1885 "25", thereby fooling user agent software into sending an unintended 1886 or impersonating message via an SMTP server. 1888 Applications should prevent dereference of a URI that specifies a TCP 1889 port number within the "well-known port" range (0 - 1023) unless the 1890 protocol being used to dereference that URI is compatible with the 1891 protocol expected on that well-known port. Although IANA maintains a 1892 registry of well-known ports, applications should make such 1893 restrictions user-configurable to avoid preventing the deployment of 1894 new services. 1896 When a URI contains percent-encoded octets that match the delimiters 1897 for a given resolution or dereference protocol (for example, CR and 1898 LF characters for the TELNET protocol), such percent-encoded octets 1899 must not be decoded before transmission across that protocol. 1900 Transfer of the percent-encoding, which might violate the protocol, 1901 is less harmful than allowing decoded octets to be interpreted as 1902 additional operations or parameters, perhaps triggering an unexpected 1903 and possibly harmful remote operation. 1905 7.3 Back-end Transcoding 1907 When a URI is dereferenced, the data within it is often parsed by 1908 both the user agent and one or more servers. In HTTP, for example, a 1909 typical user agent will parse a URI into its five major components, 1910 access the authority's server, and send it the data within the 1911 authority, path, and query components. A typical server will take 1912 that information, parse the path into segments and the query into 1913 key/value pairs, and then invoke implementation-specific handlers to 1914 respond to the request. As a result, a common security concern for 1915 server implementations that handle a URI, either as a whole or split 1916 into separate components, is proper interpretation of the octet data 1917 represented by the characters and percent-encodings within that URI. 1919 Percent-encoded octets must be decoded at some point during the 1920 dereference process. Applications must split the URI into its 1921 components and subcomponents prior to decoding the octets, since 1922 otherwise the decoded octets might be mistaken for delimiters. 1923 Security checks of the data within a URI should be applied after 1924 decoding the octets. Note, however, that the "%00" percent-encoding 1925 (NUL) may require special handling and should be rejected if the 1926 application is not expecting to receive raw data within a component. 1928 Special care should be taken when the URI path interpretation process 1929 involves the use of a back-end filesystem or related system 1930 functions. Filesystems typically assign an operational meaning to 1931 special characters, such as the "/", "\", ":", "[", and "]" 1932 characters, and special device names like ".", "..", "...", "aux", 1933 "lpt", etc. In some cases, merely testing for the existence of such a 1934 name will cause the operating system to pause or invoke unrelated 1935 system calls, leading to significant security concerns regarding 1936 denial of service and unintended data transfer. It would be 1937 impossible for this specification to list all such significant 1938 characters and device names; implementers should research the 1939 reserved names and characters for the types of storage device that 1940 may be attached to their application and restrict the use of data 1941 obtained from URI components accordingly. 1943 7.4 Rare IP Address Formats 1945 Although the URI syntax for IPv4address only allows the common, 1946 dotted-decimal form of IPv4 address literal, many implementations 1947 that process URIs make use of platform-dependent system routines, 1948 such as gethostbyname() and inet_aton(), to translate the string 1949 literal to an actual IP address. Unfortunately, such system routines 1950 often allow and process a much larger set of formats than those 1951 described in Section 3.2.2. 1953 For example, many implementations allow dotted forms of three 1954 numbers, wherein the last part is interpreted as a 16-bit quantity 1955 and placed in the right-most two bytes of the network address (e.g., 1956 a Class B network). Likewise, a dotted form of two numbers means the 1957 last part is interpreted as a 24-bit quantity and placed in the right 1958 most three bytes of the network address (Class A), and a single 1959 number (without dots) is interpreted as a 32-bit quantity and stored 1960 directly in the network address. Adding further to the confusion, 1961 some implementations allow each dotted part to be interpreted as 1962 decimal, octal, or hexadecimal, as specified in the C language (i.e., 1963 a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 1964 implies octal; otherwise, the number is interpreted as decimal). 1966 These additional IP address formats are not allowed in the URI syntax 1967 due to differences between platform implementations. However, they 1968 can become a security concern if an application attempts to filter 1969 access to resources based on the IP address in string literal format. 1970 If such filtering is performed, literals should be converted to 1971 numeric form and filtered based on the numeric value, rather than a 1972 prefix or suffix of the string form. 1974 7.5 Sensitive Information 1976 URI producers should not provide a URI that contains a username or 1977 password which is intended to be secret: URIs are frequently 1978 displayed by browsers, stored in clear text bookmarks, and logged by 1979 user agent history and intermediary applications (proxies). A 1980 password appearing within the userinfo component is deprecated and 1981 should be considered an error (or simply ignored) except in those 1982 rare cases where the 'password' parameter is intended to be public. 1984 7.6 Semantic Attacks 1986 Because the userinfo subcomponent is rarely used and appears before 1987 the host in the authority component, it can be used to construct a 1988 URI that is intended to mislead a human user by appearing to identify 1989 one (trusted) naming authority while actually identifying a different 1990 authority hidden behind the noise. For example 1992 ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm 1994 might lead a human user to assume that the host is 'cnn.example.com', 1995 whereas it is actually '10.0.0.1'. Note that a misleading userinfo 1996 subcomponent could be much longer than the example above. 1998 A misleading URI, such as the one above, is an attack on the user's 1999 preconceived notions about the meaning of a URI, rather than an 2000 attack on the software itself. User agents may be able to reduce the 2001 impact of such attacks by distinguishing the various components of 2002 the URI when rendered, such as by using a different color or tone to 2003 render userinfo if any is present, though there is no general 2004 panacea. More information on URI-based semantic attacks can be found 2005 in [Siedzik]. 2007 8. Acknowledgments 2009 This specification is derived from RFC 2396 [RFC2396], RFC 1808 2010 [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those 2011 documents still apply. It also incorporates the update (with 2012 corrections) for IPv6 literals in the host syntax, as defined by 2013 Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in 2014 [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, 2015 Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, 2016 Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin 2017 Duerst, Stefan Eissing, Clive D.W. Feather, Tony Hammond, Pat Hayes, 2018 Henry Holtzman, Ian B. Jacobs, Michael Kay, John C. Klensin, Graham 2019 Klyne, Dan Kohn, Bruce Lilly, Andrew Main, Ira McDonald, Michael 2020 Mealling, Ray Merkert, Stephen Pollei, Julian Reschke, Tomas Rokicki, 2021 Miles Sabin, Kai Schaetzl, Mark Thomson, Ronald Tschalaer, Norm 2022 Walsh, Marc Warne, Stuart Williams, and Henry Zongaro are gratefully 2023 acknowledged. 2025 9. References 2027 9.1 Normative References 2029 [ASCII] American National Standards Institute, "Coded Character 2030 Set -- 7-bit American Standard Code for Information 2031 Interchange", ANSI X3.4, 1986. 2032 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 2033 Specifications: ABNF", RFC 2234, November 1997. 2035 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 2036 10646", STD 63, RFC 3629, November 2003. 2038 9.2 Informative References 2040 [RFC0952] Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet 2041 host table specification", RFC 952, October 1985. 2043 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 2044 STD 13, RFC 1034, November 1987. 2046 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 2047 and Support", STD 3, RFC 1123, October 1989. 2049 [RFC1535] Gavron, E., "A Security Problem and Proposed Correction 2050 With Widely Deployed DNS Software", RFC 1535, October 2051 1993. 2053 [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A 2054 Unifying Syntax for the Expression of Names and Addresses 2055 of Objects on the Network as used in the World-Wide Web", 2056 RFC 1630, June 1994. 2058 [RFC1736] Kunze, J., "Functional Recommendations for Internet 2059 Resource Locators", RFC 1736, February 1995. 2061 [RFC1737] Masinter, L. and K. Sollins, "Functional Requirements for 2062 Uniform Resource Names", RFC 1737, December 1994. 2064 [RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform 2065 Resource Locators (URL)", RFC 1738, December 1994. 2067 [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC 2068 1808, June 1995. 2070 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2071 Extensions (MIME) Part Two: Media Types", RFC 2046, 2072 November 1996. 2074 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 2076 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 2077 Languages", BCP 18, RFC 2277, January 1998. 2079 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 2080 Resource Identifiers (URI): Generic Syntax", RFC 2396, 2081 August 1998. 2083 [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D. 2084 Jensen, "HTTP Extensions for Distributed Authoring -- 2085 WEBDAV", RFC 2518, February 1999. 2087 [RFC2557] Palme, F., Hopmann, A., Shelness, N. and E. Stefferud, 2088 "MIME Encapsulation of Aggregate Documents, such as HTML 2089 (MHTML)", RFC 2557, March 1999. 2091 [RFC2717] Petke, R. and I. King, "Registration Procedures for URL 2092 Scheme Names", BCP 35, RFC 2717, November 1999. 2094 [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, 2095 "Guidelines for new URL Schemes", RFC 2718, November 1999. 2097 [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for 2098 Literal IPv6 Addresses in URL's", RFC 2732, December 1999. 2100 [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration 2101 Procedures", BCP 19, RFC 2978, October 2000. 2103 [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint W3C/ 2104 IETF URI Planning Interest Group: Uniform Resource 2105 Identifiers (URIs), URLs, and Uniform Resource Names 2106 (URNs): Clarifications and Recommendations", RFC 3305, 2107 August 2002. 2109 [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, 2110 "Internationalizing Domain Names in Applications (IDNA)", 2111 RFC 3490, March 2003. 2113 [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 2114 (IPv6) Addressing Architecture", RFC 3513, April 2003. 2116 [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", April 2117 2001, . 2120 Authors' Addresses 2122 Tim Berners-Lee 2123 World Wide Web Consortium 2124 Massachusetts Institute of Technology 2125 77 Massachusetts Avenue 2126 Cambridge, MA 02139 2127 USA 2129 Phone: +1-617-253-5702 2130 Fax: +1-617-258-5999 2131 EMail: timbl@w3.org 2132 URI: http://www.w3.org/People/Berners-Lee/ 2134 Roy T. Fielding 2135 Day Software 2136 5251 California Ave., Suite 110 2137 Irvine, CA 92612-3074 2138 USA 2140 Phone: +1-949-679-2960 2141 Fax: +1-949-679-2972 2142 EMail: fielding@gbiv.com 2143 URI: http://roy.gbiv.com/ 2145 Larry Masinter 2146 Adobe Systems Incorporated 2147 345 Park Ave 2148 San Jose, CA 95110 2149 USA 2151 Phone: +1-408-536-3024 2152 EMail: LMM@acm.org 2153 URI: http://larry.masinter.net/ 2155 Appendix A. Collected ABNF for URI 2157 URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 2159 hier-part = "//" authority path-abempty 2160 / path-abs 2161 / path-rootless 2162 / path-empty 2164 URI-reference = URI / relative-URI 2166 absolute-URI = scheme ":" hier-part [ "?" query ] 2168 relative-URI = relative-part [ "?" query ] [ "#" fragment ] 2170 relative-part = "//" authority path-abempty 2171 / path-abs 2172 / path-noscheme 2173 / path-empty 2175 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) 2177 authority = [ userinfo "@" ] host [ ":" port ] 2178 userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) 2179 host = IP-literal / IPv4address / reg-name 2180 port = *DIGIT 2182 IP-literal = "[" ( IPv6address / IPvFuture ) "]" 2184 IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) 2186 IPv6address = 6( h16 ":" ) ls32 2187 / "::" 5( h16 ":" ) ls32 2188 / [ h16 ] "::" 4( h16 ":" ) ls32 2189 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 2190 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 2191 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 2192 / [ *4( h16 ":" ) h16 ] "::" ls32 2193 / [ *5( h16 ":" ) h16 ] "::" h16 2194 / [ *6( h16 ":" ) h16 ] "::" 2196 h16 = 1*4HEXDIG 2197 ls32 = ( h16 ":" h16 ) / IPv4address 2199 IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet 2200 dec-octet = DIGIT ; 0-9 2201 / %x31-39 DIGIT ; 10-99 2202 / "1" 2DIGIT ; 100-199 2203 / "2" %x30-34 DIGIT ; 200-249 2204 / "25" %x30-35 ; 250-255 2206 reg-name = 0*255( unreserved / pct-encoded / sub-delims ) 2208 path = path-abempty ; begins with "/" or is empty 2209 / path-abs ; begins with "/" but not "//" 2210 / path-noscheme ; begins with a non-colon segment 2211 / path-rootless ; begins with a segment 2212 / path-empty ; zero characters 2214 path-abempty = *( "/" segment ) 2215 path-abs = "/" [ segment-nz *( "/" segment ) ] 2216 path-noscheme = segment-nzc *( "/" segment ) 2217 path-rootless = segment-nz *( "/" segment ) 2218 path-empty = 0 2220 segment = *pchar 2221 segment-nz = 1*pchar 2222 segment-nzc = 1*( unreserved / pct-encoded / sub-delims / "@" ) 2224 pchar = unreserved / pct-encoded / sub-delims / ":" / "@" 2226 query = *( pchar / "/" / "?" ) 2228 fragment = *( pchar / "/" / "?" ) 2230 pct-encoded = "%" HEXDIG HEXDIG 2232 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 2233 reserved = gen-delims / sub-delims 2234 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" 2235 sub-delims = "!" / "$" / "&" / "'" / "(" / ")" 2236 / "*" / "+" / "," / ";" / "=" 2238 Appendix B. Parsing a URI Reference with a Regular Expression 2240 Since the "first-match-wins" algorithm is identical to the "greedy" 2241 disambiguation method used by POSIX regular expressions, it is 2242 natural and commonplace to use a regular expression for parsing the 2243 potential five components of a URI reference. 2245 The following line is the regular expression for breaking-down a 2246 well-formed URI reference into its components. 2248 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 2249 12 3 4 5 6 7 8 9 2251 The numbers in the second line above are only to assist readability; 2252 they indicate the reference points for each subexpression (i.e., each 2253 paired parenthesis). We refer to the value matched for subexpression 2254 as $. For example, matching the above expression to 2256 http://www.ics.uci.edu/pub/ietf/uri/#Related 2258 results in the following subexpression matches: 2260 $1 = http: 2261 $2 = http 2262 $3 = //www.ics.uci.edu 2263 $4 = www.ics.uci.edu 2264 $5 = /pub/ietf/uri/ 2265 $6 = 2266 $7 = 2267 $8 = #Related 2268 $9 = Related 2270 where indicates that the component is not present, as is 2271 the case for the query component in the above example. Therefore, we 2272 can determine the value of the four components and fragment as 2274 scheme = $2 2275 authority = $4 2276 path = $5 2277 query = $7 2278 fragment = $9 2280 and, going in the opposite direction, we can recreate a URI reference 2281 from its components using the algorithm of Section 5.3. 2283 Appendix C. Delimiting a URI in Context 2285 URIs are often transmitted through formats that do not provide a 2286 clear context for their interpretation. For example, there are many 2287 occasions when a URI is included in plain text; examples include text 2288 sent in electronic mail, USENET news messages, and, most importantly, 2289 printed on paper. In such cases, it is important to be able to 2290 delimit the URI from the rest of the text, and in particular from 2291 punctuation marks that might be mistaken for part of the URI. 2293 In practice, URIs are delimited in a variety of ways, but usually 2294 within double-quotes "http://example.com/", angle brackets , or just using whitespace 2297 http://example.com/ 2299 These wrappers do not form part of the URI. 2301 In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may 2302 need to be added to break a long URI across lines. The whitespace 2303 should be ignored when extracting the URI. 2305 No whitespace should be introduced after a hyphen ("-") character. 2306 Because some typesetters and printers may (erroneously) introduce a 2307 hyphen at the end of line when breaking a line, the interpreter of a 2308 URI containing a line break immediately after a hyphen should ignore 2309 all whitespace around the line break, and should be aware that the 2310 hyphen may or may not actually be part of the URI. 2312 Using <> angle brackets around each URI is especially recommended as 2313 a delimiting style for a reference that contains embedded whitespace. 2315 The prefix "URL:" (with or without a trailing space) was formerly 2316 recommended as a way to help distinguish a URI from other bracketed 2317 designators, though it is not commonly used in practice and is no 2318 longer recommended. 2320 For robustness, software that accepts user-typed URI should attempt 2321 to recognize and strip both delimiters and embedded whitespace. 2323 For example, the text: 2325 Yes, Jim, I found it under "http://www.w3.org/Addressing/", 2326 but you can probably pick it up from . Note the warning in . 2330 contains the URI references 2332 http://www.w3.org/Addressing/ 2333 ftp://foo.example.com/rfc/ 2334 http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING 2336 Appendix D. Summary of Non-editorial Changes 2338 D.1 Additions 2340 IPv6 (and later) literals have been added to the list of possible 2341 identifiers for the host portion of a authority component, as 2342 described by [RFC2732], with the addition of "[" and "]" to the 2343 reserved set and a version flag to anticipate future versions of IP 2344 literals. Square brackets are now specified as reserved within the 2345 authority component and not allowed outside their use as delimiters 2346 for an IP literal within host. In order to make this change without 2347 changing the technical definition of the path, query, and fragment 2348 components, those rules were redefined to directly specify the 2349 characters allowed rather than be defined in terms of uric. 2351 Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal 2352 address, which unfortunately lacks an ABNF description of 2353 IPv6address, we created a new ABNF rule for IPv6address that matches 2354 the text representations defined by Section 2.2 of [RFC3513]. 2356 Likewise, the definition of IPv4address has been improved in order to 2357 limit each decimal octet to the range 0-255. 2359 Section 6 (Section 6) on URI normalization and comparison has been 2360 completely rewritten and extended using input from Tim Bray and 2361 discussion within the W3C Technical Architecture Group. 2363 An ABNF rule for URI has been introduced to correspond to the common 2364 usage of the term: an absolute URI with optional fragment. 2366 D.2 Modifications from RFC 2396 2368 The ad-hoc BNF syntax has been replaced with the ABNF of [RFC2234]. 2369 This change required all rule names that formerly included underscore 2370 characters to be renamed with a dash instead. 2372 Section 2 on characters has been rewritten to explain what characters 2373 are reserved, when they are reserved, and why they are reserved even 2374 when not used as delimiters by the generic syntax. The mark 2375 characters that are typically unsafe to decode, including the 2376 exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open 2377 and close parentheses ("(" and ")"), have been moved to the reserved 2378 set in order to clarify the distinction between reserved and 2379 unreserved and hopefully answer the most common question of scheme 2380 designers. Likewise, the section on percent-encoded characters has 2381 been rewritten, and URI normalizers are now given license to decode 2382 any percent-encoded octets corresponding to unreserved characters. 2383 In general, the terms "escaped" and "unescaped" have been replaced 2384 with "percent-encoded" and "decoded", respectively, to reduce 2385 confusion with other forms of escape mechanisms. 2387 The ABNF for URI and URI-reference has been redesigned to make them 2388 more friendly to LALR parsers and reduce complexity. As a result, the 2389 layout form of syntax description has been removed, along with the 2390 uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path, 2391 path_segments, rel_segment, and mark rules. All references to 2392 "opaque" URIs have been replaced with a better description of how the 2393 path component may be opaque to hierarchy. The ambiguity regarding 2394 the parsing of URI-reference as a URI or a relative-URI with a colon 2395 in the first segment has been eliminated through the use of five 2396 separate path matching rules. 2398 The fragment identifier has been moved back into the section on 2399 generic syntax components and within the URI and relative-URI rules, 2400 though it remains excluded from absolute-URI. The number sign ("#") 2401 character has been moved back to the reserved set as a result of 2402 reintegrating the fragment syntax. 2404 The ABNF has been corrected to allow a relative path to be empty. 2405 This also allows an absolute-URI to consist of nothing after the 2406 "scheme:", as is present in practice with the "dav:" namespace 2407 [RFC2518] and the "about:" scheme used internally by many WWW browser 2408 implementations. The ambiguity regarding the boundary between 2409 authority and path has been eliminated through the use of five 2410 separate path matching rules. 2412 Registry-based naming authorities that use the generic syntax are now 2413 defined within the host rule and limited to 255 path characters. This 2414 change allows current implementations, where whatever name provided 2415 is simply fed to the local name resolution mechanism, to be 2416 consistent with the specification and removes the need to re-specify 2417 DNS name formats here. It also allows the host component to contain 2418 percent-encoded octets, which is necessary to enable 2419 internationalized domain names to be provided in URIs, processed in 2420 their native character encodings at the application layers above URI 2421 processing, and passed to an IDNA library as a registered name in the 2422 UTF-8 character encoding. The server, hostport, hostname, 2423 domainlabel, toplabel, and alphanum rules have been removed. 2425 The resolving relative references algorithm of [RFC2396] has been 2426 rewritten using pseudocode for this revision to improve clarity and 2427 fix the following issues: 2429 o [RFC2396] section 5.2, step 6a, failed to account for a base URI 2430 with no path. 2432 o Restored the behavior of [RFC1808] where, if the reference 2433 contains an empty path and a defined query component, then the 2434 target URI inherits the base URI's path component. 2436 o Removed the special-case treatment of same-document references 2437 within the URI parser in favor of a section that explains when a 2438 reference should be interpreted by a dereferencing engine as a 2439 same-document reference: when the target URI and base URI, 2440 excluding fragments, match. This change does not modify the 2441 behavior of existing same-document references as defined by RFC 2442 2396 (fragment-only references); it merely adds the same-document 2443 distinction to other references that refer to the base URI and 2444 simplifies the interface between applications and their URI 2445 parsers, as is consistent with the internal architecture of 2446 deployed URI processing implementations. 2448 o Separated the path merge routine into two routines: merge, for 2449 describing combination of the base URI path with a relative-path 2450 reference, and remove_dot_segments, for describing how to remove 2451 the special "." and ".." segments from a composed path. The 2452 remove_dot_segments algorithm is now applied to all URI reference 2453 paths in order to match common implementations and improve the 2454 normalization of URIs in practice. This change only impacts the 2455 parsing of abnormal references and same-scheme references wherein 2456 the base URI has a non-hierarchical path. 2458 Index 2460 A 2461 ABNF 10 2462 absolute 25 2463 absolute-path 25 2464 absolute-URI 25 2465 access 8 2466 authority 15, 16 2468 B 2469 base URI 27 2471 C 2472 character encoding 4 2473 character 4 2474 characters 10 2475 coded character set 4 2477 D 2478 dec-octet 19 2479 dereference 8 2480 dot-segments 21 2482 F 2483 fragment 15, 23 2485 G 2486 gen-delims 11 2487 generic syntax 6 2489 H 2490 h16 18 2491 hier-part 15 2492 hierarchical 9 2493 host 17 2495 I 2496 identifier 5 2497 IP-literal 18 2498 IPv4 19 2499 IPv4address 19 2500 IPv6 18 2501 IPv6address 18 2502 IPvFuture 18 2504 L 2505 locator 6 2506 ls32 18 2508 M 2509 merge 30 2511 N 2512 name 6 2513 network-path 25 2515 P 2516 path 15, 21 2517 path-abempty 21 2518 path-abs 21 2519 path-empty 21 2520 path-noscheme 21 2521 path-rootless 21 2522 path-abempty 15 2523 path-abs 15 2524 path-empty 15 2525 path-rootless 15 2526 pchar 21 2527 pct-encoded 11 2528 percent-encoding 11 2529 port 20 2531 Q 2532 query 15, 22 2534 R 2535 reg-name 19 2536 registered name 19 2537 relative 9, 27 2538 relative-path 25 2539 relative-URI 25 2540 remove_dot_segments 30, 31 2541 representation 8 2542 reserved 11 2543 resolution 8, 27 2544 resource 4 2545 retrieval 8 2547 S 2548 same-document 25 2549 sameness 8 2550 scheme 15, 15 2551 segment 21 2552 segment-nz 21 2553 segment-nzc 21 2554 sub-delims 11 2555 suffix 26 2557 T 2558 transcription 7 2560 U 2561 uniform 4 2562 unreserved 12 2563 URI grammar 2564 absolute-URI 25 2565 ALPHA 10 2566 authority 15, 16 2567 CR 10 2568 dec-octet 19 2569 DIGIT 10 2570 DQUOTE 10 2571 fragment 15, 23, 25 2572 gen-delims 11 2573 h16 18 2574 HEXDIG 10 2575 hier-part 15 2576 host 16, 17 2577 IP-literal 18 2578 IPv4address 19 2579 IPv6address 18 2580 IPvFuture 18 2581 LF 10 2582 ls32 18 2583 mark 12 2584 OCTET 10 2585 path 21 2586 path-abempty 15, 21 2587 path-abs 15, 21 2588 path-empty 15, 21 2589 path-noscheme 21 2590 path-rootless 15, 21 2591 pchar 21, 22, 23 2592 pct-encoded 11 2593 port 16, 20 2594 query 15, 22, 25 2595 reg-name 19 2596 relative-URI 24, 25 2597 reserved 11 2598 scheme 15, 16, 25 2599 segment 21 2600 segment-nz 21 2601 segment-nzc 21 2602 SP 10 2603 sub-delims 11 2604 unreserved 12 2605 URI 15, 24 2606 URI-reference 24 2607 userinfo 16, 17 2608 URI 15 2609 URI-reference 24 2610 URL 6 2611 URN 6 2612 userinfo 17 2614 Intellectual Property Statement 2616 The IETF takes no position regarding the validity or scope of any 2617 Intellectual Property Rights or other rights that might be claimed to 2618 pertain to the implementation or use of the technology described in 2619 this document or the extent to which any license under such rights 2620 might or might not be available; nor does it represent that it has 2621 made any independent effort to identify any such rights. Information 2622 on the IETF's procedures with respect to rights in IETF Documents can 2623 be found in BCP 78 and BCP 79. 2625 Copies of IPR disclosures made to the IETF Secretariat and any 2626 assurances of licenses to be made available, or the result of an 2627 attempt made to obtain a general license or permission for the use of 2628 such proprietary rights by implementers or users of this 2629 specification can be obtained from the IETF on-line IPR repository at 2630 http://www.ietf.org/ipr. 2632 The IETF invites any interested party to bring to its attention any 2633 copyrights, patents or patent applications, or other proprietary 2634 rights that may cover technology that may be required to implement 2635 this standard. Please address the information to the IETF at 2636 ietf-ipr@ietf.org. 2638 Disclaimer of Validity 2640 This document and the information contained herein are provided on an 2641 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2642 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2643 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2644 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2645 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2646 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2648 Copyright Statement 2650 Copyright (C) The Internet Society (2004). This document is subject 2651 to the rights, licenses and restrictions contained in BCP 78, and 2652 except as set forth therein, the authors retain all their rights. 2654 Acknowledgment 2656 Funding for the RFC Editor function is currently provided by the 2657 Internet Society.