idnits 2.17.1 draft-fielding-url-syntax-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 1594 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 581: '... practice is NOT RECOMMENDED, because ...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 21, 1997) is 9682 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2045' is defined on line 1031, but no explicit reference was found in the text == Unused Reference: 'RFC2046' is defined on line 1035, but no explicit reference was found in the text == Unused Reference: 'ASCII' is defined on line 1054, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 1630 ** Obsolete normative reference: RFC 1738 (Obsoleted by RFC 4248, RFC 4266) ** Obsolete normative reference: RFC 1866 (Obsoleted by RFC 2854) ** Obsolete normative reference: RFC 822 (Obsoleted by RFC 2822) ** Obsolete normative reference: RFC 1808 (Obsoleted by RFC 3986) ** Downref: Normative reference to an Informational RFC: RFC 1736 ** Obsolete normative reference: RFC 2110 (Obsoleted by RFC 2557) ** Downref: Normative reference to an Informational RFC: RFC 1737 -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' Summary: 18 errors (**), 0 flaws (~~), 6 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group T. Berners-Lee, MIT/LCS 2 INTERNET-DRAFT R. Fielding, U.C. Irvine 3 draft-fielding-url-syntax-09 L. Masinter, Xerox Corporation 4 Expires six months after publication date October 21, 1997 6 Uniform Resource Locators (URL): Generic Syntax and Semantics 8 Status of this Memo 10 This document is an Internet-Draft. Internet-Drafts are working 11 documents of the Internet Engineering Task Force (IETF), its areas, 12 and its working groups. Note that other groups may also distribute 13 working documents as Internet-Drafts. 15 Internet-Drafts are draft documents valid for a maximum of six 16 months and may be updated, replaced, or obsoleted by other 17 documents at any time. It is inappropriate to use Internet-Drafts 18 as reference material or to cite them other than as ``work in 19 progress.'' 21 To learn the current status of any Internet-Draft, please check the 22 ``1id-abstracts.txt'' listing contained in the Internet-Drafts 23 Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net 24 (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East 25 Coast), or ftp.isi.edu (US West Coast). 27 Abstract 29 A Uniform Resource Locator (URL) is a compact string representation 30 of a location for use in identifying an abstract or physical 31 resource. This document defines the general syntax and semantics 32 of URLs, including both absolute and relative locators, and 33 guidelines for their use; it revises and replaces the generic 34 definitions in RFC 1738 and RFC 1808. 36 1. Introduction 38 Uniform Resource Locators (URLs) provide a simple and extensible 39 means for identifying a resource by its location. This 40 specification of URL syntax and semantics is derived from concepts 41 introduced by the World Wide Web global information initiative, 42 whose use of such objects dates from 1990 and is described in 43 "Universal Resource Identifiers in WWW" [RFC1630]. The 44 specification of URLs is designed to meet the recommendations laid 45 out in "Functional Recommendations for Internet Resource Locators" 46 [RFC1736]. 48 This document updates and merges "Uniform Resource Locators" 49 [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in 50 order to define a single, general syntax for all URLs. It excludes 51 those portions of RFC 1738 that defined the specific syntax of 52 individual URL schemes; those portions will be updated as separate 53 documents, as will the process for registration of new URL schemes. 54 This document does not discuss the issues and recommendation for 55 dealing with characters outside of the US-ASCII character set; 56 those recommendations are discussed in a separate document. 58 All significant changes from the prior RFCs are noted in Appendix G. 60 1.1 Overview of URLs 62 URLs are characterized by the following definitions: 64 Uniform 65 Uniformity of syntax and semantics allows the mechanism for 66 referencing resources to be independent of the mechanism used 67 to locate those resources and the operations applied to those 68 resources once they have been located. New types of resources, 69 access mechanisms, and operations can be introduced without 70 changing the protocols and data formats that use URLs. 71 Uniformity of syntax means that the same locator is used 72 independent of the locale, character representation, or 73 system type of the user entering the URL. 75 Resource 76 A resource can be anything that has identity. Familiar 77 examples include an electronic document, an image, a service 78 (e.g., "today's weather report for Los Angeles"), and a 79 collection of other resources. Not all resources are network 80 "retrievable"; e.g., human beings, corporations, and bound 81 books in a library can also be considered resources. 83 The resource is the conceptual mapping to an entity or set of 84 entities, not necessarily the entity which corresponds to that 85 mapping at any particular instance in time. Thus, a resource 86 can remain constant even when its content---the entities to 87 which it currently corresponds---changes over time, provided 88 that the conceptual mapping is not changed in the process. 90 Locator 91 A locator is an object that identifies a resource by its 92 location. In the case of URLs, the object is a sequence of 93 characters with a restricted syntax. An absolute locator 94 identifies a location independent of any context, whereas a 95 relative locator identifies a location relative to the 96 context in which it is found. 98 URLs are used to `locate' resources by providing an abstract 99 identification of the resource location. Having located a resource, 100 a system may perform a variety of operations on the resource, as 101 might be characterized by such words as `access', `update', 102 `replace', or `find attributes'. 104 1.2. URL, URN, and URI 106 URLs are a subset of Uniform Resource Identifiers (URI), which also 107 includes the notion of Uniform Resource Names (URN). A URN differs 108 from a URL in that it identifies a resource in a 109 location-independent fashion (see [RFC1737]). This specification 110 restricts its discussion to URLs. The syntax and semantics of other 111 URIs are defined by a separate set of specifications, although 112 it is expected that any URI notation would have a compatible syntax. 114 1.3. Example URLs 116 The following examples illustrate URLs which are in common use. 118 ftp://ftp.is.co.za/rfc/rfc1808.txt 119 -- ftp scheme for File Transfer Protocol services 121 gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles 122 -- gopher scheme for Gopher and Gopher+ Protocol services 124 http://www.math.uio.no/faq/compression-faq/part1.html 125 -- http scheme for Hypertext Transfer Protocol services 127 mailto:mduerst@ifi.unizh.ch 128 -- mailto scheme for electronic mail addresses 130 news:comp.infosystems.www.servers.unix 131 -- news scheme for USENET news groups and articles 133 telnet://melvyl.ucop.edu/ 134 -- telnet scheme for interactive services via the TELNET Protocol 136 Many URL schemes have been defined. The scheme defines the 137 namespace of the URL. Although many URL schemes are named after 138 protocols, this does not imply that the only way to access the 139 URL's resource is via the named protocol. Gateways, proxies, 140 caches, and name resolution services might be used to access some 141 resources, independent of the protocol of their origin, and the 142 resolution of some URLs may require the use of more than one 143 protocol (e.g., both DNS and HTTP are typically used to access an 144 "http" URL's resource when it can't be found in a local cache). 146 1.4. Hierarchical URLs and Relative Forms 148 URL schemes may support a hierarchical naming system, where the 149 hierarchy of the name is denoted by a "/" delimiter separating the 150 components in the scheme. There is a `relative' form of URL reference 151 which is used in conjunction with a `base' URL (of a hierarchical 152 scheme) to produce another URL. The syntax of hierarchical URLs is 153 described in Section 4, and the relative URL calculation is described 154 in Section 5. 156 1.5. URL Transcribability 158 The URL syntax was designed with global transcribability as one of 159 its main concerns. A URL is a sequence of characters from a very 160 limited set, i.e. the letters of the basic Latin alphabet, digits, 161 and a few special characters. A URL may be represented in a 162 variety of ways: e.g., ink on paper, pixels on a screen, or a 163 sequence of octets in a coded character set. The interpretation of 164 a URL depends only on the characters used and not how those 165 characters are represented in a network protocol. 167 The goal of transcribability can be described by a simple scenario. 168 Imagine two colleagues, Sam and Kim, sitting in a pub at an 169 international conference and exchanging research ideas. Sam asks 170 Kim for a location to get more information, so Kim writes the URL 171 for the research site on a napkin. Upon returning home, Sam takes 172 out the napkin and types the URL into a computer, which then 173 retrieves the information to which Kim referred. 175 There are several design concerns revealed by the scenario: 177 o A URL is a sequence of characters, which is not always 178 represented as a sequence of octets. 180 o A URL may be transcribed from a non-network source, and thus 181 should consist of characters which are most likely to be able 182 to be typed into a computer, within the constraints imposed by 183 keyboards (and related input devices) across languages and 184 locales. 186 o A URL often needs to be remembered by people, and it is easier 187 for people to remember a URL when it consists of meaningful 188 components. 190 These design concerns are not always in alignment. For example, it 191 is often the case that the most meaningful name for a URL component 192 would require characters which cannot be typed into some systems. 193 The ability to transcribe the resource location from one medium to 194 another was considered more important than having its URL consist 195 of the most meaningful of components. In local and regional 196 contexts and with improving technology, users might benefit from 197 being able to use a wider range of characters; such use is not 198 defined in this document. 200 1.6. Syntax Notation and Common Elements 202 This document uses two conventions to describe and define the syntax 203 for Uniform Resource Locators. The first, called the layout form, is 204 a general description of the order of components and component 205 separators, as in 207 /;? 209 The component names are enclosed in angle-brackets and any characters 210 outside angle-brackets are literal separators. Whitespace should be 211 ignored. These descriptions are used informally and do not define 212 the syntax requirements. 214 The second convention is a BNF-like grammar, used to define the 215 formal URL syntax. The grammar is that of [RFC822], except that 216 "|" is used to designate alternatives. Briefly, rules are separated 217 from definitions by an equal "=", indentation is used to continue a 218 rule definition over more than one line, literals are quoted with "", 219 parentheses "(" and ")" are used to group elements, optional elements 220 are enclosed in "[" and "]" brackets, and elements may be preceded 221 with * to designate n or more repetitions of the following 222 element; n defaults to 0. 224 Unlike many specifications which use a BNF-like grammar to define the 225 bytes (octets) allowed by a protocol, the URL grammar is defined in 226 terms of characters. Each literal in the grammar corresponds to the 227 character it represents, rather than to the octet encoding of that 228 character in any particular coded character set. How a URL is 229 represented in terms of bits and bytes on the wire is dependent upon 230 the character encoding of the protocol used to transport it, or the 231 charset of the document which contains it. 233 The following definitions are common to many elements: 235 alpha = lowalpha | upalpha 237 lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | 238 "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | 239 "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" 241 upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | 242 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | 243 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" 245 digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | 246 "8" | "9" 248 alphanum = alpha | digit 250 The complete URL syntax is collected in Appendix A. 252 2. URL Characters and Escape Sequences 254 URLs consist of a restricted set of characters, primarily chosen to 255 aid transcribability and usability both in computer systems and in 256 non-computer communications. Characters used conventionally as 257 delimiters around URLs were excluded. The restricted set of 258 characters consists of digits, letters, and a few graphic symbols 259 were chosen from those common to most of the character encodings 260 and input facilities available to Internet users. 262 Within a URL, characters are either used as delimiters, or to 263 represent strings of data (octets) within the delimited portions. 264 Octets are either represented directly by a character (using the 265 US-ASCII character for that octet) or by an escape encoding. This 266 representation is elaborated below. 268 2.1 URLs and non-ASCII characters 270 While URLs are sequences of characters and those characters are 271 used (within delimited sections) to represent sequences of octets, 272 in some cases those sequences of octets are used (via a 'charset' 273 or character encoding scheme) to represent sequences of characters: 275 URL char. sequence <-> octet sequence <-> original char. sequence 277 In cases where the original character sequence contains characters 278 that are strictly within the set of characters defined in the 279 US-ASCII character set, the mapping is simple: each original 280 character is translated into the US-ASCII code for it, and 281 subsequently represented either as the same character, or as an 282 escape sequence. 284 In general practice, many different character encoding schemes are 285 used in the second mapping (between sequences of represented 286 characters and sequences of octets) and there is generally no 287 representation in the URL itself of which mapping was used. While 288 there is a strong desire to provide for a general and uniform 289 mapping between more general scripts and URLs, the standard for 290 such use is outside of the scope of this document. 292 More systematic treatment of character encoding within URLs is 293 currently under development. 295 2.2. Reserved Characters 297 Many URLs include components consisting of or delimited by, certain 298 special characters. These characters are called "reserved", since 299 their usage within the URL component is limited to their reserved 300 purpose. If the data for a URL component would conflict with the 301 reserved purpose, then the conflicting data must be escaped before 302 forming the URL. 304 reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" 306 The "reserved" syntax class above refers to those characters which 307 are allowed within a URL, but which may not be allowed within a 308 particular component of the generic URL syntax; they are used as 309 delimiters of the components described in Section 4.3. 311 Characters in the "reserved" set are not reserved in all contexts. 312 The set of characters actually reserved within any given URL 313 component is defined by that component. In general, a character is 314 reserved if the semantics of the URL changes if the character is 315 replaced with its escaped US-ASCII encoding. 317 2.3. Unreserved Characters 319 Data characters which are allowed in a URL but do not have a reserved 320 purpose are called unreserved. These include upper and lower case 321 letters, decimal digits, and a limited set of punctuation marks and 322 symbols. 324 unreserved = alphanum | mark 326 mark = "$" | "-" | "_" | "." | "!" | "~" | 327 "*" | "'" | "(" | ")" | "," 329 Unreserved characters can be escaped without changing the semantics 330 of the URL, but this should not be done unless the URL is being used 331 in a context which does not allow the unescaped character to appear. 333 2.4. Escape Sequences 335 Data must be escaped if it does not have a representation using an 336 unreserved character; this includes data that does not correspond 337 to a printable character of the US-ASCII coded character set, or 338 that corresponds to any US-ASCII character that is disallowed, as 339 explained below. 341 2.4.1. Escaped Encoding 343 An escaped octet is encoded as a character triplet, consisting 344 of the percent character "%" followed by the two hexadecimal digits 345 representing the octet code. For example, "%20" is the escaped 346 encoding for the US-ASCII space character. 348 escaped = "%" hex hex 349 hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | 350 "a" | "b" | "c" | "d" | "e" | "f" 352 2.4.2. When to Escape and Unescape 354 A URL is always in an "escaped" form, since escaping or unescaping 355 a completed URL might change its semantics. Normally, the only 356 time escape encodings can safely be made is when the URL is being 357 created from its component parts; each component may have its own 358 set of characters which are reserved, so only the mechanism 359 responsible for generating or interpreting that component can 360 determine whether or not escaping a character will change its 361 semantics. Likewise, a URL must be separated into its components 362 before the escaped characters within those components can be safely 363 decoded. 365 In some cases, data that could be represented by an unreserved 366 character may appear escaped; for example, some of the unreserved 367 "mark" characters are automatically escaped by some systems. It is 368 safe to unescape these within the body of a URL. For example, 369 "%7e" is sometimes used instead of "~" in http URL path, but the 370 two can be used interchangeably. 372 Because the percent "%" character always has the reserved purpose of 373 being the escape indicator, it must be escaped as "%25" in order to 374 be used as data within a URL. Implementers should be careful not to 375 escape or unescape the same string more than once, since unescaping 376 an already unescaped string might lead to misinterpreting a percent 377 data character as another escaped character, or vice versa in the 378 case of escaping an already escaped string. 380 2.4.3. Excluded US-ASCII Characters 382 Although they are disallowed within the URL syntax, we include here 383 a description of those US-ASCII characters which have been excluded 384 and the reasons for their exclusion. 386 The control characters in the US-ASCII coded character set are not 387 used within a URL, both because they are non-printable and because 388 they are likely to be misinterpreted by some control mechanisms. 390 control = 392 The space character is excluded because significant spaces may 393 disappear and insignificant spaces may be introduced when URLs are 394 transcribed or typeset or subjected to the treatment of 395 word-processing programs. Whitespace is also used to delimit URLs 396 in many contexts. 398 space = 400 The angle-bracket "<" and ">" and double-quote (") characters are 401 excluded because they are often used as the delimiters around URLs 402 in text documents and protocol fields. The character "#" is 403 excluded because it is used to delimit a URL from a fragment 404 identifier in URL references (Section 3). The percent character "%" 405 is excluded because it is used for the encoding of escaped 406 characters. 408 delims = "<" | ">" | "#" | "%" | <"> 410 Other characters are excluded because gateways and other transport 411 agents are known to sometimes modify such characters, or they are 412 used as delimiters. 414 unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" 416 Data corresponding to excluded characters must be escaped in order 417 to be properly represented within a URL. 419 3. URL-based references and URLs 421 In practice, resource locators consist not only of complete URLs, 422 but other resource references which contain either an absolute 423 or relative URL form, and may be followed by a fragment identifier. 424 The terminology around the use of URLs has been confusing. 426 The term "URL-reference" is used here to denote the common usage of 427 a resource locator. A URL reference may be absolute or relative, 428 and may have additional information attached in the form of a 429 fragment identifier. However, "the URL" which results from such a 430 reference includes only the absolute URL after the fragment 431 identifier (if any) is removed and after any relative URL is 432 resolved to its absolute form. Although it is possible to limit 433 the discussion of URL syntax and semantics to that of the absolute 434 result, most usage of URLs is within general URL references, and it 435 is impossible to obtain the URL from such a reference without also 436 parsing the fragment and resolving the relative form. 438 URL-reference = [ absoluteURL | relativeURL ] [ "#" fragment ] 440 The syntax for relative URLs is a shortened form of that for absolute 441 URLs, where some prefix of the URL is missing and certain path 442 components ("." and "..") have a special meaning when interpreting a 443 relative path. 445 When a URL reference is used to perform a retrieval action on the 446 identified resource, the optional fragment identifier, separated from 447 the URL by a crosshatch ("#") character, consists of additional 448 reference information to be interpreted by the user agent after the 449 retrieval action has been successfully completed. As such, it is not 450 part of a URL, but is often used in conjunction with a URL. The 451 format and interpretation of fragment identifiers is dependent on the 452 media type of the retrieval result. 454 fragment = *urlc 456 A URL reference which does not contain a URL is a reference to the 457 current document. In other words, an empty URL reference within a 458 document is interpreted as a reference to the start of that document, 459 and a reference containing only a fragment identifier is a reference 460 to the identified fragment of that document. Traversal of such a 461 reference should not result in an additional retrieval action. 462 However, if the URL reference occurs in a context that is always 463 intended to result in a new request, as in the case of HTML's FORM 464 element, then an empty URL reference represents the base URL of the 465 current document and should be replaced by that URL when transformed 466 into a request. 468 4. Generic URL Syntax 470 4.1. Scheme 472 Just as there are many different methods of access to resources, 473 there are a variety of schemes for describing the location of such 474 resources. The URL syntax consists of a sequence of components 475 separated by reserved characters, with the first component defining 476 the semantics for the remainder of the URL string. 478 In general, absolute URLs are written as follows: 480 : 482 An absolute URL contains the name of the scheme being used () 483 followed by a colon (":") and then a string (the ) whose interpretation depends on the scheme. 486 Scheme names consist of a sequence of characters. The lower case 487 letters "a"--"z", digits, and the characters plus ("+"), period 488 ("."), and hyphen ("-") are allowed. For resiliency, programs 489 interpreting URLs should treat upper case letters as equivalent to 490 lower case in scheme names (e.g., allow "HTTP" as well as "http"). 492 scheme = 1*( alpha | digit | "+" | "-" | "." ) 494 Relative URL references are distinguished from absolute URLs in that 495 they do not begin with a scheme name. Instead, the scheme is 496 inherited from the base URL, as described in Section 5.2. 498 4.2. Opaque and Hierarchical URLs 500 The URL syntax does not require that the scheme-specific-part have 501 any general structure or set of semantics which is common among all 502 URLs. However, a subset of URLs do share a common syntax for 503 representing hierarchical relationships within the locator namespace. 504 This generic-URL syntax is used in interpreting relative URLs. 506 absoluteURL = generic-URL | opaque-URL 508 opaque-URL = scheme ":" *urlc 510 generic-URL = scheme ":" relativeURL 512 The separation of the URL grammar into and 513 is redundant, since both rules will successfully parse any string of 514 characters. The distinction is simply to clarify that a 515 parser of relative URL references (Section 5) will view a URL as a 516 generic-URL, whereas a handler of absolute references need only view 517 it as an opaque-URL. 519 URLs which are hierarchical in nature use the slash "/" character for 520 separating hierarchical components. For some file systems, a "/" 521 character (used to denote the hierarchical structure of a URL) is the 522 delimiter used to construct a file name hierarchy, and thus the URL 523 path will look similar to a file pathname. This does NOT imply that 524 the resource is a file or that the URL maps to an actual filesystem 525 pathname. 527 4.3. URL Syntactic Components 529 The URL syntax is dependent upon the scheme. Some schemes use 530 reserved characters like "?" and ";" to indicate special components, 531 while others just consider them to be part of the path. However, 532 most URL schemes use a common sequence of four main components to 533 define the location of a resource 535 ://? 537 each of which, except , may be absent from a particular URL. 538 For example, some URL schemes do not allow a component, and 539 others do not use a component. 541 4.3.1. Site Component 543 Many URL schemes include a top hierarchical element for a naming 544 authority, such that the namespace defined by the remainder of the 545 URL is governed by that authority. This component is 546 typically defined by an Internet-based server or a scheme-specific 547 registry of naming authorities. 549 site = server | authority 551 The component is preceded by a double slash "//" and is 552 terminated by the next slash "/", question-mark "?", or by the end of 553 the URL. Within the component, the characters ":", "@", "?", 554 and "/" are reserved. 556 The structure of a registry-based naming authority is specific to the 557 URL scheme, but constrained to the allowed characters for . 559 authority = *( unreserved | escaped | 560 ";" | ":" | "@" | "&" | "=" | "+" ) 562 URL schemes that involve the direct use of an IP-based protocol to a 563 specified server on the Internet use a common syntax for the 564 component of the URL's scheme-specific data: 566 @: 568 where may consist of a user name and, optionally, 569 scheme-specific information about how to gain authorization to access 570 the server. The parts "@" and ":" may be omitted. 572 server = [ [ userinfo ] "@" ] hostport ] 574 The user information, if present, is followed by a commercial 575 at-sign "@". 577 userinfo = *( unreserved | escaped | ":" | ";" | 578 "&" | "=" | "+" ) 580 Some URL schemes use the format "user:password" in the 581 field. This practice is NOT RECOMMENDED, because the passing of 582 authentication information in clear text (such as URLs) has proven to 583 be a security risk in almost every case where it has been used. 585 The host is a domain name of a network host, or its IPv4 address as 586 a set of four decimal digit groups separated by ".". Literal IPv6 587 addresses are not supported. 589 hostport = host [ ":" port ] 590 host = hostname | IPv4address 591 hostname = *( domainlabel "." ) toplabel [ "." ] 592 domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum 593 toplabel = alpha | alpha *( alphanum | "-" ) alphanum 594 IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit 595 port = *digit 597 Hostnames take the form described in Section 3 of [RFC1034] and 598 Section 2.1 of [RFC1123]: a sequence of domain labels separated by 599 ".", each domain label starting and ending with an alphanumeric 600 character and possibly also containing "-" characters. The rightmost 601 domain label of a fully qualified domain name will never start with a 602 digit, thus syntactically distinguishing domain names from IPv4 603 addresses, and may be followed by a single "." if it is necessary to 604 distinguish between the complete domain name and any local domain. 605 To actually be "Uniform" as a resource locator, a URL hostname should 606 be a fully qualified domain name. In practice, however, the host 607 component may be a local domain literal. 609 Note: A suitable representation for including a literal IPv6 610 address as the host part of a URL is desired, but has not yet 611 been determined or implemented in practice. 613 The port is the network port number for the server. Most schemes 614 designate protocols that have a default port number. Another port 615 number may optionally be supplied, in decimal, separated from the 616 host by a colon. If the port is omitted, the default port number is 617 assumed. 619 A site component is not required for a URL scheme to make use of 620 relative references. A base URL without a site component implies 621 that any relative reference will also be without a site component. 623 4.3.2. Path Component 625 The path component contains data, specific to the site (or the scheme 626 if there is no site component), identifying the resource within the 627 scope of that scheme and site. 629 path = [ "/" ] path_segments 631 path_segments = segment *( "/" segment ) 632 segment = *pchar *( ";" param ) 633 param = *pchar 635 pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" 637 The path may consist of a sequence of path segments separated by a 638 single slash "/" character. Within a path segment, the characters 639 "/", ";", "=", and "?" are reserved. Each path segment may include a 640 sequence of parameters, indicated by the semicolon ";" character. 641 The parameters are not significant to the parsing of relative 642 references. 644 4.3.3. Query Component 646 The query component is a string of information to be interpreted by 647 the resource. 649 query = *urlc 651 Within a query component, the characters "/", "&", "=", and "+" are 652 reserved. 654 4.4. Parsing a URL Reference 656 A URL reference is typically parsed according to the four main 657 components in order to determine what components are present and 658 whether or not the reference is relative or absolute. The individual 659 components are then parsed for their subparts and to verify their 660 validity. A reference is parsed as if it is a generic-URL, even 661 though it might be considered opaque by later processes. 663 Although the BNF defines what is allowed in each component, it is 664 ambiguous in terms of differentiating between a site component and 665 a path component that begins with two slash characters. The greedy 666 algorithm is used for disambiguation: the left-most matching rule 667 soaks up as much of the URL reference string as it is capable of 668 matching. In other words, the site component wins. 670 Readers familiar with regular expressions should see Appendix B for a 671 concrete parsing example and test oracle. 673 5. Relative URL References 675 It is often the case that a group or "tree" of documents has been 676 constructed to serve a common purpose; the vast majority of URLs in 677 these documents point to locations within the tree rather than 678 outside of it. Similarly, documents located at a particular site 679 are much more likely to refer to other resources at that site than 680 to resources at remote sites. 682 Relative addressing of URLs allows document trees to be partially 683 independent of their location and access scheme. For instance, it is 684 possible for a single set of hypertext documents to be simultaneously 685 accessible and traversable via each of the "file", "http", and "ftp" 686 schemes if the documents refer to each other using relative URLs. 687 Furthermore, such document trees can be moved, as a whole, without 688 changing any of the relative references. Experience within the WWW 689 has demonstrated that the ability to perform relative referencing 690 is necessary for the long-term usability of embedded URLs. 692 relativeURL = net_path | abs_path | rel_path 694 A relative reference beginning with two slash characters is termed a 695 network-path reference. Such references are rarely used. 697 net_path = "//" site [ abs_path ] 699 A relative reference beginning with a single slash character is 700 termed an absolute-path reference. 702 abs_path = "/" rel_path 704 A relative reference which does not begin with a scheme name or a 705 slash character is termed a relative-path reference. 707 rel_path = [ path_segments ] [ "?" query ] 709 Within a relative-path reference, the complete path segments "." and 710 ".." have special meanings: "the current hierarchy level" and "the 711 level above this hierarchy level", respectively. Although this is 712 very similar to their use within Unix-based filesystems to indicate 713 directory levels, these path components are only considered special 714 when resolving a relative-path reference to its absolute form 715 (Section 5.2). 717 Authors should be aware that a path segment which contains a colon 718 character cannot be used as the first segment of a relative URL path 719 (e.g., "this:that"), because it would be mistaken for a scheme name. 720 It is therefore necessary to precede such segments with other 721 segments (e.g., "./this:that") in order for them to be referenced as 722 a relative path. 724 It is not necessary for all URLs within a given scheme to be 725 restricted to the generic-URL syntax, since the hierarchical 726 properties of that syntax are only necessary when relative URLs are 727 used within a particular document. Documents can only make use of 728 relative URLs when their base URL fits within the generic-URL syntax. 729 It is assumed that any document which contains a relative reference 730 will also have a base URL that obeys the syntax. In other words, 731 relative URLs cannot be used within a document that has an unsuitable 732 base URL. 734 5.1. Establishing a Base URL 736 The term "relative URL" implies that there exists some absolute "base 737 URL" against which the relative reference is applied. Indeed, the 738 base URL is necessary to define the semantics of any relative URL 739 reference; without it, a relative reference is meaningless. In order 740 for relative URLs to be usable within a document, the base URL of 741 that document must be known to the parser. 743 The base URL of a document can be established in one of four ways, 744 listed below in order of precedence. The order of precedence can be 745 thought of in terms of layers, where the innermost defined base URL 746 has the highest precedence. This can be visualized graphically as: 748 .----------------------------------------------------------. 749 | .----------------------------------------------------. | 750 | | .----------------------------------------------. | | 751 | | | .----------------------------------------. | | | 752 | | | | .----------------------------------. | | | | 753 | | | | | | | | | | 754 | | | | `----------------------------------' | | | | 755 | | | | (5.1.1) Base URL embedded in the | | | | 756 | | | | document's content | | | | 757 | | | `----------------------------------------' | | | 758 | | | (5.1.2) Base URL of the encapsulating entity | | | 759 | | | (message, document, or none). | | | 760 | | `----------------------------------------------' | | 761 | | (5.1.3) URL used to retrieve the entity | | 762 | `----------------------------------------------------' | 763 | (5.1.4) Default Base URL is application-dependent | 764 `----------------------------------------------------------' 766 5.1.1. Base URL within Document Content 768 Within certain document media types, the base URL of the document can 769 be embedded within the content itself such that it can be readily 770 obtained by a parser. This can be useful for descriptive documents, 771 such as tables of content, which may be transmitted to others through 772 protocols other than their usual retrieval context (e.g., E-Mail or 773 USENET news). 775 It is beyond the scope of this document to specify how, for each 776 media type, the base URL can be embedded. It is assumed that user 777 agents manipulating such media types will be able to obtain the 778 appropriate syntax from that media type's specification. An example 779 of how the base URL can be embedded in the Hypertext Markup Language 780 (HTML) [RFC1866] is provided in Appendix D. 782 A mechanism for embedding the base URL within MIME container types 783 (e.g., the message and multipart types) is defined by MHTML 784 [RFC2110]. Protocols that do not use the MIME message header syntax, 785 but which do allow some form of tagged metainformation to be included 786 within messages, may define their own syntax for defining the base 787 URL as part of a message. 789 5.1.2. Base URL from the Encapsulating Entity 791 If no base URL is embedded, the base URL of a document is defined by 792 the document's retrieval context. For a document that is enclosed 793 within another entity (such as a message or another document), the 794 retrieval context is that entity; thus, the default base URL of the 795 document is the base URL of the entity in which the document is 796 encapsulated. 798 5.1.3. Base URL from the Retrieval URL 800 If no base URL is embedded and the document is not encapsulated 801 within some other entity (e.g., the top level of a composite entity), 802 then, if a URL was used to retrieve the base document, that URL shall 803 be considered the base URL. Note that if the retrieval was the 804 result of a redirected request, the last URL used (i.e., that which 805 resulted in the actual retrieval of the document) is the base URL. 807 5.1.4. Default Base URL 809 If none of the conditions described in Sections 5.1.1--5.1.3 apply, 810 then the base URL is defined by the context of the application. 811 Since this definition is necessarily application-dependent, failing 812 to define the base URL using one of the other methods may result in 813 the same content being interpreted differently by different types of 814 application. 816 It is the responsibility of the distributor(s) of a document 817 containing relative URLs to ensure that the base URL for that 818 document can be established. It must be emphasized that relative 819 URLs cannot be used reliably in situations where the document's 820 base URL is not well-defined. 822 5.2. Resolving Relative References to Absolute Form 824 This section describes an example algorithm for resolving URL 825 references which might be relative to a given base URL. 827 The base URL is established according to the rules of Section 5.1 and 828 parsed into the four main components as described in Section 4.4. 829 Note that only the scheme component is required to be present in the 830 base URL; the other components may be empty or undefined. A 831 component is undefined if its preceding separator does not appear in 832 the URL reference; the path component is never undefined, though it 833 may be empty. The base URL's query component is not used by the 834 resolution algorithm and may be discarded. 836 For each URL reference, the following steps are performed in order: 838 1) The URL reference is parsed into the potential four components and 839 fragment identifier, as described in Section 4.4. 841 2) If the path component is empty and the scheme, site, and query 842 components are undefined, then it is a reference to the current 843 document and we are done. Otherwise, the reference URL's query 844 and fragment components are defined as found (or not found) within 845 the URL reference and not inherited from the base URL. 847 3) If the scheme component is defined, indicating that the reference 848 starts with a scheme name, then the reference is interpreted as an 849 absolute URL and we are done. Otherwise, the reference URL's 850 scheme is inherited from the base URL's scheme component. 852 4) If the site component is defined, then the reference is a 853 network-path and we skip to step 7. Otherwise, the reference 854 URL's site is inherited from the base URL's site component, 855 which will also be undefined if the URL scheme does not use a 856 site component. 858 5) If the path component begins with a slash character ("/"), then 859 the reference is an absolute-path and we skip to step 7. 861 6) If this step is reached, then we are resolving a relative-path 862 reference. The relative path needs to be merged with the base 863 URL's path. Although there are many ways to do this, we will 864 describe a simple method using a separate string buffer. 866 a) All but the last segment of the base URL's path component is 867 copied to the buffer. In other words, any characters after the 868 last (right-most) slash character, if any, are excluded. 870 b) The reference's path component is appended to the buffer 871 string. 873 c) All occurrences of "./", where "." is a complete path segment, 874 are removed from the buffer string. 876 d) If the buffer string ends with "." as a complete path segment, 877 that "." is removed. 879 e) All occurrences of "/../", where is a 880 complete path segment not equal to "..", are removed from the 881 buffer string. Removal of these path segments is performed 882 iteratively, removing the leftmost matching pattern on each 883 iteration, until no matching pattern remains. 885 f) If the buffer string ends with "/..", where 886 is a complete path segment not equal to "..", that 887 "/.." is removed. 889 g) If the resulting buffer string still begins with one or more 890 complete path segments of "..", then the reference is 891 considered to be in error. Implementations may handle this 892 error by retaining these components in the resolved path 893 (i.e., treating them as part of the final URL), by removing 894 them from the resolved path (i.e., discarding relative levels 895 above the root), or by avoiding traversal of the reference. 897 h) The remaining buffer string is the reference URL's new path 898 component. 900 7) The resulting URL components, including any inherited from the 901 base URL, are recombined to give the absolute form of the URL 902 reference. Using pseudocode, this would be 904 result = "" 906 if scheme is defined then 907 append scheme to result 908 append ":" to result 910 if site is defined then 911 append "//" to result 912 append site to result 914 append path to result 916 if query is defined then 917 append "?" to result 918 append query to result 920 if fragment is defined then 921 append "#" to result 922 append fragment to result 924 return result 926 Note that we must be careful to preserve the distinction between a 927 component that is undefined, meaning that its separator was not 928 present in the reference, and a component that is empty, meaning 929 that the separator was present and was immediately followed by the 930 next component separator or the end of the reference. 932 The above algorithm is intended to provide an example by which the 933 output of implementations can be tested -- implementation of the 934 algorithm itself is not required. For example, some systems may find 935 it more efficient to implement step 6 as a pair of segment stacks 936 being merged, rather than as a series of string pattern replacements. 938 Note: Some WWW client applications will fail to separate the 939 reference's query component from its path component before merging 940 the base and reference paths in step 6 above. This may result in 941 a loss of information if the query component contains the strings 942 "/../" or "/./". 944 Resolution examples are provided in Appendix C. 946 6. URL Normalization and Equivalence 948 In many cases, different URL strings may actually identify the 949 identical resource. For example, the host names used in URLs are 950 actually case insensitive, and the URL is 951 equivalent to . In general, the rules for 952 equivalence and definition of a normal form, if any, are scheme 953 dependent. When a scheme uses elements of the common syntax, it 954 will also use the common syntax equivalence rules, namely that host 955 name is case independent, and a URL with an explicit ":port", where 956 the port is the default for the scheme, is equivalent to one 957 where the port is elided. 959 7. Security Considerations 961 A URL does not in itself pose a security threat. Users should beware 962 that there is no general guarantee that a URL, which at one time 963 located a given resource, will continue to do so. Nor is there any 964 guarantee that a URL will not locate a different resource at some 965 later point in time, due to the lack of any constraint on how a given 966 site apportions its namespace. Such a guarantee can only be 967 obtained from the person(s) controlling that namespace and the 968 resource in question. 970 It is sometimes possible to construct a URL such that an attempt to 971 perform a seemingly harmless, idempotent operation, such as the 972 retrieval of an entity associated with the resource, will in fact 973 cause a possibly damaging remote operation to occur. The unsafe URL 974 is typically constructed by specifying a port number other than that 975 reserved for the network protocol in question. The client 976 unwittingly contacts a site which is in fact running a different 977 protocol. The content of the URL contains instructions which, when 978 interpreted according to this other protocol, cause an unexpected 979 operation. An example has been the use of gopher URLs to cause an 980 unintended or impersonating message to be sent via a SMTP server. 982 Caution should be used when using any URL which specifies a port 983 number other than the default for the protocol, especially when it 984 is a number within the reserved space. 986 Care should be taken when URLs contain escaped delimiters for a 987 given protocol (for example, CR and LF characters for telnet 988 protocols) that these are not unescaped before transmission. This 989 might violate the protocol, but avoids the potential for such 990 characters to be used to simulate an extra operation or parameter 991 in that protocol, which might lead to an unexpected and possibly 992 harmful remote operation to be performed. 994 It is clearly unwise to use a URL that contains a password which is 995 intended to be secret. In particular, the use of a password within 996 the "site" component of a URL is strongly disrecommended except 997 in those rare cases where the 'password' parameter is intended 998 to be public. 1000 8. Acknowledgements 1002 This document was derived from RFC 1738 [RFC1738] and RFC 1808 1003 [RFC1808]; the acknowledgements in those specifications still 1004 apply. In addition, contributions by Lauren Wood, Martin Duerst, 1005 Gisle Aas, Martijn Koster, Ryan Moats, Foteos Macrides and 1006 Dave Kristol are gratefully acknowledged. 1008 9. References 1010 [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A 1011 Unifying Syntax for the Expression of Names and Addresses of 1012 Objects on the Network as used in the World-Wide Web", RFC 1630, 1013 CERN, June 1994. 1015 [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, Editors, 1016 "Uniform Resource Locators (URL)", RFC 1738, CERN, Xerox 1017 Corporation, University of Minnesota, December 1994. 1019 [RFC1866] Berners-Lee T., and D. Connolly, "HyperText Markup Language 1020 Specification -- 2.0", RFC 1866, MIT/W3C, November 1995. 1022 [RFC1123] Braden, R., Editor, "Requirements for Internet Hosts -- 1023 Application and Support", STD 3, RFC 1123, IETF, October 1989. 1025 [RFC822] Crocker, D., "Standard for the Format of ARPA Internet Text 1026 Messages", STD 11, RFC 822, UDEL, August 1982. 1028 [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC 1808, 1029 UC Irvine, June 1995. 1031 [RFC2045] N. Freed & N. Borenstein, "Multipurpose Internet Mail 1032 Extensions (MIME) Part One: Format of Internet Message Bodies," RFC 1033 2045, November 1996. 1035 [RFC2046] Freed, N., and N. Freed, "Multipurpose Internet Mail 1036 Extensions (MIME): Part Two: Media Types", RFC 2046, Innosoft, 1037 Bellcore, November 1996. 1039 [RFC1736] Kunze, J., "Functional Recommendations for Internet Resource 1040 Locators", RFC 1736, IS&T, UC Berkeley, February 1995. 1042 [RFC1034] Mockapetris, P., "Domain Names - Concepts and Facilities", 1043 STD 13, RFC 1034, USC/Information Sciences Institute, November 1044 1987. 1046 [RFC2110] Palme, J., Hopmann, A. "MIME E-mail Encapsulation of 1047 Aggregate Documents, such as HTML (MHTML)", RFC 2110, Stockholm 1048 University/KTH, Microsoft Corporation, March 1997. 1050 [RFC1737] Sollins, K., and L. Masinter, "Functional Requirements for 1051 Uniform Resource Names", RFC 1737, MIT/LCS, Xerox Corporation, 1052 December 1994. 1054 [ASCII] US-ASCII. "Coded Character Set -- 7-bit American Standard Code 1055 for Information Interchange", ANSI X3.4-1986. 1057 10. Notices 1059 Copyright (C) The Internet Society 1997. All Rights Reserved. 1061 This document and translations of it may be copied and furnished to 1062 others, and derivative works that comment on or otherwise explain it 1063 or assist in its implementation may be prepared, copied, published 1064 and distributed, in whole or in part, without restriction of any 1065 kind, provided that the above copyright notice and this paragraph are 1066 included on all such copies and derivative works. However, this 1067 document itself may not be modified in any way, such as by removing 1068 the copyright notice or references to the Internet Society or other 1069 Internet organizations, except as needed for the purpose of 1070 developing Internet standards in which case the procedures for 1071 copyrights defined in the Internet Standards process must be 1072 followed, or as required to translate it into languages other than 1073 English. 1075 The limited permissions granted above are perpetual and will not be 1076 revoked by the Internet Society or its successors or assigns. 1078 This document and the information contained herein is provided on an 1079 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1080 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1081 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1082 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1083 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1085 The IETF takes no position regarding the validity or scope of any 1086 intellectual property or other rights that might be claimed to 1087 pertain to the implementation or use of the technology described in 1088 this document or the extent to which any license under such rights 1089 might or might not be available; neither does it represent that it 1090 has made any effort to identify any such rights. Information on the 1091 IETF's procedures with respect to rights in standards-track and 1092 standards-related documentation can be found in BCP-11. Copies of 1093 claims of rights made available for publication and any assurances of 1094 licenses to be made available, or the result of an attempt made to 1095 obtain a general license or permission for the use of such 1096 proprietary rights by implementors or users of this specification can 1097 be obtained from the IETF Secretariat. 1099 The IETF invites any interested party to bring to its attention any 1100 copyrights, patents or patent applications, or other proprietary 1101 rights which may cover technology that may be required to practice 1102 this standard. Please address the information to the IETF Executive 1103 Director. 1105 11. Authors' Addresses 1107 Tim Berners-Lee 1108 World Wide Web Consortium 1109 MIT Laboratory for Computer Science, NE43-356 1110 545 Technology Square 1111 Cambridge, MA 02139 1113 Fax: +1(617)258-8682 1114 EMail: timbl@w3.org 1116 Roy T. Fielding 1117 Department of Information and Computer Science 1118 University of California, Irvine 1119 Irvine, CA 92697-3425 1121 Fax: +1(714)824-1715 1122 EMail: fielding@ics.uci.edu 1124 Larry Masinter 1125 Xerox PARC 1126 3333 Coyote Hill Road 1127 Palo Alto, CA 94034 1129 Fax: +1(415)812-4333 1130 EMail: masinter@parc.xerox.com 1132 Appendices 1134 A. Collected BNF for URLs 1136 URL-reference = [ absoluteURL | relativeURL ] [ "#" fragment ] 1137 absoluteURL = generic-URL | opaque-URL 1138 opaque-URL = scheme ":" *urlc 1139 generic-URL = scheme ":" relativeURL 1141 relativeURL = net_path | abs_path | rel_path 1142 net_path = "//" site [ abs_path ] 1143 abs_path = "/" rel_path 1144 rel_path = [ path_segments ] [ "?" query ] 1146 scheme = 1*( alpha | digit | "+" | "-" | "." ) 1148 site = server | authority 1150 authority = *( unreserved | escaped | 1151 ";" | ":" | "@" | "&" | "=" | "+" ) 1153 server = [ [ userinfo ] "@" ] hostport ] 1154 userinfo = *( unreserved | escaped | ":" | ";" | "&" | 1155 "=" | "+" ) 1156 hostport = host [ ":" port ] 1157 host = hostname | IPv4address 1158 hostname = *( domainlabel "." ) toplabel [ "." ] 1159 domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum 1160 toplabel = alpha | alpha *( alphanum | "-" ) alphanum 1161 IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit 1162 port = *digit 1164 path = [ "/" ] path_segments 1165 path_segments = segment *( "/" segment ) 1166 segment = *pchar *( ";" param ) 1167 param = *pchar 1168 pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" 1170 query = *urlc 1172 fragment = *urlc 1174 urlc = reserved | unreserved | escaped 1175 reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" 1176 unreserved = alpha | digit | mark 1177 mark = "$" | "-" | "_" | "." | "!" | "~" | 1178 "*" | "'" | "(" | ")" | "," 1180 escaped = "%" hex hex 1181 hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | 1182 "a" | "b" | "c" | "d" | "e" | "f" 1184 alphanum = alpha | digit 1185 alpha = lowalpha | upalpha 1187 lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | 1188 "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | 1189 "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" 1190 upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | 1191 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | 1192 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" 1193 digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | 1194 "8" | "9" 1196 B. Parsing a URL Reference with a Regular Expression 1198 As described in Section 4.4, the generic-URL syntax is not sufficient 1199 to disambiguate the components of some forms of URL. Since the 1200 "greedy algorithm" described in that section is identical to the 1201 disambiguation method used by POSIX regular expressions, it is 1202 natural and commonplace to use a regular expression for parsing the 1203 potential four components and fragment identifier of a URL reference. 1205 The following line is the regular expression for breaking-down a URL 1206 reference into its components. 1208 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 1209 12 3 4 5 6 7 8 9 1211 The numbers in the second line above are only to assist readability; 1212 they indicate the reference points for each subexpression (i.e., each 1213 paired parenthesis). We refer to the value matched for subexpression 1214 as $. For example, matching the above expression to 1216 http://www.ics.uci.edu/pub/ietf/uri/#Related 1218 results in the following subexpression matches: 1220 $1 = http: 1221 $2 = http 1222 $3 = //www.ics.uci.edu 1223 $4 = www.ics.uci.edu 1224 $5 = /pub/ietf/uri/ 1225 $6 = 1226 $7 = 1227 $8 = #Related 1228 $9 = Related 1230 where indicates that the component is not present, as is 1231 the case for the query component in the above example. Therefore, we 1232 can determine the value of the four components and fragment as 1234 scheme = $2 1235 site = $4 1236 path = $5 1237 query = $7 1238 fragment = $9 1240 and, going in the opposite direction, we can recreate a URL reference 1241 from its components using the algorithm in step 7 of Section 5.2. 1243 C. Examples of Resolving Relative URL References 1245 Within an object with a well-defined base URL of 1247 http://a/b/c/d;p?q 1249 the relative URLs would be resolved as follows: 1251 C.1. Normal Examples 1253 g:h = g:h 1254 g = http://a/b/c/g 1255 ./g = http://a/b/c/g 1256 g/ = http://a/b/c/g/ 1257 /g = http://a/g 1258 //g = http://g 1259 ?y = http://a/b/c/?y 1260 g?y = http://a/b/c/g?y 1261 #s = (current document)#s 1262 g#s = http://a/b/c/g#s 1263 g?y#s = http://a/b/c/g?y#s 1264 ;x = http://a/b/c/;x 1265 g;x = http://a/b/c/g;x 1266 g;x?y#s = http://a/b/c/g;x?y#s 1267 . = http://a/b/c/ 1268 ./ = http://a/b/c/ 1269 .. = http://a/b/ 1270 ../ = http://a/b/ 1271 ../g = http://a/b/g 1272 ../.. = http://a/ 1273 ../../ = http://a/ 1274 ../../g = http://a/g 1276 C.2. Abnormal Examples 1278 Although the following abnormal examples are unlikely to occur in 1279 normal practice, all URL parsers should be capable of resolving them 1280 consistently. Each example uses the same base as above. 1282 An empty reference refers to the start of the current document. 1284 <> = (current document) 1286 Parsers must be careful in handling the case where there are more 1287 relative path ".." segments than there are hierarchical levels in 1288 the base URL's path. Note that the ".." syntax cannot be used to 1289 change the site component of a URL. 1291 ../../../g = http://a/../g 1292 ../../../../g = http://a/../../g 1294 In practice, some implementations strip leading relative symbolic 1295 elements (".", "..") after applying a relative URL calculation, based 1296 on the theory that compensating for obvious author errors is better 1297 than allowing the request to fail. Thus, the above two references 1298 will be interpreted as "http://a/g" by some implementations. 1300 Similarly, parsers must avoid treating "." and ".." as special when 1301 they are not complete components of a relative path. 1303 /./g = http://a/./g 1304 /../g = http://a/../g 1305 g. = http://a/b/c/g. 1306 .g = http://a/b/c/.g 1307 g.. = http://a/b/c/g.. 1308 ..g = http://a/b/c/..g 1310 Less likely are cases where the relative URL uses unnecessary or 1311 nonsensical forms of the "." and ".." complete path segments. 1313 ./../g = http://a/b/g 1314 ./g/. = http://a/b/c/g/ 1315 g/./h = http://a/b/c/g/h 1316 g/../h = http://a/b/c/h 1317 g;x=1/./y = http://a/b/c/g;x=1/y 1318 g;x=1/../y = http://a/b/c/y 1320 All client applications remove the query component from the base URL 1321 before resolving relative URLs. However, some applications fail to 1322 separate the reference's query and/or fragment components from a 1323 relative path before merging it with the base path. This error is 1324 rarely noticed, since typical usage of a fragment never includes the 1325 hierarchy ("/") character, and the query component is not normally 1326 used within relative references. 1328 g?y/./x = http://a/b/c/g?y/x 1329 g?y/../x = http://a/b/c/x 1330 g#s/./x = http://a/b/c/g#s/./x 1331 g#s/../x = http://a/b/c/g#s/../x 1333 Some parsers allow the scheme name to be present in a relative URL 1334 if it is the same as the base URL scheme. This is considered to be 1335 a loophole in prior specifications of partial URLs [RFC1630]. Its 1336 use should be avoided. 1338 http:g = http:g 1339 http: = http: 1341 D. Embedding the Base URL in HTML documents 1343 It is useful to consider an example of how the base URL of a 1344 document can be embedded within the document's content. In this 1345 appendix, we describe how documents written in the Hypertext Markup 1346 Language (HTML) [RFC1866] can include an embedded base URL. This 1347 appendix does not form a part of the URL specification and should not 1348 be considered as anything more than a descriptive example. 1350 HTML defines a special element "BASE" which, when present in the 1351 "HEAD" portion of a document, signals that the parser should use 1352 the BASE element's "HREF" attribute as the base URL for resolving 1353 any relative URLs. The "HREF" attribute must be an absolute URL. 1354 Note that, in HTML, element and attribute names are 1355 case-insensitive. For example: 1357 1358 1359 An example HTML document 1360 1361 1362 ... a hypertext anchor ... 1363 1365 A parser reading the example document should interpret the given 1366 relative URL "../x" as representing the absolute URL 1368 1370 regardless of the context in which the example document was 1371 obtained. 1373 E. Recommendations for Delimiting URLs in Context 1375 URLs are often transmitted through formats which do not provide a 1376 clear context for their interpretation. For example, there are 1377 many occasions when URLs are included in plain text; examples 1378 include text sent in electronic mail, USENET news messages, and, 1379 most importantly, printed on paper. In such cases, it is important 1380 to be able to delimit the URL from the rest of the text, and in 1381 particular from punctuation marks that might be mistaken for part 1382 of the URL. 1384 In practice, URLs are delimited in a variety of ways, but usually 1385 within double-quotes "http://test.com/", angle brackets 1386 , or just using whitespace 1388 http://test.com/ 1390 These wrappers do not form part of the URL. 1392 In the case where a fragment identifier is associated with a URL 1393 reference, the fragment would be placed within the brackets as well 1394 (separated from the URL with a "#" character). 1396 In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) 1397 may need to be added to break long URLs across lines. The 1398 whitespace should be ignored when extracting the URL. 1400 No whitespace should be introduced after a hyphen ("-") character. 1401 Because some typesetters and printers may (erroneously) introduce a 1402 hyphen at the end of line when breaking a line, the interpreter of a 1403 URL containing a line break immediately after a hyphen should ignore 1404 all unescaped whitespace around the line break, and should be aware 1405 that the hyphen may or may not actually be part of the URL. 1407 Using <> angle brackets around each URL is especially recommended 1408 as a delimiting style for URLs that contain whitespace. 1410 The prefix "URL:" (with or without a trailing space) was 1411 recommended as a way to used to help distinguish a URL from other 1412 bracketed designators, although this is not common in practice. 1414 For robustness, software that accepts user-typed URLs should 1415 attempt to recognize and strip both delimiters and embedded 1416 whitespace. 1418 For example, the text: 1420 Yes, Jim, I found it under "http://www.w3.org/Addressing/", 1421 but you can probably pick it up from . Note the warning in . 1425 contains the URL references 1427 http://www.w3.org/Addressing/ 1428 ftp://ds.internic.net/rfc/ 1429 http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING 1431 F. Abbreviated URLs 1433 The URL syntax was designed for unambiguous reference to network 1434 resources and extensibility via the URL scheme. However, as URL 1435 identification and usage have become commonplace, traditional media 1436 (television, radio, newspapers, billboards, etc.) have increasingly 1437 used abbreviated URL references. That is, a reference consisting of 1438 only the site and path portions of the identified resource, such as 1440 www.w3.org/Addressing/ 1442 or simply the DNS hostname on its own. Such references are primarily 1443 intended for human interpretation rather than machine, with the 1444 assumption that context-based heuristics are sufficient to complete 1445 the URL (e.g., most hostnames beginning with "www" are likely to have 1446 a URL prefix of "http://"). Although there is no standard set of 1447 heuristics for disambiguating abbreviated URL references, many 1448 client implementations allow them to be entered by the user and 1449 heuristically resolved. It should be noted that such heuristics may 1450 change over time, particularly when new URL schemes are introduced. 1452 Since an abbreviated URL has the same syntax as a relative URL path, 1453 abbreviated URL references cannot be used in contexts where relative 1454 URLs are expected. This limits the use of abbreviated URLs to places 1455 where there is no defined base URL, such as dialog boxes and off-line 1456 advertisements. 1458 G. Summary of Non-editorial Changes 1460 G.1. Additions 1462 Section 3 (URL References) was added to stem the confusion 1463 regarding "what is a URL" and how to describe fragment identifiers 1464 given that they are not part of the URL, but are part of the URL 1465 syntax and parsing concerns. In addition, it provides a reference 1466 definition for use by other IETF specifications (HTML, HTTP, etc.) 1467 which have previously attempted to redefine the URL syntax in order 1468 to account for the presence of fragment identifiers in URL 1469 references. 1471 Section 2.4 was rewritten to clarify a number of misinterpretations 1472 and to leave room for fully internationalized URLs. 1474 Appendix F on abbreviated URLs was added to describe the shortened 1475 references often seen on television and magazine advertisements and 1476 explain why they are not used in other contexts. 1478 G.2. Modifications from both RFC 1738 and RFC 1808 1480 Confusion regarding the terms "character encoding", the URL 1481 "character set", and the escaping of characters with % 1482 equivalents has (hopefully) been reduced. Many of the BNF rule 1483 names regarding the character sets have been changed to more 1484 accurately describe their purpose and to encompass all "characters" 1485 rather than just US-ASCII octets. Unless otherwise noted here, 1486 these modifications do not affect the URL syntax. 1488 Both RFC 1738 and RFC 1808 refer to the "reserved" set of 1489 characters as if URL-interpreting software were limited to a single 1490 set of characters with a reserved purpose (i.e., as meaning 1491 something other than the data to which the characters correspond), 1492 and that this set was fixed by the URL scheme. However, this has 1493 not been true in practice; any character which is interpreted 1494 differently when it is escaped is, in effect, reserved. 1495 Furthermore, the interpreting engine on a HTTP server is often 1496 dependent on the resource, not just the URL scheme. The 1497 description of reserved characters has been changed accordingly. 1499 The plus "+" character was added to those in the "reserved" set, 1500 since it is treated as reserved within some URL components. 1502 The tilde "~" character was added to those in the "unreserved" set, 1503 since it is extensively used on the Internet in spite of the 1504 difficulty to transcribe it with some keyboards. 1506 The "user:password" form in the previous BNF was changed to 1507 a "userinfo" token, and the possibility that it might be 1508 "user:password" made scheme specific. In particular, the use 1509 of passwords in the clear is not even suggested by the syntax. 1511 The question-mark "?" character was removed from the set of allowed 1512 characters for the userinfo in the site component, since 1513 testing showed that many applications treat it as reserved for 1514 separating the query component from the rest of the URL. 1516 RFC 1738 specified that the path was separated from the site 1517 portion of a URL by a slash. RFC 1808 followed suit, but with a 1518 fudge of carrying around the separator as a "prefix" in order to 1519 describe the parsing algorithm. RFC 1630 never had this problem, 1520 since it considered the slash to be part of the path. In writing 1521 this specification, it was found to be impossible to accurately 1522 describe and retain the difference between the two URLs 1523 and 1524 without either considering the slash to be part of the path (as 1525 corresponds to actual practice) or creating a separate component just 1526 to hold that slash. We chose the former. 1528 G.3. Modifications from RFC 1738 1530 The definition of specific URL schemes and their scheme-specific 1531 syntax and semantics has been moved to separate documents. 1533 The URL host was defined as a fully-qualified domain name. However, 1534 many URLs are used without fully-qualified domain names (in contexts 1535 for which the full qualification is not necessary), without any host 1536 (as in some file URLs), or with a host of "localhost". 1538 The URL port is now *digit instead of 1*digit, since systems are 1539 expected to handle the case where the ":" separator between host and 1540 port is supplied without a port. 1542 The recommendations for delimiting URLs in context (Appendix E) have 1543 been adjusted to reflect current practice. 1545 G.4. Modifications from RFC 1808 1547 RFC 1808 (Section 4) defined an empty URL reference (a reference 1548 containing nothing aside from the fragment identifier) as being a 1549 reference to the base URL. Unfortunately, that definition could be 1550 interpreted, upon selection of such a reference, as a new retrieval 1551 action on that resource. Since the normal intent of such references 1552 is for the user agent to change its view of the current document to 1553 the beginning of the specified fragment within that document, not to 1554 make an additional request of the resource, a description of how to 1555 correctly interpret an empty reference has been added in Section 3. 1557 The description of the mythical Base header field has been replaced 1558 with a reference to the Content-Base and Content-Location header 1559 fields defined by MHTML [RFC2110]. 1561 RFC 1808 described various schemes as either having or not having the 1562 properties of the generic-URL syntax. However, the only requirement 1563 is that the particular document containing the relative references 1564 have a base URL which abides by the generic-URL syntax, regardless of 1565 the URL scheme, so the associated description has been updated to 1566 reflect that. 1568 The BNF term has been replaced with , since the 1569 latter more accurately describes its use and purpose. Likewise, the 1570 site is no longer restricted to the IP server syntax. 1572 Extensive testing of current client applications demonstrated that 1573 the majority of deployed systems do not use the ";" character to 1574 indicate trailing parameter information, and that the presence of a 1575 semicolon in a path segment does not affect the relative parsing of 1576 that segment. Therefore, parameters have been removed as a separate 1577 component and may now appear in any path segment. Their influence 1578 has been removed from the algorithm for resolving a relative URL 1579 reference. The resolution examples in Appendix C have been modified 1580 to reflect this change.