An example HTML document

idnits 2.17.1 draft-fielding-uri-syntax-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 1713 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 613: '... practice is NOT RECOMMENDED, because ...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 4, 1998) is 9457 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Downref: Normative reference to an Informational RFC: RFC 1630

  ** Obsolete normative reference: RFC 1738 (Obsoleted by RFC 4248, RFC 4266)

  ** Obsolete normative reference: RFC 1866 (Obsoleted by RFC 2854)

  ** Obsolete normative reference: RFC  822 (Obsoleted by RFC 2822)

  ** Obsolete normative reference: RFC 1808 (Obsoleted by RFC 3986)

  ** Downref: Normative reference to an Informational RFC: RFC 1736

  ** Obsolete normative reference: RFC 2141 (Obsoleted by RFC 8141)

  ** Obsolete normative reference: RFC 2110 (Obsoleted by RFC 2557)

  ** Downref: Normative reference to an Informational RFC: RFC 1737

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  ** Obsolete normative reference: RFC 2279 (ref. 'UTF-8') (Obsoleted by RFC
     3629)


     Summary: 20 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                            T. Berners-Lee, MIT/LCS
2	INTERNET-DRAFT                                 R. Fielding,  U.C. Irvine
3	draft-fielding-uri-syntax-03              L. Masinter, Xerox Corporation
4	Expires six months after publication date                   June 4, 1998

6	          Uniform Resource Identifiers (URI): Generic Syntax

8	Status of this Memo

10	   This document is an Internet-Draft.  Internet-Drafts are working
11	   documents of the Internet Engineering Task Force (IETF), its areas,
12	   and its working groups.  Note that other groups may also distribute
13	   working documents as Internet-Drafts.

15	   Internet-Drafts are draft documents valid for a maximum of six
16	   months and may be updated, replaced, or obsoleted by other
17	   documents at any time.  It is inappropriate to use Internet-Drafts
18	   as reference material or to cite them other than as ``work in
19	   progress.''

21	   To view the entire list of current Internet-Drafts, please check the
22	   "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
23	   Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern
24	   Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific
25	   Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).

27	   Instructions to RFC Editor: This document will obsolete RFC 1738 and
28	   RFC 1808.  If the new version of the MHTML proposed standard is
29	   ready for publication at the same time as this document, please
30	   change all references to RFC 2110 to refer to its new version.

32	Copyright Notice

34	   Copyright (C) The Internet Society (1998).  All Rights Reserved.

36	Abstract

38	   A Uniform Resource Identifier (URI) is a compact string of characters
39	   for identifying an abstract or physical resource.  This document
40	   defines the generic syntax of URI, including both absolute and
41	   relative forms, and guidelines for their use; it revises and replaces
42	   the generic definitions in RFC 1738 and RFC 1808.

44	   This document defines a grammar that is a superset of all valid URI,
45	   such that an implementation can parse the common components of a URI
46	   reference without knowing the scheme-specific requirements of every
47	   possible identifier type.  This document does not define a generative
48	   grammar for URI; that task will be performed by the individual
49	   specifications of each URI scheme.

51	1. Introduction

53	   Uniform Resource Identifiers (URI) provide a simple and extensible
54	   means for identifying a resource.  This specification of URI syntax
55	   and semantics is derived from concepts introduced by the World Wide
56	   Web global information initiative, whose use of such objects dates
57	   from 1990 and is described in "Universal Resource Identifiers in WWW"
58	   [RFC1630].  The specification of URI is designed to meet the
59	   recommendations laid out in "Functional Recommendations for Internet
60	   Resource Locators" [RFC1736] and "Functional Requirements for Uniform
61	   Resource Names" [RFC1737].

63	   This document updates and merges "Uniform Resource Locators"
64	   [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in
65	   order to define a single, generic syntax for all URI.  It excludes
66	   those portions of RFC 1738 that defined the specific syntax of
67	   individual URL schemes; those portions will be updated as separate
68	   documents, as will the process for registration of new URI schemes.
69	   This document does not discuss the issues and recommendation for
70	   dealing with characters outside of the US-ASCII character set
71	   [ASCII]; those recommendations are discussed in a separate document.

73	   All significant changes from the prior RFCs are noted in Appendix G.

75	1.1 Overview of URI

77	   URI are characterized by the following definitions:

79	      Uniform
80	         Uniformity provides several benefits: it allows different types
81	         of resource identifiers to be used in the same context, even
82	         when the mechanisms used to access those resources may differ;
83	         it allows uniform semantic interpretation of common syntactic
84	         conventions across different types of resource identifiers; it
85	         allows introduction of new types of resource identifiers
86	         without interfering with the way that existing identifiers are
87	         used; and, it allows the identifiers to be reused in many
88	         different contexts, thus permitting new applications or
89	         protocols to leverage a pre-existing, large, and widely-used
90	         set of resource identifiers.

92	      Resource
93	         A resource can be anything that has identity.  Familiar
94	         examples include an electronic document, an image, a service
95	         (e.g., "today's weather report for Los Angeles"), and a
96	         collection of other resources.  Not all resources are network
97	         "retrievable"; e.g., human beings, corporations, and bound
98	         books in a library can also be considered resources.

100	         The resource is the conceptual mapping to an entity or set of
101	         entities, not necessarily the entity which corresponds to that
102	         mapping at any particular instance in time.  Thus, a resource
103	         can remain constant even when its content---the entities to
104	         which it currently corresponds---changes over time, provided
105	         that the conceptual mapping is not changed in the process.

107	      Identifier
108	         An identifier is an object that can act as a reference to
109	         something that has identity.  In the case of URI, the object
110	         is a sequence of characters with a restricted syntax.

112	   Having identified a resource, a system may perform a variety of
113	   operations on the resource, as might be characterized by such words
114	   as `access', `update', `replace', or `find attributes'.

116	1.2. URI, URL, and URN

118	   A URI can be further classified as a locator, a name, or both.  The
119	   term "Uniform Resource Locator" (URL) refers to the subset of URI
120	   that identify resources via a representation of their primary access
121	   mechanism (e.g., their network "location"), rather than identifying
122	   the resource by name or by some other attribute(s) of that resource.
123	   The term "Uniform Resource Name" (URN) refers to the subset of URI
124	   that are required to remain globally unique and persistent even when
125	   the resource ceases to exist or becomes unavailable.

127	   The URI scheme (Section 3.1) defines the namespace of the URI, and
128	   thus may further restrict the syntax and semantics of identifiers
129	   using that scheme.  This specification defines those elements of the
130	   URI syntax that are either required of all URI schemes or are common
131	   to many URI schemes.  It thus defines the syntax and semantics that
132	   are needed to implement a scheme-independent parsing mechanism for
133	   URI references, such that the scheme-dependent handling of a URI can
134	   be postponed until the scheme-dependent semantics are needed.  We use
135	   the term URL below when describing syntax or semantics that only
136	   apply to locators.

138	   Although many URL schemes are named after protocols, this does not
139	   imply that the only way to access the URL's resource is via the named
140	   protocol.  Gateways, proxies, caches, and name resolution services
141	   might be used to access some resources, independent of the protocol
142	   of their origin, and the resolution of some URL may require the use
143	   of more than one protocol (e.g., both DNS and HTTP are typically used
144	   to access an "http" URL's resource when it can't be found in a local
145	   cache).

147	   A URN differs from a URL in that it's primary purpose is persistent
148	   labeling of a resource with an identifier.  That identifier is drawn
149	   from one of a set of defined namespaces, each of which has its own
150	   set name structure and assignment procedures.  The "urn" scheme has
151	   been reserved to establish the requirements for a standardized URN
152	   namespace, as defined in "URN Syntax" [RFC2141] and its related
153	   specifications.

155	   Most of the examples in this specification demonstrate URL, since
156	   they allow the most varied use of the syntax and often have a
157	   hierarchical namespace.  A parser of the URI syntax is capable of
158	   parsing both URL and URN references as a generic URI; once the scheme
159	   is determined, the scheme-specific parsing can be performed on the
160	   generic URI components.  In other words, the URI syntax is a superset
161	   of the syntax of all URI schemes.

163	1.3. Example URI

165	   The following examples illustrate URI that are in common use.

167	   ftp://ftp.is.co.za/rfc/rfc1808.txt
168	      -- ftp scheme for File Transfer Protocol services

170	   gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
171	      -- gopher scheme for Gopher and Gopher+ Protocol services

173	   http://www.math.uio.no/faq/compression-faq/part1.html
174	      -- http scheme for Hypertext Transfer Protocol services

176	   mailto:mduerst@ifi.unizh.ch
177	      -- mailto scheme for electronic mail addresses

179	   news:comp.infosystems.www.servers.unix
180	      -- news scheme for USENET news groups and articles

182	   telnet://melvyl.ucop.edu/
183	      -- telnet scheme for interactive services via the TELNET Protocol

185	1.4. Hierarchical URI and Relative Forms

187	   An absolute identifier refers to a resource independent of the
188	   context in which the identifier is used.  In contrast, a relative
189	   identifier refers to a resource by describing the difference within a
190	   hierarchical namespace between the current context and an absolute
191	   identifier of the resource.

193	   Some URI schemes support a hierarchical naming system, where the
194	   hierarchy of the name is denoted by a "/" delimiter separating the
195	   components in the scheme. This document defines a scheme-independent
196	   `relative' form of URI reference that can be used in conjunction with
197	   a `base' URI (of a hierarchical scheme) to produce another URI. The
198	   syntax of hierarchical URI is described in Section 3; the relative
199	   URI calculation is described in Section 5.

201	1.5. URI Transcribability

203	   The URI syntax was designed with global transcribability as one of
204	   its main concerns. A URI is a sequence of characters from a very
205	   limited set, i.e. the letters of the basic Latin alphabet, digits,
206	   and a few special characters.  A URI may be represented in a
207	   variety of ways: e.g., ink on paper, pixels on a screen, or a
208	   sequence of octets in a coded character set.  The interpretation of
209	   a URI depends only on the characters used and not how those
210	   characters are represented in a network protocol.

212	   The goal of transcribability can be described by a simple scenario.
213	   Imagine two colleagues, Sam and Kim, sitting in a pub at an
214	   international conference and exchanging research ideas.  Sam asks
215	   Kim for a location to get more information, so Kim writes the URI
216	   for the research site on a napkin.  Upon returning home, Sam takes
217	   out the napkin and types the URI into a computer, which then
218	   retrieves the information to which Kim referred.

220	   There are several design concerns revealed by the scenario:

222	      o  A URI is a sequence of characters, which is not always
223	         represented as a sequence of octets.

225	      o  A URI may be transcribed from a non-network source, and thus
226	         should consist of characters that are most likely to be able
227	         to be typed into a computer, within the constraints imposed by
228	         keyboards (and related input devices) across languages and
229	         locales.

231	      o  A URI often needs to be remembered by people, and it is easier
232	         for people to remember a URI when it consists of meaningful
233	         components.

235	   These design concerns are not always in alignment.  For example, it
236	   is often the case that the most meaningful name for a URI component
237	   would require characters that cannot be typed into some systems.
238	   The ability to transcribe the resource identifier from one medium to
239	   another was considered more important than having its URI consist
240	   of the most meaningful of components.  In local and regional
241	   contexts and with improving technology, users might benefit from
242	   being able to use a wider range of characters; such use is not
243	   defined in this document.

245	1.6. Syntax Notation and Common Elements

247	   This document uses two conventions to describe and define the syntax
248	   for URI.  The first, called the layout form, is a general description
249	   of the order of components and component separators, as in

251	      /;?

253	   The component names are enclosed in angle-brackets and any characters
254	   outside angle-brackets are literal separators.  Whitespace should be
255	   ignored.  These descriptions are used informally and do not define
256	   the syntax requirements.

258	   The second convention is a BNF-like grammar, used to define the
259	   formal URI syntax.  The grammar is that of [RFC822], except that
260	   "|" is used to designate alternatives.  Briefly, rules are separated
261	   from definitions by an equal "=", indentation is used to continue a
262	   rule definition over more than one line, literals are quoted with "",
263	   parentheses "(" and ")" are used to group elements, optional elements
264	   are enclosed in "[" and "]" brackets, and elements may be preceded
265	   with * to designate n or more repetitions of the following
266	   element; n defaults to 0.

268	   Unlike many specifications that use a BNF-like grammar to define the
269	   bytes (octets) allowed by a protocol, the URI grammar is defined in
270	   terms of characters.  Each literal in the grammar corresponds to the
271	   character it represents, rather than to the octet encoding of that
272	   character in any particular coded character set.  How a URI is
273	   represented in terms of bits and bytes on the wire is dependent upon
274	   the character encoding of the protocol used to transport it, or the
275	   charset of the document which contains it.

277	   The following definitions are common to many elements:

279	      alpha    = lowalpha | upalpha

281	      lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
282	                 "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
283	                 "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"

285	      upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
286	                 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
287	                 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"

289	      digit    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
290	                 "8" | "9"

292	      alphanum = alpha | digit

294	   The complete URI syntax is collected in Appendix A.

296	2. URI Characters and Escape Sequences

298	   URI consist of a restricted set of characters, primarily chosen to
299	   aid transcribability and usability both in computer systems and in
300	   non-computer communications. Characters used conventionally as
301	   delimiters around URI were excluded.  The restricted set of
302	   characters consists of digits, letters, and a few graphic symbols
303	   were chosen from those common to most of the character encodings
304	   and input facilities available to Internet users.

306	      uric          = reserved | unreserved | escaped

308	   Within a URI, characters are either used as delimiters, or to
309	   represent strings of data (octets) within the delimited portions.
310	   Octets are either represented directly by a character (using the
311	   US-ASCII character for that octet [ASCII]) or by an escape encoding.
312	   This representation is elaborated below.

314	2.1 URI and non-ASCII characters

316	   The relationship between URI and characters has been a source of
317	   confusion for characters that are not part of US-ASCII. To describe
318	   the relationship, it is useful to distinguish between a "character"
319	   (as a distinguishable semantic entity) and an "octet" (an 8-bit
320	   byte). There are two mappings, one from URI characters to octets,
321	   and a second from octets to original characters:

323	   URI character sequence->octet sequence->original character sequence

325	   A URI is represented as a sequence of characters, not as a sequence
326	   of octets. That is because URI might be "transported" by means that
327	   are not through a computer network, e.g., printed on paper, read
328	   over the radio, etc.

330	   A URI scheme may define a mapping from URI characters to octets;
331	   whether this is done depends on the scheme. Commonly, within a
332	   delimited component of a URI, a sequence of characters may be
333	   used to represent a sequence of octets. For example, the character
334	   "a" represents the octet 97 (decimal), while the character sequence
335	   "%", "0", "a" represents the octet 10 (decimal).

337	   There is a second translation for some resources: the sequence of
338	   octets defined by a component of the URI is subsequently used to
339	   represent a sequence of characters. A 'charset' defines this
340	   mapping. There are many charsets in use in Internet protocols. For
341	   example, UTF-8 [UTF-8] defines a mapping from sequences of octets to
342	   sequences of characters in the repertoire of ISO 10646.

344	   In the simplest case, the original character sequence contains
345	   only characters that are defined in US-ASCII, and the two levels of
346	   mapping are simple and easily invertible: each 'original character'
347	   is represented as the octet for the US-ASCII code for it, which is,
348	   in turn, represented as either the US-ASCII character, or else the
349	   "%" escape sequence for that octet.

351	   For original character sequences that contain non-ASCII characters,
352	   however, the situation is more difficult. Internet protocols that
353	   transmit octet sequences intended to represent character sequences
354	   are expected to provide some way of identifying the charset used,
355	   if there might be more than one [RFC2277].  However, there is
356	   currently no provision within the generic URI syntax to accomplish
357	   this identification. An individual URI scheme may require a single
358	   charset, define a default charset, or provide a way to indicate the
359	   charset used.

361	   It is expected that a systematic treatment of character encoding
362	   within URI will be developed as a future modification of this
363	   specification.

365	2.2. Reserved Characters

367	   Many URI include components consisting of or delimited by, certain
368	   special characters.  These characters are called "reserved", since
369	   their usage within the URI component is limited to their reserved
370	   purpose.  If the data for a URI component would conflict with the
371	   reserved purpose, then the conflicting data must be escaped before
372	   forming the URI.

374	      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
375	                    "$" | ","

377	   The "reserved" syntax class above refers to those characters that
378	   are allowed within a URI, but which may not be allowed within a
379	   particular component of the generic URI syntax; they are used as
380	   delimiters of the components described in Section 3.

382	   Characters in the "reserved" set are not reserved in all contexts.
383	   The set of characters actually reserved within any given URI
384	   component is defined by that component. In general, a character is
385	   reserved if the semantics of the URI changes if the character is
386	   replaced with its escaped US-ASCII encoding.

388	2.3. Unreserved Characters

390	   Data characters that are allowed in a URI but do not have a reserved
391	   purpose are called unreserved.  These include upper and lower case
392	   letters, decimal digits, and a limited set of punctuation marks and
393	   symbols.

395	      unreserved  = alphanum | mark

397	      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

399	   Unreserved characters can be escaped without changing the semantics
400	   of the URI, but this should not be done unless the URI is being used
401	   in a context that does not allow the unescaped character to appear.

403	2.4. Escape Sequences

405	   Data must be escaped if it does not have a representation using an
406	   unreserved character; this includes data that does not correspond
407	   to a printable character of the US-ASCII coded character set, or
408	   that corresponds to any US-ASCII character that is disallowed, as
409	   explained below.

411	2.4.1. Escaped Encoding

413	   An escaped octet is encoded as a character triplet, consisting
414	   of the percent character "%" followed by the two hexadecimal digits
415	   representing the octet code. For example, "%20" is the escaped
416	   encoding for the US-ASCII space character.

418	      escaped     = "%" hex hex
419	      hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
420	                            "a" | "b" | "c" | "d" | "e" | "f"

422	2.4.2. When to Escape and Unescape

424	   A URI is always in an "escaped" form, since escaping or unescaping
425	   a completed URI might change its semantics.  Normally, the only
426	   time escape encodings can safely be made is when the URI is being
427	   created from its component parts; each component may have its own
428	   set of characters that are reserved, so only the mechanism
429	   responsible for generating or interpreting that component can
430	   determine whether or not escaping a character will change its
431	   semantics. Likewise, a URI must be separated into its components
432	   before the escaped characters within those components can be safely
433	   decoded.

435	   In some cases, data that could be represented by an unreserved
436	   character may appear escaped; for example, some of the unreserved
437	   "mark" characters are automatically escaped by some systems.  If the
438	   given URI scheme defines a canonicalization algorithm, then
439	   unreserved characters may be unescaped according to that algorithm.
440	   For example, "%7e" is sometimes used instead of "~" in an http URL
441	   path, but the two are equivalent for an http URL.

443	   Because the percent "%" character always has the reserved purpose of
444	   being the escape indicator, it must be escaped as "%25" in order to
445	   be used as data within a URI.  Implementers should be careful not to
446	   escape or unescape the same string more than once, since unescaping
447	   an already unescaped string might lead to misinterpreting a percent
448	   data character as another escaped character, or vice versa in the
449	   case of escaping an already escaped string.

451	2.4.3. Excluded US-ASCII Characters

453	   Although they are disallowed within the URI syntax, we include here
454	   a description of those US-ASCII characters that have been excluded
455	   and the reasons for their exclusion.

457	   The control characters in the US-ASCII coded character set are not
458	   used within a URI, both because they are non-printable and because
459	   they are likely to be misinterpreted by some control mechanisms.

461	   control     = 

463	   The space character is excluded because significant spaces may
464	   disappear and insignificant spaces may be introduced when URI are
465	   transcribed or typeset or subjected to the treatment of
466	   word-processing programs.  Whitespace is also used to delimit URI
467	   in many contexts.

469	   space       = 

471	   The angle-bracket "<" and ">" and double-quote (") characters are
472	   excluded because they are often used as the delimiters around URI
473	   in text documents and protocol fields.  The character "#" is
474	   excluded because it is used to delimit a URI from a fragment
475	   identifier in URI references (Section 4). The percent character "%"
476	   is excluded because it is used for the encoding of escaped
477	   characters.

479	   delims      = "<" | ">" | "#" | "%" | <">

481	   Other characters are excluded because gateways and other transport
482	   agents are known to sometimes modify such characters, or they are
483	   used as delimiters.

485	   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

487	   Data corresponding to excluded characters must be escaped in order
488	   to be properly represented within a URI.

490	3. URI Syntactic Components

492	   The URI syntax is dependent upon the scheme.  In general, absolute
493	   URI are written as follows:

495	      :

497	   An absolute URI contains the name of the scheme being used ()
498	   followed by a colon (":") and then a string (the ) whose interpretation depends on the scheme.

501	   The URI syntax does not require that the scheme-specific-part have
502	   any general structure or set of semantics which is common among all
503	   URI.  However, a subset of URI do share a common syntax for
504	   representing hierarchical relationships within the namespace.  This
505	   "generic URI" syntax consists of a sequence of four main components:

507	      ://?

509	   each of which, except , may be absent from a particular URI.
510	   For example, some URI schemes do not allow an  component,
511	   and others do not use a  component.

513	      absoluteURI   = scheme ":" ( hier_part | opaque_part )

515	   URI that are hierarchical in nature use the slash "/" character for
516	   separating hierarchical components.  For some file systems, a "/"
517	   character (used to denote the hierarchical structure of a URI) is the
518	   delimiter used to construct a file name hierarchy, and thus the URI
519	   path will look similar to a file pathname.  This does NOT imply that
520	   the resource is a file or that the URI maps to an actual filesystem
521	   pathname.

523	      hier_part     = ( net_path | abs_path ) [ "?" query ]

525	      net_path      = "//" authority [ abs_path ]

527	      abs_path      = "/"  path_segments

529	   URI that do not make use of the slash "/" character for separating
530	   hierarchical components are considered opaque by the generic URI
531	   parser.

533	      opaque_part   = uric_no_slash *uric

535	      uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
536	                      "&" | "=" | "+" | "$" | ","

538	   We use the term  to refer to both the  and
539	    constructs, since they are mutually exclusive for any
540	   given URI and can be parsed as a single component.

542	3.1. Scheme Component

544	   Just as there are many different methods of access to resources,
545	   there are a variety of schemes for identifying such resources.  The
546	   URI syntax consists of a sequence of components separated by reserved
547	   characters, with the first component defining the semantics for the
548	   remainder of the URI string.

550	   Scheme names consist of a sequence of characters beginning with a
551	   lower case letter and followed by any combination of lower case
552	   letters, digits, plus ("+"), period ("."), or hyphen ("-").  For
553	   resiliency, programs interpreting URI should treat upper case
554	   letters as equivalent to lower case in scheme names (e.g., allow
555	   "HTTP" as well as "http").

557	      scheme        = alpha *( alpha | digit | "+" | "-" | "." )

559	   Relative URI references are distinguished from absolute URI in that
560	   they do not begin with a scheme name.  Instead, the scheme is
561	   inherited from the base URI, as described in Section 5.2.

563	3.2. Authority Component

565	   Many URI schemes include a top hierarchical element for a naming
566	   authority, such that the namespace defined by the remainder of the
567	   URI is governed by that authority.  This authority component is
568	   typically defined by an Internet-based server or a scheme-specific
569	   registry of naming authorities.

571	      authority     = server | reg_name

573	   The authority component is preceded by a double slash "//" and is
574	   terminated by the next slash "/", question-mark "?", or by the end of
575	   the URI.  Within the authority component, the characters ";", ":",
576	   "@", "?", and "/" are reserved.

578	   An authority component is not required for a URI scheme to make use
579	   of relative references.  A base URI without an authority component
580	   implies that any relative reference will also be without an authority
581	   component.

583	3.2.1. Registry-based Naming Authority

585	   The structure of a registry-based naming authority is specific to the
586	   URI scheme, but constrained to the allowed characters for an
587	   authority component.

589	      reg_name      = 1*( unreserved | escaped | "$" | "," |
590	                          ";" | ":" | "@" | "&" | "=" | "+" )

592	3.2.2. Server-based Naming Authority

594	   URL schemes that involve the direct use of an IP-based protocol to a
595	   specified server on the Internet use a common syntax for the server
596	   component of the URI's scheme-specific data:

598	        @:

600	   where  may consist of a user name and, optionally,
601	   scheme-specific information about how to gain authorization to access
602	   the server.  The parts "@" and ":" may be omitted.

604	      server        = [ [ userinfo "@" ] hostport ]

606	   The user information, if present, is followed by a commercial
607	   at-sign "@".

609	      userinfo      = *( unreserved | escaped |
610	                         ";" | ":" | "&" | "=" | "+" | "$" | "," )

612	   Some URL schemes use the format "user:password" in the userinfo
613	   field. This practice is NOT RECOMMENDED, because the passing of
614	   authentication information in clear text (such as URI) has proven to
615	   be a security risk in almost every case where it has been used.

617	   The host is a domain name of a network host, or its IPv4 address as
618	   a set of four decimal digit groups separated by ".".  Literal IPv6
619	   addresses are not supported.

621	      hostport      = host [ ":" port ]
622	      host          = hostname | IPv4address
623	      hostname      = *( domainlabel "." ) toplabel [ "." ]
624	      domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
625	      toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
626	      IPv4address   = 1*digit "." 1*digit "." 1*digit "." 1*digit
627	      port          = *digit

629	   Hostnames take the form described in Section 3 of [RFC1034] and
630	   Section 2.1 of [RFC1123]: a sequence of domain labels separated by
631	   ".", each domain label starting and ending with an alphanumeric
632	   character and possibly also containing "-" characters.  The rightmost
633	   domain label of a fully qualified domain name will never start with a
634	   digit, thus syntactically distinguishing domain names from IPv4
635	   addresses, and may be followed by a single "." if it is necessary to
636	   distinguish between the complete domain name and any local domain.
637	   To actually be "Uniform" as a resource locator, a URL hostname should
638	   be a fully qualified domain name.  In practice, however, the host
639	   component may be a local domain literal.

641	      Note: A suitable representation for including a literal IPv6
642	      address as the host part of a URL is desired, but has not yet
643	      been determined or implemented in practice.

645	   The port is the network port number for the server.  Most schemes
646	   designate protocols that have a default port number.  Another port
647	   number may optionally be supplied, in decimal, separated from the
648	   host by a colon.  If the port is omitted, the default port number is
649	   assumed.

651	3.3. Path Component

653	   The path component contains data, specific to the authority (or the
654	   scheme if there is no authority component), identifying the resource
655	   within the scope of that scheme and authority.

657	      path          = [ abs_path | opaque_part ]

659	      path_segments = segment *( "/" segment )
660	      segment       = *pchar *( ";" param )
661	      param         = *pchar

663	      pchar         = unreserved | escaped |
664	                      ":" | "@" | "&" | "=" | "+" | "$" | ","

666	   The path may consist of a sequence of path segments separated by a
667	   single slash "/" character.  Within a path segment, the characters
668	   "/", ";", "=", and "?" are reserved.  Each path segment may include a
669	   sequence of parameters, indicated by the semicolon ";" character.
670	   The parameters are not significant to the parsing of relative
671	   references.

673	3.4. Query Component

675	   The query component is a string of information to be interpreted by
676	   the resource.

678	      query         = *uric

680	   Within a query component, the characters ";", "/", "?", ":", "@",
681	   "&", "=", "+", ",", and "$" are reserved.

683	4. URI References

685	   The term "URI-reference" is used here to denote the common usage of
686	   a resource identifier.  A URI reference may be absolute or relative,
687	   and may have additional information attached in the form of a
688	   fragment identifier.  However, "the URI" that results from such a
689	   reference includes only the absolute URI after the fragment
690	   identifier (if any) is removed and after any relative URI is
691	   resolved to its absolute form.  Although it is possible to limit
692	   the discussion of URI syntax and semantics to that of the absolute
693	   result, most usage of URI is within general URI references, and it
694	   is impossible to obtain the URI from such a reference without also
695	   parsing the fragment and resolving the relative form.

697	      URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]

699	   The syntax for relative URI is a shortened form of that for absolute
700	   URI, where some prefix of the URI is missing and certain path
701	   components ("." and "..") have a special meaning when interpreting a
702	   relative path.  The relative URI syntax is defined in Section 5.

704	4.1. Fragment Identifier

706	   When a URI reference is used to perform a retrieval action on the
707	   identified resource, the optional fragment identifier, separated from
708	   the URI by a crosshatch ("#") character, consists of additional
709	   reference information to be interpreted by the user agent after the
710	   retrieval action has been successfully completed.  As such, it is not
711	   part of a URI, but is often used in conjunction with a URI.

713	      fragment      = *uric

715	   The semantics of a fragment identifier is a property of the data
716	   resulting from a retrieval action, regardless of the type of URI used
717	   in the reference.  Therefore, the format and interpretation of
718	   fragment identifiers is dependent on the media type [RFC2046] of the
719	   retrieval result.  The character restrictions described in Section 2
720	   for URI also apply to the fragment in a URI-reference.  Individual
721	   media types may define additional restrictions or structure within
722	   the fragment for specifying different types of "partial views" that
723	   can be identified within that media type.

725	   A fragment identifier is only meaningful when a URI reference is
726	   intended for retrieval and the result of that retrieval is a document
727	   for which the identified fragment is consistently defined.

729	4.2. Same-document References

731	   A URI reference that does not contain a URI is a reference to the
732	   current document.  In other words, an empty URI reference within a
733	   document is interpreted as a reference to the start of that document,
734	   and a reference containing only a fragment identifier is a reference
735	   to the identified fragment of that document.  Traversal of such a
736	   reference should not result in an additional retrieval action.
737	   However, if the URI reference occurs in a context that is always
738	   intended to result in a new request, as in the case of HTML's FORM
739	   element, then an empty URI reference represents the base URI of the
740	   current document and should be replaced by that URI when transformed
741	   into a request.

743	4.3. Parsing a URI Reference

745	   A URI reference is typically parsed according to the four main
746	   components and fragment identifier in order to determine what
747	   components are present and whether the reference is relative or
748	   absolute.  The individual components are then parsed for their
749	   subparts and, if not opaque, to verify their validity.

751	   Although the BNF defines what is allowed in each component, it is
752	   ambiguous in terms of differentiating between an authority component
753	   and a path component that begins with two slash characters.  The
754	   greedy algorithm is used for disambiguation: the left-most matching
755	   rule soaks up as much of the URI reference string as it is capable of
756	   matching.  In other words, the authority component wins.

758	   Readers familiar with regular expressions should see Appendix B for a
759	   concrete parsing example and test oracle.

761	5. Relative URI References

763	   It is often the case that a group or "tree" of documents has been
764	   constructed to serve a common purpose; the vast majority of URI in
765	   these documents point to resources within the tree rather than
766	   outside of it.  Similarly, documents located at a particular site
767	   are much more likely to refer to other resources at that site than
768	   to resources at remote sites.

770	   Relative addressing of URI allows document trees to be partially
771	   independent of their location and access scheme.  For instance, it is
772	   possible for a single set of hypertext documents to be simultaneously
773	   accessible and traversable via each of the "file", "http", and "ftp"
774	   schemes if the documents refer to each other using relative URI.
775	   Furthermore, such document trees can be moved, as a whole, without
776	   changing any of the relative references.  Experience within the WWW
777	   has demonstrated that the ability to perform relative referencing
778	   is necessary for the long-term usability of embedded URI.

780	   The syntax for relative URI takes advantage of the  syntax
781	   of  (Section 3) in order to express a reference that is
782	   relative to the namespace of another hierarchical URI.

784	      relativeURI   = ( net_path | abs_path | rel_path ) [ "?" query ]

786	   A relative reference beginning with two slash characters is termed a
787	   network-path reference, as defined by  in Section 3.  Such
788	   references are rarely used.

790	   A relative reference beginning with a single slash character is
791	   termed an absolute-path reference, as defined by  in
792	   Section 3.

794	   A relative reference that does not begin with a scheme name or a
795	   slash character is termed a relative-path reference.

797	      rel_path      = rel_segment [ abs_path ]

799	      rel_segment   = 1*( unreserved | escaped |
800	                          ";" | "@" | "&" | "=" | "+" | "$" | "," )

802	   Within a relative-path reference, the complete path segments "." and
803	   ".." have special meanings: "the current hierarchy level" and "the
804	   level above this hierarchy level", respectively.  Although this is
805	   very similar to their use within Unix-based filesystems to indicate
806	   directory levels, these path components are only considered special
807	   when resolving a relative-path reference to its absolute form
808	   (Section 5.2).

810	   Authors should be aware that a path segment which contains a colon
811	   character cannot be used as the first segment of a relative URI path
812	   (e.g., "this:that"), because it would be mistaken for a scheme name.
813	   It is therefore necessary to precede such segments with other
814	   segments (e.g., "./this:that") in order for them to be referenced as
815	   a relative path.

817	   It is not necessary for all URI within a given scheme to be
818	   restricted to the  syntax, since the hierarchical
819	   properties of that syntax are only necessary when relative URI are
820	   used within a particular document.  Documents can only make use of
821	   relative URI when their base URI fits within the  syntax.
822	   It is assumed that any document which contains a relative reference
823	   will also have a base URI that obeys the syntax.  In other words,
824	   relative URI cannot be used within a document that has an unsuitable
825	   base URI.

827	   Some URI schemes do not allow a hierarchical syntax matching the
828	    syntax, and thus cannot use relative references.

830	5.1. Establishing a Base URI

832	   The term "relative URI" implies that there exists some absolute "base
833	   URI" against which the relative reference is applied.  Indeed, the
834	   base URI is necessary to define the semantics of any relative URI
835	   reference; without it, a relative reference is meaningless.  In order
836	   for relative URI to be usable within a document, the base URI of
837	   that document must be known to the parser.

839	   The base URI of a document can be established in one of four ways,
840	   listed below in order of precedence.  The order of precedence can be
841	   thought of in terms of layers, where the innermost defined base URI
842	   has the highest precedence.  This can be visualized graphically as:

844	      .----------------------------------------------------------.
845	      |  .----------------------------------------------------.  |
846	      |  |  .----------------------------------------------.  |  |
847	      |  |  |  .----------------------------------------.  |  |  |
848	      |  |  |  |  .----------------------------------.  |  |  |  |
849	      |  |  |  |  |              |  |  |  |  |
850	      |  |  |  |  `----------------------------------'  |  |  |  |
851	      |  |  |  | (5.1.1) Base URI embedded in the       |  |  |  |
852	      |  |  |  |         document's content             |  |  |  |
853	      |  |  |  `----------------------------------------'  |  |  |
854	      |  |  | (5.1.2) Base URI of the encapsulating entity |  |  |
855	      |  |  |         (message, document, or none).        |  |  |
856	      |  |  `----------------------------------------------'  |  |
857	      |  | (5.1.3) URI used to retrieve the entity            |  |
858	      |  `----------------------------------------------------'  |
859	      | (5.1.4) Default Base URI is application-dependent        |
860	      `----------------------------------------------------------'

862	5.1.1. Base URI within Document Content

864	   Within certain document media types, the base URI of the document can
865	   be embedded within the content itself such that it can be readily
866	   obtained by a parser.  This can be useful for descriptive documents,
867	   such as tables of content, which may be transmitted to others through
868	   protocols other than their usual retrieval context (e.g., E-Mail or
869	   USENET news).

871	   It is beyond the scope of this document to specify how, for each
872	   media type, the base URI can be embedded.  It is assumed that user
873	   agents manipulating such media types will be able to obtain the
874	   appropriate syntax from that media type's specification.  An example
875	   of how the base URI can be embedded in the Hypertext Markup Language
876	   (HTML) [RFC1866] is provided in Appendix D.

878	   A mechanism for embedding the base URI within MIME container types
879	   (e.g., the message and multipart types) is defined by MHTML
880	   [RFC2110].  Protocols that do not use the MIME message header syntax,
881	   but which do allow some form of tagged metainformation to be included
882	   within messages, may define their own syntax for defining the base
883	   URI as part of a message.

885	5.1.2. Base URI from the Encapsulating Entity

887	   If no base URI is embedded, the base URI of a document is defined by
888	   the document's retrieval context.  For a document that is enclosed
889	   within another entity (such as a message or another document), the
890	   retrieval context is that entity; thus, the default base URI of the
891	   document is the base URI of the entity in which the document is
892	   encapsulated.

894	5.1.3. Base URI from the Retrieval URI

896	   If no base URI is embedded and the document is not encapsulated
897	   within some other entity (e.g., the top level of a composite entity),
898	   then, if a URI was used to retrieve the base document, that URI shall
899	   be considered the base URI.  Note that if the retrieval was the
900	   result of a redirected request, the last URI used (i.e., that which
901	   resulted in the actual retrieval of the document) is the base URI.

903	5.1.4. Default Base URI

905	   If none of the conditions described in Sections 5.1.1--5.1.3 apply,
906	   then the base URI is defined by the context of the application.
907	   Since this definition is necessarily application-dependent, failing
908	   to define the base URI using one of the other methods may result in
909	   the same content being interpreted differently by different types of
910	   application.

912	   It is the responsibility of the distributor(s) of a document
913	   containing relative URI to ensure that the base URI for that
914	   document can be established.  It must be emphasized that relative
915	   URI cannot be used reliably in situations where the document's
916	   base URI is not well-defined.

918	5.2. Resolving Relative References to Absolute Form

920	   This section describes an example algorithm for resolving URI
921	   references that might be relative to a given base URI.

923	   The base URI is established according to the rules of Section 5.1 and
924	   parsed into the four main components as described in Section 3.
925	   Note that only the scheme component is required to be present in the
926	   base URI; the other components may be empty or undefined.  A
927	   component is undefined if its preceding separator does not appear in
928	   the URI reference; the path component is never undefined, though it
929	   may be empty.  The base URI's query component is not used by the
930	   resolution algorithm and may be discarded.

932	   For each URI reference, the following steps are performed in order:

934	   1) The URI reference is parsed into the potential four components and
935	      fragment identifier, as described in Section 4.3.

937	   2) If the path component is empty and the scheme, authority, and
938	      query components are undefined, then it is a reference to the
939	      current document and we are done.  Otherwise, the reference URI's
940	      query and fragment components are defined as found (or not found)
941	      within the URI reference and not inherited from the base URI.

943	   3) If the scheme component is defined, indicating that the reference
944	      starts with a scheme name, then the reference is interpreted as an
945	      absolute URI and we are done.  Otherwise, the reference URI's
946	      scheme is inherited from the base URI's scheme component.

948	      Due to a loophole in prior specifications [RFC1630], some parsers
949	      allow the scheme name to be present in a relative URI if it is the
950	      same as the base URI scheme.  Unfortunately, this can conflict
951	      with the correct parsing of non-hierarchical URI.  For backwards
952	      compatibility, an implementation may work around such references
953	      by removing the scheme if it matches that of the base URI and the
954	      scheme is known to always use the  syntax.  The parser
955	      can then continue with the steps below for the remainder of the
956	      reference components.  Validating parsers should mark such a
957	      misformed relative reference as an error.

959	   4) If the authority component is defined, then the reference is a
960	      network-path and we skip to step 7.  Otherwise, the reference
961	      URI's authority is inherited from the base URI's authority
962	      component, which will also be undefined if the URI scheme does not
963	      use an authority component.

965	   5) If the path component begins with a slash character ("/"), then
966	      the reference is an absolute-path and we skip to step 7.

968	   6) If this step is reached, then we are resolving a relative-path
969	      reference.  The relative path needs to be merged with the base
970	      URI's path.  Although there are many ways to do this, we will
971	      describe a simple method using a separate string buffer.

973	      a) All but the last segment of the base URI's path component is
974	         copied to the buffer.  In other words, any characters after the
975	         last (right-most) slash character, if any, are excluded.

977	      b) The reference's path component is appended to the buffer
978	         string.

980	      c) All occurrences of "./", where "." is a complete path segment,
981	         are removed from the buffer string.

983	      d) If the buffer string ends with "." as a complete path segment,
984	         that "." is removed.

986	      e) All occurrences of "/../", where  is a
987	         complete path segment not equal to "..", are removed from the
988	         buffer string.  Removal of these path segments is performed
989	         iteratively, removing the leftmost matching pattern on each
990	         iteration, until no matching pattern remains.

992	      f) If the buffer string ends with "/..", where 
993	         is a complete path segment not equal to "..", that
994	         "/.." is removed.

996	      g) If the resulting buffer string still begins with one or more
997	         complete path segments of "..", then the reference is
998	         considered to be in error.  Implementations may handle this
999	         error by retaining these components in the resolved path
1000	         (i.e., treating them as part of the final URI), by removing
1001	         them from the resolved path (i.e., discarding relative levels
1002	         above the root), or by avoiding traversal of the reference.

1004	      h) The remaining buffer string is the reference URI's new path
1005	         component.

1007	   7) The resulting URI components, including any inherited from the
1008	      base URI, are recombined to give the absolute form of the URI
1009	      reference.  Using pseudocode, this would be

1011	         result = ""

1013	         if scheme is defined then
1014	             append scheme to result
1015	             append ":" to result

1017	         if authority is defined then
1018	             append "//" to result
1019	             append authority to result

1021	         append path to result

1023	         if query is defined then
1024	             append "?" to result
1025	             append query to result

1027	         if fragment is defined then
1028	             append "#" to result
1029	             append fragment to result

1031	         return result

1033	      Note that we must be careful to preserve the distinction between a
1034	      component that is undefined, meaning that its separator was not
1035	      present in the reference, and a component that is empty, meaning
1036	      that the separator was present and was immediately followed by the
1037	      next component separator or the end of the reference.

1039	   The above algorithm is intended to provide an example by which the
1040	   output of implementations can be tested -- implementation of the
1041	   algorithm itself is not required.  For example, some systems may find
1042	   it more efficient to implement step 6 as a pair of segment stacks
1043	   being merged, rather than as a series of string pattern replacements.

1045	      Note: Some WWW client applications will fail to separate the
1046	      reference's query component from its path component before merging
1047	      the base and reference paths in step 6 above.  This may result in
1048	      a loss of information if the query component contains the strings
1049	      "/../" or "/./".

1051	   Resolution examples are provided in Appendix C.

1053	6. URI Normalization and Equivalence

1055	   In many cases, different URI strings may actually identify the
1056	   identical resource. For example, the host names used in URL are
1057	   actually case insensitive, and the URL  is
1058	   equivalent to . In general, the rules for
1059	   equivalence and definition of a normal form, if any, are scheme
1060	   dependent. When a scheme uses elements of the common syntax, it
1061	   will also use the common syntax equivalence rules, namely that the
1062	   scheme and hostname are case insensitive and a URL with an explicit
1063	   ":port", where the port is the default for the scheme, is equivalent
1064	   to one where the port is elided.

1066	7. Security Considerations

1068	   A URI does not in itself pose a security threat.  Users should beware
1069	   that there is no general guarantee that a URL, which at one time
1070	   located a given resource, will continue to do so.  Nor is there any
1071	   guarantee that a URL will not locate a different resource at some
1072	   later point in time, due to the lack of any constraint on how a given
1073	   authority apportions its namespace.  Such a guarantee can only be
1074	   obtained from the person(s) controlling that namespace and the
1075	   resource in question.  A specific URI scheme may include additional
1076	   semantics, such as name persistence, if those semantics are required
1077	   of all naming authorities for that scheme.

1079	   It is sometimes possible to construct a URL such that an attempt to
1080	   perform a seemingly harmless, idempotent operation, such as the
1081	   retrieval of an entity associated with the resource, will in fact
1082	   cause a possibly damaging remote operation to occur.  The unsafe URL
1083	   is typically constructed by specifying a port number other than that
1084	   reserved for the network protocol in question.  The client
1085	   unwittingly contacts a site that is in fact running a different
1086	   protocol.  The content of the URL contains instructions that, when
1087	   interpreted according to this other protocol, cause an unexpected
1088	   operation.  An example has been the use of a gopher URL to cause an
1089	   unintended or impersonating message to be sent via a SMTP server.

1091	   Caution should be used when using any URL that specifies a port
1092	   number other than the default for the protocol, especially when it
1093	   is a number within the reserved space.

1095	   Care should be taken when a URL contains escaped delimiters for a
1096	   given protocol (for example, CR and LF characters for telnet
1097	   protocols) that these are not unescaped before transmission.  This
1098	   might violate the protocol, but avoids the potential for such
1099	   characters to be used to simulate an extra operation or parameter
1100	   in that protocol, which might lead to an unexpected and possibly
1101	   harmful remote operation to be performed.

1103	   It is clearly unwise to use a URL that contains a password which is
1104	   intended to be secret. In particular, the use of a password within
1105	   the 'userinfo' component of a URL is strongly disrecommended except
1106	   in those rare cases where the 'password' parameter is intended
1107	   to be public.

1109	8. Acknowledgements

1111	   This document was derived from RFC 1738 [RFC1738] and RFC 1808
1112	   [RFC1808]; the acknowledgements in those specifications still
1113	   apply.  In addition, contributions by Gisle Aas, Martin Beet,
1114	   Martin Duerst, Jim Gettys, Martijn Koster, Dave Kristol,
1115	   Daniel LaLiberte, Foteos Macrides, James Marshall, Ryan Moats,
1116	   Keith Moore, and Lauren Wood are gratefully acknowledged.

1118	9. References

1120	[RFC2277] Alvestrand, H. "IETF Policy on Character Sets and Languages",
1121	   BCP 18, RFC 2277, UNINETT, January 1998.

1123	[RFC1630] Berners-Lee, T. "Universal Resource Identifiers in WWW: A
1124	   Unifying Syntax for the Expression of Names and Addresses of
1125	   Objects on the Network as used in the World-Wide Web", RFC 1630,
1126	   CERN, June 1994.

1128	[RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, Editors.
1129	   "Uniform Resource Locators (URL)", RFC 1738, CERN, Xerox
1130	   Corporation, University of Minnesota, December 1994.

1132	[RFC1866] Berners-Lee T., and D. Connolly. "HyperText Markup Language
1133	   Specification -- 2.0", RFC 1866, MIT/W3C, November 1995.

1135	[RFC1123] Braden, R., Editor. "Requirements for Internet Hosts --
1136	   Application and Support", STD 3, RFC 1123, IETF, October 1989.

1138	[RFC822]  Crocker, D. "Standard for the Format of ARPA Internet Text
1139	   Messages", STD 11, RFC 822, UDEL, August 1982.

1141	[RFC1808] Fielding, R. "Relative Uniform Resource Locators", RFC 1808,
1142	   UC Irvine, June 1995.

1144	[RFC2046] Freed, N., and N. Borenstein. "Multipurpose Internet Mail
1145	   Extensions (MIME) Part Two: Media Types", RFC 2046, Innosoft,
1146	   First Virtual, November 1996.

1148	[RFC1736] Kunze, J. "Functional Recommendations for Internet Resource
1149	   Locators", RFC 1736, IS&T, UC Berkeley, February 1995.

1151	[RFC2141] Moats, R. "URN Syntax", RFC 2141, AT&T, May 1997.

1153	[RFC1034] Mockapetris, P. "Domain Names - Concepts and Facilities",
1154	   STD 13, RFC 1034, USC/Information Sciences Institute, November 1987.

1156	[RFC2110] Palme, J., and A. Hopmann. "MIME E-mail Encapsulation of
1157	   Aggregate Documents, such as HTML (MHTML)", RFC 2110, Stockholm
1158	   University/KTH, Microsoft Corporation, March 1997.

1160	[RFC1737] Sollins, K., and L. Masinter. "Functional Requirements for
1161	   Uniform Resource Names", RFC 1737, MIT/LCS, Xerox Corporation,
1162	   December 1994.

1164	[ASCII] US-ASCII. "Coded Character Set -- 7-bit American Standard Code
1165	   for Information Interchange", ANSI X3.4-1986.

1167	[UTF-8] Yergeau, F. "UTF-8, a transformation format of ISO 10646",
1168	   RFC 2279, Alis Technologies, January 1998.

1170	10. Authors' Addresses

1172	   Tim Berners-Lee
1173	   World Wide Web Consortium
1174	   MIT Laboratory for Computer Science, NE43-356
1175	   545 Technology Square
1176	   Cambridge, MA 02139

1178	   Fax: +1(617)258-8682
1179	   EMail: timbl@w3.org

1181	   Roy T. Fielding
1182	   Department of Information and Computer Science
1183	   University of California, Irvine
1184	   Irvine, CA  92697-3425

1186	   Fax: +1(949)824-1715
1187	   EMail: fielding@ics.uci.edu

1189	   Larry Masinter
1190	   Xerox PARC
1191	   3333 Coyote Hill Road
1192	   Palo Alto, CA 94034

1194	   Fax: +1(415)812-4333
1195	   EMail: masinter@parc.xerox.com

1197	Appendices

1199	A. Collected BNF for URI

1201	      URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
1202	      absoluteURI   = scheme ":" ( hier_part | opaque_part )
1203	      relativeURI   = ( net_path | abs_path | rel_path ) [ "?" query ]

1205	      hier_part     = ( net_path | abs_path ) [ "?" query ]
1206	      opaque_part   = uric_no_slash *uric

1208	      uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
1209	                      "&" | "=" | "+" | "$" | ","

1211	      net_path      = "//" authority [ abs_path ]
1212	      abs_path      = "/"  path_segments
1213	      rel_path      = rel_segment [ abs_path ]

1215	      rel_segment   = 1*( unreserved | escaped |
1216	                          ";" | "@" | "&" | "=" | "+" | "$" | "," )

1218	      scheme        = alpha *( alpha | digit | "+" | "-" | "." )

1220	      authority     = server | reg_name

1222	      reg_name      = 1*( unreserved | escaped | "$" | "," |
1223	                          ";" | ":" | "@" | "&" | "=" | "+" )

1225	      server        = [ [ userinfo "@" ] hostport ]
1226	      userinfo      = *( unreserved | escaped |
1227	                         ";" | ":" | "&" | "=" | "+" | "$" | "," )

1229	      hostport      = host [ ":" port ]
1230	      host          = hostname | IPv4address
1231	      hostname      = *( domainlabel "." ) toplabel [ "." ]
1232	      domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
1233	      toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
1234	      IPv4address   = 1*digit "." 1*digit "." 1*digit "." 1*digit
1235	      port          = *digit

1237	      path          = [ abs_path | opaque_part ]
1238	      path_segments = segment *( "/" segment )
1239	      segment       = *pchar *( ";" param )
1240	      param         = *pchar
1241	      pchar         = unreserved | escaped |
1242	                      ":" | "@" | "&" | "=" | "+" | "$" | ","

1244	      query         = *uric

1246	      fragment      = *uric

1248	      uric          = reserved | unreserved | escaped
1249	      reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
1250	                      "$" | ","
1251	      unreserved    = alphanum | mark
1252	      mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
1253	                      "(" | ")"

1255	      escaped       = "%" hex hex
1256	      hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
1257	                              "a" | "b" | "c" | "d" | "e" | "f"

1259	      alphanum      = alpha | digit
1260	      alpha         = lowalpha | upalpha

1262	      lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
1263	                 "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
1264	                 "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
1265	      upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
1266	                 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
1267	                 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
1268	      digit    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
1269	                 "8" | "9"

1271	B. Parsing a URI Reference with a Regular Expression

1273	   As described in Section 4.3, the generic URI syntax is not sufficient
1274	   to disambiguate the components of some forms of URI.  Since the
1275	   "greedy algorithm" described in that section is identical to the
1276	   disambiguation method used by POSIX regular expressions, it is
1277	   natural and commonplace to use a regular expression for parsing the
1278	   potential four components and fragment identifier of a URI reference.

1280	   The following line is the regular expression for breaking-down a URI
1281	   reference into its components.

1283	      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
1284	       12            3  4          5       6  7        8 9

1286	   The numbers in the second line above are only to assist readability;
1287	   they indicate the reference points for each subexpression (i.e., each
1288	   paired parenthesis).  We refer to the value matched for subexpression
1289	    as $.  For example, matching the above expression to

1291	      http://www.ics.uci.edu/pub/ietf/uri/#Related

1293	   results in the following subexpression matches:

1295	      $1 = http:
1296	      $2 = http
1297	      $3 = //www.ics.uci.edu
1298	      $4 = www.ics.uci.edu
1299	      $5 = /pub/ietf/uri/
1300	      $6 = 
1301	      $7 = 
1302	      $8 = #Related
1303	      $9 = Related

1305	   where  indicates that the component is not present, as is
1306	   the case for the query component in the above example.  Therefore, we
1307	   can determine the value of the four components and fragment as

1309	      scheme    = $2
1310	      authority = $4
1311	      path      = $5
1312	      query     = $7
1313	      fragment  = $9

1315	   and, going in the opposite direction, we can recreate a URI reference
1316	   from its components using the algorithm in step 7 of Section 5.2.

1318	C. Examples of Resolving Relative URI References

1320	   Within an object with a well-defined base URI of

1322	      http://a/b/c/d;p?q

1324	   the relative URI would be resolved as follows:

1326	C.1.  Normal Examples

1328	      g:h           =  g:h
1329	      g             =  http://a/b/c/g
1330	      ./g           =  http://a/b/c/g
1331	      g/            =  http://a/b/c/g/
1332	      /g            =  http://a/g
1333	      //g           =  http://g
1334	      ?y            =  http://a/b/c/?y
1335	      g?y           =  http://a/b/c/g?y
1336	      #s            =  (current document)#s
1337	      g#s           =  http://a/b/c/g#s
1338	      g?y#s         =  http://a/b/c/g?y#s
1339	      ;x            =  http://a/b/c/;x
1340	      g;x           =  http://a/b/c/g;x
1341	      g;x?y#s       =  http://a/b/c/g;x?y#s
1342	      .             =  http://a/b/c/
1343	      ./            =  http://a/b/c/
1344	      ..            =  http://a/b/
1345	      ../           =  http://a/b/
1346	      ../g          =  http://a/b/g
1347	      ../..         =  http://a/
1348	      ../../        =  http://a/
1349	      ../../g       =  http://a/g

1351	C.2.  Abnormal Examples

1353	   Although the following abnormal examples are unlikely to occur in
1354	   normal practice, all URI parsers should be capable of resolving them
1355	   consistently.  Each example uses the same base as above.

1357	   An empty reference refers to the start of the current document.

1359	      <>            =  (current document)

1361	   Parsers must be careful in handling the case where there are more
1362	   relative path ".." segments than there are hierarchical levels in
1363	   the base URI's path.  Note that the ".." syntax cannot be used to
1364	   change the authority component of a URI.

1366	      ../../../g    =  http://a/../g
1367	      ../../../../g =  http://a/../../g

1369	   In practice, some implementations strip leading relative symbolic
1370	   elements (".", "..") after applying a relative URI calculation, based
1371	   on the theory that compensating for obvious author errors is better
1372	   than allowing the request to fail.  Thus, the above two references
1373	   will be interpreted as "http://a/g" by some implementations.

1375	   Similarly, parsers must avoid treating "." and ".." as special when
1376	   they are not complete components of a relative path.

1378	      /./g          =  http://a/./g
1379	      /../g         =  http://a/../g
1380	      g.            =  http://a/b/c/g.
1381	      .g            =  http://a/b/c/.g
1382	      g..           =  http://a/b/c/g..
1383	      ..g           =  http://a/b/c/..g

1385	   Less likely are cases where the relative URI uses unnecessary or
1386	   nonsensical forms of the "." and ".." complete path segments.

1388	      ./../g        =  http://a/b/g
1389	      ./g/.         =  http://a/b/c/g/
1390	      g/./h         =  http://a/b/c/g/h
1391	      g/../h        =  http://a/b/c/h
1392	      g;x=1/./y     =  http://a/b/c/g;x=1/y
1393	      g;x=1/../y    =  http://a/b/c/y

1395	   All client applications remove the query component from the base URI
1396	   before resolving relative URI.  However, some applications fail to
1397	   separate the reference's query and/or fragment components from a
1398	   relative path before merging it with the base path.  This error is
1399	   rarely noticed, since typical usage of a fragment never includes the
1400	   hierarchy ("/") character, and the query component is not normally
1401	   used within relative references.

1403	      g?y/./x       =  http://a/b/c/g?y/./x
1404	      g?y/../x      =  http://a/b/c/g?y/../x
1405	      g#s/./x       =  http://a/b/c/g#s/./x
1406	      g#s/../x      =  http://a/b/c/g#s/../x

1408	   Some parsers allow the scheme name to be present in a relative URI
1409	   if it is the same as the base URI scheme.  This is considered to be
1410	   a loophole in prior specifications of partial URI [RFC1630]. Its
1411	   use should be avoided.

1413	      http:g        =  http:g           ; for validating parsers
1414	                    |  http://a/b/c/g   ; for backwards compatibility

1416	D. Embedding the Base URI in HTML documents

1418	   It is useful to consider an example of how the base URI of a
1419	   document can be embedded within the document's content.  In this
1420	   appendix, we describe how documents written in the Hypertext Markup
1421	   Language (HTML) [RFC1866] can include an embedded base URI.  This
1422	   appendix does not form a part of the URI specification and should not
1423	   be considered as anything more than a descriptive example.

1425	   HTML defines a special element "BASE" which, when present in the
1426	   "HEAD" portion of a document, signals that the parser should use
1427	   the BASE element's "HREF" attribute as the base URI for resolving
1428	   any relative URI.  The "HREF" attribute must be an absolute URI.
1429	   Note that, in HTML, element and attribute names are
1430	   case-insensitive.  For example:

1432	      
1433	      
1434	      An example HTML document
1435	      
1436	      
1437	      ... a hypertext anchor ...
1438	      

1440	   A parser reading the example document should interpret the given
1441	   relative URI "../x" as representing the absolute URI

1443	      

1445	   regardless of the context in which the example document was
1446	   obtained.

1448	E. Recommendations for Delimiting URI in Context

1450	   URI are often transmitted through formats that do not provide a
1451	   clear context for their interpretation.  For example, there are
1452	   many occasions when URI are included in plain text; examples
1453	   include text sent in electronic mail, USENET news messages, and,
1454	   most importantly, printed on paper.  In such cases, it is important
1455	   to be able to delimit the URI from the rest of the text, and in
1456	   particular from punctuation marks that might be mistaken for part
1457	   of the URI.

1459	   In practice, URI are delimited in a variety of ways, but usually
1460	   within double-quotes "http://test.com/", angle brackets
1461	   , or just using whitespace

1463	                  http://test.com/

1465	   These wrappers do not form part of the URI.

1467	   In the case where a fragment identifier is associated with a URI
1468	   reference, the fragment would be placed within the brackets as well
1469	   (separated from the URI with a "#" character).

1471	   In some cases, extra whitespace (spaces, linebreaks, tabs, etc.)
1472	   may need to be added to break long URI across lines. The
1473	   whitespace should be ignored when extracting the URI.

1475	   No whitespace should be introduced after a hyphen ("-") character.
1476	   Because some typesetters and printers may (erroneously) introduce a
1477	   hyphen at the end of line when breaking a line, the interpreter of a
1478	   URI containing a line break immediately after a hyphen should ignore
1479	   all unescaped whitespace around the line break, and should be aware
1480	   that the hyphen may or may not actually be part of the URI.

1482	   Using <> angle brackets around each URI is especially recommended
1483	   as a delimiting style for URI that contain whitespace.

1485	   The prefix "URL:" (with or without a trailing space) was
1486	   recommended as a way to used to help distinguish a URL from other
1487	   bracketed designators, although this is not common in practice.

1489	   For robustness, software that accepts user-typed URI should
1490	   attempt to recognize and strip both delimiters and embedded
1491	   whitespace.

1493	   For example, the text:

1495	      Yes, Jim, I found it under "http://www.w3.org/Addressing/",
1496	      but you can probably pick it up from .  Note the warning in .

1500	   contains the URI references

1502	      http://www.w3.org/Addressing/
1503	      ftp://ds.internic.net/rfc/
1504	      http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING

1506	F. Abbreviated URLs

1508	   The URL syntax was designed for unambiguous reference to network
1509	   resources and extensibility via the URL scheme.  However, as URL
1510	   identification and usage have become commonplace, traditional media
1511	   (television, radio, newspapers, billboards, etc.) have increasingly
1512	   used abbreviated URL references.  That is, a reference consisting of
1513	   only the authority and path portions of the identified resource,
1514	   such as

1516	      www.w3.org/Addressing/

1518	   or simply the DNS hostname on its own.  Such references are primarily
1519	   intended for human interpretation rather than machine, with the
1520	   assumption that context-based heuristics are sufficient to complete
1521	   the URL (e.g., most hostnames beginning with "www" are likely to have
1522	   a URL prefix of "http://").  Although there is no standard set of
1523	   heuristics for disambiguating abbreviated URL references, many
1524	   client implementations allow them to be entered by the user and
1525	   heuristically resolved.  It should be noted that such heuristics may
1526	   change over time, particularly when new URL schemes are introduced.

1528	   Since an abbreviated URL has the same syntax as a relative URL path,
1529	   abbreviated URL references cannot be used in contexts where relative
1530	   URLs are expected.  This limits the use of abbreviated URLs to places
1531	   where there is no defined base URL, such as dialog boxes and off-line
1532	   advertisements.

1534	G. Summary of Non-editorial Changes

1536	G.1. Additions

1538	   Section 4 (URI References) was added to stem the confusion
1539	   regarding "what is a URI" and how to describe fragment identifiers
1540	   given that they are not part of the URI, but are part of the URI
1541	   syntax and parsing concerns.  In addition, it provides a reference
1542	   definition for use by other IETF specifications (HTML, HTTP, etc.)
1543	   that have previously attempted to redefine the URI syntax in order
1544	   to account for the presence of fragment identifiers in URI
1545	   references.

1547	   Section 2.4 was rewritten to clarify a number of misinterpretations
1548	   and to leave room for fully internationalized URI.

1550	   Appendix F on abbreviated URLs was added to describe the shortened
1551	   references often seen on television and magazine advertisements and
1552	   explain why they are not used in other contexts.

1554	G.2. Modifications from both RFC 1738 and RFC 1808

1556	   Changed to URI syntax instead of just URL.

1558	   Confusion regarding the terms "character encoding", the URI
1559	   "character set", and the escaping of characters with %
1560	   equivalents has (hopefully) been reduced.  Many of the BNF rule
1561	   names regarding the character sets have been changed to more
1562	   accurately describe their purpose and to encompass all "characters"
1563	   rather than just US-ASCII octets.  Unless otherwise noted here,
1564	   these modifications do not affect the URI syntax.

1566	   Both RFC 1738 and RFC 1808 refer to the "reserved" set of
1567	   characters as if URI-interpreting software were limited to a single
1568	   set of characters with a reserved purpose (i.e., as meaning
1569	   something other than the data to which the characters correspond),
1570	   and that this set was fixed by the URI scheme.  However, this has
1571	   not been true in practice; any character that is interpreted
1572	   differently when it is escaped is, in effect, reserved.
1573	   Furthermore, the interpreting engine on a HTTP server is often
1574	   dependent on the resource, not just the URI scheme.  The
1575	   description of reserved characters has been changed accordingly.

1577	   The plus "+", dollar "$", and comma "," characters have been added to
1578	   those in the "reserved" set, since they are treated as reserved
1579	   within the query component.

1581	   The tilde "~" character was added to those in the "unreserved" set,
1582	   since it is extensively used on the Internet in spite of the
1583	   difficulty to transcribe it with some keyboards.

1585	   The syntax for URI scheme has been changed to require that all
1586	   schemes begin with an alpha character.

1588	   The "user:password" form in the previous BNF was changed to
1589	   a "userinfo" token, and the possibility that it might be
1590	   "user:password" made scheme specific. In particular, the use
1591	   of passwords in the clear is not even suggested by the syntax.

1593	   The question-mark "?" character was removed from the set of allowed
1594	   characters for the userinfo in the authority component, since
1595	   testing showed that many applications treat it as reserved for
1596	   separating the query component from the rest of the URI.

1598	   The semicolon ";" character was added to those stated as being
1599	   reserved within the authority component, since several new schemes
1600	   are using it as a separator within userinfo to indicate the type
1601	   of user authentication.

1603	   RFC 1738 specified that the path was separated from the authority
1604	   portion of a URI by a slash.  RFC 1808 followed suit, but with a
1605	   fudge of carrying around the separator as a "prefix" in order to
1606	   describe the parsing algorithm.  RFC 1630 never had this problem,
1607	   since it considered the slash to be part of the path.  In writing
1608	   this specification, it was found to be impossible to accurately
1609	   describe and retain the difference between the two URI
1610	         and   
1611	   without either considering the slash to be part of the path (as
1612	   corresponds to actual practice) or creating a separate component just
1613	   to hold that slash.  We chose the former.

1615	G.3. Modifications from RFC 1738

1617	   The definition of specific URL schemes and their scheme-specific
1618	   syntax and semantics has been moved to separate documents.

1620	   The URL host was defined as a fully-qualified domain name.  However,
1621	   many URLs are used without fully-qualified domain names (in contexts
1622	   for which the full qualification is not necessary), without any host
1623	   (as in some file URLs), or with a host of "localhost".

1625	   The URL port is now *digit instead of 1*digit, since systems are
1626	   expected to handle the case where the ":" separator between host and
1627	   port is supplied without a port.

1629	   The recommendations for delimiting URI in context (Appendix E) have
1630	   been adjusted to reflect current practice.

1632	G.4. Modifications from RFC 1808

1634	   RFC 1808 (Section 4) defined an empty URL reference (a reference
1635	   containing nothing aside from the fragment identifier) as being a
1636	   reference to the base URL.  Unfortunately, that definition could be
1637	   interpreted, upon selection of such a reference, as a new retrieval
1638	   action on that resource.  Since the normal intent of such references
1639	   is for the user agent to change its view of the current document to
1640	   the beginning of the specified fragment within that document, not to
1641	   make an additional request of the resource, a description of how to
1642	   correctly interpret an empty reference has been added in Section 4.

1644	   The description of the mythical Base header field has been replaced
1645	   with a reference to the Content-Location header field defined by
1646	   MHTML [RFC2110].

1648	   RFC 1808 described various schemes as either having or not having the
1649	   properties of the generic URI syntax.  However, the only requirement
1650	   is that the particular document containing the relative references
1651	   have a base URI that abides by the generic URI syntax, regardless of
1652	   the URI scheme, so the associated description has been updated to
1653	   reflect that.

1655	   The BNF term  has been replaced with , since the
1656	   latter more accurately describes its use and purpose.  Likewise, the
1657	   authority is no longer restricted to the IP server syntax.

1659	   Extensive testing of current client applications demonstrated that
1660	   the majority of deployed systems do not use the ";" character to
1661	   indicate trailing parameter information, and that the presence of a
1662	   semicolon in a path segment does not affect the relative parsing of
1663	   that segment.  Therefore, parameters have been removed as a separate
1664	   component and may now appear in any path segment.  Their influence
1665	   has been removed from the algorithm for resolving a relative URI
1666	   reference.  The resolution examples in Appendix C have been modified
1667	   to reflect this change.

1669	   Implementations are now allowed to work around misformed relative
1670	   references that are prefixed by the same scheme as the base URI,
1671	   but only for schemes known to use the  syntax.

1673	H. Full Copyright Statement

1675	   Copyright (C) The Internet Society (1998).  All Rights Reserved.

1677	   This document and translations of it may be copied and furnished to
1678	   others, and derivative works that comment on or otherwise explain it
1679	   or assist in its implementation may be prepared, copied, published
1680	   and distributed, in whole or in part, without restriction of any
1681	   kind, provided that the above copyright notice and this paragraph are
1682	   included on all such copies and derivative works.  However, this
1683	   document itself may not be modified in any way, such as by removing
1684	   the copyright notice or references to the Internet Society or other
1685	   Internet organizations, except as needed for the purpose of
1686	   developing Internet standards in which case the procedures for
1687	   copyrights defined in the Internet Standards process must be
1688	   followed, or as required to translate it into languages other than
1689	   English.

1691	   The limited permissions granted above are perpetual and will not be
1692	   revoked by the Internet Society or its successors or assigns.

1694	   This document and the information contained herein is provided on an
1695	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1696	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1697	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1698	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1699	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.