idnits 2.17.1 

draft-fielding-uri-rfc2396bis-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1.a on line 20.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 2773.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2750.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2757.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2763.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement. 

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        This document is an Internet-Draft and is subject to all provisions of
        Section 3 of RFC 3667.

        By submitting this Internet-Draft, each author represents that any
        applicable patent or other IPR claims of which he or she is aware
        have been or will be disclosed, and any of which he or she
        becomes aware will be disclosed, in accordance with Section 6 of
        BCP 79.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 58
     longer pages, the longest (page 7) being 72 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 61 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  -- The draft header indicates that this document obsoletes RFC2732, but the
     abstract doesn't seem to mention this, which it should.

  -- The draft header indicates that this document obsoletes RFC2396, but the
     abstract doesn't seem to mention this, which it should.

  -- The draft header indicates that this document obsoletes RFC1808, but the
     abstract doesn't seem to mention this, which it should.

  -- The draft header indicates that this document updates RFC1738, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 735 has weird spacing: '...  query   frag...'

     (Using the creation date from RFC1738, updated by this document, for
     RFC5378 checks: 1994-12-01)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (September 25, 2004) is 7152 days in the past.  Is
     this intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'

  ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UCS'

  -- Obsolete informational reference (is this intentional?): RFC 2717 (ref.
     'BCP35') (Obsoleted by RFC 4395)

  -- Obsolete informational reference (is this intentional?): RFC 1738
     (Obsoleted by RFC 4248, RFC 4266)

  -- Obsolete informational reference (is this intentional?): RFC 1808
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2141
     (Obsoleted by RFC 8141)

  -- Obsolete informational reference (is this intentional?): RFC 2396
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 2518
     (Obsoleted by RFC 4918)

  -- Obsolete informational reference (is this intentional?): RFC 2718
     (Obsoleted by RFC 4395)

  -- Obsolete informational reference (is this intentional?): RFC 2732
     (Obsoleted by RFC 3986)

  -- Obsolete informational reference (is this intentional?): RFC 3490
     (Obsoleted by RFC 5890, RFC 5891)

  -- Obsolete informational reference (is this intentional?): RFC 3513
     (Obsoleted by RFC 4291)


     Summary: 8 errors (**), 0 flaws (~~), 6 warnings (==), 23 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     T. Berners-Lee
3	Internet-Draft                                                   W3C/MIT
4	Updates: 1738 (if approved)                                  R. Fielding
5	Obsoletes: 2732, 2396, 1808 (if approved)                   Day Software
6	                                                             L. Masinter
7	Expires: March 26, 2005                                            Adobe
8	                                                      September 25, 2004

10	           Uniform Resource Identifier (URI): Generic Syntax
11	                    draft-fielding-uri-rfc2396bis-07

13	Status of this Memo

15	   This document is an Internet-Draft and is subject to all provisions
16	   of section 3 of RFC 3667.  By submitting this Internet-Draft, each
17	   author represents that any applicable patent or other IPR claims of
18	   which he or she is aware have been or will be disclosed, and any of
19	   which he or she become aware will be disclosed, in accordance with
20	   RFC 3668.

22	   Internet-Drafts are working documents of the Internet Engineering
23	   Task Force (IETF), its areas, and its working groups.  Note that
24	   other groups may also distribute working documents as
25	   Internet-Drafts.

27	   Internet-Drafts are draft documents valid for a maximum of six months
28	   and may be updated, replaced, or obsoleted by other documents at any
29	   time.  It is inappropriate to use Internet-Drafts as reference
30	   material or to cite them other than as "work in progress."

32	   The list of current Internet-Drafts can be accessed at
33	   <http://www.ietf.org/ietf/1id-abstracts.txt>.

35	   The list of Internet-Draft Shadow Directories can be accessed at
36	   <http://www.ietf.org/shadow.html>.

38	Copyright Notice

40	   Copyright (C) The Internet Society (2004).

42	Abstract

44	   A Uniform Resource Identifier (URI) is a compact sequence of
45	   characters for identifying an abstract or physical resource.  This
46	   specification defines the generic URI syntax and a process for
47	   resolving URI references that might be in relative form, along with
48	   guidelines and security considerations for the use of URIs on the
49	   Internet.  The URI syntax defines a grammar that is a superset of all
50	   valid URIs, such that an implementation can parse the common
51	   components of a URI reference without knowing the scheme-specific
52	   requirements of every possible identifier.  This specification does
53	   not define a generative grammar for URIs; that task is performed by
54	   the individual specifications of each URI scheme.

56	Editorial Note

58	   Discussion of this draft and comments to the editors should be sent
59	   to the uri@w3.org mailing list.  An issues list and version history
60	   is available at <http://gbiv.com/protocols/uri/rev-2002/issues.html>.

62	Table of Contents

64	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
65	     1.1   Overview of URIs . . . . . . . . . . . . . . . . . . . . .  4
66	       1.1.1   Generic Syntax . . . . . . . . . . . . . . . . . . . .  6
67	       1.1.2   Examples . . . . . . . . . . . . . . . . . . . . . . .  7
68	       1.1.3   URI, URL, and URN  . . . . . . . . . . . . . . . . . .  7
69	     1.2   Design Considerations  . . . . . . . . . . . . . . . . . .  7
70	       1.2.1   Transcription  . . . . . . . . . . . . . . . . . . . .  7
71	       1.2.2   Separating Identification from Interaction . . . . . .  9
72	       1.2.3   Hierarchical Identifiers . . . . . . . . . . . . . . . 10
73	     1.3   Syntax Notation  . . . . . . . . . . . . . . . . . . . . . 11
74	   2.  Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11
75	     2.1   Percent-Encoding . . . . . . . . . . . . . . . . . . . . . 12
76	     2.2   Reserved Characters  . . . . . . . . . . . . . . . . . . . 12
77	     2.3   Unreserved Characters  . . . . . . . . . . . . . . . . . . 13
78	     2.4   When to Encode or Decode . . . . . . . . . . . . . . . . . 13
79	     2.5   Identifying Data . . . . . . . . . . . . . . . . . . . . . 14
80	   3.  Syntax Components  . . . . . . . . . . . . . . . . . . . . . . 16
81	     3.1   Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 16
82	     3.2   Authority  . . . . . . . . . . . . . . . . . . . . . . . . 17
83	       3.2.1   User Information . . . . . . . . . . . . . . . . . . . 18
84	       3.2.2   Host . . . . . . . . . . . . . . . . . . . . . . . . . 18
85	       3.2.3   Port . . . . . . . . . . . . . . . . . . . . . . . . . 21
86	     3.3   Path . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
87	     3.4   Query  . . . . . . . . . . . . . . . . . . . . . . . . . . 23
88	     3.5   Fragment . . . . . . . . . . . . . . . . . . . . . . . . . 24
89	   4.  Usage  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
90	     4.1   URI Reference  . . . . . . . . . . . . . . . . . . . . . . 25
91	     4.2   Relative Reference . . . . . . . . . . . . . . . . . . . . 26
92	     4.3   Absolute URI . . . . . . . . . . . . . . . . . . . . . . . 26
93	     4.4   Same-document Reference  . . . . . . . . . . . . . . . . . 27
94	     4.5   Suffix Reference . . . . . . . . . . . . . . . . . . . . . 27

96	   5.  Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28
97	     5.1   Establishing a Base URI  . . . . . . . . . . . . . . . . . 28
98	       5.1.1   Base URI Embedded in Content . . . . . . . . . . . . . 29
99	       5.1.2   Base URI from the Encapsulating Entity . . . . . . . . 29
100	       5.1.3   Base URI from the Retrieval URI  . . . . . . . . . . . 30
101	       5.1.4   Default Base URI . . . . . . . . . . . . . . . . . . . 30
102	     5.2   Relative Resolution  . . . . . . . . . . . . . . . . . . . 30
103	       5.2.1   Pre-parse the Base URI . . . . . . . . . . . . . . . . 30
104	       5.2.2   Transform References . . . . . . . . . . . . . . . . . 31
105	       5.2.3   Merge Paths  . . . . . . . . . . . . . . . . . . . . . 32
106	       5.2.4   Remove Dot Segments  . . . . . . . . . . . . . . . . . 32
107	     5.3   Component Recomposition  . . . . . . . . . . . . . . . . . 34
108	     5.4   Reference Resolution Examples  . . . . . . . . . . . . . . 34
109	       5.4.1   Normal Examples  . . . . . . . . . . . . . . . . . . . 35
110	       5.4.2   Abnormal Examples  . . . . . . . . . . . . . . . . . . 35
111	   6.  Normalization and Comparison . . . . . . . . . . . . . . . . . 36
112	     6.1   Equivalence  . . . . . . . . . . . . . . . . . . . . . . . 37
113	     6.2   Comparison Ladder  . . . . . . . . . . . . . . . . . . . . 37
114	       6.2.1   Simple String Comparison . . . . . . . . . . . . . . . 38
115	       6.2.2   Syntax-based Normalization . . . . . . . . . . . . . . 39
116	       6.2.3   Scheme-based Normalization . . . . . . . . . . . . . . 40
117	       6.2.4   Protocol-based Normalization . . . . . . . . . . . . . 41
118	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 41
119	     7.1   Reliability and Consistency  . . . . . . . . . . . . . . . 41
120	     7.2   Malicious Construction . . . . . . . . . . . . . . . . . . 42
121	     7.3   Back-end Transcoding . . . . . . . . . . . . . . . . . . . 42
122	     7.4   Rare IP Address Formats  . . . . . . . . . . . . . . . . . 43
123	     7.5   Sensitive Information  . . . . . . . . . . . . . . . . . . 44
124	     7.6   Semantic Attacks . . . . . . . . . . . . . . . . . . . . . 44
125	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 45
126	   9.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 45
127	   10.   References . . . . . . . . . . . . . . . . . . . . . . . . . 46
128	   10.1  Normative References . . . . . . . . . . . . . . . . . . . . 46
129	   10.2  Informative References . . . . . . . . . . . . . . . . . . . 46
130	       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 48
131	   A.  Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49
132	   B.  Parsing a URI Reference with a Regular Expression  . . . . . . 51
133	   C.  Delimiting a URI in Context  . . . . . . . . . . . . . . . . . 52
134	   D.  Changes from RFC 2396  . . . . . . . . . . . . . . . . . . . . 53
135	     D.1   Additions  . . . . . . . . . . . . . . . . . . . . . . . . 53
136	     D.2   Modifications  . . . . . . . . . . . . . . . . . . . . . . 54
137	   E.  Instructions to RFC Editor . . . . . . . . . . . . . . . . . . 56
138	       Index  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
139	       Intellectual Property and Copyright Statements . . . . . . . . 61

141	1.  Introduction

143	   A Uniform Resource Identifier (URI) provides a simple and extensible
144	   means for identifying a resource.  This specification of URI syntax
145	   and semantics is derived from concepts introduced by the World Wide
146	   Web global information initiative, whose use of such identifiers
147	   dates from 1990 and is described in "Universal Resource Identifiers
148	   in WWW" [RFC1630], and is designed to meet the recommendations laid
149	   out in "Functional Recommendations for Internet Resource Locators"
150	   [RFC1736] and "Functional Requirements for Uniform Resource Names"
151	   [RFC1737].

153	   This document obsoletes [RFC2396], which merged "Uniform Resource
154	   Locators" [RFC1738] and "Relative Uniform Resource Locators"
155	   [RFC1808] in order to define a single, generic syntax for all URIs.
156	   It contains the updates from, and obsoletes, [RFC2732], which
157	   introduced syntax for IPv6 addresses.  It excludes those portions of
158	   RFC 1738 that defined the specific syntax of individual URI schemes;
159	   those portions will be updated as separate documents.  The process
160	   for registration of new URI schemes is defined separately by [BCP35].
161	   Advice for designers of new URI schemes can be found in [RFC2718].

163	   All significant changes from RFC 2396 are noted in Appendix D.

165	   This specification uses the terms "character" and "coded character
166	   set" in accordance with the definitions provided in [BCP19], and
167	   "character encoding" in place of what [BCP19] refers to as a
168	   "charset".

170	1.1  Overview of URIs

172	   URIs are characterized as follows:

174	   Uniform

176	      Uniformity provides several benefits: it allows different types of
177	      resource identifiers to be used in the same context, even when the
178	      mechanisms used to access those resources may differ; it allows
179	      uniform semantic interpretation of common syntactic conventions
180	      across different types of resource identifiers; it allows
181	      introduction of new types of resource identifiers without
182	      interfering with the way that existing identifiers are used; and,
183	      it allows the identifiers to be reused in many different contexts,
184	      thus permitting new applications or protocols to leverage a
185	      pre-existing, large, and widely-used set of resource identifiers.

187	   Resource

189	      This specification does not limit the scope of what might be a
190	      resource; rather, the term "resource" is used in a general sense
191	      for whatever might be identified by a URI.  Familiar examples
192	      include an electronic document, an image, a source of information
193	      with consistent purpose (e.g., "today's weather report for Los
194	      Angeles"), a service (e.g., an HTTP to SMS gateway), a collection
195	      of other resources, and so on.  A resource is not necessarily
196	      accessible via the Internet; e.g., human beings, corporations, and
197	      bound books in a library can also be resources.  Likewise,
198	      abstract concepts can be resources, such as the operators and
199	      operands of a mathematical equation, the types of a relationship
200	      (e.g., "parent" or "employee"), or numeric values (e.g., zero,
201	      one, and infinity).

203	   Identifier

205	      An identifier embodies the information required to distinguish
206	      what is being identified from all other things within its scope of
207	      identification.  Our use of the terms "identify" and "identifying"
208	      refer to this purpose of distinguishing one resource from all
209	      other resources, regardless of how that purpose is accomplished
210	      (e.g., by name, address, context, etc.).  These terms should not
211	      be mistaken as an assumption that an identifier defines or
212	      embodies the identity of what is referenced, though that may be
213	      the case for some identifiers.  Nor should it be assumed that a
214	      system using URIs will access the resource identified: in many
215	      cases, URIs are used to denote resources without any intention
216	      that they be accessed.  Likewise, the "one" resource identified
217	      might not be singular in nature (e.g., a resource might be a named
218	      set or a mapping that varies over time).

220	   A URI is an identifier, consisting of a sequence of characters
221	   matching the syntax rule named <URI> in Section 3, that enables
222	   uniform identification of resources via a separately defined,
223	   extensible set of naming schemes (Section 3.1).  How that
224	   identification is accomplished, assigned, or enabled is delegated to
225	   each scheme specification.

227	   This specification does not place any limits on the nature of a
228	   resource, the reasons why an application might wish to refer to a
229	   resource, or the kinds of system that might use URIs for the sake of
230	   identifying resources.  This specification does not require that a
231	   URI persists in identifying the same resource over all time, though
232	   that is a common goal of all URI schemes.  Nevertheless, nothing in
233	   this specification prevents an application from limiting itself to
234	   particular types of resources, or to a subset of URIs that maintains
235	   characteristics desired by that application.

237	   URIs have a global scope and are interpreted consistently regardless
238	   of context, though the result of that interpretation may be in
239	   relation to the end-user's context.  For example, "http://localhost/"
240	   has the same interpretation for every user of that reference, even
241	   though the network interface corresponding to "localhost" may be
242	   different for each end-user: interpretation is independent of access.
243	   However, an action made on the basis of that reference will take
244	   place in relation to the end-user's context, which implies that an
245	   action intended to refer to a single, globally unique thing must use
246	   a URI that distinguishes that resource from all other things.  URIs
247	   that identify in relation to the end-user's local context should only
248	   be used when the context itself is a defining aspect of the resource,
249	   such as when an on-line help manual refers to a file on the
250	   end-user's filesystem (e.g., "file:///etc/hosts").

252	1.1.1  Generic Syntax

254	   Each URI begins with a scheme name, as defined in Section 3.1, that
255	   refers to a specification for assigning identifiers within that
256	   scheme.  As such, the URI syntax is a federated and extensible naming
257	   system wherein each scheme's specification may further restrict the
258	   syntax and semantics of identifiers using that scheme.

260	   This specification defines those elements of the URI syntax that are
261	   required of all URI schemes or are common to many URI schemes.  It
262	   thus defines the syntax and semantics that are needed to implement a
263	   scheme-independent parsing mechanism for URI references, such that
264	   the scheme-dependent handling of a URI can be postponed until the
265	   scheme-dependent semantics are needed.  Likewise, protocols and data
266	   formats that make use of URI references can refer to this
267	   specification as defining the range of syntax allowed for all URIs,
268	   including those schemes that have yet to be defined, thus decoupling
269	   the evolution of identification schemes from the evolution of
270	   protocols, data formats, and implementations that make use of URIs.

272	   A parser of the generic URI syntax is capable of parsing any URI
273	   reference into its major components; once the scheme is determined,
274	   further scheme-specific parsing can be performed on the components.
275	   In other words, the URI generic syntax is a superset of the syntax of
276	   all URI schemes.

278	1.1.2  Examples

280	   The following example URIs illustrate several URI schemes and
281	   variations in their common syntax components:

283	      ftp://ftp.is.co.za/rfc/rfc1808.txt

285	      http://www.ietf.org/rfc/rfc2396.txt

287	      ldap://[2001:db8::7]/c=GB?objectClass?one

289	      mailto:John.Doe@example.com

291	      news:comp.infosystems.www.servers.unix

293	      tel:+1-816-555-1212

295	      telnet://192.0.2.16:80/

297	      urn:oasis:names:specification:docbook:dtd:xml:4.1.2

299	1.1.3  URI, URL, and URN

301	   A URI can be further classified as a locator, a name, or both.  The
302	   term "Uniform Resource Locator" (URL) refers to the subset of URIs
303	   that, in addition to identifying a resource, provide a means of
304	   locating the resource by describing its primary access mechanism
305	   (e.g., its network "location").  The term "Uniform Resource Name"
306	   (URN) has been used historically to refer to both URIs under the
307	   "urn" scheme [RFC2141], which are required to remain globally unique
308	   and persistent even when the resource ceases to exist or becomes
309	   unavailable, and to any other URI with the properties of a name.

311	   An individual scheme does not need to be classified as being just one
312	   of "name" or "locator".  Instances of URIs from any given scheme may
313	   have the characteristics of names or locators or both, often
314	   depending on the persistence and care in the assignment of
315	   identifiers by the naming authority, rather than any quality of the
316	   scheme.  Future specifications and related documentation should use
317	   the general term "URI", rather than the more restrictive terms URL
318	   and URN [RFC3305].

320	1.2  Design Considerations

322	1.2.1  Transcription

324	   The URI syntax has been designed with global transcription as one of
325	   its main considerations.  A URI is a sequence of characters from a
326	   very limited set: the letters of the basic Latin alphabet, digits,
327	   and a few special characters.  A URI may be represented in a variety
328	   of ways: e.g., ink on paper, pixels on a screen, or a sequence of
329	   character encoding octets.  The interpretation of a URI depends only
330	   on the characters used and not how those characters are represented
331	   in a network protocol.

333	   The goal of transcription can be described by a simple scenario.
334	   Imagine two colleagues, Sam and Kim, sitting in a pub at an
335	   international conference and exchanging research ideas.  Sam asks Kim
336	   for a location to get more information, so Kim writes the URI for the
337	   research site on a napkin.  Upon returning home, Sam takes out the
338	   napkin and types the URI into a computer, which then retrieves the
339	   information to which Kim referred.

341	   There are several design considerations revealed by the scenario:

343	   o  A URI is a sequence of characters that is not always represented
344	      as a sequence of octets.

346	   o  A URI might be transcribed from a non-network source, and thus
347	      should consist of characters that are most likely to be able to be
348	      entered into a computer, within the constraints imposed by
349	      keyboards (and related input devices) across languages and
350	      locales.

352	   o  A URI often needs to be remembered by people, and it is easier for
353	      people to remember a URI when it consists of meaningful or
354	      familiar components.

356	   These design considerations are not always in alignment.  For
357	   example, it is often the case that the most meaningful name for a URI
358	   component would require characters that cannot be typed into some
359	   systems.  The ability to transcribe a resource identifier from one
360	   medium to another has been considered more important than having a
361	   URI consist of the most meaningful of components.

363	   In local or regional contexts and with improving technology, users
364	   might benefit from being able to use a wider range of characters;
365	   such use is not defined by this specification.  Percent-encoded
366	   octets (Section 2.1) may be used within a URI to represent characters
367	   outside the range of the US-ASCII coded character set if such
368	   representation is allowed by the scheme or by the protocol element in
369	   which the URI is referenced; such a definition should specify the
370	   character encoding used to map those characters to octets prior to
371	   being percent-encoded for the URI.

373	1.2.2  Separating Identification from Interaction

375	   A common misunderstanding of URIs is that they are only used to refer
376	   to accessible resources.  In fact, the URI alone only provides
377	   identification; access to the resource is neither guaranteed nor
378	   implied by the presence of a URI.  Instead, an operation (if any)
379	   associated with a URI reference is defined by the protocol element,
380	   data format attribute, or natural language text in which it appears.

382	   Given a URI, a system may attempt to perform a variety of operations
383	   on the resource, as might be characterized by such words as "access",
384	   "update", "replace", or "find attributes".  Such operations are
385	   defined by the protocols that make use of URIs, not by this
386	   specification.  However, we do use a few general terms for describing
387	   common operations on URIs.  URI "resolution" is the process of
388	   determining an access mechanism and the appropriate parameters
389	   necessary to dereference a URI; such resolution may require several
390	   iterations.  To use that access mechanism to perform an action on the
391	   URI's resource is to "dereference" the URI.

393	   When URIs are used within information retrieval systems to identify
394	   sources of information, the most common form of URI dereference is
395	   "retrieval": making use of a URI in order to retrieve a
396	   representation of its associated resource.  A "representation" is a
397	   sequence of octets, along with representation metadata describing
398	   those octets, that constitutes a record of the state of the resource
399	   at the time that the representation is generated.  Retrieval is
400	   achieved by a process that might include using the URI as a cache key
401	   to check for a locally cached representation, resolution of the URI
402	   to determine an appropriate access mechanism (if any), and
403	   dereference of the URI for the sake of applying a retrieval
404	   operation.  Depending on the protocols used to perform the retrieval,
405	   additional information might be supplied about the resource (resource
406	   metadata) and its relation to other resources.

408	   URI references in information retrieval systems are designed to be
409	   late-binding: the result of an access is generally determined at the
410	   time it is accessed and may vary over time or due to other aspects of
411	   the interaction.  Such references are created in order to be used in
412	   the future: what is being identified is not some specific result that
413	   was obtained in the past, but rather some characteristic that is
414	   expected to be true for future results.  In such cases, the resource
415	   referred to by the URI is actually a sameness of characteristics as
416	   observed over time, perhaps elucidated by additional comments or
417	   assertions made by the resource provider.

419	   Although many URI schemes are named after protocols, this does not
420	   imply that use of such a URI will result in access to the resource
421	   via the named protocol.  URIs are often used simply for the sake of
422	   identification.  Even when a URI is used to retrieve a representation
423	   of a resource, that access might be through gateways, proxies,
424	   caches, and name resolution services that are independent of the
425	   protocol associated with the scheme name, and the resolution of some
426	   URIs may require the use of more than one protocol (e.g., both DNS
427	   and HTTP are typically used to access an "http" URI's origin server
428	   when a representation isn't found in a local cache).

430	1.2.3  Hierarchical Identifiers

432	   The URI syntax is organized hierarchically, with components listed in
433	   order of decreasing significance from left to right.  For some URI
434	   schemes, the visible hierarchy is limited to the scheme itself:
435	   everything after the scheme component delimiter (":") is considered
436	   opaque to URI processing.  Other URI schemes make the hierarchy
437	   explicit and visible to generic parsing algorithms.

439	   The generic syntax uses the slash ("/"), question mark ("?"), and
440	   number sign ("#") characters for the purpose of delimiting components
441	   that are significant to the generic parser's hierarchical
442	   interpretation of an identifier.  In addition to aiding the
443	   readability of such identifiers through the consistent use of
444	   familiar syntax, this uniform representation of hierarchy across
445	   naming schemes allows scheme-independent references to be made
446	   relative to that hierarchy.

448	   It is often the case that a group or "tree" of documents has been
449	   constructed to serve a common purpose, wherein the vast majority of
450	   URI references in these documents point to resources within the tree
451	   rather than outside of it.  Similarly, documents located at a
452	   particular site are much more likely to refer to other resources at
453	   that site than to resources at remote sites.  Relative referencing of
454	   URIs allows document trees to be partially independent of their
455	   location and access scheme.  For instance, it is possible for a
456	   single set of hypertext documents to be simultaneously accessible and
457	   traversable via each of the "file", "http", and "ftp" schemes if the
458	   documents refer to each other using relative references.
459	   Furthermore, such document trees can be moved, as a whole, without
460	   changing any of the relative references.

462	   A relative reference (Section 4.2) refers to a resource by describing
463	   the difference within a hierarchical name space between the reference
464	   context and the target URI.  The reference resolution algorithm,
465	   presented in Section 5, defines how such a reference is transformed
466	   to the target URI.  Since relative references can only be used within
467	   the context of a hierarchical URI, designers of new URI schemes
468	   should use a syntax consistent with the generic syntax's hierarchical
469	   components unless there are compelling reasons to forbid relative
470	   referencing within that scheme.

472	      NOTE: Previous specifications used the terms "partial URI" and
473	      "relative URI" to denote a relative reference to a URI.  Since
474	      some readers misunderstood those terms to mean that relative URIs
475	      are a subset of URIs, rather than a method of referencing URIs,
476	      this specification simply refers to them as relative references.

478	   All URI references are parsed by generic syntax parsers when used.
479	   However, since hierarchical processing has no effect on an absolute
480	   URI used in a reference unless it contains one or more dot-segments
481	   (complete path segments of "." or "..", as described in Section 3.3),
482	   URI scheme specifications can define opaque identifiers by
483	   disallowing use of slash characters, question mark characters, and
484	   the URIs "scheme:." and "scheme:..".

486	1.3  Syntax Notation

488	   This specification uses the Augmented Backus-Naur Form (ABNF)
489	   notation of [RFC2234], including the following core ABNF syntax rules
490	   defined by that specification: ALPHA (letters), CR (carriage return),
491	   DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal
492	   digits), LF (line feed), and SP (space).  The complete URI syntax is
493	   collected in Appendix A.

495	2.  Characters

497	   The URI syntax provides a method of encoding data, presumably for the
498	   sake of identifying a resource, as a sequence of characters.  The URI
499	   characters are, in turn, frequently encoded as octets for transport
500	   or presentation.  This specification does not mandate any particular
501	   character encoding for mapping between URI characters and the octets
502	   used to store or transmit those characters.  When a URI appears in a
503	   protocol element, the character encoding is defined by that protocol;
504	   absent such a definition, a URI is assumed to be in the same
505	   character encoding as the surrounding text.

507	   The ABNF notation defines its terminal values to be non-negative
508	   integers (codepoints) based on the US-ASCII coded character set
509	   [ASCII].  Since a URI is a sequence of characters, we must invert
510	   that relation in order to understand the URI syntax.  Therefore, the
511	   integer values used by the ABNF must be mapped back to their
512	   corresponding characters via US-ASCII in order to complete the syntax
513	   rules.

515	   A URI is composed from a limited set of characters consisting of
516	   digits, letters, and a few graphic symbols.  A reserved subset of
517	   those characters may be used to delimit syntax components within a
518	   URI, while the remaining characters, including both the unreserved
519	   set and those reserved characters not acting as delimiters, define
520	   each component's identifying data.

522	2.1  Percent-Encoding

524	   A percent-encoding mechanism is used to represent a data octet in a
525	   component when that octet's corresponding character is outside the
526	   allowed set or is being used as a delimiter of, or within, the
527	   component.  A percent-encoded octet is encoded as a character
528	   triplet, consisting of the percent character "%" followed by the two
529	   hexadecimal digits representing that octet's numeric value.  For
530	   example, "%20" is the percent-encoding for the binary octet
531	   "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
532	   character (SP).  Section 2.4 describes when percent-encoding and
533	   decoding is applied.

535	      pct-encoded = "%" HEXDIG HEXDIG

537	   The uppercase hexadecimal digits 'A' through 'F' are equivalent to
538	   the lowercase digits 'a' through 'f', respectively.  Two URIs that
539	   differ only in the case of hexadecimal digits used in percent-encoded
540	   octets are equivalent.  For consistency, URI producers and
541	   normalizers should use uppercase hexadecimal digits for all
542	   percent-encodings.

544	2.2  Reserved Characters

546	   URIs include components and subcomponents that are delimited by
547	   characters in the "reserved" set.  These characters are called
548	   "reserved" because they may (or may not) be defined as delimiters by
549	   the generic syntax, by each scheme-specific syntax, or by the
550	   implementation-specific syntax of a URI's dereferencing algorithm.
551	   If data for a URI component would conflict with a reserved
552	   character's purpose as a delimiter, then the conflicting data must be
553	   percent-encoded before forming the URI.

555	      reserved    = gen-delims / sub-delims

557	      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

559	      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
560	                  / "*" / "+" / "," / ";" / "="

562	   The purpose of reserved characters is to provide a set of delimiting
563	   characters that are distinguishable from other data within a URI.
564	   URIs that differ in the replacement of a reserved character with its
565	   corresponding percent-encoded octet are not equivalent.
566	   Percent-encoding a reserved character, or decoding a percent-encoded
567	   octet that corresponds to a reserved character, will change how the
568	   URI is interpreted by most applications.  Thus, characters in the
569	   reserved set are protected from normalization and are therefore safe
570	   to be used by scheme-specific and producer-specific algorithms for
571	   delimiting data subcomponents within a URI.

573	   A subset of the reserved characters (gen-delims) are used as
574	   delimiters of the generic URI components described in Section 3.  A
575	   component's ABNF syntax rule will not use the reserved or gen-delims
576	   rule names directly; instead, each syntax rule lists the characters
577	   allowed within that component (i.e., not delimiting it) and any of
578	   those characters that are also in the reserved set are "reserved" for
579	   use as subcomponent delimiters within the component.  Only the most
580	   common subcomponents are defined by this specification; other
581	   subcomponents may be defined by a URI scheme's specification, or by
582	   the implementation-specific syntax of a URI's dereferencing
583	   algorithm, provided that such subcomponents are delimited by
584	   characters in the reserved set allowed within that component.

586	   URI producing applications should percent-encode data octets that
587	   correspond to characters in the reserved set.  However, if a reserved
588	   character is found in a URI component and no delimiting role is known
589	   for that character, then it should be interpreted as representing the
590	   data octet corresponding to that character's encoding in US-ASCII.

592	2.3  Unreserved Characters

594	   Characters that are allowed in a URI but do not have a reserved
595	   purpose are called unreserved.  These include uppercase and lowercase
596	   letters, decimal digits, hyphen, period, underscore, and tilde.

598	      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

600	   URIs that differ in the replacement of an unreserved character with
601	   its corresponding percent-encoded US-ASCII octet are equivalent: they
602	   identify the same resource.  However, URI comparison implementations
603	   do not always perform normalization prior to comparison Section 6.
604	   For consistency, percent-encoded octets in the ranges of ALPHA
605	   (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
606	   underscore (%5F), or tilde (%7E) should not be created by URI
607	   producers and, when found in a URI, should be decoded to their
608	   corresponding unreserved character by URI normalizers.

610	2.4  When to Encode or Decode

612	   Under normal circumstances, the only time that octets within a URI
613	   are percent-encoded is during the process of producing the URI from
614	   its component parts.  It is during that process that an
615	   implementation determines which of the reserved characters are to be
616	   used as subcomponent delimiters and which can be safely used as data.
617	   Once produced, a URI is always in its percent-encoded form.

619	   When a URI is dereferenced, the components and subcomponents
620	   significant to the scheme-specific dereferencing process (if any)
621	   must be parsed and separated before the percent-encoded octets within
622	   those components can be safely decoded, since otherwise the data may
623	   be mistaken for component delimiters.  The only exception is for
624	   percent-encoded octets corresponding to characters in the unreserved
625	   set, which can be decoded at any time.  For example, the octet
626	   corresponding to the tilde ("~") character is often encoded as "%7E"
627	   by older URI processing implementations; the "%7E" can be replaced by
628	   "~" without changing its interpretation.

630	   Because the percent ("%") character serves as the indicator for
631	   percent-encoded octets, it must be percent-encoded as "%25" in order
632	   for that octet to be used as data within a URI.  Implementations must
633	   not percent-encode or decode the same string more than once, since
634	   decoding an already decoded string might lead to misinterpreting a
635	   percent data octet as the beginning of a percent-encoding, or vice
636	   versa in the case of percent-encoding an already percent-encoded
637	   string.

639	2.5  Identifying Data

641	   URI characters provide identifying data for each of the URI
642	   components, serving as an external interface for identification
643	   between systems.  Although the presence and nature of the URI
644	   production interface is hidden from clients that use its URIs, and
645	   thus beyond the scope of the interoperability requirements defined by
646	   this specification, it is a frequent source of confusion and errors
647	   in the interpretation of URI character issues.  Implementers need to
648	   be aware that there are multiple character encodings involved in the
649	   production and transmission of URIs: local name and data encoding,
650	   public interface encoding, URI character encoding, data format
651	   encoding, and protocol encoding.

653	   The first encoding of identifying data is the one in which the local
654	   names or data are stored.  URI producing applications (a.k.a., origin
655	   servers) will typically use the local encoding as the basis for
656	   producing meaningful names.  The URI producer will transform the
657	   local encoding to one that is suitable for a public interface, and
658	   then transform the public interface encoding into the restricted set
659	   of URI characters (reserved, unreserved, and percent-encodings).
660	   Those characters are, in turn, encoded as octets to be used as a
661	   reference within a data format (e.g., a document charset), and such
662	   data formats are often subsequently encoded for transmission over
663	   Internet protocols.

665	   For most systems, an unreserved character appearing within a URI
666	   component is interpreted as representing the data octet corresponding
667	   to that character's encoding in US-ASCII.  Consumers of URIs assume
668	   that the letter "X" corresponds to the octet "01011000", and there is
669	   no harm in making that assumption even when it is incorrect.  A
670	   system that internally provides identifiers in the form of a
671	   different character encoding, such as EBCDIC, will generally perform
672	   character translation of textual identifiers to UTF-8 [STD63] (or
673	   some other superset of the US-ASCII character encoding) at an
674	   internal interface, thereby providing more meaningful identifiers
675	   than simply percent-encoding the original octets.

677	   For example, consider an information service that provides data,
678	   stored locally using an EBCDIC-based filesystem, to clients on the
679	   Internet through an HTTP server.  When an author creates a file on
680	   that filesystem with the name "Laguna Beach", their expectation is
681	   that the "http" URI corresponding to that resource would also contain
682	   the meaningful string "Laguna%20Beach".  If, however, that server
683	   produces URIs using an overly-simplistic raw octet mapping, then the
684	   result would be a URI containing
685	   "%D3%81%87%A4%95%81@%C2%85%81%83%88".  An internal transcoding
686	   interface fixes that problem by transcoding the local name to a
687	   superset of US-ASCII prior to producing the URI.  Naturally, proper
688	   interpretation of an incoming URI on such an interface requires that
689	   percent-encoded octets be decoded (e.g., "%20" to SP) before the
690	   reverse transcoding is applied to obtain the local name.

692	   In some cases, the internal interface between a URI component and the
693	   identifying data that it has been crafted to represent is much less
694	   direct than a character encoding translation.  For example, portions
695	   of a URI might reflect a query on non-ASCII data, numeric coordinates
696	   on a map, etc.  Likewise, a URI scheme may define components with
697	   additional encoding requirements that are applied prior to forming
698	   the component and producing the URI.

700	   When a new URI scheme defines a component that represents textual
701	   data consisting of characters from the Unicode character set [UCS],
702	   the data should be encoded first as octets according to the UTF-8
703	   character encoding [STD63], and then only those octets that do not
704	   correspond to characters in the unreserved set should be
705	   percent-encoded.  For example, the character A would be represented
706	   as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be
707	   represented as "%C3%80", and the character KATAKANA LETTER A would be
708	   represented as "%E3%82%A2".

710	3.  Syntax Components

712	   The generic URI syntax consists of a hierarchical sequence of
713	   components referred to as the scheme, authority, path, query, and
714	   fragment.

716	      URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

718	      hier-part   = "//" authority path-abempty
719	                  / path-absolute
720	                  / path-rootless
721	                  / path-empty

723	   The scheme and path components are required, though path may be empty
724	   (no characters).  When authority is present, the path must either be
725	   empty or begin with a slash ("/") character.  When authority is not
726	   present, the path cannot begin with two slash characters ("//").
727	   These restrictions result in five different ABNF rules for a path
728	   (Section 3.3), only one of which will match any given URI reference.

730	   The following are two example URIs and their component parts:

732	         foo://example.com:8042/over/there?name=ferret#nose
733	         \_/   \______________/\_________/ \_________/ \__/
734	          |           |            |            |        |
735	       scheme     authority       path        query   fragment
736	          |   _____________________|__
737	         / \ /                        \
738	         urn:example:animal:ferret:nose

740	3.1  Scheme

742	   Each URI begins with a scheme name that refers to a specification for
743	   assigning identifiers within that scheme.  As such, the URI syntax is
744	   a federated and extensible naming system wherein each scheme's
745	   specification may further restrict the syntax and semantics of
746	   identifiers using that scheme.

748	   Scheme names consist of a sequence of characters beginning with a
749	   letter and followed by any combination of letters, digits, plus
750	   ("+"), period ("."), or hyphen ("-").  Although scheme is
751	   case-insensitive, the canonical form is lowercase and documents that
752	   specify schemes must do so using lowercase letters.  An
753	   implementation should accept uppercase letters as equivalent to
754	   lowercase in scheme names (e.g., allow "HTTP" as well as "http"), for
755	   the sake of robustness, but should only produce lowercase scheme
756	   names, for consistency.

758	      scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

760	   Individual schemes are not specified by this document.  The process
761	   for registration of new URI schemes is defined separately by [BCP35].
762	   The scheme registry maintains the mapping between scheme names and
763	   their specifications.  Advice for designers of new URI schemes can be
764	   found in [RFC2718].  URI scheme specifications must define their own
765	   syntax such that all strings matching their scheme-specific syntax
766	   will also match the <absolute-URI> grammar, as described in
767	   Section 4.3.

769	   When presented with a URI that violates one or more scheme-specific
770	   restrictions, the scheme-specific resolution process should flag the
771	   reference as an error rather than ignore the unused parts; doing so
772	   reduces the number of equivalent URIs and helps detect abuses of the
773	   generic syntax that might indicate the URI has been constructed to
774	   mislead the user (Section 7.6).

776	3.2  Authority

778	   Many URI schemes include a hierarchical element for a naming
779	   authority, such that governance of the name space defined by the
780	   remainder of the URI is delegated to that authority (which may, in
781	   turn, delegate it further).  The generic syntax provides a common
782	   means for distinguishing an authority based on a registered name or
783	   server address, along with optional port and user information.

785	   The authority component is preceded by a double slash ("//") and is
786	   terminated by the next slash ("/"), question mark ("?"), or number
787	   sign ("#") character, or by the end of the URI.

789	      authority   = [ userinfo "@" ] host [ ":" port ]

791	   URI producers and normalizers should omit the ":" delimiter that
792	   separates host from port if the port component is empty.  Some
793	   schemes do not allow the userinfo and/or port subcomponents.

795	   If a URI contains an authority component, then the path component
796	   must either be empty or begin with a slash ("/") character.
797	   Non-validating parsers (those that merely separate a URI reference
798	   into its major components) will often ignore the subcomponent
799	   structure of authority, treating it as an opaque string from the
800	   double-slash to the first terminating delimiter, until such time as
801	   the URI is dereferenced.

803	3.2.1  User Information

805	   The userinfo subcomponent may consist of a user name and, optionally,
806	   scheme-specific information about how to gain authorization to access
807	   the resource.  The user information, if present, is followed by a
808	   commercial at-sign ("@") that delimits it from the host.

810	      userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )

812	   Use of the format "user:password" in the userinfo field is
813	   deprecated.  Applications should not render as clear text any data
814	   after the first colon (":") character found within a userinfo
815	   subcomponent unless the data after the colon is the empty string
816	   (indicating no password).  Applications may choose to ignore or
817	   reject such data when received as part of a reference, and should
818	   reject the storage of such data in unencrypted form.  The passing of
819	   authentication information in clear text has proven to be a security
820	   risk in almost every case where it has been used.

822	   Applications that render a URI for the sake of user feedback, such as
823	   in graphical hypertext browsing, should render userinfo in a way that
824	   is distinguished from the rest of a URI, when feasible.  Such
825	   rendering will assist the user in cases where the userinfo has been
826	   misleadingly crafted to look like a trusted domain name
827	   (Section 7.6).

829	3.2.2  Host

831	   The host subcomponent of authority is identified by an IP literal
832	   encapsulated within square brackets, an IPv4 address in
833	   dotted-decimal form, or a registered name.  The host subcomponent is
834	   case-insensitive.  The presence of a host subcomponent within a URI
835	   does not imply that the scheme requires access to the given host on
836	   the Internet.  In many cases, the host syntax is used only for the
837	   sake of reusing the existing registration process created and
838	   deployed for DNS, thus obtaining a globally unique name without the
839	   cost of deploying another registry.  However, such use comes with its
840	   own costs: domain name ownership may change over time for reasons not
841	   anticipated by the URI producer.  In other cases, the data within the
842	   host component identifies a registered name that has nothing to do
843	   with an Internet host.  We use the name "host" for the ABNF rule
844	   because that is its most common purpose, not its only purpose, and
845	   thus should not be considered as semantically limiting the data
846	   within it.

848	      host        = IP-literal / IPv4address / reg-name

850	   The syntax rule for host is ambiguous because it does not completely
851	   distinguish between an IPv4address and a reg-name.  In order to
852	   disambiguate the syntax, we apply the "first-match-wins" algorithm:
853	   If host matches the rule for IPv4address, then it should be
854	   considered an IPv4 address literal and not a reg-name.  Although host
855	   is case-insensitive, producers and normalizers should use lowercase
856	   for registered names and hexadecimal addresses for the sake of
857	   uniformity, while only using uppercase letters for percent-encodings.

859	   A host identified by an Internet Protocol literal address, version 6
860	   [RFC3513] or later, is distinguished by enclosing the IP literal
861	   within square brackets ("[" and "]").  This is the only place where
862	   square bracket characters are allowed in the URI syntax.  In
863	   anticipation of future, as-yet-undefined IP literal address formats,
864	   an optional version flag may be used to indicate such a format
865	   explicitly rather than relying on heuristic determination.

867	      IP-literal = "[" ( IPv6address / IPvFuture  ) "]"

869	      IPvFuture  = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

871	   The version flag does not indicate the IP version; rather, it
872	   indicates future versions of the literal format.  As such,
873	   implementations must not provide the version flag for existing IPv4
874	   and IPv6 literal addresses.  If a URI containing an IP-literal that
875	   starts with "v" (case-insensitive), indicating that the version flag
876	   is present, is dereferenced by an application that does not know the
877	   meaning of that version flag, then the application should return an
878	   appropriate error for "address mechanism not supported".

880	   A host identified by an IPv6 literal address is represented inside
881	   the square brackets without a preceding version flag.  The ABNF
882	   provided here is a translation of the text definition of an IPv6
883	   literal address provided in [RFC3513].  A 128-bit IPv6 address is
884	   divided into eight 16-bit pieces.  Each piece is represented
885	   numerically in case-insensitive hexadecimal, using one to four
886	   hexadecimal digits (leading zeroes are permitted).  The eight encoded
887	   pieces are given most-significant first, separated by colon
888	   characters.  Optionally, the least-significant two pieces may instead
889	   be represented in IPv4 address textual format.  A sequence of one or
890	   more consecutive zero-valued 16-bit pieces within the address may be
891	   elided, omitting all their digits and leaving exactly two consecutive
892	   colons in their place to mark the elision.

894	      IPv6address =                            6( h16 ":" ) ls32
895	                  /                       "::" 5( h16 ":" ) ls32
896	                  / [               h16 ] "::" 4( h16 ":" ) ls32
897	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
898	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
899	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
900	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
901	                  / [ *5( h16 ":" ) h16 ] "::"              h16
902	                  / [ *6( h16 ":" ) h16 ] "::"

904	      ls32        = ( h16 ":" h16 ) / IPv4address
905	                  ; least-significant 32 bits of address

907	      h16         = 1*4HEXDIG
908	                  ; 16 bits of address represented in hexadecimal

910	   A host identified by an IPv4 literal address is represented in
911	   dotted-decimal notation (a sequence of four decimal numbers in the
912	   range 0 to 255, separated by "."), as described in [RFC1123] by
913	   reference to [RFC0952].  Note that other forms of dotted notation may
914	   be interpreted on some platforms, as described in Section 7.4, but
915	   only the dotted-decimal form of four octets is allowed by this
916	   grammar.

918	      IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

920	      dec-octet   = DIGIT                 ; 0-9
921	                  / %x31-39 DIGIT         ; 10-99
922	                  / "1" 2DIGIT            ; 100-199
923	                  / "2" %x30-34 DIGIT     ; 200-249
924	                  / "25" %x30-35          ; 250-255

926	   A host identified by a registered name is a sequence of characters
927	   that is usually intended for lookup within a locally-defined host or
928	   service name registry, though the URI's scheme-specific semantics may
929	   require that a specific registry (or fixed name table) be used
930	   instead.  The most common name registry mechanism is the Domain Name
931	   System (DNS).  A registered name intended for lookup in the DNS uses
932	   the syntax defined in Section 3.5 of [RFC1034] and Section 2.1 of
933	   [RFC1123].  Such a name consists of a sequence of domain labels
934	   separated by ".", each domain label starting and ending with an
935	   alphanumeric character and possibly also containing "-" characters.
936	   The rightmost domain label of a fully qualified domain name in DNS
937	   may be followed by a single "." and should be followed by one if it
938	   is necessary to distinguish between the complete domain name and some
939	   local domain.

941	      reg-name    = *( unreserved / pct-encoded / sub-delims )

943	   If the URI scheme defines a default for host, then that default
944	   applies when the host subcomponent is undefined or when the
945	   registered name is empty (zero length).  For example, the "file" URI
946	   scheme is defined such that no authority, an empty host, and
947	   "localhost" all mean the end-user's machine, whereas the "http"
948	   scheme considers a missing authority or empty host to be invalid.

950	   This specification does not mandate a particular registered name
951	   lookup technology and therefore does not restrict the syntax of
952	   reg-name beyond that necessary for interoperability.  Instead, it
953	   delegates the issue of registered name syntax conformance to the
954	   operating system of each application performing URI resolution, and
955	   that operating system decides what it will allow for the purpose of
956	   host identification.  A URI resolution implementation might use DNS,
957	   host tables, yellow pages, NetInfo, WINS, or any other system for
958	   lookup of registered names.  However, a globally-scoped naming
959	   system, such as DNS fully-qualified domain names, is necessary for
960	   URIs that are intended to have global scope.  URI producers should
961	   use names that conform to the DNS syntax, even when use of DNS is not
962	   immediately apparent, and should limit such names to no more than 255
963	   characters in length.

965	   The reg-name syntax allows percent-encoded octets in order to
966	   represent non-ASCII registered names in a uniform way that is
967	   independent of the underlying name resolution technology; such
968	   non-ASCII characters must first be encoded according to UTF-8 [STD63]
969	   and then each octet of the corresponding UTF-8 sequence must be
970	   percent-encoded to be represented as URI characters.  URI producing
971	   applications must not use percent-encoding in host unless it is used
972	   to represent a UTF-8 character sequence.  When a non-ASCII registered
973	   name represents an internationalized domain name intended for
974	   resolution via the DNS, the name must be transformed to the IDNA
975	   encoding [RFC3490] prior to name lookup.  URI producers should
976	   provide such registered names in the IDNA encoding, rather than a
977	   percent-encoding, if they wish to maximize interoperability with
978	   legacy URI resolvers.

980	3.2.3  Port

982	   The port subcomponent of authority is designated by an optional port
983	   number in decimal following the host and delimited from it by a
984	   single colon (":") character.

986	      port        = *DIGIT

988	   A scheme may define a default port.  For example, the "http" scheme
989	   defines a default port of "80", corresponding to its reserved TCP
990	   port number.  The type of port designated by the port number (e.g.,
991	   TCP, UDP, SCTP, etc.) is defined by the URI scheme.  URI producers
992	   and normalizers should omit the port component and its ":" delimiter
993	   if port is empty or its value would be the same as the scheme's
994	   default.

996	3.3  Path

998	   The path component contains data, usually organized in hierarchical
999	   form, that, along with data in the non-hierarchical query component
1000	   (Section 3.4), serves to identify a resource within the scope of the
1001	   URI's scheme and naming authority (if any).  The path is terminated
1002	   by the first question mark ("?") or number sign ("#") character, or
1003	   by the end of the URI.

1005	   If a URI contains an authority component, then the path component
1006	   must either be empty or begin with a slash ("/") character.  If a URI
1007	   does not contain an authority component, then the path cannot begin
1008	   with two slash characters ("//").  In addition, a URI reference
1009	   (Section 4.1) may be a relative-path reference, in which case the
1010	   first path segment cannot contain a colon (":") character.  The ABNF
1011	   requires five separate rules to disambiguate these cases, only one of
1012	   which will match the path substring within a given URI reference.  We
1013	   use the generic term "path component" to describe the URI substring
1014	   matched by the parser to one of these rules.

1016	      path          = path-abempty    ; begins with "/" or is empty
1017	                    / path-absolute   ; begins with "/" but not "//"
1018	                    / path-noscheme   ; begins with a non-colon segment
1019	                    / path-rootless   ; begins with a segment
1020	                    / path-empty      ; zero characters

1022	      path-abempty  = *( "/" segment )
1023	      path-absolute = "/" [ segment-nz *( "/" segment ) ]
1024	      path-noscheme = segment-nz-nc *( "/" segment )
1025	      path-rootless = segment-nz *( "/" segment )
1026	      path-empty    = 0<pchar>

1028	      segment       = *pchar
1029	      segment-nz    = 1*pchar
1030	      segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
1031	                    ; non-zero-length segment without any colon ":"

1033	      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

1035	   A path consists of a sequence of path segments separated by a slash
1036	   ("/") character.  A path is always defined for a URI, though the
1037	   defined path may be empty (zero length).  Use of the slash character
1038	   to indicate hierarchy is only required when a URI will be used as the
1039	   context for relative references.  For example, the URI
1040	   <mailto:fred@example.com> has a path of "fred@example.com", whereas
1041	   the URI <foo://info.example.com?fred> has an empty path.

1043	   The path segments "." and "..", also known as dot-segments, are
1044	   defined for relative reference within the path name hierarchy.  They
1045	   are intended for use at the beginning of a relative-path reference
1046	   (Section 4.2) for indicating relative position within the
1047	   hierarchical tree of names.  This is similar to their role within
1048	   some operating systems' file directory structure to indicate the
1049	   current directory and parent directory, respectively.  However,
1050	   unlike a file system, these dot-segments are only interpreted within
1051	   the URI path hierarchy and are removed as part of the resolution
1052	   process (Section 5.2).

1054	   Aside from dot-segments in hierarchical paths, a path segment is
1055	   considered opaque by the generic syntax.  URI-producing applications
1056	   often use the reserved characters allowed in a segment for the
1057	   purpose of delimiting scheme-specific or dereference-handler-specific
1058	   subcomponents.  For example, the semicolon (";") and equals ("=")
1059	   reserved characters are often used for delimiting parameters and
1060	   parameter values applicable to that segment.  The comma (",")
1061	   reserved character is often used for similar purposes.  For example,
1062	   one URI producer might use a segment like "name;v=1.1" to indicate a
1063	   reference to version 1.1 of "name", whereas another might use a
1064	   segment like "name,1.1" to indicate the same.  Parameter types may be
1065	   defined by scheme-specific semantics, but in most cases the syntax of
1066	   a parameter is specific to the implementation of the URI's
1067	   dereferencing algorithm.

1069	3.4  Query

1071	   The query component contains non-hierarchical data that, along with
1072	   data in the path component (Section 3.3), serves to identify a
1073	   resource within the scope of the URI's scheme and naming authority
1074	   (if any).  The query component is indicated by the first question
1075	   mark ("?") character and terminated by a number sign ("#") character
1076	   or by the end of the URI.

1078	      query       = *( pchar / "/" / "?" )

1080	   The characters slash ("/") and question mark ("?") may represent data
1081	   within the query component.  Beware that some older, erroneous
1082	   implementations may not handle such data correctly when used as the
1083	   base URI for relative references (Section 5.1), apparently because
1084	   they fail to to distinguish query data from path data when looking
1085	   for hierarchical separators.  However, since query components are
1086	   often used to carry identifying information in the form of
1087	   "key=value" pairs, and one frequently used value is a reference to
1088	   another URI, it is sometimes better for usability to avoid
1089	   percent-encoding those characters.

1091	3.5  Fragment

1093	   The fragment identifier component of a URI allows indirect
1094	   identification of a secondary resource by reference to a primary
1095	   resource and additional identifying information.  The identified
1096	   secondary resource may be some portion or subset of the primary
1097	   resource, some view on representations of the primary resource, or
1098	   some other resource defined or described by those representations.  A
1099	   fragment identifier component is indicated by the presence of a
1100	   number sign ("#") character and terminated by the end of the URI.

1102	      fragment    = *( pchar / "/" / "?" )

1104	   The semantics of a fragment identifier are defined by the set of
1105	   representations that might result from a retrieval action on the
1106	   primary resource.  The fragment's format and resolution is therefore
1107	   dependent on the media type [RFC2046] of a potentially retrieved
1108	   representation, even though such a retrieval is only performed if the
1109	   URI is dereferenced.  If no such representation exists, then the
1110	   semantics of the fragment are considered unknown and, effectively,
1111	   unconstrained.  Fragment identifier semantics are independent of the
1112	   URI scheme and thus cannot be redefined by scheme specifications.

1114	   Individual media types may define their own restrictions on, or
1115	   structure within, the fragment identifier syntax for specifying
1116	   different types of subsets, views, or external references that are
1117	   identifiable as secondary resources by that media type.  If the
1118	   primary resource has multiple representations, as is often the case
1119	   for resources whose representation is selected based on attributes of
1120	   the retrieval request (a.k.a., content negotiation), then whatever is
1121	   identified by the fragment should be consistent across all of those
1122	   representations: each representation should either define the
1123	   fragment such that it corresponds to the same secondary resource,
1124	   regardless of how it is represented, or the fragment should be left
1125	   undefined by the representation (i.e., not found).

1127	   As with any URI, use of a fragment identifier component does not
1128	   imply that a retrieval action will take place.  A URI with a fragment
1129	   identifier may be used to refer to the secondary resource without any
1130	   implication that the primary resource is accessible or will ever be
1131	   accessed.

1133	   Fragment identifiers have a special role in information retrieval
1134	   systems as the primary form of client-side indirect referencing,
1135	   allowing an author to specifically identify those aspects of an
1136	   existing resource that are only indirectly provided by the resource
1137	   owner.  As such, the fragment identifier is not used in the
1138	   scheme-specific processing of a URI; instead, the fragment identifier
1139	   is separated from the rest of the URI prior to a dereference, and
1140	   thus the identifying information within the fragment itself is
1141	   dereferenced solely by the user agent and regardless of the URI
1142	   scheme.  Although this separate handling is often perceived to be a
1143	   loss of information, particularly in regards to accurate redirection
1144	   of references as resources move over time, it also serves to prevent
1145	   information providers from denying reference authors the right to
1146	   selectively refer to information within a resource.  Indirect
1147	   referencing also provides additional flexibility and extensibility to
1148	   systems that use URIs, since new media types are easier to define and
1149	   deploy than new schemes of identification.

1151	   The characters slash ("/") and question mark ("?") are allowed to
1152	   represent data within the fragment identifier.  Beware that some
1153	   older, erroneous implementations may not handle such data correctly
1154	   when used as the base URI for relative references (Section 5.1).

1156	4.  Usage

1158	   When applications make reference to a URI, they do not always use the
1159	   full form of reference defined by the "URI" syntax rule.  In order to
1160	   save space and take advantage of hierarchical locality, many Internet
1161	   protocol elements and media type formats allow an abbreviation of a
1162	   URI, while others restrict the syntax to a particular form of URI.
1163	   We define the most common forms of reference syntax in this
1164	   specification because they impact and depend upon the design of the
1165	   generic syntax, requiring a uniform parsing algorithm in order to be
1166	   interpreted consistently.

1168	4.1  URI Reference

1170	   URI-reference is used to denote the most common usage of a resource
1171	   identifier.

1173	      URI-reference = URI / relative-ref

1175	   A URI-reference is either a URI or a relative reference.  If the
1176	   URI-reference's prefix does not match the syntax of a scheme followed
1177	   by its colon separator, then the URI-reference is a relative
1178	   reference.

1180	   A URI-reference is typically parsed first into the five URI
1181	   components, in order to determine what components are present and
1182	   whether or not the reference is relative, after which each component
1183	   is parsed for its subparts and their validation.  The ABNF of
1184	   URI-reference, along with the "first-match-wins" disambiguation rule,
1185	   is sufficient to define a validating parser for the generic syntax.
1186	   Readers familiar with regular expressions should see Appendix B for
1187	   an example of a non-validating URI-reference parser that will take
1188	   any given string and extract the URI components.

1190	4.2  Relative Reference

1192	   A relative reference takes advantage of the hierarchical syntax
1193	   (Section 1.2.3) in order to express a URI reference relative to the
1194	   name space of another hierarchical URI.

1196	      relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

1198	      relative-part = "//" authority path-abempty
1199	                    / path-absolute
1200	                    / path-noscheme
1201	                    / path-empty

1203	   The URI referred to by a relative reference, also known as the target
1204	   URI, is obtained by applying the reference resolution algorithm of
1205	   Section 5.

1207	   A relative reference that begins with two slash characters is termed
1208	   a network-path reference; such references are rarely used.  A
1209	   relative reference that begins with a single slash character is
1210	   termed an absolute-path reference.  A relative reference that does
1211	   not begin with a slash character is termed a relative-path reference.

1213	   A path segment that contains a colon character (e.g., "this:that")
1214	   cannot be used as the first segment of a relative-path reference
1215	   because it would be mistaken for a scheme name.  Such a segment must
1216	   be preceded by a dot-segment (e.g., "./this:that") to make a
1217	   relative-path reference.

1219	4.3  Absolute URI

1221	   Some protocol elements allow only the absolute form of a URI without
1222	   a fragment identifier.  For example, defining a base URI for later
1223	   use by relative references calls for an absolute-URI syntax rule that
1224	   does not allow a fragment.

1226	      absolute-URI  = scheme ":" hier-part [ "?" query ]

1228	   URI scheme specifications must define their own syntax such that all
1229	   strings matching their scheme-specific syntax will also match the
1230	   <absolute-URI> grammar.  Scheme specifications are not responsible
1231	   for defining fragment identifier syntax or usage, regardless of its
1232	   applicability to resources identifiable via that scheme, since
1233	   fragment identification is orthogonal to scheme definition.  However,
1234	   scheme specifications are encouraged to include a wide range of
1235	   examples, including examples that show use of the scheme's URIs with
1236	   fragment identifiers when such usage is appropriate.

1238	4.4  Same-document Reference

1240	   When a URI reference refers to a URI that is, aside from its fragment
1241	   component (if any), identical to the base URI (Section 5.1), that
1242	   reference is called a "same-document" reference.  The most frequent
1243	   examples of same-document references are relative references that are
1244	   empty or include only the number sign ("#") separator followed by a
1245	   fragment identifier.

1247	   When a same-document reference is dereferenced for the purpose of a
1248	   retrieval action, the target of that reference is defined to be
1249	   within the same entity (representation, document, or message) as the
1250	   reference; therefore, a dereference should not result in a new
1251	   retrieval action.

1253	   Normalization of the base and target URIs prior to their comparison,
1254	   as described in Section 6.2.2 and Section 6.2.3, is allowed but
1255	   rarely performed in practice.  Normalization may increase the set of
1256	   same-document references, which may be of benefit to some caching
1257	   applications.  As such, reference authors should not assume that a
1258	   slightly different, though equivalent, reference URI will (or will
1259	   not) be interpreted as a same-document reference by any given
1260	   application.

1262	4.5  Suffix Reference

1264	   The URI syntax is designed for unambiguous reference to resources and
1265	   extensibility via the URI scheme.  However, as URI identification and
1266	   usage have become commonplace, traditional media (television, radio,
1267	   newspapers, billboards, etc.) have increasingly used a suffix of the
1268	   URI as a reference, consisting of only the authority and path
1269	   portions of the URI, such as

1271	      www.w3.org/Addressing/

1273	   or simply a DNS registered name on its own.  Such references are
1274	   primarily intended for human interpretation, rather than for
1275	   machines, with the assumption that context-based heuristics are
1276	   sufficient to complete the URI (e.g., most registered names beginning
1277	   with "www" are likely to have a URI prefix of "http://").  Although
1278	   there is no standard set of heuristics for disambiguating a URI
1279	   suffix, many client implementations allow them to be entered by the
1280	   user and heuristically resolved.

1282	   While this practice of using suffix references is common, it should
1283	   be avoided whenever possible and never used in situations where
1284	   long-term references are expected.  The heuristics noted above will
1285	   change over time, particularly when a new URI scheme becomes popular,
1286	   and are often incorrect when used out of context.  Furthermore, they
1287	   can lead to security issues along the lines of those described in
1288	   [RFC1535].

1290	   Since a URI suffix has the same syntax as a relative-path reference,
1291	   a suffix reference cannot be used in contexts where a relative
1292	   reference is expected.  As a result, suffix references are limited to
1293	   those places where there is no defined base URI, such as dialog boxes
1294	   and off-line advertisements.

1296	5.  Reference Resolution

1298	   This section defines the process of resolving a URI reference within
1299	   a context that allows relative references, such that the result is a
1300	   string matching the <URI> syntax rule of Section 3.

1302	5.1  Establishing a Base URI

1304	   The term "relative" implies that there exists a "base URI" against
1305	   which the relative reference is applied.  Aside from fragment-only
1306	   references (Section 4.4), relative references are only usable when a
1307	   base URI is known.  A base URI must be established by the parser
1308	   prior to parsing URI references that might be relative.  A base URI
1309	   must conform to the <absolute-URI> syntax rule (Section 4.3): if the
1310	   base URI is obtained from a URI reference, then that reference must
1311	   be converted to absolute form and stripped of any fragment component
1312	   prior to use as a base URI.

1314	   The base URI of a reference can be established in one of four ways,
1315	   discussed below in order of precedence.  The order of precedence can
1316	   be thought of in terms of layers, where the innermost defined base
1317	   URI has the highest precedence.  This can be visualized graphically
1318	   as:

1320	      .----------------------------------------------------------.
1321	      |  .----------------------------------------------------.  |
1322	      |  |  .----------------------------------------------.  |  |
1323	      |  |  |  .----------------------------------------.  |  |  |
1324	      |  |  |  |  .----------------------------------.  |  |  |  |
1325	      |  |  |  |  |       <relative-reference>       |  |  |  |  |
1326	      |  |  |  |  `----------------------------------'  |  |  |  |
1327	      |  |  |  | (5.1.1) Base URI embedded in content   |  |  |  |
1328	      |  |  |  `----------------------------------------'  |  |  |
1329	      |  |  | (5.1.2) Base URI of the encapsulating entity |  |  |
1330	      |  |  |         (message, representation, or none)   |  |  |
1331	      |  |  `----------------------------------------------'  |  |
1332	      |  | (5.1.3) URI used to retrieve the entity            |  |
1333	      |  `----------------------------------------------------'  |
1334	      | (5.1.4) Default Base URI (application-dependent)         |
1335	      `----------------------------------------------------------'

1337	5.1.1  Base URI Embedded in Content

1339	   Within certain media types, a base URI for relative references can be
1340	   embedded within the content itself such that it can be readily
1341	   obtained by a parser.  This can be useful for descriptive documents,
1342	   such as tables of content, which may be transmitted to others through
1343	   protocols other than their usual retrieval context (e.g., E-Mail or
1344	   USENET news).

1346	   It is beyond the scope of this specification to specify how, for each
1347	   media type, a base URI can be embedded.  The appropriate syntax, when
1348	   available, is described by the data format specification associated
1349	   with each media type.

1351	5.1.2  Base URI from the Encapsulating Entity

1353	   If no base URI is embedded, the base URI is defined by the
1354	   representation's retrieval context.  For a document that is enclosed
1355	   within another entity, such as a message or archive, the retrieval
1356	   context is that entity; thus, the default base URI of a
1357	   representation is the base URI of the entity in which the
1358	   representation is encapsulated.

1360	   A mechanism for embedding a base URI within MIME container types
1361	   (e.g., the message and multipart types) is defined by MHTML
1362	   [RFC2557].  Protocols that do not use the MIME message header syntax,
1363	   but do allow some form of tagged metadata to be included within
1364	   messages, may define their own syntax for defining a base URI as part
1365	   of a message.

1367	5.1.3  Base URI from the Retrieval URI

1369	   If no base URI is embedded and the representation is not encapsulated
1370	   within some other entity, then, if a URI was used to retrieve the
1371	   representation, that URI shall be considered the base URI.  Note that
1372	   if the retrieval was the result of a redirected request, the last URI
1373	   used (i.e., the URI that resulted in the actual retrieval of the
1374	   representation) is the base URI.

1376	5.1.4  Default Base URI

1378	   If none of the conditions described above apply, then the base URI is
1379	   defined by the context of the application.  Since this definition is
1380	   necessarily application-dependent, failing to define a base URI using
1381	   one of the other methods may result in the same content being
1382	   interpreted differently by different types of application.

1384	   A sender of a representation containing relative references is
1385	   responsible for ensuring that a base URI for those references can be
1386	   established.  Aside from fragment-only references, relative
1387	   references can only be used reliably in situations where the base URI
1388	   is well-defined.

1390	5.2  Relative Resolution

1392	   This section describes an algorithm for converting a URI reference
1393	   that might be relative to a given base URI into the parsed components
1394	   of the reference's target.  The components can then be recomposed, as
1395	   described in Section 5.3, to form the target URI.  This algorithm
1396	   provides definitive results that can be used to test the output of
1397	   other implementations.  Applications may implement relative reference
1398	   resolution using some other algorithm, provided that the results
1399	   match what would be given by this algorithm.

1401	5.2.1  Pre-parse the Base URI

1403	   The base URI (Base) is established according to the procedure of
1404	   Section 5.1 and parsed into the five main components described in
1405	   Section 3.  Note that only the scheme component is required to be
1406	   present in a base URI; the other components may be empty or
1407	   undefined.  A component is undefined if its associated delimiter does
1408	   not appear in the URI reference; the path component is never
1409	   undefined, though it may be empty.

1411	   Normalization of the base URI, as described in Section 6.2.2 and
1412	   Section 6.2.3, is optional.  A URI reference must be transformed to
1413	   its target URI before it can be normalized.

1415	5.2.2  Transform References

1417	   For each URI reference (R), the following pseudocode describes an
1418	   algorithm for transforming R into its target URI (T):

1420	      -- The URI reference is parsed into the five URI components
1421	      --
1422	      (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R);

1424	      -- A non-strict parser may ignore a scheme in the reference
1425	      -- if it is identical to the base URI's scheme.
1426	      --
1427	      if ((not strict) and (R.scheme == Base.scheme)) then
1428	         undefine(R.scheme);
1429	      endif;

1431	      if defined(R.scheme) then
1432	         T.scheme    = R.scheme;
1433	         T.authority = R.authority;
1434	         T.path      = remove_dot_segments(R.path);
1435	         T.query     = R.query;
1436	      else
1437	         if defined(R.authority) then
1438	            T.authority = R.authority;
1439	            T.path      = remove_dot_segments(R.path);
1440	            T.query     = R.query;
1441	         else
1442	            if (R.path == "") then
1443	               T.path = Base.path;
1444	               if defined(R.query) then
1445	                  T.query = R.query;
1446	               else
1447	                  T.query = Base.query;
1448	               endif;
1449	            else
1450	               if (R.path starts-with "/") then
1451	                  T.path = remove_dot_segments(R.path);
1452	               else
1453	                  T.path = merge(Base.path, R.path);
1454	                  T.path = remove_dot_segments(T.path);
1455	               endif;
1456	               T.query = R.query;
1457	            endif;
1458	            T.authority = Base.authority;
1459	         endif;
1460	         T.scheme = Base.scheme;
1461	      endif;

1463	      T.fragment = R.fragment;

1465	5.2.3  Merge Paths

1467	   The pseudocode above refers to a "merge" routine for merging a
1468	   relative-path reference with the path of the base URI.  This is
1469	   accomplished as follows:

1471	   o  If the base URI has a defined authority component and an empty
1472	      path, then return a string consisting of "/" concatenated with the
1473	      reference's path; otherwise,

1475	   o  Return a string consisting of the reference's path component
1476	      appended to all but the last segment of the base URI's path (i.e.,
1477	      excluding any characters after the right-most "/" in the base URI
1478	      path, or excluding the entire base URI path if it does not contain
1479	      any "/" characters).

1481	5.2.4  Remove Dot Segments

1483	   The pseudocode also refers to a "remove_dot_segments" routine for
1484	   interpreting and removing the special "." and ".." complete path
1485	   segments from a referenced path.  This is done after the path is
1486	   extracted from a reference, whether or not the path was relative, in
1487	   order to remove any invalid or extraneous dot-segments prior to
1488	   forming the target URI.  Although there are many ways to accomplish
1489	   this removal process, we describe a simple method using two string
1490	   buffers.

1492	   1.  The input buffer is initialized with the now-appended path
1493	       components and the output buffer is initialized to the empty
1494	       string.

1496	   2.  While the input buffer is not empty, loop:

1498	       A.  If the input buffer begins with a prefix of "../" or "./",
1499	           then remove that prefix from the input buffer; otherwise,

1501	       B.  If the input buffer begins with a prefix of "/./" or "/.",
1502	           where "." is a complete path segment, then replace that
1503	           prefix with "/" in the input buffer; otherwise,

1505	       C.  If the input buffer begins with a prefix of "/../" or "/..",
1506	           where ".." is a complete path segment, then replace that
1507	           prefix with "/" in the input buffer and remove the last
1508	           segment and its preceding "/" (if any) from the output
1509	           buffer; otherwise,

1511	       D.  If the input buffer consists only of "." or "..", then remove
1512	           that from the input buffer; otherwise,

1514	       E.  Move the first path segment in the input buffer to the end of
1515	           the output buffer, including the initial "/" character (if
1516	           any) and any subsequent characters up to, but not including,
1517	           the next "/" character or the end of the input buffer.

1519	   3.  Finally, the output buffer is returned as the result of
1520	       remove_dot_segments.

1522	   Note that dot-segments are intended for use in URI references to
1523	   express an identifier relative to the hierarchy of names in the base
1524	   URI.  The remove_dot_segments algorithm respects that hierarchy by
1525	   removing extra dot-segments rather than treating them as an error or
1526	   leaving them to be misinterpreted by dereference implementations.

1528	   The following illustrates how the above steps are applied for two
1529	   example merged paths, showing the state of the two buffers after each
1530	   step.

1532	      STEP   OUTPUT BUFFER         INPUT BUFFER

1534	       1 :                         /a/b/c/./../../g
1535	       2E:   /a                    /b/c/./../../g
1536	       2E:   /a/b                  /c/./../../g
1537	       2E:   /a/b/c                /./../../g
1538	       2B:   /a/b/c                /../../g
1539	       2C:   /a/b                  /../g
1540	       2C:   /a                    /g
1541	       2E:   /a/g

1543	      STEP   OUTPUT BUFFER         INPUT BUFFER

1545	       1 :                         mid/content=5/../6
1546	       2E:   mid                   /content=5/../6
1547	       2E:   mid/content=5         /../6
1548	       2C:   mid                   /6
1549	       2E:   mid/6

1551	   Some applications may find it more efficient to implement the
1552	   remove_dot_segments algorithm using two segment stacks rather than
1553	   strings.

1555	      Note: Beware that some older, erroneous implementations will fail
1556	      to separate a reference's query component from its path component
1557	      prior to merging the base and reference paths, resulting in an
1558	      interoperability failure if the query component contains the
1559	      strings "/../" or "/./".

1561	5.3  Component Recomposition

1563	   Parsed URI components can be recomposed to obtain the corresponding
1564	   URI reference string.  Using pseudocode, this would be:

1566	      result = ""

1568	      if defined(scheme) then
1569	         append scheme to result;
1570	         append ":" to result;
1571	      endif;

1573	      if defined(authority) then
1574	         append "//" to result;
1575	         append authority to result;
1576	      endif;

1578	      append path to result;

1580	      if defined(query) then
1581	         append "?" to result;
1582	         append query to result;
1583	      endif;

1585	      if defined(fragment) then
1586	         append "#" to result;
1587	         append fragment to result;
1588	      endif;

1590	      return result;

1592	   Note that we are careful to preserve the distinction between a
1593	   component that is undefined, meaning that its separator was not
1594	   present in the reference, and a component that is empty, meaning that
1595	   the separator was present and was immediately followed by the next
1596	   component separator or the end of the reference.

1598	5.4  Reference Resolution Examples

1600	   Within a representation with a well-defined base URI of

1602	      http://a/b/c/d;p?q

1604	   a relative reference is transformed to its target URI as follows.

1606	5.4.1  Normal Examples

1608	      "g:h"           =  "g:h"
1609	      "g"             =  "http://a/b/c/g"
1610	      "./g"           =  "http://a/b/c/g"
1611	      "g/"            =  "http://a/b/c/g/"
1612	      "/g"            =  "http://a/g"
1613	      "//g"           =  "http://g"
1614	      "?y"            =  "http://a/b/c/d;p?y"
1615	      "g?y"           =  "http://a/b/c/g?y"
1616	      "#s"            =  "http://a/b/c/d;p?q#s"
1617	      "g#s"           =  "http://a/b/c/g#s"
1618	      "g?y#s"         =  "http://a/b/c/g?y#s"
1619	      ";x"            =  "http://a/b/c/;x"
1620	      "g;x"           =  "http://a/b/c/g;x"
1621	      "g;x?y#s"       =  "http://a/b/c/g;x?y#s"
1622	      ""              =  "http://a/b/c/d;p?q"
1623	      "."             =  "http://a/b/c/"
1624	      "./"            =  "http://a/b/c/"
1625	      ".."            =  "http://a/b/"
1626	      "../"           =  "http://a/b/"
1627	      "../g"          =  "http://a/b/g"
1628	      "../.."         =  "http://a/"
1629	      "../../"        =  "http://a/"
1630	      "../../g"       =  "http://a/g"

1632	5.4.2  Abnormal Examples

1634	   Although the following abnormal examples are unlikely to occur in
1635	   normal practice, all URI parsers should be capable of resolving them
1636	   consistently.  Each example uses the same base as above.

1638	   Parsers must be careful in handling cases where there are more ".."
1639	   segments in a relative-path reference than there are hierarchical
1640	   levels in the base URI's path.  Note that the ".." syntax cannot be
1641	   used to change the authority component of a URI.

1643	      "../../../g"    =  "http://a/g"
1644	      "../../../../g" =  "http://a/g"

1646	   Similarly, parsers must remove the dot-segments "." and ".." when
1647	   they are complete components of a path, but not when they are only
1648	   part of a segment.

1650	      "/./g"          =  "http://a/g"
1651	      "/../g"         =  "http://a/g"
1652	      "g."            =  "http://a/b/c/g."
1653	      ".g"            =  "http://a/b/c/.g"
1654	      "g.."           =  "http://a/b/c/g.."
1655	      "..g"           =  "http://a/b/c/..g"

1657	   Less likely are cases where the relative reference uses unnecessary
1658	   or nonsensical forms of the "." and ".." complete path segments.

1660	      "./../g"        =  "http://a/b/g"
1661	      "./g/."         =  "http://a/b/c/g/"
1662	      "g/./h"         =  "http://a/b/c/g/h"
1663	      "g/../h"        =  "http://a/b/c/h"
1664	      "g;x=1/./y"     =  "http://a/b/c/g;x=1/y"
1665	      "g;x=1/../y"    =  "http://a/b/c/y"

1667	   Some applications fail to separate the reference's query and/or
1668	   fragment components from the path component before merging it with
1669	   the base path and removing dot-segments.  This error is rarely
1670	   noticed, since typical usage of a fragment never includes the
1671	   hierarchy ("/") character, and the query component is not normally
1672	   used within relative references.

1674	      "g?y/./x"       =  "http://a/b/c/g?y/./x"
1675	      "g?y/../x"      =  "http://a/b/c/g?y/../x"
1676	      "g#s/./x"       =  "http://a/b/c/g#s/./x"
1677	      "g#s/../x"      =  "http://a/b/c/g#s/../x"

1679	   Some parsers allow the scheme name to be present in a relative
1680	   reference if it is the same as the base URI scheme.  This is
1681	   considered to be a loophole in prior specifications of partial URI
1682	   [RFC1630].  Its use should be avoided, but is allowed for backward
1683	   compatibility.

1685	      "http:g"        =  "http:g"         ; for strict parsers
1686	                      /  "http://a/b/c/g" ; for backward compatibility

1688	6.  Normalization and Comparison

1690	   One of the most common operations on URIs is simple comparison:
1691	   determining if two URIs are equivalent without using the URIs to
1692	   access their respective resource(s).  A comparison is performed every
1693	   time a response cache is accessed, a browser checks its history to
1694	   color a link, or an XML parser processes tags within a namespace.
1695	   Extensive normalization prior to comparison of URIs is often used by
1696	   spiders and indexing engines to prune a search space or reduce
1697	   duplication of request actions and response storage.

1699	   URI comparison is performed in respect to some particular purpose,
1700	   and implementations with differing purposes will often be subject to
1701	   differing design trade-offs in regards to how much effort should be
1702	   spent in reducing aliased identifiers.  This section describes a
1703	   variety of methods that may be used to compare URIs, the trade-offs
1704	   between them, and the types of applications that might use them.

1706	6.1  Equivalence

1708	   Since URIs exist to identify resources, presumably they should be
1709	   considered equivalent when they identify the same resource.  However,
1710	   such a definition of equivalence is not of much practical use, since
1711	   there is no way for an implementation to compare two resources that
1712	   are not under its own control.  For this reason, determination of
1713	   equivalence or difference of URIs is based on string comparison,
1714	   perhaps augmented by reference to additional rules provided by URI
1715	   scheme definitions.  We use the terms "different" and "equivalent" to
1716	   describe the possible outcomes of such comparisons, but there are
1717	   many application-dependent versions of equivalence.

1719	   Even though it is possible to determine that two URIs are equivalent,
1720	   URI comparison is not sufficient to determine if two URIs identify
1721	   different resources.  For example, an owner of two different domain
1722	   names could decide to serve the same resource from both, resulting in
1723	   two different URIs.  Therefore, comparison methods are designed to
1724	   minimize false negatives while strictly avoiding false positives.

1726	   In testing for equivalence, applications should not directly compare
1727	   relative references; the references should be converted to their
1728	   respective target URIs before comparison.  When URIs are being
1729	   compared for the purpose of selecting (or avoiding) a network action,
1730	   such as retrieval of a representation, fragment components (if any)
1731	   should be excluded from the comparison.

1733	6.2  Comparison Ladder

1735	   A variety of methods are used in practice to test URI equivalence.
1736	   These methods fall into a range, distinguished by the amount of
1737	   processing required and the degree to which the probability of false
1738	   negatives is reduced.  As noted above, false negatives cannot be
1739	   eliminated.  In practice, their probability can be reduced, but this
1740	   reduction requires more processing and is not cost-effective for all
1741	   applications.

1743	   If this range of comparison practices is considered as a ladder, the
1744	   following discussion will climb the ladder, starting with those
1745	   practices that are cheap but have a relatively higher chance of
1746	   producing false negatives, and proceeding to those that have higher
1747	   computational cost and lower risk of false negatives.

1749	6.2.1  Simple String Comparison

1751	   If two URIs, considered as character strings, are identical, then it
1752	   is safe to conclude that they are equivalent.  This type of
1753	   equivalence test has very low computational cost and is in wide use
1754	   in a variety of applications, particularly in the domain of parsing.

1756	   Testing strings for equivalence requires some basic precautions.
1757	   This procedure is often referred to as "bit-for-bit" or
1758	   "byte-for-byte" comparison, which is potentially misleading.  Testing
1759	   of strings for equality is normally based on pairwise comparison of
1760	   the characters that make up the strings, starting from the first and
1761	   proceeding until both strings are exhausted and all characters found
1762	   to be equal, a pair of characters compares unequal, or one of the
1763	   strings is exhausted before the other.

1765	   Such character comparisons require that each pair of characters be
1766	   put in comparable form.  For example, should one URI be stored in a
1767	   byte array in EBCDIC encoding, and the second be in a Java String
1768	   object (UTF-16), bit-for-bit comparisons applied naively will produce
1769	   errors.  It is better to speak of equality on a
1770	   character-for-character rather than byte-for-byte or bit-for-bit
1771	   basis.  In practical terms, character-by-character comparisons should
1772	   be done codepoint-by-codepoint after conversion to a common character
1773	   encoding.

1775	   False negatives are caused by the production and use of URI aliases.
1776	   Unnecessary aliases can be reduced, regardless of the comparison
1777	   method, by consistently providing URI references in an
1778	   already-normalized form (i.e., a form identical to what would be
1779	   produced after normalization is applied, as described below).
1780	   Protocols and data formats often choose to limit some URI comparisons
1781	   to simple string comparison, based on the theory that people and
1782	   implementations will, in their own best interest, be consistent in
1783	   providing URI references, or at least consistent enough to negate any
1784	   efficiency that might be obtained from further normalization.

1786	6.2.2  Syntax-based Normalization

1788	   Implementations may use logic based on the definitions provided by
1789	   this specification to reduce the probability of false negatives.
1790	   Such processing is moderately higher in cost than
1791	   character-for-character string comparison.  For example, an
1792	   application using this approach could reasonably consider the
1793	   following two URIs equivalent:

1795	      example://a/b/c/%7Bfoo%7D
1796	      eXAMPLE://a/./b/../b/%63/%7bfoo%7d

1798	   Web user agents, such as browsers, typically apply this type of URI
1799	   normalization when determining whether a cached response is
1800	   available.  Syntax-based normalization includes such techniques as
1801	   case normalization, percent-encoding normalization, and removal of
1802	   dot-segments.

1804	6.2.2.1  Case Normalization

1806	   For all URIs, the hexadecimal digits within a percent-encoding
1807	   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1808	   should be normalized to use uppercase letters for the digits A-F.

1810	   When a URI uses components of the generic syntax, the component
1811	   syntax equivalence rules always apply; namely, that the scheme and
1812	   host are case-insensitive and therefore should be normalized to
1813	   lowercase.  For example, the URI <HTTP://www.EXAMPLE.com/> is
1814	   equivalent to <http://www.example.com/>.  The other generic syntax
1815	   components are assumed to be case-sensitive unless specifically
1816	   defined otherwise by the scheme (see Section 6.2.3).

1818	6.2.2.2  Percent-Encoding Normalization

1820	   The percent-encoding mechanism (Section 2.1) is a frequent source of
1821	   variance among otherwise identical URIs.  In addition to the case
1822	   normalization issue noted above, some URI producers percent-encode
1823	   octets that do not require percent-encoding, resulting in URIs that
1824	   are equivalent to their non-encoded counterparts.  Such URIs should
1825	   be normalized by decoding any percent-encoded octet that corresponds
1826	   to an unreserved character, as described in Section 2.3.

1828	6.2.2.3  Path Segment Normalization

1830	   The complete path segments "." and ".." are intended only for use
1831	   within relative references (Section 4.1) and are removed as part of
1832	   the reference resolution process (Section 5.2).  However, some
1833	   deployed implementations incorrectly assume that reference resolution
1834	   is not necessary when the reference is already a URI, and thus fail
1835	   to remove dot-segments when they occur in non-relative paths.  URI
1836	   normalizers should remove dot-segments by applying the
1837	   remove_dot_segments algorithm to the path, as described in
1838	   Section 5.2.4.

1840	6.2.3  Scheme-based Normalization

1842	   The syntax and semantics of URIs vary from scheme to scheme, as
1843	   described by the defining specification for each scheme.
1844	   Implementations may use scheme-specific rules, at further processing
1845	   cost, to reduce the probability of false negatives.  For example,
1846	   since the "http" scheme makes use of an authority component, has a
1847	   default port of "80", and defines an empty path to be equivalent to
1848	   "/", the following four URIs are equivalent:

1850	      http://example.com
1851	      http://example.com/
1852	      http://example.com:/
1853	      http://example.com:80/

1855	   In general, a URI that uses the generic syntax for authority with an
1856	   empty path should be normalized to a path of "/"; likewise, an
1857	   explicit ":port", where the port is empty or the default for the
1858	   scheme, is equivalent to one where the port and its ":" delimiter are
1859	   elided, and thus should be removed by scheme-based normalization.
1860	   For example, the second URI above is the normal form for the "http"
1861	   scheme.

1863	   Another case where normalization varies by scheme is in the handling
1864	   of an empty authority component or empty host subcomponent.  For many
1865	   scheme specifications, an empty authority or host is considered an
1866	   error; for others, it is considered equivalent to "localhost" or the
1867	   end-user's host.  When a scheme defines a default for authority and a
1868	   URI reference to that default is desired, the reference should be
1869	   normalized to an empty authority for the sake of uniformity, brevity,
1870	   and internationalization.  If, however, either the userinfo or port
1871	   subcomponent is non-empty, then the host should be given explicitly
1872	   even if it matches the default.

1874	   Normalization should not remove delimiters when their associated
1875	   component is empty unless licensed to do so by the scheme
1876	   specification.  For example, the URI "http://example.com/?" cannot be
1877	   assumed to be equivalent to any of the examples above.  Likewise, the
1878	   presence or absence of delimiters within a userinfo subcomponent is
1879	   usually significant to its interpretation.  The fragment component is
1880	   not subject to any scheme-based normalization; thus, two URIs that
1881	   differ only by the suffix "#" are considered different regardless of
1882	   the scheme.

1884	   Some schemes define additional subcomponents that consist of
1885	   case-insensitive data, giving an implicit license to normalizers to
1886	   convert such data to a common case (e.g., all lowercase).  For
1887	   example, URI schemes that define a subcomponent of path to contain an
1888	   Internet hostname, such as the "mailto" URI scheme, cause that
1889	   subcomponent to be case-insensitive and thus subject to case
1890	   normalization (e.g., "mailto:Joe@Example.COM" is equivalent to
1891	   "mailto:Joe@example.com" even though the generic syntax considers the
1892	   path component to be case-sensitive).

1894	   Other scheme-specific normalizations are possible.

1896	6.2.4  Protocol-based Normalization

1898	   Web spiders, for which substantial effort to reduce the incidence of
1899	   false negatives is often cost-effective, are observed to implement
1900	   even more aggressive techniques in URI comparison.  For example, if
1901	   they observe that a URI such as

1903	      http://example.com/data

1905	   redirects to a URI differing only in the trailing slash

1907	      http://example.com/data/

1909	   they will likely regard the two as equivalent in the future.  This
1910	   kind of technique is only appropriate when equivalence is clearly
1911	   indicated by both the result of accessing the resources and the
1912	   common conventions of their scheme's dereference algorithm (in this
1913	   case, use of redirection by HTTP origin servers to avoid problems
1914	   with relative references).

1916	7.  Security Considerations

1918	   A URI does not in itself pose a security threat.  However, since URIs
1919	   are often used to provide a compact set of instructions for access to
1920	   network resources, care must be taken to properly interpret the data
1921	   within a URI, to prevent that data from causing unintended access,
1922	   and to avoid including data that should not be revealed in plain
1923	   text.

1925	7.1  Reliability and Consistency

1927	   There is no guarantee that, having once used a given URI to retrieve
1928	   some information, the same information will be retrievable by that
1929	   URI in the future.  Nor is there any guarantee that the information
1930	   retrievable via that URI in the future will be observably similar to
1931	   that retrieved in the past.  The URI syntax does not constrain how a
1932	   given scheme or authority apportions its name space or maintains it
1933	   over time.  Such a guarantee can only be obtained from the person(s)
1934	   controlling that name space and the resource in question.  A specific
1935	   URI scheme may define additional semantics, such as name persistence,
1936	   if those semantics are required of all naming authorities for that
1937	   scheme.

1939	7.2  Malicious Construction

1941	   It is sometimes possible to construct a URI such that an attempt to
1942	   perform a seemingly harmless, idempotent operation, such as the
1943	   retrieval of a representation, will in fact cause a possibly damaging
1944	   remote operation to occur.  The unsafe URI is typically constructed
1945	   by specifying a port number other than that reserved for the network
1946	   protocol in question.  The client unwittingly contacts a site that is
1947	   running a different protocol service and data within the URI contains
1948	   instructions that, when interpreted according to this other protocol,
1949	   cause an unexpected operation.  A frequent example of such abuse has
1950	   been the use of a protocol-based scheme with a port component of
1951	   "25", thereby fooling user agent software into sending an unintended
1952	   or impersonating message via an SMTP server.

1954	   Applications should prevent dereference of a URI that specifies a TCP
1955	   port number within the "well-known port" range (0 - 1023) unless the
1956	   protocol being used to dereference that URI is compatible with the
1957	   protocol expected on that well-known port.  Although IANA maintains a
1958	   registry of well-known ports, applications should make such
1959	   restrictions user-configurable to avoid preventing the deployment of
1960	   new services.

1962	   When a URI contains percent-encoded octets that match the delimiters
1963	   for a given resolution or dereference protocol (for example, CR and
1964	   LF characters for the TELNET protocol), such percent-encoded octets
1965	   must not be decoded before transmission across that protocol.
1966	   Transfer of the percent-encoding, which might violate the protocol,
1967	   is less harmful than allowing decoded octets to be interpreted as
1968	   additional operations or parameters, perhaps triggering an unexpected
1969	   and possibly harmful remote operation.

1971	7.3  Back-end Transcoding

1973	   When a URI is dereferenced, the data within it is often parsed by
1974	   both the user agent and one or more servers.  In HTTP, for example, a
1975	   typical user agent will parse a URI into its five major components,
1976	   access the authority's server, and send it the data within the
1977	   authority, path, and query components.  A typical server will take
1978	   that information, parse the path into segments and the query into
1979	   key/value pairs, and then invoke implementation-specific handlers to
1980	   respond to the request.  As a result, a common security concern for
1981	   server implementations that handle a URI, either as a whole or split
1982	   into separate components, is proper interpretation of the octet data
1983	   represented by the characters and percent-encodings within that URI.

1985	   Percent-encoded octets must be decoded at some point during the
1986	   dereference process.  Applications must split the URI into its
1987	   components and subcomponents prior to decoding the octets, since
1988	   otherwise the decoded octets might be mistaken for delimiters.
1989	   Security checks of the data within a URI should be applied after
1990	   decoding the octets.  Note, however, that the "%00" percent-encoding
1991	   (NUL) may require special handling and should be rejected if the
1992	   application is not expecting to receive raw data within a component.

1994	   Special care should be taken when the URI path interpretation process
1995	   involves the use of a back-end filesystem or related system
1996	   functions.  Filesystems typically assign an operational meaning to
1997	   special characters, such as the "/", "\", ":", "[", and "]"
1998	   characters, and special device names like ".", "..", "...", "aux",
1999	   "lpt", etc.  In some cases, merely testing for the existence of such
2000	   a name will cause the operating system to pause or invoke unrelated
2001	   system calls, leading to significant security concerns regarding
2002	   denial of service and unintended data transfer.  It would be
2003	   impossible for this specification to list all such significant
2004	   characters and device names; implementers should research the
2005	   reserved names and characters for the types of storage device that
2006	   may be attached to their application and restrict the use of data
2007	   obtained from URI components accordingly.

2009	7.4  Rare IP Address Formats

2011	   Although the URI syntax for IPv4address only allows the common,
2012	   dotted-decimal form of IPv4 address literal, many implementations
2013	   that process URIs make use of platform-dependent system routines,
2014	   such as gethostbyname() and inet_aton(), to translate the string
2015	   literal to an actual IP address.  Unfortunately, such system routines
2016	   often allow and process a much larger set of formats than those
2017	   described in Section 3.2.2.

2019	   For example, many implementations allow dotted forms of three
2020	   numbers, wherein the last part is interpreted as a 16-bit quantity
2021	   and placed in the right-most two bytes of the network address (e.g.,
2022	   a Class B network).  Likewise, a dotted form of two numbers means the
2023	   last part is interpreted as a 24-bit quantity and placed in the right
2024	   most three bytes of the network address (Class A), and a single
2025	   number (without dots) is interpreted as a 32-bit quantity and stored
2026	   directly in the network address.  Adding further to the confusion,
2027	   some implementations allow each dotted part to be interpreted as
2028	   decimal, octal, or hexadecimal, as specified in the C language (i.e.,
2029	   a leading 0x or 0X implies hexadecimal; otherwise, a leading 0
2030	   implies octal; otherwise, the number is interpreted as decimal).

2032	   These additional IP address formats are not allowed in the URI syntax
2033	   due to differences between platform implementations.  However, they
2034	   can become a security concern if an application attempts to filter
2035	   access to resources based on the IP address in string literal format.
2036	   If such filtering is performed, literals should be converted to
2037	   numeric form and filtered based on the numeric value, rather than a
2038	   prefix or suffix of the string form.

2040	7.5  Sensitive Information

2042	   URI producers should not provide a URI that contains a username or
2043	   password which is intended to be secret: URIs are frequently
2044	   displayed by browsers, stored in clear text bookmarks, and logged by
2045	   user agent history and intermediary applications (proxies).  A
2046	   password appearing within the userinfo component is deprecated and
2047	   should be considered an error (or simply ignored) except in those
2048	   rare cases where the 'password' parameter is intended to be public.

2050	7.6  Semantic Attacks

2052	   Because the userinfo subcomponent is rarely used and appears before
2053	   the host in the authority component, it can be used to construct a
2054	   URI that is intended to mislead a human user by appearing to identify
2055	   one (trusted) naming authority while actually identifying a different
2056	   authority hidden behind the noise.  For example

2058	      ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm

2060	   might lead a human user to assume that the host is 'cnn.example.com',
2061	   whereas it is actually '10.0.0.1'.  Note that a misleading userinfo
2062	   subcomponent could be much longer than the example above.

2064	   A misleading URI, such as the one above, is an attack on the user's
2065	   preconceived notions about the meaning of a URI, rather than an
2066	   attack on the software itself.  User agents may be able to reduce the
2067	   impact of such attacks by distinguishing the various components of
2068	   the URI when rendered, such as by using a different color or tone to
2069	   render userinfo if any is present, though there is no general
2070	   panacea.  More information on URI-based semantic attacks can be found
2071	   in [Siedzik].

2073	8.  IANA Considerations

2075	   URI scheme names, as defined by <scheme> in Section 3.1, form a
2076	   registered name space that is managed by IANA according to the
2077	   procedures defined in [BCP35].  No IANA actions are required by this
2078	   document.

2080	9.  Acknowledgments

2082	   This specification is derived from RFC 2396 [RFC2396], RFC 1808
2083	   [RFC1808], and RFC 1738 [RFC1738]; the acknowledgments in those
2084	   documents still apply.  It also incorporates the update (with
2085	   corrections) for IPv6 literals in the host syntax, as defined by
2086	   Robert M.  Hinden, Brian E.  Carpenter, and Larry Masinter in
2087	   [RFC2732].  In addition, contributions by Gisle Aas, Reese Anschultz,
2088	   Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll,
2089	   Dan Connolly, Adam M.  Costello, John Cowan, Jason Diamond, Martin
2090	   Duerst, Stefan Eissing, Clive D.W.  Feather, Al Gilman, Tony Hammond,
2091	   Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B.  Jacobs, Michael
2092	   Kay, John C.  Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew
2093	   Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert,
2094	   Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai
2095	   Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne,
2096	   Stuart Williams, and Henry Zongaro are gratefully acknowledged.

2098	10.  References

2100	10.1  Normative References

2102	   [ASCII]    American National Standards Institute, "Coded Character
2103	              Set -- 7-bit American Standard Code for Information
2104	              Interchange", ANSI X3.4, 1986.

2106	   [RFC2234]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
2107	              Specifications: ABNF", RFC 2234, November 1997.

2109	   [STD63]    Yergeau, F., "UTF-8, a transformation format of ISO
2110	              10646", STD 63, RFC 3629, November 2003.

2112	   [UCS]      International Organization for Standardization,
2113	              "Information Technology - Universal Multiple-Octet Coded
2114	              Character Set (UCS)", ISO/IEC 10646:2003, December 2003.

2116	10.2  Informative References

2118	   [BCP19]    Freed, N. and J. Postel, "IANA Charset Registration
2119	              Procedures", BCP 19, RFC 2978, October 2000.

2121	   [BCP35]    Petke, R. and I. King, "Registration Procedures for URL
2122	              Scheme Names", BCP 35, RFC 2717, November 1999.

2124	   [RFC0952]  Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet
2125	              host table specification", RFC 952, October 1985.

2127	   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
2128	              STD 13, RFC 1034, November 1987.

2130	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
2131	              and Support", STD 3, RFC 1123, October 1989.

2133	   [RFC1535]  Gavron, E., "A Security Problem and Proposed Correction
2134	              With Widely Deployed DNS Software", RFC 1535, October
2135	              1993.

2137	   [RFC1630]  Berners-Lee, T., "Universal Resource Identifiers in WWW: A
2138	              Unifying Syntax for the Expression of Names and Addresses
2139	              of Objects on the Network as used in the World-Wide Web",
2140	              RFC 1630, June 1994.

2142	   [RFC1736]  Kunze, J., "Functional Recommendations for Internet
2143	              Resource Locators", RFC 1736, February 1995.

2145	   [RFC1737]  Masinter, L. and K. Sollins, "Functional Requirements for
2146	              Uniform Resource Names", RFC 1737, December 1994.

2148	   [RFC1738]  Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform
2149	              Resource Locators (URL)", RFC 1738, December 1994.

2151	   [RFC1808]  Fielding, R., "Relative Uniform Resource Locators", RFC
2152	              1808, June 1995.

2154	   [RFC2046]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
2155	              Extensions (MIME) Part Two: Media Types", RFC 2046,
2156	              November 1996.

2158	   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

2160	   [RFC2396]  Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
2161	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
2162	              August 1998.

2164	   [RFC2518]  Goland, Y., Whitehead, E., Faizi, A., Carter, S. and D.
2165	              Jensen, "HTTP Extensions for Distributed Authoring --
2166	              WEBDAV", RFC 2518, February 1999.

2168	   [RFC2557]  Palme, F., Hopmann, A., Shelness, N. and E. Stefferud,
2169	              "MIME Encapsulation of Aggregate Documents, such as HTML
2170	              (MHTML)", RFC 2557, March 1999.

2172	   [RFC2718]  Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke,
2173	              "Guidelines for new URL Schemes", RFC 2718, November 1999.

2175	   [RFC2732]  Hinden, R., Carpenter, B. and L. Masinter, "Format for
2176	              Literal IPv6 Addresses in URL's", RFC 2732, December 1999.

2178	   [RFC3305]  Mealling, M. and R. Denenberg, "Report from the Joint W3C/
2179	              IETF URI Planning Interest Group: Uniform Resource
2180	              Identifiers (URIs), URLs, and Uniform Resource Names
2181	              (URNs): Clarifications and Recommendations", RFC 3305,
2182	              August 2002.

2184	   [RFC3490]  Faltstrom, P., Hoffman, P. and A. Costello,
2185	              "Internationalizing Domain Names in Applications (IDNA)",
2186	              RFC 3490, March 2003.

2188	   [RFC3513]  Hinden, R. and S. Deering, "Internet Protocol Version 6
2189	              (IPv6) Addressing Architecture", RFC 3513, April 2003.

2191	   [Siedzik]  Siedzik, R., "Semantic Attacks: What's in a URL?",
2192	              April 2001, <http://www.giac.org/practical/gsec/
2193	              Richard_Siedzik_GSEC.pdf>.

2195	Authors' Addresses

2197	   Tim Berners-Lee
2198	   World Wide Web Consortium
2199	   Massachusetts Institute of Technology
2200	   77 Massachusetts Avenue
2201	   Cambridge, MA  02139
2202	   USA

2204	   Phone: +1-617-253-5702
2205	   Fax:   +1-617-258-5999
2206	   EMail: timbl@w3.org
2207	   URI:   http://www.w3.org/People/Berners-Lee/

2209	   Roy T. Fielding
2210	   Day Software
2211	   5251 California Ave., Suite 110
2212	   Irvine, CA  92617
2213	   USA

2215	   Phone: +1-949-679-2960
2216	   Fax:   +1-949-679-2972
2217	   EMail: fielding@gbiv.com
2218	   URI:   http://roy.gbiv.com/

2220	   Larry Masinter
2221	   Adobe Systems Incorporated
2222	   345 Park Ave
2223	   San Jose, CA  95110
2224	   USA

2226	   Phone: +1-408-536-3024
2227	   EMail: LMM@acm.org
2228	   URI:   http://larry.masinter.net/

2230	Appendix A.  Collected ABNF for URI

2232	    URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

2234	    hier-part     = "//" authority path-abempty
2235	                  / path-absolute
2236	                  / path-rootless
2237	                  / path-empty

2239	    URI-reference = URI / relative-ref

2241	    absolute-URI  = scheme ":" hier-part [ "?" query ]

2243	    relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

2245	    relative-part = "//" authority path-abempty
2246	                  / path-absolute
2247	                  / path-noscheme
2248	                  / path-empty

2250	    scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

2252	    authority     = [ userinfo "@" ] host [ ":" port ]
2253	    userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
2254	    host          = IP-literal / IPv4address / reg-name
2255	    port          = *DIGIT

2257	    IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"

2259	    IPvFuture     = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

2261	    IPv6address   =                            6( h16 ":" ) ls32
2262	                  /                       "::" 5( h16 ":" ) ls32
2263	                  / [               h16 ] "::" 4( h16 ":" ) ls32
2264	                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
2265	                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
2266	                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
2267	                  / [ *4( h16 ":" ) h16 ] "::"              ls32
2268	                  / [ *5( h16 ":" ) h16 ] "::"              h16
2269	                  / [ *6( h16 ":" ) h16 ] "::"

2271	    h16           = 1*4HEXDIG
2272	    ls32          = ( h16 ":" h16 ) / IPv4address
2273	    IPv4address   = dec-octet "." dec-octet "." dec-octet "." dec-octet

2275	    dec-octet     = DIGIT                 ; 0-9
2276	                  / %x31-39 DIGIT         ; 10-99
2277	                  / "1" 2DIGIT            ; 100-199
2278	                  / "2" %x30-34 DIGIT     ; 200-249
2279	                  / "25" %x30-35          ; 250-255

2281	    reg-name      = *( unreserved / pct-encoded / sub-delims )

2283	    path          = path-abempty    ; begins with "/" or is empty
2284	                  / path-absolute   ; begins with "/" but not "//"
2285	                  / path-noscheme   ; begins with a non-colon segment
2286	                  / path-rootless   ; begins with a segment
2287	                  / path-empty      ; zero characters

2289	    path-abempty  = *( "/" segment )
2290	    path-absolute = "/" [ segment-nz *( "/" segment ) ]
2291	    path-noscheme = segment-nz-nc *( "/" segment )
2292	    path-rootless = segment-nz *( "/" segment )
2293	    path-empty    = 0<pchar>

2295	    segment       = *pchar
2296	    segment-nz    = 1*pchar
2297	    segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
2298	                  ; non-zero-length segment without any colon ":"

2300	    pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

2302	    query         = *( pchar / "/" / "?" )
2303	    fragment      = *( pchar / "/" / "?" )

2305	    pct-encoded   = "%" HEXDIG HEXDIG

2307	    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
2308	    reserved      = gen-delims / sub-delims
2309	    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
2310	    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
2311	                  / "*" / "+" / "," / ";" / "="

2313	Appendix B.  Parsing a URI Reference with a Regular Expression

2315	   Since the "first-match-wins" algorithm is identical to the "greedy"
2316	   disambiguation method used by POSIX regular expressions, it is
2317	   natural and commonplace to use a regular expression for parsing the
2318	   potential five components of a URI reference.

2320	   The following line is the regular expression for breaking-down a
2321	   well-formed URI reference into its components.

2323	      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
2324	       12            3  4          5       6  7        8 9

2326	   The numbers in the second line above are only to assist readability;
2327	   they indicate the reference points for each subexpression (i.e., each
2328	   paired parenthesis).  We refer to the value matched for subexpression
2329	   <n> as $<n>.  For example, matching the above expression to

2331	      http://www.ics.uci.edu/pub/ietf/uri/#Related

2333	   results in the following subexpression matches:

2335	      $1 = http:
2336	      $2 = http
2337	      $3 = //www.ics.uci.edu
2338	      $4 = www.ics.uci.edu
2339	      $5 = /pub/ietf/uri/
2340	      $6 = <undefined>
2341	      $7 = <undefined>
2342	      $8 = #Related
2343	      $9 = Related

2345	   where <undefined> indicates that the component is not present, as is
2346	   the case for the query component in the above example.  Therefore, we
2347	   can determine the value of the four components and fragment as

2349	      scheme    = $2
2350	      authority = $4
2351	      path      = $5
2352	      query     = $7
2353	      fragment  = $9

2355	   and, going in the opposite direction, we can recreate a URI reference
2356	   from its components using the algorithm of Section 5.3.

2358	Appendix C.  Delimiting a URI in Context

2360	   URIs are often transmitted through formats that do not provide a
2361	   clear context for their interpretation.  For example, there are many
2362	   occasions when a URI is included in plain text; examples include text
2363	   sent in electronic mail, USENET news messages, and, most importantly,
2364	   printed on paper.  In such cases, it is important to be able to
2365	   delimit the URI from the rest of the text, and in particular from
2366	   punctuation marks that might be mistaken for part of the URI.

2368	   In practice, URIs are delimited in a variety of ways, but usually
2369	   within double-quotes "http://example.com/", angle brackets
2370	   <http://example.com/>, or just using whitespace

2372	      http://example.com/

2374	   These wrappers do not form part of the URI.

2376	   In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
2377	   need to be added to break a long URI across lines.  The whitespace
2378	   should be ignored when extracting the URI.

2380	   No whitespace should be introduced after a hyphen ("-") character.
2381	   Because some typesetters and printers may (erroneously) introduce a
2382	   hyphen at the end of line when breaking a line, the interpreter of a
2383	   URI containing a line break immediately after a hyphen should ignore
2384	   all whitespace around the line break, and should be aware that the
2385	   hyphen may or may not actually be part of the URI.

2387	   Using <> angle brackets around each URI is especially recommended as
2388	   a delimiting style for a reference that contains embedded whitespace.

2390	   The prefix "URL:" (with or without a trailing space) was formerly
2391	   recommended as a way to help distinguish a URI from other bracketed
2392	   designators, though it is not commonly used in practice and is no
2393	   longer recommended.

2395	   For robustness, software that accepts user-typed URI should attempt
2396	   to recognize and strip both delimiters and embedded whitespace.

2398	   For example, the text:

2400	      Yes, Jim, I found it under "http://www.w3.org/Addressing/",
2401	      but you can probably pick it up from <ftp://foo.example.
2402	      com/rfc/>.  Note the warning in <http://www.ics.uci.edu/pub/
2403	      ietf/uri/historical.html#WARNING>.

2405	   contains the URI references

2407	      http://www.w3.org/Addressing/
2408	      ftp://foo.example.com/rfc/
2409	      http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING

2411	Appendix D.  Changes from RFC 2396

2413	D.1  Additions

2415	   An ABNF rule for URI has been introduced to correspond to one common
2416	   usage of the term: an absolute URI with optional fragment.

2418	   IPv6 (and later) literals have been added to the list of possible
2419	   identifiers for the host portion of an authority component, as
2420	   described by [RFC2732], with the addition of "[" and "]" to the
2421	   reserved set and a version flag to anticipate future versions of IP
2422	   literals.  Square brackets are now specified as reserved within the
2423	   authority component and not allowed outside their use as delimiters
2424	   for an IP literal within host.  In order to make this change without
2425	   changing the technical definition of the path, query, and fragment
2426	   components, those rules were redefined to directly specify the
2427	   characters allowed.

2429	   Since [RFC2732] defers to [RFC3513] for definition of an IPv6 literal
2430	   address, which unfortunately lacks an ABNF description of
2431	   IPv6address, we created a new ABNF rule for IPv6address that matches
2432	   the text representations defined by Section 2.2 of [RFC3513].
2433	   Likewise, the definition of IPv4address has been improved in order to
2434	   limit each decimal octet to the range 0-255.

2436	   Section 6 (Section 6) on URI normalization and comparison has been
2437	   completely rewritten and extended using input from Tim Bray and
2438	   discussion within the W3C Technical Architecture Group.

2440	D.2  Modifications

2442	   The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
2443	   [RFC2234].  This change required all rule names that formerly
2444	   included underscore characters to be renamed with a dash instead.  In
2445	   addition, a number of syntax rules have been eliminated or simplified
2446	   to make the overall grammar more comprehensible.  Specifications that
2447	   refer to the obsolete grammar rules may be understood by replacing
2448	   those rules according to the following table:

2450	   +----------------+--------------------------------------------------+
2451	   | obsolete rule  | translation                                      |
2452	   +----------------+--------------------------------------------------+
2453	   | absoluteURI    | absolute-URI                                     |
2454	   | relativeURI    | relative-part [ "?" query ]                      |
2455	   | hier_part      | ( "//" authority path-abempty /                  |
2456	   |                |   path-absolute ) [ "?" query ]                  |
2457	   |                |                                                  |
2458	   | opaque_part    | path-rootless [ "?" query ]                      |
2459	   | net_path       | "//" authority path-abempty                      |
2460	   | abs_path       | path-absolute                                    |
2461	   | rel_path       | path-rootless                                    |
2462	   | rel_segment    | segment-nz-nc                                    |
2463	   | reg_name       | reg-name                                         |
2464	   | server         | authority                                        |
2465	   | hostport       | host [ ":" port ]                                |
2466	   | hostname       | reg-name                                         |
2467	   | path_segments  | path-abempty                                     |
2468	   | param          | *<pchar excluding ";">                           |
2469	   |                |                                                  |
2470	   | uric           | unreserved / pct-encoded / ";" / "?" / ":"       |
2471	   |                |  / "@" / "&" / "=" / "+" / "$" / "," / "/"       |
2472	   |                |                                                  |
2473	   | uric_no_slash  | unreserved / pct-encoded / ";" / "?" / ":"       |
2474	   |                |  / "@" / "&" / "=" / "+" / "$" / ","             |
2475	   |                |                                                  |
2476	   | mark           | "-" / "_" / "." / "!" / "~" / "*" / "'"          |
2477	   |                |  / "(" / ")"                                     |
2478	   |                |                                                  |
2479	   | escaped        | pct-encoded                                      |
2480	   | hex            | HEXDIG                                           |
2481	   | alphanum       | ALPHA / DIGIT                                    |
2482	   +----------------+--------------------------------------------------+

2484	   Use of the above obsolete rules for the definition of scheme-specific
2485	   syntax is deprecated.

2487	   Section 2 on characters has been rewritten to explain what characters
2488	   are reserved, when they are reserved, and why they are reserved even
2489	   when not used as delimiters by the generic syntax.  The mark
2490	   characters that are typically unsafe to decode, including the
2491	   exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open
2492	   and close parentheses ("(" and ")"), have been moved to the reserved
2493	   set in order to clarify the distinction between reserved and
2494	   unreserved and hopefully answer the most common question of scheme
2495	   designers.  Likewise, the section on percent-encoded characters has
2496	   been rewritten, and URI normalizers are now given license to decode
2497	   any percent-encoded octets corresponding to unreserved characters.
2498	   In general, the terms "escaped" and "unescaped" have been replaced
2499	   with "percent-encoded" and "decoded", respectively, to reduce
2500	   confusion with other forms of escape mechanisms.

2502	   The ABNF for URI and URI-reference has been redesigned to make them
2503	   more friendly to LALR parsers and reduce complexity.  As a result,
2504	   the layout form of syntax description has been removed, along with
2505	   the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path,
2506	   path_segments, rel_segment, and mark rules.  All references to
2507	   "opaque" URIs have been replaced with a better description of how the
2508	   path component may be opaque to hierarchy.  The relativeURI rule has
2509	   been replaced with relative-ref to avoid unnecessary confusion over
2510	   whether or not they are a subset of URI.  The ambiguity regarding the
2511	   parsing of URI-reference as a URI or a relative-ref with a colon in
2512	   the first segment has been eliminated through the use of five
2513	   separate path matching rules.

2515	   The fragment identifier has been moved back into the section on
2516	   generic syntax components and within the URI and relative-ref rules,
2517	   though it remains excluded from absolute-URI.  The number sign ("#")
2518	   character has been moved back to the reserved set as a result of
2519	   reintegrating the fragment syntax.

2521	   The ABNF has been corrected to allow the path component to be empty.
2522	   This also allows an absolute-URI to consist of nothing after the
2523	   "scheme:", as is present in practice with the "dav:" namespace
2524	   [RFC2518] and the "about:" scheme used internally by many WWW browser
2525	   implementations.  The ambiguity regarding the boundary between
2526	   authority and path has been eliminated through the use of five
2527	   separate path matching rules.

2529	   Registry-based naming authorities that use the generic syntax are now
2530	   defined within the host rule.  This change allows current
2531	   implementations, where whatever name provided is simply fed to the
2532	   local name resolution mechanism, to be consistent with the
2533	   specification and removes the need to re-specify DNS name formats
2534	   here.  It also allows the host component to contain percent-encoded
2535	   octets, which is necessary to enable internationalized domain names
2536	   to be provided in URIs, processed in their native character encodings
2537	   at the application layers above URI processing, and passed to an IDNA
2538	   library as a registered name in the UTF-8 character encoding.  The
2539	   server, hostport, hostname, domainlabel, toplabel, and alphanum rules
2540	   have been removed.

2542	   The resolving relative references algorithm of [RFC2396] has been
2543	   rewritten using pseudocode for this revision to improve clarity and
2544	   fix the following issues:

2546	   o  [RFC2396] section 5.2, step 6a, failed to account for a base URI
2547	      with no path.

2549	   o  Restored the behavior of [RFC1808] where, if the reference
2550	      contains an empty path and a defined query component, then the
2551	      target URI inherits the base URI's path component.

2553	   o  The determination of whether a URI reference is a same-document
2554	      reference has been decoupled from the URI parser, simplifying the
2555	      URI processing interface within applications in a way consistent
2556	      with the internal architecture of deployed URI processing
2557	      implementations.  The determination is now based on comparison to
2558	      the base URI after transforming a reference to absolute form,
2559	      rather than on the format of the reference itself.  This change
2560	      may result in more references being considered "same-document"
2561	      under this specification than would be under the rules given in
2562	      RFC 2396, especially when normalization is used to reduce aliases.
2563	      However, it does not change the status of existing same-document
2564	      references.

2566	   o  Separated the path merge routine into two routines: merge, for
2567	      describing combination of the base URI path with a relative-path
2568	      reference, and remove_dot_segments, for describing how to remove
2569	      the special "." and ".." segments from a composed path.  The
2570	      remove_dot_segments algorithm is now applied to all URI reference
2571	      paths in order to match common implementations and improve the
2572	      normalization of URIs in practice.  This change only impacts the
2573	      parsing of abnormal references and same-scheme references wherein
2574	      the base URI has a non-hierarchical path.

2576	Appendix E.  Instructions to RFC Editor

2578	   Prior to publication as an RFC, please remove this section and the
2579	   "Editorial Note" that appears after the Abstract.  If [BCP35] or any
2580	   of the normative references are updated prior to publication, the
2581	   associated reference in this document can be safely updated as well.
2582	   This document has been produced using the xml2rfc tool set; the XML
2583	   version can be obtained via the URI listed in the editorial note.

2585	Index

2587	A
2588	   ABNF  11
2589	   absolute  26
2590	   absolute-path  26
2591	   absolute-URI  26
2592	   access  9
2593	   authority  16, 17

2595	B
2596	   base URI  28

2598	C
2599	   character encoding  4
2600	   character  4
2601	   characters  11
2602	   coded character set  4

2604	D
2605	   dec-octet  20
2606	   dereference  9
2607	   dot-segments  22

2609	F
2610	   fragment  16, 24

2612	G
2613	   gen-delims  12
2614	   generic syntax  6

2616	H
2617	   h16  19
2618	   hier-part  16
2619	   hierarchical  10
2620	   host  18

2622	I
2623	   identifier  5
2624	   IP-literal  19
2625	   IPv4  20
2626	   IPv4address  20
2627	   IPv6  19
2628	   IPv6address  19, 20
2629	   IPvFuture  19

2631	L
2632	   locator  7
2633	   ls32  19

2635	M
2636	   merge  32

2638	N
2639	   name  7
2640	   network-path  26

2642	P
2643	   path  16, 22
2644	      path-abempty  22
2645	      path-absolute  22
2646	      path-empty  22
2647	      path-noscheme  22
2648	      path-rootless  22
2649	   path-abempty  16
2650	   path-absolute  16
2651	   path-empty  16
2652	   path-rootless  16
2653	   pchar  22
2654	   pct-encoded  12
2655	   percent-encoding  12
2656	   port  21

2658	Q
2659	   query  16, 23

2661	R
2662	   reg-name  20
2663	   registered name  20
2664	   relative  10, 28
2665	   relative-path  26
2666	   relative-ref  26
2667	   remove_dot_segments  32
2668	   representation  9
2669	   reserved  12
2670	   resolution  9, 28
2671	   resource  5
2672	   retrieval  9

2674	S
2675	   same-document  27
2676	   sameness  9
2677	   scheme  16, 16
2678	   segment  22
2679	      segment-nz  22
2680	      segment-nz-nc  22
2681	   sub-delims  12
2682	   suffix  27

2684	T
2685	   transcription  7

2687	U
2688	   uniform  4
2689	   unreserved  13
2690	   URI grammar
2691	      absolute-URI  26
2692	      ALPHA  11
2693	      authority  16, 17
2694	      CR  11
2695	      dec-octet  20
2696	      DIGIT  11
2697	      DQUOTE  11
2698	      fragment  16, 24, 26
2699	      gen-delims  12
2700	      h16  19
2701	      HEXDIG  11
2702	      hier-part  16
2703	      host  17, 18
2704	      IP-literal  19
2705	      IPv4address  20
2706	      IPv6address  19, 20
2707	      IPvFuture  19
2708	      LF  11
2709	      ls32  19
2710	      mark  13
2711	      OCTET  11
2712	      path  22
2713	      path-abempty  16, 22
2714	      path-absolute  16, 22
2715	      path-empty  16, 22
2716	      path-noscheme  22
2717	      path-rootless  16, 22
2718	      pchar  22, 23, 24
2719	      pct-encoded  12
2720	      port  17, 21
2721	      query  16, 23, 26, 26
2722	      reg-name  20
2723	      relative-ref  25, 26
2724	      reserved  12
2725	      scheme  16, 16, 26
2726	      segment  22
2727	      segment-nz  22
2728	      segment-nz-nc  22
2729	      SP  11
2730	      sub-delims  12
2731	      unreserved  13
2732	      URI  16, 25
2733	      URI-reference  25
2734	      userinfo  17, 18
2735	   URI  16
2736	   URI-reference  25
2737	   URL  7
2738	   URN  7
2739	   userinfo  17, 18

2741	Intellectual Property Statement

2743	   The IETF takes no position regarding the validity or scope of any
2744	   Intellectual Property Rights or other rights that might be claimed to
2745	   pertain to the implementation or use of the technology described in
2746	   this document or the extent to which any license under such rights
2747	   might or might not be available; nor does it represent that it has
2748	   made any independent effort to identify any such rights.  Information
2749	   on the procedures with respect to rights in RFC documents can be
2750	   found in BCP 78 and BCP 79.

2752	   Copies of IPR disclosures made to the IETF Secretariat and any
2753	   assurances of licenses to be made available, or the result of an
2754	   attempt made to obtain a general license or permission for the use of
2755	   such proprietary rights by implementers or users of this
2756	   specification can be obtained from the IETF on-line IPR repository at
2757	   http://www.ietf.org/ipr.

2759	   The IETF invites any interested party to bring to its attention any
2760	   copyrights, patents or patent applications, or other proprietary
2761	   rights that may cover technology that may be required to implement
2762	   this standard.  Please address the information to the IETF at
2763	   ietf-ipr@ietf.org.

2765	Disclaimer of Validity

2767	   This document and the information contained herein are provided on an
2768	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2769	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
2770	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
2771	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
2772	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2773	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

2775	Copyright Statement

2777	   Copyright (C) The Internet Society (2004).  This document is subject
2778	   to the rights, licenses and restrictions contained in BCP 78, and
2779	   except as set forth therein, the authors retain all their rights.

2781	Acknowledgment

2783	   Funding for the RFC Editor function is currently provided by the
2784	   Internet Society.