idnits 2.17.1 

draft-hamilton-indexing-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-27) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 2 instances of too long lines in the document, the longest one
     being 1 character in excess of 72.

  ** The abstract seems to contain references ([1]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.

  == There are 4 instances of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 1996) is 10178 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 1866 (ref. '2') (Obsoleted by RFC 2854)

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  -- Possible downref: Non-RFC (?) normative reference: ref. '4'

  ** Downref: Normative reference to an Historic RFC: RFC 1835 (ref. '5')

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'


     Summary: 12 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	INTERNET-DRAFT                                           Martin Hamilton
3	draft-hamilton-indexing-00.txt                   Loughborough University
4	Expires in six months                                   Daniel LaLiberte
5	                         National Center for Supercomputing Applications
6	                                                               June 1996

8	      Experimental HTTP methods to support indexing and searching
9	                Filename: draft-hamilton-indexing-00.txt

11	Status of this Memo

13	      This document is an Internet-Draft.  Internet-Drafts are working
14	      documents of the Internet Engineering Task Force (IETF), its
15	      areas, and its working groups.  Note that other groups may also
16	      distribute working documents as Internet-Drafts.

18	      Internet-Drafts are draft documents valid for a maximum of six
19	      months and may be updated, replaced, or obsoleted by other
20	      documents at any time.  It is inappropriate to use Internet-
21	      Drafts as reference material or to cite them other than as ``work
22	      in progress.''

24	      To learn the current status of any Internet-Draft, please check
25	      the ``1id-abstracts.txt'' listing contained in the Internet-
26	      Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net
27	      (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East
28	      Coast), or ftp.isi.edu (US West Coast).

30	Abstract

32	   This document briefly outlines current approaches to indexing and
33	   searching, proposes some experimental mechanisms which might be
34	   deployed within HTTP [1] in support of these activities, and
35	   concludes with a discussion of the issues raised.

37	   The key features which are seen as desirable are a standardized way
38	   of providing a local search capability on the information being made
39	   available by an HTTP server, and a way of reducing both the bandwidth
40	   consumed by indexing agents and the amount of work done by HTTP
41	   servers during the indexing process.

43	1. Introduction

45	   As the number of HTTP servers deployed has increased, providing
46	   searchable indexes of the information which they make available has
47	   itself become a growth industry.  As a result there are now a large
48	   number of "web crawlers", "web wanderers" and suchlike.

50	   These indexing agents typically act independently of each other, and
51	   do not share the information which they retrieve from the servers
52	   being indexed.  This can be a major cause for frustration on the part
53	   of the server maintainers, who see multiple requests for the same
54	   information coming from different indexers.  It also results in a
55	   large amount of redundant network traffic - with these repeated
56	   requests for the same objects, and the objects themselves, often
57	   travelling over the same physical and routing infrastructure.  To
58	   minimize the problems which arise from this behaviour, a number of
59	   techniques may be used, e.g. caching proxy servers, conditional "GET"
60	   requests, restricting transfers to objects which can usefully be
61	   indexed - such as HTML [2] documents, and the robots exclusion
62	   convention [3].

64	   From the server administrator's point of view it would be preferable
65	   that the HTTP servers being indexed were capable of generating
66	   indexing information in a standardized format themselves.  Better yet
67	   if this information were made available in as bandwidth friendly a
68	   manner as possible - e.g. using compression, and sending only the
69	   indexing information for those objects which have changed since the
70	   indexing agent's last visit.  This would facilitate diverse
71	   approaches to indexing the Web, such as regional and subject-based
72	   indexes.

74	   It is also desireable that HTTP servers support a native search
75	   method, in order that (where a suitable search back end is
76	   available), HTTP clients may carry out a search of the information
77	   provided by an HTTP server in a standardized manner.  Current
78	   approaches to local searching typically involve running one or more
79	   third party search and retrieval tools in addition to the basic HTTP
80	   server.  It is usually the case that search results may only be
81	   returned as an HTML document, whereas a structured format which was
82	   intended specifically for delivering search results would be
83	   preferable.  This could add greatly to the flexibility of the World-
84	   Wide Web, e.g. by making it possible to write hyperlinks in HTML
85	   documents which cause searches to be carried out, using the results
86	   of web crawler searches to expand searches to HTTP servers where
87	   relevant documents were found, and so on.

89	2. Additional HTTP methods

91	   Of course, these indexing and searching capabilities need not be
92	   provided for within HTTP.  A number of networked search and retrieval
93	   protocols are already in existence, and several approaches exist for
94	   the local building of indexes of the information made available by
95	   HTTP servers.  Unfortunately, since these are usually third party
96	   products, extra work is required in obtaining, installing and
97	   configuring them.  This is not going to happen unless the server
98	   maintainers are sufficiently motivated to devote extra time and
99	   effort to the tasks involved.

101	   Ideally, the HTTP server package would itself provide some degree of
102	   indexing and searching support - perhaps just by bundling third party
103	   software.  Unfortunately, these features tend to be seen as `value
104	   added', and may only be available at a price.  By redefining the HTTP
105	   base line to include support for them, it is hoped that the spread of
106	   these technologies can be encouraged, and that free software
107	   developers at least will implement built-in support as a standard
108	   feature.

110	   The normal HTTP content negotiation features may be used in any
111	   request/response pair.  In particular, the "If-Modified-Since:"
112	   request header should be used to indicate that the indexing agent is
113	   only interested in object which have been created or modified since
114	   the date specified.  The request/response pair of "Accept-Encoding:"
115	   and "Content-Encoding:" should be used to indicate whether
116	   compression is desired - and if so, the preferred compression
117	   algorithm.

119	   In the following examples, "C:" is used to indicate the client side
120	   of the conversation, an "S:" the server side, and the client and
121	   server sides are separated by a blank line for clarity.

123	2.1 The COLLECTIONS method

125	   The COLLECTIONS method provides a means for HTTP clients to determine
126	   which collections of information are made available by the HTTP
127	   server.  This may then be used, for example by the SEARCH and META
128	   methods, to localize activity to a particular collection.
129	   Implementors should note that this collection selection is in
130	   addition to the virtual host selection provided by the HTTP "Host:"
131	   header.

133	   In COLLECTIONS requests, the Request-URI (to use the jargon of [1])
134	   component of the HTTP request should be an asterisk "*", which
135	   specifies that the scope of the request is for all collections of
136	   information made available by the server.  Alternatively, the
137	   Request-URI may be the URI of a particular collection, in which case
138	   the request is for all subcollections of the identified collection -
139	   i.e. a recursive traversal is implied.

141	   It is assumed that these Request-URIs would likely be in the same
142	   namespace used by the server for regular HTTP requests.  This would
143	   be in accordance with the general practice of indicating hierarchy in
144	   HTTP URLs using the forward slash character "/".

146	   e.g.

148	     C: COLLECTIONS * HTTP/1.1
149	     C: Accept: application/x-whois-data
150	     C: Accept-Encoding: gzip, compress
151	     C: Host: www.lut.ac.uk
152	     C:

154	     S: 200 OK collection info follows
155	     S: Content-type: application/x-whois-data
156	     S:
157	     S: [...etc...]

159	   Essentially, all the information which is strictly speaking required
160	   at this stage is a list of the URIs of the relevant collections of
161	   information.  The META method may be used to discover further
162	   information about individual collections or elements of collections.

164	   Since collections themselves may be objects, such as Unix
165	   directories, it is desirable that the Request-URI be able to refer to
166	   the collection object itself, or the objects which form the
167	   collection.  To distinguish between these two roles, we suggest that
168	   an asterisk "*" may be used to disambiguate between a Request-URI
169	   which identifies a collection object, and the objects which form the
170	   collection - e.g. "/departments/co/" might refer to the collection
171	   object, and "/departments/co/*" to the objects which form the
172	   collection.

174	2.2 The META method

176	   The META method is drawn from the Collector/Gatherer protocol used by
177	   the Harvest software [4].  It may be used to make a request for
178	   indexing information about a particular collection of information, or
179	   a request for indexing information about an individual object within
180	   the collection.

182	   The scope of the request may be indicated via the Request-URI.

184	   e.g.

186	     C: META * HTTP/1.1
187	     C: Accept: application/x-rdm, application/x-ldif
188	     C: Accept-Encoding: gzip, compress
189	     C: If-Modified-Since: Mon, 1 Apr 1996 07:34:31 GMT
190	     C: Host: www.lut.ac.uk
191	     C:

193	     S: 200 OK metadata follows
194	     S: Content-type: application/x-rdm
195	     S:
196	     S: [...etc...]

198	   Since some servers might want indexing to be done by an associated
199	   server, rather than doing it themselves, a request for indexing
200	   information (or by extension searching services) might reasonably be
201	   redirected to another server.

203	2.3 The SEARCH method

205	   The SEARCH method embeds a query in the HTTP headers component of the
206	   request, using the search syntax defined for the WHOIS++ protocol
207	   [5].

209	   The Request-URI for a SEARCH request should be either "*", for the
210	   server as a whole, or the URI of a collection.  The parameters of the
211	   search should be in additional header lines.  The query header
212	   specifies what elements of the collection should be selected, just as
213	   for the META request.

215	   e.g.

217	     C: SEARCH /departments/co HTTP/1.1
218	     C: Accept: application/x-whois-data, text/html
219	     C: Host: www.lut.ac.uk
220	     C: Query: keywords=venona
221	     C:

223	     S: 200 OK search results follow
224	     S: Content-type: application/x-whois-data
225	     S:
226	     S: [...etc...]

228	   WHOIS++ requests normally fit onto a single line, and no state is
229	   preserved between requests.  Consequently, embedding WHOIS++ requests
230	   within HTTP requests does not add greatly to implementation
231	   complexity.

233	3. Discussion

235	   There is no widespread agreement on the form which the indexing
236	   information retrieved by web crawlers would take, and it may be the
237	   case that different web crawlers are looking for different types of
238	   information.  As the number of indexing agents deployed on the
239	   Internet continues to grow, it seems possible that they will
240	   eventually proliferate to the point where it becomes infeasible to
241	   retrieve the full content of each and every indexed object from each
242	   and every HTTP server.

244	   This said, distributing the indexing load amongst a number of servers
245	   which pooled their results would be one way around this problem -
246	   splitting the indexing load along geographical and topological lines.
247	   To put some perspective on this discussion, the need to do this does
248	   not yet appear to have arisen.

250	   On the format of indexing information there is something of a
251	   dichotomy between those who see the indexing information as a long
252	   term catalogue entry, perhaps to be generated by hand, and those who
253	   see it merely as an interchange format between two programs - which
254	   may be generated automatically.  Ideally the same format would be
255	   useful in both situations, but in practice it may be difficult to
256	   isolate a sufficiently small subset of a rich cataloguing format for
257	   machine use.

259	   Consequently, this document will not make any proposals about the
260	   format of the indexing information.  By extension, it will not
261	   propose a default format for search results.

263	   However, it seems reasonable that clients be able to request that
264	   search results be returned formatted as HTML, though this in itself
265	   is not a particularly meaningful concept - since there are a variety
266	   of languages which all claim to be HTML based.  A tractable approach
267	   for implementors would be that HTML 2 should be returned unless the
268	   server is aware of more advanced HTML features supported by the
269	   client.  Currently, much of this feature negotiation is based upon
270	   the value of the HTTP "User-Agent:" header, but it is hoped that a
271	   more sophisticated mechanism will eventually be developed.

273	   The use of the WHOIS++ search syntax is based on the observation that
274	   most Internet based search and retrieval protocols provide little
275	   more than an attribute/value based search capability.  WHOIS++
276	   manages to offer a simple yet flexible serach capability in arguably
277	   the simplest and most readily implemented manner.  Other protocols
278	   typically add extra complexity in delivering requests and responses,
279	   e.g. by using binary encodings, and management type features which
280	   are rarely exercised over wide area networks - and features to aid in
281	   the management of result sets, which are desirable but add to
282	   implementation complexity.

284	   This document has suggested that search requests be presented using a
285	   new HTTP method, primarily so as to avoid confusion when dealing with
286	   servers which do not support searching.  This approach has the
287	   disadvantage that there is a large installed base of clients which
288	   would not understand the new method, a large proportion of which have
289	   no way of supporting new HTTP methods.

291	   An alternative strategy would be to implement searches embedded
292	   within GET requests.  This would complicate processing of the GET
293	   request, but not require any changes on the part of the client.  It
294	   would also allow searches to be written in HTML documents without any
295	   changes to the HTML syntax - they would simply appear as regular
296	   URLs.  Searches which required a new HTTP method would presumably
297	   have to be delineated by an additional component in the HTML anchor
298	   tag.

300	   This problem does not arise with the collection of indexing
301	   information, since the number of agents performing the collection
302	   will be comparatively small, and there is no perceived benefit from
303	   being able to write HTML documents which include pointers to indexing
304	   information - rather the opposite, in fact.

306	   In a future development, the HTTP Protocol Extension Protocol [6]
307	   could provide a means for HTTP/1.1 based applications which use these
308	   HTTP extensions to share information about supported options, version
309	   numbers, and so on.  For example, the "Protocol:" header might be
310	   used to indicate an alternative query language instead of the simple
311	   WHOIS++ attribute-value syntax, but we suggest that the WHOIS++
312	   syntax should be supported by every implementation of the SEARCH
313	   method to provide a common base-line.

315	   A sample PEP enabled SEARCH...

317	     C: SEARCH * HTTP/1.1
318	     C: Accept: application/x-whois-data, text/html
319	     C: Host: www.lut.ac.uk
320	     C: Protocol: {ftp://ftp.internic.net/rfc/rfc1835.txt {str req}}
321	     C: Query: keywords=venona
322	     C:

324	     S: 220 OK search results follow
325	     S: Content-type: application/x-whois-data
326	     S: Protocol: {ftp://ftp.internic.net/rfc/rfc1835.txt {str req}}
327	     S:
328	     S: [...etc...]

330	   It may be noted that the three experimental methods proposed in this
331	   document are very similar - differing essentially in the scope of the
332	   information which they apply to.  It may be desirable to collapse at
333	   least the COLLECTIONS and META requests down to a single request,
334	   using an extra HTTP header, say "Scope:", to indicate the scope of
335	   the message.

337	4. Security considerations

339	   Most Internet protocols which deal with distributed indexing and
340	   searching are careful to note the dangers of allowing unrestricted
341	   access to the server.  This is normally on the grounds that
342	   unscrupulous clients may make off with the entire collection of
343	   information - perhaps resulting in a breach of users' privacy, in the
344	   case of White Pages servers.

346	   In the web crawler environment, these general considerations do not
347	   apply, since the entire collection of information is already "up for
348	   grabs" to any person or agent willing to perform a traversal of the
349	   server.  Similarly, it is not likely to be a privacy problem if
350	   searches yield a large number of results.

352	   One exception, which should be noted by implementors, is that it is a
353	   common practice to have some private information on public HTTP
354	   server - perhaps limiting access to it on the basis of passwords, IP
355	   addresses, network numbers, or domain names.  These restrictions
356	   should be considered when preparing indexing information or search
357	   results, so as to avoid revealing private information to the Internet
358	   as a whole.

360	   It should also be noted that many of these access control mechanisms
361	   are too trivial to be used over wide area networks such as the
362	   Internet.  Domain names and IP addresses are readily forged,
363	   passwords are readily sniffed, and connections are readily hijacked.
364	   Strong cryptographic authentication and session level encryption
365	   should be used in any cases where security is a major concern.

367	5. Conclusions

369	   There can be no doubt that the measures proposed in this document are
370	   implementable - in fact they have already been implemented and
371	   deployed, though on nothing like the scale of HTTP.  It is a matter
372	   for debate whether they are needed or desirable as additions to HTTP,
373	   but it is clear that the additional functionality added to HTTP for
374	   search support would be at some implementation cost.  Indexing
375	   support would be trivial to implement, once the issue of formatting
376	   had been resolved.

378	6. Acknowledgements

380	   Thanks to Jon Knight, Liam Quinn, Mike Schwartz, and <<your name
381	   here!!>> for their comments on draft versions of this document.

383	   This work was supported by grants from the UK Electronic Libraries
384	   Programme (eLib) and the European Commission's Telematics for
385	   Research Programme.

387	   The Harvest software was developed by the Internet Research Task
388	   Force Research Group on Resource Discovery, with support from the
389	   Advanced Research Projects Agency, the Air Force Office of Scientific
390	   Research, the National Science Foundation, Hughes Aircraft Company,
391	   Sun Microsystems' Collaborative Research Program, and the University
392	   of Colorado.

394	7. References

396	   Request For Comments (RFC) and Internet Draft documents are available
397	   from <URL:ftp://ftp.internic.net> and numerous mirror sites.

399	         [1]         R. Fielding, H. Frystyk, T. Berners-Lee, J. Gettys,
400	                     J. C. Mogul.  "Hypertext Transfer Protocol --
401	                     HTTP/1.1", Internet Draft (work in progress).  June
402	                     1996.

404	         [2]         T. Berners-Lee, D. Connolly.  "Hypertext Markup
405	                     Language - 2.0", RFC 1866.  November 1995.

407	         [3]         M. Koster.  "A Standard for Robot Exclusion."  Last
408	                     updated March 1996.
409	                     <URL:http://info.webcrawler.com/mak/projects/robots/
410	                     norobots.html>

412	         [4]         C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber,
413	                     M. F. Schwartz, and D. P. Wessels. "Harvest: A
414	                     Scalable, Customizable Discovery and Access Sys-
415	                     tem", Technical Report CU-CS-732-94, Department of
416	                     Computer Science, University of Colorado, Boulder,
417	                     August 1994.
418	                     <URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/sc
419	                     hwartz/HarvestJour.ps.Z>

421	         [5]         P. Deutsch, R. Schoultz, P. Faltstrom & C. Weider.
422	                     "Architecture of the WHOIS++ service", RFC 1835.
423	                     August 1995.

425	         [6]         R. Khare.  "PEP: An Extension Mechanism for
426	                     HTTP/1.1", Internet Draft (work in progress).
427	                     February 1996.

429	8. Authors' Addresses

431	   Martin Hamilton
432	   Department of Computer Studies
433	   Loughborough University of Technology
434	   Leics. LE11 3TU, UK

436	   Email: m.t.hamilton@lut.ac.uk

438	   Daniel LaLiberte
439	   National Center for Supercomputing Applications
440	   152 CAB
441	   605 E Springfield
442	   Champaign, IL 61820

444	   Email: liberte@ncsa.uiuc.edu

446	                  This Internet Draft expires XXXX, 1996.