idnits 2.17.1 draft-hamilton-indexing-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-27) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 2 instances of too long lines in the document, the longest one being 1 character in excess of 72. ** The abstract seems to contain references ([1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 4 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 1996) is 10178 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 1866 (ref. '2') (Obsoleted by RFC 2854) -- Possible downref: Non-RFC (?) normative reference: ref. '3' -- Possible downref: Non-RFC (?) normative reference: ref. '4' ** Downref: Normative reference to an Historic RFC: RFC 1835 (ref. '5') -- Possible downref: Non-RFC (?) normative reference: ref. '6' Summary: 12 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT Martin Hamilton 3 draft-hamilton-indexing-00.txt Loughborough University 4 Expires in six months Daniel LaLiberte 5 National Center for Supercomputing Applications 6 June 1996 8 Experimental HTTP methods to support indexing and searching 9 Filename: draft-hamilton-indexing-00.txt 11 Status of this Memo 13 This document is an Internet-Draft. Internet-Drafts are working 14 documents of the Internet Engineering Task Force (IETF), its 15 areas, and its working groups. Note that other groups may also 16 distribute working documents as Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use Internet- 21 Drafts as reference material or to cite them other than as ``work 22 in progress.'' 24 To learn the current status of any Internet-Draft, please check 25 the ``1id-abstracts.txt'' listing contained in the Internet- 26 Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net 27 (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East 28 Coast), or ftp.isi.edu (US West Coast). 30 Abstract 32 This document briefly outlines current approaches to indexing and 33 searching, proposes some experimental mechanisms which might be 34 deployed within HTTP [1] in support of these activities, and 35 concludes with a discussion of the issues raised. 37 The key features which are seen as desirable are a standardized way 38 of providing a local search capability on the information being made 39 available by an HTTP server, and a way of reducing both the bandwidth 40 consumed by indexing agents and the amount of work done by HTTP 41 servers during the indexing process. 43 1. Introduction 45 As the number of HTTP servers deployed has increased, providing 46 searchable indexes of the information which they make available has 47 itself become a growth industry. As a result there are now a large 48 number of "web crawlers", "web wanderers" and suchlike. 50 These indexing agents typically act independently of each other, and 51 do not share the information which they retrieve from the servers 52 being indexed. This can be a major cause for frustration on the part 53 of the server maintainers, who see multiple requests for the same 54 information coming from different indexers. It also results in a 55 large amount of redundant network traffic - with these repeated 56 requests for the same objects, and the objects themselves, often 57 travelling over the same physical and routing infrastructure. To 58 minimize the problems which arise from this behaviour, a number of 59 techniques may be used, e.g. caching proxy servers, conditional "GET" 60 requests, restricting transfers to objects which can usefully be 61 indexed - such as HTML [2] documents, and the robots exclusion 62 convention [3]. 64 From the server administrator's point of view it would be preferable 65 that the HTTP servers being indexed were capable of generating 66 indexing information in a standardized format themselves. Better yet 67 if this information were made available in as bandwidth friendly a 68 manner as possible - e.g. using compression, and sending only the 69 indexing information for those objects which have changed since the 70 indexing agent's last visit. This would facilitate diverse 71 approaches to indexing the Web, such as regional and subject-based 72 indexes. 74 It is also desireable that HTTP servers support a native search 75 method, in order that (where a suitable search back end is 76 available), HTTP clients may carry out a search of the information 77 provided by an HTTP server in a standardized manner. Current 78 approaches to local searching typically involve running one or more 79 third party search and retrieval tools in addition to the basic HTTP 80 server. It is usually the case that search results may only be 81 returned as an HTML document, whereas a structured format which was 82 intended specifically for delivering search results would be 83 preferable. This could add greatly to the flexibility of the World- 84 Wide Web, e.g. by making it possible to write hyperlinks in HTML 85 documents which cause searches to be carried out, using the results 86 of web crawler searches to expand searches to HTTP servers where 87 relevant documents were found, and so on. 89 2. Additional HTTP methods 91 Of course, these indexing and searching capabilities need not be 92 provided for within HTTP. A number of networked search and retrieval 93 protocols are already in existence, and several approaches exist for 94 the local building of indexes of the information made available by 95 HTTP servers. Unfortunately, since these are usually third party 96 products, extra work is required in obtaining, installing and 97 configuring them. This is not going to happen unless the server 98 maintainers are sufficiently motivated to devote extra time and 99 effort to the tasks involved. 101 Ideally, the HTTP server package would itself provide some degree of 102 indexing and searching support - perhaps just by bundling third party 103 software. Unfortunately, these features tend to be seen as `value 104 added', and may only be available at a price. By redefining the HTTP 105 base line to include support for them, it is hoped that the spread of 106 these technologies can be encouraged, and that free software 107 developers at least will implement built-in support as a standard 108 feature. 110 The normal HTTP content negotiation features may be used in any 111 request/response pair. In particular, the "If-Modified-Since:" 112 request header should be used to indicate that the indexing agent is 113 only interested in object which have been created or modified since 114 the date specified. The request/response pair of "Accept-Encoding:" 115 and "Content-Encoding:" should be used to indicate whether 116 compression is desired - and if so, the preferred compression 117 algorithm. 119 In the following examples, "C:" is used to indicate the client side 120 of the conversation, an "S:" the server side, and the client and 121 server sides are separated by a blank line for clarity. 123 2.1 The COLLECTIONS method 125 The COLLECTIONS method provides a means for HTTP clients to determine 126 which collections of information are made available by the HTTP 127 server. This may then be used, for example by the SEARCH and META 128 methods, to localize activity to a particular collection. 129 Implementors should note that this collection selection is in 130 addition to the virtual host selection provided by the HTTP "Host:" 131 header. 133 In COLLECTIONS requests, the Request-URI (to use the jargon of [1]) 134 component of the HTTP request should be an asterisk "*", which 135 specifies that the scope of the request is for all collections of 136 information made available by the server. Alternatively, the 137 Request-URI may be the URI of a particular collection, in which case 138 the request is for all subcollections of the identified collection - 139 i.e. a recursive traversal is implied. 141 It is assumed that these Request-URIs would likely be in the same 142 namespace used by the server for regular HTTP requests. This would 143 be in accordance with the general practice of indicating hierarchy in 144 HTTP URLs using the forward slash character "/". 146 e.g. 148 C: COLLECTIONS * HTTP/1.1 149 C: Accept: application/x-whois-data 150 C: Accept-Encoding: gzip, compress 151 C: Host: www.lut.ac.uk 152 C: 154 S: 200 OK collection info follows 155 S: Content-type: application/x-whois-data 156 S: 157 S: [...etc...] 159 Essentially, all the information which is strictly speaking required 160 at this stage is a list of the URIs of the relevant collections of 161 information. The META method may be used to discover further 162 information about individual collections or elements of collections. 164 Since collections themselves may be objects, such as Unix 165 directories, it is desirable that the Request-URI be able to refer to 166 the collection object itself, or the objects which form the 167 collection. To distinguish between these two roles, we suggest that 168 an asterisk "*" may be used to disambiguate between a Request-URI 169 which identifies a collection object, and the objects which form the 170 collection - e.g. "/departments/co/" might refer to the collection 171 object, and "/departments/co/*" to the objects which form the 172 collection. 174 2.2 The META method 176 The META method is drawn from the Collector/Gatherer protocol used by 177 the Harvest software [4]. It may be used to make a request for 178 indexing information about a particular collection of information, or 179 a request for indexing information about an individual object within 180 the collection. 182 The scope of the request may be indicated via the Request-URI. 184 e.g. 186 C: META * HTTP/1.1 187 C: Accept: application/x-rdm, application/x-ldif 188 C: Accept-Encoding: gzip, compress 189 C: If-Modified-Since: Mon, 1 Apr 1996 07:34:31 GMT 190 C: Host: www.lut.ac.uk 191 C: 193 S: 200 OK metadata follows 194 S: Content-type: application/x-rdm 195 S: 196 S: [...etc...] 198 Since some servers might want indexing to be done by an associated 199 server, rather than doing it themselves, a request for indexing 200 information (or by extension searching services) might reasonably be 201 redirected to another server. 203 2.3 The SEARCH method 205 The SEARCH method embeds a query in the HTTP headers component of the 206 request, using the search syntax defined for the WHOIS++ protocol 207 [5]. 209 The Request-URI for a SEARCH request should be either "*", for the 210 server as a whole, or the URI of a collection. The parameters of the 211 search should be in additional header lines. The query header 212 specifies what elements of the collection should be selected, just as 213 for the META request. 215 e.g. 217 C: SEARCH /departments/co HTTP/1.1 218 C: Accept: application/x-whois-data, text/html 219 C: Host: www.lut.ac.uk 220 C: Query: keywords=venona 221 C: 223 S: 200 OK search results follow 224 S: Content-type: application/x-whois-data 225 S: 226 S: [...etc...] 228 WHOIS++ requests normally fit onto a single line, and no state is 229 preserved between requests. Consequently, embedding WHOIS++ requests 230 within HTTP requests does not add greatly to implementation 231 complexity. 233 3. Discussion 235 There is no widespread agreement on the form which the indexing 236 information retrieved by web crawlers would take, and it may be the 237 case that different web crawlers are looking for different types of 238 information. As the number of indexing agents deployed on the 239 Internet continues to grow, it seems possible that they will 240 eventually proliferate to the point where it becomes infeasible to 241 retrieve the full content of each and every indexed object from each 242 and every HTTP server. 244 This said, distributing the indexing load amongst a number of servers 245 which pooled their results would be one way around this problem - 246 splitting the indexing load along geographical and topological lines. 247 To put some perspective on this discussion, the need to do this does 248 not yet appear to have arisen. 250 On the format of indexing information there is something of a 251 dichotomy between those who see the indexing information as a long 252 term catalogue entry, perhaps to be generated by hand, and those who 253 see it merely as an interchange format between two programs - which 254 may be generated automatically. Ideally the same format would be 255 useful in both situations, but in practice it may be difficult to 256 isolate a sufficiently small subset of a rich cataloguing format for 257 machine use. 259 Consequently, this document will not make any proposals about the 260 format of the indexing information. By extension, it will not 261 propose a default format for search results. 263 However, it seems reasonable that clients be able to request that 264 search results be returned formatted as HTML, though this in itself 265 is not a particularly meaningful concept - since there are a variety 266 of languages which all claim to be HTML based. A tractable approach 267 for implementors would be that HTML 2 should be returned unless the 268 server is aware of more advanced HTML features supported by the 269 client. Currently, much of this feature negotiation is based upon 270 the value of the HTTP "User-Agent:" header, but it is hoped that a 271 more sophisticated mechanism will eventually be developed. 273 The use of the WHOIS++ search syntax is based on the observation that 274 most Internet based search and retrieval protocols provide little 275 more than an attribute/value based search capability. WHOIS++ 276 manages to offer a simple yet flexible serach capability in arguably 277 the simplest and most readily implemented manner. Other protocols 278 typically add extra complexity in delivering requests and responses, 279 e.g. by using binary encodings, and management type features which 280 are rarely exercised over wide area networks - and features to aid in 281 the management of result sets, which are desirable but add to 282 implementation complexity. 284 This document has suggested that search requests be presented using a 285 new HTTP method, primarily so as to avoid confusion when dealing with 286 servers which do not support searching. This approach has the 287 disadvantage that there is a large installed base of clients which 288 would not understand the new method, a large proportion of which have 289 no way of supporting new HTTP methods. 291 An alternative strategy would be to implement searches embedded 292 within GET requests. This would complicate processing of the GET 293 request, but not require any changes on the part of the client. It 294 would also allow searches to be written in HTML documents without any 295 changes to the HTML syntax - they would simply appear as regular 296 URLs. Searches which required a new HTTP method would presumably 297 have to be delineated by an additional component in the HTML anchor 298 tag. 300 This problem does not arise with the collection of indexing 301 information, since the number of agents performing the collection 302 will be comparatively small, and there is no perceived benefit from 303 being able to write HTML documents which include pointers to indexing 304 information - rather the opposite, in fact. 306 In a future development, the HTTP Protocol Extension Protocol [6] 307 could provide a means for HTTP/1.1 based applications which use these 308 HTTP extensions to share information about supported options, version 309 numbers, and so on. For example, the "Protocol:" header might be 310 used to indicate an alternative query language instead of the simple 311 WHOIS++ attribute-value syntax, but we suggest that the WHOIS++ 312 syntax should be supported by every implementation of the SEARCH 313 method to provide a common base-line. 315 A sample PEP enabled SEARCH... 317 C: SEARCH * HTTP/1.1 318 C: Accept: application/x-whois-data, text/html 319 C: Host: www.lut.ac.uk 320 C: Protocol: {ftp://ftp.internic.net/rfc/rfc1835.txt {str req}} 321 C: Query: keywords=venona 322 C: 324 S: 220 OK search results follow 325 S: Content-type: application/x-whois-data 326 S: Protocol: {ftp://ftp.internic.net/rfc/rfc1835.txt {str req}} 327 S: 328 S: [...etc...] 330 It may be noted that the three experimental methods proposed in this 331 document are very similar - differing essentially in the scope of the 332 information which they apply to. It may be desirable to collapse at 333 least the COLLECTIONS and META requests down to a single request, 334 using an extra HTTP header, say "Scope:", to indicate the scope of 335 the message. 337 4. Security considerations 339 Most Internet protocols which deal with distributed indexing and 340 searching are careful to note the dangers of allowing unrestricted 341 access to the server. This is normally on the grounds that 342 unscrupulous clients may make off with the entire collection of 343 information - perhaps resulting in a breach of users' privacy, in the 344 case of White Pages servers. 346 In the web crawler environment, these general considerations do not 347 apply, since the entire collection of information is already "up for 348 grabs" to any person or agent willing to perform a traversal of the 349 server. Similarly, it is not likely to be a privacy problem if 350 searches yield a large number of results. 352 One exception, which should be noted by implementors, is that it is a 353 common practice to have some private information on public HTTP 354 server - perhaps limiting access to it on the basis of passwords, IP 355 addresses, network numbers, or domain names. These restrictions 356 should be considered when preparing indexing information or search 357 results, so as to avoid revealing private information to the Internet 358 as a whole. 360 It should also be noted that many of these access control mechanisms 361 are too trivial to be used over wide area networks such as the 362 Internet. Domain names and IP addresses are readily forged, 363 passwords are readily sniffed, and connections are readily hijacked. 364 Strong cryptographic authentication and session level encryption 365 should be used in any cases where security is a major concern. 367 5. Conclusions 369 There can be no doubt that the measures proposed in this document are 370 implementable - in fact they have already been implemented and 371 deployed, though on nothing like the scale of HTTP. It is a matter 372 for debate whether they are needed or desirable as additions to HTTP, 373 but it is clear that the additional functionality added to HTTP for 374 search support would be at some implementation cost. Indexing 375 support would be trivial to implement, once the issue of formatting 376 had been resolved. 378 6. Acknowledgements 380 Thanks to Jon Knight, Liam Quinn, Mike Schwartz, and <> for their comments on draft versions of this document. 383 This work was supported by grants from the UK Electronic Libraries 384 Programme (eLib) and the European Commission's Telematics for 385 Research Programme. 387 The Harvest software was developed by the Internet Research Task 388 Force Research Group on Resource Discovery, with support from the 389 Advanced Research Projects Agency, the Air Force Office of Scientific 390 Research, the National Science Foundation, Hughes Aircraft Company, 391 Sun Microsystems' Collaborative Research Program, and the University 392 of Colorado. 394 7. References 396 Request For Comments (RFC) and Internet Draft documents are available 397 from and numerous mirror sites. 399 [1] R. Fielding, H. Frystyk, T. Berners-Lee, J. Gettys, 400 J. C. Mogul. "Hypertext Transfer Protocol -- 401 HTTP/1.1", Internet Draft (work in progress). June 402 1996. 404 [2] T. Berners-Lee, D. Connolly. "Hypertext Markup 405 Language - 2.0", RFC 1866. November 1995. 407 [3] M. Koster. "A Standard for Robot Exclusion." Last 408 updated March 1996. 409 412 [4] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, 413 M. F. Schwartz, and D. P. Wessels. "Harvest: A 414 Scalable, Customizable Discovery and Access Sys- 415 tem", Technical Report CU-CS-732-94, Department of 416 Computer Science, University of Colorado, Boulder, 417 August 1994. 418 421 [5] P. Deutsch, R. Schoultz, P. Faltstrom & C. Weider. 422 "Architecture of the WHOIS++ service", RFC 1835. 423 August 1995. 425 [6] R. Khare. "PEP: An Extension Mechanism for 426 HTTP/1.1", Internet Draft (work in progress). 427 February 1996. 429 8. Authors' Addresses 431 Martin Hamilton 432 Department of Computer Studies 433 Loughborough University of Technology 434 Leics. LE11 3TU, UK 436 Email: m.t.hamilton@lut.ac.uk 438 Daniel LaLiberte 439 National Center for Supercomputing Applications 440 152 CAB 441 605 E Springfield 442 Champaign, IL 61820 444 Email: liberte@ncsa.uiuc.edu 446 This Internet Draft expires XXXX, 1996.