idnits 2.17.1 

draft-rfced-exp-cosmo-00.txt:
  ** The Abstract section seems to be numbered


  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.  Found some kind of copyright notice around line 34 but it
     does not match any copyright boilerplate known by this tool.

     Expected boilerplate is as follows today (2024-04-25) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == The page length should not exceed 58 lines per page, but there was 2
     longer pages, the longest (page 1) being 75 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 4 instances of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (January 1998) is 9597 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 9 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET DRAFT		EXPIRES SEPT 1998	INTERNET DRAFT
2	Network Working Group                               R. Di Cosmo
3	INTERNET DRAFT		                            ENS France
4	Category: Experimental                              P.E. Martinez Lopez
5	                                                    UNLP Argentina
6	                                                    January 1998

8	      Distributed Robots: a Technology for Fast Web Indexing
9			<draft-rfced-exp-cosmo-00.txt>

11	Status of This Memo

13	This document is an Internet-Draft.  Internet-Drafts are working
14	documents of the Internet Engineering Task Force (IETF), its
15	areas, and its working groups.  Note that other groups may also
16	distribute working documents as Internet-Drafts.

18	Internet-Drafts are draft documents valid for a maximum of six
19	months and may be updated, replaced, or obsoleted by other
20	documents at any time.  It is inappropriate to use Internet-
21	Drafts as reference material or to cite them other than as
22	"work in progress."

24	To learn the current status of any Internet-Draft, please check
25	the "1id-abstracts.txt" listing contained in the Internet-
26	Drafts Shadow Directories on ftp.is.co.za (Africa),
27	ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim),
28	ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast).

30	Distribution of this document is unlimited.

32	Copyright Notice

34	  Copyright (C) The Internet Society (1998). All Rights Reserved.

36	1. Abstract

38	  We propose a protocol (the Remote Update Protocol, RUP) for
39	  cooperation between Web Servers and Web Robots in order to
40	  increase the reliability of Web indexing, and decrease the load
41	  both on the server and the robot side. If the servers conform to
42	  the RUP protocol, the task of the Robot will appear to be
43	  distributed between the servers that it consults - for that
44	  reason we choose to call this note "Distributed Robots".

46	2. Introduction

48	  Web Robots are programs that automatically traverse the Web's
49	  hypertext structure in order to perform mainly indexing and/or
50	  manteinance tasks [3]. Usually, the Robot connects to the servers in
51	  order to retrieve and index the relevant documents. Due to
52	  communication latency, exponential growth of Web servers,
53	  multi-headed servers and various other factors, the task of
54	  indexing the whole Web is a daunting one: a short back-of the
55	  envelope calculation, assuming a 10 seconds delay between
56	  requests to avoid overloading the servers, and an average of 100
57	  million urls to index, even forgetting about the bandwidth
58	  necessary to transfer the actual data, shows that we would need
59	  more than 30 years to index the Web using one robot, and several
60	  months even supposing to have hundreds of independent robots
61	  examining disjoint partitions of the Web. Considering the widely
62	  variable lifetime of URLs, this means that the snapshot taken by
63	  a web search robot is doomed to be a pretty old one, so that the
64	  probability of getting dead URLs as a result of a search on a web
65	  index is quite high, and bound to increase steadily, unless some
66	  radical change in indexing technology occurs. The purpose of the
67	  present draft is to propose such a new technology, via a public
68	  protocol for Remote Updates. The key observation is that, as
69	  always, work should be done where it costs less: checking what is
70	  new on a web server is best done by the web server itself, not by
71	  an external search engine. Better, checking modifications of the
72	  server's file system is a task that is already performed on their
73	  own by many Webmasters, for security and administrative reasons,
74	  on a daily basis. Hence, it is the web server that should notify
75	  registered robots, on a periodic basis, of relevant
76	  modifications, and provide unregistered robots with the ability
77	  to query for modifications occurred over a designated span of
78	  time, thus taking a relevant part of the workload off the
79	  robots. Also, the web server is the best able to know whether
80	  some URLs in its domain are not to be indexed by a given robot
81	  (like synthetic ones, or local temporary links etc.), and this
82	  information is already available on the site through the
83	  /robots.txt file, covered in the Robot Exclusion Protocol
84	  (REP) [1,2]. Combining these local informations (modification logs and
85	  exclusion preferences) with a registration mechanisms for
86	  indexing robots, we can obtain the following advantages:

88	  * lower server load: registered robots will no longer crush the
89	  server with bursts of GET or HEAD http requests covering the
90	  whole server's URL addressing space.

92	  * lower robot load: registered robots will only have to retrieve
93	  modified URLs

95	  * lower bandwidth usage: besides drastically reducing the
96	  bandwidth abuse due to indexing bursts, the remote update
97	  protocol may further reduce bandwidth usage by sending back
98	  modification information to robots using e-mail messages (that
99	  use store-and forward methods instead of long range TCP/IP
100	  connections)

102	  * increased index liveness: the remote update mechanism allows to
103	  maintain more up-to-date indexes, and to discover modification
104	  patterns that will allow the robot to apply sound reindexing
105	  policies (like tagging "hot" servers with higher reindexing
106	  priorities, while using lower priorities for relatively stable
107	  ones).

109	  It is worth noting that what we are proposing is the web equivalent
110	  of replacing polling of an input device with interrupt driven
111	  technology, with similar benefits.

113	3 Protocol description

115	  We present now the components of the remote update protocol for
116	  the communication between servers and robots. This will consists
117	  of

119	  * a registration protocol,

121	  * an interaction protocol for performing further actions like
122	  unregistering, modifying preferences or requesting update
123	  information on the fly (this last suitable for unregistered
124	  servers), and

126	  * a text format for the communication of update information.

128	  In order to describe data formats, we will use formal expressions
129	  with the following conventions:

131	  * characters in typewriter font should appear the same in the
132	  data text;

134	  * names in italics are variables that should be replaced by a
135	  proper value;

137	  * any text enclosed between [ ]* symbols can appear zero, one or
138	  more times;

140	  * any text enclose between { } symbols can appear at most one
141	  time (that is, it is optional);

143	  * the letters NL are used to indicate end-of-line, and are system
144	  dependent.

146	3.1 Registration protocol

148	  At present, when a Robot wants data from a Web server, it access
149	  the server, and retrieve the information it wants. If it is
150	  willing, it can retrieve /robots.txt file and respect the
151	  guidances provided there. In our protocol, the first action the
152	  Robot must do is to register with the server. The registration of
153	  a robot involves its identification to the server, and the
154	  communication of its preferences (latency between updates, for
155	  example). The server should accept this registration - previous
156	  an optional authentication of the robot - initialize the data for
157	  the communication with the robot, and give back the robot a key
158	  needed for further operations like unregistering or changing
159	  preferences information. To accomplish that task, servers
160	  implementing the RUP protocol will have in their root WWW
161	  directory a file /rupinfo.tex, containing information about the
162	  registration procedure (that will take place via a cgi-script)
163	  and the implemented features of the protocol (like valid values
164	  for latency, exclusion and inclusion policies, etc.).

166	3.1.1 /rupinfo.txt data format

168	  The file /rupinfo.tex has the following syntax:

170	     RUP-CGI: cgiurl NL
171	     {Authentifier: authmethod NL}
172	     {Latency: latvalue[, latvalue]* NL}
173	     {Exclude: urlpatvalue[, urlpatvalue]* NL}
174	     {IncludeOnly: urlpatvalue[, urlpatvalue]* NL}
175	     {MaxQuerySpan: integer-latvalue NL}

177	  where

179	  * cgiurl is the URL of the CGI script implementing the RUP
180	  protocol on the server.

182	  * authmethod is a verification scheme to determine the robot's
183	  identity.

185	  For the time being, we do not provide any support for robot
186	  authentication, so the only valid value is currently none. But a
187	  future version of the protocol may add new values.

189	  * latvalue is a value for accepted latencies (common values are
190	  day, week, month).

192	  * urlpatvalue is a pattern of an url, expressed using regular
193	  expression syntax.

195	  * integer is a positive number.

197	  The RUP-CGI field indicates the URL of the CGI script that should
198	  be run in order to perform any action from the RUP protocol. The
199	  robots will communicate with the RUP server by issuing HTTP
200	  requests to cgiurl using preferrably the POST CGI method (but GET
201	  methos should also be honored): the possible actions are
202	  described in subsequent sections. The Latency field indicates the
203	  possibles magnitudes for the interval between notifications. The
204	  Exclude and IncludeOnly fields are lists of accepted patterns of
205	  URLs that the robot may want to include in the exclude and
206	  includeonly list in the registration phase (default is none). The
207	  MaxQuerySpan field indicates how long the server keeps the
208	  modification information; it is used by registered or
209	  unregistered servers in order to know how far in the past they
210	  can obtain update information from the server.

212	3.1.2 Registration phase

214	  The robots willing to register with a server will retrieve the
215	  /rupinfo.txt file to find which cgi-script to call for
216	  registration and the preferences values supported by the
217	  server. It will then issue an HTTP request to the found cgiurl
218	  with the following set of key/value pairs as arguments:

220	     Action=Register
221	     RobotEmail=email-address
222	     {Latency={integer-}latvalue}
223	     {Exclude=urlpatvalue[ urlpatvalue]*}
224	     {IncludeOnly=urlpatvalue[ urlpatvalue]*}

226	  where

228	  * email-address is a valid e-mail address where the robot wants
229	  to be notified of changes,

231	  * Latency indicates the time the robot wants to wait between two
232	  succesive reports (and where latvalue is chosen from the Latency
233	  field in the /rupinfo.txt),

235	  * Exclude indicates that the robot wants to be informed of all
236	  changes, except those affecting files matching the listed path
237	  patterns

239	  * IncludeOnly indicates that the robot only wants to monitor
240	  changes on files matching the listed path patterns. This is
241	  especially suitable to allow single users to monitor changes on
242	  specific sets of pages, if the server supports registration of
243	  single users.

245	  The only requierd value is the email address of the robot, and
246	  either Exclude or IncludeOnly is allowed, not both. After
247	  authentication of the robot identity, the server will answer to
248	  this request with an HTML document containing the line:

250	     RegistrationKey: string

252	  with status code 200 (OK). This key will be required for any
253	  further modifications of the robot's preferences record stored by
254	  the server. In case the key is lost, human intervention will be
255	  required. If an error occurs (for example, if the email address
256	  is not a valid one), the answer will be an error status code -
257	  tipically 400 (Bad Request) or 501 (Not implemented). After
258	  registration, a robot has no more need for its normal operation
259	  to interact with the server via the RUP protocol, because the
260	  server will send out the modifications updates to the registered
261	  e-mail address according to the latency preferences and
262	  include/exclude directives given by the robot.

264	3.2 Data format for modification updates

266	  The modification updates will be MIME compliant e-mail messages
267	  whose content has the following format

269	     SequenceNumber: integer NL
270	     URLBase: URL NL
271	     NL
272	     [
273	       New[date]: filepath[, filepath]* NL |
274	       Change[date]: filepath[, filepath]* NL |
275	       Delete[date]: filepath[, filepath]* NL
276	     ]*

278	  In other words, each of these files is composed by a header of
279	  two lines giving the sequence number of the report and the URL
280	  base of the site (the server address), then a blank line, then a
281	  sequence of lines indicating the changes in the server
282	  contents. Each line in the sequence is of one of three types: a
283	  New line, a Change line, or a Delete line. A New line with date
284	  date indicates that the filepaths were created on that
285	  date. Similarly, a Change line indicates that the filepaths were
286	  changed on the indicated date, and a Delete line indicates that
287	  they were deleted on the indicated date. The sequence number is
288	  used to detect lost update reports as explained in 3.3.3.

290	3.3 Further interactions

292	3.3.1 Canceling a registration

294	  If for any reason a robot has no more need of a server's
295	  information, it can cancel its registration by issuing an HTTP
296	  request with the following key/value pairs:

298	     Action=Unregister
299	     RegistrationKey=string
300	     RegisteredRobotEmail=email-address

302	  The effect of this command is that the server will erase the
303	  robot from its database, and stop sending reports of its changes
304	  to it.

306	3.3.2 Modifying preferences

308	  A robot can modify at any moment its preferences on the server by
309	  issuing an HTTP request with the following key/value pairs:

311	     Action=ModifyPreferences
312	     RegistrationKey=string
313	     RegisteredRobotEmail=email-address
314	     {RobotEmail=email-address}
315	     {Latency={integer-}latvalue}
316	     {Exclude=urlpatvalue[ urlpatvalue]*}
317	     {IncludeOnly=urlpatvalue[ urlpatvalue]*}

319	  The effect of this command is that the server will check whether
320	  the robot registered with the e-mail address given in the
321	  RegisteredRobotEmail field is in its database, and if the
322	  registration key is valid. In this case, it modifies the robot's
323	  preferences items given in the request, leaving the other items
324	  unchanged, otherwise it will give a status error code, typically
325	  400 (Bad Request).

327	3.3.3 Discovering and handling errors

329	  A given Robot should receive notifications from a server within a
330	  fixed period of time - the latency-time set by the Robot. For
331	  that reason, if the latency time expired and no message arrives
332	  in a reasonable amount of time, the Robot can consult the
333	  SequenceNumber stored in the server to check if any report was
334	  lost - that is, the server sent it, but it did not arrive. This
335	  is done by issuing an HTTP request with the following key/value
336	  pairs:

338	     Action=GetSequenceNumber
339	     RegistrationKey=string
340	     RegisteredRobotEmail=email-address

342	  that will produce either an error or an HTML document containing
343	  the line SequenceNumber=integer. In the case of a lost report,
344	  the Robot can either get the lost data using the unregistered
345	  robot part of the protocol, or, if this is not supported or the
346	  required data extends too much into the past for the server to
347	  satisfy the request, completely reindex the site. After that,
348	  normal operation is resumed.

350	3.3.4 Requesting an online modification update

352	  A server may maintain the modification log in a format suitable
353	  to answer unregistered queries asking for a selected part of the
354	  modification log. In this case, it is very advisable to make this
355	  information available even to servers not yet registered, via a
356	  simple query protocol. This is done by issuing an HTTP request
357	  with the following key/value pairs:

359	     Action=GetIndex
360	     Span=integer-latvalue

362	  whose answer is either an error or an HTML document of the same
363	  data format as the content of periodic update mail messages sent
364	  to registered robots (see 3.2), with the requested modification
365	  information. This method can also be used by registered servers
366	  to recover a lost update report, as described in 3.3.3.

368	4 Security Considerations

370	  The protocol proposed in this memo relies on the HTTP, CGI and
371	  e-mail protocols for data transfer and does not change the
372	  underlying security issues involved in those protocols. The only
373	  new security issue raised concerns the risk of malicious
374	  modification of a web indexer preferences record, that we counter
375	  in a simple way by providing a unique identification key for
376	  every web indexer. More robust solutions could be provided by
377	  means of a more robust authentication protocol, for which a hook
378	  is provided in the protocol, so the choice of the authentication
379	  method, which is out of the scope of the present memo, does not
380	  alter in any way the proposed protocol.

382	5 Conclusions

384	  In the current state of the art for Web Robots, the fearful
385	  amount of work involved in their task is a real barrier to
386	  achieve completeness. For that reason, some form of distribution
387	  of workload is needed. In this note we have presented a protocol
388	  that can be used to distribute the task of indexing the Web. Each
389	  server cooperates with the Robots, preparing reports of changes
390	  in their contents, so that the Robot must not figure out those
391	  changes by itself. The protocol described does not impose on the
392	  servers any significant overhead, in consideration of the fact
393	  that the needed modification logs are usually maintained for
394	  administrative reasons, and if this technology spreads out,
395	  we will surely find that the information the Robots can gather
396	  and mantain will be much more accurate than
397	  at the present time. One could think of the possibility of
398	  allowing not only change monitoring tasks, but also indexing
399	  tasks to be performed on the server in idle time: it is quite
400	  easy to add support for such a feature, for example by allowing
401	  transmission of executable code during the registration phase and
402	  allowing proprietary data to be included in the preiodic
403	  report. Nevertheless, due to the varying nature of the indexing
404	  algorithms used by different robots, this would require
405	  robot-dependent code (or even robot supplied code) to be executed
406	  on the server, which would increase the overhead and raise
407	  fundamental security issues that would prevent easy distribution
408	  of the protocol. So we decided not to include support for this
409	  feature in this version of the protocol, but we may do so in the
410	  future. A prototype RUP implementation on the server side is
411	  being tested in this very moment and will be made available as a
412	  reference implementation with the final version of this note.

414	6. References

416	   [1] M. Koster, "A Standard for Robot Exclusion", June 1994.
417	       http://info.webcrawler.com/mak/projects/robots/norobots.html

419	   [2] Charles P. Kollar, John R. R. Leavitt, Michael Mauldin,
420	       "Robot Exclusion Standard Revisited", June, 1996.
421	       http://www.kollar.com/robots.html

423	   [3] Martijn Koster, "The Web Robots Pages", 1995.
424	       http://info.webcrawler.com/mak/projects/robots/robots.html

426	7. Author Information

428	   Roberto Di Cosmo
429	   LIENS-DMI
430	   Ecole Normale Superieure
431	   45, Rue d'Ulm
432	   75230 Paris CEDEX 05
433	   France

435	   E-mail: Roberto.Di.Cosmo@ens.fr

437	   Pablo Ernesto Martinez Lopez
438	   LIFIA
439	   Universidad Nacional de La Plata
440	   CC.11, Correo Central
441	   La Plata
442	   Argentina

444	   E-mail: fidel@info.unlp.edu.ar

446	INTERNET DRAFT		EXPIRES SEPT 1998	INTERNET DRAFT