INTERNET DRAFT EXPIRES SEPT 1998 INTERNET DRAFT Network Working Group R. Di Cosmo INTERNET DRAFT ENS France Category: Experimental P.E. Martinez Lopez UNLP Argentina January 1998 Distributed Robots: a Technology for Fast Web Indexing Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. Copyright Notice Copyright (C) The Internet Society (1998). All Rights Reserved. 1. Abstract We propose a protocol (the Remote Update Protocol, RUP) for cooperation between Web Servers and Web Robots in order to increase the reliability of Web indexing, and decrease the load both on the server and the robot side. If the servers conform to the RUP protocol, the task of the Robot will appear to be distributed between the servers that it consults - for that reason we choose to call this note "Distributed Robots". 2. Introduction Web Robots are programs that automatically traverse the Web's hypertext structure in order to perform mainly indexing and/or manteinance tasks [3]. Usually, the Robot connects to the servers in order to retrieve and index the relevant documents. Due to communication latency, exponential growth of Web servers, multi-headed servers and various other factors, the task of indexing the whole Web is a daunting one: a short back-of the envelope calculation, assuming a 10 seconds delay between requests to avoid overloading the servers, and an average of 100 million urls to index, even forgetting about the bandwidth necessary to transfer the actual data, shows that we would need more than 30 years to index the Web using one robot, and several months even supposing to have hundreds of independent robots examining disjoint partitions of the Web. Considering the widely variable lifetime of URLs, this means that the snapshot taken by a web search robot is doomed to be a pretty old one, so that the probability of getting dead URLs as a result of a search on a web index is quite high, and bound to increase steadily, unless some radical change in indexing technology occurs. The purpose of the present draft is to propose such a new technology, via a public protocol for Remote Updates. The key observation is that, as always, work should be done where it costs less: checking what is new on a web server is best done by the web server itself, not by an external search engine. Better, checking modifications of the server's file system is a task that is already performed on their Di Cosmo, Martinez Lopez Experimental [Page 1] own by many Webmasters, for security and administrative reasons, on a daily basis. Hence, it is the web server that should notify registered robots, on a periodic basis, of relevant modifications, and provide unregistered robots with the ability to query for modifications occurred over a designated span of time, thus taking a relevant part of the workload off the robots. Also, the web server is the best able to know whether some URLs in its domain are not to be indexed by a given robot (like synthetic ones, or local temporary links etc.), and this information is already available on the site through the /robots.txt file, covered in the Robot Exclusion Protocol (REP) [1,2]. Combining these local informations (modification logs and exclusion preferences) with a registration mechanisms for indexing robots, we can obtain the following advantages: * lower server load: registered robots will no longer crush the server with bursts of GET or HEAD http requests covering the whole server's URL addressing space. * lower robot load: registered robots will only have to retrieve modified URLs * lower bandwidth usage: besides drastically reducing the bandwidth abuse due to indexing bursts, the remote update protocol may further reduce bandwidth usage by sending back modification information to robots using e-mail messages (that use store-and forward methods instead of long range TCP/IP connections) * increased index liveness: the remote update mechanism allows to maintain more up-to-date indexes, and to discover modification patterns that will allow the robot to apply sound reindexing policies (like tagging "hot" servers with higher reindexing priorities, while using lower priorities for relatively stable ones). It is worth noting that what we are proposing is the web equivalent of replacing polling of an input device with interrupt driven technology, with similar benefits. 3 Protocol description We present now the components of the remote update protocol for the communication between servers and robots. This will consists of * a registration protocol, * an interaction protocol for performing further actions like unregistering, modifying preferences or requesting update information on the fly (this last suitable for unregistered servers), and * a text format for the communication of update information. Di Cosmo, Martinez Lopez Experimental [Page 2] In order to describe data formats, we will use formal expressions with the following conventions: * characters in typewriter font should appear the same in the data text; * names in italics are variables that should be replaced by a proper value; * any text enclosed between [ ]* symbols can appear zero, one or more times; * any text enclose between { } symbols can appear at most one time (that is, it is optional); * the letters NL are used to indicate end-of-line, and are system dependent. 3.1 Registration protocol At present, when a Robot wants data from a Web server, it access the server, and retrieve the information it wants. If it is willing, it can retrieve /robots.txt file and respect the guidances provided there. In our protocol, the first action the Robot must do is to register with the server. The registration of a robot involves its identification to the server, and the communication of its preferences (latency between updates, for example). The server should accept this registration - previous an optional authentication of the robot - initialize the data for the communication with the robot, and give back the robot a key needed for further operations like unregistering or changing preferences information. To accomplish that task, servers implementing the RUP protocol will have in their root WWW directory a file /rupinfo.tex, containing information about the registration procedure (that will take place via a cgi-script) and the implemented features of the protocol (like valid values for latency, exclusion and inclusion policies, etc.). 3.1.1 /rupinfo.txt data format The file /rupinfo.tex has the following syntax: RUP-CGI: cgiurl NL {Authentifier: authmethod NL} {Latency: latvalue[, latvalue]* NL} {Exclude: urlpatvalue[, urlpatvalue]* NL} {IncludeOnly: urlpatvalue[, urlpatvalue]* NL} {MaxQuerySpan: integer-latvalue NL} where * cgiurl is the URL of the CGI script implementing the RUP protocol on the server. Di Cosmo, Martinez Lopez Experimental [Page 3] * authmethod is a verification scheme to determine the robot's identity. For the time being, we do not provide any support for robot authentication, so the only valid value is currently none. But a future version of the protocol may add new values. * latvalue is a value for accepted latencies (common values are day, week, month). * urlpatvalue is a pattern of an url, expressed using regular expression syntax. * integer is a positive number. The RUP-CGI field indicates the URL of the CGI script that should be run in order to perform any action from the RUP protocol. The robots will communicate with the RUP server by issuing HTTP requests to cgiurl using preferrably the POST CGI method (but GET methos should also be honored): the possible actions are described in subsequent sections. The Latency field indicates the possibles magnitudes for the interval between notifications. The Exclude and IncludeOnly fields are lists of accepted patterns of URLs that the robot may want to include in the exclude and includeonly list in the registration phase (default is none). The MaxQuerySpan field indicates how long the server keeps the modification information; it is used by registered or unregistered servers in order to know how far in the past they can obtain update information from the server. 3.1.2 Registration phase The robots willing to register with a server will retrieve the /rupinfo.txt file to find which cgi-script to call for registration and the preferences values supported by the server. It will then issue an HTTP request to the found cgiurl with the following set of key/value pairs as arguments: Action=Register RobotEmail=email-address {Latency={integer-}latvalue} {Exclude=urlpatvalue[ urlpatvalue]*} {IncludeOnly=urlpatvalue[ urlpatvalue]*} where * email-address is a valid e-mail address where the robot wants to be notified of changes, * Latency indicates the time the robot wants to wait between two succesive reports (and where latvalue is chosen from the Latency field in the /rupinfo.txt), * Exclude indicates that the robot wants to be informed of all changes, except those affecting files matching the listed path Di Cosmo, Martinez Lopez Experimental [Page 4] patterns * IncludeOnly indicates that the robot only wants to monitor changes on files matching the listed path patterns. This is especially suitable to allow single users to monitor changes on specific sets of pages, if the server supports registration of single users. The only requierd value is the email address of the robot, and either Exclude or IncludeOnly is allowed, not both. After authentication of the robot identity, the server will answer to this request with an HTML document containing the line: RegistrationKey: string with status code 200 (OK). This key will be required for any further modifications of the robot's preferences record stored by the server. In case the key is lost, human intervention will be required. If an error occurs (for example, if the email address is not a valid one), the answer will be an error status code - tipically 400 (Bad Request) or 501 (Not implemented). After registration, a robot has no more need for its normal operation to interact with the server via the RUP protocol, because the server will send out the modifications updates to the registered e-mail address according to the latency preferences and include/exclude directives given by the robot. 3.2 Data format for modification updates The modification updates will be MIME compliant e-mail messages whose content has the following format SequenceNumber: integer NL URLBase: URL NL NL [ New[date]: filepath[, filepath]* NL | Change[date]: filepath[, filepath]* NL | Delete[date]: filepath[, filepath]* NL ]* In other words, each of these files is composed by a header of two lines giving the sequence number of the report and the URL base of the site (the server address), then a blank line, then a sequence of lines indicating the changes in the server contents. Each line in the sequence is of one of three types: a New line, a Change line, or a Delete line. A New line with date date indicates that the filepaths were created on that date. Similarly, a Change line indicates that the filepaths were changed on the indicated date, and a Delete line indicates that they were deleted on the indicated date. The sequence number is used to detect lost update reports as explained in 3.3.3. Di Cosmo, Martinez Lopez Experimental [Page 5] 3.3 Further interactions 3.3.1 Canceling a registration If for any reason a robot has no more need of a server's information, it can cancel its registration by issuing an HTTP request with the following key/value pairs: Action=Unregister RegistrationKey=string RegisteredRobotEmail=email-address The effect of this command is that the server will erase the robot from its database, and stop sending reports of its changes to it. 3.3.2 Modifying preferences A robot can modify at any moment its preferences on the server by issuing an HTTP request with the following key/value pairs: Action=ModifyPreferences RegistrationKey=string RegisteredRobotEmail=email-address {RobotEmail=email-address} {Latency={integer-}latvalue} {Exclude=urlpatvalue[ urlpatvalue]*} {IncludeOnly=urlpatvalue[ urlpatvalue]*} The effect of this command is that the server will check whether the robot registered with the e-mail address given in the RegisteredRobotEmail field is in its database, and if the registration key is valid. In this case, it modifies the robot's preferences items given in the request, leaving the other items unchanged, otherwise it will give a status error code, typically 400 (Bad Request). 3.3.3 Discovering and handling errors A given Robot should receive notifications from a server within a fixed period of time - the latency-time set by the Robot. For that reason, if the latency time expired and no message arrives in a reasonable amount of time, the Robot can consult the SequenceNumber stored in the server to check if any report was lost - that is, the server sent it, but it did not arrive. This is done by issuing an HTTP request with the following key/value pairs: Action=GetSequenceNumber RegistrationKey=string RegisteredRobotEmail=email-address that will produce either an error or an HTML document containing the line SequenceNumber=integer. In the case of a lost report, Di Cosmo, Martinez Lopez Experimental [Page 6] the Robot can either get the lost data using the unregistered robot part of the protocol, or, if this is not supported or the required data extends too much into the past for the server to satisfy the request, completely reindex the site. After that, normal operation is resumed. 3.3.4 Requesting an online modification update A server may maintain the modification log in a format suitable to answer unregistered queries asking for a selected part of the modification log. In this case, it is very advisable to make this information available even to servers not yet registered, via a simple query protocol. This is done by issuing an HTTP request with the following key/value pairs: Action=GetIndex Span=integer-latvalue whose answer is either an error or an HTML document of the same data format as the content of periodic update mail messages sent to registered robots (see 3.2), with the requested modification information. This method can also be used by registered servers to recover a lost update report, as described in 3.3.3. 4 Security Considerations The protocol proposed in this memo relies on the HTTP, CGI and e-mail protocols for data transfer and does not change the underlying security issues involved in those protocols. The only new security issue raised concerns the risk of malicious modification of a web indexer preferences record, that we counter in a simple way by providing a unique identification key for every web indexer. More robust solutions could be provided by means of a more robust authentication protocol, for which a hook is provided in the protocol, so the choice of the authentication method, which is out of the scope of the present memo, does not alter in any way the proposed protocol. 5 Conclusions In the current state of the art for Web Robots, the fearful amount of work involved in their task is a real barrier to achieve completeness. For that reason, some form of distribution of workload is needed. In this note we have presented a protocol that can be used to distribute the task of indexing the Web. Each server cooperates with the Robots, preparing reports of changes in their contents, so that the Robot must not figure out those changes by itself. The protocol described does not impose on the servers any significant overhead, in consideration of the fact that the needed modification logs are usually maintained for administrative reasons, and if this technology spreads out, we will surely find that the information the Robots can gather and mantain will be much more accurate than Di Cosmo, Martinez Lopez Experimental [Page 7] at the present time. One could think of the possibility of allowing not only change monitoring tasks, but also indexing tasks to be performed on the server in idle time: it is quite easy to add support for such a feature, for example by allowing transmission of executable code during the registration phase and allowing proprietary data to be included in the preiodic report. Nevertheless, due to the varying nature of the indexing algorithms used by different robots, this would require robot-dependent code (or even robot supplied code) to be executed on the server, which would increase the overhead and raise fundamental security issues that would prevent easy distribution of the protocol. So we decided not to include support for this feature in this version of the protocol, but we may do so in the future. A prototype RUP implementation on the server side is being tested in this very moment and will be made available as a reference implementation with the final version of this note. 6. References [1] M. Koster, "A Standard for Robot Exclusion", June 1994. http://info.webcrawler.com/mak/projects/robots/norobots.html [2] Charles P. Kollar, John R. R. Leavitt, Michael Mauldin, "Robot Exclusion Standard Revisited", June, 1996. http://www.kollar.com/robots.html [3] Martijn Koster, "The Web Robots Pages", 1995. http://info.webcrawler.com/mak/projects/robots/robots.html 7. Author Information Roberto Di Cosmo LIENS-DMI Ecole Normale Superieure 45, Rue d'Ulm 75230 Paris CEDEX 05 France E-mail: Roberto.Di.Cosmo@ens.fr Pablo Ernesto Martinez Lopez LIFIA Universidad Nacional de La Plata CC.11, Correo Central La Plata Argentina E-mail: fidel@info.unlp.edu.ar Di Cosmo, Martinez Lopez Experimental [Page 8] INTERNET DRAFT EXPIRES SEPT 1998 INTERNET DRAFT