idnits 2.17.1 draft-rfced-exp-cosmo-00.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Found some kind of copyright notice around line 34 but it does not match any copyright boilerplate known by this tool. Expected boilerplate is as follows today (2024-03-19) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == The page length should not exceed 58 lines per page, but there was 2 longer pages, the longest (page 1) being 75 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 4 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 1998) is 9560 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 9 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET DRAFT EXPIRES SEPT 1998 INTERNET DRAFT 2 Network Working Group R. Di Cosmo 3 INTERNET DRAFT ENS France 4 Category: Experimental P.E. Martinez Lopez 5 UNLP Argentina 6 January 1998 8 Distributed Robots: a Technology for Fast Web Indexing 9 11 Status of This Memo 13 This document is an Internet-Draft. Internet-Drafts are working 14 documents of the Internet Engineering Task Force (IETF), its 15 areas, and its working groups. Note that other groups may also 16 distribute working documents as Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use Internet- 21 Drafts as reference material or to cite them other than as 22 "work in progress." 24 To learn the current status of any Internet-Draft, please check 25 the "1id-abstracts.txt" listing contained in the Internet- 26 Drafts Shadow Directories on ftp.is.co.za (Africa), 27 ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim), 28 ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). 30 Distribution of this document is unlimited. 32 Copyright Notice 34 Copyright (C) The Internet Society (1998). All Rights Reserved. 36 1. Abstract 38 We propose a protocol (the Remote Update Protocol, RUP) for 39 cooperation between Web Servers and Web Robots in order to 40 increase the reliability of Web indexing, and decrease the load 41 both on the server and the robot side. If the servers conform to 42 the RUP protocol, the task of the Robot will appear to be 43 distributed between the servers that it consults - for that 44 reason we choose to call this note "Distributed Robots". 46 2. Introduction 48 Web Robots are programs that automatically traverse the Web's 49 hypertext structure in order to perform mainly indexing and/or 50 manteinance tasks [3]. Usually, the Robot connects to the servers in 51 order to retrieve and index the relevant documents. Due to 52 communication latency, exponential growth of Web servers, 53 multi-headed servers and various other factors, the task of 54 indexing the whole Web is a daunting one: a short back-of the 55 envelope calculation, assuming a 10 seconds delay between 56 requests to avoid overloading the servers, and an average of 100 57 million urls to index, even forgetting about the bandwidth 58 necessary to transfer the actual data, shows that we would need 59 more than 30 years to index the Web using one robot, and several 60 months even supposing to have hundreds of independent robots 61 examining disjoint partitions of the Web. Considering the widely 62 variable lifetime of URLs, this means that the snapshot taken by 63 a web search robot is doomed to be a pretty old one, so that the 64 probability of getting dead URLs as a result of a search on a web 65 index is quite high, and bound to increase steadily, unless some 66 radical change in indexing technology occurs. The purpose of the 67 present draft is to propose such a new technology, via a public 68 protocol for Remote Updates. The key observation is that, as 69 always, work should be done where it costs less: checking what is 70 new on a web server is best done by the web server itself, not by 71 an external search engine. Better, checking modifications of the 72 server's file system is a task that is already performed on their 73 own by many Webmasters, for security and administrative reasons, 74 on a daily basis. Hence, it is the web server that should notify 75 registered robots, on a periodic basis, of relevant 76 modifications, and provide unregistered robots with the ability 77 to query for modifications occurred over a designated span of 78 time, thus taking a relevant part of the workload off the 79 robots. Also, the web server is the best able to know whether 80 some URLs in its domain are not to be indexed by a given robot 81 (like synthetic ones, or local temporary links etc.), and this 82 information is already available on the site through the 83 /robots.txt file, covered in the Robot Exclusion Protocol 84 (REP) [1,2]. Combining these local informations (modification logs and 85 exclusion preferences) with a registration mechanisms for 86 indexing robots, we can obtain the following advantages: 88 * lower server load: registered robots will no longer crush the 89 server with bursts of GET or HEAD http requests covering the 90 whole server's URL addressing space. 92 * lower robot load: registered robots will only have to retrieve 93 modified URLs 95 * lower bandwidth usage: besides drastically reducing the 96 bandwidth abuse due to indexing bursts, the remote update 97 protocol may further reduce bandwidth usage by sending back 98 modification information to robots using e-mail messages (that 99 use store-and forward methods instead of long range TCP/IP 100 connections) 102 * increased index liveness: the remote update mechanism allows to 103 maintain more up-to-date indexes, and to discover modification 104 patterns that will allow the robot to apply sound reindexing 105 policies (like tagging "hot" servers with higher reindexing 106 priorities, while using lower priorities for relatively stable 107 ones). 109 It is worth noting that what we are proposing is the web equivalent 110 of replacing polling of an input device with interrupt driven 111 technology, with similar benefits. 113 3 Protocol description 115 We present now the components of the remote update protocol for 116 the communication between servers and robots. This will consists 117 of 119 * a registration protocol, 121 * an interaction protocol for performing further actions like 122 unregistering, modifying preferences or requesting update 123 information on the fly (this last suitable for unregistered 124 servers), and 126 * a text format for the communication of update information. 128 In order to describe data formats, we will use formal expressions 129 with the following conventions: 131 * characters in typewriter font should appear the same in the 132 data text; 134 * names in italics are variables that should be replaced by a 135 proper value; 137 * any text enclosed between [ ]* symbols can appear zero, one or 138 more times; 140 * any text enclose between { } symbols can appear at most one 141 time (that is, it is optional); 143 * the letters NL are used to indicate end-of-line, and are system 144 dependent. 146 3.1 Registration protocol 148 At present, when a Robot wants data from a Web server, it access 149 the server, and retrieve the information it wants. If it is 150 willing, it can retrieve /robots.txt file and respect the 151 guidances provided there. In our protocol, the first action the 152 Robot must do is to register with the server. The registration of 153 a robot involves its identification to the server, and the 154 communication of its preferences (latency between updates, for 155 example). The server should accept this registration - previous 156 an optional authentication of the robot - initialize the data for 157 the communication with the robot, and give back the robot a key 158 needed for further operations like unregistering or changing 159 preferences information. To accomplish that task, servers 160 implementing the RUP protocol will have in their root WWW 161 directory a file /rupinfo.tex, containing information about the 162 registration procedure (that will take place via a cgi-script) 163 and the implemented features of the protocol (like valid values 164 for latency, exclusion and inclusion policies, etc.). 166 3.1.1 /rupinfo.txt data format 168 The file /rupinfo.tex has the following syntax: 170 RUP-CGI: cgiurl NL 171 {Authentifier: authmethod NL} 172 {Latency: latvalue[, latvalue]* NL} 173 {Exclude: urlpatvalue[, urlpatvalue]* NL} 174 {IncludeOnly: urlpatvalue[, urlpatvalue]* NL} 175 {MaxQuerySpan: integer-latvalue NL} 177 where 179 * cgiurl is the URL of the CGI script implementing the RUP 180 protocol on the server. 182 * authmethod is a verification scheme to determine the robot's 183 identity. 185 For the time being, we do not provide any support for robot 186 authentication, so the only valid value is currently none. But a 187 future version of the protocol may add new values. 189 * latvalue is a value for accepted latencies (common values are 190 day, week, month). 192 * urlpatvalue is a pattern of an url, expressed using regular 193 expression syntax. 195 * integer is a positive number. 197 The RUP-CGI field indicates the URL of the CGI script that should 198 be run in order to perform any action from the RUP protocol. The 199 robots will communicate with the RUP server by issuing HTTP 200 requests to cgiurl using preferrably the POST CGI method (but GET 201 methos should also be honored): the possible actions are 202 described in subsequent sections. The Latency field indicates the 203 possibles magnitudes for the interval between notifications. The 204 Exclude and IncludeOnly fields are lists of accepted patterns of 205 URLs that the robot may want to include in the exclude and 206 includeonly list in the registration phase (default is none). The 207 MaxQuerySpan field indicates how long the server keeps the 208 modification information; it is used by registered or 209 unregistered servers in order to know how far in the past they 210 can obtain update information from the server. 212 3.1.2 Registration phase 214 The robots willing to register with a server will retrieve the 215 /rupinfo.txt file to find which cgi-script to call for 216 registration and the preferences values supported by the 217 server. It will then issue an HTTP request to the found cgiurl 218 with the following set of key/value pairs as arguments: 220 Action=Register 221 RobotEmail=email-address 222 {Latency={integer-}latvalue} 223 {Exclude=urlpatvalue[ urlpatvalue]*} 224 {IncludeOnly=urlpatvalue[ urlpatvalue]*} 226 where 228 * email-address is a valid e-mail address where the robot wants 229 to be notified of changes, 231 * Latency indicates the time the robot wants to wait between two 232 succesive reports (and where latvalue is chosen from the Latency 233 field in the /rupinfo.txt), 235 * Exclude indicates that the robot wants to be informed of all 236 changes, except those affecting files matching the listed path 237 patterns 239 * IncludeOnly indicates that the robot only wants to monitor 240 changes on files matching the listed path patterns. This is 241 especially suitable to allow single users to monitor changes on 242 specific sets of pages, if the server supports registration of 243 single users. 245 The only requierd value is the email address of the robot, and 246 either Exclude or IncludeOnly is allowed, not both. After 247 authentication of the robot identity, the server will answer to 248 this request with an HTML document containing the line: 250 RegistrationKey: string 252 with status code 200 (OK). This key will be required for any 253 further modifications of the robot's preferences record stored by 254 the server. In case the key is lost, human intervention will be 255 required. If an error occurs (for example, if the email address 256 is not a valid one), the answer will be an error status code - 257 tipically 400 (Bad Request) or 501 (Not implemented). After 258 registration, a robot has no more need for its normal operation 259 to interact with the server via the RUP protocol, because the 260 server will send out the modifications updates to the registered 261 e-mail address according to the latency preferences and 262 include/exclude directives given by the robot. 264 3.2 Data format for modification updates 266 The modification updates will be MIME compliant e-mail messages 267 whose content has the following format 269 SequenceNumber: integer NL 270 URLBase: URL NL 271 NL 272 [ 273 New[date]: filepath[, filepath]* NL | 274 Change[date]: filepath[, filepath]* NL | 275 Delete[date]: filepath[, filepath]* NL 276 ]* 278 In other words, each of these files is composed by a header of 279 two lines giving the sequence number of the report and the URL 280 base of the site (the server address), then a blank line, then a 281 sequence of lines indicating the changes in the server 282 contents. Each line in the sequence is of one of three types: a 283 New line, a Change line, or a Delete line. A New line with date 284 date indicates that the filepaths were created on that 285 date. Similarly, a Change line indicates that the filepaths were 286 changed on the indicated date, and a Delete line indicates that 287 they were deleted on the indicated date. The sequence number is 288 used to detect lost update reports as explained in 3.3.3. 290 3.3 Further interactions 292 3.3.1 Canceling a registration 294 If for any reason a robot has no more need of a server's 295 information, it can cancel its registration by issuing an HTTP 296 request with the following key/value pairs: 298 Action=Unregister 299 RegistrationKey=string 300 RegisteredRobotEmail=email-address 302 The effect of this command is that the server will erase the 303 robot from its database, and stop sending reports of its changes 304 to it. 306 3.3.2 Modifying preferences 308 A robot can modify at any moment its preferences on the server by 309 issuing an HTTP request with the following key/value pairs: 311 Action=ModifyPreferences 312 RegistrationKey=string 313 RegisteredRobotEmail=email-address 314 {RobotEmail=email-address} 315 {Latency={integer-}latvalue} 316 {Exclude=urlpatvalue[ urlpatvalue]*} 317 {IncludeOnly=urlpatvalue[ urlpatvalue]*} 319 The effect of this command is that the server will check whether 320 the robot registered with the e-mail address given in the 321 RegisteredRobotEmail field is in its database, and if the 322 registration key is valid. In this case, it modifies the robot's 323 preferences items given in the request, leaving the other items 324 unchanged, otherwise it will give a status error code, typically 325 400 (Bad Request). 327 3.3.3 Discovering and handling errors 329 A given Robot should receive notifications from a server within a 330 fixed period of time - the latency-time set by the Robot. For 331 that reason, if the latency time expired and no message arrives 332 in a reasonable amount of time, the Robot can consult the 333 SequenceNumber stored in the server to check if any report was 334 lost - that is, the server sent it, but it did not arrive. This 335 is done by issuing an HTTP request with the following key/value 336 pairs: 338 Action=GetSequenceNumber 339 RegistrationKey=string 340 RegisteredRobotEmail=email-address 342 that will produce either an error or an HTML document containing 343 the line SequenceNumber=integer. In the case of a lost report, 344 the Robot can either get the lost data using the unregistered 345 robot part of the protocol, or, if this is not supported or the 346 required data extends too much into the past for the server to 347 satisfy the request, completely reindex the site. After that, 348 normal operation is resumed. 350 3.3.4 Requesting an online modification update 352 A server may maintain the modification log in a format suitable 353 to answer unregistered queries asking for a selected part of the 354 modification log. In this case, it is very advisable to make this 355 information available even to servers not yet registered, via a 356 simple query protocol. This is done by issuing an HTTP request 357 with the following key/value pairs: 359 Action=GetIndex 360 Span=integer-latvalue 362 whose answer is either an error or an HTML document of the same 363 data format as the content of periodic update mail messages sent 364 to registered robots (see 3.2), with the requested modification 365 information. This method can also be used by registered servers 366 to recover a lost update report, as described in 3.3.3. 368 4 Security Considerations 370 The protocol proposed in this memo relies on the HTTP, CGI and 371 e-mail protocols for data transfer and does not change the 372 underlying security issues involved in those protocols. The only 373 new security issue raised concerns the risk of malicious 374 modification of a web indexer preferences record, that we counter 375 in a simple way by providing a unique identification key for 376 every web indexer. More robust solutions could be provided by 377 means of a more robust authentication protocol, for which a hook 378 is provided in the protocol, so the choice of the authentication 379 method, which is out of the scope of the present memo, does not 380 alter in any way the proposed protocol. 382 5 Conclusions 384 In the current state of the art for Web Robots, the fearful 385 amount of work involved in their task is a real barrier to 386 achieve completeness. For that reason, some form of distribution 387 of workload is needed. In this note we have presented a protocol 388 that can be used to distribute the task of indexing the Web. Each 389 server cooperates with the Robots, preparing reports of changes 390 in their contents, so that the Robot must not figure out those 391 changes by itself. The protocol described does not impose on the 392 servers any significant overhead, in consideration of the fact 393 that the needed modification logs are usually maintained for 394 administrative reasons, and if this technology spreads out, 395 we will surely find that the information the Robots can gather 396 and mantain will be much more accurate than 397 at the present time. One could think of the possibility of 398 allowing not only change monitoring tasks, but also indexing 399 tasks to be performed on the server in idle time: it is quite 400 easy to add support for such a feature, for example by allowing 401 transmission of executable code during the registration phase and 402 allowing proprietary data to be included in the preiodic 403 report. Nevertheless, due to the varying nature of the indexing 404 algorithms used by different robots, this would require 405 robot-dependent code (or even robot supplied code) to be executed 406 on the server, which would increase the overhead and raise 407 fundamental security issues that would prevent easy distribution 408 of the protocol. So we decided not to include support for this 409 feature in this version of the protocol, but we may do so in the 410 future. A prototype RUP implementation on the server side is 411 being tested in this very moment and will be made available as a 412 reference implementation with the final version of this note. 414 6. References 416 [1] M. Koster, "A Standard for Robot Exclusion", June 1994. 417 http://info.webcrawler.com/mak/projects/robots/norobots.html 419 [2] Charles P. Kollar, John R. R. Leavitt, Michael Mauldin, 420 "Robot Exclusion Standard Revisited", June, 1996. 421 http://www.kollar.com/robots.html 423 [3] Martijn Koster, "The Web Robots Pages", 1995. 424 http://info.webcrawler.com/mak/projects/robots/robots.html 426 7. Author Information 428 Roberto Di Cosmo 429 LIENS-DMI 430 Ecole Normale Superieure 431 45, Rue d'Ulm 432 75230 Paris CEDEX 05 433 France 435 E-mail: Roberto.Di.Cosmo@ens.fr 437 Pablo Ernesto Martinez Lopez 438 LIFIA 439 Universidad Nacional de La Plata 440 CC.11, Correo Central 441 La Plata 442 Argentina 444 E-mail: fidel@info.unlp.edu.ar 446 INTERNET DRAFT EXPIRES SEPT 1998 INTERNET DRAFT