Internet Draft John Pritchard Columbia U Computer Science Expires June 1996 21 November 1996 Efficient HyperLink Maintenance for HTTP Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. Please send comments to John Pritchard at Abstract Hyperlink maintenance allows robots and servers to cooperate in propagating the effects of daily changes in the millions of resource locations in the wwweb. Here, we propose developing the definitions of the LINK and UNLINK methods defined for HTTP since RFC 1945 and which remain largely unimplemented and unused. We believe that the only reason these methods have not been employed is that they remain too loosely defined and implicitly too inefficient. A new syntax and semantics simplify implementation and improve utility. Author's address John Pritchard 315 W 82nd Street, #4 New York, NY 10024 Contents 1. Introduction 2. Link Terminology 3. Implementation Terminology 4. Current HTTP Link Management Protocol 5. Some linking practices 6. Proposed Facility 7. Methods 1. LINK 2. UNLINK 3. UNLINKR 4. LINKMOD 8. Implementation 9. Indempotency 10. Security Considerations 11. Syntax 12. References 1. Introduction The HTTP protocol has recognized the importance of link management since HTTP/1.0 RFC 1945 [1]. However, the methods defined in HTTP/1.0 are limited and remain largely unimplemented. The existing link concept is defined irrespective of direction, ie, reference or resource, and so leaves too much semantically implied. The revised methods define simple and efficient syntax and semantics for a complete hyperlink management protocol within HTTP. Dangling links are a bigger and bigger problem on a large and growing wwweb. Messages like the following are common: The URL which you entered, ... , was not found on this server. You may have entered it incorrectly, or it may no longer exist. If you arrived here by clicking on a link in another page, please tell that page's owner/administrator that the link no longer exists. This one resulted from a URL stored in a popular search engine. A solution is readily available in defining HTTP's LINK and UNLINK methods with syntax and semantics that effectively and efficiently provide for hyperlink maintenace. Hyperlink maintenance implies communication, processing and storage costs. The proposed methods cut processing with syntax by not defining semantics that imply searching on behalf of call receivers. The proposed methods' semantics also match storage requirements to the HTML LINK tag concept. Storage space is not required on behalf of robots for implementation. The protocol detailed here is currently being implemented in an HTTP/1.1 compliant, commercial wwweb server and agent platform under the extensions provisions of that specification. This protocol has been realized as the result of that effort. 2. Link Terminology In this context we refer exclusively to links that are Uniform Resource Locators, see URL [4] and [2]. URLs are Uniform Resource Indentifiers, URIs [1], pointing to particular resources without variation per user identity, class or input, or other particularly perishable or localized circumstances. A link has two end points, one in an HTML anchor or otherwise a URL reference, and the other in the HTTP service providing access to a resource via a reference. The source end of a link is the client or anchor end, sometimes the tail, and the target end of a link is the resource end, sometimes the head. source: anchor, reference, tail target: resource, head, server, named anchor Usage for source and target include direct reference to documents, or reference locators (URLs), or the services (hosts) at the respective ends of a link. For discussing efficiency, we describe a shorter URI as coarser, and a longer one finer. The comparison could be made for URIs into the same sub-wwweb, for example http://www.target.com/some/long/path/ A http://www.target.com/some/path/ B B is coarser than A. If a coarser URI replaces a finer one, the implication of clobbered namespaces arises as well as a greater potential need for link modifications. Remember that handling URLs, or particular resource locators, implies that for each link there's an unlink. 3. Implementation Terminology In agreement with the HTTP specification documents and RFC 1123 [1], we employ must, shall or required to indicate implementation syntax or semantics that are not optional for software conforming to this specification, may for recommended features and should for optional features. Please note that this draft does not constitute a modification of any standard, rfc, or draft document but a proposal for review by the HTTP Working Group and the internet administration and development community. 4. Current HTTP Link Management Protocol The LINK and UNLINK methods are described in HTTP/1.1 [2] draft seven, sections 19.6.1.2 and 3, respectively. In short, the link and unlink request lines include method names and a request URI. The specification [2] states (section 5.3) The LINK method establishes one or more Link relationships between the existing resource identified by the Request-URI and other existing resources. The UNLINK method removes one or more Link relationships from the existing resource identified by the Request-URI. These relationships may have been established using the LINK method or by any other method supporting the Link header. The removal of a link to a resource does not imply that the resource ceases to exist or becomes inaccessible for future references. Without providing both the source and target of a link for LINKing or UNLINKing, the processing requirements for implementation of the current methods imply looking up the other end of the link. Link source or unlink target information is required in request headers, or on the request line to allow a valuable optimization -- eliminating excess searching or indexing. 5. Some linking practices Hyperlink maintenance methods are required for wwweb organization and must be interoperable across wwweb servers and robots in order to be effective. Robots and wanderers maintain catalogs of URI references and hypertext. Currently, unlink maintenance of these catalogs is largely manual. The Robot Exclusion Standard or "/robots.txt" [6] is currently considering a new facility for informing robots of changes to a server's sub-web, but doesn't address the server to server case that most links fall into. The passive existance of a link directive instrument on a server would require every server to get the linking directives from every other server and apply them heuristically to try to weed out broken links. This is untenable for broad use by communication and processing requirements and by the complexity of implementation. RES is useful for directing searches on subwebs by robots and is fairly widely employed by search engines and other robots. The URN [7] proposal is another idea that is sometimes mentioned but really isn't relevant. It creates a hierarchical global namespace for resources, and is designed for resources with extensive lifetimes, and not the ordinary class of information. Named linking would be extremely useful for putting hyperlinks into this document for reference material. With a particular URN namespace, the reader would potentially find the closest copy, perhaps a local copy of an RFC or Internet-Draft document, rather than simply use the link provided to the USA East Coast repository provided here. But even URN may not be appropriate for drafts with six month lifetimes. WWWeb meta information and versioning are important in this context as the proposed link maintenance extensions could benefit from mutual implementation in a wwweb server's object management system in conjunction with "Version management with meta-level links via HTTP/1.1" [3]. Content level links (see "Link" content header in HTTP/1.0 [2] and LINK entity in HTML 2.0 [8]) provide a default storage mechanism for link maintenance information. 6. Proposed Facility Required semantics are very limited. Only support for the LINK call, and clean disposal of other calls, is required by implementing systems. This simple, lightweight form doesn't require storage overhead on robots, crawlers, etc.. The cost of employing this automation is lower than might first be imagined as link changes with coarser effects are rarer than link changes with finer effects. Unlinks potentially occur for each link, without matching coarse URIs into fine URLs. If the wwweb server maintains a table of LINKs for the target document, it can issue UNLINKs to delete or revise others' information when the location changes or is deleted. So the average cost in simple network calls and table size is linear in number of links. Unlink calls' generation versus link calls' receipt ratio depends entirely on the server site characteristics. The table for a particular doc.html would store link source info, or reverse links. The UNLINK call is made to the host in the source end of the link, with the source and target links so that it can handle the request with minimal overhead. The LINK call is made to the host serving the target when the reference locator is used in a link-source document. Although HTML [7] defines LINK entities, in practice one doesn't want the wwweb server to download its link set with each HTML document -- if for no other reason than minimizing general bandwidth consumption. 7. Methods 1. LINK Linking provides for subsequent link modifications from the target to the source. Links change at their target side, so the link establishment between two HTTP implementing systems needs to allow the target side to tell the source side when a link URL has changed. The LINKMOD option tells the target end of the link that LINKMOD calls should be made to the source end. The target maintains a table of source links associated with particular resources so that if their URIs change the target can notify the source. LINK Source-URL Target-URL LINK Source-URL Target-URL LINKMOD Request The source tells the target that a URL to the target has been stored at the source. Reply The target will accept LINK calls with 200 Ok unless the Target-URL is invalid. In this case it will respond with a 417 Invalid target URI. If the LINKMOD option is requested but not enabled, the 207 No Linkmod reply will be generated. 2. UNLINK UNLINK removes previous LINK information. A source tells a target that the previous source referenced in a prior LINK call no longer exists or has moved. UNLINK Source-URL Target-URL UNLINK Source-URL Target-URL Repl-Source-URL Request The source notifies the target that the source link has changed. Optionally, the source may specify a replacement source URL. Reply The target replies with 200 Ok unless the source has specified invalid source or target URLs. In the case of erroneous source or target URIs, the target replies with one of 416 Invalid source URI or 417 Invalid target URI. The invalid target may indicate only that UNLINKR has not been supported by the target or source system. The invalid source reply occurs when there is no such source link information known to the target. 3. UNLINKR This method allows the target to inform the source that a link has changed. It specifies that the first argument refers to a source link that it stores and the second argument refers to a target link from that source. It is redundant on the semantics of the UNLINK method if the semantics of the UNLINK method included determining whether the recipient of the call is the source or the target. For UNLINK, the receiver is the target end, and with UNLINKR, the receiver is the source end. UNLINKR Source-URL Target-URL UNLINKR Source-URL Target-URL Repl-Target-URL Request The target notifies the source that the Target-URL referenced from location Source-URL is no longer valid. The target optionally provides the source with a replacement target URL. Reply The source replies with 200 Ok unless the target has specified invalid target or source URLs. In the case of erroneous target or source URIs, the source replies with one of 416 Invalid target URI or 417 Invalid source URI. The invalid source may indicate only that UNLINK has not been supported by the source or target system. The invalid target reply occurs when there is no such target link information known to the source. 4. LINKMOD A LINKMOD call could notify robots that a page has been updated. this would require that LINK be extended with optional request for LINKMOD calls. LINKMOD would be accepted by robots and crawlers in addition to UNLINK. The source will react according to its need for this information. LINKMOD Source-URL Target-URL Request The target informs the source that the Target-URI has been modified. Reply The source replies with 200 Ok unless the target has specified invalid target or source URLs. In the case of erroneous target or source URIs, the source replies with one of 416 Invalid target URI or 417 Invalid source URI. The invalid source may indicate only that UNLINK has not been supported by the source or target system. The invalid target reply occurs when there is no such target link information known to the source. 8. Implementation We can divide all classes of HTTP-implementing software into two categories for specifying implementation requirements. The first is the class of systems that maintain no link references (no HTML or URL catalogs) in their internal data. These have no implementation requirements. The second is systems that maintain link references in HTML or URL catalog data. These include wwweb servers and search engines. The implementation must include LINK and may implement UNLINK, UNLINKR and LINKMOD. If it is only implementing LINK, it must reply with an Ok status code to any UNLINK, UNLINKR and LINKMOD calls it receives. 9. Indempotency All of these methods are indempotent. Successive identical calls have identical effect as a single call. However, this requires that LINK is implemented to not replicate identical data. Please refer to RFCs 1738 [4] and 1808 [5] and HTTP/1.1 [2] Section 3.2.3 "URI Comparison" for information on determining when a LINK request should be discarded in preserving indempotency. 10. Security Considerations The UNLINK and UNLINKR methods' calls should be manually reviewed or automated and secured for trusted or authenticated hosts. At least robot-level spamming would be segmented into LINKMOD domain until people used UNLINK or the variation based on replicating pages, ie, UNLINK . 11. Syntax The syntax employs an induction operator, "=" (parser), and a deduction operator ":" (compiler). Literals are double quoted. Alternatives succeed "|". Where noted in ";" line comments, a syntactic variable may be defined in HTTP/1.1 [2]. Two linebreaks terminate a clause, any amount of whitespace is identical to a single token separator. Method = "LINK" | "UNLINK" | "UNLINKR" | "LINKMOD" Request = Link-Request-Line | Unlink-Request-Line | UnlinkR-Request-Line | LinkMod-Request-Line *( general-header ) ; HTTP/1.1 07 4.5 CRLF Link-Request-Line = "LINK" Source-URL Target-URL | "LINK" Source-URL Target-URL "LINKMOD" Unlink-Request-Line = "UNLINK" Source-URL Target-URL | "UNLINK" Source-URL Target-URL Repl-Source-URL UnlinkR-Request-Line = "UNLINKR" Source-URL Target-URL | "UNLINKR" Source-URL Target-URL Repl-Target-URL LinkMod-Request-Line = "LINKMOD" Source-URL Target-URL Source-URL : URL ; RFC 1738 Resource Locator Target-URL : URL Repl-Target-URL : URL ; Suggested Link Replacement Repl-Source-URL : URL ; Suggested Link Replacement Response = Status-Line ; As HTTP/1.1 Status-Code = "200" ; Ok | "207" ; No Linkmod | "400" ; Bad Request | "404" ; Not found | "416" ; Invalid source URI | "417" ; Invalid target URI | "500" ; Internal Server Error 12. References 1. Hypertext Transfer Protocol -- HTTP/1.0 rfc1945 T. Berners-Lee, R. Fielding, H. Frystyk May 1996 2. Hypertext Transfer Protocol -- HTTP/1.1 draft-ietf-http-v11-spec-07 R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, T. Berners-Lee August 1996 3. Version management with meta-level links via HTTP/1.1 draft-ota-http-version-00 K. Ota, K. Takahashi, K. Sekiya November 1996 4. Uniform Resource Locators (URL) rfc1738 T. Berners-Lee, L. Masinter, M. McCahill December 1994 5. Relative Uniform Resource Locators rfc1808 R. Fielding June 1995 6. Robot Exclusion Standard norobots.html Martijn Koster 7. A Framework for the Assignment and Resolution of Uniform Resource Names draft-daigle-urnframework-00 Leslie L. Daigle June 1996 8. Hypertext Markup Language - 2.0 draft-ietf-html-spec-06 T. Berners-Lee, D. Connolly September 1995