Internet Draft                                       John Pritchard
<draft-pritchard-http-links-00>         Columbia U Computer Science

Expires June 1996                                  21 November 1996

                  Efficient HyperLink Maintenance for HTTP

Status of this Memo

This document is an Internet-Draft. Internet-Drafts are working documents of
the Internet Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working documents as
Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and
may be updated, replaced, or obsoleted by other documents at any time. It is
inappropriate to use Internet- Drafts as reference material or to cite them
other than as ``work in progress.''

To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt'' listing contained in the Internet- Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au
(Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West
Coast).

Distribution of this document is unlimited. Please send comments to John
Pritchard at <jdp@cs.columbia.edu>

Abstract

Hyperlink maintenance allows robots and servers to cooperate in propagating
the effects of daily changes in the millions of resource locations in the
wwweb. Here, we propose developing the definitions of the LINK and UNLINK
methods defined for HTTP since RFC 1945 and which remain largely
unimplemented and unused. We believe that the only reason these methods have
not been employed is that they remain too loosely defined and implicitly too
inefficient. A new syntax and semantics simplify implementation and improve
utility.

Author's address

John Pritchard
315 W 82nd Street, #4
New York, NY 10024

<jdp@cs.columbia.edu>

Contents

  1. Introduction

  2. Link Terminology

  3. Implementation Terminology

  4. Current HTTP Link Management Protocol

  5. Some linking practices

  6. Proposed Facility

  7. Methods

       1. LINK

       2. UNLINK

       3. UNLINKR

       4. LINKMOD

  8. Implementation

  9. Indempotency

 10. Security Considerations

 11. Syntax

 12. References

  1. Introduction

     The HTTP protocol has recognized the importance of link management
     since HTTP/1.0 RFC 1945 [1]. However, the methods defined in HTTP/1.0
     are limited and remain largely unimplemented. The existing link concept
     is defined irrespective of direction, ie, reference or resource, and so
     leaves too much semantically implied. The revised methods define simple
     and efficient syntax and semantics for a complete hyperlink management
     protocol within HTTP.

     Dangling links are a bigger and bigger problem on a large and growing
     wwweb. Messages like the following are common:

       The URL which you entered, ... , was not found on this server.
       You may have entered it incorrectly, or it may no longer exist.
       If you arrived here by clicking on a link in another page,
       please tell that page's owner/administrator that the link no
       longer exists.


     This one resulted from a URL stored in a popular search engine. A
     solution is readily available in defining HTTP's LINK and UNLINK
     methods with syntax and semantics that effectively and efficiently
     provide for hyperlink maintenace.

     Hyperlink maintenance implies communication, processing and storage
     costs. The proposed methods cut processing with syntax by not defining
     semantics that imply searching on behalf of call receivers. The
     proposed methods' semantics also match storage requirements to the HTML
     LINK tag concept. Storage space is not required on behalf of robots for
     implementation.

     The protocol detailed here is currently being implemented in an
     HTTP/1.1 compliant, commercial wwweb server and agent platform under
     the extensions provisions of that specification. This protocol has been
     realized as the result of that effort.

  2. Link Terminology

     In this context we refer exclusively to links that are Uniform Resource
     Locators, see URL [4] and [2]. URLs are Uniform Resource Indentifiers,
     URIs [1], pointing to particular resources without variation per user
     identity, class or input, or other particularly perishable or localized
     circumstances.

     A link has two end points, one in an HTML anchor or otherwise a URL
     reference, and the other in the HTTP service providing access to a
     resource via a reference. The source end of a link is the client or
     anchor end, sometimes the tail, and the target end of a link is the
     resource end, sometimes the head.

       source: anchor, reference, tail

       target: resource, head, server, named anchor

     Usage for source and target include direct reference to documents, or
     reference locators (URLs), or the services (hosts) at the respective
     ends of a link.

     For discussing efficiency, we describe a shorter URI as coarser, and a
     longer one finer. The comparison could be made for URIs into the same
     sub-wwweb, for example

        http://www.target.com/some/long/path/     A

        http://www.target.com/some/path/          B

     B is coarser than A. If a coarser URI replaces a finer one, the
     implication of clobbered namespaces arises as well as a greater
     potential need for link modifications. Remember that handling URLs, or
     particular resource locators, implies that for each link there's an
     unlink.

  3. Implementation Terminology

     In agreement with the HTTP specification documents and RFC 1123 [1], we
     employ must, shall or required to indicate implementation syntax or
     semantics that are not optional for software conforming to this
     specification, may for recommended features and should for optional
     features.

     Please note that this draft does not constitute a modification of any
     standard, rfc, or draft document but a proposal for review by the HTTP
     Working Group and the internet administration and development
     community.

  4. Current HTTP Link Management Protocol

     The LINK and UNLINK methods are described in HTTP/1.1 [2] draft seven,
     sections 19.6.1.2 and 3, respectively. In short, the link and unlink
     request lines include method names and a request URI.

     The specification [2] states (section 5.3)

          The LINK method establishes one or more Link relationships
          between the existing resource identified by the Request-URI
          and other existing resources.

          The UNLINK method removes one or more Link relationships
          from the existing resource identified by the Request-URI.
          These relationships may have been established using the
          LINK method or by any other method supporting the Link
          header. The removal of a link to a resource does not imply
          that the resource ceases to exist or becomes inaccessible
          for future references.

     Without providing both the source and target of a link for LINKing or
     UNLINKing, the processing requirements for implementation of the
     current methods imply looking up the other end of the link. Link source
     or unlink target information is required in request headers, or on the
     request line to allow a valuable optimization -- eliminating excess
     searching or indexing.

  5. Some linking practices

     Hyperlink maintenance methods are required for wwweb organization and
     must be interoperable across wwweb servers and robots in order to be
     effective. Robots and wanderers maintain catalogs of URI references and
     hypertext. Currently, unlink maintenance of these catalogs is largely
     manual. The Robot Exclusion Standard or "/robots.txt" [6] is currently
     considering a new facility for informing robots of changes to a
     server's sub-web, but doesn't address the server to server case that
     most links fall into. The passive existance of a link directive
     instrument on a server would require every server to get the linking
     directives from every other server and apply them heuristically to try
     to weed out broken links. This is untenable for broad use by
     communication and processing requirements and by the complexity of
     implementation. RES is useful for directing searches on subwebs by
     robots and is fairly widely employed by search engines and other
     robots.

     The URN [7] proposal is another idea that is sometimes mentioned but
     really isn't relevant. It creates a hierarchical global namespace for
     resources, and is designed for resources with extensive lifetimes, and
     not the ordinary class of information. Named linking would be extremely
     useful for putting hyperlinks into this document for reference
     material. With a particular URN namespace, the reader would potentially
     find the closest copy, perhaps a local copy of an RFC or Internet-Draft
     document, rather than simply use the link provided to the USA East
     Coast repository provided here. But even URN may not be appropriate for
     drafts with six month lifetimes.

     WWWeb meta information and versioning are important in this context as
     the proposed link maintenance extensions could benefit from mutual
     implementation in a wwweb server's object management system in
     conjunction with "Version management with meta-level links via
     HTTP/1.1" [3]. Content level links (see "Link" content header in
     HTTP/1.0 [2] and LINK entity in HTML 2.0 [8]) provide a default storage
     mechanism for link maintenance information.

  6. Proposed Facility

     Required semantics are very limited. Only support for the LINK call,
     and clean disposal of other calls, is required by implementing systems.

     This simple, lightweight form doesn't require storage overhead on
     robots, crawlers, etc..

     The cost of employing this automation is lower than might first be
     imagined as link changes with coarser effects are rarer than link
     changes with finer effects. Unlinks potentially occur for each link,
     without matching coarse URIs into fine URLs.

     If the wwweb server maintains a table of LINKs for the target document,
     it can issue UNLINKs to delete or revise others' information when the
     location changes or is deleted. So the average cost in simple network
     calls and table size is linear in number of links. Unlink calls'
     generation versus link calls' receipt ratio depends entirely on the
     server site characteristics.

     The table for a particular doc.html would store link source info, or
     reverse links. The UNLINK call is made to the host in the source end of
     the link, with the source and target links so that it can handle the
     request with minimal overhead. The LINK call is made to the host
     serving the target when the reference locator is used in a link-source
     document.

     Although HTML [7] defines LINK entities, in practice one doesn't want
     the wwweb server to download its link set with each HTML document -- if
     for no other reason than minimizing general bandwidth consumption.

  7. Methods

       1. LINK

          Linking provides for subsequent link modifications from the target
          to the source. Links change at their target side, so the link
          establishment between two HTTP implementing systems needs to allow
          the target side to tell the source side when a link URL has
          changed.

          The LINKMOD option tells the target end of the link that LINKMOD
          calls should be made to the source end.

          The target maintains a table of source links associated with
          particular resources so that if their URIs change the target can
          notify the source.

            LINK Source-URL Target-URL

            LINK Source-URL Target-URL LINKMOD

               Request
                    The source tells the target that a URL to the target has
               been stored at the source.

               Reply
                    The target will accept LINK calls with 200 Ok unless the
               Target-URL is invalid. In this case it will respond with a
               417 Invalid target URI. If the LINKMOD option is requested
               but not enabled, the 207 No Linkmod reply will be generated.

       2. UNLINK

          UNLINK removes previous LINK information. A source tells a target
          that the previous source referenced in a prior LINK call no longer
          exists or has moved.

            UNLINK Source-URL Target-URL

            UNLINK Source-URL Target-URL Repl-Source-URL

               Request
                    The source notifies the target that the source link has
               changed. Optionally, the source may specify a replacement
               source URL.

               Reply
                    The target replies with 200 Ok unless the source has
               specified invalid source or target URLs. In the case of
               erroneous source or target URIs, the target replies with one
               of 416 Invalid source URI or 417 Invalid target URI. The
               invalid target may indicate only that UNLINKR has not been
               supported by the target or source system. The invalid source
               reply occurs when there is no such source link information
               known to the target.

       3. UNLINKR

          This method allows the target to inform the source that a link has
          changed. It specifies that the first argument refers to a source
          link that it stores and the second argument refers to a target
          link from that source. It is redundant on the semantics of the
          UNLINK method if the semantics of the UNLINK method included
          determining whether the recipient of the call is the source or the
          target.

          For UNLINK, the receiver is the target end, and with UNLINKR, the
          receiver is the source end.

            UNLINKR Source-URL Target-URL

            UNLINKR Source-URL Target-URL Repl-Target-URL

               Request
                    The target notifies the source that the Target-URL
               referenced from location Source-URL is no longer valid. The
               target optionally provides the source with a replacement
               target URL.

               Reply
                    The source replies with 200 Ok unless the target has
               specified invalid target or source URLs. In the case of
               erroneous target or source URIs, the source replies with one
               of 416 Invalid target URI or 417 Invalid source URI. The
               invalid source may indicate only that UNLINK has not been
               supported by the source or target system. The invalid target
               reply occurs when there is no such target link information
               known to the source.

       4. LINKMOD

          A LINKMOD call could notify robots that a page has been updated.
          this would require that LINK be extended with optional request for
          LINKMOD calls.

          LINKMOD would be accepted by robots and crawlers in addition to
          UNLINK. The source will react according to its need for this
          information.

            LINKMOD Source-URL Target-URL

               Request
                    The target informs the source that the Target-URI has
               been modified.

               Reply
                    The source replies with 200 Ok unless the target has
               specified invalid target or source URLs. In the case of
               erroneous target or source URIs, the source replies with one
               of 416 Invalid target URI or 417 Invalid source URI. The
               invalid source may indicate only that UNLINK has not been
               supported by the source or target system. The invalid target
               reply occurs when there is no such target link information
               known to the source.

  8. Implementation

     We can divide all classes of HTTP-implementing software into two
     categories for specifying implementation requirements. The first is the
     class of systems that maintain no link references (no HTML or URL
     catalogs) in their internal data. These have no implementation
     requirements.

     The second is systems that maintain link references in HTML or URL
     catalog data. These include wwweb servers and search engines.

     The implementation must include LINK and may implement UNLINK, UNLINKR
     and LINKMOD. If it is only implementing LINK, it must reply with an Ok
     status code to any UNLINK, UNLINKR and LINKMOD calls it receives.

  9. Indempotency

     All of these methods are indempotent. Successive identical calls have
     identical effect as a single call. However, this requires that LINK is
     implemented to not replicate identical data. Please refer to RFCs 1738
     [4] and 1808 [5] and HTTP/1.1 [2] Section 3.2.3 "URI Comparison" for
     information on determining when a LINK request should be discarded in
     preserving indempotency.

 10. Security Considerations

     The UNLINK and UNLINKR methods' calls should be manually reviewed or
     automated and secured for trusted or authenticated hosts.

     At least robot-level spamming would be segmented into LINKMOD domain
     until people used UNLINK <target> <target> or the variation based on
     replicating pages, ie, UNLINK <target> <copy of target>.

 11. Syntax

     The syntax employs an induction operator, "=" (parser), and a deduction
     operator ":" (compiler). Literals are double quoted. Alternatives
     succeed "|". Where noted in ";" line comments, a syntactic variable may
     be defined in HTTP/1.1 [2]. Two linebreaks terminate a clause, any
     amount of whitespace is identical to a single token separator.

             Method        = "LINK"
                           | "UNLINK"
                           | "UNLINKR"
                           | "LINKMOD"

             Request       = Link-Request-Line
                           | Unlink-Request-Line
                           | UnlinkR-Request-Line
                           | LinkMod-Request-Line
                           *( general-header  )      ; HTTP/1.1 07 4.5
                           CRLF

             Link-Request-Line
               = "LINK" Source-URL Target-URL
               | "LINK" Source-URL Target-URL "LINKMOD"

             Unlink-Request-Line
               = "UNLINK" Source-URL Target-URL
               | "UNLINK" Source-URL Target-URL Repl-Source-URL

             UnlinkR-Request-Line
               = "UNLINKR" Source-URL Target-URL
               | "UNLINKR" Source-URL Target-URL Repl-Target-URL

             LinkMod-Request-Line
               = "LINKMOD" Source-URL Target-URL

             Source-URL    : URL     ; RFC 1738 Resource Locator

             Target-URL    : URL

             Repl-Target-URL
               : URL                 ; Suggested Link Replacement

             Repl-Source-URL
               : URL                 ; Suggested Link Replacement

             Response      = Status-Line  ; As HTTP/1.1

             Status-Code   = "200"   ; Ok
                           | "207"   ; No Linkmod
                           | "400"   ; Bad Request
                           | "404"   ; Not found
                           | "416"   ; Invalid source URI
                           | "417"   ; Invalid target URI
                           | "500"   ; Internal Server Error

 12. References

       1. Hypertext Transfer Protocol -- HTTP/1.0
          rfc1945
          T. Berners-Lee, R. Fielding, H. Frystyk
          May 1996

       2. Hypertext Transfer Protocol -- HTTP/1.1
          draft-ietf-http-v11-spec-07
          R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, T. Berners-Lee
          August 1996

       3. Version management with meta-level links via HTTP/1.1
          draft-ota-http-version-00
          K. Ota, K. Takahashi, K. Sekiya
          November 1996

       4. Uniform Resource Locators (URL)
          rfc1738
          T. Berners-Lee, L. Masinter, M. McCahill
          December 1994

       5. Relative Uniform Resource Locators
          rfc1808
          R. Fielding
          June 1995

       6. Robot Exclusion Standard
          norobots.html
          Martijn Koster

       7. A Framework for the Assignment and Resolution of Uniform Resource
          Names
          draft-daigle-urnframework-00
          Leslie L. Daigle
          June 1996

       8. Hypertext Markup Language - 2.0
          draft-ietf-html-spec-06
          T. Berners-Lee, D. Connolly
          September 1995