INTERNET-DRAFT                                               Jim Davis
draft-davis-dasl-requirements-00.txt                 Xerox Corporation
Sept 17, 1998                                             Saveen Reddy
Expires Mar 17, 1999                             Microsoft Corporation
                                                          Judith Slein
                                                     Xerox Corporation


Requirements for DAV Searching and Locating
  
  
Status of this Memo
  
  This document is an Internet draft. Internet drafts are working
  documents of the Internet Engineering Task Force (IETF), its areas and
  its working groups. Note that other groups may also distribute working
  information as Internet drafts.
  
  Internet Drafts are draft documents valid for a maximum of six months
  and can be updated, replaced or obsoleted by other documents at any
  time. It is inappropriate to use Internet drafts as reference material
  or to cite them as other than as "work in progress".
 
  To view the entire list of current Internet-Drafts, please check
  the "1id-abstracts.txt" listing contained in the Internet-Drafts
  Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
  (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au
  (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu
  (US West Coast)A.  Further information about the IETF can be
  found at URL: http://www.ietf.org/ .
 
  Distribution of this document is unlimited. Please send comments to
  the mailing list at www-webdav-dasl@w3.org, which may be joined by
  sending a message with subject "subscribe" to
  www-webdav-dasl-request@w3.org.
  
  Discussions of the list are archived at
  http://www.w3.org/pub/WWW/Archives/Public/www-webdav-dasl .

Abstract
  
  The Distributed Authoring and Versioning protocol [WEBDAV] defines
  simple mechanisms to assign and retrieve values for properties. This
  document presents requirements for a WebDAV extension to support
  efficient searching for resources based on WEBDAV properties and
  content. These requirements are intended to be the basis for the DAV
  Searching and Location (DASL) protocol.


Davis et al.                                                   [Page 1]

Internet Draft                 DASL Req                                

1. Introduction

 Motivation for DASL
  
  WEBDAV and HTTP provide support for client-side search, but not
  server-side search. The GET method defined in [HTTP] allows clients to
  retrieve a resource's content; the PROPFIND method defined in [WEBDAV]
  allows clients to retrieve a resource's properties. Having retrieved a
  resource's properties and / or content, the client can compare them to
  its search criteria to determine whether the resource is of interest.
  Although this client-side searching is logically sufficient, and
  requires no modifications to the server, it comes at a significant
  cost, because it makes inefficient use of network resources. A client
  must retrieve properties and content for each resource under
  consideration. Furthermore, it does not take advantage of server
  intelligence. Servers capable of searching can use sophisticated
  mechanisms to generate results: internal caching of intermediate
  search results, content-indexing, etc.
  
  Even simple, common queries may expose these limitations. Consider the
  query "find all text files modified during the last week." When such a
  query is extended to a large number of clients searching against a
  single server, the limitations become more apparent. Client-side
  searching has difficulties scaling in these cases.
  
  DASL allows for server-side searching. Server-side searching allows
  the client to formulate a query and have the server perform task of
  selecting the resources that fit the criteria. This overcomes both of
  the limitations of client-side searching described above. The benefit
  is a searching solution that scales; the cost is that the server
  software becomes more complex.
  
  This document presents requirements for any possible protocol that
  might be proposed for DASL. In this document, the word "MUST" refers
  to features or behavior which, if absent from a proposed protocol,
  would make it unusable for DASL. The word "SHOULD" refers to features
  that are desirable, but not necessary. These requirements come from
  considerations of the scenarios presented in [SCENARIOS], from the
  need to support the WebDAV object model, the use of HTTP, and general
  IETF rules. We have provided rationale for those requirements whose
  justification is not obvious.

2. Terminology
  
  scope
       a set of resources to be searched.
  criteria
       an expression against which each resource in the search scope is
       evaluated.
  result set
       a set of records, one for each resource for which the search
       criteria evaluated to True.


Davis et al.                                                   [Page 2]

Internet Draft                 DASL Req                                


  record
       a description of a resource. A result record is a set of
       properties, and possibly other descriptive information
  result
       A result is a result set, optionally augmented with other
       information describing the search as a whole.
  result record definition
       a specification of the set of properties to be returned in the
       result record
  sort specification
       a specification of an ordering on the result records in the
       result set.
  search modifier
       an instruction that governs the execution of the query but is not
       part of the search scope, result record definition, the search
       criteria, or the sort specification. An example of a search
       modifier is one that controls how much time the server can spend
       on the query before giving a response.
  query
       A query is a combination of a search scope, search criteria,
       result record definition, sort specification, and a search
       modifier.
  query grammar
       a set of definitions of XML elements, attributes, and constraints
       on their relations and values that defines a set of queries and
       the intended semantics.
  schema
       a listing, for any given grammar and scope, of the properties and
       operators that may be used in a query with that grammar and
       scope.
  Hit highlighting
       is a specification of the location(s) within a resource
       containing text that matched a content-query. It allows clients
       to provide visual cues to a user to identify segments in a text
       resource that cause them to match content-based queries.
  paged results
       allows a client to request that the server return a subset of the
       result set rather than the entire set. In subsequent calls to the
       server, additional results from the same query can be requested.
       Paged results are intended to improve the performance and
       manageability of search results.
  In addition to the terms defined above, this document uses terminology
  consistent with [HTTP] and [WEBDAV].
  
  Requirements are divided into five categories, and numbered within
  each category. The categories are Scope, Criteria, Record Definition,
  Other and Discovery.


Davis et al.                                                   [Page 3]

Internet Draft                 DASL Req                                


 3. Requirements: Scope
  
  
  S1:  It MUST be possible to specify at least one resource in the
  scope. It SHOULD be possible to specify a set of distinct, unrelated
  resources in the scope.
  
  
  S2 It MUST be possible to specify a WebDAV collection as a scope.
  
  
  S3:  It SHOULD be possible to specify other types of resources in a
  scope.
       Rationale: A client might wish to determine whether a given
       resource was of interest without transfering it.
  
  
  S4:  When the scope is a collection, it MUST be possible to specify
  the depth.
       Users often intend to scope their searches either to the
       immediate children of a container or to extend the search
       recursively to the container's children. Furthermore, depth
       control is needed to prevent servers from performing unnecessary
       work.
  

4. Requirements: Criteria
  
  
Criteria generalities
  
  
  C1:  It MUST be possible to search properties in a query. It MUST be
  possible to search both DAV-defined and application-defined properties
  in a query.
       Further requirements for properties are below.
  
  
  C2:  It MUST be possible to search content in a query.
       Note that at this writing, unlike property searches, there is no
       single widely accepted semantics for content-based queries.
       Further requirements for content criteria are below.
  
  
  C3:  It MUST be possible to search both properties and content in a
  single query
  
  
  C4:  It MUST be possible to combine criteria with Boolean operators

Davis et al.                                                   [Page 4]

Internet Draft                 DASL Req                                


  (i.e. and, or, not)
  

Criteria for properties
  
  
  C5:  It MUST be possible to include undefined properties in a query
  without error.
       Rationale: . This arises from the property model of DAV. Unlike
       the more familiar relational model, DAV does not define tables or
       schema for resources, hence there is no guarentee that all
       properties will be defined for all resources. Moreover, DAV
       allows an client to store arbitrary properties on arbitrary
       resources. Therefore DASL must support queries that use
       properties that are not defined on all resources in the scope. If
       such a query failed, there would be no way to locate the desired
       resources.
  
  
  C5.1:  It MUST be possible to test whether a property is defined.
  
  
  C6.1:  It MUST be possible to compare property values to constant
  values.
  
  
  C6.2.1:  It SHOULD be possible to compare property values to other
  properties of the same resource.
  
  
  C6.2.2:  It SHOULD be possible to compare property values to other
  properties of other resources.
       Note that this may involve a "join". We do not expect the first
       version of the DASL protocol to meet this requirements.
  
  
  C6.3: It SHOULD be possible to compare property values to results of
  expressions.
  
  
  C6.4:  It MUST be possible to match property values with pattern
  matching, at a minimum, string-ending wildcards. More powerful
  patterns, such as defined by the SQL "like" operator or Unix regular
  expressions, are desirable.
       The minimum is necessary to enable DASL to locate resources by
       content type, e.g. to locate all image files by comparison with
       "image/*". More powerful comparisons are useful when strings
       encode structured data such as times or lists.
  
  
  C7.1:  It MUST be possible to use equality operators.

Davis et al.                                                   [Page 5]

Internet Draft                 DASL Req                                


  C7.2:  It MUST be possible to use relative operators.
  
  
  C8:  It MUST be possible to specify case sensitivity
       Note this does not say that all DASL servers must support both
       case-sensitive and case-insensitive comparisons, but only that
       the protocol must be able to express a client's preference, and
       define behavior in the case where the server cannot support that
       preference.
  
  
  C9:  It MUST be possible to specify language-specific definitions for
  string comparison and sorting.
       Different cultures define different rules for string comparison,
       e.g. for collating sequence and for significance of diacritics.
       Cross-language comparison is out of scope for DASL, but
       comparisons within the same language must be done with the
       appropriate semantics.
  

Requirements: Criteria for content searches
  
  
  C10:  It MUST be possible to search content of any text media type.
  The definition of "searching content" for DASL means locating
  sequences of character in the contents of the resource.
       DASL defines no specific requirements for searching for structure
       within text media types (e.g. for finding character strings
       within HTML tags) DASL defines no requirements for searching
       other media types that might contain text (e.g. subtypes of
       application). Searching non-text media types (e.g.images, audio)
       is out of scope for DASL.
  
  
  C11.1:  It SHOULD be possible to search for words that are within a
  specified number of words of each other.
       This is often called 'near' search.
  
  
  C11.2:  It SHOULD be possible to search for words that occur within
  the same grammatical context, e.g. same phrase, sentence, or
  paragraph.
       This is sometimes called 'in' search.
  
  
  C12.1:  It SHOULD be possible for a client to control whether content
  searches does or does not use a stemming comparison.

Davis et al.                                                   [Page 6]

Internet Draft                 DASL Req                                


  C12.2:  It SHOULD be possible for a client to request comparisons
  using phonetic similarity (e.g. soundex)
  
  
  C12.3:  It SHOULD be possible for the client to request keyword
  expansion (thesaurus expansion).
  
  
  C13:  It SHOULD be possible for a client to conduct a relevance
  search. In such a search, the query consists of a set of words
  (perhaps an entire resource), and the result is a list of resources
  whose contents most closely resemble the query, sorted in decreasing
  order of resemblance.
  

5. Requirements: Results
  
  
  R1:  It MUST be possible to specify a sorting for the result set
  
  
  R2:  It MUST be possible to specify a set of properties to be returned
  in the result records, distinct from the properties in criteria
       For example, a query might ask for "the authors of those
       documents under 10K in size". In this case, the criterion relates
       only to the size, but the desired result record contains only the
       author.
  
  
  R3:  It MUST be possible for a client to request limits on the
  resources consumed in creating of transmitting in the result set.
       Some queries can potentially return very large result sets.
       Clients that are good citizens will voluntarily limit the size of
       such results. In addition, some servers may charge money for
       queries.
  
  
  R3.1:  It MUST be possible for a client to limit the number of records
  in the result set.
       This is the most meaningful unit of resource consumption to the
       client.
  
  
  R4:  It MUST be possible for the server to return fewer result records
  than match the criteria.
       "Client proposes, server disposes".
  
  
  R5:  It MUST be possible to a client to request paged results.

Davis et al.                                                   [Page 7]

Internet Draft                 DASL Req                                


       Paged retrieval is necessary if result sets are very large and if
       clients must also present a responsive interface to a user. Note
       that this requirement is silent about whether a server implements
       paged results by storing results from a query or recalculating
       them as needed.
  

6. Requirements: Other
  
  
  O1:  It MUST be possible to support multiple query grammars.
       Rationale: A particular query grammar may not expose all the
       useful searching functionality of a server. Clients should be
       allowed to query a server using any grammar that takes advantage
       of those special server capabilities. This requirement also
       allows DASL to define an initial limited query grammar which
       meets all the mandatory requirements without needing to address
       all the desirable, but non-mandatory requirements.
  
  
  O2:  It MUST be possible to extend the basic grammar defined by DASL.
  
  
  03:  It MUST be possible for the server to redirect a query.
       This is useful when a server is not able to search a given scope,
       but can refer the client to another server which is able to
       search the scope.
  
  
  O4:  It SHOULD be possible for the client to request hit highlighting.
  

7. Requirements: Discovery
  
  
  D1:  It MUST be possible for a client to discover the set of query
  grammars supported by a server.
  
  
  D2:  It MUST be possible for a client to discover the schema supported
  by a server for a particular grammar with a particular scope.
       Note that the schema may differ depending on the scope. Query
       schema discovery allows a client to use optional operators
       supported by a server.
  
  
  D3:  It SHOULD be possible for a client to determine information about
  the properties within a scope.
       This information can enable a user interface to help a user to
       construct a valid query, for example by providing meaningful

Davis et al.                                                   [Page 8]

Internet Draft                 DASL Req                                


       names for properties, constraints on values, hints about data
       type, and so on, or information about expected performance, for
       example whether a property is indexed (and hence more quickly
       searched).
  

8. External Requirements
  
  DASL MUST describe how to perform searches on internationalized
  content and properties. This is in keeping with IETF policy.
  
  Information intended for user comprehension MUST conform to the IETF
  Character Set Policy [CHAR].
  
  The WebDAV working group is currently addressing the standardization
  of mechanisms for authors to submit variants and version of resources,
  or for means of exposing access control. DASL should provide
  mechanisms that can query for variants, versions, and access control
  but of course can not do so until they are defined. Likewise, DASL may
  contribute requirements to access control (e.g. control over
  querying).

9. Related Work
  
  
  Z39.50: "Information Retrieval (Z39.50): Application Service
  Definition and Protocol Specification".
  http://lcweb.loc.gov/z3950/agency/
  
  Z39.50 Profile for Simple Distributed Search and Ranked Retrieval
  http://lcweb.loc.gov/z3950/agency/profiles/zdsr.html
  
  The STARTS Protocol
  http://www-db.stanford.edu/~gravano/starts.html
  
  The Harvest Information Discovery and Access System
  http://mordor.transarc.com/afs/transarc.com/public/trg/Harvest/

10. References
  
  [CHAR] H.T. Alvestrand, "IETF Policy on Character Sets and Languages",
  June 1997, internet-draft, work-in-progress,
  draft-alvestrand-charset-policy-02.txt.
  
  [HTTP] R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, and T.
  Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, U.C.
  Irvine, DEC, MIT/LCS, January 1997.


Davis et al.                                                   [Page 9]

Internet Draft                 DASL Req                                


[SCENARIOS] Henderson, R. et al Scenarios for DAV Searching and Locating
  not yet even an ID.
  
  [WEBDAV] Y. Y. Goland, E. J. Whitehead, Jr., A. Faizi, S. R. Carter,
  D. Jensen, "Extensions for Distributed Authoring and Versioning on the
  World Wide Web", April, 1998. internet-draft, work-in-progress,
  draft-ietf-webdav-protocol-08.txt.

11. Authors' Addresses
  
  Jim Davis
  Xerox Corporation
  3333 Coyote Hill Road
  Palo Alto, CA 94304
  Email: jdavis@parc.xerox.com
  
  Saveen Reddy
  Microsoft Corporation
  One Microsoft Way
  Redmond WA, 9085-6933
  email: saveenr@microsoft.com
  
  Judith Slein
  Xerox Corporation
  800 Phillips Road 105-50C
  Webster, NY 14580
  Email: slein@wrc.xerox.com


Davis et al.                                                  [Page 10]