INTERNET-DRAFT Jim Davis draft-davis-dasl-requirements-00.txt Xerox Corporation Sept 17, 1998 Saveen Reddy Expires Mar 17, 1999 Microsoft Corporation Judith Slein Xerox Corporation Requirements for DAV Searching and Locating Status of this Memo This document is an Internet draft. Internet drafts are working documents of the Internet Engineering Task Force (IETF), its areas and its working groups. Note that other groups may also distribute working information as Internet drafts. Internet Drafts are draft documents valid for a maximum of six months and can be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use Internet drafts as reference material or to cite them as other than as "work in progress". To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast)A. Further information about the IETF can be found at URL: http://www.ietf.org/ . Distribution of this document is unlimited. Please send comments to the mailing list at www-webdav-dasl@w3.org, which may be joined by sending a message with subject "subscribe" to www-webdav-dasl-request@w3.org. Discussions of the list are archived at http://www.w3.org/pub/WWW/Archives/Public/www-webdav-dasl . Abstract The Distributed Authoring and Versioning protocol [WEBDAV] defines simple mechanisms to assign and retrieve values for properties. This document presents requirements for a WebDAV extension to support efficient searching for resources based on WEBDAV properties and content. These requirements are intended to be the basis for the DAV Searching and Location (DASL) protocol. Davis et al. [Page 1] Internet Draft DASL Req 1. Introduction Motivation for DASL WEBDAV and HTTP provide support for client-side search, but not server-side search. The GET method defined in [HTTP] allows clients to retrieve a resource's content; the PROPFIND method defined in [WEBDAV] allows clients to retrieve a resource's properties. Having retrieved a resource's properties and / or content, the client can compare them to its search criteria to determine whether the resource is of interest. Although this client-side searching is logically sufficient, and requires no modifications to the server, it comes at a significant cost, because it makes inefficient use of network resources. A client must retrieve properties and content for each resource under consideration. Furthermore, it does not take advantage of server intelligence. Servers capable of searching can use sophisticated mechanisms to generate results: internal caching of intermediate search results, content-indexing, etc. Even simple, common queries may expose these limitations. Consider the query "find all text files modified during the last week." When such a query is extended to a large number of clients searching against a single server, the limitations become more apparent. Client-side searching has difficulties scaling in these cases. DASL allows for server-side searching. Server-side searching allows the client to formulate a query and have the server perform task of selecting the resources that fit the criteria. This overcomes both of the limitations of client-side searching described above. The benefit is a searching solution that scales; the cost is that the server software becomes more complex. This document presents requirements for any possible protocol that might be proposed for DASL. In this document, the word "MUST" refers to features or behavior which, if absent from a proposed protocol, would make it unusable for DASL. The word "SHOULD" refers to features that are desirable, but not necessary. These requirements come from considerations of the scenarios presented in [SCENARIOS], from the need to support the WebDAV object model, the use of HTTP, and general IETF rules. We have provided rationale for those requirements whose justification is not obvious. 2. Terminology scope a set of resources to be searched. criteria an expression against which each resource in the search scope is evaluated. result set a set of records, one for each resource for which the search criteria evaluated to True. Davis et al. [Page 2] Internet Draft DASL Req record a description of a resource. A result record is a set of properties, and possibly other descriptive information result A result is a result set, optionally augmented with other information describing the search as a whole. result record definition a specification of the set of properties to be returned in the result record sort specification a specification of an ordering on the result records in the result set. search modifier an instruction that governs the execution of the query but is not part of the search scope, result record definition, the search criteria, or the sort specification. An example of a search modifier is one that controls how much time the server can spend on the query before giving a response. query A query is a combination of a search scope, search criteria, result record definition, sort specification, and a search modifier. query grammar a set of definitions of XML elements, attributes, and constraints on their relations and values that defines a set of queries and the intended semantics. schema a listing, for any given grammar and scope, of the properties and operators that may be used in a query with that grammar and scope. Hit highlighting is a specification of the location(s) within a resource containing text that matched a content-query. It allows clients to provide visual cues to a user to identify segments in a text resource that cause them to match content-based queries. paged results allows a client to request that the server return a subset of the result set rather than the entire set. In subsequent calls to the server, additional results from the same query can be requested. Paged results are intended to improve the performance and manageability of search results. In addition to the terms defined above, this document uses terminology consistent with [HTTP] and [WEBDAV]. Requirements are divided into five categories, and numbered within each category. The categories are Scope, Criteria, Record Definition, Other and Discovery. Davis et al. [Page 3] Internet Draft DASL Req 3. Requirements: Scope S1: It MUST be possible to specify at least one resource in the scope. It SHOULD be possible to specify a set of distinct, unrelated resources in the scope. S2 It MUST be possible to specify a WebDAV collection as a scope. S3: It SHOULD be possible to specify other types of resources in a scope. Rationale: A client might wish to determine whether a given resource was of interest without transfering it. S4: When the scope is a collection, it MUST be possible to specify the depth. Users often intend to scope their searches either to the immediate children of a container or to extend the search recursively to the container's children. Furthermore, depth control is needed to prevent servers from performing unnecessary work. 4. Requirements: Criteria Criteria generalities C1: It MUST be possible to search properties in a query. It MUST be possible to search both DAV-defined and application-defined properties in a query. Further requirements for properties are below. C2: It MUST be possible to search content in a query. Note that at this writing, unlike property searches, there is no single widely accepted semantics for content-based queries. Further requirements for content criteria are below. C3: It MUST be possible to search both properties and content in a single query C4: It MUST be possible to combine criteria with Boolean operators Davis et al. [Page 4] Internet Draft DASL Req (i.e. and, or, not) Criteria for properties C5: It MUST be possible to include undefined properties in a query without error. Rationale: . This arises from the property model of DAV. Unlike the more familiar relational model, DAV does not define tables or schema for resources, hence there is no guarentee that all properties will be defined for all resources. Moreover, DAV allows an client to store arbitrary properties on arbitrary resources. Therefore DASL must support queries that use properties that are not defined on all resources in the scope. If such a query failed, there would be no way to locate the desired resources. C5.1: It MUST be possible to test whether a property is defined. C6.1: It MUST be possible to compare property values to constant values. C6.2.1: It SHOULD be possible to compare property values to other properties of the same resource. C6.2.2: It SHOULD be possible to compare property values to other properties of other resources. Note that this may involve a "join". We do not expect the first version of the DASL protocol to meet this requirements. C6.3: It SHOULD be possible to compare property values to results of expressions. C6.4: It MUST be possible to match property values with pattern matching, at a minimum, string-ending wildcards. More powerful patterns, such as defined by the SQL "like" operator or Unix regular expressions, are desirable. The minimum is necessary to enable DASL to locate resources by content type, e.g. to locate all image files by comparison with "image/*". More powerful comparisons are useful when strings encode structured data such as times or lists. C7.1: It MUST be possible to use equality operators. Davis et al. [Page 5] Internet Draft DASL Req C7.2: It MUST be possible to use relative operators. C8: It MUST be possible to specify case sensitivity Note this does not say that all DASL servers must support both case-sensitive and case-insensitive comparisons, but only that the protocol must be able to express a client's preference, and define behavior in the case where the server cannot support that preference. C9: It MUST be possible to specify language-specific definitions for string comparison and sorting. Different cultures define different rules for string comparison, e.g. for collating sequence and for significance of diacritics. Cross-language comparison is out of scope for DASL, but comparisons within the same language must be done with the appropriate semantics. Requirements: Criteria for content searches C10: It MUST be possible to search content of any text media type. The definition of "searching content" for DASL means locating sequences of character in the contents of the resource. DASL defines no specific requirements for searching for structure within text media types (e.g. for finding character strings within HTML tags) DASL defines no requirements for searching other media types that might contain text (e.g. subtypes of application). Searching non-text media types (e.g.images, audio) is out of scope for DASL. C11.1: It SHOULD be possible to search for words that are within a specified number of words of each other. This is often called 'near' search. C11.2: It SHOULD be possible to search for words that occur within the same grammatical context, e.g. same phrase, sentence, or paragraph. This is sometimes called 'in' search. C12.1: It SHOULD be possible for a client to control whether content searches does or does not use a stemming comparison. Davis et al. [Page 6] Internet Draft DASL Req C12.2: It SHOULD be possible for a client to request comparisons using phonetic similarity (e.g. soundex) C12.3: It SHOULD be possible for the client to request keyword expansion (thesaurus expansion). C13: It SHOULD be possible for a client to conduct a relevance search. In such a search, the query consists of a set of words (perhaps an entire resource), and the result is a list of resources whose contents most closely resemble the query, sorted in decreasing order of resemblance. 5. Requirements: Results R1: It MUST be possible to specify a sorting for the result set R2: It MUST be possible to specify a set of properties to be returned in the result records, distinct from the properties in criteria For example, a query might ask for "the authors of those documents under 10K in size". In this case, the criterion relates only to the size, but the desired result record contains only the author. R3: It MUST be possible for a client to request limits on the resources consumed in creating of transmitting in the result set. Some queries can potentially return very large result sets. Clients that are good citizens will voluntarily limit the size of such results. In addition, some servers may charge money for queries. R3.1: It MUST be possible for a client to limit the number of records in the result set. This is the most meaningful unit of resource consumption to the client. R4: It MUST be possible for the server to return fewer result records than match the criteria. "Client proposes, server disposes". R5: It MUST be possible to a client to request paged results. Davis et al. [Page 7] Internet Draft DASL Req Paged retrieval is necessary if result sets are very large and if clients must also present a responsive interface to a user. Note that this requirement is silent about whether a server implements paged results by storing results from a query or recalculating them as needed. 6. Requirements: Other O1: It MUST be possible to support multiple query grammars. Rationale: A particular query grammar may not expose all the useful searching functionality of a server. Clients should be allowed to query a server using any grammar that takes advantage of those special server capabilities. This requirement also allows DASL to define an initial limited query grammar which meets all the mandatory requirements without needing to address all the desirable, but non-mandatory requirements. O2: It MUST be possible to extend the basic grammar defined by DASL. 03: It MUST be possible for the server to redirect a query. This is useful when a server is not able to search a given scope, but can refer the client to another server which is able to search the scope. O4: It SHOULD be possible for the client to request hit highlighting. 7. Requirements: Discovery D1: It MUST be possible for a client to discover the set of query grammars supported by a server. D2: It MUST be possible for a client to discover the schema supported by a server for a particular grammar with a particular scope. Note that the schema may differ depending on the scope. Query schema discovery allows a client to use optional operators supported by a server. D3: It SHOULD be possible for a client to determine information about the properties within a scope. This information can enable a user interface to help a user to construct a valid query, for example by providing meaningful Davis et al. [Page 8] Internet Draft DASL Req names for properties, constraints on values, hints about data type, and so on, or information about expected performance, for example whether a property is indexed (and hence more quickly searched). 8. External Requirements DASL MUST describe how to perform searches on internationalized content and properties. This is in keeping with IETF policy. Information intended for user comprehension MUST conform to the IETF Character Set Policy [CHAR]. The WebDAV working group is currently addressing the standardization of mechanisms for authors to submit variants and version of resources, or for means of exposing access control. DASL should provide mechanisms that can query for variants, versions, and access control but of course can not do so until they are defined. Likewise, DASL may contribute requirements to access control (e.g. control over querying). 9. Related Work Z39.50: "Information Retrieval (Z39.50): Application Service Definition and Protocol Specification". http://lcweb.loc.gov/z3950/agency/ Z39.50 Profile for Simple Distributed Search and Ranked Retrieval http://lcweb.loc.gov/z3950/agency/profiles/zdsr.html The STARTS Protocol http://www-db.stanford.edu/~gravano/starts.html The Harvest Information Discovery and Access System http://mordor.transarc.com/afs/transarc.com/public/trg/Harvest/ 10. References [CHAR] H.T. Alvestrand, "IETF Policy on Character Sets and Languages", June 1997, internet-draft, work-in-progress, draft-alvestrand-charset-policy-02.txt. [HTTP] R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, U.C. Irvine, DEC, MIT/LCS, January 1997. Davis et al. [Page 9] Internet Draft DASL Req [SCENARIOS] Henderson, R. et al Scenarios for DAV Searching and Locating not yet even an ID. [WEBDAV] Y. Y. Goland, E. J. Whitehead, Jr., A. Faizi, S. R. Carter, D. Jensen, "Extensions for Distributed Authoring and Versioning on the World Wide Web", April, 1998. internet-draft, work-in-progress, draft-ietf-webdav-protocol-08.txt. 11. Authors' Addresses Jim Davis Xerox Corporation 3333 Coyote Hill Road Palo Alto, CA 94304 Email: jdavis@parc.xerox.com Saveen Reddy Microsoft Corporation One Microsoft Way Redmond WA, 9085-6933 email: saveenr@microsoft.com Judith Slein Xerox Corporation 800 Phillips Road 105-50C Webster, NY 14580 Email: slein@wrc.xerox.com Davis et al. [Page 10]