INTERNET-DRAFT N. Ballou Expires: May 15, 1997 Microsoft November 15, 1996 NNTP Full-text Search Enhancements 1. Status of this Memo This document is an Internet-Draft. Internet-Drafts are working docu- ments of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). 2. Abstract This document describes a set of enhancements to the Network News Transport Protocol [NNTP-977] that allows full-text searching of news articles in multiple newsgroups. The proposed SEARCH command supports functionality similar to the [IMAP4] SEARCH command, minus user specific search keys (i.e., ANSWERED, DRAFT, FLAGGED, KEYWORD, NEW, OLD, RECENT, SEEN) and minus search keys based on headers that do not exist in news (i.e., CC, BCC, TO). The availability of the extensions described here will be advertised by the server using the extension negotiation-mechanism described in the new NNTP protocol specification currently being developed [NNTP-NEW]. Ballou [Page 1] INTERNET-DRAFT October 4, 1996 3. Introduction The NNTP SEARCH command is sent from the client to the server to specify and initiate a full-text search on articles in one or more newsgroups. The NNTP SEARCH command is a subset of the [IMAP4] SEARCH command, with user property and mail-specific header search keys not present in NNTP SEARCH. The results of an NNTP Search is OVER data as specified in [NNTP-NEW] for each article that satisfies the search criteria. 4. SEARCH Command Description Arguments: optional character set specification searching criteria (one or more) Responses: 224 overview information follows 412 no news group selected 421 no results found 462 error performing search 501 command syntax error 502 no permission The SEARCH command searches the newsgroup for articles that match the given searching criteria. Searching criteria consist of one or more search keys. If there are articles that match the search criteria, the server responds with code 224 and returns OVER data for each matching article in the same format as described in [NNTP-NEW]. The article ID is only meaningful when searching either the current newsgroup or a single newsgroup. A response of 421 indicates that there are no articles that match the search criteria. A response of 501 indicates a syntax error in the search criteria. A response of 502 indicates that the user does not have permission to search one or more of the specified newsgroups. If the search criteria did not specify a newsgroup, and there is no current newsgroup (i.e., set using the NNTP GROUP command), then the server returns the error code 412, indicating that no newsgroup has been specified. A response of 462 indicates that the server encountered an error when processing the search. When multiple keys are specified, the result is the intersection (AND function) of all the messages that match those keys. For example, the criteria FROM "SMITH" SINCE 1-Feb-1994 refers to all articles from Smith that were placed in the newsgroup since February 1, 1994. A search key may also be a parenthesized list of one or more search keys (e.g. for use with the OR and NOT keys). Ballou [Page 2] INTERNET-DRAFT October 4, 1996 Server implementations MAY exclude [MIME-1] body parts with terminal content types other than TEXT and MESSAGE from consideration in SEARCH matching. The optional character set specification consists of the word "CHARSET" followed by a registered MIME character set. It indicates the character set of the strings that appear in the search criteria. [MIME-2] strings that appear in RFC 822/MIME message headers, and [MIME-1] content transfer encodings, MUST be decoded before matching. Except for US-ASCII, it is not required that any particular character set be supported. If the server does not support the specified character set, it MUST return a tagged NO response (not a BAD). In all search keys that use strings, a message matches the key if the string is a substring of the field. The matching is case-insensitive. The defined search keys are as follows. Refer to the Formal Syntax section for the precise syntactic definitions of the arguments. Messages with article identifiers corresponding to the specified message sequence number set. This can only relevant for searches on a single newsgroups. ALL All messages in the newsgroup; the default initial key for ANDing. BEFORE Messages whose internal date is earlier than the specified date. BODY Messages that contain the specified string in the body of the message. FROM Messages that contain the specified string in the envelope structure's FROM field. HEADER Messages that have a header with the specified field-name (as defined in [RFC-822]) and that contains the specified string in the [RFC-822] field-body. LARGER Messages with an RFC822.SIZE larger than the specified number of octets. Ballou [Page 3] INTERNET-DRAFT October 4, 1996 NEWSGROUP Messages in the specified newsgroup. The string can either be a fully-qualified newsgroup name, or a partial newsgroup name that ends with the substring ".*" (i.e., search the newsgroup hierarchy), or the string "*" (i.e., search all newsgroups). NOT Messages that do not match the specified search key. ON Messages whose internal date is within the specified date. OR Messages that match either search key. SENTBEFORE Messages whose [RFC-822] Date: header is earlier than the specified date. SENTON Messages whose [RFC-822] Date: header is within the specified date. SENTSINCE Messages whose [RFC-822] Date: header is within or later than the specified date. SINCE Messages whose internal date is within or later than the specified date. SMALLER Messages with an RFC822.SIZE smaller than the specified number of octets. SUBJECT Messages that contain the specified string in the envelope structure's SUBJECT field. TEXT Messages that contain the specified string in the header or body of the message. UID Messages with message identifiers corresponding to the specified message identifier set. Can only be used when searching a single newsgroup. Ballou [Page 4] INTERNET-DRAFT October 4, 1996 Example: C: SEARCH FROM "Smith" SINCE 1-Feb-1994 S: 573 Hello "John Smith" Sun, 03 Nov 1996 \ 14:25:05 -0800 <01cbc9d5f3c70$eab9a2cd@xyz.com> 4080 33 S: . 5. Formal Syntax The search query syntax is derived from the search syntax defined for the IMAP4 protocol. It is somewhat different because of the way inter- national character sets need to be encoded. One exception defined by this RFC to the 7bit character set restriction for commands in [NNTP-977] is that the 8bit ISO-8859-1 character set is allowed in unencoded form in search strings. This is allowed because it simplifies handling this widely used character set, without requiring support of arbitrary binary data. The following syntax specification uses the augmented Backus-Naur Form (BNF) notation as specified in [RFC-822] with one exception; the delimiter used with the "#" construct is a single space (SPACE) and not one or more commas. Except as noted otherwise, all alphabetic characters are case- insensitive. The use of upper or lower case characters to define token strings is for editorial clarity only. Implementations MUST accept these strings in a case-insensitive fashion. astring ::= atom / string atom ::= 1*ATOM_CHAR ATOM_CHAR ::= atom_specials ::= "(" / ")" / "{" / SPACE / CTL / list_wildcards / quoted_specials CHAR ::= CHAR8 ::= CRLF ::= CR LF CTL ::= date ::= date_text / <"> date_text <"> Ballou [Page 5] INTERNET-DRAFT October 4, 1996 date_day ::= 1*2digit ;; Day of month date_day_fixed ::= (SPACE digit) / 2digit ;; Fixed-format version of date_day date_month ::= "Jan" / "Feb" / "Mar" / "Apr" / "May" / "Jun" / "Jul" / "Aug" / "Sep" / "Oct" / "Nov" / "Dec" date_text ::= date_day "-" date_month "-" date_year date_year ::= 4digit date_time ::= <"> date_day_fixed "-" date_month "-" date_year SPACE time SPACE zone <"> digit ::= "0" / digit_nz digit_nz ::= "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" header_fld_name ::= sstring list_wildcards ::= "%" / "*" literal ::= "{" number "}" CRLF *CHAR8 ;; Number represents the number of CHAR8 octets mstring ::= A MIME-2 encoded string. number ::= 1*digit ;; Unsigned 32-bit integer ;; (0 <= n < 4,294,967,296) nz_number ::= digit_nz *digit ;; Non-zero unsigned 32-bit integer ;; (0 < n < 4,294,967,296) quoted ::= <"> *QUOTED_CHAR <"> QUOTED_CHAR ::= / "\" quoted_specials quoted_specials ::= <"> / "\" search ::= "SEARCH" SPACE ["CHARSET" SPACE astring SPACE] 1#search_key ;; [CHARSET] MUST be registered with IANA Ballou [Page 6] INTERNET-DRAFT October 4, 1996 search_key ::= "ALL" / "BEFORE" SPACE date / "BODY" SPACE sstring / "FROM" SPACE sstring / "ON" SPACE date / "SINCE" SPACE date / "SUBJECT" SPACE sstring / "TEXT" SPACE sstring / "TO" SPACE sstring / "HEADER" SPACE header_fld_name SPACE sstring / "LARGER" SPACE number / "NOT" SPACE search_key / "OR" SPACE search_key SPACE search_key / "SENTBEFORE" SPACE date / "SENTON" SPACE date / "SENTSINCE" SPACE date / "SMALLER" SPACE number / "UID" SPACE set / set / "(" 1#search_key ")" sequence_num ::= nz_number / "*" ;; * is the largest number in use. For message ;; sequence numbers, it is the number of messages ;; in the mailbox. For unique identifiers, it is ;; the unique identifier of the last message in ;; the mailbox. set ::= sequence_num / (sequence_num ":" sequence_num) / (set "," set) ;; Identifies a set of messages. For message ;; sequence numbers, these are consecutive ;; numbers from 1 to the number of messages in ;; the mailbox ;; Comma delimits individual numbers, colon ;; delimits between two numbers inclusive. ;; Example: 2,4:7,9,12:* is 2,4,5,6,7,9,12,13, ;; 14,15 for a mailbox with 15 messages. SPACE ::= sstring ::= <"> astring <"> | <"> mstring <"> string ::= quoted / literal TEXT_CHAR ::= time ::= 2digit ":" 2digit ":" 2digit ;; Hours minutes seconds Ballou [Page 7] INTERNET-DRAFT October 4, 1996 7. Bibliography [NNTP-977] Network News Transfer Protocol. B. Kantor, Phil Lapsley, Request for Comment (RFC) 977, February 1986. [NNTP-NEW] Network News Transfer Protocol. S. Barber INTERNET DRAFT, Sep- tember 1996. [IMAP4] IMAP4 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4. M Crispin, Request for Comment (RFC) 1730, December 1994 [MIME-1] Borenstein N., and N. Freed, MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies, RFC 1521, Bellcore, Innosoft, September 1993. [MIME-2] Moore, K., MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text, RFC 1522, University of Tennessee, September 1993. 8. Author's Address Nat Ballou Microsoft One Microsoft Way Redmond, WA 98052 USA Phone: +1 206-703-0574 Email: natba@microsoft.com This Internet Draft expires April xx, 1997. Ballou [Page 8]