Network Working Group Chris Weider INTERNET-DRAFT Bunyip Information Systems February 1996 The Common Indexing Protocol Status of this memo: The author describes a protocol designed to provide common indexing for the currently deployed Internet Directory Services. This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. This Internet Draft expires August 20, 1996. Part 0: Differences between this and the WHOIS++ Indexing protocol This protocol is designed to be more flexible than version 1.0 of the WHOIS++ Indexing protocol. The version number on this protocol is 2.0, as when this stabilizes it will become version 2.0 of the WHOIS++ Index protocol as well. It adds the ability to specify tokenization method for the strings generated for the centroids, weighting capabilities, and the ability to specify which original host a specific term was derived from. These capabilities should provide a much more robust indexing protocol. Part 1: Introduction This protocol is designed to allow general indexing from most of the attribute- value based directory services. It is an extension of the current WHOIS++ indexing protocol. Also, this protocol is designed to be implemented in such a way as to allow many different types of protocols (X.500, WHOIS++, SOLO, LDAP, etc) to use a basic server which speaks all these protocols. Part 2: Using the Indexing Architecture The basic software architecture consists of APIs to back-end databases, an indexing 'object' which speaks the indexing protocol, and a variety of interfaces for various query protocols. Backend Databases Indexing Object Protocol Front End Client or query protocols _________ __________ __________ | | | | | |<-CIP-| WHOIS++ |<-WHOIS++- | SQL or |<---API---->| Indexer | |_________| | Z39.50 or| |__________| | GDBM or | ^ __________ |__________| |----CIP-| | | SOLO |<-SOLO- |_________| Each Indexer participates in a forward knowledge mesh as shown below. Part 3: Protocol Functionality and components of the Index Service 3.1 Base data servers Most directory services today specify only the query language, the information model, and the server responses for their servers. Most also use a basic 'template-based' information model, in which each entry consists of a set of attribute-value pairs. Thus the basic service can be provided by a wide variety of databases and directory services. However, to participate in the Index Service, that underlying database must also be able to generate a 'centroid', or some other type of forward knowledge, for the data it serves. Connections out from the indexing service to the base data servers will be accomplished using URLs for the various end protocols. This will avoid the need to rewrite the data from its native formats. 3.2. Centroids as forward knowledge The centroid of a server is comprised of a list of the templates and attributes used by that server, and a word list for each attribute. The word list for a given attribute contains one occurrence of every word which appears at least once in that attribute in some record in that server's data, and nothing else. For example, if a server contains exactly three records, as follows: Record 1 Record 2 Template: User Template: User First Name: John First Name: Joe Last Name: Smith Last Name: Smith Favourite Drink: Labatt Beer Favourite Drink: Molson Beer Record 3 Template: Domain Domain Name: foo.edu Contact Name: Mike Foobar the centroid for this server would be Template: User First Name: Joe -John Last Name: Smith Favourite Drink: Beer -Labatt -Molson Template: Domain Domain Name: foo.edu Contact Name: Mike -Foobar It is this information which is handed up the tree to provide forward knowledge. As we mention above, this may not turn out to be the ideal solution for forward knowledge, and we suspect that there may be a number of different sets of forward knowledge used in the Index Service. However, the indexing architecture is in a very real sense independent of what types of forward knowledge are handed around, and it is entirely possible to build a unified directory which uses many types of forward knowledge. 3.3 Other types of forward information There are several other types of forward information that might be useful in an indexing service. The first is untokenized values for the given attributes, as opposed to the tokenized values given in the centroid. A second type is forward information generated by a typical query; this can be used for replication of databases or of specific records in a database. A third type is forward information which specifies from which server a given value was obtained. All of these are given in the protocol. 3.4. Index servers and Index server Architecture A index server collects and collates the centroids (or other forward knowledge) of either a number of base servers or of a number of other index servers. An index server must be able to generate a centroid for the information it contains. In addition, an index server can index any other server it wishes, which allows one base level server (or index server) to participate in many hierarchies in the directory mesh. 3.4.1. Queries to index servers An index server will take a query in standard format, search its collections of centroids and other forward information, determine which servers hold records which may fill that query, and then notifies the user's client of the next servers to contact to submit the query (referral in the X.500 model). An index server can also contain primary data of its own; and thus act a both an index server and a base level server. In this case, the index server's response to a query may be a mix of records and referral pointers. Protocol-specific use of an index server (for example, issuing a WHOIS++ query to the index service) will require a protocol-specific front end to the index server, which is responsible for translating the query and formatting the output as required. 3.4.2. Index server distribution model and forward knowledge propogation The diagram on the next page illustrates how a mesh of index servers might be created for a set of base servers. Although it looks like a hierarchy, the protocols allow (for example) server A to be indexed by both server D and by server H. base level index index servers servers servers for for base level lower-level servers index servers _______ | | | A |__ |_______| \ _______ \----------| | _______ | D |__ ______ | | /----------|_______| \ | | | B |__/ \----------| | |_______| | F | /----------|______| / _______ _______ / | | | |- | C |--------------| E | |_______| |_______|- \ \ _______ \ ______ | | \----------| | | G |--------------------------------------| H | |_______| |______| Figure 1: Sample layout of the Index Service mesh _______________________________________________________________________________ In the portion of the index tree shown above, base servers A and B hand their centroids up to index server D, base server C hands its centroid up to index server E, and index servers D and E hand their centroids up to index server F. Servers E and G also hand their centroids up to H. The number of levels of index servers, and the number of index servers at each level, will depend on the number of base servers deployed, and the response time of individual layers of the server tree. These numbers will have to be determined in the field. 3.4.3. Forward knowledge propogation and changes to forward knowledge Forward knowledge propogation is initiated by an authenticated POLL command (sec. 4.2). The format of the POLL command allows the poller to request the forward knowledge of any or all templates and attributes held by the polled server. After the polled server has authenticated the poller, it determines which of the requested forward knowledge the poller is allowed to request, and then issues a CENTROID-CHANGES report (sec. 4.3) to transmit the data. When the poller receives the CENTROID-CHANGES report, it can authenticate the pollee to determine whether to add the new changes to its data. Additionally, if a given pollee knows what pollers hold forward knowledge from the pollee, it can signal to those pollers the fact that its information has changed by issuing a DATA-CHANGED command. The poller can then determine if and when to issue a new POLL request to get the updated information. The DATA-CHANGED command is included in this protocol to allow 'interactive' updating of critical information. 3.4.4. Forward knowledge propogation and mesh traversal When an index server issues a POLL request, it may indicate to the polled server what relationship it has to the polled. This information can be used to help traverse the directory mesh. Two fields are specified in the current proposal to transmit the relationship information, although it is expected that richer relationship information will be shared in future revisions of this protocol. One field used for this information is the Hierarchy field, and can take on three values. The first is 'topology', which indicates that the indexing server is at a higher level in the network topology (e.g. indexes the whole regional ISP). The second is 'geographical', which indicates that the polling server covers a geographical area subsuming the pollee. The third is 'administrative', which indicates that the indexing server covers an administrative domain subsuming the pollee. The second field used for this information is the Description field, which contains the DESCRIBE record of the polling server. This allows users to obtain richer metainformation for the directory mesh, enabling them to expand queries more effectively. 3.4.5 Loop control Since there are no a priori restrictions on which servers may poll which other servers, and since a given server may participate in many sub-meshes, mechanisms must be installed to allow the detection of cycles in the polling relationships. This is accomplished in the current protocol by including a hop-count on polling relationships. Each time a polled server generates forward information, it informs the polling server about its current hopcount, which is the maximum of the hopcounts of all the servers it polls, plus 1. A base level server (one which polls no other servers) will have a hopcount of 0. When a server decides to poll a new server, if its hopcount goes up, then it must information all the other servers which poll it about its new hopcount. A maximum hopcount (8 in the current version) will help the servers detect polling loops. A second approach to loop detection is to do all the work in the client; which would determine which new referrals have already appeared in the referral list, and then simply iterate the referral process until there are no new servers to ask. An algorithm to accomplish this in WHOIS++ is detailed in [Faltstrom 95]. 3.4.6. Query handling and passing algorithms When an index server receives a query, it searches its collection of forward knowledge and determines which servers hold records which may fill that query. As this service becomes widely deployed, it is expected that some index servers may specialize in indexing certain template types or perhaps even certain fields within those templates. If an index server obtains a match with the query _for those template fields and attributes the server indexes_, it is to be considered a match for the purpose of forwarding the query. 3.4.7. Query referral Query referral is the process of informing a client which servers to contact next to resolve a query. The syntax for notifying a client is outlined in section 4.5. A query can specify the 'trace' option, which causes each server which receives the query to send its server handle and an identification string to the client. 3.5 Security considerations In the opinion of this author, until a generally accepted Internet wide security service is available (or until a web of such services reaches into most of the Internet) all security applied to queries should be done by the base server. Propogating this information through the common index mesh will run immediately into the problems of common authentication, access control, and incommensurable security features. Thus any index information propogated to this service should be considered unsecured. Server to Server authentication is provided, however. 4: Protocol Specification for the Index Service: The syntax for each protocol component is listed below. In addition, each section contains a listing of which of these attributes is required and optional for each of the components. All timestamps must be in the format YYYYMMDDHHMM+ZZZZ or YYYYMMDDHHMM-ZZZZ, using a 24 hour clock, where ZZZZ indicates the difference between the local clock and GMT. 4.1. Data changed syntax The data changed template look like this: # DATA-CHANGED Version-number: // version number of index service software, used to insure // compatibility. Current value is 2.0 Time-of-latest-centroid-change: // time stamp of latest forward information // change,GMT Time-of-message-generation: // time when this message was generated, GMT Server-handle: // IANA unique identifier for this server Host-Name: // Host name of this server (current name) Host-Port: // Port number of this server (current port) Best-time-to-poll: // For heavily used servers, this will identify when // the server is likely to be lightly loaded // so that response to the poll will be speedy, GMT Authentication-type: // Type of authentication used by server, or NONE Authentication-data: // data for authentication # END // This line must be used to terminate the data changed message Required/optional table Version-Number REQUIRED Time-of-latest-centroid-change REQUIRED Time-of-message-generation REQUIRED Server-handle REQUIRED Host-Name REQUIRED Host-Port REQUIRED Best-time-to-poll OPTIONAL Authentication-type OPTIONAL Authentication-data OPTIONAL 4.2. Polling syntax # POLL: Version-number: // version number of poller's index software, used to // insure compatibility Type-of-poll: // type of forward data requested. CENTROID and QUERY // are the only one currently defined Poll-scope: // Selects bounds within which data will be returned. See note. Start-time: // give me all the centroid changes starting at this time, GMT End-time: // ending at this time, GMT Template: // a standard whois++ template name, or the keyword ALL, for a // full update. Field: // used to limit centroid update information to specific fields, // is either a specific field name, a list of field names, // or the keyword ALL Server-handle: // IANA unique identifier for the polling server. // this handle may optionally be cached by the polled // server to announce future changes Host-Name: // Host name of the polling server. Host-Port: // Port number of the polling server. Hierarchy: // This field indicates the relationship which the poller // bears to the pollee. Typical values might include // 'Topology', 'Geographical", or "Administrative" Description: // This field contains the DESCRIBE record of the // polling server Tokenization-type: // The tokenization algorithm used // Can be one of: "tokens", "records" or "attributes". // Default is "tokens", which means space-delimited values. Options: // Can be used to request the WEIGHT, HANDLE, and/or HOST information // for the returned values Content-Transfer-Encoding: // What encoding type, standard values... Authentication-type: // Type of authentication used by poller, or NONE Authentication-data: // Data for authentication # END // This line must by used to terminate the poll message For poll type CENTROID, the allowable values for Poll Scope are FULL and RELATIVE. Support of the FULL value is required, this provides a complete listing of the centroid or other forward information. RELATIVE indicates that these are the relative changes in the centroid since the last report to the polling server. The allowable values for OPTION are WEIGHT, HANDLE, and HOST. Support for the HANDLE and HOST values are required. HANDLE indicates that each attribute value must be listed with the server handle of the server from which this value was obtained by the polled server; HOST indicates that each attribute value must be listed with the host name and port number of the server from which this value was obtained. WEIGHT is optional, and allows each value to be assigned a relative weight according to a defined and specified weighting scheme. This value is included for future clarification. Since a weighting scheme will need to be identified, WEIGHT will take additional scheme identifiers in a syntax to be determined. For poll type QUERY, the allowable values for Poll Scope are a blank line, which indicates that all records are to be returned, or a valid WHOIS++ query, which indicates that just those records which satisfy the query are to be returned. N.B. Security considerations may require additional authentication for successful response to the Blank Line Poll Scope. This value has been included for server replication. As there are different types of tokenization available in future versions of this protocol, and as there may well be organizations which wish to construct servers which each index the same set of servers, but wish to do it in slightly different fashions, the command POLLED-FOR can be used to determine all the servers the current server polls. Required/Optional Table Version-Number REQUIRED, value is 2.0 Type-Of-Poll REQUIRED, values CENTROID and QUERY are required Poll-scope REQUIRED If Type-of-poll is CENTROID, FULL is required, RELATIVE is optional If Type-of-poll is QUERY, Blank line is required, and WHOIS++-type queries are required Start-time OPTIONAL End-Time OPTIONAL Template REQUIRED Field REQUIRED Server-handle REQUIRED Host-Name REQUIRED Host-Port REQUIRED Hierarchy OPTIONAL Description OPTIONAL Tokenization-Type REQUIRED, value TOKENS is required, RECORDS and ATTRIBUTES are optional Options OPTIONAL Content-Tranfer-Encoding OPTIONAL. If not given, default is 8-bit Available values are 8-bit, Quoted-Printable, and Base64 Authentication-Type: OPTIONAL Authentication-data: OPTIONAL Example of a POLL command: # POLL: Version-number: 2.0 Type-of-poll: CENTROID Poll-scope: FULL Start-time: 199501281030+0100 Template: ALL Field: ALL Server-handle: BUNYIP01 Host-Name: services.bunyip.com Host-Port: 7070 Hierarchy: Geographical Tokenization-type: TOKENS # END 4.3. Centroid change report As the centroid change report contains nested multiply-occuring blocks, each multiply occurring block is surrounded *in this paper* by curly braces '{', '}'. These curly braces are NOT part of the syntax, they are for identification purposes only. The syntax of a Data: item is either a list of values (words or other phases, depending on the tokenization value), one value per line, with the syntax: -word + weight (if requested) + server handle (if requested) + host (if requested) + port (if requested) or the keyword: * The weight, handle, host, and port are not required, but are expected to be used by advanced servers. The weight is the relative weight of the value for weighting servers, and the handle, host, and port values are to indicate from which host the polled server received the value. This allows a polling server to construct direct pointers to the servers lower in the mesh rather than adding an additional level of indirection. The keyword ANY as the only item of a Data: list means that any value for this field should be treated as a hit by the indexing server. The field Any-field: needs more explanation than can be given in the body of the syntax description below. It can take two values, True or False. If the value is True, the pollee is indicating that there are fields in this template which are not being exported to the polling server, but wishes to treat as a hit. Thus, when the polling server gets a query which has a term requesting a field not in this list for this template, the polling server will treat that term as a 'hit'. If the value is False, the pollee is indicating that there are no other fields for this template which should be treated as a hit. This field is required because the basic model for the WHOIS++ query syntax requires that the results of each search term be 'and'ed together. This field allows polled servers to export data only for non-sensitive fields, yet still get referrals of queries which contain sensitive terms. # CENTROID-CHANGES Version-number: // version number of pollee's index software, used to // insure compatibility Start-time: // change list starting time, GMT End-time: // change list ending time, GMT Server-handle: // IANA unique identifier of the responding server Hop-Count: // One more than the largest value the polled server has received // when polling other servers. If the polled server is a leaf , // server, hop-count should be zero. The current maximum value // (Feb. 95) is 8. Options: // Which options the polled server was able to satisfy. Values are // WEIGHT, HANDLE, and HOST Authentication-type: // Type of authentication used by pollee, or NONE Authentication-data: // Data for authentication Compression-type: // Type of compression used on the data, or NONE Size-of-compressed-data: // size of compressed data if compression is used Operation: // One of 3 keywords: ADD, DELETE, FULL // ADD - add these entries to the centroid for this server // DELETE - delete these entries from the centroid of this // server // FULL - the full centroid as of end-time follows Tokenization-type: // The tokenization algorithm used // Can be one of: "tokens", "records" or "attributes". // Default is "tokens". { // The multiply occurring template block starts here # BEGIN TEMPLATE Template: // a standard whois++ template name Any-field: // TRUE or FALSE. See beginning of 6.3 for explanation. { // the template contains multiple field blocks # BEGIN FIELD Field: // a field name within that template Content-Transfer-Encoding: // If existing, one of Base64 or Quoted-Printable Data: // Either the keyword *, or // the value list itself, one per line, cr/lf terminated, // each line starting with a dash character ('-'). // Each value may be optionally followed by other lines containing // weight, handle, and/or host information, these other lines begin // with the plus sign (+) and a tag which states which information // is placed on this line. Allowable tags are , , , // and # END FIELD } // the field ends with END FIELD # END TEMPLATE } // the template block ends with END TEMPLATE # END CENTROID-CHANGES // This line must be used to terminate the centroid // change report For each template, all fields must be listed, or queries will not be referred correctly. Required/Optional table Version-number REQUIRED, value is 2.0 Start-time REQUIRED (even if the centroid type is FULL) End-time REQUIRED (even if the centroid type is FULL) Server-handle REQUIRED Hop-Count REQUIRED Options OPTIONAL If the polling server has requested options a polled server is unable to satisfy, an error message will be generated Authentication-Type OPTIONAL Authentication-Data OPTIONAL Compression-type OPTIONAL Size-of-compressed-data OPTIONAL (even if compression is used) Operation REQUIRED, Support for all three values is required Tokenization-type REQUIRED #BEGIN TEMPLATE REQUIRED Template REQUIRED Any-field REQUIRED #BEGIN FIELD REQUIRED Field REQUIRED Content-Transfer-Encoding OPTIONAL, same values as above Data REQUIRED #END FIELD REQUIRED #END TEMPLATE REQUIRED #END CENTROID-CHANGES REQUIRED Example: # CENTROID-CHANGES Version-number: 2.0 Start-time: 197001010000+0100 End-time: 199503012336+0100 Server-handle: BUNYIP01 Hop-Count: 3 Tokenization-Type: TOKENS # BEGIN TEMPLATE Template: USER Any-field: TRUE # BEGIN FIELD Field: Name Data: Patrik + NADA01 + ui.nada.kth.se + 7070 -Faltstrom + NADA01 + ui.nada.kth.se + 7070 -Malin -Linnerborg #END FIELD #BEGIN FIELD Field: Email Data: paf@bunyip.com -malin.linnerborg@paf.se # END FIELD # END TEMPLATE # END CENTROID-CHANGES 4.4 QUERY and POLLEES responses The response to a QUERY command is done in WHOIS++ format. 4.5. Query referral # SERVERS-TO-ASK Version-number: // version number of index software, used to insure // compatibility Body-of-Query: // the original query goes here Next-Servers: // A list of servers to ask next, //As each server can be identified either by a handle or // a host and port combination (or both), the syntax for //reporting the information for a given server is to use // a dash (-) in column 1 to indicate the start of the // information for a new server, and a plus sign (+) to // indicate the continuation information. Also, the tags // , , and should be used to indicate // which values are returned, just as in the data reporting // for the centroid changes. # END SERVERS-TO-ASK Required/Optional table Version-number REQUIRED, value should be 2.0 Body-of-query OPTIONAL Next-Servers REQUIRED, take a value of NONE if there are no appropriate servers. 5. Client-Server Interaction Version 2.0 of the component of the Common Indexing Protocol which specifies the protocol front end to Indexing object interaction is identical to the March 1995 version of the WHOIS++ protocol. As more sophisticated weighting and indexing techniques are used, this will probably need to be expanded. 6: Reply Codes In addition to the reply codes listed in [Deutsch 95] for the basic WHOIS++ client/server interaction, the following reply codes are used by the Common Indexing Protocol. 113 Requested method not available Unable to provide a requested tokenization, compression, or transfer encoding method. Contacted server will send requested data in different format. 114 Requested option not available Unable to provide a requested option in CENTROID-CHANGES. No options have been used but raw data will be transmitted. 227 Update request acknowledged A DATA-CHANGED transmission has been accepted and logged for further action. 503 Required attribute missing A REQUIRED attribute is missing in an interaction. 504 Desired server unreachable The desired server is unreachable. 505 Desired server unavailable The desired server fails to respond to requests, but host is still reachable. 7: References [Deutsch 95] Deutsch, P., Patrik Faltstrom, Rickard Schoultz, and Chris Weider, Architecture of the WHOIS++ Service, RFC 1835, Proposed Standard, March 1995 [Faltstrom 95] Faltstrom, Patrik, Rickard Schoultz, and Chris Weider, "How to interact with a WHOIS++ mesh", RFC 1914, Proposed Standard, November 1995