idnits 2.17.1 

draft-ietf-find-cip-arch-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-19) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 4
     longer pages, the longest (page 7) being 60 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 2 instances of too long lines in the document, the longest one
     being 2 characters in excess of 72.

  ** There is 1 instance of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'Transport' is mentioned on line 368, but not defined

  ** Downref: Normative reference to an Historic RFC: RFC 1913

  ** Downref: Normative reference to an Historic RFC: RFC 1914

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CIP-MIME'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CIP-TRANSPORT'


     Summary: 11 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	FIND Working Group                                              J. Allen
2	Internet Draft                                Bunyip Information Systems
3	<draft-ietf-find-cip-arch-01.txt>                       Michael Mealling
4	21 November 1997                                 Network Solutions, Inc.
5	Expire in six months

7	         The Architecture of the Common Indexing Protocol (CIP)

9	Status of this Memo

11	   This document is an Internet-Draft.  Internet-Drafts are working
12	   documents of the Internet Engineering Task Force (IETF), its areas,
13	   and its working groups.  Note that other groups may also distribute
14	   working documents as Internet-Drafts.

16	   Internet-Drafts are draft documents valid for a maximum of six months
17	   and may be updated, replaced, or obsoleted by other documents at any
18	   time.  It is inappropriate to use Internet-Drafts as reference
19	   material or to cite them other than as "work in progress."

21	   To learn the current status of any Internet-Draft, please check the
22	   "1id-abstracts.txt" listing contained in the Internet- Drafts Shadow
23	   Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
24	   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
25	   ftp.isi.edu (US West Coast).

27	Abstract

29	      The Common Indexing Protocol (CIP) is used to pass indexing
30	      information from server to server in order to facilitate query
31	      routing. Query routing is the process of redirecting and
32	      replicating queries through a distributed database system towards
33	      servers holding the desired results. This document describes the
34	      CIP framework, including it's architecture and the protocol
35	      specifics of exchanging indices.

37	1. Introduction

39	1.1. History and Motivation

41	   The Common Indexing Protocol (CIP) is an evolution and refinement of
42	   distributed indexing concepts first introduced in the Whois++
43	   Directory Service [RFC1913, RFC1914]. While indexing proved useful in
44	   that system to promote query routing, the centroid index object which
45	   is passed among Whois++ servers is specifically designed for
46	   template-based databases search-able by token-based matching.  With
47	   alternative index objects, the index-passing technology will prove
48	   useful to many more application domains, not simply Directory
49	   Services and those applications which can be cast into the form of
50	   template collections.

52	   The indexing part of Whois++ is integrated with the data access
53	   protocol. The goal in designing CIP is to extract the indexing
54	   portion of Whois++, while abstracting the index objects to apply more
55	   broadly to information retrieval. In addition, another kind of
56	   technology reuse has been undertaken by converting the ad-hoc data
57	   representations used by Whois++ into structures based on the MIME
58	   specification for structured Internet mail.

60	   Whois++ used a version number field in centroid objects to facilitate
61	   future growth. The initial version was "1". Version 1 of CIP (then
62	   embedded in Whois++, and not referred to separately as CIP) had
63	   support for only ISO-8895-1 characters, and for only the centroid
64	   index object type.

66	   Version 2 of the Whois++ centroid was used in the Digger software by
67	   Bunyip Information Systems to notify recipients that the centroid
68	   carried extra character set information. Digger's centroids can carry
69	   UTF-8 encoded 16-bit Unicode characters, or ISO-8859-1 characters,
70	   determined by a field in the headers.

72	   This specification is for CIP version 3 (CIPv3). Version 3 is a major
73	   overhaul to the protocol, though through a short negotiation
74	   sequence, CIP version 3 and earlier servers can interoperate in an
75	   index-passing mesh.

77	1.2 CIP's place in the Information Retrieval world

79	   CIP facilitates query routing. CIP is a protocol used between servers
80	   in a network to pass hints which make data access by clients at a
81	   later date more efficient. Query routing is the act of redirecting
82	   and replicating queries through a distributed database system towards
83	   the servers holding the actual results via reference to indexing
84	   information.

86	   CIP is a "backend" protocol -- it is implemented in and "spoken"
87	   among only network servers. These same servers must also speak some
88	   kind of data access protocol to communicate with clients. During
89	   query resolution in the native protocol implementation, the server
90	   will refer to the indexing information collected by the CIP
91	   implementation for guidance on how to route the query.

93	   Data access protocols used with CIP must have some provision for
94	   control information in the form of a referral. The syntax and
95	   semantics of these referrals are outside the scope of this
96	   specification.

98	2. Related Documents

100	   This document is one of three documents. This document describes the
101	   fundamental concepts and framework of CIP.

103	   The document "MIME Object Definitions for the Common Indexing
104	   Protocol" [CIP-MIME] describes the MIME objects that make up the
105	   items that are passed by the transport system.

107	   Requirements and examples of several transport systems are specified
108	   in the "CIP Transport Protocols" [CIP-TRANSPORT] document.

110	   A second set of document describe the various specifications for
111	   specific index types.

113	3. Architecture

115	3.1 CIP in the Information Retrieval World

117	3.1.1 Information Retrieval in the Abstract

119	   In order to better understand how CIP fits into the information
120	   retrieval world, we need to first understand the unifying abstract
121	   features of existing information retrieval technology. Next, we
122	   discuss why adding indexing technology to this model results in a
123	   system capable of query routing, and why query routing is useful.

125	   An abstract view of the client/server data retrieval process includes
126	   data sets and data access protocols. An individual server is
127	   responsible for handling queries over a fixed domain of data. For the
128	   purposes of CIP, we call this domain of data the dataset. Clients
129	   make searches in the dataset and retrieve parts of it via a data
130	   access protocol. There are many data access protocols, each optimized
131	   for the data in question. For instance, LDAP and Whois++ are access
132	   protocols that reflect the needs of the directory services
133	   application domain. Other data access protocols include HTTP and
134	   Z39.50.

136	3.1.2 Indexing Information Facilitates Query Routing

138	   The above description reflects a world without indexing, where no
139	   server knows about any other server. In some cases (as with X.500
140	   referrals, and HTTP redirects) a server will, as part of it's reply,
141	   implicate another server in the process of resolving the query.
142	   However, those servers generate replies based solely on their local
143	   knowledge. When indexing information is introduced into a server's
144	   local database, the server now knows not only answers based on the
145	   local dataset, but also answers based on external indices. These
146	   indices come from peer servers, via an indexing protocol. CIP is one
147	   such indexing protocol.

149	   Replies based on index information may not be the complete answer.
150	   After all, an index is not a replicated version of the remote
151	   dataset, but a possibly reduced version of it. Thus, in addition to
152	   giving complete replies from the local dataset, the server may give
153	   referrals to other datasets. These referrals are the core feature
154	   necessary for effective query routing. When CIP is used to pass
155	   indices from server to server, they make a kind of investment. At the
156	   cost of some resources to create, transmit and store the indices,
157	   query routing becomes possible.

159	   Query Routing is the process of replicating and moving a query closer
160	   to datasets which can satisfy the query. In some distributed systems,
161	   widely distributed searches must be accomplished by replicating the
162	   query to all sub-datasets. This approach can be wasteful of resources
163	   both in the network, and on the servers, and is thus sometimes
164	   explicitly disabled. Using indexing in such a system opens the door
165	   to more efficient distributed searching.

167	   While CIP-equipped servers provide the referrals necessary to make
168	   query routing work, it's always the client's responsibility to
169	   collate, filter, and chase the referrals it receives. This gives the
170	   end-user (or agent, in the case that there's no human user involved
171	   in the search) greatest control over the query resolution process.
172	   The cost of the added client complexity is weighed against the
173	   benefits of total control over query resolution. In some cases, it
174	   may also be possible to decouple the referral chasing from the client
175	   by introducing a proxy, allowing existing simple clients to make use
176	   of query routing. Such a proxy would transparently resolve referrals
177	   into concrete results before returning them to the simple-minded
178	   client.

180	3.1.3 Abstracting the CIP index object

182	   As useful as indices seem, the fact remains that not all queries can
183	   benefit from the same type of index. For example, say the index
184	   consists of a simple list of keywords. With such an index, it is
185	   impossible to answer queries about whether two keywords were near one
186	   another, or if a keyword was present in a certain context (for
187	   instance, in the title).

189	   Because of the need for application domain specific indices, CIP
190	   index objects are abstract; they must be defined by a separate
191	   specification. The basic protocols for moving index objects are
192	   widely applicable, but the specific design of the index, and the
193	   structure of the mesh of servers which pass a particular type of
194	   index is dependent on the application domain. This document describes
195	   only the protocols for moving indices among servers. Companion
196	   documents describe initial index objects.

198	   The requirements that index type specifications must address are
199	   specified in the [CIP-MIME] document.

201	3.2 Architectural Details

203	   CIP implements index passing, providing the forward knowledge
204	   necessary to generate the referrals used for query routing. The core
205	   of the protocol is the index object. In the following sections, the
206	   structure of the index objects themselves is presented. Next, how and
207	   why indices are passed from server to server is discussed. Finally,
208	   the circumstances under which a server may synthesize an index object
209	   based on incoming ones are discussed.

211	3.2.1 The CIP Index Object

213	   A CIP index object is composed of two parts, the header and the
214	   payload. The header contains metadata necessary to process and make
215	   use of the index object being transmitted. The actual index resides
216	   in the payload.

218	   Three particular headers warrant specific mention at this point.
219	   The "type" of the index object selects one of many distinct CIP index
220	   object specifications which define exactly how the index blocks are
221	   to be created, parsed and used to facilitate query routing.  Another
222	   header of note is the "DSI", or Dataset Identifier, which uniquely
223	   identifies the dataset from which the index was created.  Another
224	   header that is crucial for generating referrals is the "Base-URI".
225	   The URI (or URI's) contained in this header form the basis of any
226	   referrals generated based on this index block. ***The URI is also
227	   used as input during the index aggregation process to constrain
228	   the kinds of aggregation possible, due to multiprotocol constraints.**
229	   The exact syntax of these headers will be specified in the CIP MIME
230	   specification document [CIP-MIME].

232	   The payload is opaque to CIP itself. It is defined exclusively by the
233	   index object specification associated with the object's MIME type.
234	   Specifications on how to parse and use the payload are published
235	   separately as "CIP index object specifications". This abstract
236	   definition of the index object forms the basis of CIP's applicability
237	   to indexing needs across multiple application domains.

239	   A precise definition of the content and form of a CIP index block can
240	   be found in the Protocol document [CIP-MIME]

242	3.2.2 Moving Index Objects: How to Build a Mesh

244	   Indices are transmitted among servers participating in a CIP mesh. By
245	   distributing this information in anticipation of a query, efficient,
246	   accurate query routing is possible at the time a query arrives.

248	   A CIP mesh is a set of CIP servers which pass indices of the same
249	   type among themselves. Typically, a mesh is arranged in a
250	   hierarchical tree fashion, with servers nearer the root of the tree
251	   having larger and more comprehensive indices. See Figure 1.
252	   However, a CIP mesh is explicitly allowed to have lateral links in
253	   it, and there may be more than one part of the mesh that has the
254	   properties of a "root". Mesh administrators are encouraged to avoid
255	   loops in the system, but they are not obliged to maintain a strict
256	   tree structure. Clients wishing to completely resolve all referrals
257	   they receive should protect against referral loops while attempting
258	   to traverse the mesh to avoid wasting time and network resources.
259	   See the section on "Navigating the Mesh" for a discussion of this.

261	     base level	            index                    index
262	     directory             servers                  servers
263	      servers                for                      for
264	                          base level               lower-level
265	                           servers                index servers
266	     _______
267	    |       |
268	    |   A   |__
269	    |_______|  \            _______
270	                \---CIP----|       |
271	     _______               |   D   |__
272	    |       |   /---CIP----|_______|  \             ------
273	    |   B   |__/                       \--CIP------|      |
274	    |_______|                                      |  F   |
275	                                       /--CIP------|______|
276	                                      /
277	     _______                _______  /
278	    |       |              |       |-
279	    |   C   |-------CIP----|   E   |
280	    |_______|              |_______|-
281	                                |    \
282	                                r     \
283	     _______                    e      \            ______
284	    |       |                   f       \--CIP-----|      |
285	    |   G   |-------CIP---------e------------------|  H   |
286	    |_______|                   r                  |______|
287	            \--referral---|     r      --referral-/

289	                          |     a     |

291	                          |     l     |

293	                          \ 3   | 2   | 1

295	                            \--------/

297	                            |        |

299	                            | client |

301	                            |        |

303	                             --------

305	             Figure 1: Sample layout of the Index Service mesh

307	   All indices passed in a given mesh are assumed, as of this writing,
308	   to be of the same type (i.e. governed by the same CIP index object
309	   specification). It may be possible to create gateways between meshes
310	   carrying different index objects, but at this time that process is
311	   undefined and declared to be outside the scope of this specification.

313	   In the case where a CIP server receives an index of a type that
314	   it does not understand it _can_ pass that index forward untouched.
315	   In the case where a server implementation decides not to accept
316	   unknown indices it should return an appropriate error message to
317	   the server sending the index. This behavior is to allow mesh
318	   implementations to attempt heterogeneous meshes. As stated
319	   above heterogeneous meshes are considered to be ill defined and as
320	   such should be considered dangerous. I.e. "Here be dragons".

322	   Experience suggests that this index passing activity should take
323	   place among CIP servers as a parallel (and possibly lower-priority)
324	   job to their primary job of answering queries. Index objects travel
325	   among CIP servers by protocol exchanges explicitly defined in this
326	   document, not via the server's native protocol. This distinction is
327	   important, and bears repeating:

329	         Queries are answered (and referrals are sent) via the native
330	         data access protocol.

332	         Index objects are transferred via alternative means, as defined
333	         by this document.

335	   When two servers cooperate to move indexing information, the pair are
336	   said to be in a "polling relationship". The server that holds the
337	   data of interest, and generates the index is called the "polled
338	   server".  The other server, which is the one that collects the
339	   generated index, is the "polling server".

341	   In a polling relationship, the polled server is responsible for
342	   notifying the polling server when it has a new index that the polling
343	   server might be interested in. In response, the polling server may
344	   immediately pick up the index object, or it may schedule a job to
345	   pick up a copy of the new index at a more convenient time. But,
346	   a polling server is not required to wait on the polled server to
347	   notify it of changes. The polling server can request a new
348	   index at any time.

350	   Independent of the symmetric polling relationship, there's another
351	   way that servers can pass indices using CIP. In an "index pushing"
352	   relationship, a CIP server simply sends the index to a peer whenever
353	   necessary, and allows the receiver to handle the index object as it
354	   chooses. The receiving server may refuse it, may accept is, then
355	   silently discard it, may accept only portions of it (by accepting it
356	   as is, then filtering it), or may accept it without question.

358	   The index pushing relationship is intended for use by dumb leaf nodes
359	   which simply want to make their index available to the global mesh of
360	   servers, but have no interest in implementing the complete CIP
361	   transaction protocol. It lowers the barriers to entry for CIP leaf
362	   nodes. For more information on participating in a CIP mesh in this
363	   restricted manner, see the section below on "Protocol Conformance".

365	   CIP index passing operations take place across a reliable transport
366	   mechanisms, including both TCP connections, and Internet mail
367	   messages. The precise mechanisms are described in the Transport
368	   document [Transport]
369	   In the case where a CIP server receives an index of a type that
370	   it does not understand it _can_ pass that index forward untouched.
371	   In the case where a server implementation decides not to accept
372	   unknown indices it should return an appropriate error message to
373	   the server sending the index. This behavior is to allow mesh
374	   implementations to attempt heterogeneous meshes. As stated
375	   above heterogeneous meshes are considered to be ill defined and as
376	   such should be considered dangerous. I.e. "Here be dragons".

378	   Experience suggests that this index passing activity should take
379	   place among CIP servers as a parallel (and possibly lower-priority)
380	   job to their primary job of answering queries. Index objects travel
381	   among CIP servers by protocol exchanges explicitly defined in this
382	   document, not via the server's native protocol. This distinction is
383	   important, and bears repeating:

385	         Queries are answered (and referrals are sent) via the native
386	         data access protocol.

388	         Index objects are transferred via alternative means, as defined
389	         by this document.

391	   When two servers cooperate to move indexing information, the pair are
392	   said to be in a "polling relationship". The server that holds the
393	   data of interest, and generates the index is called the "polled
394	   server".  The other server, which is the one that collects the
395	   generated index, is the "polling server".

397	   In a polling relationship, the polled server is responsible for
398	3.2.3 Index Object Synthesis

400	   From the preceding discussion, it should be clear that indexing
401	   servers read and write index objects as they pass them around the
402	   mesh. However, a CIP server need not simply pass the in-bound indices
403	   through as the out-bound ones. While it's always permissible to pass
404	   an index object through to other servers, a server may choose to
405	   aggregate two or more of them, thereby reducing redundancy in the
406	   index, at the cost of longer referral chains.

408	   A basic premise of index passing is that even while collapsing a body
409	   of data into an index by lossy compression methods, hints useful to
410	   routing queries will survive in the resulting index. Since the index
411	   is not a complete copy of the original dataset, it contains less
412	   information. Index objects can be passed along unchanged, but as more
413	   and more information collects in the resulting index object,
414	   redundancy will creep in again, and it may prove useful to apply the
415	   compression again, by aggregating two or more index objects into one.

417	   This kind of aggregation should be performed without compromising the
418	   ability to correctly route queries while avoiding excessive numbers
419	   of missed results. The acceptable likelihood of false negatives must
420	   be established on a per-application-domain basis, and is controlled
421	   by the granularity of the index and the aggregation rules defined for
422	   it by the particular specification.

424	   However, when CIP is used in a multi-protocol application domain,
425	   such as a Directory Service (with contenders including Whois++, LDAP,
426	   and Ph), things get significantly trickier. The fundamental problem
427	   is to avoid forcing a referral chain to pass through part of the mesh
428	   which does not support the protocol by which that client made the
429	   query. If this ever happens, the client loses access to any hits
430	   beyond that point in the referral chain, since it cannot resolve the
431	   referral in its native data access protocol. This is a failure of
432	   query routing, which should be avoided.

434	   In addition to multi-protocol considerations, server managers may
435	   choose not to allow index object aggregation for performance reasons.
436	   As referral chains lengthen, a client needs to perform more
437	   transactions to resolve a query. As the number of transactions
438	   increases, so do the user-perceived delays, the system loads, and the
439	   global bandwidth demands. In general, there's a tradeoff between
440	   aggressive aggregation (which leads to reductions in the indexing
441	   overhead) and aggressive referral chain optimization. This tradeoff,
442	   which is also sensitive to the particular application domain, needs
443	   to be explored more in actual operational situations.

445	   Conceptually, a CIP index server has several index objects on hand at
446	   any given time. If it holds data in addition to indexing information,
447	   the server has an index object formed from its own data, called the
448	   "local index". It may have one or more indices from remote servers
449	   which it has collected via the index passing mechanisms. These are
450	   called "in-bound indices".

452	         Implementor's Note: It may not be necessary to keep all of
453	         these structures intact and distinct in the local database. It
454	         is also not required to keep the out-bound index (or indices)
455	         built and ready to distribute at all times. The previous
456	         paragraph merely introduces a useful model for expressing the
457	         aggregation rules. Implementors are free to model index objects
458	         internally however they see fit.

460	   The following two rules control how a CIP server formulates it's
461	   outgoing indices:

463	      1. An index server may pass any of the index objects in its
464	         local index and its in-bound indices through unchanged to
465	         polling servers.
466	      2. If and only if the following three conditions are true, an
467	         index server can aggregate two or more index objects into a
468	         single new index object, to be added to the set of out-bound
469	         indices.

471	         a. Each index object to be aggregated covers exactly the
472	            same set of protocols, as defined by the scheme component of
473	            the Base-URI's in each index object.

475	         b. The index server supports every one of the data access
476	            protocols represented by the Base-URI's in the index objects
477	            to be aggregated.

479	         c. The specification for the index object type specified
480	            by the type header of the index objects explicitly
481	            defines the aggregation operation.

483	         The resulting index object must have Base-URI's characteristic
484	         of the local server for each protocol it supports. The outgoing
485	         objects should have the DSI of the local server.

487	4. Navigating the mesh

489	   With the CIP infrastructure in place to manage index objects, the
490	   only problem remaining is how to successfully use the indexing
491	   information to do efficient searches. CIP facilitates query routing,
492	   which is essentially a client activity. A client connects to one
493	   server, which redirects the query to servers "closer to" the answer.
494	   This redirection message is called a referral.

496	4.1 The Referral

498	   The concept of a referral and the mechanism for deciding when they
499	   should be issued is described by CIP. However, the referral itself
500	   must be transferred to the client in the native protocol, so its
501	   syntax is not directly a CIP issue. The mechanism for deciding that a
502	   referral needs to be made and generating that referral resides in
503	   the CIP implementation in the server. The mechanism for sending
504	   the referral to the client resides in the server's native protocol
505	   implementation.

507	   A referral is made when a search against the index objects held by
508	   the server shows that there may be hits available in one of the
509	   datasets represented by those index objects. If more that one index
510	   object indicates that a referral must be generated to a given
511	   dataset, the server should generate only one referral to the given
512	   dataset, as the client may not be able to detect duplicates.

514	   Though the format of the referral is dependent on the native
515	   protocol(s) of the CIP server, the baseline contents of the referral
516	   are constant across all protocols. At the least, a DSI and a URI must
517	   be returned.  The DSI is the DSI associated with the dataset which
518	   caused the hit.  This must be presented to the client so that it can
519	   avoid referral loops. The Base-URI parameter which travels along with
520	   index objects is used to provide the other required part of a referral.

522	   The additional information in the Base-URI may be necessary for the
523	   server receiving the referred query to correctly handle it. A good
524	   example of this is an LDAP server, which needs a base X.500
525	   distinguished name from which to search. When an LDAP server sends a
526	   centroid-format index object up to a CIP indexing server, it sends a
527	   Base-URI along with the name of the X.500 subtree for which the index
528	   was made. When a referral is made, the Base-URI is passed back to the
529	   client so that it can pass it to the original LDAP server.

531	   As usual, in addition to sending the DSI, a DSI-Description header
532	   can be optionally sent. Because a client may attempt to check with
533	   the user before chasing the referral, and because this string is the
534	   friendliest representation of the DSI that CIP has to offer, it
535	   should be included in referrals when available (i.e. when it was sent
536	   along with the index object).

538	4.2 Cross-protocol Mappings

540	   Each data access protocol which uses CIP will need a clearly defined
541	   set of rules to map queries in the native protocol to searches
542	   against an index object. These rules will vary according to the data
543	   domain. In principle, this could create a bit of a scaling
544	   difficulty; for N protocols and M data domains, there would be N x M
545	   mappings required. In practice, this should not be the case, since
546	   some access protocols will be wholly unsuited to some data domains.
547	   Consider for example, a LDAP server trying to make a search in an
548	   index object composed from unorganized text based pages. What
549	   would the results be? How would the client make sense of the results?

551	   However, as pre-existing protocols are connected to CIP, and as new
552	   ones are developed to work with CIP, this issue must be examined. In
553	   the case of Whois++ and the CENTROID index type, there is an
554	   extremely close mapping, since the two were designed together. When
555	   hooking LDAP to the CENTROID index type, it will be necessary to map
556	   the attribute names used in the LDAP system to attribute names which
557	   are already being used in the CENTROID mesh. It will also be
558	   necessary to tokenize the LDAP queries under the same rules as the
559	   CENTROID indexing policy, so that searches will take place correctly.
560	   These application- and protocol-specific actions must be specified in
561	   the index object specification, as discussed in the [CIP-MIME]
562	   document.

564	4.3 Moving through the mesh

566	   From a client's point of view, CIP simply pushes all the "hard work"
567	   onto its shoulders. After all, it's the client which needs to track
568	   down the real data.  While this is true, it's very misleading.
569	   Because the client has control over the query routing process, the
570	   client has total control over the size of the result set, the speed
571	   with which the query progresses, and the depth of the search.

573	   The simplest client implementation simply provides referrals to the
574	   user in a raw, ready-to-reuse form, without attempting to follow
575	   them. For instance, one Whois++ client, which interacts with the user
576	   via a Web-based form, simply makes referrals into HTML hypertext
577	   links. Encoded in the link via the HTML forms interface GET encoding
578	   rules is the data of the referral: the hostname, port, and query. If
579	   a user chooses to follow the referral link, they execute a new search
580	   on the new host. A more savvy client might present the referrals to
581	   the user and ask which should be followed. And, assuming appropriate
582	   limits were placed on search time, and bandwidth usage, it might be
583	   reasonable to program a client to follow all referral automatically.

585	   When following all referrals, a client must show a bit of
586	   intelligence.  Remember that the mesh is defined as an interconnected
587	   graph of CIP servers. This graph may have cycles, which could cause
588	   an infinite loop of referrals, wasting the servers' time and the
589	   client's too. When faced with the job of tacking down all referrals,
590	   a client must use some form of a mesh traversal algorithm. Such an
591	   algorithm has been documented for use with Whois++ in RFC-1914. The
592	   same algorithm can be easily used with this version of CIP. In
593	   Whois++ the equivalent of a DSI is called a handle. With this
594	   substitution, the Whois++ mesh traversal algorithm works unchanged
595	   with CIP.

597	   Finally, the mesh entry point (i.e. the first server queried) can
598	   have an impact on the success of the query. To avoid scaling issues,
599	   it is not acceptable to use a single "root" node, and force all
600	   clients to connect to it. Instead, clients should connect to a
601	   reasonably well connected (with respect to the CIP mesh, not the
602	   Internet infrastructure) local server. If no match can be made from
603	   this entry point, the client can expand the search by asking the
604	   original server who polls it. In general, those servers will have a
605	   better "vantage point" on the mesh, and will turn up answers that the
606	   initial search didn't. The mechanism for dynamically determining the
607	   mesh structure like this exists, but it not documented here for
608	   brevity. See RFC-1913 for more information on the POLLED-BY and
609	   POLLED-FOR commands.

611	   It still should be noted that, while these mesh operations are
612	   important to optimizing the searches that a client should make,
613	   the client still speaks its native protocol. This information must
614	   be communicated to the client without causing the client to have
615	   to understand CIP.

617	5. Security Considerations

619	   In this section, we discuss the security considerations necessary
620	   when making use of this specification. There are at least two levels
621	   at which security considerations come into play. Indexing information
622	   can leak undesirable amounts of proprietary information, unless
623	   carefully controlled. At a more fundamental level, the CIP protocol
624	   itself requires external security services to operate in a safe
625	   manner. Both topics are covered below.

627	5.1 Secure Indexing

629	   CIP is designed to index all kinds of data. Some of this data might
630	   be considered valuable, proprietary, or even highly sensitive by the
631	   data maintainer. Take, for example, a human resources database.
632	   Certain public bits of data, in moderation, can be very helpful for a
633	   company to make public. However, the database in its entirety is a
634	   very valuable asset, which the company must protect. Much experience
635	   has been gained in the directory service community over the years as
636	   to how best to walk this fine line between completely revealing the
637	   database and making useful pieces of it available.

639	   Another example where security becomes a problem is for a data
640	   publisher who'd like to participate in a CIP mesh. The data that
641	   publisher creates and manages is the prime asset of the company.
642	   There is a financial incentive to participate in a CIP mesh, since
643	   exporting indices of the data will make it more likely that people
644	   will search your database. (Making profit off of the search activity
645	   is left as an exercise to the entrepreneur.) Once again, the index
646	   must be designed carefully to protect the database while providing a
647	   useful synopsis of the data.

649	   One of the basic premises of CIP is that data providers will be
650	   willing to provide indices of their data to peer indexing servers.
651	   Unless they are carefully constructed, these indices could constitute
652	   a threat to the security of the database. Thus, security of the data
653	   must be a prime consideration when developing a new index object
654	   type. The risk of reverse engineering a database based only on the
655	   index exported from it must be kept to a level consistent with the
656	   value of the data and the need for fine-grained indexing.

658	Acknowledgments

660	   Thanks to the many helpful members of the FIND working group for
661	   discussions leading to this specification.

663	   Specific acknowledgment is given to Jeff Allen formerly of Bunyip
664	   Information Systems. His original version of these documents helped
665	   enormously in crystallizing the debate and consensus. Most of the
666	   actual text in this document was originally authored by Jeff.

668	Author's Address

670	   Jeff R. Allen                      Michael Mealling
671	   Bunyip Information Systems, Inc.   Network Solutions, Inc.
672	   310 Ste-Catherine West, Suite 300  505 Huntmar Park Drive
673	   Montreal, Quebec H2X 2A1           Herndon, VA 22070
674	   Canada

676	   Phone: +1-514-875-8611             Phone: (703) 742-0400
677	   EMail: jeff@bunyip.com             Email: michael.mealling@RWhois.net

679	References

681	   [RFC1913]
682	      Weider, C., Fullton, J., S. Spero, "Architecture of the Whois++
683	      Index Service", Bunyip, CNIDR, EIT, February 1996.

685	   [RFC1914]
686	      Faltstrom, P., Schoultz, R., C. Weider, "How to Interact with a
687	      Whois++ Mesh", Bunyip, KTHNOC, February 1996.

689	   [CIP-MIME]
690	      Allen, J., M. Mealling, "MIME Object Definitions for the Common
691	      Indexing Protocol (CIP)", IETF FIND WG, June 1997.

693	   [CIP-TRANSPORT]
694	      Allen, J., P. Leach, "CIP Transport Protocols", WebTV, Microsoft,
695	      June 1997.

697	Appendix A: Glossary

699	   application domain:
700	           A problem domain to which CIP is applied which has
701	           indexing requirements which are not subsumed by any
702	           existing problem domain. Separate application domains
703	           require separate index object specifications, and
704	           potentially separate CIP meshes.  See index object
705	           specification.

707	   centroid:
708	           An index object type used with Whois++. In CIP
709	           versions before version 3, the index was not
710	           extensible, and could only take the form of a
711	           centroid. A centroid is a list of (template name,
712	           attribute name, token) tuples with duplicate headers
713	           removed.

715	   dataset:
716	           A collection of data (real or virtual) over which an
717	           index is created. When a CIP server aggregates two or
718	           more indices, the resultant index represents the
719	           index from a "virtual dataset", spanning the previous
720	           two datasets.

722	   Dataset Identifier:
723	           An identifier chosen from any part of the ISO/CCITT
724	           OID space which uniquely identifies a given dataset
725	           among all datasets indexed by CIP.

727	   DSI:
728	           See Dataset Identifier.

730	   DSI-description:
731	           A human readable string optionally carried along with
732	           DSI's to make them more user-friendly. See dataset
733	           Identifier.

735	   index object:
736	           The embodiment of the indices passed by CIP. An index
737	           object consists of some control attributes and an
738	           opaque payload.

740	   index object specification:
741	           A document describing an index object type for use
742	           with the CIP system described in this document. See
743	           index object and payload.

745	   index pushing:
746	           The act of presenting, unsolicited, an index to a
747	           peer CIP server.

749	   MIME:
750	           see Multipurpose Internet Mail Extensions

752	   Multipurpose Internet Mail Extensions:
753	           A set of rules for encoding Internet Mail messages
754	           that gives them richer structure. CIP uses MIME rules
755	           to simplify object encoding issues. MIME is specified
756	           in RFC-1521 and RFC-1522.

758	   payload:
759	           The application domain specific indexing information
760	           stored inside an index object. The format of the
761	           payload is specified externally to this document, and
762	           depends on the type of the containing index object.

764	   polled server:
765	           A CIP server which receives a request to generate and
766	           pass an index to a peer server.

768	   polling server:
769	           A CIP server which generates a request to a peer
770	           server for its index.

772	   referral chain:
773	           The set of referrals generated by the process of
774	           routing a query. See query routing.

776	   query routing:
777	           Based on reference to indexing information,
778	           redirecting and replicating queries through a
779	           distributed database system towards the servers
780	           holding the actual results.

782	         This document expires 6 months from November 1997.