idnits 2.17.1 draft-ietf-find-cip-arch-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-19) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 4 longer pages, the longest (page 7) being 60 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 2 instances of too long lines in the document, the longest one being 2 characters in excess of 72. ** There is 1 instance of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Transport' is mentioned on line 368, but not defined ** Downref: Normative reference to an Historic RFC: RFC 1913 ** Downref: Normative reference to an Historic RFC: RFC 1914 -- Possible downref: Non-RFC (?) normative reference: ref. 'CIP-MIME' -- Possible downref: Non-RFC (?) normative reference: ref. 'CIP-TRANSPORT' Summary: 11 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 FIND Working Group J. Allen 2 Internet Draft Bunyip Information Systems 3 Michael Mealling 4 21 November 1997 Network Solutions, Inc. 5 Expire in six months 7 The Architecture of the Common Indexing Protocol (CIP) 9 Status of this Memo 11 This document is an Internet-Draft. Internet-Drafts are working 12 documents of the Internet Engineering Task Force (IETF), its areas, 13 and its working groups. Note that other groups may also distribute 14 working documents as Internet-Drafts. 16 Internet-Drafts are draft documents valid for a maximum of six months 17 and may be updated, replaced, or obsoleted by other documents at any 18 time. It is inappropriate to use Internet-Drafts as reference 19 material or to cite them other than as "work in progress." 21 To learn the current status of any Internet-Draft, please check the 22 "1id-abstracts.txt" listing contained in the Internet- Drafts Shadow 23 Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), 24 munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or 25 ftp.isi.edu (US West Coast). 27 Abstract 29 The Common Indexing Protocol (CIP) is used to pass indexing 30 information from server to server in order to facilitate query 31 routing. Query routing is the process of redirecting and 32 replicating queries through a distributed database system towards 33 servers holding the desired results. This document describes the 34 CIP framework, including it's architecture and the protocol 35 specifics of exchanging indices. 37 1. Introduction 39 1.1. History and Motivation 41 The Common Indexing Protocol (CIP) is an evolution and refinement of 42 distributed indexing concepts first introduced in the Whois++ 43 Directory Service [RFC1913, RFC1914]. While indexing proved useful in 44 that system to promote query routing, the centroid index object which 45 is passed among Whois++ servers is specifically designed for 46 template-based databases search-able by token-based matching. With 47 alternative index objects, the index-passing technology will prove 48 useful to many more application domains, not simply Directory 49 Services and those applications which can be cast into the form of 50 template collections. 52 The indexing part of Whois++ is integrated with the data access 53 protocol. The goal in designing CIP is to extract the indexing 54 portion of Whois++, while abstracting the index objects to apply more 55 broadly to information retrieval. In addition, another kind of 56 technology reuse has been undertaken by converting the ad-hoc data 57 representations used by Whois++ into structures based on the MIME 58 specification for structured Internet mail. 60 Whois++ used a version number field in centroid objects to facilitate 61 future growth. The initial version was "1". Version 1 of CIP (then 62 embedded in Whois++, and not referred to separately as CIP) had 63 support for only ISO-8895-1 characters, and for only the centroid 64 index object type. 66 Version 2 of the Whois++ centroid was used in the Digger software by 67 Bunyip Information Systems to notify recipients that the centroid 68 carried extra character set information. Digger's centroids can carry 69 UTF-8 encoded 16-bit Unicode characters, or ISO-8859-1 characters, 70 determined by a field in the headers. 72 This specification is for CIP version 3 (CIPv3). Version 3 is a major 73 overhaul to the protocol, though through a short negotiation 74 sequence, CIP version 3 and earlier servers can interoperate in an 75 index-passing mesh. 77 1.2 CIP's place in the Information Retrieval world 79 CIP facilitates query routing. CIP is a protocol used between servers 80 in a network to pass hints which make data access by clients at a 81 later date more efficient. Query routing is the act of redirecting 82 and replicating queries through a distributed database system towards 83 the servers holding the actual results via reference to indexing 84 information. 86 CIP is a "backend" protocol -- it is implemented in and "spoken" 87 among only network servers. These same servers must also speak some 88 kind of data access protocol to communicate with clients. During 89 query resolution in the native protocol implementation, the server 90 will refer to the indexing information collected by the CIP 91 implementation for guidance on how to route the query. 93 Data access protocols used with CIP must have some provision for 94 control information in the form of a referral. The syntax and 95 semantics of these referrals are outside the scope of this 96 specification. 98 2. Related Documents 100 This document is one of three documents. This document describes the 101 fundamental concepts and framework of CIP. 103 The document "MIME Object Definitions for the Common Indexing 104 Protocol" [CIP-MIME] describes the MIME objects that make up the 105 items that are passed by the transport system. 107 Requirements and examples of several transport systems are specified 108 in the "CIP Transport Protocols" [CIP-TRANSPORT] document. 110 A second set of document describe the various specifications for 111 specific index types. 113 3. Architecture 115 3.1 CIP in the Information Retrieval World 117 3.1.1 Information Retrieval in the Abstract 119 In order to better understand how CIP fits into the information 120 retrieval world, we need to first understand the unifying abstract 121 features of existing information retrieval technology. Next, we 122 discuss why adding indexing technology to this model results in a 123 system capable of query routing, and why query routing is useful. 125 An abstract view of the client/server data retrieval process includes 126 data sets and data access protocols. An individual server is 127 responsible for handling queries over a fixed domain of data. For the 128 purposes of CIP, we call this domain of data the dataset. Clients 129 make searches in the dataset and retrieve parts of it via a data 130 access protocol. There are many data access protocols, each optimized 131 for the data in question. For instance, LDAP and Whois++ are access 132 protocols that reflect the needs of the directory services 133 application domain. Other data access protocols include HTTP and 134 Z39.50. 136 3.1.2 Indexing Information Facilitates Query Routing 138 The above description reflects a world without indexing, where no 139 server knows about any other server. In some cases (as with X.500 140 referrals, and HTTP redirects) a server will, as part of it's reply, 141 implicate another server in the process of resolving the query. 142 However, those servers generate replies based solely on their local 143 knowledge. When indexing information is introduced into a server's 144 local database, the server now knows not only answers based on the 145 local dataset, but also answers based on external indices. These 146 indices come from peer servers, via an indexing protocol. CIP is one 147 such indexing protocol. 149 Replies based on index information may not be the complete answer. 150 After all, an index is not a replicated version of the remote 151 dataset, but a possibly reduced version of it. Thus, in addition to 152 giving complete replies from the local dataset, the server may give 153 referrals to other datasets. These referrals are the core feature 154 necessary for effective query routing. When CIP is used to pass 155 indices from server to server, they make a kind of investment. At the 156 cost of some resources to create, transmit and store the indices, 157 query routing becomes possible. 159 Query Routing is the process of replicating and moving a query closer 160 to datasets which can satisfy the query. In some distributed systems, 161 widely distributed searches must be accomplished by replicating the 162 query to all sub-datasets. This approach can be wasteful of resources 163 both in the network, and on the servers, and is thus sometimes 164 explicitly disabled. Using indexing in such a system opens the door 165 to more efficient distributed searching. 167 While CIP-equipped servers provide the referrals necessary to make 168 query routing work, it's always the client's responsibility to 169 collate, filter, and chase the referrals it receives. This gives the 170 end-user (or agent, in the case that there's no human user involved 171 in the search) greatest control over the query resolution process. 172 The cost of the added client complexity is weighed against the 173 benefits of total control over query resolution. In some cases, it 174 may also be possible to decouple the referral chasing from the client 175 by introducing a proxy, allowing existing simple clients to make use 176 of query routing. Such a proxy would transparently resolve referrals 177 into concrete results before returning them to the simple-minded 178 client. 180 3.1.3 Abstracting the CIP index object 182 As useful as indices seem, the fact remains that not all queries can 183 benefit from the same type of index. For example, say the index 184 consists of a simple list of keywords. With such an index, it is 185 impossible to answer queries about whether two keywords were near one 186 another, or if a keyword was present in a certain context (for 187 instance, in the title). 189 Because of the need for application domain specific indices, CIP 190 index objects are abstract; they must be defined by a separate 191 specification. The basic protocols for moving index objects are 192 widely applicable, but the specific design of the index, and the 193 structure of the mesh of servers which pass a particular type of 194 index is dependent on the application domain. This document describes 195 only the protocols for moving indices among servers. Companion 196 documents describe initial index objects. 198 The requirements that index type specifications must address are 199 specified in the [CIP-MIME] document. 201 3.2 Architectural Details 203 CIP implements index passing, providing the forward knowledge 204 necessary to generate the referrals used for query routing. The core 205 of the protocol is the index object. In the following sections, the 206 structure of the index objects themselves is presented. Next, how and 207 why indices are passed from server to server is discussed. Finally, 208 the circumstances under which a server may synthesize an index object 209 based on incoming ones are discussed. 211 3.2.1 The CIP Index Object 213 A CIP index object is composed of two parts, the header and the 214 payload. The header contains metadata necessary to process and make 215 use of the index object being transmitted. The actual index resides 216 in the payload. 218 Three particular headers warrant specific mention at this point. 219 The "type" of the index object selects one of many distinct CIP index 220 object specifications which define exactly how the index blocks are 221 to be created, parsed and used to facilitate query routing. Another 222 header of note is the "DSI", or Dataset Identifier, which uniquely 223 identifies the dataset from which the index was created. Another 224 header that is crucial for generating referrals is the "Base-URI". 225 The URI (or URI's) contained in this header form the basis of any 226 referrals generated based on this index block. ***The URI is also 227 used as input during the index aggregation process to constrain 228 the kinds of aggregation possible, due to multiprotocol constraints.** 229 The exact syntax of these headers will be specified in the CIP MIME 230 specification document [CIP-MIME]. 232 The payload is opaque to CIP itself. It is defined exclusively by the 233 index object specification associated with the object's MIME type. 234 Specifications on how to parse and use the payload are published 235 separately as "CIP index object specifications". This abstract 236 definition of the index object forms the basis of CIP's applicability 237 to indexing needs across multiple application domains. 239 A precise definition of the content and form of a CIP index block can 240 be found in the Protocol document [CIP-MIME] 242 3.2.2 Moving Index Objects: How to Build a Mesh 244 Indices are transmitted among servers participating in a CIP mesh. By 245 distributing this information in anticipation of a query, efficient, 246 accurate query routing is possible at the time a query arrives. 248 A CIP mesh is a set of CIP servers which pass indices of the same 249 type among themselves. Typically, a mesh is arranged in a 250 hierarchical tree fashion, with servers nearer the root of the tree 251 having larger and more comprehensive indices. See Figure 1. 252 However, a CIP mesh is explicitly allowed to have lateral links in 253 it, and there may be more than one part of the mesh that has the 254 properties of a "root". Mesh administrators are encouraged to avoid 255 loops in the system, but they are not obliged to maintain a strict 256 tree structure. Clients wishing to completely resolve all referrals 257 they receive should protect against referral loops while attempting 258 to traverse the mesh to avoid wasting time and network resources. 259 See the section on "Navigating the Mesh" for a discussion of this. 261 base level index index 262 directory servers servers 263 servers for for 264 base level lower-level 265 servers index servers 266 _______ 267 | | 268 | A |__ 269 |_______| \ _______ 270 \---CIP----| | 271 _______ | D |__ 272 | | /---CIP----|_______| \ ------ 273 | B |__/ \--CIP------| | 274 |_______| | F | 275 /--CIP------|______| 276 / 277 _______ _______ / 278 | | | |- 279 | C |-------CIP----| E | 280 |_______| |_______|- 281 | \ 282 r \ 283 _______ e \ ______ 284 | | f \--CIP-----| | 285 | G |-------CIP---------e------------------| H | 286 |_______| r |______| 287 \--referral---| r --referral-/ 289 | a | 291 | l | 293 \ 3 | 2 | 1 295 \--------/ 297 | | 299 | client | 301 | | 303 -------- 305 Figure 1: Sample layout of the Index Service mesh 307 All indices passed in a given mesh are assumed, as of this writing, 308 to be of the same type (i.e. governed by the same CIP index object 309 specification). It may be possible to create gateways between meshes 310 carrying different index objects, but at this time that process is 311 undefined and declared to be outside the scope of this specification. 313 In the case where a CIP server receives an index of a type that 314 it does not understand it _can_ pass that index forward untouched. 315 In the case where a server implementation decides not to accept 316 unknown indices it should return an appropriate error message to 317 the server sending the index. This behavior is to allow mesh 318 implementations to attempt heterogeneous meshes. As stated 319 above heterogeneous meshes are considered to be ill defined and as 320 such should be considered dangerous. I.e. "Here be dragons". 322 Experience suggests that this index passing activity should take 323 place among CIP servers as a parallel (and possibly lower-priority) 324 job to their primary job of answering queries. Index objects travel 325 among CIP servers by protocol exchanges explicitly defined in this 326 document, not via the server's native protocol. This distinction is 327 important, and bears repeating: 329 Queries are answered (and referrals are sent) via the native 330 data access protocol. 332 Index objects are transferred via alternative means, as defined 333 by this document. 335 When two servers cooperate to move indexing information, the pair are 336 said to be in a "polling relationship". The server that holds the 337 data of interest, and generates the index is called the "polled 338 server". The other server, which is the one that collects the 339 generated index, is the "polling server". 341 In a polling relationship, the polled server is responsible for 342 notifying the polling server when it has a new index that the polling 343 server might be interested in. In response, the polling server may 344 immediately pick up the index object, or it may schedule a job to 345 pick up a copy of the new index at a more convenient time. But, 346 a polling server is not required to wait on the polled server to 347 notify it of changes. The polling server can request a new 348 index at any time. 350 Independent of the symmetric polling relationship, there's another 351 way that servers can pass indices using CIP. In an "index pushing" 352 relationship, a CIP server simply sends the index to a peer whenever 353 necessary, and allows the receiver to handle the index object as it 354 chooses. The receiving server may refuse it, may accept is, then 355 silently discard it, may accept only portions of it (by accepting it 356 as is, then filtering it), or may accept it without question. 358 The index pushing relationship is intended for use by dumb leaf nodes 359 which simply want to make their index available to the global mesh of 360 servers, but have no interest in implementing the complete CIP 361 transaction protocol. It lowers the barriers to entry for CIP leaf 362 nodes. For more information on participating in a CIP mesh in this 363 restricted manner, see the section below on "Protocol Conformance". 365 CIP index passing operations take place across a reliable transport 366 mechanisms, including both TCP connections, and Internet mail 367 messages. The precise mechanisms are described in the Transport 368 document [Transport] 369 In the case where a CIP server receives an index of a type that 370 it does not understand it _can_ pass that index forward untouched. 371 In the case where a server implementation decides not to accept 372 unknown indices it should return an appropriate error message to 373 the server sending the index. This behavior is to allow mesh 374 implementations to attempt heterogeneous meshes. As stated 375 above heterogeneous meshes are considered to be ill defined and as 376 such should be considered dangerous. I.e. "Here be dragons". 378 Experience suggests that this index passing activity should take 379 place among CIP servers as a parallel (and possibly lower-priority) 380 job to their primary job of answering queries. Index objects travel 381 among CIP servers by protocol exchanges explicitly defined in this 382 document, not via the server's native protocol. This distinction is 383 important, and bears repeating: 385 Queries are answered (and referrals are sent) via the native 386 data access protocol. 388 Index objects are transferred via alternative means, as defined 389 by this document. 391 When two servers cooperate to move indexing information, the pair are 392 said to be in a "polling relationship". The server that holds the 393 data of interest, and generates the index is called the "polled 394 server". The other server, which is the one that collects the 395 generated index, is the "polling server". 397 In a polling relationship, the polled server is responsible for 398 3.2.3 Index Object Synthesis 400 From the preceding discussion, it should be clear that indexing 401 servers read and write index objects as they pass them around the 402 mesh. However, a CIP server need not simply pass the in-bound indices 403 through as the out-bound ones. While it's always permissible to pass 404 an index object through to other servers, a server may choose to 405 aggregate two or more of them, thereby reducing redundancy in the 406 index, at the cost of longer referral chains. 408 A basic premise of index passing is that even while collapsing a body 409 of data into an index by lossy compression methods, hints useful to 410 routing queries will survive in the resulting index. Since the index 411 is not a complete copy of the original dataset, it contains less 412 information. Index objects can be passed along unchanged, but as more 413 and more information collects in the resulting index object, 414 redundancy will creep in again, and it may prove useful to apply the 415 compression again, by aggregating two or more index objects into one. 417 This kind of aggregation should be performed without compromising the 418 ability to correctly route queries while avoiding excessive numbers 419 of missed results. The acceptable likelihood of false negatives must 420 be established on a per-application-domain basis, and is controlled 421 by the granularity of the index and the aggregation rules defined for 422 it by the particular specification. 424 However, when CIP is used in a multi-protocol application domain, 425 such as a Directory Service (with contenders including Whois++, LDAP, 426 and Ph), things get significantly trickier. The fundamental problem 427 is to avoid forcing a referral chain to pass through part of the mesh 428 which does not support the protocol by which that client made the 429 query. If this ever happens, the client loses access to any hits 430 beyond that point in the referral chain, since it cannot resolve the 431 referral in its native data access protocol. This is a failure of 432 query routing, which should be avoided. 434 In addition to multi-protocol considerations, server managers may 435 choose not to allow index object aggregation for performance reasons. 436 As referral chains lengthen, a client needs to perform more 437 transactions to resolve a query. As the number of transactions 438 increases, so do the user-perceived delays, the system loads, and the 439 global bandwidth demands. In general, there's a tradeoff between 440 aggressive aggregation (which leads to reductions in the indexing 441 overhead) and aggressive referral chain optimization. This tradeoff, 442 which is also sensitive to the particular application domain, needs 443 to be explored more in actual operational situations. 445 Conceptually, a CIP index server has several index objects on hand at 446 any given time. If it holds data in addition to indexing information, 447 the server has an index object formed from its own data, called the 448 "local index". It may have one or more indices from remote servers 449 which it has collected via the index passing mechanisms. These are 450 called "in-bound indices". 452 Implementor's Note: It may not be necessary to keep all of 453 these structures intact and distinct in the local database. It 454 is also not required to keep the out-bound index (or indices) 455 built and ready to distribute at all times. The previous 456 paragraph merely introduces a useful model for expressing the 457 aggregation rules. Implementors are free to model index objects 458 internally however they see fit. 460 The following two rules control how a CIP server formulates it's 461 outgoing indices: 463 1. An index server may pass any of the index objects in its 464 local index and its in-bound indices through unchanged to 465 polling servers. 466 2. If and only if the following three conditions are true, an 467 index server can aggregate two or more index objects into a 468 single new index object, to be added to the set of out-bound 469 indices. 471 a. Each index object to be aggregated covers exactly the 472 same set of protocols, as defined by the scheme component of 473 the Base-URI's in each index object. 475 b. The index server supports every one of the data access 476 protocols represented by the Base-URI's in the index objects 477 to be aggregated. 479 c. The specification for the index object type specified 480 by the type header of the index objects explicitly 481 defines the aggregation operation. 483 The resulting index object must have Base-URI's characteristic 484 of the local server for each protocol it supports. The outgoing 485 objects should have the DSI of the local server. 487 4. Navigating the mesh 489 With the CIP infrastructure in place to manage index objects, the 490 only problem remaining is how to successfully use the indexing 491 information to do efficient searches. CIP facilitates query routing, 492 which is essentially a client activity. A client connects to one 493 server, which redirects the query to servers "closer to" the answer. 494 This redirection message is called a referral. 496 4.1 The Referral 498 The concept of a referral and the mechanism for deciding when they 499 should be issued is described by CIP. However, the referral itself 500 must be transferred to the client in the native protocol, so its 501 syntax is not directly a CIP issue. The mechanism for deciding that a 502 referral needs to be made and generating that referral resides in 503 the CIP implementation in the server. The mechanism for sending 504 the referral to the client resides in the server's native protocol 505 implementation. 507 A referral is made when a search against the index objects held by 508 the server shows that there may be hits available in one of the 509 datasets represented by those index objects. If more that one index 510 object indicates that a referral must be generated to a given 511 dataset, the server should generate only one referral to the given 512 dataset, as the client may not be able to detect duplicates. 514 Though the format of the referral is dependent on the native 515 protocol(s) of the CIP server, the baseline contents of the referral 516 are constant across all protocols. At the least, a DSI and a URI must 517 be returned. The DSI is the DSI associated with the dataset which 518 caused the hit. This must be presented to the client so that it can 519 avoid referral loops. The Base-URI parameter which travels along with 520 index objects is used to provide the other required part of a referral. 522 The additional information in the Base-URI may be necessary for the 523 server receiving the referred query to correctly handle it. A good 524 example of this is an LDAP server, which needs a base X.500 525 distinguished name from which to search. When an LDAP server sends a 526 centroid-format index object up to a CIP indexing server, it sends a 527 Base-URI along with the name of the X.500 subtree for which the index 528 was made. When a referral is made, the Base-URI is passed back to the 529 client so that it can pass it to the original LDAP server. 531 As usual, in addition to sending the DSI, a DSI-Description header 532 can be optionally sent. Because a client may attempt to check with 533 the user before chasing the referral, and because this string is the 534 friendliest representation of the DSI that CIP has to offer, it 535 should be included in referrals when available (i.e. when it was sent 536 along with the index object). 538 4.2 Cross-protocol Mappings 540 Each data access protocol which uses CIP will need a clearly defined 541 set of rules to map queries in the native protocol to searches 542 against an index object. These rules will vary according to the data 543 domain. In principle, this could create a bit of a scaling 544 difficulty; for N protocols and M data domains, there would be N x M 545 mappings required. In practice, this should not be the case, since 546 some access protocols will be wholly unsuited to some data domains. 547 Consider for example, a LDAP server trying to make a search in an 548 index object composed from unorganized text based pages. What 549 would the results be? How would the client make sense of the results? 551 However, as pre-existing protocols are connected to CIP, and as new 552 ones are developed to work with CIP, this issue must be examined. In 553 the case of Whois++ and the CENTROID index type, there is an 554 extremely close mapping, since the two were designed together. When 555 hooking LDAP to the CENTROID index type, it will be necessary to map 556 the attribute names used in the LDAP system to attribute names which 557 are already being used in the CENTROID mesh. It will also be 558 necessary to tokenize the LDAP queries under the same rules as the 559 CENTROID indexing policy, so that searches will take place correctly. 560 These application- and protocol-specific actions must be specified in 561 the index object specification, as discussed in the [CIP-MIME] 562 document. 564 4.3 Moving through the mesh 566 From a client's point of view, CIP simply pushes all the "hard work" 567 onto its shoulders. After all, it's the client which needs to track 568 down the real data. While this is true, it's very misleading. 569 Because the client has control over the query routing process, the 570 client has total control over the size of the result set, the speed 571 with which the query progresses, and the depth of the search. 573 The simplest client implementation simply provides referrals to the 574 user in a raw, ready-to-reuse form, without attempting to follow 575 them. For instance, one Whois++ client, which interacts with the user 576 via a Web-based form, simply makes referrals into HTML hypertext 577 links. Encoded in the link via the HTML forms interface GET encoding 578 rules is the data of the referral: the hostname, port, and query. If 579 a user chooses to follow the referral link, they execute a new search 580 on the new host. A more savvy client might present the referrals to 581 the user and ask which should be followed. And, assuming appropriate 582 limits were placed on search time, and bandwidth usage, it might be 583 reasonable to program a client to follow all referral automatically. 585 When following all referrals, a client must show a bit of 586 intelligence. Remember that the mesh is defined as an interconnected 587 graph of CIP servers. This graph may have cycles, which could cause 588 an infinite loop of referrals, wasting the servers' time and the 589 client's too. When faced with the job of tacking down all referrals, 590 a client must use some form of a mesh traversal algorithm. Such an 591 algorithm has been documented for use with Whois++ in RFC-1914. The 592 same algorithm can be easily used with this version of CIP. In 593 Whois++ the equivalent of a DSI is called a handle. With this 594 substitution, the Whois++ mesh traversal algorithm works unchanged 595 with CIP. 597 Finally, the mesh entry point (i.e. the first server queried) can 598 have an impact on the success of the query. To avoid scaling issues, 599 it is not acceptable to use a single "root" node, and force all 600 clients to connect to it. Instead, clients should connect to a 601 reasonably well connected (with respect to the CIP mesh, not the 602 Internet infrastructure) local server. If no match can be made from 603 this entry point, the client can expand the search by asking the 604 original server who polls it. In general, those servers will have a 605 better "vantage point" on the mesh, and will turn up answers that the 606 initial search didn't. The mechanism for dynamically determining the 607 mesh structure like this exists, but it not documented here for 608 brevity. See RFC-1913 for more information on the POLLED-BY and 609 POLLED-FOR commands. 611 It still should be noted that, while these mesh operations are 612 important to optimizing the searches that a client should make, 613 the client still speaks its native protocol. This information must 614 be communicated to the client without causing the client to have 615 to understand CIP. 617 5. Security Considerations 619 In this section, we discuss the security considerations necessary 620 when making use of this specification. There are at least two levels 621 at which security considerations come into play. Indexing information 622 can leak undesirable amounts of proprietary information, unless 623 carefully controlled. At a more fundamental level, the CIP protocol 624 itself requires external security services to operate in a safe 625 manner. Both topics are covered below. 627 5.1 Secure Indexing 629 CIP is designed to index all kinds of data. Some of this data might 630 be considered valuable, proprietary, or even highly sensitive by the 631 data maintainer. Take, for example, a human resources database. 632 Certain public bits of data, in moderation, can be very helpful for a 633 company to make public. However, the database in its entirety is a 634 very valuable asset, which the company must protect. Much experience 635 has been gained in the directory service community over the years as 636 to how best to walk this fine line between completely revealing the 637 database and making useful pieces of it available. 639 Another example where security becomes a problem is for a data 640 publisher who'd like to participate in a CIP mesh. The data that 641 publisher creates and manages is the prime asset of the company. 642 There is a financial incentive to participate in a CIP mesh, since 643 exporting indices of the data will make it more likely that people 644 will search your database. (Making profit off of the search activity 645 is left as an exercise to the entrepreneur.) Once again, the index 646 must be designed carefully to protect the database while providing a 647 useful synopsis of the data. 649 One of the basic premises of CIP is that data providers will be 650 willing to provide indices of their data to peer indexing servers. 651 Unless they are carefully constructed, these indices could constitute 652 a threat to the security of the database. Thus, security of the data 653 must be a prime consideration when developing a new index object 654 type. The risk of reverse engineering a database based only on the 655 index exported from it must be kept to a level consistent with the 656 value of the data and the need for fine-grained indexing. 658 Acknowledgments 660 Thanks to the many helpful members of the FIND working group for 661 discussions leading to this specification. 663 Specific acknowledgment is given to Jeff Allen formerly of Bunyip 664 Information Systems. His original version of these documents helped 665 enormously in crystallizing the debate and consensus. Most of the 666 actual text in this document was originally authored by Jeff. 668 Author's Address 670 Jeff R. Allen Michael Mealling 671 Bunyip Information Systems, Inc. Network Solutions, Inc. 672 310 Ste-Catherine West, Suite 300 505 Huntmar Park Drive 673 Montreal, Quebec H2X 2A1 Herndon, VA 22070 674 Canada 676 Phone: +1-514-875-8611 Phone: (703) 742-0400 677 EMail: jeff@bunyip.com Email: michael.mealling@RWhois.net 679 References 681 [RFC1913] 682 Weider, C., Fullton, J., S. Spero, "Architecture of the Whois++ 683 Index Service", Bunyip, CNIDR, EIT, February 1996. 685 [RFC1914] 686 Faltstrom, P., Schoultz, R., C. Weider, "How to Interact with a 687 Whois++ Mesh", Bunyip, KTHNOC, February 1996. 689 [CIP-MIME] 690 Allen, J., M. Mealling, "MIME Object Definitions for the Common 691 Indexing Protocol (CIP)", IETF FIND WG, June 1997. 693 [CIP-TRANSPORT] 694 Allen, J., P. Leach, "CIP Transport Protocols", WebTV, Microsoft, 695 June 1997. 697 Appendix A: Glossary 699 application domain: 700 A problem domain to which CIP is applied which has 701 indexing requirements which are not subsumed by any 702 existing problem domain. Separate application domains 703 require separate index object specifications, and 704 potentially separate CIP meshes. See index object 705 specification. 707 centroid: 708 An index object type used with Whois++. In CIP 709 versions before version 3, the index was not 710 extensible, and could only take the form of a 711 centroid. A centroid is a list of (template name, 712 attribute name, token) tuples with duplicate headers 713 removed. 715 dataset: 716 A collection of data (real or virtual) over which an 717 index is created. When a CIP server aggregates two or 718 more indices, the resultant index represents the 719 index from a "virtual dataset", spanning the previous 720 two datasets. 722 Dataset Identifier: 723 An identifier chosen from any part of the ISO/CCITT 724 OID space which uniquely identifies a given dataset 725 among all datasets indexed by CIP. 727 DSI: 728 See Dataset Identifier. 730 DSI-description: 731 A human readable string optionally carried along with 732 DSI's to make them more user-friendly. See dataset 733 Identifier. 735 index object: 736 The embodiment of the indices passed by CIP. An index 737 object consists of some control attributes and an 738 opaque payload. 740 index object specification: 741 A document describing an index object type for use 742 with the CIP system described in this document. See 743 index object and payload. 745 index pushing: 746 The act of presenting, unsolicited, an index to a 747 peer CIP server. 749 MIME: 750 see Multipurpose Internet Mail Extensions 752 Multipurpose Internet Mail Extensions: 753 A set of rules for encoding Internet Mail messages 754 that gives them richer structure. CIP uses MIME rules 755 to simplify object encoding issues. MIME is specified 756 in RFC-1521 and RFC-1522. 758 payload: 759 The application domain specific indexing information 760 stored inside an index object. The format of the 761 payload is specified externally to this document, and 762 depends on the type of the containing index object. 764 polled server: 765 A CIP server which receives a request to generate and 766 pass an index to a peer server. 768 polling server: 769 A CIP server which generates a request to a peer 770 server for its index. 772 referral chain: 773 The set of referrals generated by the process of 774 routing a query. See query routing. 776 query routing: 777 Based on reference to indexing information, 778 redirecting and replicating queries through a 779 distributed database system towards the servers 780 holding the actual results. 782 This document expires 6 months from November 1997.