Network Working Group M. Wahl INTERNET-DRAFT Critical Angle Inc. Expires in six months from 8 February 1997 A CIP-based Centroid Exchange for LDAP draft-ietf-find-ldapc-00.txt Status of this Memo This document is an Internet Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). Please note that this document reflects experimental software, and is incomplete. The underlying specification is likely to change slightly. Abstract This document describes how an LDAP server (the supplier) may transmit, through an out-of-band email, index information or attributes of its naming context to another LDAP server (the consumer). The consumer server will make use of this information when determining whether the supplier server is likely to have entries in that naming context which match a particular search filter. This assists the consumer in processing subtree searches in distributed directories. 1. Goals The primary goal of this specification is to allow an LDAP-capable server with a large number of subordinate references to more efficiently perform a subtree search operation, without having to chain or refer sub-operations for each subordinate reference. Secondary goals of this specification are: - allow servers to efficiently handle all shapes of search filters in common use, even distressful searches like (|(cn=joe bloggs)(sn=joe bloggs)(uid=joe bloggs) (cn=*joe bloggs*)(sn=*joe bloggs*)(uid=*joe bloggs*) (cn~=joe bloggs)(sn~=joe bloggs)(uid~=joe bloggs)) - not require any modifications to the internals of servers holding subordinate contexts; - allow the organizations which maintain these contexts to control bulk retrieval of the data, and to schedule when it may be retrieved; - ensure that it is possible to protect against unauthorized disclosure of bulk directory data while in transit, and provide some protection against spoofing attacks. The following are NON-goals of this version of the specification: - exchange of index information without prior agreement (e.g. trawling); - negotiation of agreements (separate document for this); - updates of non-complete naming contexts; - allowing the consumer to poll for updates; - alias entries and/or non-hierarchical models; - representation of access control or transfer of non-publically-accessible attributes. 2. Introduction This document defines a centroid-based index for a subtree of the directory. It is carried by an update MIME object. The supplier server is an LDAP server which masters or shadows a naming context. In this protocol it acts as a "simple CIP leaf server". The supplier server will at intervals mail the update to the consumer server. The consumer server is a different LDAP server with a subordinate or cross reference to the supplier server, which makes use of this update to determine how to route queries. The centroid-based index consists of processed attribute values from entries. The supplier may send the complete attribute values, or if this would violate data protection laws, only approximate match codes of values, which are nonreversible. As more processing is performed on values, the size of the index is reduced, as is its usefulness to the consumer. The specification allows for both total and incremental updates to be sent. This specification is intended for "push" environments, where there are many (tens of thousands) of naming contexts, and a small number (dozens) of index consumers. The naming contexts are as a whole static, however it is desirable or changes to take effect rapidly. (If the consumer were instead to start one poll per second to cover 100,000 suppliers, then if a person moved from one naming context to another, in the worst case the index information would be out of date for more than a day, making that person unlocatable.) Thus it is not intended to represent index information in the directory itself. The index information is expected to be of interest to only a very small portion of users of the directory, and for legal reasons should only be visible to authorized servers. Furthermore the size of the index information is often proportional to the size of the rest of the DIT. 3. Agreement An agreement is established between the organization which administers the supplier and the organization which administers the consumer, which specifies how the servers will communicate. The agreement contains the following: - "version": The version of the agreement and the index type. This specification describes the index type "x-ldap-centroid-1". - "baseobject": The Distinguished Name of the prefix entry of the supplier's subtree. - "scope": The subset of information in the supplier's subtree for which the update information will index. For this version of the specification, the scope is always "subtree": the base object and all entries down to the leaves of the tree, including any subordinate naming contexts. - "dsi": An OID which uniquely identifies the subtree and scope. - "supplier": The hostname and listening LDAP port number of the supplier server, as well as any alternative servers holding that same naming contexts, in case the supplier is unavailable. - "consumeraddr": This is a URI of the "mailto:" form, with the RFC 822 email address of the consumer server. Subsequent versions of the specification may allow other forms of URI, so that the consumer may retrieve the update via the WWW, FTP or a Common Indexing Protocol. - "updateinterval": The maximum duration in seconds between occurances of the supplier server generating an update. If the consumer server has not received an update from the supplier server after waiting this long since the previous update, it is likely that the index information is now out of date. A typical value for a server with frequent updates would be 604800 seconds, or every week. Servers whose DITs are only modified annually could have a much longer update interval. - "securityoption": Whether and how the supplier server should sign and encrypt the update before sending it to the consumer server. Options for this version of the specification are "none": the update is sent in plaintext "PGP/MIME": the update is digitally signed and encrypted using PGP "Fortezza": the update is digitally signed and encrypted using Fortezza It is recommended that the "PGP/MIME" option be used when exchanging sensitive information across public networks, and both the supplier and consumer have PGP keys. The "Fortezza" option is intended for use in environments where security protocols are based on Fortezza-compatible devices. - Security Credentials: The long-term cryptographic credentials used for key exchange and authentication of the consumer and supplier servers, if a security option was selected. For "PGP/MIME", this will be the trusted public keys of both servers. For "Fortezza", this will be the certificate paths of both servers to a common point of trust. 4. Content Type The update consists of a MIME object of type application/cip-index-object. The parameters are: - "type": this has value "x-ldap-centroid-1". - "dsi": the DSI from the agreement. - "base-uri". A set of URIs, each of the "ldap:" form, separated by spaces. In each URI, the hostname/portno must be distinct, and based on the "supplier" part of the agreement. Each URI must have on the RHS the base object distinguished name. The payload is mostly textual data but may include bytes with the high bit set. The quoted-printable content-transfer-encoding is recommend to be used if there are any bytes with the high bit set, otherwise no transfer encoding is needed. This object may be encapsulated in a wrapper content (such as multipart/signed) or be encrypted as part of the security procedures. The resulting content will in this version of the specification be emailed. For example, an email without any security transformations may resemble: From: supplier@sup.com Date: Thu, 16 Jan 1997 13:50:37 -0500 Message-Id: <199701161850.NAA29295@sup.com> To: consumer@consumer.com <<-- from consumer server address Reply-To: supplier-admin@sup.com MIME-Version: 1.0 Content-Type: application/cip-index-object; type=x-ldap-centroid-1; dsi=1.3.6.1.4.1.1466.85.85.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16; base-uri="ldap://sup.com/dc=sup,dc=com ldap://alt.com/dc=sup,dc=com" ...payload here... The payload is series of CRLF-terminated lines. Each line is in the UTF-8 encoding of the Unicode (ISO-10646 BMP) character set. No other character sets are permitted by this version of the specification. Some supplier servers may only be able to generate the printable US-ASCII subset, but all consumer servers must be able to handle the full range of Unicode characters. The "x-ldap-centroid-1" index payload begins with a header section, which is followed by one or more attribute sections. Each section is separated from the others by a blank line. Multiple blank lines together are treated the same as a single blank line. Lines which begin with a '#' character are ignored. 4.1. Header Section The header section consists of one or more lines in "type:value" format. The following types are defined: - "version": This line must always be present, and have the value "1" for this version of the specification. - "updatetype": This line must always be present. It takes as the value either "total" or "incremental". The first update sent by a supplier server to a consumer server for a DSI must be a "total" update. - "thisupdate": This line must always be present. The value is an ASN.1 GeneralizedTime in UTC (Z suffix) of the time at which the supplier constructed this update. - "lastupdate": This line must be present if the "updatetype" list has the value "incremental". It is an ASN.1 GeneralizedTime in UTC (Zulu), which is the value of the "thisupdate" of previous update sent to the consumer. - "contextsize": This line may be present at the supplier's option. The value is a number, which is the approximate total number of entries in the subtree. This information is provided for statistical purposes only. - "attributetypes": This line may be present any number of times with distinct values at the supplier's option. The value is the string encoding of an LDAP AttributeTypeDescription. This allows the supplier to include privately defined or nonstandard attributes in the update. The supplier generally should not include in the header - descriptions of attribute types defined in X.500 or RFC 1274, - descriptions of attribute types for which it will not be including index information (other than presence), - descriptions of attribute types with privately-defined syntaxes, - descriptions of attribute types whose syntax is "DirectoryString" or "IA5String" and matching rule is "caseIgnoreMatch" or "caseIgnoreIA5Match". For each attribute description the supplier provides, the syntax and equality rule names must be included. The consumer may wish to make use of the oid, name, syntax and equality fields when processing queries using the information from the update. If the consumer does not recognize an attribute type in the update, and it is not defined in this header, the consumer must treat it as having an unknown oid, DirectoryString syntax, and caseIgnoreMatch equality matching. - "chopbefore": This line may be present any number of times with distinct values at the supplier's option. The value is a Distinguished Name in LDAP format. The entry with this name and all its subordinates have been excluded from the generation of the index information. Typically the entry will be a context prefix or an administrative point. Note: It is assumed that either the consumer will establish a separate agreement to cover each excluded area, or these areas are to be ignored during searching. - "chopafter": Similar to the "chopbefore" option, except that the index information does include the attributes of this entry, but not its subordinates. (E.g. the subordinate entries are held in a QUIPU DSA.) An example header would be: version: 1 updatetype: total thisupdate: 199701121341Z 4.2. Attribute Sections This section is present any number of times following the header in an x-ldap-centroid-1 index object. Each section is separated from the others by a blank line. Each section corresponds to one attribute type. The first line of the section consists of three parts separated by colons: the attribute type name, the match form, and the tokenization rule. The match form is one of the following: - equality: the tokens correspond to values of that attribute present in entries in the subtree. - approx: the tokens correspond to approximate match codes of all values of that attribute present in entries in the subtree. The attribute must be of the DirectoryString or IA5String syntax. - presence: the token indicates whether there is at least one value of that attribute present in any entry in the subtree. The supplier should use the equality match form for attributes whose values are DirectoryString, IA5String or OID, are useful for filtering, and the supplier is willing to disclose to the consumer. If the supplier is not willing to disclose the values, the supplier should use the approx match form. For all other attributes on which the supplier permits searching, but would not be useful to include in a centroid (because they are large, binary or do not compress well), the presence match form should be used. Tokenization rules define the transformation from attribute values to tokens. The "flat" tokenization rule is always applied. First, the attribute value is converted to an LDAP string. Non-string values (e.g. photo and audio) are discarded. Leading and trailing spaces are removed. Multiple consecutive spaces and non-printing characters are replaced by a single space. Lower case ASCII letters are converted to upper case. ASCII characters 0-31 and 127 are removed. If the match form is "approx", the "soundex" rule is also applied. Each word (separated by spaces) is replaced by its SOUNDEX code and becomes its own token. If the match form is "presence", the "presence" rule is applied. It generates only one token: "*" if there is any value. If a character "#", "-", "+", ";" or "\" occurs in the token, each is preceded by a "\" character. The supplier may append modifiers to each token. Modifiers are separated from the tokens (and each other) by a semicolon. One modifier is described here: ;ec=: An approximate count of the number of entries in which this token occurs. Modifiers are optional and consumers may choose to ignore them. If the "updatetype" is "total", each of the lines in the section after the first is one token. The order of lines is unimportant. If the "updatetype" is "incremental", each of the lines starts with either a "+" or a "-" character, indicating that the token should be added or removed from the tokens of this attribute. Lines must be processed in the order they occur. Modifiers on a "-" token are ignored. For example, if the subtree contained: dn:dc=sup,dc=com objectclass:top objectclass:domain dc:sup dn:cn=Joe Bloggs,dc=sup,dc=com objectclass:top objectclass:person objectclass:strongAuthenticationUser cn:Joe Bloggs cn:Joseph Bloggs sn:Bloggs userCertificate;binary::0281... dn:cn=Mary Bloggs,dc=sup,dc=com objectclass:top objectclass:person cn:Mary Bloggs sn:Bloggs And the supplier generated equality match form for "sn", "objectclass" and "description", approximate match form with soundex for "cn", and presence match form for "userCertificate;binary", the payload might resemble: version:1 updatetype: total thisupdate: 199701121341Z sn:equality:flat BLOGGS;ec=2 objectclass:equality:flat TOP;ec=3 DOMAIN;ec=1 PERSON;ec=2 STRONGAUTHENTICATIONUSER;ec=1 description:equality:flat cn:approx:soundex J2;ec=1 J3;ec=1 -- these are not the right codes B256;ec=1 userCertificate;binary:presence:presence *;ec=1 If the subtree was subsequently modified to add a value to an entry description: seldom seen Then the next update might resemble version:1 updatetype: incremental thisupdate: 199701131339Z lastupdate: 199701121341Z description:equality:flat +SELDOM SEEN;ec=1 5. Aggregation Aggregation may be performed to combine a index of a naming context with indices of ALL subordinate naming contexts. It cannot be usefully performed if there is missing index information for one or more subordinates. When combining centroids, if an attribute centroid is provided by one index but not by any others, the attribute may be replaced by a presence index. Chopafter and chopbefore lists are merged. The base URIs are changed to point to the aggregating server and its alternatives. The DSI is distinct from that of any subordinate naming context's DSIs. If all subordinate indexes included a contextsize header, then the aggregate may also have a contextsize header. For example, suppose a server holds a naming context C=US with the following entry: dn:c=US objectclass: top objectclass: country c: US It has two subordinate references "o=Foo,c=US" and "o=Bar,c=US". It consumes an x-ldap-centroid-1 from the Foo supplier: version: 1 updatetype: total thisupdate: 199701131339Z objectclass:equality:flat TOP;ec=2 ORGANIZATION;ec=1 PERSON;ec=1 o:equality:flat FOO;ec=1 cn:equality:flat SHARON META;ec=1 sn:equality:flat META;ec=1 And an x-ldap-centroid-1 from the Bar supplier: version: 1 updatetype: total thisupdate: 199701131339Z o:equality:flat BAR;ec=1 cn:equality:flat JEFF RUSSELL;ec=1 JEFFREY RUSSELL;ec=1 PENELOPE JONES;ec=1 sn:equality:flat RUSSELL;ec=1 JONES;ec=1 uid:approx:soundex J311;ec=1 J12;ec=1 objectclass:equality:flat TOP;ec=3 ORGANIZATION;ec=1 PERSON;ec=2 BARSPECIFICPERSON;ec=2 To aggregate, it first converts its own entry into a centroid. Since the server can determine that it is the only centroid without an o attribute, it can create an "o:equality:flat" section for itself with no values. This will prevent the o attribute information from being lost. Only the Bar server sent the uid attribute. Since the Foo server did not include a "uid" attribute, the server derives that are no uid values for the Foo server. It then merges the o,cn,sn,uid and objectclass sections, to build the resulting centroid. version: 1 completeness: index o:equality:flat FOO;ec=1 BAR;ec=1 cn:equality:flat SHARON META;ec=1 JEFF RUSSELL;ec=1 JEFFREY RUSSELL;ec=1 PENELOPE JONES;ec=1 sn:equality:flat META;ec=1 RUSSELL;ec=1 JONES;ec=1 uid:approx:soundex J311;ec=1 J12;ec=1 objectclass:equality:flat TOP;ec=6 ORGANIZATION;ec=2 PERSON;ec=3 BARSPECIFICPERSON;ec=2 COUNTRY;ec=1 Finally, the server generates a new DSI. It can transfer this object to other servers as the index for C=US and all subordinates. 6. Use of Index Objects in the Consumer This procedure is followed for each data set held by the consumer LDAP server, when a search operation is to be performed that includes the subtree in its scope. Index information is not used during bind password validation or comparison operations. If the target object of the search is below the subtree prefix, then the operation is chained or referred to the supplier (or processed with a shadow copy), regardless of the contents of the index information. 7. Filter Evaluation If the base object of a search is superior to the subtree prefix, the index information is used as to make a routing decision. The consumer will evaluate the client's search filter. The result will be one of the following outcomes: - LIKELY: there is a good possibility that there is at least one matching entry in the naming context. - POSSIBLE: there may or may not be any matching entries in the naming context. - UNLIKELY: the only matching entries would be those which were modified or added to the naming context subsequently to the index information being generated. - UNINDEXED: the supplying server will likely not have any matching entries as it does not allow searching on one or more attribute types referenced in the filter. This occurs if any of the attributes in the search filter are not represented in the centroid. The consumer server should first chain or return referrals for subordinate naming contexts in which the evaluation returned LIKELY. If the time and size limits permit, the server should then chain or return referrals for subordinate naming contexts in which the evaluation returned POSSIBLE. The server should ignore subordinate contexts which were UNLIKELY or UNINDEXED. Generated referrals will be URIs of the LDAP form, in which the base object is the name of the naming context and the scope is subtree. There may be mulitple URIs in a referral, if there are alternate servers for this naming context. 7.1. "and" filter Evaluate each of the component filters. If any filter returned "UNLIKELY", return "UNLIKELY". If any filter returned "POSSIBLE", return "POSSIBLE". Otherwise all filters returned "LIKELY", so return "LIKELY". 7.2. "or" filter Evaluate each of the component filters. If all filters returned "UNLIKELY", return "UNLIKELY". If all filters returned "LIKELY", return "LIKELY". Otherwise return "POSSIBLE". 7.3. "equalityMatch" filter If there is an attribute section for the presented attribute type of the equality match form, then tokenize the presented value and compare. If the presented value matches, return "LIKELY", otherwise return "UNLIKELY". If there is an attribute section for the presented attribute type of the approx match form, then tokenize the presented value and compare. If the approximated presented value matches, return "POSSIBLE", otherwise return "UNLIKELY". If there is an attribute section for the presented attribute type of the presence match form, then if the "*" token is present, return "POSSIBLE", otherwise return "UNLIKELY". 7.4. "substrings" filter If there is an attribute section for the presented attribute type of the equality match form, then tokenize the filter values and apply the filter against each token. If there is a match, return "LIKELY", otherwise return "UNLIKELY". If there is an attribute section for the presented attribute type of the approx match form, then tokenize the filter values and compare each against each token. If there is a match, return "LIKELY", otherwise return "POSSIBLE". If there is an attribute section for the presented attribute type of the presence match form, then if the "*" token is present, return "POSSIBLE", otherwise return "UNLIKELY". 7.5. "approxMatch" filter If there is an attribute section for the presented attribute type of the equality match form, then apply the filter against each token. If there is a match, return "LIKELY", otherwise return "UNLIKELY". If there is an attribute section for the presented attribute type of the approx match form, then compare each tokenized word in the presented value against the tokens. If there is at least one matching word, return "LIKELY", otherwise return "UNLIKELY". If there is an attribute section for the presented attribute type of the presence match form, then if the "*" token is present, return "POSSIBLE", otherwise return "UNLIKELY". 7.6. "present" filter If there are any values or tokens of the presented attribute type in the index object, return "POSSIBLE", otherwise return "UNLIKELY". 7.7. "not", "greaterOrEqual", "lessOrEqual" and "extensibleMatch" filters Return "UNLIKELY". 8. Recommendations To be written. 9. Security Considerations This specification provides a way to transfer directory information from one server to another. This may consist of white pages data, which is protected by privacy laws in many countries. Depending on the requirements of the data suppliers, a number of index encoding options are available, which provide a range of non-reversibility, at a cost of usefulness for the consumer. The specification recommends that a digital signature be applied and the data be encrypted before being transferred to the consumer. This will allow the consumer to verify the source of the data, and to ensure that unauthorized parties are not able to access the data while in transit. The specification is designed to work in environments where there is an agreement between the index supplier and consumer. This may be based on a legal or contractual agreement between the two parties, which defines the protections the consumer must provide to the index information. 10. Author's Address Mark Wahl Critical Angle Inc. 4815 W Braker Lane #502-385 Austin, TX 78759 USA EMail: M.Wahl@critical-angle.com Bibliography [CIP] [LDAP] [LDIF] [Fortezza] [UTF8] RFC 2044