INTERNET-DRAFT John C. Klensin May 28, 2001 Expires November 2001 A Search-based access model for the DNS draft-klensin-dns-search-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This document supplements a companion document [DNSROLE] on the role of the DNS relative to the uses to which it is being put and is intended to start laying the groundwork for a specific proposal. Both documents, their successors, and closely-related issues, can be discussed on the mailing list at ietf-i18n-dns-directory@imc.org. See http://www.imc.org/ietf-i18n-dns-directory/ for subscription and archival information. Copyright Notice Copyright (C) The Internet Society (2000). All Rights Reserved. 0. Abstract This memo discusses strategies for supporting "DNS searching" -- finding of names in the DNS by a layered mechanism that permits fuzzy matching, selection that uses attributes or facets, and use of descriptive terms. Demand for these facilities appear to be increasing with growth in the Internet (and especially the web) and with requirements to move beyond the restricted subset of ASCII names that have been the traditional contents of DNS "Class=IN". This document proposes a three-level system for access to DNS names in which the upper two levels involve search, rather than lookup, functions. It also discusses some of the issues and challenges in completing the design of, and deploying, such a system. 1. Introduction and Executive Summary The notion of "DNS searching" is somewhat of an oxymoron: the DNS is structured to only perform exact lookups of structured label strings. But, as discussed elsewhere, there is considerable demand for searching facilities -- partial and fuzzy matching, selection that uses attributes or facets, and searching using descriptive terms-- and that demand appears to be increasing with growth in the Internet (and especially the web) and with requirements to move beyond the restricted subset of ASCII names that have been the traditional contents of DNS Class=IN. This document proposes a three-level system for access to DNS names in which the upper two levels involve search, rather than lookup, functions. It also discusses some of the issues and challenges in completing the design of, and deploying, such a system. It has been suggested that introducing a "directory" or "keywords" into, or above, the DNS could be used as a solution to the IDN problem and, often, several others. Probing statements about "directories" often quickly demonstrates that their advocates don't agree on what they mean. This section outlines a three-layer search/lookup model (adding two layers to the DNS, i.e., constructing a three-layer model, rather than continuing with the single one we have today). Those layers consist of the current DNS, a search-capable layer using an extremely simple set of facets, and a layer capable of broader search approaches in a localized context. It is intended as a strawman for criticism and development, rather than as a specific proposal. I.e., the details are left for WG efforts. The document suggests that it is better to add two layers to the DNS --constructing a three-layer model-- rather than just one. And I'm going to try to avoid using the word "directory" as if it meant anything, but I presume readers will insert that term when it matches their perceptions. This document is a preliminary proposal -- a framework and fodder for a working group or design team-- rather than a complete specification or even an approximation to one. 2. A three (or four) -layer environment. The material below suggests three or more layers: (1) The DNS, with the existing lookup mechanisms (2) A restricted, facet-based, search system. (3) Commercial, localized, and potentially topic-specific, search environments. (4) Something else? 2.1. Layer one: Identifiers -- a lookup system and the DNS. In this model, the DNS remains largely as is (see section 3.3ff) or, perhaps, a bit closer to its original purpose and assumption than the direction in which it has evolved in recent years. I.e., it is a distributed database, with precise lookups, whose lookup keys are identifiers for Internet hosts and other objects. We give up the notion that these identifiers should also serve as human-useful names or at least try to do so. As an aside, note that some people have suggested that we should dehumanize DNS names entirely, e.g., prohibit the registration and use of any name that can be found in any dictionary for any language that can be represented in the DNS-acceptable character set. This proposal doesn't include that idea. But it is absent primarily because I don't think the transition process is worth the time it would take to explore rather than because it has no appeal. The goal at this layer is relatively simple, unique, identifiers. It is probably desirable that these identifiers be able to have some human mneumonic value, but less important that they be tightly bound to real-world names and descriptions. The inputs and outputs at this layer are as they are in the DNS today, although modifications to accomodate non-hosttable format names there remain possible if that is deemed important. 2.2 Layer two: Names -- a faceted search system with a small number of facets. Much of the current burden borne by the DNS would appear to be better localized in a search system that contains names and a small number of facets/ attributes. This burden includes a whole range of non-identifier goals and constraints: names that a user can understand and find and that have significant mneumonic value, names with trademark implications, a wide variety of naming systems and, in general, helping people find the things for which they are looking. It is critical that the number of attributes be constrained to a minimal set --and that other attributes, especially those of special interest, be deferred to the third layer. It is probably most useful to think about this layer in terms of a structured, multifacted, multihierarchical, thesaurus-like database with search capability (Cf. ISO IS 5127-1 and IS 5127-6 [THES]), rather than as a "directory" in the sense of X.500 and its derivatives and antagonists. The key question is what facets to use once the commercial product requirements are removed (to layer three, see below). It appears to me that, to satisfy to the critical name-uniqueness and real world pressures on the DNS, candidates might be name (IS 10646, see below) language (presumably per RFC 3066) geographical location (country, and/or for some federal countries, country/province ("state"), granularity is important; there may be a case for an additional facet in a coordinate system) network location (If we can figure out what that means and how to express it in a canonical way.) industry category code (For companies, presumably derived from the Madrid Treaty [Madrid] list; the list would need to be extended to deal with non-commercial organizations and entities and for identifying resources and services associated with people. This typology gives the trademark view of the world somewhat more precedence in looking at name conflict issues than one might like in principle. But, in practice, one of the key issues we have encountered in trying to store "names", rather than identifiers, in the DNS is that the process unreasonably flattens the space. That "Joe's Auto Repair" and "Joe's Pizza" can co-exist in the same geographical area without conflict or confusion and that "Joe's Pizza" in one area can co-exist with "Joe's Pizza" in another, again without conflict or confusion, are the consequence of the way we name and identify things in the real world. Most trademark rules ar the consequence of those naming systems, not their cause. It is not intended that this level act as a white pages service for people. Doing so leads down several slippery slopes at once, including heightened privacy concerns and a stronger requirement for URL targets rather than DNS label ones (see below). The names in this environment can reasonably be written in IS 10646 codes or some recoding of them. Since we would be starting more or less from scratch, we could select lengths and codings for maximum efficiency and utility, not to meet the constraints of existing software. In such a context, this author has a slight bias for direct UCS-4 coding, rather than ASCII-compatible ("ACE") codes; compressed, null-octet- eliminating, systems such as UTF-8; or surrogate introducers to hold things to 16 bits. The loss in transport efficiency is likely to be more than compensated for by gains in cleanliness and equal treatment of all scripts. But that issue is separate from the main and important design arguments of this document. The work done for "nameprep" in the IDN WG is almost certainly relevant to determining which names to actually store in the database. But the stakes are lower here than the "get it right or fail completely" constraint of the DNS lookup environment: one can imagine search mechanisms that would apply a more liberal set of matching rules (and/or localized and language-specific ones) than the rules used to encode names (much like recent applications protocols that explicitly distinguish between the formats one is permitted to send and those one is expected to accept (Cf. [RFC8222])). As is common with systems of this type, we would anticipate the possibility of searching on any of the attributes and that searching on free-text strings would not be exact (i.e., near-match responses could be returned using any of several algorithms, with the user making choices). As is equally common, we should think about user interfaces that store both queries and response sets so that the responses could be used offline and refreshed when the client systems were attached to the Internet. In summary, the goal at this layer is to provide unique tuples of human-recognizable (not just mneumonic) names, but names that are unique within a context, rather than a global system based on the names alone. The inputs at this layer are search values for one or more of the facets. The outputs are still controversial, but would appear to best be the full facet set of the matched tuple(s) and one or more DNS names. One of many interesting questions is whether this layer should pass through and return the DNS records themselves (labels, class, type, and target) or whether it should return names (labels) and let the applications do the DNS lookups. Another possibility is to return one or more URLs (or more general URIs?) rather than DNS names. Doing so increases flexibility but at the cost of greater complexity and risk of recursion problems. One possibility would be to create a URI for DNS record information and use it to abstract this return information into something applications can then specify or decode as appropriate. Use of this would need to be carefully structured to avoid complex problems, but might be a reasonable approach. Experience with the DNS and other distributed databases also argues persuasively that these records are not forever. Unless there are no local copying and caching mechanisms (which seems unlikely and hard to enforce), some type of time to live (TTL) or other expiration or reverification mechanism will be needed. 2.3. Layer three: locality and/or content-domain-specific lookup mechanisms. The problem with the second-layer model is that there are a number of usability and marketplace pressures for naming systems that offer finer granularity and better match user needs. Interestingly, those systems which have been included in experiments or partially deployed (see, e.g., [RFC2345], [Netword], and [RealNames]) have demonstrated that these systems require contextual localization, not a single global environment. There are many causes for this, but need for very specific searches that are geographic-area, topic-area, or language or culturally specific tend to dominate the list. The issue is perhaps illustrated by an example. Suppose the granularity of an entry at the second level is {"Joe's", "UK", Restaurant,... } Now, I might want to create a business around a restaurant directory for Bristol. I would probably want to construct a database that contained exact locations, type of food, menu information, prices, etc., and permit people to query it that way. That type of product bears a strong relationship to traditional yellow pages services: the right attributes to collect and the right way to organize them will differ by topic (e.g., "menu" has no obvious analogy in an automobile repair shop) and the business models are fairly established. One can imagine many different types of keyword and (yellow pages-like) directory services at this level, using different types of protocol mechanisms as well as different types of database content and schema. But those services are nearly ideal candidates for competition: there is no requirement that either the providers or the services be global or unique or even highly standardized. Having all three layers bound to the same data sources --inheriting values from them if one wants to think about it that way-- would provide a degree of consistency that might be very attractive to users, so there are clearly issues here that will need to be worked out in the marketplace. Inputs at the third layer will differ by service: one can imagine free-text interfaces and menus (but see section 2.4) as well as systems that more closely resemble faceted search terms. Outputs will normally be layer-two names or strings to preserve name and reference portability, or might be URIs containing such names. Summary: Just as the monohierarchical identifier-lookup system at the first (bottom) level should be supplemented by a multilingual, multifaceted, multihierarchy search system at the second, that second level system should be supplemented by a collection of localized, subject- and topic- specific systems at the third. These third-level systems need not be centrally coordinated in any way, although some similarity of function and interface would almost certainly make them more consistent for users and easier to market. 2.4. A layer above the third: free-text searching applications. The approaches described above omit one set of techniques used today: "web searches" on full text or its equivalent. These systems have an important role (and, similar to the third level, there seems no particular advantage to trying to standardize them worldwide). But their disadvantage, if seen as a DNS surrogate or replacement, is that they have difficulty distinguishing between the name of something, a pointer to it, and a reference or discussion of it or how it works. If, for example, one is looking for a web site for a company, the third level would presumably find that site. The second (or even the DNS) might find it with some guessing, but this fourth level would (as web search engines do today) probably not distinguish the company's site from sites that reference the company or its products. Layer three produces information that is explicitly bound to the query, i.e., what one is looking for, while a search engine returns values that also include sites where the subject of the query might have been mentioned. 3 Context and directions 3.1 The data search and access model It is interesting that recent IETF "directory" work has focused on accessing mechanisms without worrying intensely about the underlying database content, maintenance, and update issues. Those issues seem to be the harder ones, i.e., the difference between LDAP and CNRP may make less difference than how we structure, maintain, and distribute the relevant data. Of course, that does not suggest that the work is not important or that it isn't required. And, to deploy the model suggested above, we will need to deal with a pair of uncomfortable problems: * CNRP looks interesting, but has not been widely implemented or deployed in production. * LDAP is widely deployed, but primarily in implementations that contain sufficient extensions and special features to be non-interoperable. If we are going to choose -- and layer two certainly implies a choice-- we need to figure out how to do that. 3.2 Implications of uniqueness of name structures at the second layer. The IAB's discussion of DNS root uniqueness [RFC2826] argues that DNS names must be unique, i.e., that there must not be alternate or surrogate root structures if the Internet is to survive as a seamless whole and be universally addressable and accessible. Even with imprecise matching, similar arguments apply at level two, especially if this is the first level at which names in natural languages (hence including multilingual names), rather than constrained identifiers, appear. Because the name structures at the second level still must be unique, some mechanism for registries or structuring of names will be necessary to avoid conflicts. The problem is somewhat easier than the ones encountered by ICANN and its associated groups because the very structuring of the names and attributes creates opportunities for dividing up responsibilities, but the registration problems exist nonetheless and will need to be resolved. 3.3 Deployment against DNS base As with the "new class" approach to DNS changes [NEWCLASS], the approach outlined here does not require any changes to the existing installed DNS base. But, like all solutions to the multilingual name issues, it requires changes to all relevant applications. The notion of moving from lookup to searching does imply that we will need, not merely to change the code that calls the name resolution system, but to rethink the UIs of those applications. 3.4 Older applications To fully realize internationalized naming requires changing all applications to understand the new method, whatever it is. Older applications will see distorted and unfriendly names under some systems, and no names at all under others. The environment contemplated here is a "no names" one -- applications that have not been upgraded will not see internationalized names or other natural-language phrases. The advantages of this are that it avoids confusion and, as with the original host table to DNS conversion, provides an incentive to convert old applications to make newer naming styles visible. None of these transitions are ever easy, but it may be worth going through this one to get things right, rather than investing a large fraction of the pain to get a solution that doesn't quite do the job. 3.5 Why not just a keyword system As suggested above, the term "keyword system" is used to refer to many different things. Many would fit nicely into the layer three environment, but most of the existing proposals put them directly on top of the DNS, or skip the DNS entirely and go directly to IP addresses. The difficulty with these systems is that they either must be localized (e.g., a different system or database for each language, country, or smaller locality) or they don't scale well. In particular, they eventually suffer from either the "all the good names are taken" problem with which the DNS is frequently accused or they yield to poor precision properties. 4 Summary The solution to the "multilingual DNS" problem, and to a series of other limitations of the DNS relative to today's expectations for naming and searching, lies in solutions targeted to those problems, rather than superimposing additional mechanisms on the DNS in ways that, we hope, will not cause problems with older programs and unconverted infrastructure. Inserting new layers avoids those risks and permits a clean solution that is adapted to the problems, rather than the limitations imposed by existing properties of the DNS. 5 IANA Considerations and related topics At layer two, it is difficult to think about how the system might function successfully without controlled vocabularies for each of the non-name facets. As discussed in section 2.2, we have already established one such registry (bound to an ISO standard), and mechanisms for utilizing it, with RFC 3066. The Madrid agreement provides classifications for types of businesses, but we would need to extend the registry for names that are not business-related. The two locational attributes are somewhat vague at this point, but controlled vocabularies would presumably be needed, and should, if possible, be drawn from stable, non-IETF, work (e.g., IS 3166-1 and 3166-2 might provide a foundation, and possibly a complete list, for the location vocabulary). Curiously, there is no technical reason why the names themselves must be unique: that is one of the attractions of a model like this over attempting to overload the DNS. If conflicts or confusion occur, those are standard civil (marketplace or trademark) issues that can be resolved in their own environments, rather than posing special Internet problems. 6 Security Considerations Additional layers of naming, searching, and databases imply addition of opportunities for compromising those databases and mechanisms. Part of the challenge with the model implied here is to determine how to secure and authenticate those databases and access (especially modify access) to them. The good news is that, since the functions are new, we should be able to design security mechanisms in, rather than --as with the DNS-- have to try to graft them on to a structure not designed for them. 7 References [DNSROLE] Klensin, John. "Role of the Domain Name System", work in progress, draft-klensin-dns-role-... [MADRID] [Netword] http://corp.netword.com/ -- real reference needed. [NEWCLASS] Klensin, John, "Internationalizing the DNS -- A New Class", work in progress, draft-klensin-i18n-newclass-... [RealNames] http://www.realnames.com/ -- real reference needed. [RFC882] Mockapetris, P.V., "Domain names: Concepts and facilities". RFC 822. Nov-01-1983. [RFC883] Mockapetris, P.V. "Domain names: Implementation specification", RFC 883. Nov-01-1983. [RFC1035] Mockapetris, P.V. "Domain names - implementation and specification", RFC 1035. Nov-01-1987. [RFC2345] Klensin, J, T. Wolf, G. Oglesby. "Domain Names and Company Name Retrieval", RFC 2345. May 1998. [RFC2822] Resnick, P., Editor. "Internet Message Format", RFC 2822. April 2001. [RFC2826] IAB. "IAB Technical Comment on the Unique DNS Root", RFC 2826. May 2000. [RFC3066] Alvestrand, H. "Tags for the Identification of Languages", RFC 3066. January 2001. [THES] IS 5127-1, IS 5127-2. [WAIS] M. St. Pierre, J. Fullton, K. Gamiel, J. Goldman, B. Kahle, J. Kunze, H. Morris, F. Schiettecatte. "WAIS over Z39.50-1988", RFC 1625. June 1994. [Z39] Z39.50, IS 23950. 8 Acknowledgements This document, and the related notes, are the result of thinking that has come together and evolved since before the issue of internationalized access to domain names came onto the IETF's radar. Discussions with a number of people have led to refinements in the approach or the text, even though some of them might not recognize their contributions or agree with the conclusions I have drawn from them (indeed, some of those discussions were rooted in challenges to the general ideas expressed here). Particularly important suggestions have come from, or arisen out of conversations with, Harald Alvestrand, Rob Austein, Fred Baker, Eric Brunner-Williams, Randy Bush, Vint Cerf, Kilnam Chon, Dave Crocker, Leslie Daigle, Patrik F„ltstr÷m, Michael Froomkin, Francis Gurry, Paul Hoffman, Kenny Huang, Mao Wei, Michael Mealing, Gary Oglesby, Qian Huilin, James Seng, Theresa Swinehart, Len Tower, and Zita Wenzel as well as some long-ago conversations with Jon Postel and J.C.R. Licklider. 9 Author's Address John C Klensin AT&T Labs 99 Bedford St, 4th floor Boston, MA 02111 USA +1 617 574 3076 klensin@att.com Expires November 2001