INTERNET-DRAFT John C. Klensin November 20, 2001 Expires May 2002 A Search-based access model for the DNS draft-klensin-dns-search-02.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This document supplements a companion document [DNSROLE] on the role of the DNS relative to the uses to which it is being put and is intended to start laying the groundwork for a specific proposal. Both documents, their successors, and closely-related issues, can be discussed on the mailing list at ietf-irnss@lists.elistx.com See http://lists.elistx.com/ob/adm.pl for subscription and archival information. Copyright Notice Copyright (C) The Internet Society (2000). All Rights Reserved. 0. Abstract This memo discusses strategies for supporting "DNS searching" -- finding of names in the DNS, or references that will ultimately point to DNS names, by a mechanism layered above the DNS itself that permits fuzzy matching, selection that uses attributes or facets, and use of descriptive terms. Demand for these facilities appear to be increasing with growth in the Internet (and especially the web) and with requirements to move beyond the restricted subset of ASCII names that have been the traditional contents of DNS "Class=IN". This document proposes a three-level system for access to DNS names in which the upper two levels involve search, rather than lookup (exactly known target), functions. It also discusses some of the issues and challenges in completing the design of, and deploying, such a system. Table of Contents 0. Abstract 1. Introduction and Executive Summary 2. A three (or four) search-layer environment. 2.1. Search Layer One: Identifiers -- a lookup system and the DNS. 2.1.1. The facets 2.1.2. The name string 2.1.3. Case matching 2.1.4. More complex character matching 2.1.5. Query formation and specification 2.2. Search Layer Two: Names -- a faceted search system with a small number of facets. 2.3. Search Layer three: locality and/or content-domain-specific lookup mechanisms. 2.4. A search layer above the third: free-text searching applications. 2.5. Database and searching differentiation 3. Context and directions 3.1 The data search and access model 3.2 Uniqueness of name structures at the second search layer. 3.2.1 The case for unique names 3.2.2 Non-unique names 3.2.3 The middle ground 3.3 Sources for controlled-vocabulary facets ("attributes") 3.4 Deployment against the existing DNS base 3.5 Thoughts about User Interfaces (UIs) 3.6 Older applications 4. Comparisions to existing and proposed technology 4.1 The IDN Strawman 4.2 "Keyword" systems 4.3 Client-side and server-side solutions 5. Comments on business models 6. Summary 7. IANA Considerations and related topics 8. Security Considerations 9. References 10. Acknowledgements 11. Author's Address 1. Introduction and Executive Summary The notion of "DNS searching" is somewhat of an oxymoron: the DNS is structured to only perform exact lookups of structured strings of labels. But, as discussed elsewhere, there is considerable demand for searching facilities -- partial and fuzzy matching, selection that uses attributes or facets, and searching using descriptive terms-- and that demand appears to be increasing with growth in the Internet (and especially the web) and with requirements to move beyond the restricted subset of ASCII names that have been the traditional contents of DNS Class=IN. This document proposes a three-level system for access to DNS names in which the upper two levels involve search, rather than lookup, functions. It also discusses some of the issues and challenges in completing the design of, and deploying, such a system. These types of services are unnecessary as long as the problem is defined as "get non-ASCII identifiers into the DNS, but keep to a well-specified set of characters and usage so they retain strict identifier properties". Such approaches do not, as discussed in [DNSROLE], solve the problem as perceived by many people. One non-technical way of looking at this is that the DNS is fundamentally downward-facing: it is designed to support references to network and host resources. Users want something upward-facing, i.e., that provides natural-language terminology and searching for resources of interest. And, as the IAB has pointed out [RFC2825], even if "fixing the DNS" did the job, it would be the easy part: the harder problem is considering and adjusting the applications and applications-level user interfaces. It has been suggested that introducing a "directory" or "keywords" into, or above, the DNS could be used as a solution to the IDN problem and, often, several others. Probing statements about "directories" often quickly demonstrates that their advocates don't agree on what they mean. This section outlines a three-layer search/lookup model (adding two layers to the one provided by the DNS, i.e., constructing a three-layer model, rather than continuing with the single one we have today). Those layers consist of the current DNS, a faceted search-capable layer using an extremely simple set of facets, and a layer capable of broader search approaches in a localized context. It is intended as a strawman for criticism and development, rather than as a specific proposal. I.e., the details are left for WG efforts. As a terminology issue, the "layers" described here are probably best thought of as sublayers of the applications layer, with actual user-facing applications lying yet above them. The term "search layer" has been used below where it appears to be needed for clarity or emphasis, and "sublayer" and "level" are sometimes used interchangably with it: suggestions for better terminology would be welcomed. At the two "above DNS" sublayers, international ("universal") character sets and scripts are assumed and part of this initial design. Since actual or applications-applied DNS restrictions are not being inherited upward into these sublayers, coding can be chosen for maximum utility and balance among language groups. E.g., native UCS-4 could be used as an alternative to a secondary encoding form such as UTF-8 or an ASCII-compatible recoding. This document is intended to evolove into a framework and model for the layered search system, rather than a complete specification or even an approximation to one. It is complemented (for sublayer two) by [Mealling-SLS], which discusses a CRNP-based implementation model for the middle sublayer and the more keyword-focused model [Arrouye]. We believe the latter to best be described as sublayer three services but the authors disagree, seeing them in a sublayer two context as well. Additional documents are expected to be developed that describe other aspects of both sublayers. 2. A three (or four) search-layer environment. The material below suggests three or more sublayers for name lookup and search: (1) The DNS, with the existing lookup mechanisms and a single global name space in which names are unique. Names are placed in the DNS by those who wish to use those names themselves (e.g., for identifying hosts and resources within a home, an enterprise, or cooperating groups of organizations. The DNS was never designed for searching for, or querying of, an identifier by someone who does not already know what it is. A useful analogy has been drawn between DNS names and variable names in a programming language [Austein]. (2) A restricted, facet-based, search system. This system still preserves a global name space, but name strings are not expected to be unique and the set of facet values for a given entity may not be (see section 3.2). Names are placed into this second-sublayer system by those who want to be found, or want the names or resources to be found, by others. The assumptions are neither that those others will know exactly what name they are trying to access (where the DNS requires precise knowledge of names or _very_ good guessing) or that names will be unique (where the DNS requires uniqueness). But the search activity is still based on names (and attributes), not topics. It may be useful to think of this layer as similar to "white pages" services. This comparision is discussed in more detail below. (3) Commercial, localized, and potentially topic-specific, search environments. These environments utilize multiple, localized, name spaces. These would typically be localized by language or (physical or political) geography, but might be structured around, e.g., specific subject matter. Names are placed in this sublayer by those who wish them to be found within a topic area context (or language or locality or combination of them). Because the environments are localized, different search terms and levels of granularity can be used in different search sites and name spaces. It may be useful to think of this layer as similar to "yellow pages" services. Again, the comparison is discussed in more detail below. (4) Something else? 2.1. Search Layer One: Identifiers -- a lookup system and the DNS. In this model, the DNS remains largely as is (see section 3.4ff) or, perhaps, a bit closer to its original purpose and assumptions than the direction in which it has evolved in recent years. I.e., it is a distributed database, with precise lookups, whose lookup keys are identifiers for Internet hosts and other objects. We give up the notion that these identifiers should also serve as human-useful names or at least try to abandon that notion. As an aside, note that some people have suggested that we should dehumanize DNS names entirely, e.g., prohibit the registration and use of any name that can be found in any dictionary for any language that can be represented in the DNS-acceptable character set. This proposal doesn't include that idea. But it is absent primarily because it does not appear that the transition process would be worth the time it would take to explore, rather than because it has no appeal. The goal at this sublayer is relatively simple, unique, identifiers. It is probably desirable that these identifiers be able to have some human mnemonic value, but less important that they be tightly bound to real-world names and descriptions. The inputs and outputs at this layer remain as they are in the DNS today, although modifications to accomodate non-hostname (i.e., restricted to the ASCII-based "letter-digit-hyphen" (LDH) format traditionally used in Internet applications [HOSTNAME]) names there remain possible if that is deemed important for mnemomic or other purposes. 2.2. Search Layer Two: Names -- a faceted search system with a small number of facets. Much of the current burden borne by the DNS would appear to be better focused on a search system that contains names and a small number of attributes represented in name facets. That DNS burden includes a wide range of non-identifier goals and constraints: names that a user can understand and find and that have significant mnemonic value, names with trademark implications, a wide variety of naming systems and, in general, helping people find the things for which they are looking. It is critical that the number of attributes be constrained to a minimal set, and that other attributes, especially those of special interest, be deferred to the third search layer. The term "attribute" is used here and below to identify the controlled vocabulary or rule-defined facets as distinct from the free-form "name-string". It is probably most useful to think about this layer in terms of a structured, multifacted, multihierarchical, thesaurus-like database with search capability (Cf. ISO IS 5127-1 and IS 5127-6 [THES]), rather than as a "directory" in the sense of X.500 and its derivatives and antagonists. 2.1.1. The facets A key question is what facets to use once the major commercial product requirements are removed (to search layer three, see below). It appears to me that, to satisfy to the critical name-uniqueness and real world pressures on the DNS, candidates for identifying facets might be name-string Characters from IS 10646, see below. language Presumably codes as specified in RFC 3066. geographical location Country, and/or for some federal countries, country/province ("state"). Granularity is important and there may be a case for an additional facet based in a coordinate system or for a two-level facet. network location If we can figure out what that means and how to express it in a canonical way. industry category code For companies, presumably derived from some existing official list such as the WIPO Nice Agreement list [WIPO-NICE]. The list would presumably require extension in some way to deal with non-commercial organizations and entities and to identify resources and services associated with people. This typology gives the trademark view of the world somewhat more precedence in looking at name conflict issues than one might like in principle. But, in practice, one of the key issues we have encountered in trying to store "names", rather than identifiers, in the DNS is that the process unreasonably flattens the space, not only from a technical standpoint but from a usably one. That "Joe's Auto Repair" and "Joe's Pizza" can co-exist in the same geographical area without conflict or confusion and that "Joe's Pizza" in one area can co-exist with "Joe's Pizza" in another, again without conflict or confusion, are the consequence of the way we name and identify things in the real world. Most trademark rules are the consequence of those naming systems, not their cause and many perceived conflicts between the trademark system and DNS usage are the result of this flattening. It is not intended that this level act as a white pages service for people. Doing so leads down several slippery slopes at once, including heightened privacy concerns and a stronger requirement for URL targets rather than DNS label ones (see below). The general intent is that the list of facets be fixed by protocol and that possible values for each facet be controlled vocabularies, not necessarily (and probably not) controlled from the same source (see section 3.2). We would hope to utilize existing terminology lists where possible. For a particular record (i.e., a name and its set of attributes), and especially if requirements for uniqueness can be bypassed or relaxed, the selection (from the controlled vocabularies) of particular facet values would be the responsibility of the entity registering the names. In other words, someone registering a "name" in this system would select values for each of the facets from the controlled vocabulary for that facet as part of the process of placing the name into a database. It is important to note that the registration of that name would include all of the associated facets, although the vocabularies for all of the facets other than the "name-string" would be drawn from specific, external lists (controlled vocabularies or rules). It would not be desirable, and probably would not be feasible, for registrants to record their names in independent, facet-based, databases with one facet per database. There is also no magic in the proposed system. Names are placed in the system with particular facet sets because a registrant wants them there. A registrant who wishes to have a given name-string associated with different facet values (e.g., to identify different locations or lines of business) will make multiple registrations. While all faceted name strings would contain the same facets, there is no technical reason why one or more of these might not have a blank (or "missing") value, presumably causing a match to any search term for that facet. More important, searching for a name might omit one or more facets from the search, again matching any value that actually appeared in the database. It should be clear that there is significantly more information (from the values of the facets) at this layer than there is in the DNS. 2.1.2. The name string The names in this environment can reasonably be written in IS 10646 codes or some recoding of them. Since we would be starting more or less from scratch, we could select lengths and codings for maximum efficiency and utility, not to meet the constraints of existing software. In such a context, this author has a slight bias for direct UCS-4 coding. This is in preference to ASCII-compatible ("ACE") codes; compressed, null-octet-eliminating, systems such as UTF-8; or surrogate introducers to hold things to 16 bits. The loss in transport efficiency is likely to be more than compensated for by gains in cleanliness and equal treatment of all scripts. And, if compression is needed, it is perhaps better to do it at the string level rather than the character one. But that issue is separate from the main and important design arguments of this document. The work done to define "nameprep" and, later, the set of "stringprep" functions [NAMEPREP], in the IDN WG is almost certainly relevant to determining which names to actually store in the database. But the stakes are lower here than the "get it right or fail completely" constraint of the DNS lookup environment: one can imagine search mechanisms that would apply a more liberal set of matching rules (and/or localized and language-specific ones) than the rules used to encode names (much like recent applications protocols that explicitly distinguish between the formats one is permitted to send and those one is expected to accept (Cf. [RFC2822])). At the same time, it would be sensible to permit short phrases as these "names", something which is not generally possible in the DNS (or in the IDN proposals). The necessity, in the DNS, to turn, e.g., "Lower Slobbovian University" into "LowerSlobbovianUniversity.edu", and hope case will be preserved (or "lowerslobbovian.edu", or worse) is, ultimately, just another example of the unfortunate match between the identifiers of DNS and real-world naming systems. So we would assume that it is a design requirement to make it possible to use "Lower Slobbovian University" and "University of Lower Slobbovia" as stored names. 2.1.3. Case matching In the system proposed here, case-matching should be treated as just another case of fuzzy searching and matching, not a relationship with unique status. As discussed below, in all cases, the user (or her agent) would provide a string, some subset of facets, and search-method specifications as input, and would receive a set of matching results, in the form in which they are stored in the database. Case matching -- treating upper and lower case letters as identical -- is another historical DNS property that does not have a simple and unambiguous interpretation in the real world of non-ASCII character sets and a range of language applications. Some scripts contain glyph forms that clearly represent two cases, some scripts clearly do not have case distinctions, and, as the IDN WG has discovered, there are character-matching requirements in some languages (e.g., equality of simplified and traditional chinese [CNDC], see below) for which the appropriateness of an analogy to case-matching has caused a considerable controversy, not least because of the apparent absence of a set of mapping tables that cover all of the possible character pairs. The IDN WG has also discovered that, even for scripts with the presence of clear case distinctions, the matching rules sometimes differ by geographical locality. It is not yet completely clear how case matching should best be handled, but one thing that appears completely clear is that the model the IDN group seems to be creating is not desirable. That model essentially results in different rules being applied to different scripts: case matching in some situations, none in others, and some but not all characters in yet other cases. This may possibly the best compromise given the combination of the constraints of the DNS with the idiosyncracies of Unicode, but, without the DNS constraints, we should strive to treat all languages and scripts in as nearly an identical way as possible. While there are other options, it would appear to be better to handle case-matching on the server, as it is done in the DNS. As with other searching variants, it should be possible to return the form of the name as stored in the database while finding it using any of the user-acceptable variations (use of client-side string preparation for both the stored name and query formation, as an IDN-DNS seems to require, loses information that some people consider important). Case-matching in the proposed faceted system could be applied (or not) as dictated either by a heuristic using the combination of the language facet and a query containing the preferred location-context of the user (see below). Or there could be an explicit query flag (or indicator carrying more than one bit of information). This author tends to prefer the latter because of a profound distrust of heuristics, but the question requires additional study. 2.1.4. More complex character matching The case-matching strategy applies to more complex cases of character matching as well. If one can establish sufficient context, and specify the types of expanded matching to be used, and permit multiple variants to be returned to the application, then one could support matching of similar-appearing characters (e.g., Latin "A" and Greek Alpha), or Latin-derived and Cyrillic-derived scripts for Serbo-Croatian, or, perhaps most important, mapping between Traditional and Simplified Chinese. 2.1.5. Query formation and specification As is common with systems of this type, we would anticipate the possibility of searching on any of the attributes and that searching on free-text strings would not be exact (i.e., near-match responses could be returned using any of several algorithms, with the user making choices). One could also imagine distance function calculations on appropriately structured restricted-vocabulary facets being implemented in some search engines. As is equally common, we should think about user interfaces that store both queries and response sets so that the responses could be used offline and refreshed when the client systems were attached to the Internet. At the same time, we would assume that a search without at least some approximation to a name string would rarely be productive and would expect search systems to be optimized accordingly. In summary, the goal at this layer is to provide tuples of human-recognizable (not just mnemonic) facets (names and attributes), but names that are relevant within the context set by the attributes, rather than a global system based on the names alone. The input at this layer is a query consisting of search values for one or more of the facets, plus information to control the search. E.g., to the extent that designers of search protocols can provide the proper tools and terminology, one would expect the query to be accompanied by rule statements about how much "fuzziness" was permitted, how "distant" names might be from the chosen ones and still be selected, whether character set or language translation (or even phonetic recognition) was to be applied (and whether translation was to be restricted to a small group of languages or made more general) and so forth. The outputs are still being discussed, but would appear to best be the full facet set of the matched tuple(s) (more than one such set if multiple tuples match) and one or more DNS names associated with each tuple. These DNS names, of course, have the same uniqueness properties of the DNS itself: while a query, or full set of matching facets, could match (and return) multiple DNS names, nothing would make the DNS names less unique than they are today (i.e., as the DNS requires). One of many interesting questions is whether this layer should pass through and return the DNS records themselves (labels, class, type, and target) or whether it should return names (labels) and let the applications do the DNS lookups. Another possibility is to return one or more URLs (or more general URIs?) rather than DNS names. Doing so increases flexibility but at the cost of greater complexity and risk of recursion problems. Still another possibility would be to create a URI [RFC-URI] for DNS record information and use it to abstract this return information into something applications can then specify or decode as appropriate. Use of this would need to be carefully structured to avoid complex problems (e.g., recursion in either this system, the DNS, or both), but might be a reasonable approach. If the output is either a DNS name or a URI, if the DNS is extended, as is being discussed in the IDN WG, the process of looking up DNS names that emerge from the sublayer two search would presumably go through the extended process, e.g., stringprep and IDNA or their descendants. Experience with the DNS and other distributed databases also argues persuasively that these records are not forever. Unless there are no local copying and caching mechanisms (which seems unlikely and hard to enforce), some type of time to live (TTL) or other expiration or reverification mechanism will be needed. 2.3. Search Layer three: locality and/or content-domain-specific data and mechanisms. The problem with the second-search-layer model is that there are a number of usability and marketplace pressures for naming systems that offer finer granularity and better match user needs. For many purposes, users want localized, not global, systems. This has been confirmed in those systems which have been included in experiments or partially deployed (see, e.g., [RFC2345], [Netword], and [RealNames]), which require contextual localization, not a single global environment. There are many causes for this, but requirements for very specific searches that are geographic-area, topic-area, or language or culturally specific, tend to dominate the list. The issue is perhaps illustrated by an example. Suppose the granularity of an entry at the second level is {"Joe's", "UK", Restaurant,... } Now, I might want to create a business around a restaurant directory for Bristol. I would probably want to construct a database that contained exact locations, type of food, menu information, prices, etc., and permit people to query it that way. That type of product bears a strong relationship to traditional yellow pages services: the best attributes to collect and the optimal way to organize them will differ by topic (e.g., "menu" has no obvious analogy in an automobile repair shop) and the business models are fairly established. Part of the history of those business models is the observation that, when there are competing yellow pages services (or guidebooks, or other, similar services), those who consistently make better (and "more accurate") choices of categories and keywords tend, other things being equal, to be judged "better" and to cature larger market share. One can imagine many different types of keyword and (yellow pages-like) directory services at this level, using different types of protocol mechanisms as well as different types of database content and schema. But those services are nearly ideal candidates for competition: there is no requirement that either the providers or the services be global or unique or even highly standardized. Having all three search layers bound to the same data sources --inheriting values from them if one wants to think about it that way-- would provide a degree of consistency that might be very attractive to users, so there are clearly issues here that will need to be worked out in the marketplace. Directories of these types are, of course, common and widespread outside the Internet. There is no shortage --some would say there is a surplus-- of directories and guides to to resources and services of particular types and in particular areas. Some are supported by advertising or placement fees from the resource owners, some by book sales or fees charged to users, and others by a combination. Most of these directories and guides publish year after year and seem profitable. Inputs at the third search layer will differ by service: one can imagine free-text interfaces and menus (but see section 2.4) as well as systems that more closely resemble faceted search terms. Outputs will normally be search-layer-two names or strings to preserve name and reference portability, or might be URIs containing such names. Summary: Just as the monohierarchical identifier-lookup system at the first (bottom, DNS) level should be supplemented by a multilingual, multifaceted, multihierarchy search system at the second, that second level system should be supplemented by a collection of localized, subject- and topic- specific systems at the third. These third-level systems need not be centrally coordinated in any way, although some similarity of function and interface would almost certainly make them more consistent for users and easier to market. 2.4. A search layer above the third: free-text searching applications. The approaches described above omit one set of techniques used today: "web searches" on full text or its equivalent. These systems have an important role (and, similar to the third level, there seems no particular advantage to trying to standardize them worldwide). But their disadvantage, if seen as a DNS surrogate or replacement, is that they have difficulty distinguishing between the name of something, a pointer to it, and a reference to, or discussion of, it or how it works. The other systems discussed in this document are all "directories" in the sense that someone must make an explicit decision to put an entry in a database; they are not full text searching systems or analogues of them. If, for example, one is looking for a web site for a company, the third level would presumably find that site (assuming the company wanted to be found). The second (or even the DNS) might find it with some guessing, but this fourth level would (as web search engines do today) probably not reliably distinguish the company's site from sites that reference the company or its products. Search layer three produces information that is explicitly bound to the query, i.e., what one is looking for, while a search engine returns values that also include sites where the subject of the query might have been mentioned. 2.5. Database and searching differentiation In both sublayers two and three, but especially in two, we assume that "compiling databases" (i.e., registry and, if appropriate, registrar functions) and "designing and building search functions and providing search services" are separate. It would be necessary to have database interfaces be sufficiently general and well-specified that referrals were possible and different search services could rest on top of them, but we would expect some search services to be much more extensive than others and for their vendors to seek increased compensation for those more extensive servces. In many cases, the market would eventually sort out the optimal combinations of capabilities and costs. Ultimately, the term "fuzzy search", used extensively in this document and elsewhere, is handwaving. Whether heuristic or deterministic, one must devise, for each facet, systems for determining whether matches have occurred and, for inexact matches, whether the combination of query term and database entity are "close enough" together to be candidates for being returned as responses. We can imagine phonetic matching as well as character-string matching, application of contextual rules as well as simple character-pair rules for matching of Traditional and Simplified Chinese, and similar rules for matching of Kanji and kana strings. And we would presume that users, or their agents, would be able to control such decisions by choice of search providers, configuration, or choices on a per-search basis. 3. Context and directions 3.1 The data search and access model It is interesting that recent IETF "directory" work has focused on accessing mechanisms without worrying intensely about the underlying database content, maintenance, and update issues. Those latter issues seem to be the harder ones, i.e., the difference between LDAP and CNRP may make less difference than how we structure, maintain, match, and distribute the relevant data. Of course, that does not suggest that work on accessing mechanisms is not important or that it isn't required. And, to deploy the model suggested above, we will need to deal with a pair of uncomfortable problems: * CNRP looks interesting, but has not been widely implemented or deployed in production. * LDAP is widely deployed, but primarily in implementations that contain sufficient extensions and special features to be non-interoperable. Effective referral mechanisms have also not be clearly standardized in LDAP, and this might provide a barrier. Some readers of earlier drafts have also suggested that the history of LDAP points to local extensions that will result in inconsistent search behavior, while CNRP may be better specified (or at least closer to a clean slate). If we are going to choose -- and search layer two certainly implies a choice-- we need to figure out how to do that. 3.2 Uniqueness of name structures at the second search layer. There are cases to be made both for and against uniqueness of names (more precisey, of the combination of the name-string facet and all of the other facets) at this sublayer, and even a partial middle ground, in which names are unique within a registry namespace, but there are mechanisms for identifying such spaces so that the names are unique across the Internet. The community should address the tradeoffs because no position is ideal; summaries of the extreme positions are below. In none of these cases is it necessary, or even desirable, that the name-string itself (without the additional "attribute" facet values) be unique. 3.2.1 The case for unique names The IAB's discussion of DNS root uniqueness [RFC2826] argues that DNS names must be unique, i.e., that there must not be alternate or surrogate root structures if the Internet is to survive as a seamless whole and be universally addressable and accessible. Even with imprecise matching, similar arguments may apply at level two, especially if this is the first level at which names in natural languages (hence including multilingual names), rather than constrained identifiers, appear. The mathematical arguments aside, the main argument for uniqueness is that a given combination of name-string and facets will yield exactly one logical host (or equivalent). If this is not the case, it seems inevitable that users will be faced with choices they need to resolve even when they have an exact match for a full set of facets. Because the name structures at the second level, in this case, still must be unique, some mechanism for registries or structuring of names will be necessary to avoid conflicts. The problem is somewhat easier than the ones encountered by ICANN and its associated groups because the very structuring of the names and attributes creates opportunities for dividing up responsibilities, but the registration problems exist nonetheless and will need to be resolved. 3.2.2 Non-unique names Conversely, one could have multiple appearances of the same set of facets (including the name-string), such that an exact match could still yield multiple "hits". This would have the advantage of eliminating all requirements for monopoly registries or [other] technical mechanisms for guaranteeing that name conflicts did not occur. The disadvantage is that it would force more user choices or heuristics, and at least some errors in which the wrong host or site was identified would be almost inevitable. If it turned out that most user queries occurred at sublayer three or four, rather than directly at this sublayer, that issue might not be significant. Were extensive use of per-user (or per-group) local directories ("bookmarks", "favorites", etc.) to evolve, they might also make the difficulties with non-uniqueness insignificant. This would be especially likely if these directories contained not only a keyword and (DNS name or URI) target, but also a stored form of the search used so that local data could be recalculated and replenished. See section 3.5 for some related discussion. Such "bookmarks" can be thought of as a local cache of queries and responses with sufficient information to both immediately locate a target associated with the user's perception of what was looked for and of "refreshing" the search if circumstances changed or values timed out. In the presence of a particular query, a client system would presumably check for a matching bookmark. If one was not found, the layer two search would be performed, yielding values that might require user intervention for selection. Once selected, the search, the full set of facets returned, the DNS names or URIs, and any TTL information would be stored (possibly using a user-supplied name or tag) and the resource accessed via the appropriate DNS name or URI. If the search or tag was found in the cache, checks would be made for the values being current and then the DNS name or URI used directly, without going back through the search procedure. 3.2.3 The middle ground A proposal was made in the initial version of [Mealling-SLS], that an additional facet could be added to represent the registry which records the names. If this were done, names could be kept unique within registries and would be globally unique as long as the registry-identifying facet had a unique value for each registry. There would be no need to restrict the number of registries in this model or resolve naming disputes among them -- each one could have a unique, randomly-generated and assigned identifier-- so the approach could provide some degree of technical uniqueness while still preserving most of the benefits of the non-unique approach. This model could, of course, be deployed at a "registrar" level instead, just by changing the assignment of the identifier facet from value-per-registry to value-per-registrar. Other variations are, of course, possible. 3.3. Sources for controlled-vocabulary facets ("attributes") We anticipate that most of the sublayer two facets other than the name-string itself will have values chosen from controlled vocabularies I.e., the user-registrants will be able to select whatever values seem to match their needs, but only from pre-defined lists of possible values. These are not intended as free-text entities; to make them free text would push the second-sublayer system toward the lowered precision of Internet search engines and other free-text search environments. The facet values that are not populated from controlled vocabularies will be determined by deterministic and unambiguous rules. For example if one of these attributes is a geographic location that uses a coordiate scheme, the definition of the coordinate scheme should be sufficient to yield a predicatable and exact value. The question, then, is how to establish the vocabulary lists and write the definining rules. It has been something of an Internet tradition, building on Jon Postel's principles for registration and registries, to try to avoid having IETF or IANA become embroiled in controversies about names, their ownership, propriety of using them, and so on. The use of IS 3166-1 alpha-2 codes as the basis for "country code" top-level domain names (see [RFC1591]) is just one instance of the application of this principle. Following this tradition, facets should be chosen, in part, on the basis of availability of pre-existing, well-known lists of names and authorities or, at worst, the ability to identify relatively non-controversial authorities who can quickly establish such lists. 3.4. Deployment against the existing DNS base As with the "new class" approach to DNS changes [NEWCLASS], the approach outlined here does not require any changes to the existing installed DNS base. But, like all solutions to the multilingual name issues, it requires changes to all relevant applications. The notion of moving from lookup to searching does imply that we will need, not merely to change the code that calls the name resolution system, but to rethink the UIs of those applications. 3.5 Thoughts about user interfaces (UIs) There are many possible models for user interfaces to be used with a system of the type proposed here. The IETF should, as usual, remain agnostic about them. At the same time, some notions about possible user interfaces are important to demonstrate that the concepts are practical and to inform the design of protocol interfaces. So, with the understanding that other approaches are possible, and may be preferable: As discussions on both DNS "searching" and multilingual names, and the general model presented here, have evolved, it has become apparent to some observers that these approaches would be best realized in conjunction with user-specific directories or memory with refresh capability, whether modeled on a local directory, or cache, or history file, or something else. It has been surmised [WJR ref?] that the behavior of typical users is to spend most of their time using or referencing known services and hosts (whether web sites, hosts used in email addresses, or other services) and much less time "searching" for unknown resources. If this is actually the case, then a typical reference should involve a DNS "name to address" lookup only, even though it would be desirable for the DNS name to not be visible to that user. The user might reasonably see his or her original collection of search terms, or a name assigned to that search or its results, but actual searching would take place only as a first-time activity or in the process or refreshing the search and results (at user request or, perhaps, automatically). 3.6 Older applications To fully realize the benefits of internationalized naming requires changing all applications to understand the new method, whatever it is. Even the "internationalize the DNS" proposals are subject to this principle. Older applications will see distorted and unfriendly names under some systems, and no names at all under others (some approaches might cause some applications implementations to fail entirely). The environment contemplated here is a "no international names in old applications", i.e., "no new names without upgrading", one -- applications that have not been upgraded will not see internationalized names or other natural-language phrases, nor coded surrogates for them. The advantages of a "no names without upgrading" approach are that it avoids confusion and the risk, however slight, of catastrophe. As with the original host table to DNS conversion, they provide an incentive to convert old applications to make newer naming styles, and newer names, visible. None of these transitions are ever easy, but it may be worth going through this one to get things right, rather than investing a large fraction of the pain to get a solution that doesn't quite do the job. 4. Comparisions to existing and proposed technology 4.1 The IDN Strawman After the IETF IDN working group came into being, its work rapidly converged on the assumption that internationalized name referencing issues and requirements --including the requirements, not heretofore satified even for ASCII-based names, to be able to search for things using the DNS-- could be achieved by placing non-ASCII identifiers into the DNS itself, in some coded form. These identifiers have commonly been described as "multilingual names", further complicating the work program and concensus-seeking process in that working group. Many of the problems associated with trying to overload the DNS in this way have been described in [DNSROLE]. And that document, and the experience from which it is drawn, predict that the IDN WG effort will ultimately fail if it goes down paths that require sensitivity to the characteristics of particular languages, rather than just an expanded set of characters to be used in identifiers. As implied in the [DNSROLE] document, consideration of language-related issues and their appropriate handling was one of the primary the motivations for the model developed here. However, at least from the viewpoint of this author, one important question remains: assuming that the IDN WG's work can be appropriately narrowed down to characters and identifiers, does the value of local-language identifiers justify putting non-ASCII strings into the DNS even if end users never see them? We argue in section 2.1 that it is not necessary and poses some risks. However, the "variables in programming languages" analogy and the "local directory or cache" approach, both outlined above, suggest that such names would be extremely useful. And, if one believes the model outlined here, or any competing "keyword" model, will achieve wide deployment and use, the needs and perspectives of such systems should condition the evaluation of IDN WG-produced alternatives. So there is a serious and complex set of engineering (and, realistically, political) tradeoffs to be evaluated in making the decision as to whether wide deployment of some version of the IDN work is appropriate. 4.2 "Keyword" systems As suggested above, the term "keyword system" is used to refer to many different things. Many would fit nicely into the third sublayer environment, but most of the existing proposals put them directly on top of the DNS, or skip the DNS entirely and go directly to IP addresses. The difficulty with these systems is that they either must be localized (e.g., a different system or database for each language, country, or smaller locality) or they don't scale well. In particular, they eventually suffer from either the "all the good names are taken" problem (of which the DNS is frequently accused) or they are very vunerable to poor retrieval precision properties as the number of names (or keyword combinations) in the name space grows large. Adapting keyword systems to operate locally and as part of the third sublayer model proposed here would appear to be the best way forward for such systems. It has been observed that what users really want is localization, and locally-oriented keyword systems could satisfy much of that requirement. And keyword systems would be strengthened by being placed on a base of use and language-sensitive naming and searching, rather than on the low-context, monohierarchical, DNS. Other types of keyword systems are really special cases of the sublayer two search service, presumably with keywords combined into a phrase that can be interpreted, if appropriate, with permutation rules in a search service and, probably, some of the other facets left out of searches. Some further analysis, as to whether what is optimally desirable is a set of unordered keywords, or an ordered phrase that might contain such keywords, seems called for. Different answers could, of course, be implemented at different layers of this model. 4.3 Client-side and server-side solutions The key approaches being considered in the IDN WG are essentially client implementations, applied to names before they are placed in the DNS. This contrasts with the existing use and protocols of DNS in which, e.g., string matching is done on the server. Ignoring speed of deployment (which can be argued either way), the advantage of client-side implementations is that they don't require changes to the DNS fabric itself (and therefore minimize the risk of damaging existing applications that rely on that fabric). Because the sublayer two and three mechanisms do not rely on the DNS for any searching or matching activities, and are completely new, server-side implementations are again feasible: applications will require modification to access these services (just as they would to support a client-side implementation), but older, unmodified, applications will not touch them at all. Server-side implementations have several advantages over client-side ones. If something complicated is being done, it is often possible to apply more computer resources, or larger tables, on a server, and to update those resources and tables more easily if needed. And server-side implementations tend to yield more uniformity of behavior relative to having a potentially wide mix of client implementations. 5. Comments on business models Historically, the IETF has had even less desire to involve itself with business models than it has with user interfaces (see section 3.5). But the approach outlined here, and the protocol and operational proposals that will derive from it, face a particular challenge: the DNS works well for its intended purpose (something we don't intend to change) and arguably works at least tolerably for some purposes, including as a search engine, for which it was not intended. Many of us see its quality and capabilities, when used as a search (or, more accurately, "guessing") engine deteriorating but collapse, if it occurs, is still in the future. There are also considerable vested interests -- both economic and policy control-- associated with the current DNS structure and arrangements. The ability to produce and deploy a different model, especially one that requires new work in several areas, against that backdrop will be challenging at best. Unless there are clear business models for doing so, the odds of success are quite low. So this section outlines some of the business issues and models not covered elsewhere in this document. As with the user interface discussion, it is not intended to be definitive: some of these models may fail and others may be more attractive. But it is intended to provide a sufficient demonstration of concept that, perhaps, the technical ideas can be taken seriously. We observe that a telephone system analogy may be helpful. With the telephone system, there are registries, described as national numbering databases, that record which numbers are in use and by whom. There are white pages services which, given locale and some other information (e.g., whether business or residential in some areas) and a near or exact match to a name, provide name to number lookup. And there are yellow pages services, with precise categories and organization differing somewhat from one location to another. Organizations make money at all three levels, but the greatest aggregate income occurs with the yellow pages services. At each of sublayers two and three, there are multiple services. Some of these would probably need to be operated as public goods, spreading costs over the producers of other services. Others would presumably be directly profitable. 5.1 Sublayer two - faceted global searching 5.1.1 Facet listings and identification For the attribute facets that rely on controlled vocabularies, some organizational structure would be required to oversee those vocabularies. As suggested elsewhere, the ideal would be to use pre-existing organizations and pre-existing lists (the WIPO classification of goods and services [NICE] is an example of such a list, as would be the IS 3166-1 list traditionally used for country code domain names. Where such lists did not exist, it would be necessary to build arrangements for them. The maintenance of such vocabularies would be, from an Internet standpoint, be a public good. 5.1.2 Registration and searching Actual registrations would be required for names their attributes with, as mentioned above, multiple registrations when an individual, organization, or business wished to be registered with more than one attribute set. The economic model would presumably parallel the current registrar and registry business, with a charge for registration (since there is no intrinsic requirement for a single registry, registry services might well be competitive, eliminating the need for models that separate registries and registrars. However, lookup and search activities would be more flexible than the DNS, with extended services, including character set transposition, language translation, and potentially more extensive search variations being potential areas on which providers could compete, using fee for service or subscription models to support costs. 5.2 Sublayer three - localized databases and searching As mentioned above, yellow pages and publication of directories and guidebooks are traditionally where the money has been made. The analogies apply: one could imagine charging for entering information into the databases, or for searching, or for information delivered, or all three of these. And all have been used for papers and related databases. 6. Summary The solution to the "multilingual DNS" problem, and to a series of other limitations of the DNS relative to today's expectations for naming and searching, lies in solutions targeted to those problems, rather than superimposing additional mechanisms on the DNS in ways that, those who advocate them hope, will not cause problems with older programs and unconverted infrastructure. Inserting new search layers avoids those risks and permits a clean solution that is adapted to the problems, rather than the limitations imposed by existing properties of the DNS. 7. IANA Considerations and related topics At search layer two, it is difficult to think about how the system might function successfully without controlled vocabularies for each of the non-name facets. As discussed in section 2.2, we have already established one such registry (bound to an ISO standard), and mechanisms for utilizing it, with RFC 3066. The Madrid agreement and its predecessors [MADRID, NICE] provide classifications for types of businesses, but we would need to extend the registry for names that are not business-related. The two locational attributes are somewhat vague at this point, but controlled vocabularies would presumably be needed, and should, if possible, be drawn from stable, non-IETF, work (e.g., IS 3166-1 and 3166-2 might provide a foundation, and possibly a complete list, for the location vocabulary). Curiously, there is no technical reason why the name-strings themselves must be unique: that is one of the attractions of a model like this over attempting to overload the DNS. If conflicts or confusion occur, those are standard civil (marketplace or trademark) issues that can be resolved in their own environments, rather than posing special Internet problems. 8. Security Considerations Additional layers of naming, searching, and databases imply addition of opportunities for compromising those databases and mechanisms. Part of the challenge with the model implied here is to determine how to secure and authenticate those databases and access (especially modify access) to them. The good news is that, since the functions are new, we should be able to design security mechanisms in, rather than --as with the DNS-- have to try to graft them on to a structure not designed for them. 9. References Most of the references in this document are to examples of approaches to the systems outlined here, or provide additional information about the context of some of the suggestions, or are included to give credit for particular ideas or to better identify earlier and approaches. None of those references are normative in the protocol sense typically used in the IETF. 9.1. Normative References [RFC882] Mockapetris, P.V., "Domain names: Concepts and facilities". RFC 882. Nov-01-1983. [RFC883] Mockapetris, P.V. "Domain names: Implementation specification", RFC 883. Nov-01-1983. [RFC1035] Mockapetris, P.V. "Domain names - implementation and specification", RFC 1035. Nov-01-1987. [RFC2826] IAB. "IAB Technical Comment on the Unique DNS Root", RFC 2826. May 2000. [RFC3066] Alvestrand, H. "Tags for the Identification of Languages", RFC 3066. January 2001. 9.2. Non-normative References [Arrouye] Arrouye, Yves, et al. Work in progress, draft-arrouye-kls-00.txt, and unpublished BOF proposal. [Austein] Austein, Rob. Private communication. [CDNC] One or more of the TC<->SC works in progress, to be supplied. [DNSROLE] Klensin, J., "Role of the Domain Name System", work in progress, draft-klensin-dns-role-01.txt. [HOSTNAME] Harrenstien, K., M.K. Stahl, E.J. Feinler. "Hostname Server", RFC 0953, Oct-01-1985. Also Braden, R., ed. "Requirements for Internet Hosts - Application and Support", RFC 1123, October 1989. [MADRID] [Mealling-SLS] Mealling, M and L Daigle, "Service Lookup System (SLS)", work in progress, draft-mealling-sls-00.txt. [NAMEPREP] Hoffman, P. and M. Blanchet, "Preparation of Internationalized Host Names", work in progress, draft-ietf-idn-nameprep-04.txt [WIPO-NICE] World Intellectual Property Organization, "Nice Agreement concerning the International Classification of Goods and Services for the Purposes of the Registration of Marks", June 1957. [Netword] http://corp.netword.com/ -- real reference needed. [NEWCLASS] Klensin, John, "Internationalizing the DNS -- A New Class", work in progress, draft-klensin-i18n-newclass-... [RealNames] http://www.realnames.com/ -- real reference needed. [RFC1591] Postel, J. "Domain Name System Structure and Delegation", RFC 1591, March 1994. [RFC2345] Klensin, J, T. Wolf, G. Oglesby. "Domain Names and Company Name Retrieval", RFC 2345. May 1998. It is perhaps worth noting that, as in the case of many RFCs, descriptions of this work were widely circulated in draft form and discussed for a year or two before being published as an RFC. [RFC2822] Resnick, P., Editor. "Internet Message Format", RFC 2822. April 2001. [RFC2825] IAB, L. Daigle, ed. "A Tangled Web: Issues of I18N, Domain Names, and the Other Internet protocols", RFC 2825. May 2000. [THES] IS 5127-1, IS 5127-2. [RFC-URI] Berners-Lee, T., R. Fielding, L. Masinter. "Uniform Resource Identifiers (URI): Generic Syntax", RFC 2396. August 1998. [WAIS] M. St. Pierre, J. Fullton, K. Gamiel, J. Goldman, B. Kahle, J. Kunze, H. Morris, F. Schiettecatte. "WAIS over Z39.50-1988", RFC 1625. June 1994. [Z39] Z39.50, IS 23950. 10. Acknowledgements This document, and the related notes, are the result of thinking that has come together and evolved since before the issue of internationalized access to domain names came onto the IETF's radar. Discussions with a number of people have led to refinements in the approach or the text, even though some of them might not recognize their contributions or agree with the conclusions I have drawn from them (indeed, some of those discussions were rooted in challenges to the general ideas expressed here). Particularly important suggestions have come from, or arisen out of conversations with, Harald Alvestrand, Rob Austein, Fred Baker, Christine Borgman, Eric Brunner-Williams, Randy Bush, Vint Cerf, Kilnam Chon, Dave Crocker, Leslie Daigle, Patrik Faltstrom, Michael Froomkin, Francis Gurry, Marti Hearst, Paul Hoffman, Kenny Huang, Mao Wei, Michael Mealling, Gary Oglesby, Mike Padlipsky, Qian Huilin, James Seng, Theresa Swinehart, Tan Tin Wee, Len Tower, and Zita Wenzel, as well as some memorable long-ago conversations with Jon Postel and J.C.R. Licklider. 11. Author's Address John C Klensin AT&T Labs 99 Bedford St, 4th floor Boston, MA 02111 USA +1 617 574 3076 klensin@att.com Expires May 2002