INTERNET-DRAFT                                John C. Klensin
May 28, 2001
Expires November 2001


			   A Search-based access model for the DNS
				   draft-klensin-dns-search-00.txt

Status of this Memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups.  Note that
other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.

This document supplements a companion document [DNSROLE] on the role
of the DNS relative to the uses to which it is being put and is
intended to start laying the groundwork for a specific proposal.
Both documents, their successors, and closely-related issues, can be
discussed on the mailing list at ietf-i18n-dns-directory@imc.org.
See http://www.imc.org/ietf-i18n-dns-directory/ for subscription and
archival information.


Copyright Notice

Copyright (C) The Internet Society (2000).  All Rights Reserved.


0. Abstract

This memo discusses strategies for supporting "DNS searching" --
finding of names in the DNS by a layered mechanism that permits fuzzy
matching, selection that uses attributes or facets, and use of
descriptive terms. Demand for these facilities appear to be
increasing with growth in the Internet (and especially the web) and
with requirements to move beyond the restricted subset of ASCII names
that have been the traditional contents of DNS "Class=IN".  This
document proposes a three-level system for access to DNS names in
which the upper two levels involve search, rather than lookup,
functions. It also discusses some of the issues and challenges in
completing the design of, and deploying, such a system.

1. Introduction and Executive Summary

The notion of "DNS searching" is somewhat of an oxymoron: the DNS is
structured to only perform exact lookups of structured label strings.
But, as discussed elsewhere, there is considerable demand for
searching facilities -- partial and fuzzy matching, selection that
uses attributes or facets, and searching using descriptive terms--
and that demand appears to be increasing with growth in the Internet
(and especially the web) and with requirements to move beyond the
restricted subset of ASCII names that have been the traditional
contents of DNS Class=IN.  This document proposes a three-level
system for access to DNS names in which the upper two levels involve
search, rather than lookup, functions. It also discusses some of the
issues and challenges in completing the design of, and deploying,
such a system.

It has been suggested that introducing a "directory" or "keywords"
into, or above, the DNS could be used as a solution to the IDN
problem and, often, several others.  Probing statements about
"directories" often quickly demonstrates that their advocates don't
agree on what they mean.  This section outlines a three-layer
search/lookup model (adding two layers to the DNS, i.e., constructing
a three-layer model, rather than continuing with the single one we
have today).  Those layers consist of the current DNS, a
search-capable layer using an extremely simple set of facets, and a
layer capable of broader search approaches in a localized context. It
is intended as a strawman for criticism and development, rather than
as a specific proposal.  I.e., the details are left for WG efforts.

The document suggests that it is better to add two layers to the DNS
--constructing a three-layer model-- rather than just one.  And I'm
going to try to avoid using the word "directory" as if it meant
anything, but I presume readers will insert that term when it matches
their perceptions.

This document is a preliminary proposal -- a framework and fodder for
a working group or design team-- rather than a complete specification
or even an approximation to one.

2. A three (or four) -layer environment.

The material below suggests three or more layers:

   (1) The DNS, with the existing lookup mechanisms

   (2) A restricted, facet-based, search system.

   (3) Commercial, localized, and potentially topic-specific, search
   environments. 

   (4) Something else?


2.1.  Layer one: Identifiers -- a lookup system and the DNS.

In this model, the DNS remains largely as is (see section 3.3ff) or,
perhaps, a bit closer to its original purpose and assumption than the
direction in which it has evolved in recent years.  I.e., it is a
distributed database, with precise lookups, whose lookup keys are
identifiers for Internet hosts and other objects.  We give up the
notion that these identifiers should also serve as human-useful names
or at least try to do so.  

   As an aside, note that some people have suggested that we
   should dehumanize DNS names entirely, e.g., prohibit the
   registration and use of any name that can be found in any
   dictionary for any language that can be represented in the
   DNS-acceptable character set.  This proposal doesn't
   include that idea.  But it is absent primarily because I
   don't think the transition process is worth the time it
   would take to explore rather than because it has no appeal.

The goal at this layer is relatively simple, unique, identifiers.  It
is probably desirable that these identifiers be able to have some
human mneumonic value, but less important that they be tightly bound
to real-world names and descriptions.

The inputs and outputs at this layer are as they are in the DNS
today, although modifications to accomodate non-hosttable format
names there remain possible if that is deemed important.


2.2 Layer two: Names -- a faceted search system with a small number
of facets.

Much of the current burden borne by the DNS would appear to be better
localized in a search system that contains names and a small number
of facets/ attributes.  This burden includes a whole range of
non-identifier goals and constraints: names that a user can
understand and find and that have significant mneumonic value, names
with trademark implications, a wide variety of naming systems and, in
general, helping people find the things for which they are looking.
It is critical that the number of attributes be constrained to a
minimal set --and that other attributes, especially those of special
interest, be deferred to the third layer. 

It is probably most useful to think about this layer in terms of a
structured, multifacted, multihierarchical, thesaurus-like database
with search capability (Cf. ISO IS 5127-1 and IS 5127-6 [THES]),
rather than as a "directory" in the sense of X.500 and its
derivatives and antagonists.

The key question is what facets to use once the commercial product
requirements are removed (to layer three, see below).  It appears to
me that, to satisfy to the critical name-uniqueness and real world
pressures on the DNS, candidates might be

     name (IS 10646, see below)
     language (presumably per RFC 3066)
     geographical location (country, and/or for some federal
	    countries, country/province ("state"), granularity is
		important; there may be a case for an additional facet
		in a coordinate system)
     network location (If we can figure out what that means
	    and how to express it in a canonical way.)
     industry category code (For companies, presumably derived
	    from the Madrid Treaty [Madrid] list; the list would
		need to be extended to deal with non-commercial
		organizations and entities and for identifying
		resources and services associated with people.

This typology gives the trademark view of the world somewhat more
precedence in looking at name conflict issues than one might like in
principle.  But, in practice, one of the key issues we have
encountered in trying to store "names", rather than identifiers, in
the DNS is that the process unreasonably flattens the space.  That
"Joe's Auto Repair" and "Joe's Pizza" can co-exist in the same
geographical area without conflict or confusion and that "Joe's
Pizza" in one area can co-exist with "Joe's Pizza" in another, again
without conflict or confusion, are the consequence of the way we name
and identify things in the real world.  Most trademark rules ar the
consequence of those naming systems, not their cause.

It is not intended that this level act as a white pages service for
people.  Doing so leads down several slippery slopes at once,
including heightened privacy concerns and a stronger requirement for
URL targets rather than DNS label ones (see below).

The names in this environment can reasonably be written in IS 10646
codes or some recoding of them.  Since we would be starting more or
less from scratch, we could select lengths and codings for maximum
efficiency and utility, not to meet the constraints of existing
software.  In such a context, this author has a slight bias for
direct UCS-4 coding, rather than ASCII-compatible ("ACE") codes;
compressed, null-octet- eliminating, systems such as UTF-8; or
surrogate introducers to hold things to 16 bits.  The loss in
transport efficiency is likely to be more than compensated for by
gains in cleanliness and equal treatment of all scripts.  But that
issue is separate from the main and important design arguments of
this document.

The work done for "nameprep" in the IDN WG is almost certainly
relevant to determining which names to actually store in the
database.  But the stakes are lower here than the "get it right or
fail completely" constraint of the DNS lookup environment: one can
imagine search mechanisms that would apply a more liberal set of
matching rules (and/or localized and language-specific ones) than the
rules used to encode names (much like recent applications protocols
that explicitly distinguish between the formats one is permitted to
send and those one is expected to accept (Cf. [RFC8222])).

As is common with systems of this type, we would anticipate the
possibility of searching on any of the attributes and that searching
on free-text strings would not be exact (i.e., near-match responses
could be returned using any of several algorithms, with the user
making choices).  As is equally common, we should think about user
interfaces that store both queries and response sets so that the
responses could be used offline and refreshed when the client systems
were attached to the Internet.

In summary, the goal at this layer is to provide unique tuples of
human-recognizable (not just mneumonic) names, but names that are
unique within a context, rather than a global system based on the
names alone.

The inputs at this layer are search values for one or more of the
facets.  The outputs are still controversial, but would appear to
best be the full facet set of the matched tuple(s) and one or more
DNS names.  One of many interesting questions is whether this layer
should pass through and return the DNS records themselves (labels,
class, type, and target) or whether it should return names (labels)
and let the applications do the DNS lookups.  Another possibility is
to return one or more URLs (or more general URIs?) rather than DNS
names.  Doing so increases flexibility but at the cost of greater
complexity and risk of recursion problems.

One possibility would be to create a URI for DNS record information
and use it to abstract this return information into something
applications can then specify or decode as appropriate.  Use of this
would need to be carefully structured to avoid complex problems, but
might be a reasonable approach.

Experience with the DNS and other distributed databases also argues
persuasively that these records are not forever.  Unless there are no
local copying and caching mechanisms (which seems unlikely and hard
to enforce), some type of time to live (TTL) or other expiration or
reverification mechanism will be needed.

2.3.  Layer three: locality and/or content-domain-specific
lookup mechanisms.

The problem with the second-layer model is that there are a number of
usability and marketplace pressures for naming systems that offer
finer granularity and better match user needs.  Interestingly, those
systems which have been included in experiments or partially deployed
(see, e.g., [RFC2345], [Netword], and [RealNames]) have demonstrated
that these systems require contextual localization, not a single
global environment.  There are many causes for this, but need for
very specific searches that are geographic-area, topic-area, or
language or culturally specific tend to dominate the list.

The issue is perhaps illustrated by an example.  Suppose the
granularity of an entry at the second level is

  {"Joe's", "UK", Restaurant,... } 

Now, I might want to create a business around a restaurant directory
for Bristol.  I would probably want to construct a database that
contained exact locations, type of food, menu information, prices,
etc., and permit people to query it that way.  That type of product
bears a strong relationship to traditional yellow pages services: the
right attributes to collect and the right way to organize them will
differ by topic (e.g., "menu" has no obvious analogy in an automobile
repair shop) and the business models are fairly established.

One can imagine many different types of keyword and (yellow
pages-like) directory services at this level, using different types
of protocol mechanisms as well as different types of database content
and schema.  But those services are nearly ideal candidates for
competition: there is no requirement that either the providers or the
services be global or unique or even highly standardized.  Having all
three layers bound to the same data sources --inheriting values from
them if one wants to think about it that way-- would provide a degree
of consistency that might be very attractive to users, so there are
clearly issues here that will need to be worked out in the
marketplace.

Inputs at the third layer will differ by service: one can imagine
free-text interfaces and menus (but see section 2.4) as well as
systems that more closely resemble faceted search terms.  Outputs
will normally be layer-two names or strings to preserve name and
reference portability, or might be URIs containing such names.

Summary: Just as the monohierarchical identifier-lookup system at the
first (bottom) level should be supplemented by a multilingual,
multifaceted, multihierarchy search system at the second, that second
level system should be supplemented by a collection of localized,
subject- and topic- specific systems at the third.  These third-level
systems need not be centrally coordinated in any way, although some
similarity of function and interface would almost certainly make them
more consistent for users and easier to market.  

2.4. A layer above the third: free-text searching applications. 

The approaches described above omit one set of techniques used today:
"web searches" on full text or its equivalent.  These systems have an
important role (and, similar to the third level, there seems no
particular advantage to trying to standardize them worldwide).  But
their disadvantage, if seen as a DNS surrogate or replacement, is
that they have difficulty distinguishing between the name of
something, a pointer to it, and a reference or discussion of it or
how it works.  

If, for example, one is looking for a web site for a company, the
third level would presumably find that site.  The second (or even the
DNS) might find it with some guessing, but this fourth level would
(as web search engines do today) probably not distinguish the
company's site from sites that reference the company or its products.

Layer three produces information that is explicitly bound to the
query, i.e., what one is looking for, while a search engine returns
values that also include sites where the subject of the query might
have been mentioned.


3 Context and directions

3.1 The data search and access model

It is interesting that recent IETF "directory" work has focused on
accessing mechanisms without worrying intensely about the underlying
database content, maintenance, and update issues.  Those issues seem
to be the harder ones, i.e., the difference between LDAP and CNRP may
make less difference than how we structure, maintain, and distribute
the relevant data.

Of course, that does not suggest that the work is not
important or that it isn't required.  And, to deploy the model
suggested above, we will need to deal with a pair of
uncomfortable problems:

     * CNRP looks interesting, but has not been widely implemented or
	 deployed in production. 

     * LDAP is widely deployed, but primarily in implementations that
	 contain sufficient extensions and special features to be
	 non-interoperable. 

If we are going to choose -- and layer two certainly implies a
choice-- we need to figure out how to do that.


3.2 Implications of uniqueness of name structures at the
second layer.

The IAB's discussion of DNS root uniqueness [RFC2826] argues that DNS
names must be unique, i.e., that there must not be alternate or
surrogate root structures if the Internet is to survive as a seamless
whole and be universally addressable and accessible.   Even with
imprecise matching, similar arguments apply at level two, especially
if this is the first level at which names in natural languages (hence
including multilingual names), rather than constrained identifiers,
appear.

Because the name structures at the second level still must be unique,
some mechanism for registries or structuring of names will be
necessary to avoid conflicts.  The problem is somewhat easier than
the ones encountered by ICANN and its associated groups because the
very structuring of the names and attributes creates opportunities
for dividing up responsibilities, but the registration problems exist
nonetheless and will need to be resolved.

3.3 Deployment against DNS base

As with the "new class" approach to DNS changes [NEWCLASS], the
approach outlined here does not require any changes to the
existing installed DNS base.  But, like all solutions to the
multilingual name issues, it requires changes to all relevant
applications.  The notion of moving from lookup to searching
does imply that we will need, not merely to change the code
that calls the name resolution system, but to rethink the UIs
of those applications.

3.4 Older applications

To fully realize internationalized naming requires changing all
applications to understand the new method, whatever it is.  Older
applications will see distorted and unfriendly names under some
systems, and no names at all under others.  The environment
contemplated here is a "no names" one -- applications that have not
been upgraded will not see internationalized names or other
natural-language phrases.  The advantages of this are that it avoids
confusion and, as with the original host table to DNS conversion,
provides an incentive to convert old applications to make newer
naming styles visible.  None of these transitions are ever easy, but
it may be worth going through this one to get things right, rather
than investing a large fraction of the pain to get a solution that
doesn't quite do the job.


3.5 Why not just a keyword system

As suggested above, the term "keyword system" is used to refer to
many different things.  Many would fit nicely into the layer three
environment, but most of the existing proposals put them directly on
top of the DNS, or skip the DNS entirely and go directly to IP
addresses.  The difficulty with these systems is that they either
must be localized (e.g., a different system or database for each
language, country, or smaller locality) or they don't scale well.  In
particular, they eventually suffer from either the "all the good
names are taken" problem with which the DNS is frequently accused or
they yield to poor precision properties.


4  Summary

The solution to the "multilingual DNS" problem, and to a series of
other limitations of the DNS relative to today's expectations for
naming and searching, lies in solutions targeted to those problems,
rather than superimposing additional mechanisms on the DNS in ways
that, we hope, will not cause problems with older programs and
unconverted infrastructure.  Inserting new layers avoids those risks
and permits a clean solution that is adapted to the problems, rather
than the limitations imposed by existing properties of the DNS.


5 IANA Considerations and related topics

At layer two, it is difficult to think about how the system might
function successfully without controlled vocabularies for each of the
non-name facets.  As discussed in section 2.2, we have already
established one such registry (bound to an ISO standard), and
mechanisms for utilizing it, with RFC 3066.  The Madrid agreement
provides classifications for types of businesses, but we would need
to extend the registry for names that are not business-related.  The
two locational attributes are somewhat vague at this point, but
controlled vocabularies would presumably be needed, and should, if
possible, be drawn from stable, non-IETF, work (e.g., IS 3166-1 and
3166-2 might provide a foundation, and possibly a complete list, for
the location vocabulary).  Curiously, there is no technical reason why
the names themselves must be unique: that is one of the attractions
of a model like this over attempting to overload the DNS.  If
conflicts or confusion occur, those are standard civil (marketplace
or trademark) issues that can be resolved in their own environments,
rather than posing special Internet problems.


6 Security Considerations

Additional layers of naming, searching, and databases imply addition
of opportunities for compromising those databases and mechanisms.
Part of the challenge with the model implied here is to determine how
to secure and authenticate those databases and access (especially
modify access) to them.  The good news is that, since the functions
are new, we should be able to design security mechanisms in, rather
than --as with the DNS-- have to try to graft them on to a structure
not designed for them.

7 References

[DNSROLE] Klensin, John. "Role of the Domain Name System",
work in progress, draft-klensin-dns-role-...

[MADRID]

[Netword] http://corp.netword.com/ -- real reference needed.

[NEWCLASS] Klensin, John, "Internationalizing the DNS -- A New
Class", work in progress, draft-klensin-i18n-newclass-...

[RealNames] http://www.realnames.com/ -- real reference needed.

[RFC882] Mockapetris, P.V., "Domain names: Concepts and facilities".
RFC 822.  Nov-01-1983.

[RFC883] Mockapetris, P.V. "Domain names: Implementation
specification", RFC 883. Nov-01-1983.

[RFC1035] Mockapetris, P.V. "Domain names - implementation and
specification", RFC 1035. Nov-01-1987.

[RFC2345] Klensin, J, T. Wolf, G.  Oglesby. "Domain Names and Company
Name Retrieval", RFC 2345. May 1998.

[RFC2822] Resnick, P., Editor. "Internet Message Format", RFC 2822.
April 2001. 

[RFC2826] IAB. "IAB Technical Comment on the Unique DNS Root", RFC
2826.  May 2000.

[RFC3066] Alvestrand, H. "Tags for the Identification of Languages",
RFC 3066. January 2001.

[THES] IS 5127-1, IS 5127-2.

[WAIS] M. St. Pierre, J. Fullton, K. Gamiel, J.  Goldman, B.
Kahle, J. Kunze, H. Morris, F. Schiettecatte.  "WAIS over
Z39.50-1988", RFC 1625.  June 1994.

[Z39] Z39.50, IS 23950.

8 Acknowledgements

This document, and the related notes, are the result of thinking that
has come together and evolved since before the issue of
internationalized access to domain names came onto the IETF's radar.
Discussions with a number of people have led to refinements in the
approach or the text, even though some of them might not recognize
their contributions or agree with the conclusions I have drawn from
them (indeed, some of those discussions were rooted in challenges to
the general ideas expressed here).  Particularly important
suggestions have come from, or arisen out of conversations with,
Harald Alvestrand, Rob Austein, Fred Baker, Eric Brunner-Williams,
Randy Bush, Vint Cerf, Kilnam Chon, Dave Crocker, Leslie Daigle,
Patrik F�ltstr�m, Michael Froomkin, Francis Gurry, Paul Hoffman,
Kenny Huang, Mao Wei, Michael Mealing, Gary Oglesby, Qian Huilin,
James Seng, Theresa Swinehart, Len Tower, and Zita Wenzel as well as
some long-ago conversations with Jon Postel and J.C.R. Licklider.


9 Author's Address

John C Klensin
AT&T Labs
99 Bedford St, 4th floor
Boston, MA 02111 USA
+1 617 574 3076
klensin@att.com

Expires November 2001