< draft-rfced-info-moats-02.txt   draft-rfced-info-moats-03.txt >
Internet-Draft Ryan Moats Internet-Draft Ryan Moats
draft-rfced-info-moats-02.txt Rick Huber draft-rfced-info-moats-03.txt Rick Huber
Expires in six months AT&T Expires in six months AT&T
October 1998 December 1998
Building Directories from DNS: Experiences from WWWSeeker Building Directories from DNS: Experiences from WWWSeeker
Filename: draft-rfced-info-moats-02.txt Filename: draft-rfced-info-moats-03.txt
Status of This Memo Status of This Memo
This document is an Internet-Draft. Internet-Drafts are working This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups. Note that other groups may also areas, and its working groups. Note that other groups may also
distribute working documents as Internet-Drafts. distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other months and may be updated, replaced, or obsoleted by other
skipping to change at page 1, line 34 skipping to change at page 1, line 34
To learn the current status of any Internet-Draft, please check To learn the current status of any Internet-Draft, please check
the ``1id-abstracts.txt'' listing contained in the Internet- the ``1id-abstracts.txt'' listing contained in the Internet-
Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net
(Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East (Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East
Coast), or ftp.isi.edu (US West Coast). Coast), or ftp.isi.edu (US West Coast).
Abstract Abstract
There has been much discussion and several documents written about There has been much discussion and several documents written about
the need for an Internet Directory. Recently, this discussion has the need for an Internet Directory. Recently, this discussion has
focussed on ways to discover an organization's domain name without focused on ways to discover an organization's domain name without
relying on use of DNS as a directory service. This draft discusses relying on use of DNS as a directory service. This draft discusses
lessons that were learned during InterNIC Directory and Database lessons that were learned during InterNIC Directory and Database
Services' development and operation of WWWSeeker, an application that Services' development and operation of WWWSeeker, an application that
finds a web site given information about the name and location of an finds a web site given information about the name and location of an
organization. The back end database that drives this application was organization. The back end database that drives this application was
built from information obtained from domain registries via WHOIS and built from information obtained from domain registries via WHOIS and
other protocols. We present this information to help future other protocols. We present this information to help future
implementors avoid some of the blind alleys that we have already implementors avoid some of the blind alleys that we have already
explored. This work builds on the Netfind system that was created by explored. This work builds on the Netfind system that was created by
Mike Schwartz and his team at the University of Colorado at Boulder Mike Schwartz and his team at the University of Colorado at Boulder
[1]. [1].
INTERNET DRAFT Building Directories from DNS: Experiences from INTERNET DRAFT Building Directories from DNS: Experiences from
WWWSeeker October 1998 WWWSeeker December 1998
1. Introduction 1. Introduction
Over time, there have been several RFCs [2, 3, 4] about approaches Over time, there have been several RFCs [2, 3, 4] about approaches
for providing Internet Directories. Many of the earlier documents for providing Internet Directories. Many of the earlier documents
discussed white pages directories that supply mappings from a discussed white pages directories that supply mappings from a
person's name to their telephone number, email address, etc. person's name to their telephone number, email address, etc.
More recently, there has been discussion of directories that map from More recently, there has been discussion of directories that map from
a company name to a domain name or web site. Many people are using a company name to a domain name or web site. Many people are using
skipping to change at page 3, line 5 skipping to change at page 3, line 5
2. Directory Population 2. Directory Population
2.1 What to do? 2.1 What to do?
There are two issues in populating a directory: finding all the There are two issues in populating a directory: finding all the
domain names (building the skeleton) and associating those domains domain names (building the skeleton) and associating those domains
with entities (adding the meat). These two issues are discussed with entities (adding the meat). These two issues are discussed
below: below:
INTERNET DRAFT Building Directories from DNS: Experiences from INTERNET DRAFT Building Directories from DNS: Experiences from
WWWSeeker October 1998 WWWSeeker December 1998
2.2 Building the skeleton 2.2 Building the skeleton
In "building the skeleton," it is popular to suggest using a variant In "building the skeleton," it is popular to suggest using a variant
of a "tree walk" to determine the domains that need to be added to of a "tree walk" to determine the domains that need to be added to
the directory. Our experience is that this is neither a reasonable the directory. Our experience is that this is neither a reasonable
nor an efficient proposal for maintaining such a directory. Except nor an efficient proposal for maintaining such a directory. Except
for some infrequent and long-standing DNS surveys [5], DNS "tree for some infrequent and long-standing DNS surveys [5], DNS "tree
walks" tend to be discouraged by the Internet community, especially walks" tend to be discouraged by the Internet community, especially
given that the frequency of DNS changes would require a new tree walk given that the frequency of DNS changes would require a new tree walk
monthly (if not more often). Instead, our experience has shown that monthly (if not more often). Instead, our experience has shown that
data on allocated DNS domains can usually be retrieved in bulk data on allocated DNS domains can usually be retrieved in bulk
fashion with FTP, HTTP, or Gopher (we have used each of these for fashion with FTP, HTTP, or Gopher (we have used each of these for
particular TLDs). This has the added advantage of both "building the particular TLDs). This has the added advantage of both "building the
skeleton" and "adding the meat" at the same time. skeleton" and "adding the meat" at the same time. Our favorite
method for finding a server that has allocated DNS domain information
is to start with the list maintained at
http://www.alldomains.com/countryindex.html and go from there.
Before this was available, it was necessary to hunt for a registry
using trial and error.
When maintaining the database, existing domains may be verified via When maintaining the database, existing domains may be verified via
direct DNS lookups rather than a "tree walk." "Tree walks" should direct DNS lookups rather than a "tree walk." "Tree walks" should
therefore be the choice of last resort for directory population, and therefore be the choice of last resort for directory population, and
bulk retrieval should be used whenever possible. bulk retrieval should be used whenever possible.
2.3 Adding the meat 2.3 Adding the meat
A possibility for populating a directory ("adding the meat") is to A possibility for populating a directory ("adding the meat") is to
use an automated system (like a spider) that uses the WHOIS protocol use an automated system that makes repeated queries using the WHOIS
to gather information about the organization that owns a domain. At protocol to gather information about the organization that owns a
the conclusion of the InterNIC Directory and Database Services domain. The queries would be made against a WHOIS server located
project, our backend database contained about 2.9 million records with the above method. At the conclusion of the InterNIC Directory
built from data that could be retrieved via WHOIS. The entire and Database Services project, our backend database contained about
database contained 3.25 million records, with the additional records 2.9 million records built from data that could be retrieved via
coming from sources other than WHOIS. WHOIS. The entire database contained 3.25 million records, with the
additional records coming from sources other than WHOIS.
In our experience this information contains many factual and In our experience this information contains many factual and
typographical errors and requires further examination and processing typographical errors and requires further examination and processing
to improve its quality. Further, TLD registrars that support WHOIS to improve its quality. Further, TLD registrars that support WHOIS
typically only support WHOIS information for second level domains typically only support WHOIS information for second level domains
(i.e. ne.us) as opposed to lower level domains (i.e. (i.e. ne.us) as opposed to lower level domains (i.e.
windrose.omaha.ne.us). Also, there are TLDs without registrars, TLDs windrose.omaha.ne.us). Also, there are TLDs without registrars, TLDs
without WHOIS support, and still other TLDs that use other methods without WHOIS support, and still other TLDs that use other methods
(HTTP, FTP, gopher) for providing organizational information. Based (HTTP, FTP, gopher) for providing organizational information. Based
on our experience, an implementor of an internet directory needs to on our experience, an implementor of an internet directory needs to
support multiple protocols for directory population. A WHOIS spider support multiple protocols for directory population. An automated
is necessary, but isn't enough. WHOIS search tool is necessary, but isn't enough.
INTERNET DRAFT Building Directories from DNS: Experiences from
WWWSeeker December 1998
3. Directory Updating: Full Rebuilds vs Incremental Updates 3. Directory Updating: Full Rebuilds vs Incremental Updates
Given the size of our database in April 1998 when it was last Given the size of our database in April 1998 when it was last
generated, a complete rebuild of the database that is available from generated, a complete rebuild of the database that is available from
WHOIS lookups would require between 134.2 to 167.8 days just for
INTERNET DRAFT Building Directories from DNS: Experiences from WHOIS lookups from a Sun SPARCstation 20. This estimate does not
WWWSeeker October 1998 include other considerations (for example, inverting the token tree
required about 24 hours processing time on a Sun SPARCstation 20)
WHOIS lookups would require between 11.6 million and 14.5 million that would increase the amount of time to rebuild the entire
seconds of time just for WHOIS lookups from a Sun SPARCstation 20. database.
This estimate does not include other considerations (for example,
inverting the token tree required about 24 hours processing time on a
Sun SPARCstation 20) that would increase the amount of time to
rebuild the entire database.
Whether this is feasible depends on the frequency of database updates Whether this is feasible depends on the frequency of database updates
provided. Because of the rate of growth of allocated domain names provided. Because of the rate of growth of allocated domain names
(150K-200K new allocated domains per month in early 1998), we (150K-200K new allocated domains per month in early 1998), we
provided monthly updates of the database. To rebuild the database provided monthly updates of the database. To rebuild the database
each month (based on the above time estimate) would require between 3 each month (based on the above time estimate) would require between 3
and 5 machines to be dedicated full time (independent of machine and 5 machines to be dedicated full time (independent of machine
architecture). Instead, we checkpointed the allocated domain list architecture). Instead, we checkpointed the allocated domain list
and rebuild on an incremental basis during one weekend of the month. and rebuild on an incremental basis during one weekend of the month.
This allowed us to complete the update on between 1 and 4 machines (3 This allowed us to complete the update on between 1 and 4 machines (3
skipping to change at page 4, line 46 skipping to change at page 4, line 50
database as a monolithic structure. Given past growth, it is not database as a monolithic structure. Given past growth, it is not
clear at what point migrating to a distributed directory becomes clear at what point migrating to a distributed directory becomes
actually necessary to support customer queries. Our last database actually necessary to support customer queries. Our last database
contained over 3.25 million records in a flat ASCII file. Searching contained over 3.25 million records in a flat ASCII file. Searching
was done via a PERL script of an inverted tree (also produced by a was done via a PERL script of an inverted tree (also produced by a
PERL script). While admittedly primitive, this configuration PERL script). While admittedly primitive, this configuration
supported over 200,000 database queries per month from our production supported over 200,000 database queries per month from our production
servers. servers.
Increasing the database size only requires more disk space to hold Increasing the database size only requires more disk space to hold
the database and inverted tree. Of course, using database technology the database and inverted tree. Of course, using database technology
would probably improve performance and scalability, but we had not would probably improve performance and scalability, but we had not
reached the point where this technology was required. reached the point where this technology was required.
INTERNET DRAFT Building Directories from DNS: Experiences from
WWWSeeker December 1998
5. Security 5. Security
The underlying data for the type of directory discussed in this The underlying data for the type of directory discussed in this
INTERNET DRAFT Building Directories from DNS: Experiences from
WWWSeeker October 1998
document is already generally available through WHOIS, DNS, and other document is already generally available through WHOIS, DNS, and other
standard interfaces. No new information is made available by using standard interfaces. No new information is made available by using
these techniques though many types of search become much easier. To these techniques though many types of search become much easier. To
the extent that easier access to this data makes it easier to find the extent that easier access to this data makes it easier to find
specific sites or machines to attack, security may be decreased. specific sites or machines to attack, security may be decreased.
The protocols discussed here do not have built-in security features. The protocols discussed here do not have built-in security features.
If one source machine is spoofed while the directory data is being If one source machine is spoofed while the directory data is being
gathered, substantial amounts of incorrect and misleading data could gathered, substantial amounts of incorrect and misleading data could
be pulled in to the directory and be spread to a wider audience. be pulled in to the directory and be spread to a wider audience.
skipping to change at page 5, line 41 skipping to change at page 5, line 44
Request For Comments (RFC) documents are available at Request For Comments (RFC) documents are available at
http://info.internet.isi.edu/1/in-notes/rfc and from numerous mirror http://info.internet.isi.edu/1/in-notes/rfc and from numerous mirror
sites. sites.
[1] M. F. Schwartz, C. Pu. "Applying an Information [1] M. F. Schwartz, C. Pu. "Applying an Information
Gathering Architecture to Netfind: A White Pages Gathering Architecture to Netfind: A White Pages
Tool for a Changing and Growing Internet," Univer- Tool for a Changing and Growing Internet," Univer-
sity of Colorado Technical Report CU-CS-656-93. sity of Colorado Technical Report CU-CS-656-93.
December 1993, revised July 1994. December 1993, revised July 1994.
<URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Netfind.Gathering <URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Netfind
.txt.Z>
[2] K. Sollins, Plan for Internet Directory Services, [2] K. Sollins, Plan for Internet Directory Services,
RFC 1107, July 1989. RFC 1107, July 1989.
[3] S. Hardcastle-Kille, E. Huizer, V.Cerf, R. Hobby, [3] S. Hardcastle-Kille, E. Huizer, V.Cerf, R. Hobby,
S. Kent, A Strategic Plan for Deploying an Internet S. Kent, A Strategic Plan for Deploying an Internet
X.500 Directory Service, RFC 1430, February 1993. X.500 Directory Service, RFC 1430, February 1993.
[4] J. Postel & C. Anderson, White Pages Meeting [4] J. Postel & C. Anderson, White Pages Meeting
Report, RFC 1588, February 1994. Report, RFC 1588, February 1994.
INTERNET DRAFT Building Directories from DNS: Experiences from
WWWSeeker December 1998
[5] M. Lottor, "Network Wizards Internet Domain Sur- [5] M. Lottor, "Network Wizards Internet Domain Sur-
vey," available from vey," available from
http://www.nw.com/zone/WWW/top.html http://www.nw.com/zone/WWW/top.html
INTERNET DRAFT Building Directories from DNS: Experiences from
WWWSeeker October 1998
8. Authors' addresses 8. Authors' addresses
Ryan Moats Rick Huber Ryan Moats Rick Huber
AT&T AT&T AT&T AT&T
15621 Drexel Circle Room C3-3B30, 200 Laurel Ave. South 15621 Drexel Circle Room C3-3B30, 200 Laurel Ave. South
Omaha, NE 68135-2358 Middletown, NJ 07748 Omaha, NE 68135-2358 Middletown, NJ 07748
USA USA USA USA
EMail: jayhawk@att.com Email: rvh@att.com EMail: jayhawk@att.com Email: rvh@att.com
 End of changes. 16 change blocks. 
38 lines changed or deleted 41 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/