idnits 2.17.1 

draft-rfced-info-moats-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-26) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 5
     longer pages, the longest (page 5) being 62 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 6 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** The abstract seems to contain references ([1]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (December 1998) is 9264 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Downref: Normative reference to an Informational RFC: RFC 1107 (ref. '2')

  ** Downref: Normative reference to an Informational RFC: RFC 1430 (ref. '3')

  ** Downref: Normative reference to an Informational RFC: RFC 1588 (ref. '4')

  -- Possible downref: Non-RFC (?) normative reference: ref. '5'


     Summary: 12 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet-Draft                                                Ryan Moats
2	draft-rfced-info-moats-03.txt                                 Rick Huber
3	Expires in six months                                               AT&T
4	                                                           December 1998

6	       Building Directories from DNS: Experiences from WWWSeeker
7	                Filename: draft-rfced-info-moats-03.txt

9	Status of This Memo

11	      This document is an Internet-Draft.  Internet-Drafts are working
12	      documents of the Internet Engineering Task Force (IETF), its
13	      areas, and its working groups.  Note that other groups may also
14	      distribute working documents as Internet-Drafts.

16	      Internet-Drafts are draft documents valid for a maximum of six
17	      months and may be updated, replaced, or obsoleted by other
18	      documents at any time.  It is inappropriate to use Internet-
19	      Drafts as reference material or to cite them other than as ``work
20	      in progress.''

22	      To learn the current status of any Internet-Draft, please check
23	      the ``1id-abstracts.txt'' listing contained in the Internet-
24	      Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net
25	      (Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East
26	      Coast), or ftp.isi.edu (US West Coast).

28	Abstract

30	   There has been much discussion and several documents written about
31	   the need for an Internet Directory.  Recently, this discussion has
32	   focused on ways to discover an organization's domain name without
33	   relying on use of DNS as a directory service.  This draft discusses
34	   lessons that were learned during InterNIC Directory and Database
35	   Services' development and operation of WWWSeeker, an application that
36	   finds a web site given information about the name and location of an
37	   organization.  The back end database that drives this application was
38	   built from information obtained from domain registries via WHOIS and
39	   other protocols.  We present this information to help future
40	   implementors avoid some of the blind alleys that we have already
41	   explored.  This work builds on the Netfind system that was created by
42	   Mike Schwartz and his team at the University of Colorado at Boulder
43	   [1].

45	INTERNET DRAFT      Building Directories from DNS: Experiences from
46	WWWSeeker      December 1998

48	1. Introduction

50	   Over time, there have been several RFCs [2, 3, 4] about approaches
51	   for providing Internet Directories.  Many of the earlier documents
52	   discussed white pages directories that supply mappings from a
53	   person's name to their telephone number, email address, etc.

55	   More recently, there has been discussion of directories that map from
56	   a company name to a domain name or web site.  Many people are using
57	   DNS as a directory today to find this type of information about a
58	   given company.  Typically when DNS is used, users guess the domain
59	   name of the company they are looking for and then prepend "www.".
60	   This makes it highly desirable for a company to have an easily
61	   guessable name.

63	   There are two major problems here.  As the number of assigned names
64	   increases, it becomes more difficult to get an easily guessable name.
65	   Also, the TLD must be guessed as well as the name.  While many users
66	   just guess ".COM" as the "default" TLD today, there are many two-
67	   letter country code top-level domains in current use as well as other
68	   gTLDs (.NET, .ORG, and possibly .EDU) with the prospect of additional
69	   gTLDs in the future.  As the number of TLDs in general use increases,
70	   guessing gets more difficult.

72	   Between July 1996 and our shutdown in March 1998, the InterNIC
73	   Directory and Database Services project maintained the Netfind search
74	   engine [1] and the associated database that maps organization
75	   information to domain names. This database thus acted as the type of
76	   Internet directory that associates company names with domain names.
77	   We also built WWWSeeker, a system that used the Netfind database to
78	   find web sites associated with a given organization.  The experienced
79	   gained from maintaining and growing this database provides valuable
80	   insight into the issues of providing a directory service.  We present
81	   it here to allow future implementors to avoid some of the blind
82	   alleys that we have already explored.

84	2. Directory Population

86	2.1 What to do?

88	   There are two issues in populating a directory: finding all the
89	   domain names (building the skeleton) and associating those domains
90	   with entities (adding the meat).  These two issues are discussed
91	   below:

93	INTERNET DRAFT      Building Directories from DNS: Experiences from
94	WWWSeeker      December 1998

96	2.2 Building the skeleton

98	   In "building the skeleton," it is popular to suggest using a variant
99	   of a "tree walk" to determine the domains that need to be added to
100	   the directory.  Our experience is that this is neither a reasonable
101	   nor an efficient proposal for maintaining such a directory.  Except
102	   for some infrequent and long-standing DNS surveys [5], DNS "tree
103	   walks" tend to be discouraged by the Internet community, especially
104	   given that the frequency of DNS changes would require a new tree walk
105	   monthly (if not more often).  Instead, our experience has shown that
106	   data on allocated DNS domains can usually be retrieved in bulk
107	   fashion with FTP, HTTP, or Gopher (we have used each of these for
108	   particular TLDs).  This has the added advantage of both "building the
109	   skeleton" and "adding the meat" at the same time.  Our favorite
110	   method for finding a server that has allocated DNS domain information
111	   is to start with the list maintained at
112	   http://www.alldomains.com/countryindex.html and go from there.
113	   Before this was available, it was necessary to hunt for a registry
114	   using trial and error.

116	   When maintaining the database, existing domains may be verified via
117	   direct DNS lookups rather than a "tree walk." "Tree walks" should
118	   therefore be the choice of last resort for directory population, and
119	   bulk retrieval should be used whenever possible.

121	2.3 Adding the meat

123	   A possibility for populating a directory ("adding the meat") is to
124	   use an automated system that makes repeated queries using the WHOIS
125	   protocol to gather information about the organization that owns a
126	   domain.  The queries would be made against a WHOIS server located
127	   with the above method. At the conclusion of the InterNIC Directory
128	   and Database Services project, our backend database contained about
129	   2.9 million records built from data that could be retrieved via
130	   WHOIS.  The entire database contained 3.25 million records, with the
131	   additional records coming from sources other than WHOIS.

133	   In our experience this information contains many factual and
134	   typographical errors and requires further examination and processing
135	   to improve its quality.  Further, TLD registrars that support WHOIS
136	   typically only support WHOIS information for second level domains
137	   (i.e. ne.us) as opposed to lower level domains (i.e.
138	   windrose.omaha.ne.us).  Also, there are TLDs without registrars, TLDs
139	   without WHOIS support, and still other TLDs that use other methods
140	   (HTTP, FTP, gopher) for providing organizational information.  Based
141	   on our experience, an implementor of an internet directory needs to
142	   support multiple protocols for directory population.  An automated
143	   WHOIS search tool is necessary, but isn't enough.

145	INTERNET DRAFT      Building Directories from DNS: Experiences from
146	WWWSeeker      December 1998

148	3. Directory Updating: Full Rebuilds vs Incremental Updates

150	   Given the size of our database in April 1998 when it was last
151	   generated, a complete rebuild of the database that is available from
152	   WHOIS lookups would require between 134.2 to 167.8 days just for
153	   WHOIS lookups from a Sun SPARCstation 20. This estimate does not
154	   include other considerations (for example, inverting the token tree
155	   required about 24 hours processing time on a Sun SPARCstation 20)
156	   that would increase the amount of time to rebuild the entire
157	   database.

159	   Whether this is feasible depends on the frequency of database updates
160	   provided.  Because of the rate of growth of allocated domain names
161	   (150K-200K new allocated domains per month in early 1998), we
162	   provided monthly updates of the database. To rebuild the database
163	   each month (based on the above time estimate) would require between 3
164	   and 5 machines to be dedicated full time (independent of machine
165	   architecture).  Instead, we checkpointed the allocated domain list
166	   and rebuild on an incremental basis during one weekend of the month.
167	   This allowed us to complete the update on between 1 and 4 machines (3
168	   Sun SPARCstation 20s and a dual-processor Sparcserver 690) without
169	   full dedication over a couple of days.  Further, by coupling
170	   incremental updates with periodic refresh of existing data (which can
171	   be done during another part of the month and doesn't require full
172	   dedication of machine hardware), older records would be periodically
173	   updated when the underlying information changes.  The tradeoff is
174	   timeliness and accuracy of data (some data in the database may be
175	   old) against hardware and processing costs.

177	4. Directory Presentation: Distributed vs Monolithic

179	   While a distributed directory is a desirable goal, we maintained our
180	   database as a monolithic structure.  Given past growth, it is not
181	   clear at what point migrating to a distributed directory becomes
182	   actually necessary to support customer queries.  Our last database
183	   contained over 3.25 million records in a flat ASCII file.  Searching
184	   was done via a PERL script of an inverted tree (also produced by a
185	   PERL script).  While admittedly primitive, this configuration
186	   supported over 200,000 database queries per month from our production
187	   servers.

189	   Increasing the database size only requires more disk space to hold
190	   the database and inverted tree. Of course, using database technology
191	   would probably improve performance and scalability, but we had not
192	   reached the point where this technology was required.

194	INTERNET DRAFT      Building Directories from DNS: Experiences from
195	WWWSeeker      December 1998

197	5. Security

199	   The underlying data for the type of directory discussed in this
200	   document is already generally available through WHOIS, DNS, and other
201	   standard interfaces.  No new information is made available by using
202	   these techniques though many types of search become much easier.  To
203	   the extent that easier access to this data makes it easier to find
204	   specific sites or machines to attack, security may be decreased.

206	   The protocols discussed here do not have built-in security features.
207	   If one source machine is spoofed while the directory data is being
208	   gathered, substantial amounts of incorrect and misleading data could
209	   be pulled in to the directory and be spread to a wider audience.

211	   In general, building a directory from registry data will not open any
212	   new security holes since the data is already available to the public.
213	   Existing security and accuracy problems with the data sources are
214	   likely to be amplified.

216	6. Acknowledgments

218	   This work described in this document was partially supported by the
219	   National Science Foundation under Cooperative Agreement NCR-9218179.

221	7. References

223	   Request For Comments (RFC) documents are available at
224	   http://info.internet.isi.edu/1/in-notes/rfc and from numerous mirror
225	   sites.

227	         [1]         M. F. Schwartz, C. Pu.  "Applying an Information
228	                     Gathering Architecture to Netfind: A White Pages
229	                     Tool for a Changing and Growing Internet," Univer-
230	                     sity of Colorado Technical Report CU-CS-656-93.
231	                     December 1993, revised July 1994.

233	<URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Netfind

235	         [2]         K. Sollins, Plan for Internet Directory Services,
236	                     RFC 1107, July 1989.

238	         [3]         S. Hardcastle-Kille, E. Huizer, V.Cerf, R. Hobby,
239	                     S. Kent, A Strategic Plan for Deploying an Internet
240	                     X.500 Directory Service, RFC 1430, February 1993.

242	         [4]         J. Postel & C. Anderson, White Pages Meeting
243	                     Report, RFC 1588, February 1994.

245	INTERNET DRAFT      Building Directories from DNS: Experiences from
246	WWWSeeker      December 1998

248	         [5]         M. Lottor, "Network Wizards Internet Domain Sur-
249	                     vey," available from
250	                     http://www.nw.com/zone/WWW/top.html

252	8. Authors' addresses

254	   Ryan Moats                     Rick Huber
255	   AT&T                           AT&T
256	   15621 Drexel Circle            Room C3-3B30, 200 Laurel Ave. South
257	   Omaha, NE 68135-2358           Middletown, NJ 07748
258	   USA                            USA

260	   EMail:  jayhawk@att.com        Email: rvh@att.com