idnits 2.17.1 

draft-klensin-name-filters-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 347 has weird spacing: '...   of  for an ...'

  == Line 385 has weird spacing: '...m.  See  for d...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (September 5, 2003) is 7539 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'ASCII' is mentioned on line 128, but not defined

  == Missing Reference: 'IS3166' is mentioned on line 176, but not defined

  == Unused Reference: 'RFC1535' is defined on line 613, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1738' is defined on line 617, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2396' is defined on line 629, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2616' is defined on line 633, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3490' is defined on line 643, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3491' is defined on line 647, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3492' is defined on line 651, but no explicit
     reference was found in the text

  == Unused Reference: 'JET' is defined on line 670, but no explicit
     reference was found in the text

  == Unused Reference: 'RegRestr' is defined on line 678, but no explicit
     reference was found in the text

  ** Downref: Normative reference to an Informational RFC: RFC 1535

  ** Obsolete normative reference: RFC 1738 (Obsoleted by RFC 4248, RFC 4266)

  ** Obsolete normative reference: RFC 1866 (Obsoleted by RFC 2854)

  ** Obsolete normative reference: RFC 2368 (Obsoleted by RFC 6068)

  ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986)

  ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
     RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Obsolete normative reference: RFC 2821 (Obsoleted by RFC 5321)

  ** Obsolete normative reference: RFC 2822 (Obsoleted by RFC 5322)

  ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)

  ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)

  == Outdated reference: A later version (-05) exists of
     draft-jseng-idn-admin-03

  == Outdated reference: A later version (-08) exists of
     draft-klensin-reg-guidelines-00


     Summary: 12 errors (**), 0 flaws (~~), 17 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         J. Klensin
3	Internet-Draft                                         September 5, 2003
4	Expires: March 5, 2004

6	   User Interface Evaluation and Filtering of Internet Addresses and
7	              Locators -or- Syntaxes for Common Namespaces
8	                   draft-klensin-name-filters-03.txt

10	Status of this Memo

12	   This document is an Internet-Draft and is in full conformance with
13	   all provisions of Section 10 of RFC2026.

15	   Internet-Drafts are working documents of the Internet Engineering
16	   Task Force (IETF), its areas, and its working groups. Note that other
17	   groups may also distribute working documents as Internet-Drafts.

19	   Internet-Drafts are draft documents valid for a maximum of six months
20	   and may be updated, replaced, or obsoleted by other documents at any
21	   time. It is inappropriate to use Internet-Drafts as reference
22	   material or to cite them other than as "work in progress."

24	   The list of current Internet-Drafts can be accessed at http://
25	   www.ietf.org/ietf/1id-abstracts.txt.

27	   The list of Internet-Draft Shadow Directories can be accessed at
28	   http://www.ietf.org/shadow.html.

30	   This Internet-Draft will expire on March 5, 2004.

32	Copyright Notice

34	   Copyright (C) The Internet Society (2003). All Rights Reserved.

36	Abstract

38	   Many Internet applications have been designed to deduce top-level
39	   domains (or other domain name labels) from partial information. The
40	   introduction of new top level domains, expecially non-country-code
41	   ones, has exposed flaws in some of the methods used by these
42	   applications. These flaws make it more difficult, or impossible, for
43	   users of the applications to access the full Internet.  This memo
44	   discusses some of the techniques that have been used and gives some
45	   guidance for minimizing their negative impact as the domain name
46	   environment evolves. This document draws summaries of the applicable
47	   rules together in one place and supplies references to the actual
48	   standards.

50	Table of Contents

52	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
53	   2.  Restrictions on domain (DNS) names . . . . . . . . . . . . . .  4
54	   3.  Restrictions on email addresses  . . . . . . . . . . . . . . .  7
55	   4.  URLs and URIs  . . . . . . . . . . . . . . . . . . . . . . . .  9
56	   4.1 URI syntax definitions and issues  . . . . . . . . . . . . . .  9
57	   4.2 The HTTP URL . . . . . . . . . . . . . . . . . . . . . . . . . 10
58	   4.3 The MAILTO URL . . . . . . . . . . . . . . . . . . . . . . . . 10
59	   4.4 Guessing domain names in web contexts  . . . . . . . . . . . . 12
60	   5.  Implications of internationalization . . . . . . . . . . . . . 14
61	   6.  Summary  . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
62	   7.  Security considerations  . . . . . . . . . . . . . . . . . . . 16
63	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
64	       Normative References . . . . . . . . . . . . . . . . . . . . . 18
65	       Non-normative References . . . . . . . . . . . . . . . . . . . 20
66	       Author's Address . . . . . . . . . . . . . . . . . . . . . . . 20
67	       Intellectual Property and Copyright Statements . . . . . . . . 21

69	1. Introduction

71	   Designers of user interfaces to Internet applications have often
72	   found it useful to examine user-provided values for validity before
73	   passing them to the Internet tools themselves.  This type of test,
74	   most commonly involving syntax checks or application of other rules
75	   to domain names, email addresses, or "web addresses" (URLs or,
76	   occasionally, extended URI forms (see URLs and URIs)) may enable
77	   better-quality diagnostics for the user than might be available from
78	   the protocol itself.  Local validity tests on values are also thought
79	   to improve the efficiency of back-office processing programs and to
80	   reduce load on the protocols themselves.  Certainly they are
81	   consistent with the well-established principle that it is better to
82	   detect errors as early as possible.

84	   The tests must, however, be made correctly or at least safely.  If
85	   criteria are applied that do not match the protocols, users will be
86	   inconvenienced, addresses and sites will effectively become
87	   inaccessible to some groups, and business and communications
88	   opportunities will be lost.  Experience in recent years indicates
89	   that syntax tests are often performed incorrectly and that tests for
90	   top-level domain names are applied using obsolete lists and
91	   conventions.  We assume that most of these incorrect tests are the
92	   result of inability to conveniently locate exact definitions for the
93	   criteria to be applied. This document draws summaries of the
94	   applicable rules together in one place and supplies references to the
95	   actual standards.  It does not add anything to those standards; it
96	   merely draws the information together into a form that may be more
97	   accessible.

99	   Many experts on Internet protocols believe that tests and rules of
100	   these sorts should be avoided in applications and that the tests in
101	   the protocols and back-office systems should be relied on instead.
102	   Certainly implementations of the protocols cannot assume that the
103	   data passed to them will be valid.  Unless the standards specify
104	   particular behavior, this document takes no position on whether or
105	   not the testing is desirable.  It only identifies the correct tests
106	   to be made if tests are to be applied.

108	   The sections that follow discuss domain names, email addresses, and
109	   URLs.

111	2. Restrictions on domain (DNS) names

113	   The authoritative definitions of the format and syntax of domain
114	   names appear in RFCs 1035 [RFC1035], 1123 [RFC1123], and 2181
115	   [RFC2181].

117	   Any characters, or combination of bits (as octets), are permitted in
118	   DNS names.  However, there is a preferred form that is required by
119	   most applications.  This preferred form has been the only one
120	   permitted in the names of top-level domains, or TLDs. In general, it
121	   is also the only form permitted in most second-level names registered
122	   in TLDs although some names that are normally not seen by users obey
123	   other rules.  It derives from the original ARPANET rules for naming
124	   of hosts (i.e., the "hostname" rule) and is perhaps better described
125	   as the "LDH rule", after the characters that it permits.  The LDH
126	   rule, as updated, provides that the labels (words or strings
127	   separated by periods) that make up a domain name must consist only of
128	   the ASCII [ASCII] alphabetic and numeric characters, plus the hyphen.
129	   No other symbols or punctuation characters are permitted, nor is
130	   blank space.  If the hyphen is used, it is not permitted to appear at
131	   either the beginning or end of a label.  There is an additional rule
132	   that essentially requires that top-level domain names not be
133	   all-numeric.

135	   When it is necessary to express labels with non-character octets, or
136	   to embed periods within labels, there is a mechanism for keying them
137	   in that utilizes an escape sequence. RFC 1035  should be consulted if
138	   that mechanism is needed (most common applications, including email
139	   and the Web, will generally not permit those escaped strings).  A
140	   special encoding is now available for non-ASCII characters, see the
141	   brief discussion in Implications of internationalization.

143	   Most internet applications that reference other hosts or systems
144	   assume they will be supplied with "fully-qualified" domain names,
145	   i.e., ones that include all of the labels leading to the root,
146	   including the TLD name.  Those fully-qualified domain names are then
147	   passed to either the domain name resolution protocol itself or to the
148	   remote systems.  Consequently, purported DNS names to be used in
149	   applications and to locate resources generally must contain at least
150	   one period (".") character.  Those that do not are either invalid or
151	   require the application to supply additional information.  Of course,
152	   this principle does not apply when the purpose of the application is
153	   to process or query TLD names themselves.  The DNS specification also
154	   permits a trailing period to be used to denote the root, e.g.,
155	   "a.b.c" and "a.b.c." are equivalent, but the latter is more explicit
156	   and is required to be accepted by applications.  This convention is
157	   especially important when a TLD name is being referred to directly.
158	   For example, while ".COM" has become the popular terminology for
159	   referring to that top-level domain, "COM." would be strictly and
160	   technically correct in talking about the DNS, since it shows that
161	   "COM" is a top-level domain name.

163	   There is a long history of applications moving beyond the "one or
164	   more periods" test to trying to verify that a valid TLD name is
165	   actually present.  They have done this either by applying some
166	   heuristics to the form of the name or by consulting a local list of
167	   valid names.  The historical heuristics are no longer effective.  If
168	   one is to keep a local list, much more effort must be devoted to
169	   keeping it up-to-date than was the case several years ago.

171	   The heuristics were based on the observation that, since the DNS was
172	   first deployed, all top-level domain names were two, three, or four
173	   characters in length.  All two-character names were associated with
174	   "country code" domains, with the specific labels (with a few early
175	   exceptions), drawn from the ISO list of codes for countries and
176	   similar entitles [IS3166].  The three-letter names were "generic"
177	   TLDs, whose function was not country-specific.  And there was exactly
178	   one four-letter TLD, the infrastructure domain "ARPA." [RFC1591].
179	   These length-dependent rules were, however, conventions, rather than
180	   anything on which the protocols depended.

182	   Before the mid-1990s, lists of valid top-level domain names changed
183	   infrequently.  New country codes were gradually, and then more
184	   rapidly, added as the Internet expanded, but the list of generic
185	   domains did not change at all between the establishment of the "INT."
186	   domain and ICANN's allocation of new generic TLDs in 2000.  Some
187	   application developers responded by assuming that any two-letter
188	   domain name could be valid as a TLD, but that the list of generic
189	   TLDs was fixed and could be kept locally and tested.  Several of
190	   these assumptions changed as ICANN started to allocate new top-level
191	   domains: one two-letter domain that does not appear in the ISO 3166-1
192	   table was tentatively approved, and new domains were created with
193	   three, four, and even six letter codes.

195	   As of the first quarter of 2003, the list of valid, non-country,
196	   top-level domains was .AERO, .BIZ, .COM, .COOP, .EDU, .GOV, .INFO,
197	   .INT, .MIL, .MUSEUM, .NAME, .NET, .ORG, .PRO, and .ARPA.  ICANN is
198	   expected to expand that list at regular intervals, so the list that
199	   appears here should not be used in testing.  Instead, systems that
200	   filter by testing top-level domain names should regularly check the
201	   current list of TLDs (both "generic" and country-code-related)
202	   published by IANA at http://www.iana.org/domain-names.htm.  It is
203	   likely that the better strategy has now become to make the "at least
204	   one period" test, to verify LDH conformance (including verification
205	   that the apparent TLD name is not all-numeric), and then to use the
206	   DNS to determine domain name validity, rather than trying to maintain
207	   a local list of valid TLD names.

209	   A DNS label may be no more than 63 octets long.  This is in the form
210	   actually stored; if non-ASCII label is converted to encoded
211	   "punycode" form (see Implications of internationalization) the length
212	   of that form may restrict the number of actual characters (in the
213	   original character set) that can be accomodated.  A complete,
214	   fully-qualified, domain name must not exceed 255 octets.

216	   Some additional mechanisms for guessing correct domain names when
217	   incomplete information is provided have been developed for use with
218	   the web and are discussed in Section 4.4.

220	3. Restrictions on email addresses

222	   Reference documents: RFC 2821 [RFC2821] and RFC 2822 [RFC2822]

224	   Contemporary email addresses consist of a "local part" separated from
225	   a "domain part" (a fully-qualified domain name) by an at-sign ("@").
226	   The syntax of the domain part corresponds to that in the previous
227	   section.  The concerns identified in that section about filtering and
228	   lists of names apply to the domain names used in an email context as
229	   well.  The domain name can also be replaced by an IP address in
230	   square brackets, but that form is strongly discouraged except for
231	   testing and troubleshooting purposes.

233	   The local part may appear using the quoting conventions described
234	   below. The quoted forms are rarely used in practice, but are required
235	   for some legitimate purposes.  Hence, they should not be rejected in
236	   filtering routines but, instead, should be passed to the email system
237	   for evaluation by the destination host.

239	   The exact rule is that any ASCII character, including control
240	   characters, may appear quoted, or in a quoted string.  When quoting
241	   is needed, the backslash character is used to quote the following
242	   character.  For example

244	      Abc\@def@example.com

246	   is a valid form of an email address.  Blank spaces may also appear,
247	   as in

249	      Fred\ Bloggs@example.com

251	   The backslash character may also be used to quote itself, e.g.,

253	      Joe.\\Blow@example.com

255	   In addition to quoting using the backslash character, conventional
256	   double-quote characters may be used to surround strings. For example

258	      "Abc@def"@example.com

260	      "Fred Bloggs"@example.com

262	   are alternate forms of the first two examples above.  These quoted
263	   forms are rarely recommended, and are uncommon in practice, but, as
264	   discussed above, must be supported by applications that are
265	   processing email addresses.  In particular, the quoted forms often
266	   appear in the context of addresses associated with transitions from
267	   other systems and contexts; those transitional requirements do still
268	   arise and, since a system that accepts a user-provided email address
269	   cannot "know" whether that address is associated with a legacy
270	   system, the address forms must be accepted and passed into the email
271	   environment.

273	   Without quotes, local-parts may consist of any combination of
274	   alphabetic characters, digits, or any of the special characters

276	      ! # $ % & ' * + - / = ?  ^ _ ` . { | } ~

278	   period (".") may also appear, but may not be used to start or end the
279	   local part, nor may two or more consecutive periods appear.  Stated
280	   differently, any ASCII graphic (printing) character other than the
281	   at-sign ("@"), backslash, double quote, comma, or square brackets may
282	   appear without quoting.  If any of that list of excluded characters
283	   are to appear, they must be quoted. Forms such as

285	      user+mailbox@example.com

287	      customer/department=shipping@example.com

289	      $A12345@example.com

291	      !def!xyz%abc@example.com

293	      _somename@example.com

295	   are valid and are seen fairly regularly, but any of the characters
296	   listed above are permitted.  In the context of local parts,
297	   apostrophe ("'") and acute accent ("`") are ordinary characters, not
298	   quoting characters.  Some of the characters listed above are used in
299	   conventions about routing or other types of special handling by some
300	   receiving hosts. But, since there is no way to know whether the
301	   remote host is using those conventions or just treating these
302	   characters as normal text, sending programs (and programs evaluating
303	   address validity) must simply accept the strings and pass them on.

305	   In addition to restrictions on syntax, there is a length limit on
306	   email addresses.  That limit is a maximum of 64 characters (octets)
307	   in the "local part" (before the "@") and a maximum of 255 characters
308	   (octets) in the domain part (after the "@") for a total length of 320
309	   characters.  Systems that handle email should be prepared to process
310	   addresses which are that long, even though they are rarely
311	   encountered.

313	4. URLs and URIs

315	4.1 URI syntax definitions and issues

317	   The syntax for URLs (Uniform Resource Locators) is specified in  The
318	   syntax for the more general "URI" (Uniform Resource Identifier) is
319	   specified in .  The URI syntax is extremely general, with
320	   considerable variations permitted according to the type of "scheme"
321	   (e.g., "http", "ftp", "mailto") that is being used. While it is
322	   possible to use the general syntax rules of RFC 2396 to perform
323	   syntax checks, they are general enough -- essentially only specifying
324	   the separation of the scheme name and "scheme specific part" with a
325	   colon (":") and excluding some characters that must be escaped if
326	   used-- to provide little significant filtering or validation power.

328	   The following characters are reserved in many URIs -- they must be
329	   used for either their URI-intended purpose or must be encoded. Some
330	   particular schemes may either broaden or relax these restrictions
331	   (see the following sections for URLs applicable to "web pages" and
332	   electronic mail), or apply them only to particular URI component
333	   parts.

335	      ; / ? : @ & = + $ , ?

337	   In addition, control characters, the space character, the
338	   double-quote (") character, and the following special characters

340	      < > # %

342	   are generally forbidden and must either be avoided or escaped.

344	   When it is necessary to encode these, or other, characters, the
345	   method used is replace it with a percent-sign ("%") followed by two
346	   hexidecimal digits representing its octet value.   See section 2.4.1
347	   of  for an exact definition.  Unless it is used as a delimiter of the
348	   URI scheme itself, any character may optionally be encoded this way;
349	   systems that are testing URI syntax should be prepared for these
350	   encodings to appear in any component of the URI except the scheme
351	   name itself.

353	   A "generic URI" syntax is specified and is more restrictive, but
354	   using it to test URI strings requires that one know whether or not
355	   the particular scheme in use obeys that syntax.  Consequently,
356	   applications that intend to check or validate URIs should normally
357	   identify the scheme name and then apply scheme-specific tests.  The
358	   rules for two of those -- HTTP [RFC1866] and MAILTO [RFC2368] URLs --
359	   are discussed below, but the author of an application which intends
360	   to make very precise checks, or to reject particular syntax rather
361	   than just warning the user, should consult the relevant
362	   scheme-definition documents for precise syntax and relationships.

364	4.2 The HTTP URL

366	   Absolute HTTP URLs consist of the scheme name, a host name (expressed
367	   as a domain name or IP address), and optional port number, and then,
368	   optionally, a path, a search part, and a fragment identifier.  These
369	   are separated, respectively, by a colon and the two slashes that
370	   precede the host name, a colon, a slash, a question mark, and a hash
371	   mark ("#").  So we have

373	      http://host:port/path?search#fragment

375	      http://host/path/

377	      http://host/path#fragment

379	      http://host/path?search

381	      http://host

383	   and other variations on that form.  There is also a "relative" form,
384	   but it almost never appears in text that a user might, e.g., enter
385	   into a form.  See  for details.

387	   The characters

389	      / ; ?

391	   are reserved within the path and search parts and must be encoded;
392	   the first of these may be used unencoded, and is often used, within
393	   the path to designate hierarchy.

395	4.3 The MAILTO URL

397	   MAILTO is a URL type whose content is an email address.  It can be
398	   used to encode any of the email address formats discussed in
399	   Restrictions on email addresses above.  It can also support multiple
400	   addresses and inclusion of headers (e.g., Subject lines) within the
401	   body of the URL.   MAILTO is authoritatively defined in RFC 2368
402	   [RFC2368]; anyone expecting to accept, and test, multiple addresses
403	   or mail header or body formats should consult that document
404	   carefully.

406	   In accepting text for, or validating, a MAILTO URL, it is important
407	   to note that, while it can be used to encode any valid email address,
408	   it is not sufficient to copy an email address into a MAILTO URL since
409	   email addresses may include a number of characters that are invalid
410	   in, or have reserved uses for, URLs.    Those characters must be
411	   encoded, as outlined in URI syntax definitions and issues above, when
412	   the addresses are mapped into the URL form.  Conversely, addresses in
413	   MAILTO URLs cannot be copied directly into email contexts, since few
414	   email programs will reverse the decodings (and doing so might be
415	   interpreted as a protocol violation).

417	   The following characters may appear in MAILTO URLs only with the
418	   specific defined meanings given.  If they appear in an email address
419	   (i.e., for some other purpose), they must be encoded:

421	   :              The colon in "mailto:"

423	   <>#"%\{\}|\\^~` These characters are "unsafe" in any URL,and must
424	                  always be encoded.

426	   The following characters must also be encoded if they appear in a
427	   MAILTO URL

429	   ?&=            Used to delimit headers and their values when these
430	                  are encoded into URLs.

432	   Some examples may be helpful:

434	   +----------------------+----------------------+---------------------+
435	   |     Email address    |      MAILTO URL      |               Notes |
436	   +----------------------+----------------------+---------------------+
437	   |    Joe@example.com   | mailto:joe@example.c |                   1 |
438	   |                      |          om          |                     |
439	   |                      |                      |                     |
440	   | user+mailbox@example |        mailto:       |                   2 |
441	   |         .com         | user%2Bmailbox@examp |                     |
442	   |                      |        le.com        |                     |
443	   |                      |                      |                     |
444	   | customer/department= | mailto:customer%2Fde |                   3 |
445	   | shipping@example.com | partment=shipping@ex |                     |
446	   |                      |       ample.com      |                     |
447	   |                      |                      |                     |
448	   |  $A12345@example.com | mailto:$A12345@examp |                   4 |
449	   |                      |        le.com        |                     |
450	   |                      |                      |                     |
451	   | !def!xyz%abc@example | mailto:!def!xyz%25ab |                   5 |
452	   |         .com         |     c@example.com    |                     |
453	   |                      |                      |                     |
454	   | _somename@example.co | mailto:_somename@exa |                   4 |
455	   |           m          |       mple.com       |                     |
456	   +----------------------+----------------------+---------------------+

458	                                Table 1

460	   Notes on Table

462	   1.  No characters appear in the email address that require escaping,
463	       so the body of the MAILTO URL is identical to the email address.

465	   2.  There is actually some uncertainty as to whether or not the "+"
466	       characters requires escaping in MAILTO URLs (the standards are
467	       not precisely clear).   But, since any character in the address
468	       specification may optionally be encoded, it is probably safer to
469	       encode it.

471	   3.  The "/" character is generally reserved in URLs, and must be
472	       encoded as %2F.

474	   4.  Neither the "$" nor the "_" character are given any special
475	       interpretation in MAILTO URLs, so need not be encoded.

477	   5.  While the "!" character has no special interpretation, the "%"
478	       character is used to introduce encoded sequences and hence it
479	       must always be encoded.

481	4.4 Guessing domain names in web contexts

483	   Several web browsers have adopted a practice that permits an
484	   incomplete domain name to be used as input instead of a complete URL.
485	   This has, for example, permitted users to type "microsoft" and have
486	   the browser interpret the input as "http://www.microsoft.com/".
487	   Other browser versions have gone even further, trying to build DNS
488	   names up through a series of heuristics, testing each variation in
489	   turn to see if it appears in the DNS, and accepting the first one
490	   found as the intended domain name.  If this approach is to be used,
491	   it is often critical that the browser recognize the complete list of
492	   TLDs.  If an incomplete list is used, complete domain names may not
493	   be recognized as such and the system may try to turn them into
494	   completely different names.  For example, "example.aero" is a
495	   fully-qualified name, since "AERO." is a TLD name.  But, if the
496	   system doesn't recognize "AERO" as a TLD name, it is likely to try to
497	   look up "example.aero.com" and "www.example.aero.com" (and then fail
498	   or find the wrong host), rather than simply looking up the
499	   user-supplied name.

501	   As discussed in Restrictions on domain (DNS) names above, there are
502	   dangers associated with software that attempts to "know" the list of
503	   top-level domain names locally and take advantage of that knowledge.
504	   These name-guessing heuristics are another example of that situation:
505	   if the lists are up-to-date and used carefully, the systems in which
506	   they are embedded may provide an easier, and more attractive,
507	   experience for at least some users.  But finding the wrong host, or
508	   being unable to find a host even when its name is precisely known,
509	   constitute bad experiences by any measure.

511	   More generally, there have been bad experiences with attempts to
512	   "complete" domain names by adding additional information to them.
513	   These issues are described in some detail in RFC 1535 .

515	5. Implications of internationalization

517	   The IETF has adopted a series of proposals ( - ) whose purpose is to
518	   permit encoding internationalized (i.e., non-ASCII) names in the DNS.
519	   The primary standard, and the group generically, are known as "IDNA".
520	   The actual strings stored in the DNS are in an encoded form: the
521	   labels begin with the characters "xn--" followed by the encoded
522	   string. Applications should be prepared to accept and process both
523	   the encoded form (those strings are consistent with the "LDH rule"
524	   (see Restrictions on domain (DNS) names) so should not raise any
525	   separate issues) and the use of local, and potentially other,
526	   characters as appropriate to local systems and circumstances.

528	   The IDNA specification describes the exact process to be used to
529	   validate a name or encoded string.  The process is sufficiently
530	   complex that shortcuts or heuristics, especially for versions of
531	   labels written directly in Unicode or other coded character sets, are
532	   likely to fail and cause problems.  In particular, the strings cannot
533	   be validated with syntax or semantic rules of any of the usual sorts:
534	   syntax validity is defined only in terms of the result of executing a
535	   particular function.

537	   In addition to the restrictions imposed by the protocols themselves,
538	   many domains are implementing rules about just which non-ASCII names
539	   they will permit to be registered (see, e.g., . This work is still
540	   relatively new, and the rules and conventions are likely to be
541	   different for each domain, or at least each language or script group.
542	   Attempting to test for those rules in a client program to see if a
543	   user-supplied name might possibly exist in the relevant domain would
544	   almost certainly be ill-advised.

546	   One quick, local, test, however, may be reasonable: as of the time of
547	   this writing, there should be no instances of labels in the DNS that
548	   start with two characters, followed by two hyphens, where the two
549	   characters are not "xn" (in, of course, either upper or lower case).
550	   Such label strings, if they appear, are probably erroneous or
551	   obsolete, and it may be reasonable to at least warn the user about
552	   them.

554	   There is ongoing work in the IETF and elsewhere to define
555	   internationalized formats for use in other protocols, including email
556	   addresses.  Those forms may or may not conform to existing rules for
557	   ASCII-only identifiers; anyone designing evaluators or filters should
558	   watch that work closely.

560	6. Summary

562	   When an application accepts a string from the user and ultimately
563	   passes it on to an API for a protocol, the desirability of testing or
564	   filtering the text in any way not required by the protoocl itself is
565	   hotly debated.  If it must divide the string into its components, or
566	   otherwise interpret it, it obviously must make at least enough tests
567	   to validate that process.  With, e.g., domain names or email
568	   addresses that can be passed on untouched, the appropriateness of
569	   trying to figure out which ones are valid and which ones are not
570	   requires a more complex decision, one that should include
571	   considerations of how to make exactly the correct tests and to keep
572	   information that changes and evolves up-to-date.  Making the test
573	   incorrectly, or with obsolete information, can be extremely
574	   frustrating for potential correspondents or customers and may harm
575	   desired relationships.

577	7. Security considerations

579	   Since this document merely summarizes the requirements of existing
580	   standards, it does not introduce any new security issues.  However,
581	   many of the techniques that motivate the document raise important
582	   security concerns of their own.  Rejecting valid forms of domain
583	   names, email addresses, or URIs often denies service to the user of
584	   those entities. Worse, guessing at the user's intent when an
585	   incomplete address, or other string, is given can result in
586	   compromises to privacy or accuracy of reference if the wrong target
587	   is found and returned.  From a security standpoint, the optimum
588	   behavior is probably to never guess, but, instead, to force the user
589	   to specify exactly what is wanted.  When that position involves a
590	   tradeoff with an acceptable user experience, good judgment should be
591	   used and the fact that it is a tradeoff recognized.

593	8. Acknowledgements

595	   The author would like to express his appreciation for helpful
596	   comments from Harald Alvestrand, Eric A. Hall, and the RFC Editor,
597	   and for partial support of this work from SITA.  Responsibility for
598	   any errors remains, of course, with the author.

600	   The first Internet-Draft on this subject was posted in February 2003.
601	   The document was submitted to the RFC Editor on 20 June 2003,
602	   returned for revisions on 19 August, and resubmitted on 5 September
603	   2003.

605	Normative References

607	   [RFC1035]  Mockapetris, P., "Domain names - implementation and
608	              specification", RFC 1035, STD 13, November 1987.

610	   [RFC1123]  Braden, R., "Requirements for Internet Hosts -Application
611	              and Support", RFC 1123, STD 3, October 1989.

613	   [RFC1535]  Gavron, E., "A Security Problem and Proposed Correction
614	              With Widely Deployed DNS Software", RFC 1535, October
615	              1993.

617	   [RFC1738]  Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform
618	              Resource Locators (URL)", RFC 1738, December 1994.

620	   [RFC1866]  Berners-Lee, T. and D. Connolly, "Hypertext Markup
621	              Language - 2.0", RFC 1866, November 1995.

623	   [RFC2181]  Elz, R. and R. Bush, "Clarifications to the DNS
624	              Specification", RFC 2181, July 1997.

626	   [RFC2368]  Hoffman, P., Masinter, L. and J. Zawinski, "The mailto URL
627	              scheme", RFC 2368, July 1998.

629	   [RFC2396]  Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
630	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
631	              August 1998.

633	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Nielsen, H.,
634	              Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext
635	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

637	   [RFC2821]  Klensin, J., "Simple Mail Transfer Protocol", RFC 2821,
638	              April 2001.

640	   [RFC2822]  Resnick, P., "Internet Message Format", RFC 2822, April
641	              2001.

643	   [RFC3490]  Faltstrom, P., Hoffman, P. and A. Costello,
644	              "Internationalizing Domain Names in Applications (IDNA)",
645	              RFC 3490, March 2003.

647	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
648	              Profile for Internationalized Domain Names (IDN)", RFC
649	              3491, March 2003.

651	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
652	              for Internationalized Domain Names in Applications
653	              (IDNA)", RFC 3492, March 2003.

655	   [refs.ASCII]
656	              American National Standards Institute (formerly United
657	              States of America Standards Institute), "USA Code for
658	              Information Interchange. ANSI X3.4-1968 has been replaced
659	              by newer versions with slight modifications, but the 1968
660	              version remains definitive for the Internet.", ANSI
661	              X3.4-1968.

663	Non-normative References

665	   [ISO.3166.1988]
666	              International Organization for Standardization, "Codes for
667	              the representation of names of countries, 3rd edition",
668	              ISO Standard 3166, August 1988.

670	   [JET]      Seng, J., "Internationalized Domain Names Registration and
671	              Administration Guideline for Chinese, Japanese and
672	              Korean", draft-jseng-idn-admin-03 (work in progress), June
673	              2003.

675	   [RFC1591]  Postel, J., "Domain Name System Structure and Delegation",
676	              RFC 1591, March 1994.

678	   [RegRestr]
679	              Klensin, J., "Registration Restrictions on
680	              Internationalized Domain Names -- An Overview",
681	              draft-klensin-reg-guidelines-00 (work in progress), June
682	              2003.

684	Author's Address

686	   John C Klensin
687	   1770 Massachusetts Ave, #322
688	   Cambridge, MA  02140
689	   USA

691	   Phone: +1 617 491 5735
692	   EMail: john-ietf@jck.com

694	Intellectual Property Statement

696	   The IETF takes no position regarding the validity or scope of any
697	   intellectual property or other rights that might be claimed to
698	   pertain to the implementation or use of the technology described in
699	   this document or the extent to which any license under such rights
700	   might or might not be available; neither does it represent that it
701	   has made any effort to identify any such rights. Information on the
702	   IETF's procedures with respect to rights in standards-track and
703	   standards-related documentation can be found in BCP-11. Copies of
704	   claims of rights made available for publication and any assurances of
705	   licenses to be made available, or the result of an attempt made to
706	   obtain a general license or permission for the use of such
707	   proprietary rights by implementors or users of this specification can
708	   be obtained from the IETF Secretariat.

710	   The IETF invites any interested party to bring to its attention any
711	   copyrights, patents or patent applications, or other proprietary
712	   rights which may cover technology that may be required to practice
713	   this standard. Please address the information to the IETF Executive
714	   Director.

716	Full Copyright Statement

718	   Copyright (C) The Internet Society (2003). All Rights Reserved.

720	   This document and translations of it may be copied and furnished to
721	   others, and derivative works that comment on or otherwise explain it
722	   or assist in its implementation may be prepared, copied, published
723	   and distributed, in whole or in part, without restriction of any
724	   kind, provided that the above copyright notice and this paragraph are
725	   included on all such copies and derivative works. However, this
726	   document itself may not be modified in any way, such as by removing
727	   the copyright notice or references to the Internet Society or other
728	   Internet organizations, except as needed for the purpose of
729	   developing Internet standards in which case the procedures for
730	   copyrights defined in the Internet Standards process must be
731	   followed, or as required to translate it into languages other than
732	   English.

734	   The limited permissions granted above are perpetual and will not be
735	   revoked by the Internet Society or its successors or assignees.

737	   This document and the information contained herein is provided on an
738	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
739	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
740	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
741	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
742	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

744	Acknowledgment

746	   Funding for the RFC Editor function is currently provided by the
747	   Internet Society.