idnits 2.17.1 

draft-duerst-dns-i18n-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-16) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 440 has weird spacing: '...ork for  hostn...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 1998) is 9407 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'RFC1730' on line 661 looks like a reference

  -- Missing reference section? 'Kle96' on line 644 looks like a reference

  -- Missing reference section? 'ISO3166' on line 636 looks like a reference

  -- Missing reference section? 'ASCII' on line 621 looks like a reference

  -- Missing reference section? 'ISO10646' on line 639 looks like a reference

  -- Missing reference section? 'RFC1530' on line 153 looks like a reference

  -- Missing reference section? 'RFC1522' on line 654 looks like a reference

  -- Missing reference section? 'Unicode' on line 676 looks like a reference

  -- Missing reference section? 'RFCIAB' on line 671 looks like a reference

  -- Missing reference section? 'RFC2044' on line 668 looks like a reference

  -- Missing reference section? 'RFC1642' on line 658 looks like a reference

  -- Missing reference section? 'HTML-I18N' on line 628 looks like a reference

  -- Missing reference section? 'Yer96' on line 679 looks like a reference

  -- Missing reference section? 'RFC1738' on line 665 looks like a reference

  -- Missing reference section? 'Dillon96' on line 624 looks like a reference

  -- Missing reference section? 'RFC1034' on line 648 looks like a reference

  -- Missing reference section? 'RFC1035' on line 651 looks like a reference


     Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 19 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Draft                                                M. Duerst
3	<draft-duerst-dns-i18n-02.txt>                          Keio University
4	Expires in six months                                         July 1998

6	                  Internationalization of Domain Names

8	Status of this Memo

10	   This document is an Internet-Draft.  Internet-Drafts are working doc-
11	   uments of the Internet Engineering Task Force (IETF), its areas, and
12	   its working groups. Note that other groups may also distribute work-
13	   ing documents as Internet-Drafts.

15	   Internet-Drafts are draft documents valid for a maximum of six
16	   months. Internet-Drafts may be updated, replaced, or obsoleted by
17	   other documents at any time.  It is not appropriate to use Internet-
18	   Drafts as reference material or to cite them other than as a "working
19	   draft" or "work in progress".

21	   To learn the current status of any Internet-Draft, please check the
22	   1id-abstracts.txt listing contained in the Internet-Drafts Shadow
23	   Directories on ftp.ietf.org (US East Coast), nic.nordu.net
24	   (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
25	   Rim).

27	   Distribution of this document is unlimited.  Please send comments to
28	   the author at <mduerst@w3.org>.

30	Abstract

32	   Internet domain names are currently limited to a very restricted
33	   character set. This document proposes the introduction of a new
34	   "zero-level" domain (ZLD) to allow the use of arbitrary characters
35	   from the Universal Character Set (ISO 10646/Unicode) in domain names.
36	   The proposal is fully backwards compatible and does not need any
37	   changes to DNS. Version 02 is reissued without changes just to
38	   keep this draft available.

40	Table of contents

42	   0. Change History ................................................. 2
43	     0.8 Changes Made from Version 01 to Version 02 .................. 2
44	     0.9 Changes Made from Version 00 to Version 01 .................. 2
45	   1. Introduction ................................................... 3
46	     1.1 Motivation .................................................. 3
47	     1.2 Notational Conventions ...................................... 4
48	   2. The Hidden Zero Level Domain ................................... 4
49	   3. Encoding International Characters .............................. 5
50	     3.1 Encoding Requirements ....................................... 5
51	     3.2 Encoding Definition ......................................... 5
52	     3.3 Encoding Example ............................................ 7
53	     3.4 Length Considerations ....................................... 8
54	   4. Usage Considerations ........................................... 8
55	     4.1 General Usage ............................................... 8
56	     4.2 Usage Restrictions .......................................... 9
57	     4.3 Domain Name Creation ....................................... 10
58	     4.4 Usage in URLs .............................................. 12
59	   5. Alternate Proposals ........................................... 13
60	     5.1 The Dillon Proposal ........................................ 13
61	     5.2 Using a Separate Lookup Service ............................ 13
62	   6. Generic Considerations ........................................ 14
63	     5.1 Security Considerations .................................... 14
64	     5.2 Internationalization Considerations ........................ 14
65	   Acknowledgements ................................................. 14
66	   Bibliography ..................................................... 15
67	   Author's Address .................................................=
68	 16

70	0. Change History

72	0.8 Changes Made from Version 01 to Version 02

74	   No significant changes; reissued to make it available officially.
75	   Changed author's address.

77	   Changes deferred to future versions (if ever):
78	   -  Decide on ZLD name (.i or .i18n.int or something else)
79	   -  Decide on casing solution
80	   -  Decide on exact syntax
81	   -  Proposals for experimental setup

83	0.9 Changes Made from Version 00 to Version 01
84	   -  Minor rewrites and clarifications

86	   -  Added the following references: [RFC1730], [Kle96], [ISO3166],
87	      [iNORM]

89	   -  Slightly expanded discussion about casing

91	   -  Added some variant proposals for syntax

93	   -  Added some explanations about different kinds of name parallelism

95	   -  Added some explanation about independent addition of internation-
96	      alized names in subdomains without bothering higher-level domains

98	   -  Added some explanations about tools needed for support, and the
99	      MX/CNAME problem

101	   -  Change to RFC1123 (numbers allowed at beginning of labels)

103	1. Introduction

105	1.1 Motivation

107	   The lower layers of the Internet do not discriminate any language or
108	   script. On the application level, however, the historical dominance
109	   of the US and the ASCII character set [ASCII] as a lowest common
110	   denominator have led to limitations. The process of removing these
111	   limitations is called internationalization (abbreviated i18n).  One
112	   example of the abovementioned limitations are domain names [RFC1034,
113	   RFC1035], where only the letters of the basic Latin alphabet (case-
114	   insensitive), the decimal digits, and the hyphen are allowed.

116	   While such restrictions are convenient if a domain name is intended
117	   to be used by arbitrary people around the globe, there may be very
118	   good reasons for using aliases that are more easy to remember or type
119	   in a local context. This is similar to traditional mail addresses,
120	   where both local scripts and conventions and the Latin script can be
121	   used.

123	   There are many good reasons for domain name i18n, and some arguments
124	   that are brought forward against such an extension. This document,
125	   however, does not discuss the pros and cons of domain name i18n. It
126	   proposes and discusses a solution and therefore eliminates one of the
127	   most often heard arguments agains, namely "it cannot be done".

129	   The solution proposed in this document consists of the introduction
130	   of a new "zero-level" domain building the root of a new domain
131	   branch, and an encoding of the Universal Character Set (UCS)
132	   [ISO10646] into the limited character set of domain names.

134	1.2 Notational Conventions

136	   In the domain name examples in this document, characters of the basic
137	   Latin alphabet (expressible in ASCII) are denoted with lower case
138	   letters. Upper case letters are used to represent characters outside
139	   ASCII, such as accented characters of the Latin alphabet, characters
140	   of other alphabets and syllabaries, ideographic characters, and vari-
141	   ous signs.

143	2. The Hidden Zero Level Domain

145	   The domain name system uses the domain "in-addr.arpa" to convert
146	   internet addresses back to domain names. One way to view this is to
147	   say that in-addr.arpa forms the root of a separate hierarchy.  This
148	   hierarchy has been made part of the main domain name hierarchy just
149	   for implementation convenience. While syntactically, in-addr.arpa is
150	   a second level domain (SLD), functionally it is a zero level domain
151	   (ZLD) in the same way as "." is a ZLD.  A similar example of a ZLD is
152	   the domain tpc.int, which provides a hierarchy of the global phone
153	   numbering system [RFC1530] for services such as paging and printing
154	   to fax machines.

156	   For domain name i18n to work inside the tight restrictions of domain
157	   name syntax, one has to define an encoding that maps strings of UCS
158	   characters to strings of characters allowable in domain names, and a
159	   means to distinguish domain names that are the result of such an
160	   encoding from ordinary domain names.

162	   This document proposes to create a new ZLD to distinguish encoded
163	   i18n domain names from traditional domain names.  This domain would
164	   be hidden from the user in the same way as a user does not see in-
165	   addr.arpa.  This domain could be called "i18n.arpa" (although the use
166	   of arpa in this context is definitely not appropriate), simply
167	   "i18n", or even just "i". Below, we are using "i" for shortness,
168	   while we leave the decision on the actual name to further=
169	 discussion.

171	Internet Draft    Internationalization of Domain Names         July=
172	 1997

174	3. Encoding International Characters

176	3.1 Encoding Requirements

178	   Until quite recently, the thought of going beyond ASCII for something
179	   such as domain names failed because of the lack of a single encom-
180	   passing character set for the scripts and languages of the world.
181	   Tagging techniques such as those used in MIME headers [RFC1522] would
182	   be much too clumsy for domain names.

184	   The definition of ISO 10646 [ISO10646], codepoint by codepoint iden-
185	   tical with Unicode [Unicode], provides a single Universal Character
186	   Set (UCS).  A recent report [RFCIAB] clearly recommends to base the
187	   i18n of the Internet on these standards.

189	   An encoding for i18n domain names therefore has to take the charac-
190	   ters of ISO 10646/Unicode as a starting point.  The full four-byte
191	   (31 bit) form of UCS, called UCS4, should be used. A limitation to
192	   the two-byte form (UCS2), which allows only for the encoding of the
193	   Base Multilingual Plane, is too restricting.

195	   For the mapping between UCS4 and the strongly limited character set
196	   of domain names, the following constraints have to be considered:

198	   -  The structure of domain names, and therefore the "dot", have to be
199	      conserved. Encoding is done for individual labels.

201	   -  Individual labels in domain names allow the basic Latin alphabet
202	      (monocase, 26 letters), decimal digits, and the "-" inside the
203	      label.  The capacity per octet is therefore limited to somewhat
204	      above 5 bits.

206	   -  There is no need nor possibility to preserve any characters.

208	   -  Frequent characters (i.e. ASCII, alphabetic, UCS2, in that order)
209	      should be encoded relatively compactly. A variable-length encoding
210	      (similar to UTF-8) seems desirable.

212	3.2 Encoding Definition

214	   Several encodings for UCS, so called UCS Transform Formats, exist
215	   already, namely UTF-8 [RFC2044], UTF-7 [RFC1642], and UTF-16 [Uni-
216	   code]. Unfortunately, none of them is suitable for our purposes. We
217	   therefore use the following encoding:

219	   -  To accommodate the slanted probability distribution of characters
220	      in UCS4, a variable-length encoding is used.

222	   -  Each target letter encodes 5 bits of information.  Four bits of
223	      information encode character data, the fifth bit is used to indi-
224	      cate continuation of the variable-length encoding.

226	   -  Continuation is indicated by distinguishing the initial letter
227	      from the subsequent letter.

229	   -  Leading four-bit groups of binary value 0000 of UCS4 characters
230	      are discarded, except for the last TWO groups (i.e. the last
231	      octet).  This means that ASCII and Latin-1 characters need two
232	      target letters, the main alphabets up to and including Tibetan
233	      need three target letters, the rest of the characters in the BMP
234	      need four target letters, all except the last (private) plane in
235	      the UTF-16/Surrogates area [Unicode] need five target letters, and
236	      so on.

238	   -  The letters representing the various bit groups in the various
239	      positions are chosen according to the following table:

241	        Nibble Value   Initial        Subsequent
242	        Hex  Binary
243	        0    0000 G         0
244	        1    0001 H         1
245	        2    0010 I         2
246	        3    0011 J         3
247	        4    0100 K         4
248	        5    0101 L         5
249	        6    0110 M         6
250	        7    0111 N         7
251	        8    1000 O         8
252	        9    1001 P         9
253	        A    1010 Q         A
254	        B    1011 R         B
255	        C    1100 S         C
256	        D    1101 T         D
257	        E    1110 U         E
258	        F    1111 V         F

260	   [Should we try to eliminate "I" and "O" from initial? "I" might be
261	   eliminated because then an algorithm can more easily detect ".i". "O"
262	   could lead to some confusion with "0".  What other protocols are
263	   there that might be able to use a similar solution, but that might
264	   have other restrictions for the initial letters? Proposal to run ini-
265	   tial range from H to X. Extracting the initial bits then becomes ^
266	   'H'.  Proposal to have a special convention for all-ASCII labels
267	   (start label with one of the letters not used above).]

269	   Please note that this solution has the following interesting proper-
270	   ties:

272	   -  For subsequent positions, there is an equivalence between the hex-
273	      adecimal value of the character code and the target letter used.
274	      This assures easy conversion and checking.

276	   -  The absence of digits from the "initial" column, and the fact that
277	      the hyphen is not used, assures that the resulting string conforms
278	      to domain name syntax.

280	   -  Raw sorting of encoded and unencoded domain names is equivalent.

282	   -  The boundaries of characters can always be detected easily.
283	      (While this is important for representations that are used inter-
284	      nally for text editing, it is actually not very important here,
285	      because tools for editing can be assumed to use a more straight-
286	      forward representation internally.)

288	   -  Unless control characters are allowed, the target string will
289	      never actually contain a G.

291	3.3 Encoding Example

293	   As an example, the current domain

295	        is.s.u-tokyo.ac.jp

297	   with the components standing for information science, science, the
298	   University of Tokyo, academic, and Japan, might in future be repre-
299	   sented by

301	        JOUHOU.RI.TOUDAI.GAKU.NIHON

303	   (a transliteration of the kanji that might probably be chosen to rep-
304	   resent the same domain). Writing each character in U+HHHH notation as
305	   in [Unicode], this results in the following (given for reference
306	   only, not the actual encoding or something being typed in by the
307	   user):

309	        U+60c5U+5831.U+7406.U+6771U+5927.U+5b66.U+65e5U+672c

311	   The software handling internationalized domain names will translate
312	   this, according to the above specifications, before submitting it to
313	   the DNS resolver, to:

315	        M0C5L831.N406.M771L927.LB66.M5E5M72C.i

317	3.4 Length Considerations

319	   DNS allows for a maximum of 63 positions in each part, and for 255
320	   positions for the overall domain name including dots.  This allows up
321	   to 15 ideographs, or up to 21 letters e.g.  from the Hebrew or Arabic
322	   alphabet, in a label.  While this does not allow for the same margin
323	   as in the case of ASCII domain names, it should still be quite suffi-
324	   cient.  [Problems could only surface for languages that use very long
325	   words or terms and don't know any kind of abbreviations or similar
326	   shortening devices. Do these exist?  Islandic expert asserted
327	   Islandic is not a problem.]  DNS contains a compression scheme that
328	   avoids sending the same trailing portion of a domain name twice in
329	   the same transmission. Long domain names are therefore not that much
330	   of a concern.

332	4. Usage Considerations

334	4.1 General Usage

336	   To implement this proposal, neither DNS servers nor resolvers need
337	   changes.  These programs will only deal with the encoded form of the
338	   domain name with the .i suffix. Software that wants to offer an
339	   internationalized user interface (for example a web browser) is
340	   responsible for the necessary conversions. It will analyze the domain
341	   name, call the resolver directly if the domain name conforms to the
342	   domain name syntax restrictions, and otherwise encode the name
343	   according to the specifications of Section 3.2 and append the .i suf-
344	   fix before calling the resolver.  New implementations of resolvers
345	   will of course offer a companion function to gethostbyname accepting
346	   a ISO10646/Unicode string as input.

348	   For domain name administrators, them main tool that will be needed is
349	   a program to compile files configuring zones from an UTF-8 notation
350	   (or any other suitable encoding) to the encoding described in Section
351	   3.3. Utility tools will include a corresponding decompiler, checkers
352	   for various kinds of internationalization-related errors, and tools
353	   for managing syntactic parallelism (see Section 4.3).

355	4.2 Usage Restrictions

357	   While this proposal in theory allows to have control characters such
358	   as BEL or NUL or symbols such as arrows and smilies in domain names,
359	   such characters should clearly be excluded from domain names. Whether
360	   this has to be explicitly specified or whether the difficulty to type
361	   these characters on any keyboard of the world will limit their use
362	   has to be discussed. One approach is to start with a very restricted
363	   subset and gradually relax it; the other is to allow almost anything
364	   and to rely on common sense. Anyway, such specifications should go
365	   into a separate document to allow easy updates.

367	   A related point is the question of equivalence. For historical rea-
368	   sons, ISO 10646/Unicode contain considerable number of compatibility
369	   characters and allow more than one representation for characters with
370	   diacritics. To guarantee smooth interoperability in these and related
371	   cases, additional restrictions or the definition of some form of nor-
372	   malization seem necessary.  However, this is a general problem
373	   affecting all areas where ISO 10646/Unicode is used in identifiers,
374	   and should therefore be addressed in a generic way.  See [iNORM] for
375	   an initial proposal.

377	   Equally related is the problem of case equivalence.  Users can very
378	   well distinguish between upper case and lower case.  Also, casing in
379	   an i18n context is not as straightforward as for ASCII, so that case
380	   equivalence is best avoided.  Problems therefore result not from the
381	   fact that case is distinguished for i18n domain names, but from the
382	   fact that existing domain names do not distinguish case. Where it is
383	   impossible to distinguish between next.com and NeXT.com, the same two
384	   subdomains would easily be distinguishable if subordinate to a i18n
385	   domain.  There are several possible solutions. One is to try to grad-
386	   ually migrate from a case-insensitive solution to a case-sensitive
387	   solution even for ASCII. Another is to allow case-sensitivity only
388	   beyond ASCII. Another is to restrict anything beyond ASCII to lower-
389	   case only (lowercase distinguishes better than uppercase, and is also
390	   generally used for ASCII domain names).

392	   A problem that also has to be discussed and solved is bidirectional-
393	   ity.  Arabic and Hebrew characters are written right-to-left, and the
394	   mixture with other characters results in a divergence between logical
395	   and graphical sequence. See [HTML-I18N] for more explanations.  The
396	   proposal of [Yer96] for dealing with bidirectionality in URLs could
397	   probably be applied to domain names. Anyway, there should be a gen-
398	   eral solution for identifiers, not a DNS-specific solution.

400	4.3 Domain Name Creation

402	   The ".i" ZLD should be created as such to allow the internationaliza-
403	   tion of domain names. Rules for creating subdomains inside ".i"
404	   should follow the established rules for the creation of functionally
405	   equivalent domains in the existing domain hierarchy, and should
406	   evolve in parallel.

408	   For the actual domain hierarchy, the amount of parallelism between
409	   the current ASCII-oriented hierarchy and some internationalized hier-
410	   archy depends on various factors.  In some cases, two fully parallel
411	   hierarchies may emerge.  In other cases, if more than one script or
412	   language is used locally, more than two parallel hierarchies may
413	   emerge.  Some nodes, e.g. in intranets, may only appear in an i18n
414	   hierarchy, whereas others may only appear in the current hierarchy.
415	   In some cases, the pecularities of scripts, languages, cultures, and
416	   the local marketplace may lead to completely different hierarchies.

418	   Also, one has to be aware that there may be several kinds of paral-
419	   lelisms. The first one is called syntactic parallelism.  If there is
420	   a domain XXXX.yy.zz and a domain vvvv.yy.zz, then the domain yy.zz
421	   will have to exist both in the traditional DNS hierarchy as well as
422	   within the hierarchy starting at the .i ZLD, with appropriate encod-
423	   ing.

425	   The second type of parallelism is called transcription parallelism.
426	   It results by transcribing or transliterating relations between ASCII
427	   domain names and domain names in other scripts.

429	   The third type of parallelism is called semantic parallelism.  It
430	   results from translating elements of a domain name from one language
431	   to another, possibly also changing the script or set of used charac-
432	   ters.

434	   On the host level, parallelism means that there are two names for the
435	   same host. Conventions should exist to decide whether the parallel
436	   names should have separate IP addresses or not (A record or CNAME
437	   record).  With separate IP addresses, address to name lookup is easy,
438	   otherwise it needs special precautions to be able to find all names
439	   corresponding to a given host address.  Another detail entering this
440	   consideration is that MX records only work for  hostnames/domains,
441	   not for CNAME aliases.  This at least has the consequence that alias
442	   resolution for internationalized mail addresses has to occur before
443	   MX record lookup.

445	   When discussing and applying the rules for creating domain names,
446	   some peculiarities of i18n domain names should be carefully consid-
447	   ered:

449	   -  Depending on the script, reasonable lengths for domain name parts
450	      may differ greatly. For ideographic scripts, a part may often be
451	      only a one-letter code. Established rules for lengths may need
452	      adaptation. For example, a rule for country TLDs could read: one
453	      ideographic character or two other characters.

455	   -  If the number of generic TLDs (.com, .edu, .org, .net) is kept
456	      low, then it may be feasible to restrict i18n TLDs to country
457	      TLDs.

459	   -  There are no ISO 3166 [ISO3166] two-letter codes in scripts other
460	      than Latin.  I18n domain names for countries will have to be
461	      designed from scratch.

463	   -  The names of some countries or regions may pose greater political
464	      problems when expressed in the native script than when expressed
465	      in 2-letter ISO 3166 codes.

467	   -  I18n country domain names should in principle only be created in
468	      those scripts that are used locally. There is probably little use
469	      in creating an Arabic domain name for China, for example.

471	   -  In those cases where domain names are open to a wide range of
472	      applicants, a special procedure for accepting applications should
473	      be used so that a reasonable-quality fit between ASCII domain
474	      names and i18n domain names results where desired.  This would
475	      probably be done by establishing a period of about a month for
476	      applications inside a i18n domain newly created as a parallel for
477	      an existing domain, and resolving the detected conflicts.  For
478	      syntactically parallel domain names, the owners should always be
479	      the same. Administration may be split in some cases to account for
480	      the necessary linguistic knowledge.  For domain names with tran-
481	      scription parallelism and semantic parallelism, the question of
482	      owner identity should depend on the real-life situation (trade-
483	      marks,...).

485	   -  It will be desirable to have internationalized subdomains in non-
486	      internationalized TLDs. As an example, many companies in France
487	      may want to register an accented version of their company name,
488	      while remaining under the .fr TLD. For this, .fr would have to be
489	      reregistered as .M6N2.i. Accented and other internationalized sub-
490	      domains would go below .M6N2.i, whereas unaccented ones would go
491	      below .fr in its plain form.

493	   -  To generalize the above case, one may need to create a requirement
494	      that any domain name registry would have to register and manage
495	      syntactically parallel domain names below the .i ZLD upon request
496	      to allow registration of i18n domain names in arbitrary subdo-
497	      mains.  An alternative to this is to organize domain name search
498	      so that e.g. in a search for XXXXXX.fr, if M6N2.i is not found in
499	      .i, the name server for .fr is queried for XXXXXX.M6N2.i (with
500	      XXXXXX appropriately encoded).  This convention would allow lower-
501	      level domains to introduce internationalized subdomains without
502	      depending on higher-level domains.

504	4.4 Usage in URLs

506	   According to current definitions, URLs encode sequences of octets
507	   into a sequence of characters from a character set that is almost as
508	   limited as the character set of domain names [RFC1738].  This is
509	   clearly not satisfying for i18n.

511	   Internationalizing URLs, i.e. assigning character semantics to the
512	   encoded octets, can either be done separately for each part and/or
513	   scheme, or in an uniform way. Doing it separately has the serious
514	   disadvantage that software providing user interfaces for URLs in gen-
515	   eral would have to know about all the different i18n solutions of the
516	   different parts and schemes. Many of these solutions may not even be
517	   known yet.

519	   It is therefore definitely more advantageous to decide on a single
520	   and consistent solution for URL internationalization. The most valu-
521	   able candidate [Yer96], for many reasons, is UTF-8 [RFC2044], an
522	   ASCII-compatible encoding of UCS4.

524	   Therefore, an URL containing the domain name of the example of Sec-
525	   tion 3.3 should not be written as:

527	        ftp://M0C5L831.N406.M771L927.LB66.M5E5M72C.i

529	   (although this will also work) but rather

531	        ftp://%e6%83%85%e5%a0%b1.%e7%90%86.%e6%9d%b1%e5%a4%a7.
532	             %e5%ad%a6.%e6%97%a5%e6%9c%ac

534	   In this canonical form, the trailing .i is absent, and the octets can
535	   be reconstructed from the %HH-encoding and interpreted as UTF-8 by
536	   generic URL software. The software part dealing with domain names
537	   will carry out the conversion to the .i form.

539	5. Alternate Proposals

541	5.1 The Dillon Proposal

543	   The proposal of Michael Dillon [Dillon96] is also based on encoding
544	   Unicode into the limited character set of domain names. Distinction
545	   is done for each part, using the hyphen in initial position. Because
546	   this does not fully conform to the syntax of existing domain names,
547	   it is questionable whether it is backwards-compatible. On the other
548	   hand, this has the advantage that local i18n domain names can be
549	   installed easily without cooperation by the manager of the superdo-
550	   main.

552	   A variable-length scheme with base 36 is used that can encode up to
553	   1610 characters, absolutely insufficient for Chinese or Japanese.
554	   Characters assumed not to be used in i18n domain names are excluded,
555	   i.e. only one case is allowed for basic Latin characters.  This means
556	   that large tables have to be worked out carefully to convert between
557	   ISO 10646/Unicode and the actual number that is encoded with base=
558	 36.

560	5.2 Using a Separate Lookup Service

562	   Instead of using a special encoding and burdening DNS with i18n, one
563	   could build and use a separate lookup service for i18n domain names.
564	   Instead of converting to UCS4 and encoding according to Section 3.2,
565	   and then calling the DNS resolver, a program would contact this new
566	   service when seeing a domain name with characters outside the allowed
567	   range.

569	   Such solutions have various problems. There are many directory ser-
570	   vices and proposals for how to use them in a way similar to DNS. For
571	   an overview and a specific proposal, see [Kle96].  However, while
572	   there are many proposals, a real service containing the necessary
573	   data and providing the wide installed base and distributed updating
574	   is in DNS does not exist.

576	   Most directory service proposals also do not offer uniqueness.
577	   Defining unique names again for a separate service will duplicate
578	   much of the work done for DNS. If uniqueness is not guaranteed, the
579	   user is bundened with additional selection steps.

581	   Using a separate lookup service for the internationalization of
582	   domain names also results in more complex implementations than the
583	   proposal made in this draft. Contrary to what some people might
584	   expect, the use of a separate lookup service also does not solve a
585	   capacity problem with DNS, because there is no such problem, nor will
586	   one be created with the introduction of i18n domain names.

588	6. Generic Considerations

590	6.1 Security Considerations

592	   This proposal is believed not to raise any other security considera-
593	   tions than the current use of the domain name system.

595	6.2 Internationalization Considerations

597	   This proposal addresses internationalization as such. The main addi-
598	   tional consideration with respect to internationalization may be the
599	   indication of language. However, for concise identifiers such as
600	   domain names, language tagging would be too much of a burden and
601	   would create complex dependencies with semantics.

603	        NOTE -- This section is introduced based on a recommenda-
604	        tion in [RFCIAB]. A similar section addressing internation-
605	        alization should be included in all application level
606	        internet drafts and RFCs.

608	Acknowledgements

610	   I am grateful in particular to the following persons for their advice
611	   or criticism: Bert Bos, Lori Brownell, Michael Dillon, Donald E.
612	   Eastlake 3rd, David Goldsmith, Larry Masinter, Ryan Moats, Keith
613	   Moore, Thorvardur Kari Olafson, Erik van der Poel, Jurgen Schwertl,
614	   Paul A. Vixie, Francois Yergeau, and others.

616	Internet Draft    Internationalization of Domain Names         July=
617	 1997

619	Bibliography

621	   [ASCII]        Coded Character Set -- 7-Bit American Standard Code
622	                  for Information Interchange, ANSI X3.4-1986.

624	   [Dillon96]     M. Dillon, "Multilingual Domain Names", Memra Software
625	                  Inc., November 1996 (circulated Dec. 6, 1996 on iahc-
626	                  discuss@iahc.org).

628	   [HTML-I18N]    F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter-
629	                  nationalization of the Hypertext Markup Language",
630	                  Work in progress (draft-ietf-html-i18n-05.txt), August
631	                  1996.

633	   [iNORM]        M. Duerst, "Normalization of Internationalized Identi-
634	                  fiers", draft-duerst-i18n-norm-00.txt, July 1997.

636	   [ISO3166]      ISO 3166, "Code for the representation of names of
637	                  countries", ISO 3166:1993.

639	   [ISO10646]     ISO/IEC 10646-1:1993. International standard -- Infor-
640	                  mation technology -- Universal multiple-octet coded
641	                  character Set (UCS) -- Part 1: Architecture and basic
642	                  multilingual plane.

644	   [Kle96]        J. Klensin and T. Wolf, Jr., "Domain Names and Company
645	                  Name Retrieval", Work in progress (draft-klensin-tld-
646	                  whois-01.txt), November 1996.

648	   [RFC1034]      P. Mockapetris, "Domain Names - Concepts and Facili-
649	                  ties", ISI, Nov. 1987.

651	   [RFC1035]      P. Mockapetris, "Domain Names - Implementation and
652	                  Specification", ISI, Nov. 1987.

654	   [RFC1522]      K. Moore, "MIME (Multipurpose Internet Mail Exten-
655	                  sions) Part Two: Message Header Extensions for Non-
656	                  ASCII Text", University of Tennessee, September 1993.

658	   [RFC1642]      D. Goldsmith, M. Davis, "UTF-7: A Mail-safe Transfor-
659	                  mation Format of Unicode", Taligent Inc., July 1994.

661	   [RFC1730]      C. Malamud and M. Rose, "Principles of Operation for
662	                  the TPC.INT Subdomain: General Principles and Policy",
663	                  Internet Multicasting Service, October 1993.

665	   [RFC1738]      T. Berners-Lee, L. Masinter, and M. McCahill,
666	                   "Uniform Resource Locators (URL)", CERN, Dec. 1994.

668	   [RFC2044]      F. Yergeau, "UTF-8, A Transformation Format of Unicode
669	                  and ISO 10646", Alis Technologies, October 1996.

671	   [RFCIAB]       C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R.
672	                  Atkinson, M. Crispin, P. Svanberg, "Report from the
673	                  IAB Character Set Workshop", October 1996 (currently
674	                  available as draft-weider-iab-char-wrkshop-00.txt).

676	   [Unicode]      The Unicode Consortium, "The Unicode Standard, Version
677	                  2.0", Addison-Wesley, Reading, MA, 1996.

679	   [Yer96]        F. Yergeau, "Internationalization of URLs", Alis Tech-
680	                  nologies,
681	                 =
682	 <http://www.alis.com:8085/~yergeau/url-00.html>.

684	Author's Address

686	   Martin J. Duerst
687	   World Wide Web Consortium
688	   Keio Research Institute at SFC
689	   Keio University
690	   5322 Endo
691	   Fujisawa
692	   252-8520 Japan

694	   Tel: +81 466 49 11 70
695	   E-mail: mduerst@w3.org

697	     NOTE -- Please write the author's name with u-Umlaut wherever
698	     possible, e.g. in HTML as D&uuml;rst.