idnits 2.17.1 draft-duerst-dns-i18n-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-25) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 433 has weird spacing: '...ork for hostn...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 1997) is 9781 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC1730' on line 650 looks like a reference -- Missing reference section? 'Kle96' on line 633 looks like a reference -- Missing reference section? 'ISO3166' on line 625 looks like a reference -- Missing reference section? 'ASCII' on line 610 looks like a reference -- Missing reference section? 'ISO10646' on line 628 looks like a reference -- Missing reference section? 'RFC1530' on line 150 looks like a reference -- Missing reference section? 'RFC1522' on line 643 looks like a reference -- Missing reference section? 'Unicode' on line 665 looks like a reference -- Missing reference section? 'RFCIAB' on line 660 looks like a reference -- Missing reference section? 'RFC2044' on line 657 looks like a reference -- Missing reference section? 'RFC1642' on line 647 looks like a reference -- Missing reference section? 'HTML-I18N' on line 617 looks like a reference -- Missing reference section? 'Yer96' on line 668 looks like a reference -- Missing reference section? 'RFC1738' on line 654 looks like a reference -- Missing reference section? 'Dillon96' on line 613 looks like a reference -- Missing reference section? 'RFC1034' on line 637 looks like a reference -- Missing reference section? 'RFC1035' on line 640 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 19 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft M. Duerst 3 University of Zurich 4 Expires in six months July 1997 6 Internationalization of Domain Names 8 Status of this Memo 10 This document is an Internet-Draft. Internet-Drafts are working doc- 11 uments of the Internet Engineering Task Force (IETF), its areas, and 12 its working groups. Note that other groups may also distribute work- 13 ing documents as Internet-Drafts. 15 Internet-Drafts are draft documents valid for a maximum of six 16 months. Internet-Drafts may be updated, replaced, or obsoleted by 17 other documents at any time. It is not appropriate to use Internet- 18 Drafts as reference material or to cite them other than as a "working 19 draft" or "work in progress". 21 To learn the current status of any Internet-Draft, please check the 22 1id-abstracts.txt listing contained in the Internet-Drafts Shadow 23 Directories on ds.internic.net (US East Coast), nic.nordu.net 24 (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific 25 Rim). 27 Distribution of this document is unlimited. Please send comments to 28 the author at . 30 Abstract 32 Internet domain names are currently limited to a very restricted 33 character set. This document proposes the introduction of a new 34 "zero-level" domain (ZLD) to allow the use of arbitrary characters 35 from the Universal Character Set (ISO 10646/Unicode) in domain names. 36 The proposal is fully backwards compatible and does not need any 37 changes to DNS. 39 Table of contents 41 0. Change History ................................................. 2 42 0.8 Changes (to be) Made from Version 01 to Version 02 (or later) 2 43 0.9 Changes Made from Version 00 to Version 01 .................. 2 44 1. Introduction ................................................... 3 45 1.1 Motivation .................................................. 3 46 1.2 Notational Conventions ...................................... 4 47 2. The Hidden Zero Level Domain ................................... 4 48 3. Encoding International Characters .............................. 5 49 3.1 Encoding Requirements ....................................... 5 50 3.2 Encoding Definition ......................................... 5 51 3.3 Encoding Example ............................................ 7 52 3.4 Length Considerations ....................................... 8 53 4. Usage Considerations ........................................... 8 54 4.1 General Usage ............................................... 8 55 4.2 Usage Restrictions .......................................... 9 56 4.3 Domain Name Creation ....................................... 10 57 4.4 Usage in URLs .............................................. 12 58 5. Alternate Proposals ........................................... 13 59 5.1 The Dillon Proposal ........................................ 13 60 5.2 Using a Separate Lookup Service ............................ 13 61 6. Generic Considerations ........................................ 14 62 5.1 Security Considerations .................................... 14 63 5.2 Internationalization Considerations ........................ 14 64 Acknowledgements ................................................. 14 65 Bibliography ..................................................... 15 66 Author's Address ................................................. 16 68 0. Change History 70 0.8 Changes (to be) Made from Version 01 to Version 02 (or later) 72 - Decide on ZLD name (.i or .i18n.int or something else) 74 - Decide on casing solution 76 - Decide on exact syntax 78 - Proposals for experimental setup 80 0.9 Changes Made from Version 00 to Version 01 81 - Minor rewrites and clarifications 83 - Added the following references: [RFC1730], [Kle96], [ISO3166], 84 [iNORM] 86 - Slightly expanded discussion about casing 88 - Added some variant proposals for syntax 90 - Added some explanations about different kinds of name parallelism 92 - Added some explanation about independent addition of internation- 93 alized names in subdomains without bothering higher-level domains 95 - Added some explanations about tools needed for support, and the 96 MX/CNAME problem 98 - Change to RFC1123 (numbers allowed at beginning of labels) 100 1. Introduction 102 1.1 Motivation 104 The lower layers of the Internet do not discriminate any language or 105 script. On the application level, however, the historical dominance 106 of the US and the ASCII character set [ASCII] as a lowest common 107 denominator have led to limitations. The process of removing these 108 limitations is called internationalization (abbreviated i18n). One 109 example of the abovementioned limitations are domain names [RFC1034, 110 RFC1035], where only the letters of the basic Latin alphabet (case- 111 insensitive), the decimal digits, and the hyphen are allowed. 113 While such restrictions are convenient if a domain name is intended 114 to be used by arbitrary people around the globe, there may be very 115 good reasons for using aliases that are more easy to remember or type 116 in a local context. This is similar to traditional mail addresses, 117 where both local scripts and conventions and the Latin script can be 118 used. 120 There are many good reasons for domain name i18n, and some arguments 121 that are brought forward against such an extension. This document, 122 however, does not discuss the pros and cons of domain name i18n. It 123 proposes and discusses a solution and therefore eliminates one of the 124 most often heard arguments agains, namely "it cannot be done". 126 The solution proposed in this document consists of the introduction 127 of a new "zero-level" domain building the root of a new domain 128 branch, and an encoding of the Universal Character Set (UCS) 129 [ISO10646] into the limited character set of domain names. 131 1.2 Notational Conventions 133 In the domain name examples in this document, characters of the basic 134 Latin alphabet (expressible in ASCII) are denoted with lower case 135 letters. Upper case letters are used to represent characters outside 136 ASCII, such as accented characters of the Latin alphabet, characters 137 of other alphabets and syllabaries, ideographic characters, and vari- 138 ous signs. 140 2. The Hidden Zero Level Domain 142 The domain name system uses the domain "in-addr.arpa" to convert 143 internet addresses back to domain names. One way to view this is to 144 say that in-addr.arpa forms the root of a separate hierarchy. This 145 hierarchy has been made part of the main domain name hierarchy just 146 for implementation convenience. While syntactically, in-addr.arpa is 147 a second level domain (SLD), functionally it is a zero level domain 148 (ZLD) in the same way as "." is a ZLD. A similar example of a ZLD is 149 the domain tpc.int, which provides a hierarchy of the global phone 150 numbering system [RFC1530] for services such as paging and printing 151 to fax machines. 153 For domain name i18n to work inside the tight restrictions of domain 154 name syntax, one has to define an encoding that maps strings of UCS 155 characters to strings of characters allowable in domain names, and a 156 means to distinguish domain names that are the result of such an 157 encoding from ordinary domain names. 159 This document proposes to create a new ZLD to distinguish encoded 160 i18n domain names from traditional domain names. This domain would 161 be hidden from the user in the same way as a user does not see in- 162 addr.arpa. This domain could be called "i18n.arpa" (although the use 163 of arpa in this context is definitely not appropriate), simply 164 "i18n", or even just "i". Below, we are using "i" for shortness, 165 while we leave the decision on the actual name to further discussion. 167 3. Encoding International Characters 169 3.1 Encoding Requirements 171 Until quite recently, the thought of going beyond ASCII for something 172 such as domain names failed because of the lack of a single encom- 173 passing character set for the scripts and languages of the world. 174 Tagging techniques such as those used in MIME headers [RFC1522] would 175 be much too clumsy for domain names. 177 The definition of ISO 10646 [ISO10646], codepoint by codepoint iden- 178 tical with Unicode [Unicode], provides a single Universal Character 179 Set (UCS). A recent report [RFCIAB] clearly recommends to base the 180 i18n of the Internet on these standards. 182 An encoding for i18n domain names therefore has to take the charac- 183 ters of ISO 10646/Unicode as a starting point. The full four-byte 184 (31 bit) form of UCS, called UCS4, should be used. A limitation to 185 the two-byte form (UCS2), which allows only for the encoding of the 186 Base Multilingual Plane, is too restricting. 188 For the mapping between UCS4 and the strongly limited character set 189 of domain names, the following constraints have to be considered: 191 - The structure of domain names, and therefore the "dot", have to be 192 conserved. Encoding is done for individual labels. 194 - Individual labels in domain names allow the basic Latin alphabet 195 (monocase, 26 letters), decimal digits, and the "-" inside the 196 label. The capacity per octet is therefore limited to somewhat 197 above 5 bits. 199 - There is no need nor possibility to preserve any characters. 201 - Frequent characters (i.e. ASCII, alphabetic, UCS2, in that order) 202 should be encoded relatively compactly. A variable-length encoding 203 (similar to UTF-8) seems desirable. 205 3.2 Encoding Definition 207 Several encodings for UCS, so called UCS Transform Formats, exist 208 already, namely UTF-8 [RFC2044], UTF-7 [RFC1642], and UTF-16 [Uni- 209 code]. Unfortunately, none of them is suitable for our purposes. We 210 therefore use the following encoding: 212 - To accommodate the slanted probability distribution of characters 213 in UCS4, a variable-length encoding is used. 215 - Each target letter encodes 5 bits of information. Four bits of 216 information encode character data, the fifth bit is used to indi- 217 cate continuation of the variable-length encoding. 219 - Continuation is indicated by distinguishing the initial letter 220 from the subsequent letter. 222 - Leading four-bit groups of binary value 0000 of UCS4 characters 223 are discarded, except for the last TWO groups (i.e. the last 224 octet). This means that ASCII and Latin-1 characters need two 225 target letters, the main alphabets up to and including Tibetan 226 need three target letters, the rest of the characters in the BMP 227 need four target letters, all except the last (private) plane in 228 the UTF-16/Surrogates area [Unicode] need five target letters, and 229 so on. 231 - The letters representing the various bit groups in the various 232 positions are chosen according to the following table: 234 Nibble Value Initial Subsequent 235 Hex Binary 236 0 0000 G 0 237 1 0001 H 1 238 2 0010 I 2 239 3 0011 J 3 240 4 0100 K 4 241 5 0101 L 5 242 6 0110 M 6 243 7 0111 N 7 244 8 1000 O 8 245 9 1001 P 9 246 A 1010 Q A 247 B 1011 R B 248 C 1100 S C 249 D 1101 T D 250 E 1110 U E 251 F 1111 V F 253 [Should we try to eliminate "I" and "O" from initial? "I" might be 254 eliminated because then an algorithm can more easily detect ".i". "O" 255 could lead to some confusion with "0". What other protocols are 256 there that might be able to use a similar solution, but that might 257 have other restrictions for the initial letters? Proposal to run ini- 258 tial range from H to X. Extracting the initial bits then becomes ^ 259 'H'. Proposal to have a special convention for all-ASCII labels 260 (start label with one of the letters not used above).] 262 Please note that this solution has the following interesting proper- 263 ties: 265 - For subsequent positions, there is an equivalence between the hex- 266 adecimal value of the character code and the target letter used. 267 This assures easy conversion and checking. 269 - The absence of digits from the "initial" column, and the fact that 270 the hyphen is not used, assures that the resulting string conforms 271 to domain name syntax. 273 - Raw sorting of encoded and unencoded domain names is equivalent. 275 - The boundaries of characters can always be detected easily. 276 (While this is important for representations that are used inter- 277 nally for text editing, it is actually not very important here, 278 because tools for editing can be assumed to use a more straight- 279 forward representation internally.) 281 - Unless control characters are allowed, the target string will 282 never actually contain a G. 284 3.3 Encoding Example 286 As an example, the current domain 288 is.s.u-tokyo.ac.jp 290 with the components standing for information science, science, the 291 University of Tokyo, academic, and Japan, might in future be repre- 292 sented by 294 JOUHOU.RI.TOUDAI.GAKU.NIHON 296 (a transliteration of the kanji that might probably be chosen to rep- 297 resent the same domain). Writing each character in U+HHHH notation as 298 in [Unicode], this results in the following (given for reference 299 only, not the actual encoding or something being typed in by the 300 user): 302 U+60c5U+5831.U+7406.U+6771U+5927.U+5b66.U+65e5U+672c 304 The software handling internationalized domain names will translate 305 this, according to the above specifications, before submitting it to 306 the DNS resolver, to: 308 M0C5L831.N406.M771L927.LB66.M5E5M72C.i 310 3.4 Length Considerations 312 DNS allows for a maximum of 63 positions in each part, and for 255 313 positions for the overall domain name including dots. This allows up 314 to 15 ideographs, or up to 21 letters e.g. from the Hebrew or Arabic 315 alphabet, in a label. While this does not allow for the same margin 316 as in the case of ASCII domain names, it should still be quite suffi- 317 cient. [Problems could only surface for languages that use very long 318 words or terms and don't know any kind of abbreviations or similar 319 shortening devices. Do these exist? Islandic expert asserted 320 Islandic is not a problem.] DNS contains a compression scheme that 321 avoids sending the same trailing portion of a domain name twice in 322 the same transmission. Long domain names are therefore not that much 323 of a concern. 325 4. Usage Considerations 327 4.1 General Usage 329 To implement this proposal, neither DNS servers nor resolvers need 330 changes. These programs will only deal with the encoded form of the 331 domain name with the .i suffix. Software that wants to offer an 332 internationalized user interface (for example a web browser) is 333 responsible for the necessary conversions. It will analyze the domain 334 name, call the resolver directly if the domain name conforms to the 335 domain name syntax restrictions, and otherwise encode the name 336 according to the specifications of Section 3.2 and append the .i suf- 337 fix before calling the resolver. New implementations of resolvers 338 will of course offer a companion function to gethostbyname accepting 339 a ISO10646/Unicode string as input. 341 For domain name administrators, them main tool that will be needed is 342 a program to compile files configuring zones from an UTF-8 notation 343 (or any other suitable encoding) to the encoding described in Section 344 3.3. Utility tools will include a corresponding decompiler, checkers 345 for various kinds of internationalization-related errors, and tools 346 for managing syntactic parallelism (see Section 4.3). 348 4.2 Usage Restrictions 350 While this proposal in theory allows to have control characters such 351 as BEL or NUL or symbols such as arrows and smilies in domain names, 352 such characters should clearly be excluded from domain names. Whether 353 this has to be explicitly specified or whether the difficulty to type 354 these characters on any keyboard of the world will limit their use 355 has to be discussed. One approach is to start with a very restricted 356 subset and gradually relax it; the other is to allow almost anything 357 and to rely on common sense. Anyway, such specifications should go 358 into a separate document to allow easy updates. 360 A related point is the question of equivalence. For historical rea- 361 sons, ISO 10646/Unicode contain considerable number of compatibility 362 characters and allow more than one representation for characters with 363 diacritics. To guarantee smooth interoperability in these and related 364 cases, additional restrictions or the definition of some form of nor- 365 malization seem necessary. However, this is a general problem 366 affecting all areas where ISO 10646/Unicode is used in identifiers, 367 and should therefore be addressed in a generic way. See [iNORM] for 368 an initial proposal. 370 Equally related is the problem of case equivalence. Users can very 371 well distinguish between upper case and lower case. Also, casing in 372 an i18n context is not as straightforward as for ASCII, so that case 373 equivalence is best avoided. Problems therefore result not from the 374 fact that case is distinguished for i18n domain names, but from the 375 fact that existing domain names do not distinguish case. Where it is 376 impossible to distinguish between next.com and NeXT.com, the same two 377 subdomains would easily be distinguishable if subordinate to a i18n 378 domain. There are several possible solutions. One is to try to grad- 379 ually migrate from a case-insensitive solution to a case-sensitive 380 solution even for ASCII. Another is to allow case-sensitivity only 381 beyond ASCII. Another is to restrict anything beyond ASCII to lower- 382 case only (lowercase distinguishes better than uppercase, and is also 383 generally used for ASCII domain names). 385 A problem that also has to be discussed and solved is bidirectional- 386 ity. Arabic and Hebrew characters are written right-to-left, and the 387 mixture with other characters results in a divergence between logical 388 and graphical sequence. See [HTML-I18N] for more explanations. The 389 proposal of [Yer96] for dealing with bidirectionality in URLs could 390 probably be applied to domain names. Anyway, there should be a gen- 391 eral solution for identifiers, not a DNS-specific solution. 393 4.3 Domain Name Creation 395 The ".i" ZLD should be created as such to allow the internationaliza- 396 tion of domain names. Rules for creating subdomains inside ".i" 397 should follow the established rules for the creation of functionally 398 equivalent domains in the existing domain hierarchy, and should 399 evolve in parallel. 401 For the actual domain hierarchy, the amount of parallelism between 402 the current ASCII-oriented hierarchy and some internationalized hier- 403 archy depends on various factors. In some cases, two fully parallel 404 hierarchies may emerge. In other cases, if more than one script or 405 language is used locally, more than two parallel hierarchies may 406 emerge. Some nodes, e.g. in intranets, may only appear in an i18n 407 hierarchy, whereas others may only appear in the current hierarchy. 408 In some cases, the pecularities of scripts, languages, cultures, and 409 the local marketplace may lead to completely different hierarchies. 411 Also, one has to be aware that there may be several kinds of paral- 412 lelisms. The first one is called syntactic parallelism. If there is 413 a domain XXXX.yy.zz and a domain vvvv.yy.zz, then the domain yy.zz 414 will have to exist both in the traditional DNS hierarchy as well as 415 within the hierarchy starting at the .i ZLD, with appropriate encod- 416 ing. 418 The second type of parallelism is called transcription parallelism. 419 It results by transcribing or transliterating relations between ASCII 420 domain names and domain names in other scripts. 422 The third type of parallelism is called semantic parallelism. It 423 results from translating elements of a domain name from one language 424 to another, possibly also changing the script or set of used charac- 425 ters. 427 On the host level, parallelism means that there are two names for the 428 same host. Conventions should exist to decide whether the parallel 429 names should have separate IP addresses or not (A record or CNAME 430 record). With separate IP addresses, address to name lookup is easy, 431 otherwise it needs special precautions to be able to find all names 432 corresponding to a given host address. Another detail entering this 433 consideration is that MX records only work for hostnames/domains, 434 not for CNAME aliases. This at least has the consequence that alias 435 resolution for internationalized mail addresses has to occur before 436 MX record lookup. 438 When discussing and applying the rules for creating domain names, 439 some peculiarities of i18n domain names should be carefully consid- 440 ered: 442 - Depending on the script, reasonable lengths for domain name parts 443 may differ greatly. For ideographic scripts, a part may often be 444 only a one-letter code. Established rules for lengths may need 445 adaptation. For example, a rule for country TLDs could read: one 446 ideographic character or two other characters. 448 - If the number of generic TLDs (.com, .edu, .org, .net) is kept 449 low, then it may be feasible to restrict i18n TLDs to country 450 TLDs. 452 - There are no ISO 3166 [ISO3166] two-letter codes in scripts other 453 than Latin. I18n domain names for countries will have to be 454 designed from scratch. 456 - The names of some countries or regions may pose greater political 457 problems when expressed in the native script than when expressed 458 in 2-letter ISO 3166 codes. 460 - I18n country domain names should in principle only be created in 461 those scripts that are used locally. There is probably little use 462 in creating an Arabic domain name for China, for example. 464 - In those cases where domain names are open to a wide range of 465 applicants, a special procedure for accepting applications should 466 be used so that a reasonable-quality fit between ASCII domain 467 names and i18n domain names results where desired. This would 468 probably be done by establishing a period of about a month for 469 applications inside a i18n domain newly created as a parallel for 470 an existing domain, and resolving the detected conflicts. For 471 syntactically parallel domain names, the owners should always be 472 the same. Administration may be split in some cases to account for 473 the necessary linguistic knowledge. For domain names with tran- 474 scription parallelism and semantic parallelism, the question of 475 owner identity should depend on the real-life situation (trade- 476 marks,...). 478 - It will be desirable to have internationalized subdomains in non- 479 internationalized TLDs. As an example, many companies in France 480 may want to register an accented version of their company name, 481 while remaining under the .fr TLD. For this, .fr would have to be 482 reregistered as .M6N2.i. Accented and other internationalized sub- 483 domains would go below .M6N2.i, whereas unaccented ones would go 484 below .fr in its plain form. 486 - To generalize the above case, one may need to create a requirement 487 that any domain name registry would have to register and manage 488 syntactically parallel domain names below the .i ZLD upon request 489 to allow registration of i18n domain names in arbitrary subdo- 490 mains. An alternative to this is to organize domain name search 491 so that e.g. in a search for XXXXXX.fr, if M6N2.i is not found in 492 .i, the name server for .fr is queried for XXXXXX.M6N2.i (with 493 XXXXXX appropriately encoded). This convention would allow lower- 494 level domains to introduce internationalized subdomains without 495 depending on higher-level domains. 497 4.4 Usage in URLs 499 According to current definitions, URLs encode sequences of octets 500 into a sequence of characters from a character set that is almost as 501 limited as the character set of domain names [RFC1738]. This is 502 clearly not satisfying for i18n. 504 Internationalizing URLs, i.e. assigning character semantics to the 505 encoded octets, can either be done separately for each part and/or 506 scheme, or in an uniform way. Doing it separately has the serious 507 disadvantage that software providing user interfaces for URLs in gen- 508 eral would have to know about all the different i18n solutions of the 509 different parts and schemes. Many of these solutions may not even be 510 known yet. 512 It is therefore definitely more advantageous to decide on a single 513 and consistent solution for URL internationalization. The most valu- 514 able candidate [Yer96], for many reasons, is UTF-8 [RFC2044], an 515 ASCII-compatible encoding of UCS4. 517 Therefore, an URL containing the domain name of the example of Sec- 518 tion 3.3 should not be written as: 520 ftp://M0C5L831.N406.M771L927.LB66.M5E5M72C.i 522 (although this will also work) but rather 524 ftp://%e6%83%85%e5%a0%b1.%e7%90%86.%e6%9d%b1%e5%a4%a7. 525 %e5%ad%a6.%e6%97%a5%e6%9c%ac 527 In this canonical form, the trailing .i is absent, and the octets can 528 be reconstructed from the %HH-encoding and interpreted as UTF-8 by 529 generic URL software. The software part dealing with domain names 530 will carry out the conversion to the .i form. 532 5. Alternate Proposals 534 5.1 The Dillon Proposal 536 The proposal of Michael Dillon [Dillon96] is also based on encoding 537 Unicode into the limited character set of domain names. Distinction 538 is done for each part, using the hyphen in initial position. Because 539 this does not fully conform to the syntax of existing domain names, 540 it is questionable whether it is backwards-compatible. On the other 541 hand, this has the advantage that local i18n domain names can be 542 installed easily without cooperation by the manager of the superdo- 543 main. 545 A variable-length scheme with base 36 is used that can encode up to 546 1610 characters, absolutely insufficient for Chinese or Japanese. 547 Characters assumed not to be used in i18n domain names are excluded, 548 i.e. only one case is allowed for basic Latin characters. This means 549 that large tables have to be worked out carefully to convert between 550 ISO 10646/Unicode and the actual number that is encoded with base 36. 552 5.2 Using a Separate Lookup Service 554 Instead of using a special encoding and burdening DNS with i18n, one 555 could build and use a separate lookup service for i18n domain names. 556 Instead of converting to UCS4 and encoding according to Section 3.2, 557 and then calling the DNS resolver, a program would contact this new 558 service when seeing a domain name with characters outside the allowed 559 range. 561 Such solutions have various problems. There are many directory ser- 562 vices and proposals for how to use them in a way similar to DNS. For 563 an overview and a specific proposal, see [Kle96]. However, while 564 there are many proposals, a real service containing the necessary 565 data and providing the wide installed base and distributed updating 566 is in DNS does not exist. 568 Most directory service proposals also do not offer uniqueness. 569 Defining unique names again for a separate service will duplicate 570 much of the work done for DNS. If uniqueness is not guaranteed, the 571 user is bundened with additional selection steps. 573 Using a separate lookup service for the internationalization of 574 domain names also results in more complex implementations than the 575 proposal made in this draft. Contrary to what some people might 576 expect, the use of a separate lookup service also does not solve a 577 capacity problem with DNS, because there is no such problem, nor will 578 one be created with the introduction of i18n domain names. 580 6. Generic Considerations 582 6.1 Security Considerations 584 This proposal is believed not to raise any other security considera- 585 tions than the current use of the domain name system. 587 6.2 Internationalization Considerations 589 This proposal addresses internationalization as such. The main addi- 590 tional consideration with respect to internationalization may be the 591 indication of language. However, for concise identifiers such as 592 domain names, language tagging would be too much of a burden and 593 would create complex dependencies with semantics. 595 NOTE -- This section is introduced based on a recommenda- 596 tion in [RFCIAB]. A similar section addressing internation- 597 alization should be included in all application level 598 internet drafts and RFCs. 600 Acknowledgements 602 I am grateful in particular to the following persons for their advice 603 or criticism: Bert Bos, Lori Brownell, Michael Dillon, Donald E. 604 Eastlake 3rd, David Goldsmith, Larry Masinter, Ryan Moats, Keith 605 Moore, Thorvardur Kari Olafson, Erik van der Poel, Jurgen Schwertl, 606 Paul A. Vixie, Francois Yergeau, 608 Bibliography 610 [ASCII] Coded Character Set -- 7-Bit American Standard Code 611 for Information Interchange, ANSI X3.4-1986. 613 [Dillon96] M. Dillon, "Multilingual Domain Names", Memra Software 614 Inc., November 1996 (circulated Dec. 6, 1996 on iahc- 615 discuss@iahc.org). 617 [HTML-I18N] F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter- 618 nationalization of the Hypertext Markup Language", 619 Work in progress (draft-ietf-html-i18n-05.txt), August 620 1996. 622 [iNORM] M. Duerst, "Normalization of Internationalized Identi- 623 fiers", draft-duerst-i18n-norm-00.txt, July 1997. 625 [ISO3166] ISO 3166, "Code for the representation of names of 626 countries", ISO 3166:1993. 628 [ISO10646] ISO/IEC 10646-1:1993. International standard -- Infor- 629 mation technology -- Universal multiple-octet coded 630 character Set (UCS) -- Part 1: Architecture and basic 631 multilingual plane. 633 [Kle96] J. Klensin and T. Wolf, Jr., "Domain Names and Company 634 Name Retrieval", Work in progress (draft-klensin-tld- 635 whois-01.txt), November 1996. 637 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facili- 638 ties", ISI, Nov. 1987. 640 [RFC1035] P. Mockapetris, "Domain Names - Implementation and 641 Specification", ISI, Nov. 1987. 643 [RFC1522] K. Moore, "MIME (Multipurpose Internet Mail Exten- 644 sions) Part Two: Message Header Extensions for Non- 645 ASCII Text", University of Tennessee, September 1993. 647 [RFC1642] D. Goldsmith, M. Davis, "UTF-7: A Mail-safe Transfor- 648 mation Format of Unicode", Taligent Inc., July 1994. 650 [RFC1730] C. Malamud and M. Rose, "Principles of Operation for 651 the TPC.INT Subdomain: General Principles and Policy", 652 Internet Multicasting Service, October 1993. 654 [RFC1738] T. Berners-Lee, L. Masinter, and M. McCahill, 655 "Uniform Resource Locators (URL)", CERN, Dec. 1994. 657 [RFC2044] F. Yergeau, "UTF-8, A Transformation Format of Unicode 658 and ISO 10646", Alis Technologies, October 1996. 660 [RFCIAB] C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. 661 Atkinson, M. Crispin, P. Svanberg, "Report from the 662 IAB Character Set Workshop", October 1996 (currently 663 available as draft-weider-iab-char-wrkshop-00.txt). 665 [Unicode] The Unicode Consortium, "The Unicode Standard, Version 666 2.0", Addison-Wesley, Reading, MA, 1996. 668 [Yer96] F. Yergeau, "Internationalization of URLs", Alis Tech- 669 nologies, 670 . 672 Author's Address 674 Martin J. Duerst 675 Multimedia-Laboratory 676 Department of Computer Science 677 University of Zurich 678 Winterthurerstrasse 190 679 CH-8057 Zurich 680 Switzerland 682 Tel: +41 1 257 43 16 683 Fax: +41 1 363 00 35 684 E-mail: mduerst@ifi.unizh.ch 686 NOTE -- Please write the author's name with u-Umlaut wherever 687 possible, e.g. in HTML as Dürst.