idnits 2.17.1 draft-duerst-dns-i18n-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-16) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 440 has weird spacing: '...ork for hostn...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 1998) is 9407 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC1730' on line 661 looks like a reference -- Missing reference section? 'Kle96' on line 644 looks like a reference -- Missing reference section? 'ISO3166' on line 636 looks like a reference -- Missing reference section? 'ASCII' on line 621 looks like a reference -- Missing reference section? 'ISO10646' on line 639 looks like a reference -- Missing reference section? 'RFC1530' on line 153 looks like a reference -- Missing reference section? 'RFC1522' on line 654 looks like a reference -- Missing reference section? 'Unicode' on line 676 looks like a reference -- Missing reference section? 'RFCIAB' on line 671 looks like a reference -- Missing reference section? 'RFC2044' on line 668 looks like a reference -- Missing reference section? 'RFC1642' on line 658 looks like a reference -- Missing reference section? 'HTML-I18N' on line 628 looks like a reference -- Missing reference section? 'Yer96' on line 679 looks like a reference -- Missing reference section? 'RFC1738' on line 665 looks like a reference -- Missing reference section? 'Dillon96' on line 624 looks like a reference -- Missing reference section? 'RFC1034' on line 648 looks like a reference -- Missing reference section? 'RFC1035' on line 651 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 19 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft M. Duerst 3 Keio University 4 Expires in six months July 1998 6 Internationalization of Domain Names 8 Status of this Memo 10 This document is an Internet-Draft. Internet-Drafts are working doc- 11 uments of the Internet Engineering Task Force (IETF), its areas, and 12 its working groups. Note that other groups may also distribute work- 13 ing documents as Internet-Drafts. 15 Internet-Drafts are draft documents valid for a maximum of six 16 months. Internet-Drafts may be updated, replaced, or obsoleted by 17 other documents at any time. It is not appropriate to use Internet- 18 Drafts as reference material or to cite them other than as a "working 19 draft" or "work in progress". 21 To learn the current status of any Internet-Draft, please check the 22 1id-abstracts.txt listing contained in the Internet-Drafts Shadow 23 Directories on ftp.ietf.org (US East Coast), nic.nordu.net 24 (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific 25 Rim). 27 Distribution of this document is unlimited. Please send comments to 28 the author at . 30 Abstract 32 Internet domain names are currently limited to a very restricted 33 character set. This document proposes the introduction of a new 34 "zero-level" domain (ZLD) to allow the use of arbitrary characters 35 from the Universal Character Set (ISO 10646/Unicode) in domain names. 36 The proposal is fully backwards compatible and does not need any 37 changes to DNS. Version 02 is reissued without changes just to 38 keep this draft available. 40 Table of contents 42 0. Change History ................................................. 2 43 0.8 Changes Made from Version 01 to Version 02 .................. 2 44 0.9 Changes Made from Version 00 to Version 01 .................. 2 45 1. Introduction ................................................... 3 46 1.1 Motivation .................................................. 3 47 1.2 Notational Conventions ...................................... 4 48 2. The Hidden Zero Level Domain ................................... 4 49 3. Encoding International Characters .............................. 5 50 3.1 Encoding Requirements ....................................... 5 51 3.2 Encoding Definition ......................................... 5 52 3.3 Encoding Example ............................................ 7 53 3.4 Length Considerations ....................................... 8 54 4. Usage Considerations ........................................... 8 55 4.1 General Usage ............................................... 8 56 4.2 Usage Restrictions .......................................... 9 57 4.3 Domain Name Creation ....................................... 10 58 4.4 Usage in URLs .............................................. 12 59 5. Alternate Proposals ........................................... 13 60 5.1 The Dillon Proposal ........................................ 13 61 5.2 Using a Separate Lookup Service ............................ 13 62 6. Generic Considerations ........................................ 14 63 5.1 Security Considerations .................................... 14 64 5.2 Internationalization Considerations ........................ 14 65 Acknowledgements ................................................. 14 66 Bibliography ..................................................... 15 67 Author's Address .................................................= 68 16 70 0. Change History 72 0.8 Changes Made from Version 01 to Version 02 74 No significant changes; reissued to make it available officially. 75 Changed author's address. 77 Changes deferred to future versions (if ever): 78 - Decide on ZLD name (.i or .i18n.int or something else) 79 - Decide on casing solution 80 - Decide on exact syntax 81 - Proposals for experimental setup 83 0.9 Changes Made from Version 00 to Version 01 84 - Minor rewrites and clarifications 86 - Added the following references: [RFC1730], [Kle96], [ISO3166], 87 [iNORM] 89 - Slightly expanded discussion about casing 91 - Added some variant proposals for syntax 93 - Added some explanations about different kinds of name parallelism 95 - Added some explanation about independent addition of internation- 96 alized names in subdomains without bothering higher-level domains 98 - Added some explanations about tools needed for support, and the 99 MX/CNAME problem 101 - Change to RFC1123 (numbers allowed at beginning of labels) 103 1. Introduction 105 1.1 Motivation 107 The lower layers of the Internet do not discriminate any language or 108 script. On the application level, however, the historical dominance 109 of the US and the ASCII character set [ASCII] as a lowest common 110 denominator have led to limitations. The process of removing these 111 limitations is called internationalization (abbreviated i18n). One 112 example of the abovementioned limitations are domain names [RFC1034, 113 RFC1035], where only the letters of the basic Latin alphabet (case- 114 insensitive), the decimal digits, and the hyphen are allowed. 116 While such restrictions are convenient if a domain name is intended 117 to be used by arbitrary people around the globe, there may be very 118 good reasons for using aliases that are more easy to remember or type 119 in a local context. This is similar to traditional mail addresses, 120 where both local scripts and conventions and the Latin script can be 121 used. 123 There are many good reasons for domain name i18n, and some arguments 124 that are brought forward against such an extension. This document, 125 however, does not discuss the pros and cons of domain name i18n. It 126 proposes and discusses a solution and therefore eliminates one of the 127 most often heard arguments agains, namely "it cannot be done". 129 The solution proposed in this document consists of the introduction 130 of a new "zero-level" domain building the root of a new domain 131 branch, and an encoding of the Universal Character Set (UCS) 132 [ISO10646] into the limited character set of domain names. 134 1.2 Notational Conventions 136 In the domain name examples in this document, characters of the basic 137 Latin alphabet (expressible in ASCII) are denoted with lower case 138 letters. Upper case letters are used to represent characters outside 139 ASCII, such as accented characters of the Latin alphabet, characters 140 of other alphabets and syllabaries, ideographic characters, and vari- 141 ous signs. 143 2. The Hidden Zero Level Domain 145 The domain name system uses the domain "in-addr.arpa" to convert 146 internet addresses back to domain names. One way to view this is to 147 say that in-addr.arpa forms the root of a separate hierarchy. This 148 hierarchy has been made part of the main domain name hierarchy just 149 for implementation convenience. While syntactically, in-addr.arpa is 150 a second level domain (SLD), functionally it is a zero level domain 151 (ZLD) in the same way as "." is a ZLD. A similar example of a ZLD is 152 the domain tpc.int, which provides a hierarchy of the global phone 153 numbering system [RFC1530] for services such as paging and printing 154 to fax machines. 156 For domain name i18n to work inside the tight restrictions of domain 157 name syntax, one has to define an encoding that maps strings of UCS 158 characters to strings of characters allowable in domain names, and a 159 means to distinguish domain names that are the result of such an 160 encoding from ordinary domain names. 162 This document proposes to create a new ZLD to distinguish encoded 163 i18n domain names from traditional domain names. This domain would 164 be hidden from the user in the same way as a user does not see in- 165 addr.arpa. This domain could be called "i18n.arpa" (although the use 166 of arpa in this context is definitely not appropriate), simply 167 "i18n", or even just "i". Below, we are using "i" for shortness, 168 while we leave the decision on the actual name to further= 169 discussion. 171 Internet Draft Internationalization of Domain Names July= 172 1997 174 3. Encoding International Characters 176 3.1 Encoding Requirements 178 Until quite recently, the thought of going beyond ASCII for something 179 such as domain names failed because of the lack of a single encom- 180 passing character set for the scripts and languages of the world. 181 Tagging techniques such as those used in MIME headers [RFC1522] would 182 be much too clumsy for domain names. 184 The definition of ISO 10646 [ISO10646], codepoint by codepoint iden- 185 tical with Unicode [Unicode], provides a single Universal Character 186 Set (UCS). A recent report [RFCIAB] clearly recommends to base the 187 i18n of the Internet on these standards. 189 An encoding for i18n domain names therefore has to take the charac- 190 ters of ISO 10646/Unicode as a starting point. The full four-byte 191 (31 bit) form of UCS, called UCS4, should be used. A limitation to 192 the two-byte form (UCS2), which allows only for the encoding of the 193 Base Multilingual Plane, is too restricting. 195 For the mapping between UCS4 and the strongly limited character set 196 of domain names, the following constraints have to be considered: 198 - The structure of domain names, and therefore the "dot", have to be 199 conserved. Encoding is done for individual labels. 201 - Individual labels in domain names allow the basic Latin alphabet 202 (monocase, 26 letters), decimal digits, and the "-" inside the 203 label. The capacity per octet is therefore limited to somewhat 204 above 5 bits. 206 - There is no need nor possibility to preserve any characters. 208 - Frequent characters (i.e. ASCII, alphabetic, UCS2, in that order) 209 should be encoded relatively compactly. A variable-length encoding 210 (similar to UTF-8) seems desirable. 212 3.2 Encoding Definition 214 Several encodings for UCS, so called UCS Transform Formats, exist 215 already, namely UTF-8 [RFC2044], UTF-7 [RFC1642], and UTF-16 [Uni- 216 code]. Unfortunately, none of them is suitable for our purposes. We 217 therefore use the following encoding: 219 - To accommodate the slanted probability distribution of characters 220 in UCS4, a variable-length encoding is used. 222 - Each target letter encodes 5 bits of information. Four bits of 223 information encode character data, the fifth bit is used to indi- 224 cate continuation of the variable-length encoding. 226 - Continuation is indicated by distinguishing the initial letter 227 from the subsequent letter. 229 - Leading four-bit groups of binary value 0000 of UCS4 characters 230 are discarded, except for the last TWO groups (i.e. the last 231 octet). This means that ASCII and Latin-1 characters need two 232 target letters, the main alphabets up to and including Tibetan 233 need three target letters, the rest of the characters in the BMP 234 need four target letters, all except the last (private) plane in 235 the UTF-16/Surrogates area [Unicode] need five target letters, and 236 so on. 238 - The letters representing the various bit groups in the various 239 positions are chosen according to the following table: 241 Nibble Value Initial Subsequent 242 Hex Binary 243 0 0000 G 0 244 1 0001 H 1 245 2 0010 I 2 246 3 0011 J 3 247 4 0100 K 4 248 5 0101 L 5 249 6 0110 M 6 250 7 0111 N 7 251 8 1000 O 8 252 9 1001 P 9 253 A 1010 Q A 254 B 1011 R B 255 C 1100 S C 256 D 1101 T D 257 E 1110 U E 258 F 1111 V F 260 [Should we try to eliminate "I" and "O" from initial? "I" might be 261 eliminated because then an algorithm can more easily detect ".i". "O" 262 could lead to some confusion with "0". What other protocols are 263 there that might be able to use a similar solution, but that might 264 have other restrictions for the initial letters? Proposal to run ini- 265 tial range from H to X. Extracting the initial bits then becomes ^ 266 'H'. Proposal to have a special convention for all-ASCII labels 267 (start label with one of the letters not used above).] 269 Please note that this solution has the following interesting proper- 270 ties: 272 - For subsequent positions, there is an equivalence between the hex- 273 adecimal value of the character code and the target letter used. 274 This assures easy conversion and checking. 276 - The absence of digits from the "initial" column, and the fact that 277 the hyphen is not used, assures that the resulting string conforms 278 to domain name syntax. 280 - Raw sorting of encoded and unencoded domain names is equivalent. 282 - The boundaries of characters can always be detected easily. 283 (While this is important for representations that are used inter- 284 nally for text editing, it is actually not very important here, 285 because tools for editing can be assumed to use a more straight- 286 forward representation internally.) 288 - Unless control characters are allowed, the target string will 289 never actually contain a G. 291 3.3 Encoding Example 293 As an example, the current domain 295 is.s.u-tokyo.ac.jp 297 with the components standing for information science, science, the 298 University of Tokyo, academic, and Japan, might in future be repre- 299 sented by 301 JOUHOU.RI.TOUDAI.GAKU.NIHON 303 (a transliteration of the kanji that might probably be chosen to rep- 304 resent the same domain). Writing each character in U+HHHH notation as 305 in [Unicode], this results in the following (given for reference 306 only, not the actual encoding or something being typed in by the 307 user): 309 U+60c5U+5831.U+7406.U+6771U+5927.U+5b66.U+65e5U+672c 311 The software handling internationalized domain names will translate 312 this, according to the above specifications, before submitting it to 313 the DNS resolver, to: 315 M0C5L831.N406.M771L927.LB66.M5E5M72C.i 317 3.4 Length Considerations 319 DNS allows for a maximum of 63 positions in each part, and for 255 320 positions for the overall domain name including dots. This allows up 321 to 15 ideographs, or up to 21 letters e.g. from the Hebrew or Arabic 322 alphabet, in a label. While this does not allow for the same margin 323 as in the case of ASCII domain names, it should still be quite suffi- 324 cient. [Problems could only surface for languages that use very long 325 words or terms and don't know any kind of abbreviations or similar 326 shortening devices. Do these exist? Islandic expert asserted 327 Islandic is not a problem.] DNS contains a compression scheme that 328 avoids sending the same trailing portion of a domain name twice in 329 the same transmission. Long domain names are therefore not that much 330 of a concern. 332 4. Usage Considerations 334 4.1 General Usage 336 To implement this proposal, neither DNS servers nor resolvers need 337 changes. These programs will only deal with the encoded form of the 338 domain name with the .i suffix. Software that wants to offer an 339 internationalized user interface (for example a web browser) is 340 responsible for the necessary conversions. It will analyze the domain 341 name, call the resolver directly if the domain name conforms to the 342 domain name syntax restrictions, and otherwise encode the name 343 according to the specifications of Section 3.2 and append the .i suf- 344 fix before calling the resolver. New implementations of resolvers 345 will of course offer a companion function to gethostbyname accepting 346 a ISO10646/Unicode string as input. 348 For domain name administrators, them main tool that will be needed is 349 a program to compile files configuring zones from an UTF-8 notation 350 (or any other suitable encoding) to the encoding described in Section 351 3.3. Utility tools will include a corresponding decompiler, checkers 352 for various kinds of internationalization-related errors, and tools 353 for managing syntactic parallelism (see Section 4.3). 355 4.2 Usage Restrictions 357 While this proposal in theory allows to have control characters such 358 as BEL or NUL or symbols such as arrows and smilies in domain names, 359 such characters should clearly be excluded from domain names. Whether 360 this has to be explicitly specified or whether the difficulty to type 361 these characters on any keyboard of the world will limit their use 362 has to be discussed. One approach is to start with a very restricted 363 subset and gradually relax it; the other is to allow almost anything 364 and to rely on common sense. Anyway, such specifications should go 365 into a separate document to allow easy updates. 367 A related point is the question of equivalence. For historical rea- 368 sons, ISO 10646/Unicode contain considerable number of compatibility 369 characters and allow more than one representation for characters with 370 diacritics. To guarantee smooth interoperability in these and related 371 cases, additional restrictions or the definition of some form of nor- 372 malization seem necessary. However, this is a general problem 373 affecting all areas where ISO 10646/Unicode is used in identifiers, 374 and should therefore be addressed in a generic way. See [iNORM] for 375 an initial proposal. 377 Equally related is the problem of case equivalence. Users can very 378 well distinguish between upper case and lower case. Also, casing in 379 an i18n context is not as straightforward as for ASCII, so that case 380 equivalence is best avoided. Problems therefore result not from the 381 fact that case is distinguished for i18n domain names, but from the 382 fact that existing domain names do not distinguish case. Where it is 383 impossible to distinguish between next.com and NeXT.com, the same two 384 subdomains would easily be distinguishable if subordinate to a i18n 385 domain. There are several possible solutions. One is to try to grad- 386 ually migrate from a case-insensitive solution to a case-sensitive 387 solution even for ASCII. Another is to allow case-sensitivity only 388 beyond ASCII. Another is to restrict anything beyond ASCII to lower- 389 case only (lowercase distinguishes better than uppercase, and is also 390 generally used for ASCII domain names). 392 A problem that also has to be discussed and solved is bidirectional- 393 ity. Arabic and Hebrew characters are written right-to-left, and the 394 mixture with other characters results in a divergence between logical 395 and graphical sequence. See [HTML-I18N] for more explanations. The 396 proposal of [Yer96] for dealing with bidirectionality in URLs could 397 probably be applied to domain names. Anyway, there should be a gen- 398 eral solution for identifiers, not a DNS-specific solution. 400 4.3 Domain Name Creation 402 The ".i" ZLD should be created as such to allow the internationaliza- 403 tion of domain names. Rules for creating subdomains inside ".i" 404 should follow the established rules for the creation of functionally 405 equivalent domains in the existing domain hierarchy, and should 406 evolve in parallel. 408 For the actual domain hierarchy, the amount of parallelism between 409 the current ASCII-oriented hierarchy and some internationalized hier- 410 archy depends on various factors. In some cases, two fully parallel 411 hierarchies may emerge. In other cases, if more than one script or 412 language is used locally, more than two parallel hierarchies may 413 emerge. Some nodes, e.g. in intranets, may only appear in an i18n 414 hierarchy, whereas others may only appear in the current hierarchy. 415 In some cases, the pecularities of scripts, languages, cultures, and 416 the local marketplace may lead to completely different hierarchies. 418 Also, one has to be aware that there may be several kinds of paral- 419 lelisms. The first one is called syntactic parallelism. If there is 420 a domain XXXX.yy.zz and a domain vvvv.yy.zz, then the domain yy.zz 421 will have to exist both in the traditional DNS hierarchy as well as 422 within the hierarchy starting at the .i ZLD, with appropriate encod- 423 ing. 425 The second type of parallelism is called transcription parallelism. 426 It results by transcribing or transliterating relations between ASCII 427 domain names and domain names in other scripts. 429 The third type of parallelism is called semantic parallelism. It 430 results from translating elements of a domain name from one language 431 to another, possibly also changing the script or set of used charac- 432 ters. 434 On the host level, parallelism means that there are two names for the 435 same host. Conventions should exist to decide whether the parallel 436 names should have separate IP addresses or not (A record or CNAME 437 record). With separate IP addresses, address to name lookup is easy, 438 otherwise it needs special precautions to be able to find all names 439 corresponding to a given host address. Another detail entering this 440 consideration is that MX records only work for hostnames/domains, 441 not for CNAME aliases. This at least has the consequence that alias 442 resolution for internationalized mail addresses has to occur before 443 MX record lookup. 445 When discussing and applying the rules for creating domain names, 446 some peculiarities of i18n domain names should be carefully consid- 447 ered: 449 - Depending on the script, reasonable lengths for domain name parts 450 may differ greatly. For ideographic scripts, a part may often be 451 only a one-letter code. Established rules for lengths may need 452 adaptation. For example, a rule for country TLDs could read: one 453 ideographic character or two other characters. 455 - If the number of generic TLDs (.com, .edu, .org, .net) is kept 456 low, then it may be feasible to restrict i18n TLDs to country 457 TLDs. 459 - There are no ISO 3166 [ISO3166] two-letter codes in scripts other 460 than Latin. I18n domain names for countries will have to be 461 designed from scratch. 463 - The names of some countries or regions may pose greater political 464 problems when expressed in the native script than when expressed 465 in 2-letter ISO 3166 codes. 467 - I18n country domain names should in principle only be created in 468 those scripts that are used locally. There is probably little use 469 in creating an Arabic domain name for China, for example. 471 - In those cases where domain names are open to a wide range of 472 applicants, a special procedure for accepting applications should 473 be used so that a reasonable-quality fit between ASCII domain 474 names and i18n domain names results where desired. This would 475 probably be done by establishing a period of about a month for 476 applications inside a i18n domain newly created as a parallel for 477 an existing domain, and resolving the detected conflicts. For 478 syntactically parallel domain names, the owners should always be 479 the same. Administration may be split in some cases to account for 480 the necessary linguistic knowledge. For domain names with tran- 481 scription parallelism and semantic parallelism, the question of 482 owner identity should depend on the real-life situation (trade- 483 marks,...). 485 - It will be desirable to have internationalized subdomains in non- 486 internationalized TLDs. As an example, many companies in France 487 may want to register an accented version of their company name, 488 while remaining under the .fr TLD. For this, .fr would have to be 489 reregistered as .M6N2.i. Accented and other internationalized sub- 490 domains would go below .M6N2.i, whereas unaccented ones would go 491 below .fr in its plain form. 493 - To generalize the above case, one may need to create a requirement 494 that any domain name registry would have to register and manage 495 syntactically parallel domain names below the .i ZLD upon request 496 to allow registration of i18n domain names in arbitrary subdo- 497 mains. An alternative to this is to organize domain name search 498 so that e.g. in a search for XXXXXX.fr, if M6N2.i is not found in 499 .i, the name server for .fr is queried for XXXXXX.M6N2.i (with 500 XXXXXX appropriately encoded). This convention would allow lower- 501 level domains to introduce internationalized subdomains without 502 depending on higher-level domains. 504 4.4 Usage in URLs 506 According to current definitions, URLs encode sequences of octets 507 into a sequence of characters from a character set that is almost as 508 limited as the character set of domain names [RFC1738]. This is 509 clearly not satisfying for i18n. 511 Internationalizing URLs, i.e. assigning character semantics to the 512 encoded octets, can either be done separately for each part and/or 513 scheme, or in an uniform way. Doing it separately has the serious 514 disadvantage that software providing user interfaces for URLs in gen- 515 eral would have to know about all the different i18n solutions of the 516 different parts and schemes. Many of these solutions may not even be 517 known yet. 519 It is therefore definitely more advantageous to decide on a single 520 and consistent solution for URL internationalization. The most valu- 521 able candidate [Yer96], for many reasons, is UTF-8 [RFC2044], an 522 ASCII-compatible encoding of UCS4. 524 Therefore, an URL containing the domain name of the example of Sec- 525 tion 3.3 should not be written as: 527 ftp://M0C5L831.N406.M771L927.LB66.M5E5M72C.i 529 (although this will also work) but rather 531 ftp://%e6%83%85%e5%a0%b1.%e7%90%86.%e6%9d%b1%e5%a4%a7. 532 %e5%ad%a6.%e6%97%a5%e6%9c%ac 534 In this canonical form, the trailing .i is absent, and the octets can 535 be reconstructed from the %HH-encoding and interpreted as UTF-8 by 536 generic URL software. The software part dealing with domain names 537 will carry out the conversion to the .i form. 539 5. Alternate Proposals 541 5.1 The Dillon Proposal 543 The proposal of Michael Dillon [Dillon96] is also based on encoding 544 Unicode into the limited character set of domain names. Distinction 545 is done for each part, using the hyphen in initial position. Because 546 this does not fully conform to the syntax of existing domain names, 547 it is questionable whether it is backwards-compatible. On the other 548 hand, this has the advantage that local i18n domain names can be 549 installed easily without cooperation by the manager of the superdo- 550 main. 552 A variable-length scheme with base 36 is used that can encode up to 553 1610 characters, absolutely insufficient for Chinese or Japanese. 554 Characters assumed not to be used in i18n domain names are excluded, 555 i.e. only one case is allowed for basic Latin characters. This means 556 that large tables have to be worked out carefully to convert between 557 ISO 10646/Unicode and the actual number that is encoded with base= 558 36. 560 5.2 Using a Separate Lookup Service 562 Instead of using a special encoding and burdening DNS with i18n, one 563 could build and use a separate lookup service for i18n domain names. 564 Instead of converting to UCS4 and encoding according to Section 3.2, 565 and then calling the DNS resolver, a program would contact this new 566 service when seeing a domain name with characters outside the allowed 567 range. 569 Such solutions have various problems. There are many directory ser- 570 vices and proposals for how to use them in a way similar to DNS. For 571 an overview and a specific proposal, see [Kle96]. However, while 572 there are many proposals, a real service containing the necessary 573 data and providing the wide installed base and distributed updating 574 is in DNS does not exist. 576 Most directory service proposals also do not offer uniqueness. 577 Defining unique names again for a separate service will duplicate 578 much of the work done for DNS. If uniqueness is not guaranteed, the 579 user is bundened with additional selection steps. 581 Using a separate lookup service for the internationalization of 582 domain names also results in more complex implementations than the 583 proposal made in this draft. Contrary to what some people might 584 expect, the use of a separate lookup service also does not solve a 585 capacity problem with DNS, because there is no such problem, nor will 586 one be created with the introduction of i18n domain names. 588 6. Generic Considerations 590 6.1 Security Considerations 592 This proposal is believed not to raise any other security considera- 593 tions than the current use of the domain name system. 595 6.2 Internationalization Considerations 597 This proposal addresses internationalization as such. The main addi- 598 tional consideration with respect to internationalization may be the 599 indication of language. However, for concise identifiers such as 600 domain names, language tagging would be too much of a burden and 601 would create complex dependencies with semantics. 603 NOTE -- This section is introduced based on a recommenda- 604 tion in [RFCIAB]. A similar section addressing internation- 605 alization should be included in all application level 606 internet drafts and RFCs. 608 Acknowledgements 610 I am grateful in particular to the following persons for their advice 611 or criticism: Bert Bos, Lori Brownell, Michael Dillon, Donald E. 612 Eastlake 3rd, David Goldsmith, Larry Masinter, Ryan Moats, Keith 613 Moore, Thorvardur Kari Olafson, Erik van der Poel, Jurgen Schwertl, 614 Paul A. Vixie, Francois Yergeau, and others. 616 Internet Draft Internationalization of Domain Names July= 617 1997 619 Bibliography 621 [ASCII] Coded Character Set -- 7-Bit American Standard Code 622 for Information Interchange, ANSI X3.4-1986. 624 [Dillon96] M. Dillon, "Multilingual Domain Names", Memra Software 625 Inc., November 1996 (circulated Dec. 6, 1996 on iahc- 626 discuss@iahc.org). 628 [HTML-I18N] F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter- 629 nationalization of the Hypertext Markup Language", 630 Work in progress (draft-ietf-html-i18n-05.txt), August 631 1996. 633 [iNORM] M. Duerst, "Normalization of Internationalized Identi- 634 fiers", draft-duerst-i18n-norm-00.txt, July 1997. 636 [ISO3166] ISO 3166, "Code for the representation of names of 637 countries", ISO 3166:1993. 639 [ISO10646] ISO/IEC 10646-1:1993. International standard -- Infor- 640 mation technology -- Universal multiple-octet coded 641 character Set (UCS) -- Part 1: Architecture and basic 642 multilingual plane. 644 [Kle96] J. Klensin and T. Wolf, Jr., "Domain Names and Company 645 Name Retrieval", Work in progress (draft-klensin-tld- 646 whois-01.txt), November 1996. 648 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facili- 649 ties", ISI, Nov. 1987. 651 [RFC1035] P. Mockapetris, "Domain Names - Implementation and 652 Specification", ISI, Nov. 1987. 654 [RFC1522] K. Moore, "MIME (Multipurpose Internet Mail Exten- 655 sions) Part Two: Message Header Extensions for Non- 656 ASCII Text", University of Tennessee, September 1993. 658 [RFC1642] D. Goldsmith, M. Davis, "UTF-7: A Mail-safe Transfor- 659 mation Format of Unicode", Taligent Inc., July 1994. 661 [RFC1730] C. Malamud and M. Rose, "Principles of Operation for 662 the TPC.INT Subdomain: General Principles and Policy", 663 Internet Multicasting Service, October 1993. 665 [RFC1738] T. Berners-Lee, L. Masinter, and M. McCahill, 666 "Uniform Resource Locators (URL)", CERN, Dec. 1994. 668 [RFC2044] F. Yergeau, "UTF-8, A Transformation Format of Unicode 669 and ISO 10646", Alis Technologies, October 1996. 671 [RFCIAB] C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. 672 Atkinson, M. Crispin, P. Svanberg, "Report from the 673 IAB Character Set Workshop", October 1996 (currently 674 available as draft-weider-iab-char-wrkshop-00.txt). 676 [Unicode] The Unicode Consortium, "The Unicode Standard, Version 677 2.0", Addison-Wesley, Reading, MA, 1996. 679 [Yer96] F. Yergeau, "Internationalization of URLs", Alis Tech- 680 nologies, 681 = 682 . 684 Author's Address 686 Martin J. Duerst 687 World Wide Web Consortium 688 Keio Research Institute at SFC 689 Keio University 690 5322 Endo 691 Fujisawa 692 252-8520 Japan 694 Tel: +81 466 49 11 70 695 E-mail: mduerst@w3.org 697 NOTE -- Please write the author's name with u-Umlaut wherever 698 possible, e.g. in HTML as Dürst.