Internet Draft Lee Ming Tseng Jan Ming Ho 01 Feb 2002 Kenny Huang expires 01 August 2002 Phased Implementation for Internationalized Domain Names in Applications Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html A copy of this particular draft is also archived at http://www.twnic.net.tw Abstract This document proposes a phased implementation for IDNA (Internationalized Domain Names in Applications). DNS infrastructure is critical for the Internet operation. The implementation of IDNA shall be carefully considered and examined. Deployment of IDN infrastructure shall be migrated step by step to ensure the reliability of the new infrastructure. To fulfill the incremental change requirements, this document proposes a phased implementation for IDNA. 1 Terminology The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in RFC 2119 [7]. A "code point" is an integer value associated with a character in a coded character set. "TC" is an abbreviation for Traditional Chinese. "SC" is an abbreviation for Simplified Chinese. "CDN" is defined as an acronym of Chinese Domain Name that represents internationalized domain name, which contains at least one Chinese character. As to the scope of Chinese character, please refer to ISO/IEC 10646-1:2000(E) [second edition 2000-09-15], if one character is marked "C and G-Hanzi-T", it MUST be a Chinese character, such a definition does not mean that it is not a character of other countries that use HAN ideograph.[8] 2 Proposed Phased implementation of IDNA The IDN Working Group decides to use Unicode as the basis to enable IDN services. This proposal proposes two phases implementation for IDNA, namely Bootstrapping Phase and Mature Phase as described below. 2.1 Bootstrapping Phase At bootstrapping phase, the lists in Appendix B shall be applied to prohibit these code points until future update is requested by user community. The description on how Appendix B was formed is specified in the section 3. +------+ | User | +------+ ^ | Input and display: local interface methods | (pen, keyboard, glowing phosphorus, ...) +-------------------|-------------------------------+ | v | | +-----------------------------+ | | | NamePrep | | | | 1 Mapping | | | | 2 Normalization | | | | 3 Prohibited Output | | | +-----------------------------+ | | ^ | | | | | v | | +-----------------------------+ | | | Extended Prohibited Output | | | +-----------------------------+ | | ^ | | | | | v | | +-----------------------------+ | | | Punycode[5] | | | +-----------------------------+ | | ^ ^ | End system | | | | | Call to resolver: | | Application-specific | | ACE | | protocol: | | v | predefined by the | | +----------+ | protocol or defaults | | | Resolver | | to ACE | | +----------+ | | | ^ | | +-----------------|----------|----------------------+ DNS protocol: | | ACE | | v v +-------------+ +---------------------+ | DNS servers | | Application servers | +-------------+ +---------------------+ Table 1. IDNA architecture [4] with extended prohibited output module. 2.2 Mature Phase The phased implementation of IDNA shall maintain the flexibility for future revision. Unknown code points will be sent to the extended prohibited output module. Valid code points on the other hand will never be prohibited. The future version of IDNA simply removes the prohibition on the code points listed in Appendix B, resulting in the same IDNA that's now on the table. +------+ | User | +------+ ^ | Input and display: local interface methods | (pen, keyboard, glowing phosphorus, ...) +-------------------|-------------------------------+ | v | | +-----------------------------+ | | | NamePrep | | | | 1 Mapping | | | | 2 Normalization | | | | 3 Prohibited Output | | | +-----------------------------+ | | ^ | | | | | v | | +-----------------------------+ | | | Punycode[5] | | | +-----------------------------+ | | ^ ^ | End system | | | | | Call to resolver: | | Application-specific | | ACE | | protocol: | | v | predefined by the | | +----------+ | protocol or defaults | | | Resolver | | to ACE | | +----------+ | | | ^ | | +-----------------|----------|----------------------+ DNS protocol: | | ACE | | v v +-------------+ +---------------------+ | DNS servers | | Application servers | +-------------+ +---------------------+ Table 2. IDNA architecture [4]. 3 Extended Prohibited Output This diagram specifies how the extended prohibition table (Appendix B) is used. The code points listed in Appendix B are proposed by the authors. Appendix B covers Partial Han code points, which may be used in Japan, Korea, Taiwan and China. The subsections below describe why the code points are selected in Appendix B. Implementations of this diagram MUST be based on Appdendix B, not based on the descriptions in this section. The lists in Appendix B MUST be used by implementations of this specification. 3.1 Equivalent matching Some character sets has the issue of equivalent matching, such as Han code points. Han characters are used in many countries in Asia. For a single written language, two Han characters are said to be variants of each other if they have the same meaning and pronounce the same. In other words, they are supposed to be matched as equivalent characters. But, the variant relation can be either context sensitive or context free. [1][2] It is also true that some variant relation in one country does not exist in other countries. Since Han ideograph is an open set, it is still growing even in modern days. What makes it even more complicated is the number of variants of Han character in different versions of Unicode. The number of unified Han characters is 21,204 in Unicode 2.0, 27,786 in Unicode 3.0, and 70,207 in Unicode 3.1 [6]. The larger is the size of Unicode, the larger is the size of its associated variants. We noticed that there are some dictionaries of variants. But, international standardizing efforts on variants based on Unicode had not been engaged by any organization at the time the authors are preparing this document. We also recognize that one does not have to consider the existence of variants if names are nothing but identifiers. But, if a name itself is a product with commercial value as is the case in domain name services, then the ambiguity introduced by the variants into delegation and resolution processes must be minimized. A domain name service which is unable to minimize these ambiguities will cause serious consumer protection problems. On possible solution to the Han variant problem is to standardize a variant relation,which is context free and is true for all nations or regions, with respect to a given subset of Han characters. The purpose is not to provide a complete solution to the Han variant problem given the fact that Chinese character is an open set. Instead, its purpose is to define a maximal set of equivalent variants such that ambiguity in a name service can be minimized at a reasonable cost by a low-level mechanism like IDNA. It is easier and thus is recommended by the authors to define variant relation on a small subset of Han ideograph, e.g., Unicode 2.0. If this is the case, then Han characters beyond this code range should be forbidden in a domain name. Note that Han characters outside of Unicode 2.0 are not commonly used in our daily life. It is also possible to work on a more recent version of Unicode if it is justifiable though. Han variant can be standardized in other standardization bodies, e.g., in Unicode Consortium. Note that Han variants refer to relation of characters. It is different from the equivalence of the words "color" and "colour" which refers to relation of strings of characters. As mentioned earlier, once variant relation is defined in a closed subset of Han ideograph, then character-level equivalence matching can be implemented at IDNA. On the other hand, intelligent matching algorithms can also be developed at higher layers to match context-sensitive and localized Han variants [15]. The degree of severity for an inconsistent matching rule is distinct from different language communities. The requirements and importance of equivalent CDN were also addressed by Chinese Domain Name Consortium (CDNC) and JET (Joint Engineering Team, formed by JPNIC, KRNIC, CNNIC, TWNIC). CDN requirements are listed in Appdedix A. Before standardizing a set of consistent matching rules, these controversial code points are recommended to be temporarily prohibited in the bootstrapping stage. 3.2 Visual difficulty Some code points are visually impossible to differentiate and could lead to many user entry errors. In this case these code points can cause unpredictable results when queried. The issue of visual diffculty may exist in many scripts, but the impact of visual difficulty by different language groups should be particularly evaluated. 3.3 Solutions incompleteness It is generally accepted that the IDNA solution does not solve the CDN problems that listed in Appendix A. Although the WG considered some possible solutions to the CDN problem, those solutions did not meet the IETF's requirements. Thus, this document proposes prohibiting the Han characters listed in Appendix B until a solution that is acceptable to the IETF can be found, or until it is clear that no such solution is possible. 4. Security Considerations [3] Additional function of the architecture imply addition of opportunities for compromising the mechanism. Another security issue is, if a user entering a name from the extended prohibited table that results in a failure in the bootstrapping phase.. Current applications may assume that the characters allowed in host names will always be the same as they are in RFC1034[16], RFC1035[17]. NamePrep[3] infrastructure vastly increases the number of characters available in host names. Every program that uses "special" characters in conjunction with host names may be vulnerable to attack based on the new characters allowed by NamePrep[3] specification. 5 Other Considerations for Appendix B 5.1 Other scripts requirements Other scripts (e.g., Arabic and Hebrew..,etc.) may have the same issues as described in the subsections of section 3. The Appendix B includes but is not only limited to Han code points. To expedite IDN deployment,"Go fast and prohibit only the code points you understand" model is recommended, thus Appendix B encompass only major Han code points for this version. However, Appendix B can be extended if there are other code points proposed by other scripts users. 5.2 Issues for prohibiting Han code points The Han code points are used in many countries and territories,such as Japan, Korea, China, Taiwan, Hong Kong, Macao, Singapore..,etc. Except Han code points, Kana is also used in Japan and Hangel is used in Korea. The proposal will temporarily prevent the users especially in the above areas from using CDN in the bootstrapping phase. CDN service can only be activated in the mature phase. This proposed document will cause the delay of CDN services, on the other hand this will create a good opportunity to pursue a more complete CDN solution. 6. Acknowledgement: Many people from the JET (Joint Engineering Team), CDNC (Chinese Domain Name Consortium)and IETF IDN Working Group contributed ideas that went into this document, include Paul Hoffman John Klensin Fred Baker Vincent Chen Hua Lin Qian Yang Woo Ko Yoshiro Yoneya Kazunori Konishi Ching Chun Hsieh Scott Bradner 7. Author Contact Information: Li Ming Tseng, Prof National Central University, TWNIC Email: tsenglm@cc.ncu.edu.tw Tel: +886-3-490-4421 Jan Ming Ho, Prof Academia Sinica, TWNIC Email: hoho@iis.sinica.edu.tw Tel: +886-2-2788-3799 x 1803 Kenny Huang AsiaInfra, Academia Sinica, TWNIC Email: huangk@alum.sinica.edu Tel: +886-2-2658-6510 8. References: [1] A Complete Set of Simplified Chinese Characters, published in 1986 by the Committee of National Language and Chinese Character of China. [2] Dictionary of Chinese Character Variants, compiled by Mandarin Promotion Council of Taiwan. Version 2 was published in Aug 2001 on Web site.http://140.111.1.40/ [3] Paul Hoffman, Marc Blanchet, " Stringprep Profile for Internationalized Host Names",2002-Jan-09, draft-ietf-idn-nameprep-07.txt [4] Patrik Falstrom, Paul Hoffman, "Internationalizing Domain Names In Applications (IDNA)", 2002-Jan-07, draft-ietf-idn-idna-06.txt [5] Adam Costello, "Punycode version 0.3.3", 2002-Jan-06, draft-ietf-idn-punycode-00 [6] The Unicode Consortium, "The Unicode Standard", http://www.unicode.org/unicode/standard/standard.html. [7] Scott Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997, RFC 2119. [8] ISO/IEC 10646-1:2000(E). International Standard - Information technology -- universal Multiple-Octet Coded Character Set (UCS) [9] H. Alvestrand, "IETF Policy on Character Sets and Languages", 1998-Jan, RFC 2277 [10] F. Yergeau, "UTF-8, a transformation format of ISO 10646", 1998-Jan, RFC 2279 [11] P. Vixie, "Extension Mechanisms for DNS (EDNS0)",1999-Aug, RFC 2671 [12] CJKV Information Processing, ISBN 1-56592-224-7 [13] Unicode Normalization Forms, Mark Davis and Martin Duerst, Unicode Technical Report 15 [UTR15]. [14] Case Mappings, Mark Davis, Unicode Technical Report 21 [UTR21]. [15] John C. Klensin, "A Search-based access model for the DNS", 2001-Nov-16, draft-klensin-dns-search-02d.txt [16] Paul Mockapetris, "Domain names - concepts and facilities", 1987-Nov, RFC1034 [17] Paul Mockapetris, "Domain names - implementation and specification", 1987-Nov, RFC1035 Appendix A CDN Requirements: The original list of CDN requirements were derived from the result of the consensus of 7th JET meeting held on Nov 19th, 2001 in Beijing. The requirements of traditional and simplified Chinese domain name include (1) Traditional/Simplified CDN solution MUST be consistent for all CDN users, including but not limited to end users and administrators. (2) The need to do multiple registrations and delegation for an equivalent CDN MUST be minimized. There MUST be only one registration for equivalent S-CDN. The delegation(s) for an equivalent CDN MUST be consistent. (3) Equivalent S-CDN MUST be treated as equivalent in IDN comparison. (4) There SHOULD be a consistent mechanism to validate CDN. The validation algorithm of CDN MAY be revised. (5) Applications that support CDN MAY display the equivalent S-CDN to users depending on the priority order of user preference followed by default original form and then lastly ACE fallback. (6) Implementation of IDN that supports CDN MUST preserve the original form of CDN. (7) IDN requirements MUST accommodate CDN user requirements. Appendix B. Extended Prohibited Code Point List ----- Start Extended Prohibited Table ----- 4E00-9FAF 3400-4DBF ----- End Extended Prohibited Table -----