Internet Draft Authors: Li Ming TSENG Jan Ming HO 30 Mar 2001 Hua Lin QIAN Expires 30 Sep 2001 Kenny HUANG Editor: James SENG Internationalized Domain Names and Unique Identifiers/Names Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Abstract One of the biggest technical challenge of Internationalized Domain Names (IDN) is how to determine if the two given domain names matches. The current approach to this problem is via a process known as [NAMEPREP]. This document attempts to describe an alternative view and solution to the IDN matching problem. 1. Introduction The Chinese Domain Name Consortium (CDNC) has taken a very keen interest in the IDN, in particular, the uses of chinese script in the domain names. CDNC are formed by the regional registries (CNNIC, TWNIC, HKNIC and MONIC) and have experimented doing Chinese Domain Names System for many months. The primarily motivation for this proposal is due to the lack of support of Traditional and Simplified Chinese in NAMEPREP. See [HAN] for a discussion of Traditional/Simplified Han Ideograph problems. In addition, given the operational experience of the registries, this proposal will reduce the operational and deployment cost from a TLD managers' perspective based on the examinations and developments in CDNC. Backward compatibility, interoperability, scalability, security, operational and deployment are all elements that must be considered as part of criteria when designing internationalized domain name system. 2. Background on Legacy Encoding The most popular Chinese character set used in Taiwan is the industrial standard "BIG5" and the corresponding one in China is "GBK". BIG5 have primarily Traditional Chinese characters and GBK have Simplified Chinese. In addition, the China government has also mandated that all Chinese software in China must support a new standard that supercede GBK known as GB18030. Both BIG5 and GBK are widely used in China, Taiwan, Hong Kong and Macao and supported within many operating systems including Windows. Thus, supporting these encodings in IDN is essential from a geographical perspective. 3. An overview of current proposals and its problems 3.1. ASCII Compatible Encoding (ACE) The need of supporting ACE in IDN has been extensively discussed in the IDN Working Group. Backward compatibility is the strongest advantage of ACE. The deployment of ACE neither affects the existing naming infrastructure, nor creates potential damage of current Internet applications. To move the current Internet to multilingual infrastructure, ACE obviously is the most appropriate bridging solution. Although ACE has the advantages mentioned above, but most of the user's systems support local encoding. User doesn't want to download any special software or upgrade their software in order to handle multilingual domain name system. The support of native encoding without altering user's software has became an important issue for TLD managers'. 3.2. NAMEPREP The design goal of [NAMEPREP] is to allow users to enter host names in applications and have the highest chance of getting the name correct. The NAMEPREP process comprises of three basic steps, namely "MAP", "NORMALIZATION" and "PROHIB". The MAP and NORMALIZATION step aims to reduce the number of possible representations domain name that should be equivalent. These are based upon Unicode Technical Reports [UTR15] and [UTR21]. However, when there are multiple representations of the same domain name but matching changes depending on languages and context, NAMEPREP will fail in these cases. Of our interest, Traditional and Simplified Chinese ideograph cannot be handled by NAMEPREP. 4. Alternative view to the problem space While the IDN WG has been working very hard to solve the ACE and NAMEPREP in IDN, it is apparently that there is another view to these problems that may give us a different approach and solution. First, there is an assumption that NAMEPREP IDN is ISO10646/Unicode string. In reality, most IDN is often encoded in legacy encoding and a additional step have to be taken to covert it to ISO10646/Unicode. Other than the backward compatibility feature of ACE, ACE is also an identifier string for an IDN. And the NAMEPREP process is to unify the various possible representations of IDNs to a single "unique name" for matching purposes. In other words, we have a conceptual model. +-------+ +---------+ (ISO10646) |XYZ.COM|-->--|Transcode|-->------------+ +-------+ +---------+ +----------------+ +---------------+ : (Legacy) ...---|NAMEPREP/Unified|-->--|ACE/unique name| +-------+ +---------+ +----------------+ +---------------+ |xyz.com|-->--|Transcode|-->------------+ --------+ +---------+ (ISO10646) 5. Proposal Given the context of the alternative view to IDN, we can derive another set of solution using a directory concept. +-------+ +---------+ |XYZ.COM|-->----| | +-------+ | | +---------------+ : (Legacy)|Directory|-->--|ACE/unique name| +-------+ | | +---------------+ |xyz.com|-->----| | +-------+ +---------+ The purpose of this directory system is to list all the possible representations of IDNs and unify them to a unique name. This unique name could be an ACE of the most common representation or NAMEPREPPED ACE. The content of the directory is build up upon registration whereby registrant will have to provide a list of equivalence representation of the domain names they registered. However, there is still a question of what directory should we use. In this document, we shall examine a couple of different solutions. 5.1. LDAP as Directory Lightweight Directory Access Protocol [LDAP] is one of the most widely used directory protocols. In LDAP, there is a concept of hierarchy similar to the DNS hierarchy. Hence, it is possible to distribute the content of the directory across various LDAP servers for scalability and authority control. For example, each registries who wish to deploy IDN may setup an LDAP server and to register this LDAP with a "root" LDAP server. The IDN query process would then look something like this: a. User Input IDN name into an application b. Application does a LDAP query to look for unique name c. Application use unique name to do DNS lookup Advantages: - encapsulate the problem in the representation layer and registration time - able to handle with unification problems Disadvantage - requires all applications to upgrade - additional LDAP lookup overhead - policy issues with "root" LDAP server - requires access to LDAP servers to function, i.e. can't work offline 5.2. CNRP as Directory Common Name Resolution Protocol [CNRP] is a newly developed protocol in IETF that does common names resolutions. In CNRP, there is no concept of hierarchy but there is a referrer scheme. Hence, it is possible to build a distributed directory system whereby they refer to each another. The IDN query process would then look something like this: a. User Input IDN name into an application b. Application does a CNRP query to look for unique name c. Application use unique name to do DNS lookup Advantages: - encapsulate the problem in the representation layer and registration time - able to handle with unification problems - no policy issues with "root" CNRP server Disadvantage - requires all applications to upgrade - additional CNRP lookup overhead and no assurance that unique name can be located - requires access to CNRP servers to function, i.e. can't work offline 5.3. DNS as Directory Domain Name System [DNS] is a widely established lookup distributed directory. There is an existing hierarchy structure and resource records are distributed. In theory, the DNS is able to handle 8-bit binary string. The IDN query process would then look something like this: a. User Input IDN name into an application b. Application does a DNS query to look for unique name which will return the Resource Record of the unique name together Advantages: - encapsulate the problem in the representation layer and registration time - able to handle with unification problems - existing "root" DNS server with existing hierarchy - does not requires all applications to upgrade Disadvantage - unknown behavior on applications which cannot handle 8-bit - unknown behavior of servers/caching software which cannot handle 8-bit 6. Solution Given CDNC operational experience that it is difficult to get applications developers to upgrade, difficult to get users to download new applications and difficult etc, using DNS as a Directory would be the fastest approach to deploy IDN for our users. 6.1. Zone file Because there are multiple encoding and multiple representation of the same name even within the same encoding, for a single name, there are multiple binary strings for a single domain name (e.g. ML1, ML2, ML3, ML4). Hence, we would create the following Resource Records within the name server. In the Resource Records, it would look like this: ML1 UNAME ACE1 ML2 UNAME ACE1 ML3 UNAME ACE1 ML3 UNAME ACE1 ACE1 IN A 1.2.3.4. IN A 1.2.3.4. A "UNAME" Resource Record is shown here. In practice, it could be CNAME (except CNAME is unable to handle MX). 6.2. Advantages The strongest advantage to this solution is that: a. This does not requires our users to download any special software or upgrade their software since it is able to handle the native encoding of the user directly b. It will work immediately for ccTLD who wish to offers ML.ccTLD services without any changes at the user client c. It also retains the compatible with IDNA approach so long we keep the unique name equivalent to NAMEPREPPED ACE. d. Existing DNS hierarchy 6.3. Potential Loopholes There are many loopholes within this solution that we need to take note: a. Some "smart" localized browser will send out "wrong" binary string due to different. For example, English Internet Explorer will not be able to handle Chinese double-byte legacy encoding properly b. While Chinese have a handful (usually 2 to 3) representation forms for a single IDN, other languages may have much more complicated representations which may not be suitable to use this approach. For example, if case-folding for Latin character is done using this solution, for a string length of 32 characters, it will requires 2^32 entries in the DNS. But this could be solved in some other means. c. It might be possible to construct a binary string in some legacy encoding which gives the same binary representation for another domain name (a.k.a. binary collision). Acknowledgement Author(s) Li Ming Tseng, Prof National Central University, TWNIC Email: tsenglm@cc.ncu.edu.tw Tel: +886-3-490-4421 Jan Ming Ho, Prof Academia Sinica, TWNIC Email: hoho@iis.sinica.edu.tw Tel: +886-2-2788-3799 x 1803 Hua lin Qian, Prof Chinese Academy of Science, CNNIC Email: hlqian@ns.cnc.ac.cn Tel: +86-10-6256-9960 Kenny Huang Asia Infra International Ltd, TWNIC Email: huangk@alum.sinica.edu Tel: +886-2-2658-6510 Editor: James SENG i-DNS.net International 8 Temasek Boulevard Suntec Tower Three #24-02 Singapore 038988 Email: jseng@i-dns.net Tel: +65-2486-188 Reference [IDNREQ] Requirements of Internationalized Domain Names, Zita Wenzel, James Seng, draft-ietf-idn-requirements [NAMEPREP] Preparation of Internationalized Host Names, P. Hoffman, M. Blanchet, draft-ietf-idn-nameprep [HAN] Han Ideograph (CJK) for Internationalized Domain Names, J. Seng, Y. Yoneya, K. Huang, K. Kim, draft-ietf-idn-cjk [LDAP] Lightweight Directory Access Protocol (v3), M. Wahl, T. Howes, S. Kille, rfc2251.txt [CNRP] Common Name Resolution Protocol, N. Popp, M. Mealing, M. Moseley, draft-ietf-cnrp [DNS] Domain Names - Implementation and Specification, P. Mockapetris, RFC1035 [CJKV] CJKV Information Processing ISBN 1-56592-224-7 [UTR15] Unicode Normalization Forms, Mark Davis and Martin Duerst, Unicode Technical Report 15. [UTR21] Case Mappings, Mark Davis, Unicode Technical Report 21.