Internet Draft                                    Lee Ming Tseng
<draft-tseng-idn-piidna-00.txt>                      Jan Ming Ho
01 Feb 2002                                         Kenny Huang 
expires  01 August 2002                                        

Phased Implementation for Internationalized Domain Names in Applications

Status of this Memo

    This document is an Internet-Draft and is in full conformance 
    with all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet 
    Engineering Task Force (IETF), its areas, and its working 
    groups. Note that other groups may also distribute working 
    documents as Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of
    six months and may be updated, replaced, or obsoleted by other
    documents at any time. It is inappropriate to use Internet-
    Drafts as reference material or to cite them other than as
    "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html

    A copy of this particular draft is also archived at 
    http://www.twnic.net.tw

Abstract

This document proposes a phased implementation for IDNA 
(Internationalized Domain Names in Applications). DNS infrastructure 
is critical for the Internet operation. The implementation of IDNA 
shall be carefully considered and examined. Deployment of IDN 
infrastructure shall be migrated step by step to ensure the reliability 
of the new infrastructure. To fulfill the incremental change requirements, 
this document proposes a phased implementation for IDNA.

1 Terminology

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119 [7].

A "code point" is an integer value associated with a character in a coded
character set.

"TC" is an abbreviation for Traditional Chinese.

"SC" is an abbreviation for Simplified Chinese.

"CDN" is defined as an acronym of Chinese Domain Name that represents 
internationalized domain name, which contains at least one Chinese 
character. As to the scope of Chinese character, please refer to 
ISO/IEC 10646-1:2000(E) [second edition 2000-09-15], if one character is 
marked "C and G-Hanzi-T", it MUST be a Chinese character, such a definition 
does not mean that it is not a character of other countries that use HAN 
ideograph.[8]


2 Proposed Phased implementation of IDNA

The IDN Working Group decides to use Unicode as the basis to enable
IDN services. This proposal proposes two phases implementation for 
IDNA, namely Bootstrapping Phase and Mature Phase as described below.

2.1 Bootstrapping Phase

At bootstrapping phase, the lists in Appendix B shall be applied 
to prohibit these code points until future update is requested by
user community. The description on how Appendix B was formed is 
specified in the section 3. 

                     +------+
                     | User |
                     +------+
                        ^
                        | Input and display: local interface methods
                        | (pen, keyboard, glowing phosphorus, ...)
    +-------------------|-------------------------------+
    |                   v                               |
    |          +-----------------------------+          |
    |          |        NamePrep             |          |
    |          |  1 Mapping                  |          |
    |          |  2 Normalization            |          |
    |          |  3 Prohibited Output        |          |
    |          +-----------------------------+          |
    |                   ^                               |
    |                   |                               |
    |                   v                               |
    |          +-----------------------------+          |
    |          |  Extended Prohibited Output |          |
    |          +-----------------------------+          |
    |                   ^                               |   
    |                   |                               |
    |                   v                               |
    |          +-----------------------------+          |
    |          |        Punycode[5]          |          |
    |          +-----------------------------+          |
    |                    ^       ^                      | End system
    |                    |       |                      |
    |  Call to resolver: |       | Application-specific |
    |               ACE  |       | protocol:            |
    |                    v       | predefined by the    |
    |           +----------+     | protocol or defaults |
    |           | Resolver |     | to ACE               |
    |           +----------+     |                      |
    |                 ^          |                      |
    +-----------------|----------|----------------------+
        DNS protocol: |          |
                  ACE |          |
                      v          v
           +-------------+    +---------------------+
           | DNS servers |    | Application servers |
           +-------------+    +---------------------+

Table 1. IDNA architecture [4] with extended prohibited output module.


2.2 Mature Phase 

The phased implementation of IDNA shall maintain the flexibility for
future revision. Unknown code points will be sent to the extended 
prohibited output module. Valid code points on the other hand will 
never be prohibited. The future version of IDNA simply removes 
the prohibition on the code points listed in Appendix B, resulting
in the same IDNA that's now on the table. 

                     +------+
                     | User |
                     +------+
                        ^
                        | Input and display: local interface methods
                        | (pen, keyboard, glowing phosphorus, ...)
    +-------------------|-------------------------------+
    |                   v                               |
    |          +-----------------------------+          |
    |          |        NamePrep             |          |
    |          |  1 Mapping                  |          |
    |          |  2 Normalization            |          |
    |          |  3 Prohibited Output        |          |
    |          +-----------------------------+          |
    |                   ^                               |
    |                   |                               |
    |                   v                               |
    |          +-----------------------------+          |
    |          |        Punycode[5]          |          |
    |          +-----------------------------+          |
    |                    ^       ^                      | End system
    |                    |       |                      |
    |  Call to resolver: |       | Application-specific |
    |               ACE  |       | protocol:            |
    |                    v       | predefined by the    |
    |           +----------+     | protocol or defaults |
    |           | Resolver |     | to ACE               |
    |           +----------+     |                      |
    |                 ^          |                      |
    +-----------------|----------|----------------------+
        DNS protocol: |          |
                  ACE |          |
                      v          v
           +-------------+    +---------------------+
           | DNS servers |    | Application servers |
           +-------------+    +---------------------+

Table 2. IDNA architecture [4].


3 Extended Prohibited Output

This diagram specifies how the extended prohibition table 
(Appendix B) is used. The code points listed in Appendix B 
are proposed by the authors. Appendix B covers Partial Han
code points, which may be used in Japan, Korea, Taiwan and
China. 

The subsections below describe why the code points are selected 
in Appendix B. Implementations of this diagram MUST be based 
on Appdendix B, not based on the descriptions in this section. 
The lists in Appendix B MUST be used by implementations of 
this specification.

3.1 Equivalent matching

Some character sets has the issue of equivalent matching, such as 
Han code points. Han characters are used in many countries in Asia. 
For a single written language, two Han characters are said to be 
variants of each other if they have the same meaning and pronounce 
the same. In other words, they are supposed to be matched as 
equivalent characters. But, the variant relation can be either 
context sensitive or context free. [1][2]

It is also true that some variant relation in one country does not 
exist in other countries. Since Han ideograph is an open set, it is 
still growing even in modern days. What makes it even more complicated 
is the number of variants of Han character in different versions of 
Unicode. The number of unified Han characters is 21,204 in Unicode 2.0, 
27,786 in Unicode 3.0, and 70,207 in Unicode 3.1 [6]. The larger is the 
size of Unicode, the larger is the size of its associated variants. 
We noticed that there are some dictionaries of variants. But, international 
standardizing efforts on variants based on Unicode had not been engaged 
by any organization at the time the authors are preparing this document.

We also recognize that one does not have to consider the existence of 
variants if names are nothing but identifiers. But, if a name itself 
is a product with commercial value as is the case in domain name 
services, then the ambiguity introduced by the variants into delegation 
and resolution processes must be minimized. A domain name service 
which is unable to minimize these ambiguities will cause serious 
consumer protection problems.

On possible solution to the Han variant problem is to standardize 
a variant relation,which is context free and is true for all nations 
or regions, with respect to a given subset of Han characters. The 
purpose is not to provide a complete solution to the Han variant 
problem given the fact that Chinese character is an open set. Instead, 
its purpose is to define a maximal set of equivalent variants such that
ambiguity in a name service can be minimized at a reasonable cost by 
a low-level mechanism like IDNA. It is easier and thus is recommended 
by the authors to define variant relation on a small subset of Han 
ideograph, e.g., Unicode 2.0. If this is the case, then Han characters 
beyond this code range should be forbidden in a domain name. Note that 
Han characters outside of Unicode 2.0 are not commonly used in our 
daily life. It is also possible to work on a more recent version of 
Unicode if it is justifiable though. Han variant can be standardized 
in other standardization bodies, e.g., in Unicode Consortium. 

Note that Han variants refer to relation of characters. It is different 
from the equivalence of the words "color" and "colour" which refers to 
relation of strings of characters.

As mentioned earlier, once variant relation is defined in a closed 
subset of Han ideograph, then character-level equivalence matching 
can be implemented at IDNA. On the other hand, intelligent matching 
algorithms can also be developed at higher layers to match 
context-sensitive and localized Han variants [15].

The degree of severity for an inconsistent matching rule is distinct 
from different language communities. The requirements and importance 
of equivalent CDN were also addressed by Chinese Domain Name 
Consortium (CDNC) and JET (Joint Engineering Team, formed by JPNIC, 
KRNIC, CNNIC, TWNIC). CDN requirements are listed in Appdedix A. Before 
standardizing a set of consistent matching rules, these controversial 
code points are recommended to be temporarily prohibited in the 
bootstrapping stage.

3.2 Visual difficulty

Some code points are visually impossible to differentiate and 
could lead to many user entry errors. In this case these 
code points can cause unpredictable results when queried.
The issue of visual diffculty may exist in many scripts, but
the impact of visual difficulty by different language groups
should be particularly evaluated. 

3.3 Solutions incompleteness

It is generally accepted that the IDNA solution does not solve the
CDN problems that listed in Appendix A. Although the WG considered 
some possible solutions to the CDN problem, those solutions did not 
meet the IETF's requirements. Thus, this document proposes prohibiting 
the Han characters listed in Appendix B until a solution that is 
acceptable to the IETF can be found, or until it is clear that no 
such solution is possible.


4. Security Considerations [3]

Additional function of the architecture imply addition of opportunities
for compromising the mechanism. Another security issue is, if a user 
entering a name from the extended prohibited table that results in a 
failure in the bootstrapping phase.. 

Current applications may assume that the characters allowed in host
names will always be the same as they are in RFC1034[16], RFC1035[17]. 
NamePrep[3] infrastructure vastly increases the number of characters 
available in host names. Every program that uses "special" characters 
in conjunction with host names may be vulnerable to attack based on 
the new characters allowed by NamePrep[3] specification.


5 Other Considerations for Appendix B

5.1 Other scripts requirements

Other scripts (e.g., Arabic and Hebrew..,etc.) may have the same
issues as described in the subsections of section 3. The Appendix B
includes but is not only limited to Han code points. To expedite IDN 
deployment,"Go fast and prohibit only the code points you understand" 
model is recommended, thus Appendix B encompass only major Han code 
points for this version.

However, Appendix B can be extended if there are other code points
proposed by other scripts users.


5.2 Issues for prohibiting Han code points

The Han code points are used in many countries and territories,such
as Japan, Korea, China, Taiwan, Hong Kong, Macao, Singapore..,etc.
Except Han code points, Kana is also used in Japan and Hangel is used in Korea.
The proposal will temporarily prevent the users especially in the above
areas from using CDN in the bootstrapping phase. CDN service can only
be activated in the mature phase. This proposed document will cause the
delay of CDN services, on the other hand this will create a good 
opportunity to pursue a more complete CDN solution. 


6. Acknowledgement:

Many people from the JET (Joint Engineering Team), CDNC (Chinese
Domain Name Consortium)and IETF IDN Working Group contributed ideas
that went into this document, include

Paul Hoffman
John Klensin
Fred Baker
Vincent Chen
Hua Lin Qian
Yang Woo Ko
Yoshiro Yoneya
Kazunori Konishi
Ching Chun Hsieh
Scott Bradner


7. Author Contact Information:

Li Ming Tseng, Prof
National Central University, TWNIC
Email: tsenglm@cc.ncu.edu.tw
Tel: +886-3-490-4421

Jan Ming Ho, Prof
Academia Sinica, TWNIC
Email: hoho@iis.sinica.edu.tw
Tel: +886-2-2788-3799 x 1803

Kenny Huang
AsiaInfra, Academia Sinica, TWNIC
Email: huangk@alum.sinica.edu
Tel: +886-2-2658-6510


8. References:

[1] A Complete Set of Simplified Chinese Characters, published 
in 1986 by the Committee of National Language and Chinese 
Character of China.
 
[2] Dictionary of Chinese Character Variants, compiled by Mandarin 
Promotion Council of Taiwan. Version 2 was published in Aug 2001 
on Web site.http://140.111.1.40/
 
[3] Paul Hoffman, Marc Blanchet, " Stringprep Profile for 
Internationalized Host Names",2002-Jan-09, 
draft-ietf-idn-nameprep-07.txt
 
[4] Patrik Falstrom, Paul Hoffman, "Internationalizing Domain Names 
In Applications (IDNA)", 2002-Jan-07, draft-ietf-idn-idna-06.txt
 
[5] Adam Costello, "Punycode version 0.3.3", 2002-Jan-06, 
draft-ietf-idn-punycode-00 
 
[6] The Unicode Consortium, "The Unicode Standard",
http://www.unicode.org/unicode/standard/standard.html.
 
[7] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
 
[8] ISO/IEC 10646-1:2000(E). International Standard - Information 
technology -- universal Multiple-Octet Coded Character Set (UCS)

[9] H. Alvestrand, "IETF Policy on Character Sets and Languages", 
1998-Jan, RFC 2277

[10] F. Yergeau, "UTF-8, a transformation format of ISO 10646", 
1998-Jan, RFC 2279

[11] P. Vixie, "Extension Mechanisms for DNS (EDNS0)",1999-Aug, 
RFC 2671

[12] CJKV Information Processing, ISBN 1-56592-224-7

[13] Unicode Normalization Forms, Mark Davis and Martin Duerst,
Unicode Technical Report 15 [UTR15].

[14] Case Mappings, Mark Davis, Unicode Technical Report 21 [UTR21].

[15] John C. Klensin, "A Search-based access model for the DNS", 
2001-Nov-16, draft-klensin-dns-search-02d.txt

[16] Paul Mockapetris, "Domain names - concepts and facilities", 
1987-Nov, RFC1034

[17] Paul Mockapetris, "Domain names - implementation and 
specification", 1987-Nov, RFC1035


Appendix A  CDN Requirements:

The original list of CDN requirements were derived from the result 
of the consensus of 7th JET meeting held on Nov 19th, 2001 in Beijing.  
The requirements of traditional and simplified Chinese domain name 
include

(1) Traditional/Simplified CDN solution MUST be consistent for all 
    CDN users, including but not limited to end users and 
    administrators.

(2) The need to do multiple registrations and delegation for an 
    equivalent CDN MUST be minimized. There MUST be only one 
    registration for equivalent S-CDN. The delegation(s) for an 
    equivalent CDN MUST be consistent.

(3) Equivalent S-CDN MUST be treated as equivalent in IDN comparison.

(4) There SHOULD be a consistent mechanism to validate CDN. The 
    validation algorithm of CDN MAY be revised.

(5) Applications that support CDN MAY display the equivalent S-CDN 
    to users depending on the priority order of user preference 
    followed by default original form and then lastly ACE fallback.

(6) Implementation of IDN that supports CDN MUST preserve the 
    original form of CDN.

(7) IDN requirements MUST accommodate CDN user requirements.


Appendix B. Extended Prohibited Code Point List

----- Start Extended Prohibited Table -----
4E00-9FAF
3400-4DBF
----- End Extended Prohibited Table -----