idnits 2.17.1 draft-ietf-idn-step-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There is 1 instance of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 460 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 2 instances of too long lines in the document, the longest one being 7 characters in excess of 72. ** There are 86 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 29, 2001) is 8362 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'IDNComp' is mentioned on line 154, but not defined == Missing Reference: 'RFC1766' is mentioned on line 65, but not defined ** Obsolete undefined reference: RFC 1766 (Obsoleted by RFC 3066, RFC 3282) == Missing Reference: 'Unicode3' is mentioned on line 128, but not defined == Missing Reference: 'ACE' is mentioned on line 148, but not defined == Missing Reference: 'A-Za-z0-9' is mentioned on line 297, but not defined == Missing Reference: 'P1' is mentioned on line 310, but not defined == Missing Reference: 'L1' is mentioned on line 310, but not defined == Missing Reference: 'P2' is mentioned on line 310, but not defined == Missing Reference: 'L2' is mentioned on line 310, but not defined == Missing Reference: 'Py' is mentioned on line 310, but not defined -- Looks like a reference, but probably isn't: '0' on line 317 == Unused Reference: 'ASCII' is defined on line 393, but no explicit reference was found in the text == Unused Reference: 'IDNCOMP' is defined on line 397, but no explicit reference was found in the text == Unused Reference: 'UNICODE' is defined on line 422, but no explicit reference was found in the text == Unused Reference: 'UNICODE30' is defined on line 425, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' == Outdated reference: A later version (-01) exists of draft-ietf-idn-compare-00 -- Possible downref: Normative reference to a draft: ref. 'IDNCOMP' -- Possible downref: Normative reference to a draft: ref. 'IDNReq' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'Dictionary79' -- Possible downref: Non-RFC (?) normative reference: ref. 'Macmillan93' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE30' -- Possible downref: Non-RFC (?) normative reference: ref. 'Ye95' Summary: 6 errors (**), 0 flaws (~~), 19 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft Liana Ye 2 draft-ietf-idn-step-00.txt Y&D ISG 3 May 29, 2001 4 Expires in six months (November 2001) 6 StepCode- A User Access Oriented IDN Encoding 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with 11 all provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering 14 Task Force (IETF), its areas, and its working groups. Note that 15 other groups may also distribute working documents as 16 Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other documents 20 at any time. It is inappropriate to use Internet-Drafts as reference 21 material or to cite them other than as "work in progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 26 The list of Internet-Draft Shadow Directories can be accessed 27 at http://www.ietf.org/shadow.html. 29 Abstract 31 This document describes a transformation method from an end user 32 into Unicode library for representing Chinese characters in host 33 name parts in a fashion that is extendable to include more than 34 Unicode symbols and completely compatible with the current DNS. 35 It is a potential candidate for an ASCII-Compatible Encoding (ACE) 36 for internationalized hostnames. This method is based 37 on user widely used concept for denoting a Chinese character with 38 its phonetic elements, and register their IDN names with such an 39 extented phonetic description (English speakers can register a 40 CJK glyph for their blessing too.), so that an IDN name in either 41 traditional or simplified character set can be effective with 42 both user communities depending on servers, and allows standard 43 variations for compressing and security filtering of that 44 information. The method can be extented for other scripts as long 45 as they need more then 26 basic ASCII letters to be mapped. 47 1. Introduction 49 1.1 Context 51 There is a strong world-wide desire to use characters other than 52 plain ASCII in hostnames. Hostnames have become the equivalent of 53 business or product names for many services on the Internet, so 54 there is a need to make them usable by people whose native scripts 55 are not representable by ASCII. The requirements for 56 internationalizing hostnames are described in the IDN WG's 57 requirements document, [IDNReq]. 59 The IDN WG's comparison document [IDNComp] describes three potential 60 main architectures for IDN: arch-1 (just send binary), arch-2 (send 61 binary or ACE), and arch-3 (just send ACE). StepCode is an ACE that 62 can be used with protocols that match arch-2 or arch-3. Hopefully, 63 this ACE may render arch-2 unnecessary. It is either a server or a 64 client generatable and selectable ACE according to a string's 65 language/script tag/label[RFC1766]. Because it does not attempt any 66 particular optimization or compression of string patterns, the 67 average length for a Chinese character in [ISO10646] is about 13 68 characters long, enough to code ��ٱ̅��.com (I wish 20% reader 69 can see this, and the rest should be at lost.) as: 70 xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua2shi0.com with 54 octets. 71 In the code the first group of digits specify the tones of each 72 glyph as well as the number of codepoints in [ISO10646] or glyphs. 73 The rest of digits specify the layout of different parts of each 74 glyph of above string. 76 The StepCode protocol has the following features: 78 - There is exactly one way to convert internationalized host parts 79 to and from StepCode strings with a script tag/label. It permits 80 different script tags to access the same glyph in [ISO10646] 81 similar with a searching method into a book library, while the 82 codepoints of [ISO10646] is an analogy to ISDN book number. Host 83 name part uniqueness is preserved. If there is a difference in the 84 code, it is considered as user input error. 86 - Host parts have no international glyphs but US-ASCII. 88 - Names using StepCode have lengths proportionate to the number 89 of glyphs (from IS 10646) in the names themselves plus the 90 script tag. However, StepCode for most frequently used glyphs 91 in the table is shortend significantly such as Cyrillic. 93 - This specification allows standard compression or security 94 treatment compatible with existing hostnames. 96 It is important to note that the following sections contain many 97 normative statements with "MUST" and "MUST NOT". Any implementation 98 that does not follow these statements exactly is likely to cause 99 damage to the Internet by creating non-unique representations of 100 hostnames. 102 1.2 Author's Disclaimer 104 This document was written for the convenience of the IDN WG, in 105 case someone believes that there are no agreeable mechanisms for 106 referencing internationalized names without converting [ISO10646] 107 codepoints. The author believes that [ISO10646] is to 108 establish a symbol libaray, and there are better ways to do high 109 frequency symbol accessing. Display a symbol onto 110 no DNS-based approach can solve the "IDN" problem as it 111 is hoped by users and company/enterprise domain name 112 registrants and it is prossible not to add another coding format, 113 further complicate the DNS and risk unknown problems and 114 incompatibilities. 116 1.3 Terminology 118 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", 119 and "MAY" in this document are to be interpreted as described in 120 [RFC2119]. 122 Hexadecimal values are shown preceded with an "0x". For example, 123 "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values 124 are shown preceded with an "0b". For example, a nine-bit value might 125 be shown as "0b101101111". 127 Examples in this document use the notation from the Unicode Standard 128 [Unicode3] as well as the ISO 10646 names. For example, the letter 129 "a" may be represented as either "U+0061" or "LATIN SMALL LETTER A". 131 StepCode converts strings at a client site with internationalized 132 characters into strings of US-ASCII that are acceptable as host 133 name parts in current DNS host naming usage. The former are called 134 "pre-converted" and a "glyph" for a symbol repesented by one codepoint 135 in [ISO10646] or "glyphs" for a string of that, and the latter 136 are called "post-converted". 138 The protocol contains one procedure and calls for standardizing 139 a minimum number of glyphs of a script using the same script 140 tag. Glyphs in the minimum number of glyph set is called 141 "particles". 143 The protocol using US-ASCII to denote the phonetic elements of a 144 script and calls for standardizing such a mapping for each 145 script tag. The phonetic elements of a glyph is called "spelling" 146 of the glyph and is called "stem" for that of a particle. 148 The protocol specifies an ASCII Compatible [ACE] Encoding 149 method and using Chinese script as an example to demonstrate its 150 features, here is refered as an "ACE" process. 152 1.4 IDN summary 154 Using the terminology in [IDNComp], StepCode specifies an ACE format 155 for arch-2 (send binary or ACE), and arch-3 (just send ACE). 157 The length characteristics of StepCode are discussed above 158 (1.1 Context), is a variable depending on users' choice among 159 many factors. It fits well with existing compression and security 160 treatments. 162 It calls for standardrizing phonetic elements and minimum glyph set 163 within its user community (and labeled by script tag), while 164 asking the internet industry to enforce the standard and providing 165 cross reference to different script tags into Unicode standard. 167 2. Host Part Transformation 169 According to [STD13], host parts must be case-insensitive, start and 170 end with a letter or digit, and contain only letters, digits, and 171 the hyphen character ("-"). This, of course, excludes any 172 internationalized characters, as well as many other characters in 173 the ASCII character repertoire. Further, domain name parts must be 174 63 octets or shorter in length. 176 2.1 Name tagging 178 All post-converted name parts that contain internationalized 179 characters begin with the string "gl-p-", where "gl" denote the 180 glyph set or script encoded as specified by [RFC2277], and 181 "p" denote the phonetic standard used, it SHOULD be reserved with 182 IANA. The string "gl-p-" will allow 674 scripts and 24 phonetic 183 standards of each to be encoded. The assignment of "gl-p-" shall 184 be defined in future versions of this draft. 186 Note that a zone administrator MAY still choose to use "gl-p-" at 187 the beginning of a hostname part even if that part does not contain 188 internationalized characters. Zone administrators MAY create 189 host part names that begin with "gl-p-" which means no conversion 190 is done and display systems SHOULD ignore converting 191 internationalized characters back for display. 193 2.2 Converting an internationalized name to an ACE name part 195 To convert a string of internationalized characters into an ACE name 196 part, the following steps MUST be preformed in the exact order of 197 the subsections given here. 199 If a name part consists exclusively of characters that conform to 200 the hostname requirements in [STD13] or the string "gl-p-", 201 the name MUST NOT be converted to StepCode. That is, a name part 202 that can be represented without StepCode MUST NOT be changed. 203 This absolute requirement prevents: 204 1. double encoding from a client of user keyboard input and 205 a server provider; 206 2. mess up existing registered domain names; 207 3. from being two different encodings for a single DNS 208 registered hostname; 209 4. interfering with registered glyphs with more than one 210 phonetic standard, such as Chinese script. 212 If any checking for prohibited name parts (such as ones that are 213 prohibited characters, case-folding, or canonicalization) is to be 214 done, it MUST be done after doing the conversion to an ACE name 215 part as it is specified in [nameprep]. 217 Characters outside the first plane of characters (those with 218 codepoints above U+FFFF) MUST be represented using surrogates, as 219 described in the UTF-16 description in ISO 10646. 221 The input name string consists of characters from the ISO 10646 222 character set in big-endian UTF-16 encoding. This is the 223 pre-converted string. 225 2.2.1 Check the input string for disallowed names 227 If the input string consists only of characters that conform to the 228 hostname requirements in [STD13], or the input string consists 229 a null language tag, the conversion MUST stop with an error. 231 2.2.2 Represent glyphs by their spelling and particle layout. 233 2.2.2.1 StepCode defination for digits 235 Tone marks [Macmillan93] 236 0 no tone 237 1 flat/macron (-) 238 2 rise/acute (/) 239 3 dip/breve (v) 240 4 drop/grave (\) 241 5 throw/circumflex (^) 242 6 thrill/tilde (~) 243 7 dieresis (..) 244 8 cedilla (hook) 246 Particle layout [Ye95] 247 0 end of a stem or a spelling 248 1 to its left 249 2 to its bellow 250 3 to its inside (an enclosure particle) 251 4 to its outside (normally a center divider) 253 2.2.2.2 StepCode phonetic symbol tables 255 2.2.2.2.1 Chinese 257 Note: This is a list extracted from [Dictionary79] and 258 other sources. 260 Pinyin Wade Zhuyin 261 b p b 262 p p/ p 263 m m m 264 f f f 265 d t d 266 t t/ t 267 n n n 268 l l l 269 g k g 270 k k/ k 271 h h h 272 j ch j 273 q ch/ q 274 x hs x 275 zh ch zh 276 ch ch/ ch 277 sh sh sh 278 r j r 279 z ts z 280 c ts/ c 281 s s s 282 a a a 283 o u o 284 e e^ e 285 e^ eh ei 286 i uh i 287 u u u 288 uo o oo 289 u.. u.. u.. 290 y y y 291 w w w 292 v (' spelling separator) 294 2.2.2.2.2 Other scripts to come 296 2.2.3 StepCode Format 297 Format Defination: A Stepcode unit is a string of [A-Za-z0-9] 298 characters without any white spaces, BLANK, in between. For each StepCode unit, 299 there are data elements indicated by "", which is a required elelment, 300 and [] where the element is optional, and / where the data is selectable. 302 Sx stands for Spelling of xth glyph; 303 Tx stands for Tone of xth glyph; 304 Py stands for Stem for yth particle; 305 Ly stands for Layout relation from y to y+1; 306 Px.y stands for Stem for Xth glyph and its yth particle; 307 Lx.y stands for Layout relation from Xth glyph and its y to y+1. 309 2.2.3.1. One glyph 310 "S""T"[P1][L1][P2][L2]...[Py][0/BLANK] 312 Example:xin1qin 313 xin1qin1jin0 315 2.2.3.2. Glyphs 316 "S1S2S3...Sx""T1T2...Tx"[P1.1][L1.1][P1.2][L1.2]...[P1.y][0] 317 [P2.1][L2.1][P2.2][L2.2]...[P2.y][0] 318 ... 319 [Px.1][Lx.1][Px.2][Lx.2]...[Px.y][0/BLANK] 321 Example of glyphs of four: 322 xinzhuqinghua1212 323 xinzhuqinghua1212qin1jin0ge1ge0shui1qing0 324 xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua 325 xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua2shi0 327 The four StepCodes are equivalent, depending on where 328 it is registered, the size of the database, as well as there exist 329 similar hostnames it has confict with. 331 2.3. StepCode Encoding Process 333 Either, StepCode may be obtained from Unicode to StepCode through 334 a code lookup table, and combines glyph code into glyphs code as 335 shown in 2.2. 337 Or, it is inputed directly from keyboards, where an input 338 processing module to verify correctness of intented glyphs is necessary. 340 Prepend "glp--" or the name of conversion table used as script tag 341 to the post-convered string; finish. This is the hostname part that 342 can be used in DNS registration as well as resolution. 344 Go through [nameprep], checking for prohibited characters, 345 case-folding, or canonicalization. 347 2.4. Converting a StepCode hostname part to an internationalized name 349 The process has three steps with script tag untouched: 351 Step 1.If a domain name part consists no script tag, then goto Step 3; 352 Otherwise enable conversion table named "glp" from StepCode to 353 Unicode or other code, obtain the corespondends. 354 Step 2.If the corespondent is there then goto Step 3; 355 Otherwise decomposes the post-converted string into a number 356 of individual glyph specified in the "T" field; 357 Searching for each glyph; 358 If any of the glyph is not found, 359 compose an error message and 360 Requesting the missing glyph to be supplied 361 from the sender. 362 Step 3.Display available glyph where missing glyph is shown with 363 its StepCode. 365 3. Security Considerations 367 Much of the security of the Internet relies on the DNS. Thus, any 368 change to the characteristics of the DNS can change the security of 369 much of the Internet. Thus, StepCode makes no changes to the DNS itself. 371 Hostnames are used by users to connect to Internet servers. The 372 security of the Internet would be compromised if a user entering a 373 single internationalized name could be connected to different 374 servers based on different interpretations of the internationalized 375 hostname. Thus the restriction of DNS names to a small symbol set is 376 necessary and effective, where adding any other data format such as 377 UTF-8 only opens the security gate for complication. 379 4.Internationalization considerations 381 StepCode is designed so that every internationalized hostname part can 382 be represented as one and only one DNS-compatible string. If there 383 is two different ways to obtain the same glyph on a display device, 384 then they are still two distinct hostnames, with no bearing on 385 security issue. If there is any way to follow the steps in this 386 document and get two or more different results, it is an error in 387 the domain name registration process, where one domain name register 388 fails updates other domain name register servers a newly registered 389 and well researched hostname. 391 5. References 393 [ASCII] American National Standards Institute (formerly United 394 States of America Standards Institute), X3.4, 1968, "USA Code for 395 Information Interchange". (ANSI X3.4-1968) 397 [IDNCOMP] "Comparison of Internationalized Domain Name Proposals", 398 draft-ietf-idn-compare-00.txt, June 2000, P. Hoffman. 400 [IDNReq] Zita Wenzel and James Seng, "Requirements of 401 Internationalized Domain Names", draft-ietf-idn-requirements. May 2001.) 403 [ISO10646] ISO/IEC 10646-1:2000 (note that an amendment 1 is in 404 preparation), ISO/IEC 10646-2 (in preparation), plus 405 corrigenda and amendments to these standards. 407 [Dictionary79] Beijing Foriegn Language Dept., "A Chinese-English 408 Dictionary", 1979, BK# 9017.810. 410 [Macmillan93] The Macmillan Visual Desk Reference, 1993, 411 ISBN 0-02-531310-x. 413 [RFC2277] "IETF Policy on Character Sets and Languages", 414 rfc2277.txt, January 1998, H. Alvestrand. 416 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 417 Requirement Levels", March 1997, RFC 2119. 419 [STD13] Paul Mockapetris, "Domain names - implementation and 420 specification", November 1987, STD 13 (RFC 1035). 422 [UNICODE] The Unicode Consortium, "The Unicode Standard". Described at 423 http://www.unicode.org/unicode/standard/versions/. 425 [UNICODE30] The Unicode Consortium, "The Unicode Standard -- Version 426 3.0", ISBN 0-201-61633-5. Same repertoire as ISO/IEC 427 10646-1:2000. Described at http://www.unicode.org/unicode/ 428 standard/versions/Unicode3.0.html. 430 [Ye95] Liana Ye, "A Language Oriented Chinese Encoding for 431 Multilingual Computing Environments", in "Proceeding of the 1995 432 International Conference on Computer Processing of Oriental 433 Languages", Page 323. 435 6. Acknowledgements 437 The author has reused existing IDN draft document and language as 438 much as possible to demonstrate the deep respect for the work has 439 been done by members of this working group. 441 7. IANA Considerations 443 This document require IANA action for availibility of language tag, 444 and registration for each tag and its sub-field for phonetic system 445 used. 447 8. Author Contact Information 449 Liana Ye 450 Y&D ISG 451 2607 Read Ave. 452 Belmont, CA 94002, USA. 453 (650) 592-7092 454 liana.ydisg@juno.com 456 Expires November 2001