idnits 2.17.1 draft-ietf-idn-dude-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 899 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 32 instances of too long lines in the document, the longest one being 18 characters in excess of 72. == There are 5 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 02, 2001) is 8273 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'IDNRACE' is mentioned on line 116, but not defined == Missing Reference: '0123456789abcdef' is mentioned on line 256, but not defined -- Looks like a reference, but probably isn't: '0123456789' on line 262 == Missing Reference: 'Arabic' is mentioned on line 350, but not defined == Missing Reference: 'Hindi' is mentioned on line 372, but not defined == Missing Reference: 'Chinese' is mentioned on line 389, but not defined == Missing Reference: 'Russian' is mentioned on line 408, but not defined -- Looks like a reference, but probably isn't: '512' on line 824 -- Looks like a reference, but probably isn't: '128' on line 830 == Unused Reference: 'IDNCOMP' is defined on line 467, but no explicit reference was found in the text == Unused Reference: 'IDNrACE' is defined on line 470, but no explicit reference was found in the text == Unused Reference: 'IDNNAMEPREP' is defined on line 479, but no explicit reference was found in the text -- Possible downref: Normative reference to a draft: ref. 'IDNCOMP' -- Possible downref: Normative reference to a draft: ref. 'IDNrACE' -- Possible downref: Normative reference to a draft: ref. 'IDNLACE' -- No information found for draft-ietf-idn-requirement - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'IDNREQ' -- Possible downref: Normative reference to a draft: ref. 'IDNDUERST' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE3' Summary: 4 errors (**), 0 flaws (~~), 12 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force (IETF) Mark Welter 2 INTERNET-DRAFT Brian W. Spolarich 3 draft-ietf-idn-dude-01.txt WALID, Inc. 4 March 02, 2001 Expires September 02, 2001 6 DUDE: Differential Unicode Domain Encoding 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other 15 groups may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference 20 material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 The distribution of this document is unlimited. 30 Copyright (c) The Internet Society (2000). All Rights Reserved. 32 Abstract 34 This document describes a tranformation method for representing 35 Unicode character codepoints in host name parts in a fashion that is 36 completely compatible with the current Domain Name System. It provides 37 for very efficient representation of typical Unicode sequences as 38 host name parts, while preserving simplicity. It is proposed as a 39 potential candidate for an ASCII-Compatible Encoding (ACE) for supporting 40 the deployment of an internationalized Domain Name System. 42 Table of Contents 44 1. Introduction 45 1.1 Terminology 46 2. Hostname Part Transformation 47 2.1 Post-Converted Name Prefix 48 2.2 Radix Selection 49 2.3 Hostname Prepartion 50 2.4 Definitions 51 2.5 DUDE Encoding 52 2.5.1 Extended Variable Length Hex Encoding 53 2.5.2 DUDE Compression Algorithm 54 2.5.3 Forward Transformation Algorithm 55 2.6 DUDE Decoding 56 2.6.1 Extended Variable Length Hex Decoding 57 2.6.2 DUDE Decompression Algorithm 58 2.6.3 Reverse Transformation Algorithm 59 3. Examples 60 4. Optional Case Preservation 61 5. Security Considerations 62 6. References 64 1. Introduction 66 DUDE describes an encoding scheme of the ISO/IEC 10646 [ISO10646] 67 character set (whose character code assignments are synchronized 68 with Unicode [UNICODE3]), and the procedures for using this scheme 69 to transform host name parts containing Unicode character sequences 70 into sequences that are compatible with the current DNS protocol 71 [STD13]. As such, it satisfies the definition of a 'charset' as 72 defined in [IDNREQ]. 74 1.1 Terminology 76 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and 77 "MAY" in this document are to be interpreted as described in RFC 2119 78 [RFC2119]. 80 Hexadecimal values are shown preceded with an "0x". For example, 81 "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are 82 shown preceded with an "0b". For example, a nine-bit value might be 83 shown as "0b101101111". 85 Examples in this document use the notation from the Unicode Standard 86 [UNICODE3] as well as the ISO 10646 names. For example, the letter "a" 87 may be represented as either "U+0061" or "LATIN SMALL LETTER A". 89 DUDE converts strings with internationalized characters into 90 strings of US-ASCII that are acceptable as host name parts in current 91 DNS host naming usage. The former are called "pre-converted" and the 92 latter are called "post-converted". This specification defines both 93 a forward and reverse transformation algorithm. 95 2. Hostname Part Transformation 97 According to [STD13], hostname parts must start and end with a letter 98 or digit, and contain only letters, digits, and the hyphen character 99 ("-"). This, of course, excludes most characters used by non-English 100 speakers, characters, as well as many other characters in the ASCII 101 character repertoire. Further, domain name parts must be 63 octets or 102 shorter in length. 104 2.1 Post-Converted Name Prefix 106 This document defines the string 'dq--' as a prefix to identify 107 DUDE-encoded sequences. For the purposes of comparison in the IDN 108 Working Group activities, the 'dq--' prefix should be used solely to 109 identify DUDE sequences. However, should this document proceed beyond 110 draft status the prefix should be changed to whatever prefix, if any, 111 is the final consensus of the IDN working group. 113 Note that the prepending of a fixed identifier sequence is only one 114 mechanism for differentiating ASCII character encoded international 115 domain names from 'ordinary' domain names. One method, as proposed in 116 [IDNRACE], is to include a character prefix or suffix that does not 117 appear in any name in any zone file. A second method is to insert a 118 domain component which pushes off any international names one or more 119 levels deeper into the DNS hierarchy. There are trade-offs between 120 these two methods which are independent of the Unicode to ASCII 121 transcoding method finally chosen. We do not address the international 122 vs. 'ordinary' name differention issue in this paper. 124 2.2 Radix Selection 126 There are many proposed methods for representing Unicode characters 127 within the allowed target character set, which can be split into groups 128 on the basis of the underlying radix. We have chosen a method with 129 radix 16 because both UTF-32 and ASCII are represented by even multiples 130 of four bits. This allows a Unicode character to be encoded as a 131 whole number of ASCII characters, and permits easier manipulation of 132 the resulting encoded data by humans. 134 2.3 Hostname Preparation 136 The hostname part is assumed to have at least one character disallowed 137 by [STD13], and that is has been processed for logically equivalent 138 character mapping, filtering of disallowed characters (if any), and 139 compatibility composition/decomposition before presentation to the DUDE 140 conversion algorithm. 142 While it is possible to invent a transcoding mechanism that relies 143 on certain Unicode characters being deemed illegal within domain names 144 and hence available to the transcoding mechanism for improving encoding 145 efficiency, we feel that such a proposal would complicate matters 146 excessively. 148 2.4 Definitions 150 For clarity: 152 'integer' is an unsigned binary quantity; 153 'byte' is an 8-bit integer quantity; 154 'nibble' is a 4-bit integer quantity. 156 2.5 DUDE Encoding 158 The idea behind this scheme is to provide compression by encoding the 159 contiguous least significant nibbles of a character that differ from the 160 preceding character. Using a variant of the variable length hex encoding 161 desribed in [IDNDUERST] and elsewhere, by encoding leading zero nibbles 162 this technique allows recovery of the differential length. The encoding 163 is, with some practice, easy to perform manually. 165 2.5.1 Extended Variable Length Hex Encoding 167 The variable length hex encoding algorithm was introduced by Duerst in 168 [IDNDUERST]. It encodes an integer value in a slight modification of 169 traditional hexadecimal notation, the difference being that the most 170 significant digit is represented with an alternate set of "digits" 171 - -- 'g through 'v' are used to represent 0 through 15. The result is a 172 variable length encoding which can efficiently represent integers of 173 arbitrary length. 175 This specification extends the variable length hex encoding algorithm 176 to support the compression scheme defined below by potentially not 177 supressing leading zero nibbles. 179 The extended variable length nibble encoding of an integer, C, 180 to length N, is defined as follows: 182 1. Start with I, the Nth least significant nibble from the least 183 significant nibble of C; 185 2. Emit the Ith character of the sequence [ghijklmnopqrstuv]; 187 3. Continue from the most to least significant, encoding each 188 remaining nibble J by emitting the Jth character of the 189 sequence [0123456789abcdef]. 191 2.5.2 DUDE Compression Algorithm 193 1. Let PREV = 0; 195 2. If there are no more characters in the input, terminate successfully; 197 4. Let C be the next character in the input; 199 5. If C != '-' , then go to step 7; 201 6. Consume the input character, emit '-', and go to step 2; 203 7. Let D be the result of PREV exclusive ORed with C; 205 8. Find the least positive value N such that 206 D bitwise ANDed with M is zero 207 where M = the bitwise complement of (16**N) - 1; 209 9. Let V be C ANDed with the bitwise complement of M; 211 10. Variable length hex encode V to length N and emit the result; 213 11. Let PREV = C and go to step 2. 215 2.5.3 Forward Transformation Algorithm 217 The DUDE transformation algorithm accepts a string in UTF-32 218 [UNICODE3] format as input. It is assumed that prior nameprep 219 processing has disallowed the private use code points in 220 0X100000 throuh 0X10FFFF, so that we are left with the task of 221 encoding 20 bit integers. The encoding algorithm is as follows: 223 1. Break the hostname string into dot-separated hostname parts. 224 For each hostname part which contains one or more characters 225 disallowed by [STD13], perform steps 2 and 3 below; 227 2. Compress the hostname part using the method described in section 228 2.5.2 above, and encode using the encoding described in section 229 2.5.1; 231 3. Prepend the post-converted name prefix 'dq--' (see section 2.1 232 above) to the resulting string. 234 2.6 DUDE Decoding 236 2.6.1 Extended Variable Length Hex Decoding 238 Decoding extended variable length hex encoded strings is identical 239 to the standard variable length hex encoding, and is defined as 240 follows: 242 1. Let CL be the lower case of the first input character, 244 If CL is not in set [ghijklmnopqrstuv], 245 return error, 246 else 247 consume the input character; 249 2. Let R = CL - 'g', 250 Let N = 1; 252 3. If no more input characters exist, go to step 9. 254 4. Let CL be the lower case of the next input character; 256 5. If CL is not in the set [0123456789abcdef], go to Step 9; 258 6. Consume the next input character, 259 Let N = N + 1; 260 Let R = R * 16; 262 7. If N is in set [0123456789], 263 then let R = R + (N - '0') 264 else let R = R + (N - 'a') + 10; 266 8. Go to step 3; 268 9. Let MASK be the bitwise complement of (16**N) - 1; 270 10. Return decoded result R as well as MASK. 272 2.6.2 DUDE Decompression Algorithm 274 1. Let PREV = 0; 276 2. If there are no more input characters then terminate successfully; 278 3. Let C be the next input character; 280 4. If C == '-', append '-' to the result string, consume the character, 281 and go to step 2, 283 5. Let VPART, MASK be the next extended variable length hex decoded 284 value and mask; 286 6. If VPART > 0xFFFFF then return error status, 288 7. Let CU = ( PREV bitwise-AND MASK) + VPART, 289 Let PREV = CU; 291 8. Append the UTF-32 character CU to the result string; 293 9. Go to step 2. 295 2.6.3 Reverse Transformation Algorithm 297 1. Break the string into dot-separated components and apply Steps 298 2 through 4 to each component; 300 2. Remove the post converted name prefix 'dq--' (see Section 2.1); 302 3. Decompress the component using the decompression algorithm 303 described above (which in turn invokes the decoding algorithm 304 also described above); 306 4. Concatenate the decoded segments with dot separators and return. 308 3. Examples 310 The examples below illustrate the encoding algorithm. Allowed RFC1035 311 characters, including period [U+002E] and dash [U+002D] are shown as 312 literals in the UTF-16 version of the example. DUDE is compared to 313 LACE as proposed in [IDNLACE]. A comprehensive comparison of ACE 314 proposals is outside of the scope of this document. However we believe 315 that DUDE shows a good balance between efficiency (resulting in shorter 316 ACE sequences for typical names) and complexity. 318 3.1 'www.walid.com' [Arabic]: 320 UTF-16: U+0645 U+0648 U+0642 U+0639 . U+0648 U+0644 U+064A U+062F . 321 U+0634 U+0631 U+0643 U+0629 323 DUDE: dq--m45oij9.dq--m48kqif.dq--m34hk3i9 325 LACE: bq--aqdekscche.bq--aqdeqrckf5.bq--aqddimkdfe 327 3.2 'Abugazalah-Intellectual-Property.com' [Arabic]: 329 UTF-16: U+0623 U+0628 U+0648 U+063A U+0632 U+0627 U+0644 U+0629 - 330 U+0644 U+0644 U+0645 U+0644 U+0643 U+064A U+0629 - U+0627 331 U+0644 U+0641 U+0643 U+0631 U+064A U+0629 . U+0634 U+0631 332 U+0643 U+0629 334 DUDE: dq--m23ok8jaii7k4i9-m44klkjqi9-m27k4hjj1kai9.dq--m34hk3i9 336 LACE: bq--badcgkcihizcorbjaeac2bygircekrcdjiuqcabna4dcorcbimyuuki. 337 bq--aqddimkdfe 339 3.3 'King-Hussain.person.jr' [Arabic] 341 UTF-16: U+0627 U+0644 U+0645 U+0644 U+0643 - U+062D U+0633 U+064A 342 U+0646 . U+0634 U+062E U+0635 . U+0627 U+0644 U+0623 U+0631 343 U+062F U+0646 345 DUDE: dq--m27k4lkj-m2dj3kam.dq--m34iej5.dq--m27k4i3j1ifk6 347 LACE: bq--audcorcfirbqcabnaudegljtjjda.bq--amddilrv. 348 bq--aydcorbdgexum 350 3.4 'Jordanian-Dental-Center.com.jr' [Arabic] 352 UTF-16: U+0645 U+0631 U+0643 U+0632 - U+0627 U+0644 U+0623 U+0631 U+062F 353 U+0646 - U+0644 U+0644 U+0623 U+0633 U+0646 U+0627 U+0646 . 354 U+0634 U+0631 U+0643 U+0629 . U+0627 U+0644 U+0623 U+0631 U+062F 355 U+0646 357 DUDE: dq--m45j1k3j2-m27k4i3j1ifk6-m44ki3j3k6i7k6.dq--m34hk3i9. 358 dq--m27k4i3j1ifk6 360 LACE: bq--aqdekmkdgiaqaligaytuiizrf5dacabna4deirbdgndcorq. 361 bq--aqddimkdfe.bq--aydcorbdgexum 363 3.5 'Mahindra.com' [Hindi]: 365 UTF-16: U+092E U+0939 U+093F U+0928 U+094D U+0926 U+094D U+0930 366 U+093E . U+0935 U+094D U+092F U+093E U+092A U+093E U+0930 368 DUDE: dq--p2ej9vi8kdi6kdj0u.dq--p35kdifjeiajeg 370 LACE: bq--bees4oj7fbgsmtjqhy.bq--a4etktjphyvd4ma 372 3.6 'Webdunia.com' [Hindi]: 374 UTF-16: U+0935 U+0947 U+092C U+0926 U+0941 U+0928 U+093F U+092F 375 U+093E . U+0935 U+094D U+092F U+093E U+092A U+093E U+0930 377 DUDE: dq--p35k7icmk1i8jfifje.dq--p35kdifjeiajeg 379 LACE: bq--beetkrzmezasqpzphy.bq--a4etktjphyvd4ma 381 3.7 'Chinese Finance.com' [Traditional Chinese] 383 UTF-16: U+4E2D U+83EF U+8CA1 U+7D93 . c o m 385 DUDE: dq--ke2do3efsa1nd93.com 387 LACE: bq--75hc3a7prsqx3ey.com 389 3.8 'Chinese Readers.net' [Chinese] 391 UTF-16: U+842C U+7DAD U+8B80 U+8005 . U+7DB2 U+7D61 393 DUDE: dq--o42cndadob80g05.dq--ndb2m1 395 LACE: bq--76ccy7nnroaiabi.bq--aj63eyi 397 3.9 'Russian-Standard.com.ru' [Russian] 399 UTF-16: U+0440 U+0443 U+0441 U+0441 U+043A U+0438 U+0439 - 400 U+0441 U+0442 U+0430 U+043D U+0434 U+0430 U+0440 U+0442 . 401 U+043A U+043E U+043C . U+0440 U+0444 403 DUDE: dq--k40jhhjaop-k3ausk1ij0tkgk0i.dq--k3aus.dq--k40k 405 LACE: bq--a4ceaq2bie5dqoibaawqqbcbiiyd2nbqibba.bq--amcdupr4. 406 bq--aiceara 408 3.10 'Vladimir-Putin.person.ru' [Russian] 410 UTF-16: U+0432 U+043B U+0430 U+0434 U+0438 U+043C U+0438 U+0440 - 411 U+043F U+0443 U+0442 U+0438 U+043D . U+043B U+0438 U+0447 412 U+043D U+043E U+0441 U+0442 U+044C . U+0440 U+0444 U+0020 414 DUDE: dq--k32rgkosok0-k3fk3ij8t.dq--k3bok7jduk1is.dq--k40k 416 LACE: bq--bacdeozqgq4dyocaaeac2bieh5bueob5. 417 bq--bacdwochhu7ecqsm.bq--aiceara 419 4. Optional Case Preservation 421 An extension to the DUDE concept recognizes that the first 422 character emitted by the variable length hex encoding algorithm is 423 always alphabetic. We encode the case (if any) of the original Unicode 424 character in the case of the initial "hex" character. Because the DNS 425 performs case-insensitive comparisons, mixed case international domain 426 names behave in exactly the same way as traditional domain names. 427 In particular, this enables reverse lookups to return names in the 428 preferred case. 430 In contrast to other proposals as of this writing, such a case preserving 431 version of DUDE will interoperate with the non case preserving version. 433 Despite the foregoing, we feel that the additional complexity of tracking 434 character case through the nameprep processing is not warranted by the 435 marginal utility of the result. 437 5. Security Considerations 439 Much of the security of the Internet relies on the DNS and any 440 change to the characteristics of the DNS may change the security of 441 much of the Internet. Therefore DUDE makes no changes to the DNS itself. 443 DUDE is designed so that distinct Unicode sequences map to distinct 444 domain name sequences (modulo the Unicode and DNS equivalence rules). 445 Therefore use of DUDE with DNS will not negatively affect security below 446 the application level. 448 If an application has security reliance on the Unicode string S, produced 449 by an inverse ACE transformation of a name T, the application must verify 450 that the nameprepped and ACE encoded result of S is DNS-equivalent to T. 452 6. Change History 454 The statement that we intended to submit a Nameprep draft was removed in 455 light of the changes made between the frist and second nameprep drafts. 457 The details of DUDE extensions for case preservation etc. have been 458 removed. Basic DUDE was changed to operate over the relevant 20 bit 459 UTF32 code points. 461 Examples have been extended. 463 ACE security issues were clarified. 465 7. References 467 [IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name 468 Proposals", draft-ietf-idn-compare; 470 [IDNrACE] Paul Hoffman, "RACE: Row-Based ASCII Compatible Encoding for 471 IDN", draft-ietf-idn-race; 473 [IDNLACE] Mark Davis, "LACE: Length-Based ASCII Compatible Encoding for 474 IDN", draft-ietf-idn-lace; 476 [IDNREQ] James Seng, "Requirements of Internationalized Domain Names", 477 draft-ietf-idn-requirement; 479 [IDNNAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of 480 Internationalized Host Names", draft-ietf-idn-nameprep; 482 [IDNDUERST] M. Duerst, "Internationalization of Domain Names", 483 draft-duerst-dns-i18n; 485 [ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information 486 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- 487 Part 1: Architecture and Basic Multilingual Plane. Five amendments and 488 a technical corrigendum have been published up to now. UTF-16 is 489 described in Annex Q, published as Amendment 1. 17 other amendments are 490 currently at various stages of standardization; 492 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 493 Requirement Levels", March 1997, RFC 2119; 495 [STD13] Paul Mockapetris, "Domain names - implementation and 496 specification", November 1987, STD 13 (RFC 1035); 498 [UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version 499 3.0", ISBN 0-201-61633-5. Described at 500 . 502 A. Acknowledgements 504 The structure (and some of the structural text) of this document is 505 intentionally borrowed from the LACE IDN draft (draft-ietf-idn-lace-00) 506 by Mark Davis and Paul Hoffman. 508 B. IANA Considerations 510 There are no IANA considerations in this document. 512 C. Author Contact Information 514 Mark Welter 515 Brian W. Spolarich 516 WALID, Inc. 517 State Technology Park 518 2245 S. State St. 519 Ann Arbor, MI 48104 520 +1-734-822-2020 522 mwelter@walid.com 523 briansp@walid.com 525 D. DUDE C++ Implementation 527 #include 528 #include 529 #include 530 #include 532 #define IDN_ERROR INT_MIN 534 #define DUDETAG "dq--" 536 typedef unsigned int uchar_t; 538 bool idn_isRFC1035(const uchar_t * in, int len) 539 { 540 const uchar_t * end = in + len; 542 while (in < end) 543 { 544 if ((*in > 127) || 545 !strchr("abcdefghijklmnopqrstuvwxyz0123456789-.", tolower(*in))) 546 return false; 547 in++; 548 } 549 return true; 550 } 552 static const char *hexchar = "0123456789abcdef"; 553 static const char *leadchar = "ghijklmnopqrstuv"; 555 /* 556 dudehex -- convert an integer, v, into n DUDE hex characters. 557 The result is placed in ostr. The buffer ends at the byte before 558 eop, and false is returned to indicate insufficient buffer space. 559 */ 560 static bool dudehex(char * & ostr, const char * eop, 561 unsigned int v, int n) 562 { 563 if ((ostr + n) >= eop) 564 return false; 566 n--; // convert to zero origin 568 *ostr++ = leadchar[(v >> (n << 2)) & 0x0F]; 570 while (n > 0) 571 { 572 n--; 573 *ostr++ = hexchar[(v >> (n << 2)) & 0x0F]; 574 } 575 return true; 576 } 578 /* 579 idn_dudeseg converts istr, a utf-32 domain name segment into DUDE. 580 eip points at the character after the input segment. 581 ostr points at an output buffer which ends just before eop. 582 If there is insufficient buffer space, the function return is false. 583 Invalid surrogate sequences will also cause a return of false. 584 */ 585 static bool idn_dudeseg(const uchar_t * istr, const uchar_t * eip, 586 char * & ostr, char * eop) 587 { 588 const uchar_t * ip = istr; 589 unsigned p = 0; 591 while (ip < eip) 592 { 593 if (*ip == '-') 594 *ostr++ = *ip; 595 else // if (validnc(*ip)) 596 { 597 unsigned int c = *ip; 599 unsigned d = p ^ c; // d now has the difference (xor) 600 // between the current and previous char 602 int n = 1; // Count the number of significant nibbles 603 while (d >>= 4) 604 n++; 606 dudehex(ostr, eop, c, n); 607 p = c; 608 } 609 ip++; 610 } 611 *ostr = 0; 612 return true; 613 } 615 /* 616 idn_UTF32toDUDE converts a UTF-32 domain name into DUDE. 617 in, a UTF-32 vector of length inlen is the input domain name. 618 outstr is a char output buffer of length outmax. 619 On success, the number of output characters is returned. 620 On failure, a negative number is returned. 622 It is assumed that the input has been nameprepped. 624 If this routine is used in a registration context, segment and 625 overall length restrictions must be checked by the user. 626 */ 628 int idn_UTF32toDUDE(const uchar_t * in, int inlen, char *outstr, int outmax) 629 { 630 const uchar_t *ip = in; 631 const uchar_t *eip = in + inlen; 632 const uchar_t *ep = ip; 633 char *op = outstr; 634 char *eop = outstr + outmax - 1; 636 while (ip < eip) 637 { 638 ep = ip; 639 while ((ep < eip) && (*ep != '.')) 640 ep++; 642 const char * tagp = DUDETAG; // prefix the segment 643 while (*tagp) // with the tag (dq--) 644 { 645 if (op >= eop) 646 { 647 *outstr = '\0'; 648 return IDN_ERROR; 649 } 650 *op++ = *tagp++; 651 } 653 if (idn_isRFC1035(ip, ep - ip)) 654 { 655 if ((ep - ip) >= (eop - op)) 656 { 657 *outstr = '\0'; 658 return IDN_ERROR; 659 } 660 while (ip < ep) 661 *op++ = *ip++; 662 } 663 else 664 { 665 if (!idn_dudeseg(ip, ep, op, eop)) 666 { 667 *outstr = '\0'; 668 return IDN_ERROR; 669 } 670 } 672 if (op >= eop) // check for output buffer overflow 673 { 674 *outstr = '\0'; 675 return IDN_ERROR; 676 } 677 if (ep < eip) 678 *op++ = *ep; // copy '.' 680 ip = ep + 1; 681 } 683 *op = '\0'; 685 return (op - outstr) - 1; 686 } 688 /* 689 idn_DUDEsegtoUTF32 converts instr, DUDE encoded domain name segment 690 into UTF32. 691 eip points at the character after the input segment. 692 ostr points at an output buffer which ends just before eop. 693 If there is insufficient buffer space, the function return is false. 694 */ 695 static int idn_DUDEsegtoUTF32(const char * instr, int inlen, 696 uchar_t * outstr, int maxlen) 697 { 698 const char * ip = instr; 699 const char * eip = instr + inlen; 700 uchar_t * op = outstr; 701 uchar_t * eop = op + maxlen - 1; 703 unsigned prev = 0; 705 while (ip < eip) 706 { 707 if (*ip == '-') 708 *op++ = '-'; 709 else 710 { 711 char c0 = tolower(*ip); 712 if ((c0 < 'g') || (c0 > 'v')) 713 return false; 715 ip++; 717 unsigned r = c0 - 'g'; 718 int n = 1; 719 while (ip < eip) 720 { 721 char cl = tolower(*ip); 722 if ((cl >= '0') && (cl <= '9')) 723 { 724 r <<= 4; 725 r += cl - '0'; 726 } 727 else if ((cl >= 'a') && (cl <= 'f')) 728 { 729 r <<= 4; 730 r += (cl - 'a') + 10; 731 } 732 else 733 break; 735 ip++; 736 n++; 737 } 739 if (r >= 0x0fffff) 740 { 741 return false; 742 } 743 unsigned mask = -1 << (n << 2); 745 unsigned cu = (prev & mask) + r; 746 prev = cu; 748 if (op >= eop) 749 return IDN_ERROR; 750 *op++ = cu; 751 } 752 } 753 *op = '\0'; 754 return (op - outstr); 755 } 757 int idn_DUDEtoUTF32(const char * in, int inlen, uchar_t * outstr, int outmax) 758 { 759 const char *ip = in; 760 const char *eip = in + inlen; 761 const char *ep = ip; 762 uchar_t *op = outstr; 763 uchar_t *eop = outstr + outmax - 1; 765 while (ip < eip) 766 { 767 ep = ip; 768 while ((ep < eip) && (*ep != L'.')) 769 ep++; 771 const char * tip = ip; 772 const char * tagp = DUDETAG; 773 while (*tagp && (tip < ep) && (tolower(*tagp) == tolower(*tip))) 774 { 775 tip++; 776 tagp++; 777 } 779 if (*tagp) 780 { // tag doesn't match, copy segment verbatim 781 while (ip < ep) 782 { 783 if (op >= eop) 784 return IDN_ERROR; 785 *op++ = *ip++; 786 } 787 } 788 else 789 { 790 ip = tip; 791 int rv = idn_DUDEsegtoUTF32(ip, ep - ip, op, eop - op); 793 if (rv < 0) 794 return IDN_ERROR; 796 op += rv; 797 } 799 *op++ = *ep; 801 if (!*ep) 802 break; 804 ip = ep + 1; 805 } 807 if (op >= eop) 808 return IDN_ERROR; 810 *op = '\0'; 812 return (op - outstr) - 1; 813 } 815 /* 816 DUDE test driver 817 */ 819 void printres(char *title, int rv, char *buff); 820 void printres(char *title, int rv, uchar_t *buff); 822 int main(int argc, char *argv[]) 823 { 824 char inbuff[512]; 826 while (fgets(inbuff, sizeof(inbuff), stdin)) 827 { 828 char cbuff[128]; 829 uchar_t wbuff[128]; 830 uchar_t iwbuff[128]; 831 uchar_t *wsp = wbuff; 832 uchar_t wc; 833 int in; 834 int nr; 836 char * inp = inbuff; 837 wsp = wbuff; 838 while (sscanf(inp, "%x%n", &in, &nr) > 0) 839 { 840 inp += nr; 841 *wsp++ = in; 842 } 843 fprintf(stdout, "\n"); 845 int rv; 846 rv = idn_UTF32toDUDE(wbuff, wsp - wbuff, cbuff, sizeof(cbuff)); 847 printres("toDUDE", rv, cbuff); 849 if (rv >= 0) 850 { 851 rv = idn_DUDEtoUTF32(cbuff, rv, iwbuff, sizeof(iwbuff)); 852 printres("toUTF32", rv, iwbuff); 853 } 855 } 856 return 0; 857 } 859 void printres(char *title, int rv, char *buff) 860 { 861 fprintf(stdout, "%s (%d) : ", title, rv); 862 if (rv >= 0) 863 { 864 unsigned char *dp = (unsigned char *) buff; 865 while (*dp) 866 { 867 fprintf(stdout, "%c", *dp++); 868 } 869 } 870 fprintf(stdout, "\n"); 871 } 873 void printres(char *title, int rv, uchar_t *buff) 874 { 875 fprintf(stdout, "%s (%d) : ", title, rv); 876 if (rv >= 0) 877 { 878 uchar_t *dp = buff; 879 while (*dp) 880 { 881 fprintf(stdout, " %05x", *dp++); 882 } 883 } 884 fprintf(stdout, "\n"); 885 }