idnits 2.17.1 draft-ietf-idn-dude-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 2 longer pages, the longest (page 12) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([UNICODE], [IDNA], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'NAMEPREP' is mentioned on line 63, but not defined -- Looks like a reference, but probably isn't: '0' on line 766 -- Looks like a reference, but probably isn't: '1' on line 794 -- Looks like a reference, but probably isn't: '2' on line 742 -- Looks like a reference, but probably isn't: '3' on line 747 == Unused Reference: 'RFC952' is defined on line 318, but no explicit reference was found in the text == Unused Reference: 'RFC1123' is defined on line 324, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN' == Outdated reference: A later version (-13) exists of draft-ietf-idn-idna-01 == Outdated reference: A later version (-10) exists of draft-ietf-idn-nameprep-03 ** Downref: Normative reference to an Unknown state RFC: RFC 952 -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' Summary: 6 errors (**), 0 flaws (~~), 7 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Mark Welter 2 draft-ietf-idn-dude-02.txt Brian W. Spolarich 3 Expires 2001-Dec-07 Adam M. Costello 4 2001-Jun-07 6 Differential Unicode Domain Encoding (DUDE) 8 Status of this Memo 10 This document is an Internet-Draft and is in full conformance with 11 all provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering 14 Task Force (IETF), its areas, and its working groups. Note 15 that other groups may also distribute working documents as 16 Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other documents 20 at any time. It is inappropriate to use Internet-Drafts as 21 reference material or to cite them other than as "work in progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 26 The list of Internet-Draft Shadow Directories can be accessed at 27 http://www.ietf.org/shadow.html 29 Distribution of this document is unlimited. Please send comments to 30 the authors or to the idn working group at idn@ops.ietf.org. 32 Abstract 34 DUDE is a reversible transformation from a sequence of nonnegative 35 integer values to a sequence of letters, digits, and hyphens (LDH 36 characters). DUDE provides a simple and efficient ASCII-Compatible 37 Encoding (ACE) of Unicode strings [UNICODE] for use with 38 Internationalized Domain Names [IDN] [IDNA]. 40 Contents 42 1. Introduction 43 2. Terminology 44 3. Overview 45 4. Base-32 characters 46 5. Encoding procedure 47 6. Decoding procedure 48 7. Example strings 49 8. Security considerations 50 9. References 51 A. Acknowledgements 52 B. Author contact information 53 C. Mixed-case annotation 54 D. Differences from draft-ietf-idn-dude-01 55 E. Example implementation 56 1. Introduction 58 The IDNA draft [IDNA] describes an architecture for supporting 59 internationalized domain names. Each label of a domain name may 60 begin with a special prefix, in which case the remainder of the 61 label is an ASCII-Compatible Encoding (ACE) of a Unicode string 62 satisfying certain constraints. For the details of the constraints, 63 see [IDNA] and [NAMEPREP]. The prefix has not yet been specified, 64 but see http://www.i-d-n.net/ for prefixes to be used for testing 65 and experimentation. 67 DUDE is intended to be used as an ACE within IDNA, and has been 68 designed to have the following features: 70 * Completeness: Every sequence of nonnegative integers maps to an 71 LDH string. Restrictions on which integers are allowed, and on 72 sequence length, may be imposed by higher layers. 74 * Uniqueness: Every sequence of nonnegative integers maps to at 75 most one LDH string. 77 * Reversibility: Any Unicode string mapped to an LDH string can 78 be recovered from that LDH string. 80 * Efficient encoding: The ratio of encoded size to original size 81 is small. This is important in the context of domain names 82 because [RFC1034] restricts the length of a domain label to 63 83 characters. 85 * Simplicity: The encoding and decoding algorithms are reasonably 86 simple to implement. The goals of efficiency and simplicity are 87 at odds; DUDE places greater emphasis on simplicity. 89 An optional feature is described in appendix C "Mixed-case 90 annotation". 92 2. Terminology 94 The key words "must", "shall", "required", "should", "recommended", 95 and "may" in this document are to be interpreted as described in 96 RFC 2119 [RFC2119]. 98 LDH characters are the letters A-Z and a-z, the digits 0-9, and 99 hyphen-minus. 101 A quartet is a sequence of four bits (also known as a nibble or 102 nybble). 104 A quintet is a sequence of five bits. 106 Hexadecimal values are shown preceeded by "0x". For example, 0x60 107 is decimal 96. 109 As in the Unicode Standard [UNICODE], Unicode code points are 110 denoted by "U+" followed by four to six hexadecimal digits, while a 111 range of code points is denoted by two hexadecimal numbers separated 112 by "..", with no prefixes. 114 XOR means bitwise exclusive or. Given two nonnegative integer 115 values A and B, A XOR B is the nonnegative integer value whose 116 binary representation is 1 in whichever places the binary 117 representations of A and B disagree, and 0 wherever they agree. 118 For the purpose of applying this rule, recall that an integer's 119 representation begins with an infinite number of unwritten zeros. 120 In some programming languages, care may need to be taken that A and 121 B are stored in variables of the same type and size. 123 3. Overview 125 DUDE encodes a sequence of nonnegative integral values as a sequence 126 of LDH characters, although implementations will of course need to 127 represent the output characters somehow, typically as ASCII octets. 128 When DUDE is used to encode Unicode characters, the input values are 129 Unicode code points (integral values in the range 0..10FFFF, but not 130 D800..DFFF, which are reserved for use by UTF-16). 132 Each value in the input sequence is represented by one or more LDH 133 characters in the encoded string. The value 0x2D is represented 134 by hyphen-minus (U+002D). Each non-hyphen-minus character in 135 the encoded string represents a quintet. A sequence of quintets 136 represents the bitwise XOR between each non-0x2D integer and the 137 previous one. 139 4. Base-32 characters 141 "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 142 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 143 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 144 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 145 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 146 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 147 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 148 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 149 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 150 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 151 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 152 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 153 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 154 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 155 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 156 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 158 The digits "0" and "1" and the letters "o" and "l" are not used, to 159 avoid transcription errors. 161 A decoder must accept both the uppercase and lowercase forms of 162 the base-32 characters (including mixtures of both forms). An 163 encoder should output only lowercase forms or only uppercase forms 164 (unless it uses the feature described in the appendix C "Mixed-case 165 annotation"). 167 5. Encoding procedure 169 All ordering of bits, quartets, and quintets is big-endian (most 170 significant first). 172 let prev = 0x60 173 for each input integer n (in order) do begin 174 if n == 0x2D then output hyphen-minus 175 else begin 176 let diff = prev XOR n 177 represent diff in base 16 as a sequence of quartets, 178 as few as are sufficient (but at least one) 179 prepend 0 to the last quartet and 1 to each of the others 180 output a base-32 character corresponding to each quintet 181 let prev = n 182 end 183 end 185 If an encoder encounters an input value larger than expected (for 186 example, the largest Unicode code point is U+10FFFF, and nameprep 187 [NAMEPREP03] can never output a code point larger than U+EFFFD), 188 the encoder may either encode the value correctly, or may fail, but 189 it must not produce incorrect output. The encoder must fail if it 190 encounters a negative input value. 192 6. Decoding procedure 194 let prev = 0x60 195 while the input string is not exhausted do begin 196 if the next character is hyphen-minus 197 then consume it and output 0x2D 198 else begin 199 consume characters and convert them to quintets until 200 encountering a quintet whose first bit is 0 201 fail upon encountering a non-base-32 character or end-of-input 202 strip the first bit of each quintet 203 concatenate the resulting quartets to form diff 204 let prev = prev XOR diff 205 output prev 206 end 207 end 208 encode the output sequence and compare it to the input string 209 fail if they do not match (case-insensitively) 211 The comparison at the end is necessary to guarantee the uniqueness 212 property (there cannot be two distinct encoded strings representing 213 the same sequence of integers). This check also frees the decoder 214 from having to check for overflow while decoding the base-32 215 characters. (If the decoder is one step of a larger decoding 216 process, it may be possible to defer the re-encoding and comparison 217 to the end of that larger decoding process.) 219 7. Example strings 221 The first several examples are nonsense strings of mostly unassigned 222 code points intended to exercise the corner cases of the algorithm. 224 (A) u+0061 225 DUDE: b 227 (B) u+2C7EF u+2C7EF 228 DUDE: u6z2ra 229 (C) u+1752B u+1752A 230 DUDE: tzxwmb 232 (D) u+63AB1 u+63ABA 233 DUDE: yv47bm 235 (E) u+261AF u+261BF 236 DUDE: uyt6rta 238 (F) u+C3A31 u+C3A8C 239 DUDE: 6v4xb5p 241 (G) u+09F44 u+0954C 242 DUDE: 39ue4si 244 (H) u+8D1A3 u+8C8A3 245 DUDE: 27t6dt3sa 247 (I) u+6C2B6 u+CC266 248 DUDE: y6u7g4ss7a 250 (J) u+002D u+002D u+002D u+E848F 251 DUDE: ---82w8r 253 (K) u+BD08E u+002D u+002D u+002D 254 DUDE: 57s8q--- 256 (L) u+A9A24 u+002D u+002D u+002D u+C05B7 257 DUDE: 434we---y393d 259 (M) u+7FFFFFFF 260 DUDE: z999993r or explicit failure 262 The next several examples are realistic Unicode strings that could 263 be used in domain names. They exhibit single-row text, two-row 264 text, ideographic text, and mixtures thereof. These examples are 265 names of Japanese television programs, music artists, and songs, 266 merely because one of the authors happened to have them handy. 268 (N) 3b (Latin, kanji) 269 u+0033 u+5E74 u+0062 u+7D44 u+91D1 u+516B u+5148 u+751F 270 DUDE: xdx8whx8tgz7ug863f6s5kuduwxh 272 (O) -with-super-monkeys (Latin, kanji, hyphens) 273 u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 274 u+0068 u+002D u+0073 u+0075 u+0070 u+0065 u+0072 u+002D u+006D 275 u+006F u+006E u+006B u+0065 u+0079 u+0073 276 DUDE: x58jupu8nuy6gt99m-yssctqtptn-tmgftfth-trcbfqtnk 278 (P) majikoi5 (Latin, hiragana, kanji) 279 u+006D u+0061 u+006A u+0069 u+3067 u+006B u+006F u+0069 u+3059 280 u+308B u+0035 u+79D2 u+524D 281 DUDE: pnmdvssqvssnegvsva7cvs5qz38hu53r 283 (Q) de (Latin, katakana) 284 u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 285 DUDE: vs5bezgxrvs3ibvs2qtiud 286 (R) (hiragana, katakana) 287 u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 288 DUDE: vsvpvd7hypuivf4q 290 8. Security considerations 292 Users expect each domain name in DNS to be controlled by a single 293 authority. If a Unicode string intended for use as a domain label 294 could map to multiple ACE labels, then an internationalized domain 295 name could map to multiple ACE domain names, each controlled by 296 a different authority, some of which could be spoofs that hijack 297 service requests intended for another. Therefore DUDE is designed 298 so that each Unicode string has a unique encoding. 300 However, there can still be multiple Unicode representations of the 301 "same" text, for various definitions of "same". This problem is 302 addressed to some extent by the Unicode standard under the topic of 303 canonicalization, and this work is leveraged for domain names by 304 "nameprep" [NAMEPREP03]. 306 9. References 308 [IDN] Internationalized Domain Names (IETF working group), 309 http://www.i-d-n.net/, idn@ops.ietf.org. 311 [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host 312 Names In Applications (IDNA)", draft-ietf-idn-idna-01. 314 [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation 315 of Internationalized Host Names", 2001-Feb-24, 316 draft-ietf-idn-nameprep-03. 318 [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host 319 Table Specification", 1985-Oct, RFC 952. 321 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 322 1987-Nov, RFC 1034. 324 [RFC1123] Internet Engineering Task Force, R. Braden (editor), 325 "Requirements for Internet Hosts -- Application and Support", 326 1989-Oct, RFC 1123. 328 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 329 Requirement Levels", 1997-Mar, RFC 2119. 331 [SFS] David Mazieres et al, "Self-certifying File System", 332 http://www.fs.net/. 334 [UNICODE] The Unicode Consortium, "The Unicode Standard", 335 http://www.unicode.org/unicode/standard/standard.html. 337 A. Acknowledgements 339 The basic encoding of integers to quartets to quintets to base-32 340 comes from earlier IETF work by Martin Duerst. DUDE uses a slight 341 variation on the idea. 343 Paul Hoffman provided helpful comments on this document. 345 The idea of avoiding 0, 1, o, and l in base-32 strings was taken 346 from SFS [SFS]. 348 B. Author contact information 350 Mark Welter 351 Brian W. Spolarich 352 WALID, Inc. 353 State Technology Park 354 2245 S. State St. 355 Ann Arbor, MI 48104 356 +1 734 822 2020 358 Adam M. Costello 359 University of California, Berkeley 360 http://www.cs.berkeley.edu/~amc/ 362 C. Mixed-case annotation 364 In order to use DUDE to represent case-insensitive Unicode strings, 365 higher layers need to case-fold the Unicode strings prior to DUDE 366 encoding. The encoded string can, however, use mixed-case base-32 367 (rather than all-lowercase or all-uppercase as recommended in 368 section 4 "Base-32 characters") as an annotation telling how to 369 convert the folded Unicode string into a mixed-case Unicode string 370 for display purposes. 372 Each Unicode code point (unless it is U+002D hyphen-minus) is 373 represented by a sequence of base-32 characters, the last of which 374 is always a letter (as opposed to a digit). If that letter is 375 uppercase, it is a suggestion that the Unicode character be mapped 376 to uppercase (if possible); if the letter is lowercase, it is a 377 suggestion that the Unicode character be mapped to lowercase (if 378 possible). 380 DUDE encoders and decoders are not required to support these 381 annotations, and higher layers need not use them. 383 Example: In order to suggest that example O in section 7 "Example 384 strings" be displayed as: 386 -with-SUPER-MONKEYS 388 one could capitalize the DUDE encoding as: 390 x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK 392 D. Differences from draft-ietf-idn-dude-01 394 Four changes have been made since draft-ietf-idn-dude-01 (DUDE-01): 396 1) DUDE-01 computed the XOR of each integer with the previous one 397 in order to decide how many bits of each integer to encode, but 398 now the XOR itself is encoded, so there is no need for a mask. 400 2) DUDE-01 made the first quintet of each sequence different from 401 the rest, while now it is the last quintet that differs, so it's 402 easier for the decoder to detect the end of the sequence. 404 3) The base-32 map has changed to avoid 0, 1, o, and l, to help 405 humans avoid transcription errors. 407 4) The initial value of the previous code point has changed from 0 408 to 0x60, making the encodings of a few domain names shorter and 409 none longer. 411 E. Example implementation 413 /******************************************/ 414 /* dude.c 0.2.3 (2001-May-31-Thu) */ 415 /* Adam M. Costello */ 416 /******************************************/ 418 /* This is ANSI C code (C89) implementing */ 419 /* DUDE (draft-ietf-idn-dude-02). */ 421 /************************************************************/ 422 /* Public interface (would normally go in its own .h file): */ 424 #include 426 enum dude_status { 427 dude_success, 428 dude_bad_input, 429 dude_big_output /* Output would exceed the space provided. */ 430 }; 432 enum case_sensitivity { case_sensitive, case_insensitive }; 434 #if UINT_MAX >= 0x1FFFFF 435 typedef unsigned int u_code_point; 436 #else 437 typedef unsigned long u_code_point; 438 #endif 440 enum dude_status dude_encode( 441 unsigned int input_length, 442 const u_code_point input[], 443 const unsigned char uppercase_flags[], 444 unsigned int *output_size, 445 char output[] ); 446 /* dude_encode() converts Unicode to DUDE (without any */ 447 /* signature). The input must be represented as an array */ 448 /* of Unicode code points (not code units; surrogate pairs */ 449 /* are not allowed), and the output will be represented as */ 450 /* null-terminated ASCII. The input_length is the number of code */ 451 /* points in the input. The output_size is an in/out argument: */ 452 /* the caller must pass in the maximum number of characters */ 453 /* that may be output (including the terminating null), and on */ 454 /* successful return it will contain the number of characters */ 455 /* actually output (including the terminating null, so it will be */ 456 /* one more than strlen() would return, which is why it is called */ 457 /* output_size rather than output_length). The uppercase_flags */ 458 /* array must hold input_length boolean values, where nonzero */ 459 /* means the corresponding Unicode character should be forced */ 460 /* to uppercase after being decoded, and zero means it is */ 461 /* caseless or should be forced to lowercase. Alternatively, */ 462 /* uppercase_flags may be a null pointer, which is equivalent */ 463 /* to all zeros. The encoder always outputs lowercase base-32 */ 464 /* characters except when nonzero values of uppercase_flags */ 465 /* require otherwise. The return value may be any of the */ 466 /* dude_status values defined above; if not dude_success, then */ 467 /* output_size and output may contain garbage. On success, the */ 468 /* encoder will never need to write an output_size greater than */ 469 /* input_length*k+1 if all the input code points are less than 1 */ 470 /* << (4*k), because of how the encoding is defined. */ 472 enum dude_status dude_decode( 473 enum case_sensitivity case_sensitivity, 474 char scratch_space[], 475 const char input[], 476 unsigned int *output_length, 477 u_code_point output[], 478 unsigned char uppercase_flags[] ); 479 /* dude_decode() converts DUDE (without any signature) to */ 480 /* Unicode. The input must be represented as null-terminated */ 481 /* ASCII, and the output will be represented as an array of */ 482 /* Unicode code points. The case_sensitivity argument influences */ 483 /* the check on the well-formedness of the input string; it */ 484 /* must be case_sensitive if case-sensitive comparisons are */ 485 /* allowed on encoded strings, case_insensitive otherwise. */ 486 /* The scratch_space must point to space at least as large */ 487 /* as the input, which will get overwritten (this allows the */ 488 /* decoder to avoid calling malloc()). The output_length is */ 489 /* an in/out argument: the caller must pass in the maximum */ 490 /* number of code points that may be output, and on successful */ 491 /* return it will contain the actual number of code points */ 492 /* output. The uppercase_flags array must have room for at */ 493 /* least output_length values, or it may be a null pointer if */ 494 /* the case information is not needed. A nonzero flag indicates */ 495 /* that the corresponding Unicode character should be forced to */ 496 /* uppercase by the caller, while zero means it is caseless or */ 497 /* should be forced to lowercase. The return value may be any */ 498 /* of the dude_status values defined above; if not dude_success, */ 499 /* then output_length, output, and uppercase_flags may contain */ 500 /* garbage. On success, the decoder will never need to write */ 501 /* an output_length greater than the length of the input (not */ 502 /* counting the null terminator), because of how the encoding is */ 503 /* defined. */ 505 /**********************************************************/ 506 /* Implementation (would normally go in its own .c file): */ 508 #include 510 /* Character utilities: */ 512 /* base32[q] is the lowercase base-32 character representing */ 513 /* the number q from the range 0 to 31. Note that we cannot */ 514 /* use string literals for ASCII characters because an ANSI C */ 515 /* compiler does not necessarily use ASCII. */ 517 static const char base32[] = { 518 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 519 109, 110, /* m-n */ 520 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 521 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ 522 }; 524 /* base32_decode(c) returns the value of a base-32 character, in the */ 525 /* range 0 to 31, or the constant base32_invalid if c is not a valid */ 526 /* base-32 character. */ 527 enum { base32_invalid = 32 }; 529 static unsigned int base32_decode(char c) 530 { 531 if (c < 50) return base32_invalid; 532 if (c <= 57) return c - 26; 533 if (c < 97) c += 32; 534 if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; 535 return c - 97 - (c > 108) - (c > 111); 536 } 538 /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ 539 /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ 540 /* then ASCII A-Z are considered equal to a-z respectively. */ 542 static int unequal( enum case_sensitivity case_sensitivity, 543 const char s1[], const char s2[] ) 544 { 545 char c1, c2; 547 if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; 549 for (;;) { 550 c1 = *s1; 551 c2 = *s2; 552 if (c1 >= 65 && c1 <= 90) c1 += 32; 553 if (c2 >= 65 && c2 <= 90) c2 += 32; 554 if (c1 != c2) return 1; 555 if (c1 == 0) return 0; 556 ++s1, ++s2; 557 } 558 } 560 /* Encoder: */ 562 enum dude_status dude_encode( 563 unsigned int input_length, 564 const u_code_point input[], 565 const unsigned char uppercase_flags[], 566 unsigned int *output_size, 567 char output[] ) 568 { 569 unsigned int max_out, in, out, k, j; 570 u_code_point prev, codept, diff, tmp; 571 char shift; 573 prev = 0x60; 574 max_out = *output_size; 576 for (in = out = 0; in < input_length; ++in) { 578 /* At the start of each iteration, in and out are the number of */ 579 /* items already input/output, or equivalently, the indices of */ 580 /* the next items to be input/output. */ 581 codept = input[in]; 583 if (codept == 0x2D) { 584 /* Hyphen-minus stands for itself. */ 585 if (max_out - out < 1) return dude_big_output; 586 output[out++] = 0x2D; 587 continue; 588 } 590 diff = prev ^ codept; 592 /* Compute the number of base-32 characters (k): */ 593 for (tmp = diff >> 4, k = 1; tmp != 0; ++k, tmp >>= 4); 595 if (max_out - out < k) return dude_big_output; 596 shift = uppercase_flags && uppercase_flags[in] ? 32 : 0; 597 /* shift controls the case of the last base-32 digit. */ 599 /* Each quintet has the form 1xxxx except the last is 0xxxx. */ 600 /* Computing the base-32 digits in reverse order is easiest. */ 602 out += k; 603 output[out - 1] = base32[diff & 0xF] - shift; 605 for (j = 2; j <= k; ++j) { 606 diff >>= 4; 607 output[out - j] = base32[0x10 | (diff & 0xF)]; 608 } 610 prev = codept; 611 } 613 /* Append the null terminator: */ 614 if (max_out - out < 1) return dude_big_output; 615 output[out++] = 0; 617 *output_size = out; 618 return dude_success; 619 } 621 /* Decoder: */ 623 enum dude_status dude_decode( 624 enum case_sensitivity case_sensitivity, 625 char scratch_space[], 626 const char input[], 627 unsigned int *output_length, 628 u_code_point output[], 629 unsigned char uppercase_flags[] ) 630 { 631 u_code_point prev, q, diff; 632 char c; 633 unsigned int max_out, in, out, scratch_size; 634 enum dude_status status; 636 prev = 0x60; 637 max_out = *output_length; 638 for (c = input[in = 0], out = 0; c != 0; c = input[++in], ++out) { 640 /* At the start of each iteration, in and out are the number of */ 641 /* items already input/output, or equivalently, the indices of */ 642 /* the next items to be input/output. */ 644 if (max_out - out < 1) return dude_big_output; 646 if (c == 0x2D) output[out] = c; /* hyphen-minus is literal */ 647 else { 648 /* Base-32 sequence. Decode quintets until 0xxxx is found: */ 650 for (diff = 0; ; c = input[++in]) { 651 q = base32_decode(c); 652 if (q == base32_invalid) return dude_bad_input; 653 diff = (diff << 4) | (q & 0xF); 654 if (q >> 4 == 0) break; 655 } 657 prev = output[out] = prev ^ diff; 658 } 660 /* Case of last character determines uppercase flag: */ 661 if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90; 662 } 664 /* Enforce the uniqueness of the encoding by re-encoding */ 665 /* the output and comparing the result to the input: */ 667 scratch_size = ++in; 668 status = dude_encode(out, output, uppercase_flags, 669 &scratch_size, scratch_space); 670 if (status != dude_success || scratch_size != in || 671 unequal(case_sensitivity, scratch_space, input) 672 ) return dude_bad_input; 674 *output_length = out; 675 return dude_success; 676 } 678 /******************************************************************/ 679 /* Wrapper for testing (would normally go in a separate .c file): */ 681 #include 682 #include 683 #include 684 #include 686 /* For testing, we'll just set some compile-time limits rather than */ 687 /* use malloc(), and set a compile-time option rather than using a */ 688 /* command-line option. */ 689 enum { 690 unicode_max_length = 256, 691 ace_max_size = 256, 692 test_case_sensitivity = case_insensitive 693 /* suitable for host names */ 694 }; 696 static void usage(char **argv) 697 { 698 fprintf(stderr, 699 "%s -e reads code points and writes a DUDE string.\n" 700 "%s -d reads a DUDE string and writes code points.\n" 701 "Input and output are plain text in the native character set.\n" 702 "Code points are in the form u+hex separated by whitespace.\n" 703 "A DUDE string is a newline-terminated sequence of LDH characters\n" 704 "(without any signature).\n" 705 "The case of the u in u+hex is the force-to-uppercase flag.\n" 706 , argv[0], argv[0]); 707 exit(EXIT_FAILURE); 708 } 710 static void fail(const char *msg) 711 { 712 fputs(msg,stderr); 713 exit(EXIT_FAILURE); 714 } 716 static const char too_big[] = 717 "input or output is too large, recompile with larger limits\n"; 718 static const char invalid_input[] = "invalid input\n"; 719 static const char io_error[] = "I/O error\n"; 721 /* The following string is used to convert LDH */ 722 /* characters between ASCII and the native charset: */ 724 static const char ldh_ascii[] = 725 "................" 726 "................" 727 ".............-.." 728 "0123456789......" 729 ".ABCDEFGHIJKLMNO" 730 "PQRSTUVWXYZ....." 731 ".abcdefghijklmno" 732 "pqrstuvwxyz"; 734 int main(int argc, char **argv) 735 { 736 enum dude_status status; 737 int r; 738 char *p; 740 if (argc != 2) usage(argv); 741 if (argv[1][0] != '-') usage(argv); 742 if (argv[1][2] != 0) usage(argv); 743 if (argv[1][1] == 'e') { 744 u_code_point input[unicode_max_length]; 745 unsigned long codept; 746 unsigned char uppercase_flags[unicode_max_length]; 747 char output[ace_max_size], uplus[3]; 748 unsigned int input_length, output_size, i; 750 /* Read the input code points: */ 752 input_length = 0; 754 for (;;) { 755 r = scanf("%2s%lx", uplus, &codept); 756 if (ferror(stdin)) fail(io_error); 757 if (r == EOF || r == 0) break; 759 if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) { 760 fail(invalid_input); 761 } 763 if (input_length == unicode_max_length) fail(too_big); 765 if (uplus[0] == 'u') uppercase_flags[input_length] = 0; 766 else if (uplus[0] == 'U') uppercase_flags[input_length] = 1; 767 else fail(invalid_input); 769 input[input_length++] = codept; 770 } 772 /* Encode: */ 774 output_size = ace_max_size; 775 status = dude_encode(input_length, input, uppercase_flags, 776 &output_size, output); 777 if (status == dude_bad_input) fail(invalid_input); 778 if (status == dude_big_output) fail(too_big); 779 assert(status == dude_success); 781 /* Convert to native charset and output: */ 783 for (p = output; *p != 0; ++p) { 784 i = *p; 785 assert(i <= 122 && ldh_ascii[i] != '.'); 786 *p = ldh_ascii[i]; 787 } 789 r = puts(output); 790 if (r == EOF) fail(io_error); 791 return EXIT_SUCCESS; 792 } 794 if (argv[1][1] == 'd') { 795 char input[ace_max_size], scratch[ace_max_size], *pp; 796 u_code_point output[unicode_max_length]; 797 unsigned char uppercase_flags[unicode_max_length]; 798 unsigned int input_length, output_length, i; 799 /* Read the DUDE input string and convert to ASCII: */ 801 fgets(input, ace_max_size, stdin); 802 if (ferror(stdin)) fail(io_error); 803 if (feof(stdin)) fail(invalid_input); 804 input_length = strlen(input); 805 if (input[input_length - 1] != '\n') fail(too_big); 806 input[--input_length] = 0; 808 for (p = input; *p != 0; ++p) { 809 pp = strchr(ldh_ascii, *p); 810 if (pp == 0) fail(invalid_input); 811 *p = pp - ldh_ascii; 812 } 814 /* Decode: */ 816 output_length = unicode_max_length; 817 status = dude_decode(test_case_sensitivity, scratch, input, 818 &output_length, output, uppercase_flags); 819 if (status == dude_bad_input) fail(invalid_input); 820 if (status == dude_big_output) fail(too_big); 821 assert(status == dude_success); 823 /* Output the result: */ 825 for (i = 0; i < output_length; ++i) { 826 r = printf("%s+%04lX\n", 827 uppercase_flags[i] ? "U" : "u", 828 (unsigned long) output[i] ); 829 if (r < 0) fail(io_error); 830 } 832 return EXIT_SUCCESS; 833 } 835 usage(argv); 836 return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ 837 } 839 INTERNET-DRAFT expires 2001-Dec-07