idnits 2.17.1 draft-ietf-idn-altdude-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 3 instances of too long lines in the document, the longest one being 2 characters in excess of 72. ** The abstract seems to contain references ([UNICODE], [DUDE01], [RFC1123], [RFC952], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '--out' is mentioned on line 778, but not defined == Missing Reference: '-1' is mentioned on line 836, but not defined -- Looks like a reference, but probably isn't: '0' on line 898 -- Looks like a reference, but probably isn't: '1' on line 951 -- Looks like a reference, but probably isn't: '2' on line 899 == Unused Reference: 'RFC1034' is defined on line 551, but no explicit reference was found in the text == Outdated reference: A later version (-10) exists of draft-ietf-idn-nameprep-03 -- Possible downref: Normative reference to a draft: ref. 'AMCACEM00' -- Possible downref: Normative reference to a draft: ref. 'AMCACEO00' == Outdated reference: A later version (-02) exists of draft-ietf-idn-dude-01 -- Possible downref: Normative reference to a draft: ref. 'DUDE01' -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN' -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL' ** Downref: Normative reference to an Unknown state RFC: RFC 952 -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- No information found for draft-jseng-utf5- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'UTF5' Summary: 6 errors (**), 0 flaws (~~), 6 warnings (==), 15 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Adam M. Costello 2 draft-ietf-idn-altdude-00.txt 2001-Mar-19 3 Expires 2001-Sep-19 5 AltDUDE version 0.0.2 7 Status of this Memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note 14 that other groups may also distribute working documents as 15 Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months and may be updated, replaced, or obsoleted by other documents 19 at any time. It is inappropriate to use Internet-Drafts as 20 reference material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 Distribution of this document is unlimited. Please send comments 29 to the author at amc@cs.berkeley.edu, or to the idn working 30 group at idn@ops.ietf.org. A non-paginated (and possibly 31 newer) version of this specification may be available at 32 http://www.cs.berkeley.edu/~amc/charset/altdude 34 Abstract 36 DUDE [DUDE01] by Mark Welter and Brian Spolarich is an 37 ASCII-Compatible Encoding (ACE) of Unicode strings, and AltDUDE is a 38 slight variation on it that is conceptually simpler. 40 AltDUDE is a reversible map from a sequence of unsigned integers 41 (intended to be Unicode code points) to a sequence of letters (A-Z, 42 a-z), digits (0-9), and hyphen-minus (-), henceforth called LDH 43 characters. Such a map might be useful for internationalized domain 44 names [IDN], because host name labels are currently restricted to 45 LDH characters by [RFC952] and [RFC1123]. 47 Besides domain names, there might also be other contexts where it is 48 useful to transform Unicode [UNICODE] code points (or any unsigned 49 integers that exhibit locality) into "safe" (delimiter-free) 50 ASCII characters. (If other contexts consider hyphen-minus to be 51 unsafe, it can trivially be eliminated, or replaced by a different 52 character, like underscore.) 53 Contents 55 Differences from DUDE 56 Features 57 Name 58 Overview 59 Base-32 characters 60 Encoding procedure 61 Decoding procedure 62 Signature 63 Case sensitivity models 64 Comparison with other ACEs 65 Example strings 66 Security considerations 67 Credits 68 References 69 Author 70 Example implementation 72 Differences from DUDE 74 AltDUDE differs from DUDE in four respects: 76 1) DUDE computes the XOR of each integer and the previous in order 77 to decide how many bits of each integer to encode, whereas 78 AltDUDE encodes the XOR itself, so there is no need for a mask. 80 2) DUDE makes the first quintet of each sequence different from the 81 rest, while AltDUDE makes the last quintet different, so it's 82 easier for the decoder to detect the end of the sequence. 84 3) AltDUDE uses a base-32 map that avoids 0, 1, o, and l, to help 85 humans avoid transcription errors. 87 4) AltDUDE uses 96 rather than 0 as the initial value of the 88 previous code point. For domain names, this makes a few 89 encodings one character shorter and makes none longer. 91 Features 93 Uniqueness: Every sequence of integers maps to at most one LDH 94 string. 96 Completeness: Every sequence of integers maps to an LDH string. 97 Restrictions on which integers are allowed, and on sequence length, 98 may be imposed by higher layers. 100 Efficient encoding: The ratio of encoded size to original size is 101 small. This is important in the context of domain names because 102 [RFC1034] restricts the length of a domain label to 63 characters. 104 Simplicity: The encoding and decoding algorithms are reasonably 105 simple to implement. The goals of efficiency and simplicity are at 106 odds; AltDUDE places greater emphasis on simplicity. 108 Case-preservation: If a Unicode string has been case-folded prior 109 to encoding, it is possible to record the case information in the 110 case of the letters in the encoding, allowing a mixed-case Unicode 111 string to be recovered if desired, but a case-insensitive comparison 112 of two encoded strings is equivalent to a case-insensitive 113 comparison of the Unicode strings. This feature is optional; see 114 section "Case sensitivity models". 116 Name 118 AltDUDE is a working name that should be changed if it is adopted. 119 Rather than waste good names on experimental proposals, let's 120 wait until one proposal is chosen, then assign it a good name. 121 Suggestions (assuming the primary use is in domain names): 123 DUDE (if the DUDE authors wish to adopt this algorithm) 124 UniHost 125 UTF-D ("D" for "domain names") 126 UTF-33 (there are 33 characters in the output repertoire) 128 Overview 130 AltDUDE encodes unsigned integers as characters, although 131 implementations will of course need to represent the output 132 characters somehow, usually as bytes or other code units. When 133 AltDUDE is used to encode Unicode characters, the integers are the 134 corresponding Unicode code points, not UTF-16 surrogates. 136 Each integer is represented by an integral number of characters in 137 the encoded string. There is no intermediate bit string or octet 138 string. 140 Integers with value 45 are represented by hyphen-minus characters 141 (45 is the Unicode code point for hyphen-minus). Each 142 non-hyphen-minus character in the encoded string represents five 143 bits (a "quintet"). A sequence of quintets represents the bitwise 144 XOR between each non-45 integer and the previous one. 146 The exception for 45 and hyphen-minus is useful for domain names, 147 but could be dropped in other contexts, or replaced by a different 148 exception. 150 Base-32 characters 152 "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 153 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 154 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 155 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 156 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 157 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 158 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 159 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 160 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 161 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 162 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 163 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 164 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 165 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 166 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 167 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 169 The digits "0" and "1" and the letters "o" and "l" are not used, to 170 avoid transcription errors. 172 All decoders must recognize both the uppercase and lowercase 173 forms of the base-32 characters. The case may or may not convey 174 information, as described in section "Case sensitivity models". 176 Encoding procedure 178 All ordering of nybbles and quintets is big-endian (most significant 179 first). A nybble is 4 bits. XOR is bitwise exclusive or. 181 let prev = 96 182 for each input integer n (in order) do begin 183 if n == 45 then output hyphen minus 184 else begin 185 let diff = prev XOR n 186 extract the least significant nybbles of diff, as few as are 187 sufficient to hold all the nonzero bits (but at least one) 188 prepend 0 to the last nybble and 1 to the rest 189 output base-32 characters corresponding to the quintets 190 let prev = n 191 end 192 end 194 The encoder must either correctly handle all integer values that can 195 be represented in the type of its input, or it must check whether 196 the input contains values that it cannot handle and return an error 197 if so. Under no circumstances may it produce incorrect output. 199 Decoding procedure 201 let prev = 96 202 while the input string is not exhausted do begin 203 if the next character is hyphen-minus then output 45 204 else begin 205 input characters and convert them to quintets until 206 encountering a quintet beginning with 0 207 fail upon encountering a non-base-32 character or end-of-input 208 strip the first bit of each quintet 209 concatenate the resulting nybbles to form diff 210 let prev = prev XOR diff 211 output prev 212 end 213 end 214 encode the output sequence and compare it to the input string 215 fail if they are not equal 217 The comparison at the end must be case-insensitive if ACEs are 218 always compared case-insensitively (which is true of domain names), 219 case-sensitive otherwise. See also section "Case sensitivity 220 models". This check is necessary to guarantee the uniqueness 221 property (there cannot be two distinct encoded strings representing 222 the same sequence of integers). This check also frees the decoder 223 from having to check for overflow while decoding the base-32 224 characters. 226 Signature 228 The issue of how to distinguish ACE strings from unencoded strings 229 is largely orthogonal to the encoding scheme itself, and is 230 therefore not specified here. In the context of domain name labels, 231 a standard prefix and/or suffix (chosen to be unlikely to occur 232 naturally) would presumably be attached to ACE labels. (In that 233 case, it would probably be good to forbid the encoding of Unicode 234 strings that appear to match the signature, to avoid confusing 235 humans about whether they are looking at a Unicode string or an ACE 236 string.) 238 In order to use AltDUDE in domain names, the choice of signature 239 must be mindful of the requirement in [RFC952] that labels never 240 begin or end with hyphen-minus. The raw encoded string will 241 begin or end with a hyphen-minus iff the Unicode string does. If 242 the Unicode strings are forbidden from beginning or ending with 243 hyphen-minus (which seems prudent anyway), then there is no problem. 244 Otherwise, the signature must consist of both a prefix and a suffix. 246 It appears that "---" is extremely rare in domain names; among the 247 four-character prefixes of all the second-level domains under .com, 248 .net, and .org, "---" never appears at all. Therefore, perhaps the 249 signature should be of the form ?--- (prefix) or ---? (suffix), 250 where ? could be "u" for Unicode, or "i" for internationalized, or 251 "a" for ACE, or maybe "q" or "z" because they are rare. 253 Case sensitivity models 255 The higher layer must choose one of the following four models. 257 Models suitable for domain names: 259 * Case-insensitive: Before a string is encoded, all its non-LDH 260 characters must be case-folded so that any strings differing 261 only in case become the same string (for example, strings could 262 be forced to lowercase). Folding LDH characters is optional. 263 The case of base-32 characters and literal-mode characters is 264 arbitrary and not significant. Comparisons between encoded 265 strings must be case-insensitive. The original case of non-LDH 266 characters cannot be recovered from the encoded string. 268 * Case-preserving: The case of the Unicode characters is not 269 considered significant, but it can be preserved and recovered, 270 just like in non-internationalized host names. Before a string 271 is encoded, all its non-LDH characters must be case-folded 272 as in the previous model. LDH characters are naturally able 273 to retain their case attributes because they are encoded 274 literally. The case attribute of a non-LDH character is 275 recorded in the last of the base-32 characters that represent 276 it, which is guaranteed to be a letter rather than a digit. 277 If the base-32 character is uppercase, it means the Unicode 278 character is caseless or should be forced to uppercase after 279 being decoded (which is a no-op if the case folding already 280 forces to uppercase). If the base-32 character is lowercase, 281 it means the Unicode character is caseless or should be forced 282 to lowercase after being decoded (which is a no-op if the case 283 folding already forces to lowercase). The case of the other 284 base-32 characters in a multi-quintet encoding is arbitrary 285 and not significant. Only uppercase and lowercase attributes 286 can be recorded, not titlecase. Comparisons between encoded 287 strings must be case-insensitive, and are equivalent to 288 case-insensitive comparisons between the Unicode strings. The 289 intended mixed-case Unicode string can be recovered as long as 290 the encoded characters are unaltered, but altering the case of 291 the encoded characters is not harmful--it merely alters the case 292 of the Unicode characters, and such a change is not considered 293 significant. 295 In this model, the input to the encoder and the output of the 296 decoder can be the unfolded Unicode string (in which case the 297 encoder and decoder are responsible for performing the case 298 folding and recovery), or can be the folded Unicode string 299 accompanied by separate case information (in which case the 300 higher layer is responsible for performing the case folding and 301 recovery). Whichever layer performs the case recovery must 302 first verify that the Unicode string is properly folded, to 303 guarantee the uniqueness of the encoding. 305 It is not very difficult to extend the nameprep algorithm 306 [NAMEPREP03] to remember case information. 308 The case-insensitive and case-preserving models are interoperable. 309 If a domain name passes from a case-preserving entity to a 310 case-insensitive entity, the case information will be lost, but 311 the domain name will still be equivalent. This phenomenon already 312 occurs with non-internationalized domain names. 314 Models unsuitable for domain names, but possibly useful in other 315 contexts: 317 * Case-sensitive: Unicode strings may contain both uppercase and 318 lowercase characters, which are not folded. Base-32 characters 319 must be lowercase. Comparisons between encoded strings must be 320 case-sensitive. 322 * Case-flexible: Like case-preserving, except that the choice 323 of whether the case of the Unicode characters is considered 324 significant is deferred. Therefore, base-32 characters must 325 be lowercase, except for those used to indicate uppercase 326 Unicode characters. Comparisons between encoded strings may be 327 case-sensitive or case-insensitive, and such comparisons are 328 equivalent to the corresponding comparisons between the Unicode 329 strings. 331 Comparison with other ACEs 333 The differences between AltDUDE and DUDE were given in section 334 "Differences from DUDE". For a comparison between DUDE and other 335 ACEs, please see the AMC-ACE-O specification [AMCACEO00]. 337 Example strings 339 The first several examples are all translations of the sentence "Why 340 can't they just speak in ?" (courtesy of Michael Kaplan's 341 "provincial" page [PROVINCIAL]). Word breaks and punctuation have 342 been removed, as is often done in domain names. 344 (A) Arabic (Egyptian): 345 U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644 346 U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F 348 AltDUDE: yueqpcycrcyjhbpznpitjycxf 350 (B) Chinese (simplified): 351 U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587 353 AltDUDE: w85gvk7g9k2iwf6x9j6x7ju54k 355 (C) Czech: Proprostnemluvesky 357 = U+010D 358 = U+011B 359 = U+00ED 360 AltDUDE: tActptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc 362 (D) Hebrew: 363 U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8 364 U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2 365 U+05D1 U+05E8 U+05D9 U+05EA 367 AltDUDE: x5nckajvjpvnpenqpcvjvbevrvdvjvbvd 369 (E) Hindi: 370 U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D 371 U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939 372 U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947 373 U+0939 U+0948 U+0902 (Devanagari) 375 AltDUDE: 3wrtgmzjxnuqgthyfymygxfxiycyewjuktbzjwcuqyhzjkupvbydzq\ 376 zbwk 378 (F) Japanese: 379 U+306A U+305C U+307F U+3093 U+306A U+65E5 U+672C U+8A9E U+3092 380 U+8A71 U+3057 U+3066 U+304F U+308C U+306A U+3044 U+306E U+304B 381 (kanji and hiragana) 383 AltDUDE: vsskvgud8n9jxx2ru6j875c54sn548d54ugvbuj6d8guqukuf 385 (G) Korean: 386 U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774 387 U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74 388 U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C (Hangul syllables) 390 AltDUDE: 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8si\ 391 tusauiyz5i23az96iz6ze3xaz2td96ry3si 393 (H) Russian: 394 U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E 395 U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440 396 U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A 397 U+0438 (Cyrillic) 399 AltDUDE: wxRbzjzcjzrzfdmdffigpnnzqrpzpbzqdcazmc 401 (I) Spanish: PorqunopuedensimplementehablarenEspaol 403 = U+00E9 404 = U+00F1 406 AltDUDE: tAtrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmMtgdtb\ 407 3a3qd 409 (J) Taiwanese: 410 U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587 412 AltDUDE: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k 414 (K) Vietnamese: 415 Taisaohokhngthchi\ 416 noitingVit 417 = U+0323 418 = U+00F4 419 = U+00EA 420 = U+0309 421 = U+0301 423 AltDUDE: tEtfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\ 424 yitptp2dv8mvyrjtBtr2dv6jvxh 426 The next several examples are all names of Japanese music artists, 427 song titles, and TV programs, just because the author happens to 428 have them handy (but Japanese is useful for providing examples 429 of single-row text, two-row text, ideographic text, and various 430 mixtures thereof). 432 (L) 3B (Japanese TV program title) 434 = U+5E74 (kanji) 435 = U+7D44 (kanji) 436 = U+91D1 U+516B U+5148 U+751F (kanji) 438 AltDUDE: xdx8whx8tGz7ug863f6s5kuduwxh 440 (M) -with-SUPER-MONKEYS (Japanese music group name) 442 = U+5B89 U+5BA4 U+5948 U+7F8E U+6075 (kanji) 444 AltDUDE: x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK 446 (N) Hello-Another-Way- (Japanese song title) 448 = U+305D U+308C U+305E U+308C U+306E (hiragana) 449 = U+5834 U+6240 (kanji) 451 AltDUDE: Ipjad-Qrbtmtnpth-Ftgti-vsue7b7c7c8cy2xkv4ze 453 (O) 2 (Japanese TV program title) 455 = U+3072 U+3068 U+3064 (hiragana) 456 = U+5C4B U+6839 (kanji) 457 = U+306E (hiragana) 458 = U+4E0B (kanji) 460 AltDUDE: vstctkny6urvwzcx2xhz8yfw8vj 462 (P) MajiKoi5 (Japanese song title) 464 = U+3067 (hiragana) 465 = U+3059 U+308B (hiragana) 466 = U+79D2 U+524D (kanji) 468 AltDUDE: PnmdvssqvssNegvsva7cvs5qz38hu53r 470 (Q) de (Japanese song title) 472 = U+30D1 U+30D5 U+30A3 U+30FC (katakana) 473 = U+30EB U+30F3 U+30D0 (katakana) 474 AltDUDE: vs5bezgxrvs3ibvs2qtiud 476 (R) (Japanese song title) 478 = U+305D U+306E (hiragana) 479 = U+30B9 U+30D4 U+30FC U+30C9 (katakana) 480 = U+3067 (hiragana) 482 AltDUDE: vsvpvd7hypuivf4q 484 The last example is an ASCII string that breaks not only the 485 existing rules for host name labels but also the rules proposed in 486 [NAMEPREP03] for internationalized domain names. 488 (S) -> $1.00 <- 490 AltDUDE: -xqtqetftrtqatatn- 492 Security considerations 494 Users expect each domain name in DNS to be controlled by a single 495 authority. If a Unicode string intended for use as a domain label 496 could map to multiple ACE labels, then an internationalized domain 497 name could map to multiple ACE domain names, each controlled by 498 a different authority, some of which could be spoofs that hijack 499 service requests intended for another. Therefore AltDUDE is 500 designed so that each Unicode string has a unique encoding. 502 However, there can still be multiple Unicode representations of the 503 "same" text, for various definitions of "same". This problem is 504 addressed to some extent by the Unicode standard under the topic 505 of canonicalization, but some text strings may be misleading or 506 ambiguous to humans when used as domain names, such as strings 507 containing dots, slashes, at-signs, etc. These issues are being 508 further studied under the topic of "nameprep" [NAMEPREP03]. 510 Credits 512 AltDUDE reuses a number of preexisting techniques. 514 The basic encoding of integers to nybbles to quintets to base-32 515 comes from UTF-5 [UTF5], and the particular variant used here comes 516 from AMC-ACE-M [AMCACEM00]. 518 The idea of avoiding 0, 1, o, and l in base-32 strings was taken 519 from SFS [SFS]. 521 From DUDE (of which the latest version is [DUDE01]) comes the idea 522 of encoding differences between successive integers. The idea 523 of using the alphabetic case of base-32 characters to record the 524 desired case of the Unicode characters was suggested by this author, 525 but in DUDE it was first applied it to the UTF-5-style encoding. 527 References 529 [AMCACEM00] Adam Costello, "AMC-ACE-M version 0.1.0", 2001-Feb-12, 530 draft-ietf-idn-amc-ace-m-00. 532 [AMCACEO00] Adam Costello, "AMC-ACE-O version 0.0.3", 2001-Mar-19, 533 draft-ietf-idn-amc-ace-o-00. 535 [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode 536 Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01. 538 [IDN] Internationalized Domain Names (IETF working group), 539 http://www.i-d-n.net/, idn@ops.ietf.org. 541 [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation 542 of Internationalized Host Names", 2001-Feb-24, 543 draft-ietf-idn-nameprep-03. 545 [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", 546 http://www.trigeminal.com/samples/provincial.html. 548 [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host 549 Table Specification", 1985-Oct, RFC 952. 551 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 552 1987-Nov, RFC 1034. 554 [RFC1123] Internet Engineering Task Force, R. Braden (editor), 555 "Requirements for Internet Hosts -- Application and Support", 556 1989-Oct, RFC 1123. 558 [SFS] David Mazieres et al, "Self-certifying File System", 559 http://www.fs.net/. 561 [UNICODE] The Unicode Consortium, "The Unicode Standard", 562 http://www.unicode.org/unicode/standard/standard.html. 564 [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a 565 Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*. 567 Author 569 Adam M. Costello 570 http://www.cs.berkeley.edu/~amc/ 572 See also the authors of DUDE [DUDE01]. 574 Example implementation 576 /******************************************/ 577 /* altdude.c 0.0.2 (2001-Mar-19-Sun) */ 578 /* Adam M. Costello */ 579 /******************************************/ 581 /* This is ANSI C code (C89) implementing AltDUDE */ 582 /* (draft-ietf-idn-altdude-00), a simplified variant */ 583 /* of DUDE (draft-ietf-idn-dude-01). */ 584 /************************************************************/ 585 /* Public interface (would normally go in its own .h file): */ 587 #include 589 enum altdude_status { 590 altdude_success, 591 altdude_invalid_input, 592 altdude_output_too_big 593 }; 595 enum case_sensitivity { case_sensitive, case_insensitive }; 597 #if UINT_MAX >= 0x1FFFFF 598 typedef unsigned int u_code_point; 599 #else 600 typedef unsigned long u_code_point; 601 #endif 603 enum altdude_status altdude_encode( 604 unsigned int input_length, 605 const u_code_point *input, 606 const unsigned char *uppercase_flags, 607 unsigned int *output_size, 608 char *output ); 610 /* altdude_encode() converts Unicode to AltDUDE (without any */ 611 /* signature). The input must be represented as an array */ 612 /* of Unicode code points (not code units; surrogate pairs */ 613 /* are not allowed), and the output will be represented as */ 614 /* null-terminated ASCII. The input_length is the number of code */ 615 /* points in the input. The output_size is an in/out argument: */ 616 /* the caller must pass in the maximum number of characters */ 617 /* that may be output (including the terminating null), and on */ 618 /* successful return it will contain the number of characters */ 619 /* actually output (including the terminating null, so it will be */ 620 /* one more than strlen() would return, which is why it is called */ 621 /* output_size rather than output_length). The uppercase_flags */ 622 /* array must hold input_length boolean values, where nonzero */ 623 /* means the corresponding Unicode character should be forced */ 624 /* to uppercase after being decoded, and zero means it is */ 625 /* caseless or should be forced to lowercase. Alternatively, */ 626 /* uppercase_flags may be a null pointer, which is equivalent */ 627 /* to all zeros. The encoder always outputs lower case base-32 */ 628 /* characters except when nonzero values of uppercase_flags */ 629 /* require otherwise. The return value may be any of the */ 630 /* altdude_status values defined above; if not altdude_success, */ 631 /* then output_size and output may contain garbage. On success, */ 632 /* the encoder will never need to write an output_size greater */ 633 /* than input_length*k+1 if all the input code points are less */ 634 /* than 1 << (4*k), because of how the encoding is defined. */ 635 enum altdude_status altdude_decode( 636 enum case_sensitivity case_sensitivity, 637 char *scratch_space, 638 const char *input, 639 unsigned int *output_length, 640 u_code_point *output, 641 unsigned char *uppercase_flags ); 643 /* altdude_decode() converts AltDUDE (without any signature) to */ 644 /* Unicode. The input must be represented as null-terminated */ 645 /* ASCII, and the output will be represented as an array of */ 646 /* Unicode code points. The case_sensitivity argument influences */ 647 /* the check on the well-formedness of the input string; it */ 648 /* must be case_sensitive if case-sensitive comparisons are */ 649 /* allowed on encoded strings, case_insensitive otherwise. */ 650 /* The scratch_space must point to space at least as large */ 651 /* as the input, which will get overwritten (this allows the */ 652 /* decoder to avoid calling malloc()). The output_length is */ 653 /* an in/out argument: the caller must pass in the maximum */ 654 /* number of code points that may be output, and on successful */ 655 /* return it will contain the actual number of code points */ 656 /* output. The uppercase_flags array must have room for at least */ 657 /* output_length values, or it may be a null pointer if the case */ 658 /* information is not needed. A nonzero flag indicates that the */ 659 /* corresponding Unicode character should be forced to uppercase */ 660 /* by the caller, while zero means it is caseless or should be */ 661 /* forced to lowercase. The return value may be any of the */ 662 /* altdude_status values defined above; if not altdude_success, */ 663 /* then output_length, output, and uppercase_flags may contain */ 664 /* garbage. On success, the decoder will never need to write */ 665 /* an output_length greater than the length of the input (not */ 666 /* counting the null terminator), because of how the encoding is */ 667 /* defined. */ 669 /**********************************************************/ 670 /* Implementation (would normally go in its own .c file): */ 672 #include 674 /* Character utilities: */ 676 /* is_AtoZ(c) returns 1 if c is an */ 677 /* uppercase ASCII letter, zero otherwise. */ 679 static unsigned char is_AtoZ(char c) 680 { 681 return c >= 65 && c <= 90; 682 } 684 /* base32[n] is the lowercase base-32 character representing */ 685 /* the number n from the range 0 to 31. Note that we cannot */ 686 /* use string literals for ASCII characters because an ANSI C */ 687 /* compiler does not necessarily use ASCII. */ 688 static const char base32[] = { 689 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 690 109, 110, /* m-n */ 691 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 692 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ 693 }; 695 /* base32_decode(c) returns the value of a base-32 character, in the */ 696 /* range 0 to 31, or the constant base32_invalid if c is not a valid */ 697 /* base-32 character. */ 699 enum { base32_invalid = 32 }; 701 static unsigned int base32_decode(char c) 702 { 703 if (c < 50) return base32_invalid; 704 if (c <= 57) return c - 26; 705 if (c < 97) c += 32; 706 if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; 707 return c - 97 - (c > 108) - (c > 111); 708 } 710 /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ 711 /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ 712 /* then ASCII A-Z are considered equal to a-z respectively. */ 714 static int unequal( 715 enum case_sensitivity case_sensitivity, const char *s1, const char *s2 ) 716 { 717 char c1, c2; 719 if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; 721 for (;;) { 722 c1 = *s1; 723 c2 = *s2; 724 if (c1 >= 65 && c1 <= 90) c1 += 32; 725 if (c2 >= 65 && c2 <= 90) c2 += 32; 726 if (c1 != c2) return 1; 727 if (c1 == 0) return 0; 728 ++s1, ++s2; 729 } 730 } 732 /* altdude_initial_code_point is the initial value of the */ 733 /* "previous" code point, before the first code point. */ 735 static const u_code_point altdude_initial_code_point = 96; 736 /* Encoder: */ 738 enum altdude_status altdude_encode( 739 unsigned int input_length, 740 const u_code_point *input, 741 const unsigned char *uppercase_flags, 742 unsigned int *output_size, 743 char *output ) 744 { 745 unsigned int next_in, next_out, max_out, n, out; 746 u_code_point prev, codept, diff, tmp; 747 char shift; 749 prev = altdude_initial_code_point; 750 max_out = *output_size; 751 next_out = 0; 753 for (next_in = 0; next_in < input_length; ++next_in) { 754 codept = input[next_in]; 756 if (codept == 45) { 757 /* hyphen-minus stands for itself */ 758 if (max_out - next_out < 1) return altdude_output_too_big; 759 output[next_out++] = 45; 760 continue; 761 } 763 shift = uppercase_flags && uppercase_flags[next_in] ? 32 : 0; 764 /* shift will determine the case of the last base-32 digit */ 765 diff = prev ^ codept; 766 for (tmp = diff >> 4, n = 1; tmp != 0; ++n, tmp >>= 4); 767 /* n is the number of base-32 digits */ 768 if (max_out - next_out < n) return altdude_output_too_big; 770 /* Computing the base-32 digits in reverse order is easiest. */ 771 /* Only the last base-32 digit has the high bit clear. */ 773 out = next_out + n - 1; 774 output[out] = base32[diff & 0xF] - shift; 776 while (out > next_out) { 777 diff >>= 4; 778 output[--out] = base32[0x10 | (diff & 0xF)]; 779 } 781 next_out += n; 782 prev = codept; 783 } 785 /* null terminator: */ 786 if (max_out - next_out < 1) return altdude_output_too_big; 787 output[next_out++] = 0; 788 *output_size = next_out; 789 return altdude_success; 790 } 791 /* Decoder: */ 793 enum altdude_status altdude_decode( 794 enum case_sensitivity case_sensitivity, 795 char *scratch_space, 796 const char *input, 797 unsigned int *output_length, 798 u_code_point *output, 799 unsigned char *uppercase_flags ) 800 { 801 u_code_point prev, q, diff; 802 const char *in; 803 char c; 804 unsigned int next_out, max_out, input_size, scratch_size; 805 enum altdude_status status; 807 prev = altdude_initial_code_point; 808 max_out = *output_length; 809 next_out = 0; 810 in = input; 812 for (c = *in; c != 0; ) { 813 if (max_out - next_out < 1) return altdude_output_too_big; 815 if (c == 45) { 816 /* hyphen-minus stands for itself */ 817 output[next_out] = 45; 818 if (uppercase_flags) uppercase_flags[next_out] = 0; 819 ++next_out; 820 c = *++in; 821 continue; 822 } 824 /* Base-32 sequence: */ 826 diff = 0; 828 do { 829 q = base32_decode(c); 830 if (q == base32_invalid) return altdude_invalid_input; 831 diff = (diff << 4) | (q & 0xF); 832 c = *++in; 833 } while (q >> 4 == 1); 835 /* case of last digit determines uppercase flag: */ 836 if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(in[-1]); 837 prev = output[next_out++] = prev ^ diff; 838 } 839 /* Re-encode the output and compare to the input: */ 841 input_size = in - input + 1; 842 scratch_size = input_size; 843 status = altdude_encode(next_out, output, uppercase_flags, 844 &scratch_size, scratch_space); 845 if (status != altdude_success || 846 scratch_size != input_size || 847 unequal(case_sensitivity, scratch_space, input) 848 ) return altdude_invalid_input; 850 *output_length = next_out; 851 return altdude_success; 852 } 854 /******************************************************************/ 855 /* Wrapper for testing (would normally go in a separate .c file): */ 857 #include 858 #include 859 #include 860 #include 862 /* For testing, we'll just set some compile-time limits rather than */ 863 /* use malloc(), and set a compile-time option rather than using a */ 864 /* command-line option. */ 866 enum { 867 unicode_max_length = 256, 868 ace_max_size = 256, 869 test_case_sensitivity = case_insensitive /* suitable for host names */ 870 }; 872 static void usage(char **argv) 873 { 874 fprintf(stderr, 875 "%s -e reads big-endian UTF-32 and writes AltDUDE ASCII.\n" 876 "%s -d reads AltDUDE ASCII and writes big-endian UTF-32.\n" 877 "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n" 878 , argv[0], argv[0]); 879 exit(EXIT_FAILURE); 880 } 882 static void fail(const char *msg) 883 { 884 fputs(msg,stderr); 885 exit(EXIT_FAILURE); 886 } 888 static const char too_big[] = 889 "input or output is too large, recompile with larger limits\n"; 890 static const char invalid_input[] = "invalid input\n"; 891 static const char io_error[] = "I/O error\n"; 892 int main(int argc, char **argv) 893 { 894 enum altdude_status status; 895 int r; 897 if (argc != 2) usage(argv); 898 if (argv[1][0] != '-') usage(argv); 899 if (argv[1][2] != '\0') usage(argv); 901 if (argv[1][1] == 'e') { 902 u_code_point input[unicode_max_length]; 903 unsigned char uppercase_flags[unicode_max_length]; 904 char output[ace_max_size]; 905 unsigned int input_length, output_size; 906 int c0, c1, c2, c3; 908 /* Read the UTF-32 input string: */ 910 input_length = 0; 912 for (;;) { 913 c0 = getchar(); 914 c1 = getchar(); 915 c2 = getchar(); 916 c3 = getchar(); 917 if (ferror(stdin)) fail(io_error); 919 if (c1 == EOF || c2 == EOF || c3 == EOF) { 920 if (c0 != EOF) fail("input not a multiple of 4 bytes\n"); 921 break; 922 } 924 if (input_length == unicode_max_length) fail(too_big); 926 if ((c0 != 0 && c0 != 0x80) 927 || c1 < 0 || c1 > 0x10 928 || c2 < 0 || c2 > 0xFF 929 || c3 < 0 || c3 > 0xFF ) { 930 fail(invalid_input); 931 } 933 input[input_length] = ((u_code_point) c1 << 16) | 934 ((u_code_point) c2 << 8) | (u_code_point) c3; 935 uppercase_flags[input_length] = (c0 >> 7); 936 ++input_length; 937 } 938 /* Encode, and output the result: */ 940 output_size = ace_max_size; 941 status = altdude_encode(input_length, input, uppercase_flags, 942 &output_size, output); 943 if (status == altdude_invalid_input) fail(invalid_input); 944 if (status == altdude_output_too_big) fail(too_big); 945 assert(status == altdude_success); 946 r = fputs(output,stdout); 947 if (r == EOF) fail(io_error); 948 return EXIT_SUCCESS; 949 } 951 if (argv[1][1] == 'd') { 952 char input[ace_max_size], scratch[ace_max_size]; 953 u_code_point output[unicode_max_length], codept; 954 unsigned char uppercase_flags[unicode_max_length]; 955 unsigned int output_length, i; 957 /* Read the AltDUDE ASCII input string: */ 959 fgets(input, ace_max_size, stdin); 960 if (ferror(stdin)) fail(io_error); 961 if (!feof(stdin)) fail(too_big); 963 /* Decode, and output the result: */ 965 output_length = unicode_max_length; 966 status = altdude_decode(test_case_sensitivity, scratch, input, 967 &output_length, output, uppercase_flags); 968 if (status == altdude_invalid_input) fail(invalid_input); 969 if (status == altdude_output_too_big) fail(too_big); 970 assert(status == altdude_success); 972 for (i = 0; i < output_length; ++i) { 973 r = putchar(uppercase_flags[i] ? 0x80 : 0); 974 if (r == EOF) fail(io_error); 975 codept = output[i]; 976 r = putchar(codept >> 16); 977 if (r == EOF) fail(io_error); 978 r = putchar((codept >> 8) & 0xFF); 979 if (r == EOF) fail(io_error); 980 r = putchar(codept & 0xFF); 981 if (r == EOF) fail(io_error); 982 } 984 return EXIT_SUCCESS; 985 } 987 usage(argv); 988 return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ 989 } 991 INTERNET-DRAFT expires 2001-Sep-19