idnits 2.17.1 draft-ietf-idn-amc-ace-m-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 17) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 2 instances of too long lines in the document, the longest one being 2 characters in excess of 72. ** The abstract seems to contain references ([UNICODE], [DUDE00], [RFC1123], [BRACE00], [RFC952], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 248 has weird spacing: '...b aaaaa other...' == Line 254 has weird spacing: '...c ccccc other...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1622 == Missing Reference: 'C' is mentioned on line 1299, but not defined == Missing Reference: 'A' is mentioned on line 1325, but not defined -- Looks like a reference, but probably isn't: '6' on line 1454 -- Looks like a reference, but probably isn't: '2' on line 1623 -- Looks like a reference, but probably isn't: '1' on line 1673 -- Looks like a reference, but probably isn't: '3' on line 1494 -- Looks like a reference, but probably isn't: '4' on line 1498 -- Looks like a reference, but probably isn't: '5' on line 1499 == Unused Reference: 'RACE03' is defined on line 934, but no explicit reference was found in the text == Unused Reference: 'UTF6' is defined on line 956, but no explicit reference was found in the text == Unused Reference: 'ACEID01' is defined on line 910, but no explicit reference was found in the text == Unused Reference: 'LACE01' is defined on line 924, but no explicit reference was found in the text == Unused Reference: 'SACE' is defined on line 947, but no explicit reference was found in the text == Unused Reference: 'UTF5' is defined on line 953, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '0xxxx' -- Possible downref: Normative reference to a draft: ref. 'RACE03' == Outdated reference: A later version (-02) exists of draft-ietf-idn-dude-00 -- Possible downref: Normative reference to a draft: ref. 'DUDE00' -- No information found for draft-ietf-idn-utf6- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'UTF6' == Outdated reference: A later version (-10) exists of draft-ietf-idn-nameprep-02 == Outdated reference: A later version (-02) exists of draft-ietf-idn-aceid-01 -- Possible downref: Normative reference to a draft: ref. 'ACEID01' -- Possible downref: Normative reference to a draft: ref. 'BRACE00' -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN' -- Possible downref: Normative reference to a draft: ref. 'LACE01' -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL' ** Downref: Normative reference to an Unknown state RFC: RFC 952 -- No information found for draft-ietf-idn-sace- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'SACE' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- No information found for draft-jseng-utf5- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'UTF5' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTFCONV' Summary: 6 errors (**), 0 flaws (~~), 15 warnings (==), 26 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Adam M. Costello 2 draft-ietf-idn-amc-ace-m-00.txt 2001-Feb-12 3 Expires 2001-Aug-14 5 AMC-ACE-M version 0.1.0 7 Status of this Memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note 14 that other groups may also distribute working documents as 15 Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months and may be updated, replaced, or obsoleted by other documents 19 at any time. It is inappropriate to use Internet-Drafts as 20 reference material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 Distribution of this document is unlimited. Please send comments 29 to the author at amc@cs.berkeley.edu, or to the idn working 30 group at idn@ops.ietf.org. A non-paginated (and possibly 31 newer) version of this specification may be available at 32 http://www.cs.berkeley.edu/~amc/charset/amc-ace-m 34 Abstract 36 AMC-ACE-M is a reversible map from a sequence of Unicode [UNICODE] 37 characters to a sequence of letters (A-Z, a-z), digits (0-9), and 38 hyphen-minus (-), henceforth called LDH characters. Such a map 39 (called an "ASCII-Compatible Encoding", or ACE) might be useful for 40 internationalized domain names [IDN], because host name labels are 41 currently restricted to LDH characters by [RFC952] and [RFC1123]. 43 AMC-ACE-M is a cross between BRACE [BRACE00] (which is efficient 44 but complex) and DUDE [DUDE00] (which is simple and provides case 45 preservation). AMC-ACE-M is much simpler than BRACE but similarly 46 efficient, and provides case preservation like DUDE. 48 Besides domain names, there might also be other contexts where it is 49 useful to transform Unicode characters into "safe" (delimiter-free) 50 ASCII characters. (If other contexts consider hyphen-minus to be 51 unsafe, a different character could be used to play its role, like 52 underscore.) 53 Contents 55 Features 56 Name 57 Overview 58 Base-32 characters 59 Encoding procedure 60 Decoding procedure 61 Signature 62 Case sensitivity models 63 Comparison with RACE, BRACE, LACE, and DUDE 64 Example strings 65 Security considerations 66 References 67 Author 68 Example implementation 70 Features 72 Uniqueness: Every Unicode string maps to at most one LDH string. 74 Completeness: Every Unicode string maps to an LDH string. 75 Restrictions on which Unicode strings are allowed, and on length, 76 may be imposed by higher layers. 78 Efficient encoding: The ratio of encoded size to original size is 79 small for all Unicode strings. This is important in the context 80 of domain names because [RFC1034] restricts the length of a domain 81 label to 63 characters. 83 Simplicity: The encoding and decoding algorithms are reasonably 84 simple to implement. The goals of efficiency and simplicity are at 85 odds; AMC-ACE-M aims at a good balance between them. 87 Case-preservation: If the Unicode string has been case-folded prior 88 to encoding, it is possible to record the case information in the 89 case of the letters in the encoding, allowing a mixed-case Unicode 90 string to be recovered if desired, but a case-insensitive comparison 91 of two encoded strings is equivalent to a case-insensitive 92 comparison of the Unicode strings. This feature is optional; see 93 section "Case sensitivity models". 95 Readability: The letters A-Z and a-z and the digits 0-9 appearing 96 in the Unicode string are represented as themselves in the label. 97 This comes for free because it usually the most efficient encoding 98 anyway. 100 Name 102 AMC-ACE-M is a working name that should be changed if it is adopted. 103 (The M merely indicates that it is the thirteenth ACE devised by 104 this author. BRACE was the third. D through L did not deliver 105 enough efficiency to justify their complexity.) Rather than waste 106 good names on experimental proposals, let's wait until one proposal 107 is chosen, then assign it a good name. Suggestions (assuming the 108 primary use is in domain names): 110 UniHost 111 UTF-A ("A" for "ASCII" or "alphanumeric", 112 but unfortunately UTF-A sounds like UTF-8) 113 UTF-H ("H" for "host names", 114 but unfortunately UTF-H sounds like UTF-8) 115 UTF-D ("D" for "domain names") 116 NUDE (Normal Unicode Domain Encoding) 118 Overview 120 AMC-ACE-M maps characters to characters--it does not consume or 121 produce code points, code units, or bytes, although the algorithm 122 makes use of code points, and implementations will of course need to 123 represent the input and output characters somehow, usually as bytes 124 or other code units. 126 Each character in the Unicode string is represented by an 127 integral number of characters in the encoded string. There is no 128 intermediate bit string or octet string. 130 The encoded string alternates between two modes: literal mode and 131 base-32 mode. LDH characters in the Unicode string are encoded 132 literally, except that hyphen-minus is doubled. Non-LDH characters 133 in the Unicode string are encoded using base-32, in which each 134 character of the encoded string represents five bits (a "quintet"). 135 A non-paired hyphen-minus in the encoded string indicates a mode 136 change. 138 In base-32 mode a group of one to five quintets are used to 139 represent a number, which is added to an offset to yield a 140 Unicode code point, which in turn represents a Unicode character. 141 (Surrogates, which are code units used by UTF-16 in pairs to 142 refer to code points, are not used and not allowed in AMC-ACE-M.) 143 Similarities between the code points are exploited to make the 144 encoding more compact. 146 Base-32 characters 148 "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 149 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 150 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 151 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 152 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 153 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 154 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 155 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 156 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 157 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 158 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 159 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 160 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 161 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 162 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 163 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 165 The digits "0" and "1" and the letters "o" and "l" are not used, to 166 avoid transcription errors. 168 All decoders must recognize both the uppercase and lowercase 169 forms of the base-32 characters. The case may or may not convey 170 information, as described in section "Case sensitivity models". 172 Encoding procedure 174 The encoder first examines the Unicode string and chooses some 175 parameters. It writes these parameters into the output string, then 176 proceeds to encode each Unicode character, one at a time. The exact 177 sequence of steps is given below. All ordering of bits and quintets 178 is big-endian (most significant first). The >> and << operators 179 used below mean bit shift, as in C. For >> there is no question of 180 logical versus arithmetic shift because AMC-ACE-M makes no use of 181 negative numbers. 183 0) Determine the Unicode code point for each non-LDH character in 184 the Unicode string. Since LDH characters are encoded literally, 185 their code points are not needed. Depending on how the Unicode 186 string is presented to the encoder, this step may be a no-op. 188 1) Verify that there are are no invalid code points in the input; 189 that is, none exceed 0x10FFFF (the highest code point in the 190 Unicode code space) and none are in the range D800..DFFF 191 (surrogates). 193 2) Determine the most populous row: Row n is defined as the 256 194 code points starting with n << 8, except that this definition 195 would makes rows D8..DF useless, because they would contain only 196 surrogates. Therefore AMC-ACE-M defines rows D8..DF to be the 197 following non-aligned blocks of 256 code points: 199 row D8 = 0020..001F 200 row D9 = 005B..015A 201 row DA = 007B..017A 202 row DB = 00A0..019F 203 row DC = 00C0..01BF 204 row DD = 00DF..01DE 205 row DE = 0134..0233 206 row DF = 0270..036F 208 (Rationale: Whereas almost every small script is confined to 209 a single row, the Latin script is split across a few rows, 210 and the row boundaries are not especially convenient for many 211 languages.) 213 Determine the row containing the most non-LDH input code points, 214 breaking ties in favor of smaller-numbered rows. (If a code 215 point appears multiple times in the input, it counts multiple 216 times. This applies to steps 3 and 4 also.) Call it row B. 217 Let offsetB be the first code point of row B. 219 3) Determine the most populous 16-window: For each n in 0..31 let 220 offset = ((offsetB >> 3) + n) << 3 and count the number of code 221 points in the range offset through offset + 0xF. Let A be the 222 value of n that maximizes this count, breaking ties in favor 223 of smaller values of n, and let offsetA be the corresponding 224 offset. 226 4) Determine the most populous 20k-window: If the input is empty, 227 then let C = 0. Otherwise, for each input code point, let n = 228 code_point >> 11, and count the number of non-LDH input code 229 points that are not in row B and are in the range (n << 11) 230 through (n << 11) + 0x4FFF. Determine the value of n that 231 maximizes the count, breaking ties in favor of smaller values of 232 n, and let C be that value. 234 5) Choose a style: One of the base-32 codes used in step 7.3 has 235 two variants, and so base-32 mode is subdivided into two styles, 236 narrow and wide, depending on which variant is used. Compute 237 the total number of base-32 characters that would be produced 238 if narrow style were used, and the number if wide style were 239 used. The easiest way to do this is to mimic the logic of steps 240 6 and 7.3. Use whichever style would produce fewer base-32 241 characters. In case of a tie, use narrow style. 243 6) Encode the parameters. If narrow style is used, then let 244 offsetC = (offsetB >> 12) << 12, and encode B and A as three or 245 four base-32 characters: 247 00bbb bbbbb aaaaa if B <= 0xFF 248 01bbb bbbbb bbbbb aaaaa otherwise 250 If wide style is used, then let offsetC = C << 11, and encode B 251 and C as three or five base-32 characters: 253 10bbb bbbbb ccccc if B <= 0xFF and C <= 0x1F 254 11bbb bbbbb bbbbb ccccc ccccc otherwise 256 7) Encode each input character in turn, using the first of the 257 following cases that applies. The mode is initially base-32. 259 7.1) The character is a hyphen-minus (U+002D). Encode it as 260 two hyphen-minuses. 262 7.2) The character is an LDH character. If in base-32 mode 263 then output a hyphen-minus and switch to literal mode. 264 Copy the character to the output. 266 7.3) The character is a non-LDH character. If in literal 267 mode then output a hyphen-minus and switch to base-32 268 mode. Encode the character's code point using the 269 first of the following cases that applies. Square 270 brackets enclose quintets that can be used to record 271 the upper/lowercase attribute of the Unicode character 272 (because the corresponding base-32 characters are 273 guaranteed to be letters rather than digits) (see section 274 "Case sensitivity models"). 276 7.3.1) Narrow style was chosen and the code point is in 277 the range offsetA through offsetA + 0xF. Subtract 278 offsetA and encode the difference as a single 279 base-32 character: 281 [0xxxx] 283 7.3.2) The code point is in the range offsetB through 284 offsetB + 0xFF. Subtract offsetB and encode the 285 difference as two base-32 characters: 287 1xxxx [0xxxx] 289 7.3.3) The code point is in the range offsetC through 290 offsetC + 0xFFF. Subtract offsetC and encode the 291 difference as three base-32 characters: 293 1xxxx 1xxxx [0xxxx] 295 7.3.4) Wide style was chosen and the code point is in 296 the range offsetC + 0x1000 through offsetC + 297 0x4FFF. Subtract offsetC + 0x1000 and encode the 298 difference as three base-32 characters: 300 [0xxxx] xxxxx xxxxx 302 7.3.5) The code point is in the range 0 through 0xFFFF. 303 Encode it as four base-32 characters: 305 1xxxx 1xxxx 1xxxx [0xxxx] 307 7.3.6) If we've come this far, the code point must be 308 in the range 0x10000 through 0x10FFFF. Subtract 309 0x10000 and encode the difference as five base-32 310 characters: 312 1xxxx 1xxxx 1xxxx 1xxxx [0xxxx] 314 Decoding procedure 316 The details of the decoding procedure are implied by the encoding 317 procedure. The overall sequence of steps is as follows. 319 1) Undo the encoder's step 6: From the first few base-32 320 characters, determine whether narrow or wide style is used, and 321 determine the offsets. 323 2) Set the mode to base-32. For each remaining input character, use 324 the first of the following cases that applies: 326 2.1) The character is a hyphen-minus, and the following 327 character is also a hyphen-minus. Consume them both and 328 output a hyphen-minus. 330 2.2) The character is a hyphen-minus. Consume it and toggle 331 the mode flag. 333 2.3) The current mode is literal. Consume the input character 334 and output it. 336 2.4) Interpret the input character and up to four of its 337 successors as base-32. Consume characters until one is 338 found whose value has the form 0xxxx. That is the one 339 that carries the upper/lowercase information. Remember 340 the length of the code. If the length is one and wide 341 style is being used, consume two more characters. 342 Decode the base-32 characters into an integer, add the 343 appropriate offset (which depends on the remembered code 344 length), and output the Unicode character corresponding to 345 the resulting code point. 347 If the case-flexible or case-preserving model is being 348 used (see section "Case sensitivity models"), the decoder 349 must either perform the case conversion as it is decoding, 350 or construct a separate record of the case information to 351 accompany the output string. 353 3) Before returning the output (be it a string or a string plus 354 case information), the decoder must invoke the encoder on it, 355 and compare the result to the input string. The comparison 356 must be case-sensitive if the case-sensitive or case-flexible 357 model is being used, case-insensitive if the case-insensitive 358 or case-preserving model is being used. If the two strings do 359 not match, it is an error. This check is necessary to guarantee 360 the uniqueness property (there cannot be two distinct encoded 361 strings representing the same Unicode string). 363 If the decoder at any time encounters an unexpected character, or 364 unexpected end of input, then the input is invalid. 366 Signature 368 The issue of how to distinguish ACE strings from unencoded strings 369 is largely orthogonal to the encoding scheme itself, and is 370 therefore not specified here. In the context of domain name labels, 371 a standard prefix and/or suffix (chosen to be unlikely to occur 372 naturally) would presumably be attached to ACE labels. (In that 373 case, it would probably be good to forbid the encoding of Unicode 374 strings that appear to match the signature, to avoid confusing 375 humans about whether they are looking at a Unicode string or an ACE 376 string.) 378 In order to use AMC-ACE-M in domain names, the choice of signature 379 must be mindful of the requirement in [RFC952] that labels never 380 begin or end with hyphen-minus. The raw encoded string will never 381 begin with a hyphen-minus, and will end with a hyphen-minus iff the 382 Unicode string ends with a hyphen-minus. The easiest solution is 383 to use a suffix as the signature. Alternatively, if the Unicode 384 strings were forbidden from ending with a hyphen-minus, a prefix 385 could be used. 387 It appears that "---" is extremely rare in domain names; among the 388 four-character prefixes of all the second-level domains under .com, 389 .net, and .org, "---" never appears at all. Therefore, perhaps the 390 signature should be of the form ?--- (prefix) or ---? (suffix), 391 where ? could be "u" for Unicode, or "i" for internationalized, or 392 "a" for ACE, or maybe "q" or "z" because they are rare. 394 Case sensitivity models 396 The higher layer must choose one of the following four models. 398 Models suitable for domain names: 400 * Case-insensitive: Before a string is encoded, all its non-LDH 401 characters must be case-folded so that any strings differing 402 only in case become the same string (for example, strings could 403 be forced to lowercase). Folding LDH characters is optional. 404 The case of base-32 characters and literal-mode characters is 405 arbitrary and not significant. Comparisons between encoded 406 strings must be case-insensitive. The original case of non-LDH 407 characters cannot be recovered from the encoded string. 409 * Case-preserving: The case of the Unicode characters is not 410 considered significant, but it can be preserved and recovered, 411 just like in non-internationalized host names. Before a string 412 is encoded, all its non-LDH characters must be case-folded 413 as in the previous model. LDH characters are naturally able 414 to retain their case attributes because they are encoded 415 literally. The case attribute of a non-LDH character is 416 recorded in one of the base-32 characters that represent 417 it (section "Encoding procedure" tells which one). If the 418 base-32 character is uppercase, it means the Unicode character 419 is caseless or should be forced to uppercase after being 420 decoded (which is a no-op if the case folding already forces 421 to uppercase). If the base-32 character is lowercase, it 422 means the Unicode character is caseless or should be forced to 423 lowercase after being decoded (which is a no-op if the case 424 folding already forces to lowercase). The case of the other 425 base-32 characters in a multi-quintet encoding is arbitrary 426 and not significant. Only uppercase and lowercase attributes 427 can be recorded, not titlecase. Comparisons between encoded 428 strings must be case-insensitive, and are equivalent to 429 case-insensitive comparisons between the Unicode strings. The 430 intended mixed-case Unicode string can be recovered as long as 431 the encoded characters are unaltered, but altering the case of 432 the encoded characters is not harmful--it merely alters the case 433 of the Unicode characters, and such a change is not considered 434 significant. 436 In this model, the input to the encoder and the output of the 437 decoder can be the unfolded Unicode string (in which case the 438 encoder and decoder are responsible for performing the case 439 folding and recovery), or can be the folded Unicode string 440 accompanied by separate case information (in which case the 441 higher layer is responsible for performing the case folding and 442 recovery). Whichever layer performs the case recovery must 443 first verify that the Unicode string is properly folded, to 444 guarantee the uniqueness of the encoding. 446 It is easy to extend the nameprep algorithm [NAMEPREP02] to 447 remember case information. It merely requires an additional 448 bit to be associated with each output code point in the mapping 449 table. 451 The case-insensitive and case-preserving models are interoperable. 452 If a domain name passes from a case-preserving entity to a 453 case-insensitive entity, the case information will be lost, but 454 the domain name will still be equivalent. This phenomenon already 455 occurs with non-internationalized domain names. 457 Models unsuitable for domain names, but possibly useful in other 458 contexts: 460 * Case-sensitive: Unicode strings may contain both uppercase and 461 lowercase characters, which are not folded. Base-32 characters 462 must be lowercase. Comparisons between encoded strings must be 463 case-sensitive. 465 * Case-flexible: Like case-preserving, except that the choice 466 of whether the case of the Unicode characters is considered 467 significant is deferred. Therefore, base-32 characters must 468 be lowercase, except for those used to indicate uppercase 469 Unicode characters. Comparisons between encoded strings may be 470 case-sensitive or case-insensitive, and such comparisons are 471 equivalent to the corresponding comparisons between the Unicode 472 strings. 474 Comparison with RACE, BRACE, LACE, and DUDE 476 In this section we compare AMC-ACE-M and four other ACEs: RACE 477 [RACE03], BRACE [BRACE00], LACE [LACE01], and Extended DUDE 478 [DUDE00]. We do not include SACE [SACE], UTF-5 [UTF5], or UTF-6 479 [UTF6] in the comparison, because SACE appears obviously too 480 complex, UTF-5 appears obviously too inefficient, and UTF-6 can 481 never be more efficient than its similarly simple successor, DUDE. 483 Case preservation support: 485 DUDE, AMC-ACE-M: all characters 486 BRACE: only the letters A-Z, a-z 487 RACE, LACE: none 489 RACE, BRACE, and LACE transform the Unicode string to an 490 intermediate bit string, then into a base-32 string, so there is no 491 particular alignment between the base-32 characters and the Unicode 492 characters. DUDE and AMC-ACE-M do not have this intermediate stage, 493 and enforce alignment between the base-32 characters and the Unicode 494 characters, which facilitates the case preservation. 496 Complexity is hard to measure. This author would subjectively 497 describe the complexity of the algorithms as: 499 RACE, LACE, DUDE: fairly simple but not trivial 500 AMC-ACE-M: moderate 501 BRACE: complex 503 The complexity of AMC-ACE-M is in the number of rules, but the 504 individual rules are not very complex, and they are generally 505 non-interacting. 507 The relative efficiency of the various algorithms is suggested 508 by the sizes of the encodings in section "Example strings". For 509 each ACE there is a graph below showing a horizontal bar for 510 each example string, representing the ACE length divided by the 511 minimum length among all the ACEs for that example string (so the 512 ratio is at least 1). Example R is excluded because it violates 513 nameprep [NAMEPREP02]. The other example strings all use different 514 languages, except that there are several Japanese examples. To 515 avoid skewing the results, each graph collapses all the Japanese 516 ratios into a single bar representing the median ratio. A ratio r 517 is represented by a bar of length r/0.04 characters. Since the bar 518 will always be at least 1/0.04 = 25 characters long, we show the 519 first 25 characters as "O" and the rest as "@". The bars are sorted 520 so that the graph looks like a cummulative distribution. Each bar 521 is labeled with the language of the corresponding example string. 522 (The difference between the Chinese and Taiwanese strings is that 523 the former uses simplified characters.) 525 RACE: 526 Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@ 527 Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@ 528 Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ 529 Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ 530 Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@ 531 Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ 532 Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ 533 Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@ 534 Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@ 535 Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@ 536 Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@@@@@@@@@@ 538 LACE: 539 Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@ 540 Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ 541 Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ 542 Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ 543 Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ 544 Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ 545 Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ 546 Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ 547 Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@ 548 Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@ 549 Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@@@ 551 DUDE: 552 Russian OOOOOOOOOOOOOOOOOOOOOOOOO 553 Arabic OOOOOOOOOOOOOOOOOOOOOOOOO 554 Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@ 555 Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ 556 Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@ 557 Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@ 558 Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ 559 Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ 560 Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ 561 Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ 562 Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@ 563 AMC-ACE-M: 564 Czech OOOOOOOOOOOOOOOOOOOOOOOOO 565 Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO 566 Japanese OOOOOOOOOOOOOOOOOOOOOOOOO 567 Korean OOOOOOOOOOOOOOOOOOOOOOOOO 568 Russian OOOOOOOOOOOOOOOOOOOOOOOOO 569 Spanish OOOOOOOOOOOOOOOOOOOOOOOOO 570 Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO 571 Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO 572 Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@ 573 Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@ 574 Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@ 576 BRACE: 577 Chinese OOOOOOOOOOOOOOOOOOOOOOOOO 578 Hindi OOOOOOOOOOOOOOOOOOOOOOOOO 579 Japanese OOOOOOOOOOOOOOOOOOOOOOOOO 580 Spanish OOOOOOOOOOOOOOOOOOOOOOOOO 581 Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO 582 Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@ 583 Czech OOOOOOOOOOOOOOOOOOOOOOOOO@ 584 Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@ 585 Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@ 586 Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@ 587 Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@ 589 These results suggest that DUDE is preferrable to RACE and LACE, 590 because it has similar simplicity, better support for case 591 preservation, and is somewhat more efficient. 593 The results also suggest that AMC-ACE-M is preferrable to BRACE, 594 because it has similar efficiency, better support for case 595 preservation, and is simpler. 597 DUDE and AMC-ACE-M have equal support for case preservation, but 598 AMC-ACE-M offers significantly better efficiency, at the cost of 599 significantly greater complexity, so choosing between them entails a 600 value judgement. 602 Example strings 604 In the ACE encodings below, signatures (like "bq--" for RACE) are 605 not shown. Non-LDH characters in the Unicode string are forced to 606 lowercase before being encoded using BRACE, RACE, and LACE. For 607 RACE and LACE, the letters A-Z are likewise forced to lowercase. 608 UTF-8 and UTF-16 are included for length comparisons, with non-ASCII 609 bytes shown as "?". AMC-ACE-M is abbreviated AMC-M. Backslashes 610 show where line breaks have been inserted in ACE strings too long 611 for one line. The RACE and LACE encodings are courtesy of Mark 612 Davis's online UTF converter [UTFCONV] (slightly modified to remove 613 the length restrictions). 615 The first several examples are all names of Japanese music artists, 616 song titles, and TV programs, just because the author happens to 617 have them handy (but Japanese is useful for providing examples 618 of single-row text, two-row text, ideographic text, and various 619 mixtures thereof). 621 (A) 3B (Japanese TV program title) 623 = U+5E74 (kanji) 624 = U+7D44 (kanji) 625 = U+91D1 U+516B U+5148 U+751F (kanji) 627 UTF-16: ???????????????? 628 UTF-8: 3???B??????????????? 629 AMC-M: utk-3-8ze-B-hkenqtymwifi9 630 BRACE: u-3-ygj-b-ynb6gjc7pp4k5p5w 631 DUDE: j3le74G062nd44p1d1l16bk8n51f 632 RACE: 3aadgxtuabrh2rer2fiwwukioupq 633 LACE: 74adgxtuabrh2rer2fiwwukioupq 635 (B) -with-SUPER-MONKEYS (Japanese music group name) 637 = U+5B89 U+5BA4 U+5948 U+7F8E U+6075 (kanji) 639 UTF-8: ??????????????????-with-SUPER-MONKEYS 640 AMC-M: u5m2j4etwif6q2zf---with--SUPER--MONKEYS 641 BRACE: uvj7fuaqcahy982xa---with--SUPER--MONKEYS 642 DUDE: lb89q4p48nf8em075-g077m9n4m8-N3LGM5N2-MdVURLN9J 643 UTF-16: ???????????????????????????????????????????????? 644 LACE: ajnytjablfeac74oafqhkeyafv3qm5difvzxk4dfoiww233onnsxs4y 645 RACE: 3bnysw5elfeh7dtaouac2adxabuqa5aanaac2adtab2qa4aamuaheab\ 646 nabwqa3yanyagwadfab4qa4y 648 (C) Hello-Another-Way- (Japanese song title) 650 = U+305D U+308C U+305E U+308C U+306E (hiragana) 651 = U+5834 U+6240 (kanji) 653 UTF-8: Hello-Another-Way-????????????????????? 654 BRACE: ji7-Hello--Another--Way---v3jhaefvd2ufj62 655 AMC-M: bsk-Hello--Another--Way---p2nq2nyqx2veyuwa 656 DUDE: M8lssv-Huvn4m8ln2-Nm1n9-j05docleocmel834m240 657 UTF-16: ?????????????????????????????????????????????????? 658 LACE: ciagqzlmnrxs2ylon52gqzlsfv3wc6jnauyf3dc6rrxacwbuafrea 659 RACE: 3aagqadfabwaa3aan4ac2adbabxaa3yaoqagqadfabzaaliao4agcad\ 660 zaawtaxjqrqyf4memgbxfqndcia 662 (D) 2 (Japanese TV program title) 664 = U+3072 U+3068 U+3064 (hiragana) 665 = U+5C4B U+6839 (kanji) 666 = U+306E (hiragana) 667 = U+4E0B (kanji) 669 UTF-16: ???????????????? 670 UTF-8: ?????????????????????2 671 AMC-M: bsnzciex6wmy2vjqw8sm-2 672 BRACE: ji96u56uwbhf2wqxnw4s-2 673 DUDE: j072m8klc4bm839j06eke0bg032 674 RACE: 3ayhemdigbsfys3iheyg4tqlaaza 675 LACE: 74yhemdigbsfys3iheyg4tqlaaza 676 (E) MajiKoi5 (Japanese song title) 678 = U+3067 (hiragana) 679 = U+3059 U+308B (hiragana) 680 = U+79D2 U+524D (kanji) 682 UTF-8: Maji???Koi??????5?????? 683 UTF-16: ?????????????????????????? 684 AMC-M: bsm-Maji-r-Koi-b2m-5-z37cxuwp 685 BRACE: ji8-Maji-g-Koi-qe7x-5-wx7p6ma 686 DUDE: Mdhqpj067G06bvpj059obg035n9d2l24d 687 RACE: 3aag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq 688 LACE: 74ag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq 690 (F) de (Japanese song title) 692 = U+30D1 U+30D5 U+30A3 U+30FC (katakana) 693 = U+30EB U+30F3 U+30D0 (katakana) 695 UTF-16: ?????????????? 696 BRACE: 3iu8pazt-de-pygi 697 AMC-M: bs3jp4d9n-de-8m9di 698 RACE: gdi5li7475sp6zpl6pia 699 DUDE: j0d1lq3vcg064lj0ebv3t0 700 UTF-8: ????????????de????????? 701 LACE: aqyndvnd7qbaazdfamyox46q 703 (G) (Japanese song title) 705 = U+305D U+306E (hiragana) 706 = U+30B9 U+30D4 U+30FC U+30C9 (katakana) 707 = U+3067 (hiragana) 709 RACE: gbow5oou7tewo 710 UTF-16: ?????????????? 711 BRACE: bidprdmp9wt7mi 712 LACE: a4yf23vz2t6mszy 713 AMC-M: bsmfyq5j7e9n6jr 714 DUDE: j05dmer9t4vcs9m7 715 UTF-8: ????????????????????? 717 The next several examples are all translations of the sentence "Why 718 can't they just speak in ?" (courtesy of Michael Kaplan's 719 "provincial" page [PROVINCIAL]). Word breaks and punctuation have 720 been removed, as is often done in domain names. 722 (H) Arabic (Egyptian): 723 U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644 724 U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F 726 DUDE: m44qnli7oqk3kloj4phi8kahf 727 BRACE: 28akcjwcmp3ciwb4t3ngd4nbaz 728 AMC-M: agiekhfuhuiukdefivevjvbuiktr 729 RACE: azceur2fe4ucuq2eivediojrfbfb6 730 LACE: cedeisshiutsqksdircuqnbzgeueuhy 731 UTF-16: ?????????????????????????????????? 732 UTF-8: ?????????????????????????????????? 733 (I) Chinese (simplified): 734 U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587 736 UTF-16: ?????????????????? 737 BRACE: kgcqqsgp26i5h4zn7req5i 738 AMC-M: uqj7g8nvk6awispn9wupdnh 739 DUDE: ked6ucjas0k8gdobf4ke2dm587 740 UTF-8: ??????????????????????????? 741 LACE: azhnn3b2ybea2aml6qau4libmwdq 742 RACE: 3bhnmtxmjy5e5qcojbha3c7ujywwlby 744 (J) Czech: Proprostnemluvesky 746 = U+010D 747 = U+011B 748 = U+00ED 750 UTF-8: Pro??prost??nemluv????esky 751 AMC-M: g26-Pro-p-prost-9m-nemluv-6pp-esky 752 BRACE: i32-Pro-u-prost-8y-nemluv-29f3n-esky 753 DUDE: N0imfh0dg70imfn3kh1bg6eltsn5mudh0dg65n3mbn9 754 UTF-16: ???????????????????????????????????????????? 755 LACE: amaha4tpaeaq2biaobzg643uaearwbyanzsw23dvo3wqcainaqagk43\ 756 lpe 757 RACE: ah7xb73s75xq373q75zp6377op7xig77n37wl73n75wp65p7o3762dp\ 758 7mx7xh73l754q 760 (K) Hebrew: 761 U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8 762 U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2 763 U+05D1 U+05E8 U+05D9 U+05EA 765 AMC-M: af4nqeep8e8jfinaqdb8ijp8cb8ij8k 766 DUDE: ldcukktu4pt5osgujhu8t9tu2t1u8t9ua 767 BRACE: 27vkyp7bgwmbpfjgc4ynx5nd8xsp5nd9c 768 RACE: axon5vgu3xsotvoy3tin5u6r5dm53ywr5dm6u 769 LACE: cyc5zxwu2to6j2ov3donbxwt2huntxpc2hunt2q 770 UTF-8: ???????????????????????????????????????????? 771 UTF-16: ???????????????????????????????????????????? 773 (L) Hindi: 774 U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D 775 U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939 776 U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947 777 U+0939 U+0948 U+0902 (Devanagari) 779 BRACE: 2b7xtenqdr7zc6uma2pmcz7ibage237kdemicnk9gei32 780 RACE: bextsmslc44t6kcnezabktjpjmbcqokaaiwewmrycuseookiai 781 LACE: dyes6ojsjmltspzijuteafknf5fqekbziabcyszshaksirzzjaba 782 AMC-M: ajhurbvcwmthbhuiwpugitfwpurwmscuibiscunwmvcatfuerbwisc 783 DUDE: p2fj9ikbh7j9vi8kdi6k0h5kdifkbg2i8j9k0g2ickbj2oh5i4k7j9k\ 784 8g2 785 UTF-16: ???????????????????????????????????????????????????????\ 786 ????? 787 UTF-8: ???????????????????????????????????????????????????????\ 788 ??????????????????????????????????? 789 (M) Korean: 790 U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774 791 U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74 792 U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C (Hangul syllables) 794 UTF-16: ???????????????????????????????????????????????? 795 UTF-8: ???????????????????????????????????????????????????????\ 796 ????????????????? 797 AMC-M: yhxcj2w6exiaxi68acfn92n68ezehk6xypdpwam6zehmwhk648eavwd\ 798 p6aqi23ieemweywn 799 BRACE: y394qebjusrcndbs82pkvstf96sxufcr7ffr4vbgdwsxufcx8pdktgb\ 800 gmnsqydmk7im56arju6pt82 801 LACE: 77atrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\ 802 exj2mlpfzzcyjrsely5ck4ta 803 RACE: 3datrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\ 804 exj2mlpfzzcyjrsely5ck4ta 805 DUDE: s138qcc4s758raa8ke0s0acr78cke4s774t55cqd6ds5b4r97cs774t\ 806 574lcr2e4q74s5bcr9c8g98s88bn44qe4c 808 (N) Russian: 809 U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E 810 U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440 811 U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A 812 U+0438 (Cyrillic) 814 DUDE: K3fuk7j5sk3j6lutotljuiuk0vijfuk0jhhjao 815 AMC-M: aehHgrvfemvgvfgfafvfvdgvcgiwrkhgimjjca 816 BRACE: 269xyjvcyafqfdwyr3xfd8z8byi6z39xyi692s7ug2 817 RACE: aq7t4rzvhrbtmnj6hu4d2njthyzd4qcpii7t4qcdifatuoa 818 LACE: dqcd6pshgu6egnrvhy6tqpjvgm7depsaj5bd6psainaucory 819 UTF-16: ???????????????????????????????????????????????????????\ 820 ??? 821 UTF-8: ??????????????????????????????????????????????????????? 822 ??? 824 (O) Spanish: PorqunopuedensimplementehablarenEspaol 826 = U+00E9 827 = U+00F1 829 UTF-8: Porqu??nopuedensimplementehablarenEspa??ol 830 AMC-M: aa7-Porqu-b-nopuedensimplementehablarenEspa-j-ol 831 BRACE: 22x-Porqu-9-nopuedensimplementehablarenEspa-j-ol 832 DUDE: N0mfn2hlu9mevn0lm5klun3m9tn0mcltlun4m5ohishn2m5uLn3gm1v\ 833 1mfs 834 RACE: abyg64troxuw433qovswizloonuw24dmmvwwk3tumvugcytmmfzgk3t\ 835 fonygd4lpnq 836 LACE: faaha33sof26s3tpob2wkzdfnzzws3lqnrsw2zloorswqylcnrqxezl\ 837 omvzxayprn5wa 838 UTF-16: ???????????????????????????????????????????????????????\ 839 ????????????????????????? 841 (P) Taiwanese: 842 U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587 843 UTF-16: ?????????????????? 844 UTF-8: ??????????????????????????? 845 AMC-M: uqj7g2tbgtu6a385pspnxkupdnh 846 BRACE: kgcqui49gatc2wyrn8y7cndgte9 847 RACE: 3bhnmuaroize5qe6xvha3cvkjywwlby 848 LACE: 75hnmuaroize5qe6xvha3cvkjywwlby 849 DUDE: ked6l011n232kec0pebdke0doaaake2dm587 851 (Q) Vietnamese: 852 Taisaohokhngthchi\ 853 noitingVit 855 = U+0323 856 = U+00F4 857 = U+00EA 858 = U+0309 859 = U+0301 861 UTF-8: Ta??isaoho??kh??ngth????chi??no??iti????ngVi????t 862 AMC-M: ada-Ta-ud-isaoho-ud-kh-s9e-ngth-s8kj-chi-j-no-b-iti-s8k\ 863 b-ngVi-s8kud-t 864 BRACE: i54-Ta-8-isaoho-ay-kh-29n-ngth-s2xa6i-chi-k-no-2g-iti-2\ 865 9c29-ngVi-25p48-t 866 UTF-16: ???????????????????????????????????????????????????????\ 867 ????????????????????? 868 DUDE: N4m1j23g69n3m1vovj23g6bov4menn4m8uaj09g63opj09g6evj01g6\ 869 9n4m9uaj01g6enN6m9uaj23g74 870 LACE: aiahiyibamrqmadjonqw62dpaebsgcaannupi3thoruouaidbebqay3\ 871 ineaqgcicabxg6aidaecaa2lunhvacaybauag4z3wnhvacazdaeahi 872 RACE: ap7xj73bep7wt73t75q76377nd7w6i77np7wr77u75xp6z77ot7wr77\ 873 kbh7wh73i75uqt73o75xqd73j752p62p75ia763x7m77xn73j77vch7\ 874 3u 876 The last example is an ASCII string that breaks not only the 877 existing rules for host name labels but also the rules proposed in 878 [NAMEPREP02] for internationalized domain names. 880 (R) -> $1.00 <- 882 UTF-8: -> $1.00 <- 883 DUDE: -jei0kj1iej0gi0jc- 884 RACE: aawt4ibegexdambahqwq 885 LACE: bmac2praeqys4mbqea6c2 886 UTF-16: ?????????????????????? 887 AMC-M: aae--vqae-1-q-00-avn-- 888 BRACE: 229--t2b4-1-w-00-i9i-- 890 Security considerations 892 Users expect each domain name in DNS to be controlled by a single 893 authority. If a Unicode string intended for use as a domain label 894 could map to multiple ACE labels, then an internationalized domain 895 name could map to multiple ACE domain names, each controlled by 896 a different authority, some of which could be spoofs that hijack 897 service requests intended for another. Therefore AMC-ACE-M is 898 designed so that each Unicode string has a unique encoding. 900 However, there can still be multiple Unicode representations of the 901 "same" text, for various definitions of "same". This problem is 902 addressed to some extent by the Unicode standard under the topic 903 of canonicalization, but some text strings may be misleading or 904 ambiguous to humans when used as domain names, such as strings 905 containing dots, slashes, at-signs, etc. These issues are being 906 further studied under the topic of "nameprep" [NAMEPREP02]. 908 References 910 [ACEID01] Yoshiro Yoneya, Naomasa Maruyama, "Proposal for 911 a determining process of ACE identifier", 2000-Dec-19, 912 draft-ietf-idn-aceid-01. 914 [BRACE00] Adam Costello, "BRACE: Bi-mode Row-based 915 ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19, 916 draft-ietf-idn-brace-00. 918 [DUDE00] Brian Spolarich, Mark Welter, "DUDE: Differential Unicode 919 Domain Encoding", 2000-Nov-21, draft-ietf-idn-dude-00. 921 [IDN] Internationalized Domain Names (IETF working group), 922 http://www.i-d-n.net/, idn@ops.ietf.org. 924 [LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII 925 Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01. 927 [NAMEPREP02] Paul Hoffman, Marc Blanchet, "Preparation 928 of Internationalized Host Names", 2001-Jan-17, 929 draft-ietf-idn-nameprep-02. 931 [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", 932 http://www.trigeminal.com/samples/provincial.html. 934 [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding 935 for IDN", 2000-Nov-28, draft-ietf-idn-race-03. 937 [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host 938 Table Specification", 1985-Oct, RFC 952. 940 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 941 1987-Nov, RFC 1034. 943 [RFC1123] Internet Engineering Task Force, R. Braden (editor), 944 "Requirements for Internet Hosts -- Application and Support", 945 1989-Oct, RFC 1123. 947 [SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)", 948 draft-ietf-idn-sace-*. 950 [UNICODE] The Unicode Consortium, "The Unicode Standard", 951 http://www.unicode.org/unicode/standard/standard.html. 953 [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a 954 Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*. 956 [UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another 957 ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*. 959 [UTFCONV] Mark Davis, "UTF Converter", 960 http://www.macchiato.com/unicode/convert.html. 962 Author 964 Adam M. Costello 965 http://www.cs.berkeley.edu/~amc/ 967 Example implementation 969 /******************************************/ 970 /* amc-ace-m.c 0.1.0 (2001-Feb-12-Mon) */ 971 /* Adam M. Costello */ 972 /******************************************/ 974 /* This is ANSI C code implementing AMC-ACE-M version 0.1.*. */ 976 /************************************************************/ 977 /* Public interface (would normally go in its own .h file): */ 979 #include 981 enum amc_ace_status { 982 amc_ace_success, 983 amc_ace_invalid_input, 984 amc_ace_output_too_big 985 }; 987 enum case_sensitivity { case_sensitive, case_insensitive }; 989 #if UINT_MAX >= 0x10FFFF 990 typedef unsigned int u_code_point; 991 #else 992 typedef unsigned long u_code_point; 993 #endif 995 int amc_ace_m_encode( 996 unsigned int input_length, 997 const u_code_point *input, 998 const unsigned char *uppercase_flags, 999 unsigned int *output_size, 1000 unsigned char *output ); 1001 /* amc_ace_m_encode() converts Unicode to AMC-ACE-M. The input */ 1002 /* must be represented as an array of Unicode code points */ 1003 /* (not code units; surrogate pairs are not allowed), and the */ 1004 /* output will be represented as null-terminated ASCII. The */ 1005 /* input_length is the number of code points in the input. The */ 1006 /* output_size is an in/out argument: the caller must pass */ 1007 /* in the maximum number of characters that may be output */ 1008 /* (including the terminating null), and on successful return */ 1009 /* it will contain the number of characters actually output */ 1010 /* (including the terminating null, so it will be one more than */ 1011 /* strlen() would return, which is why it is called output_size */ 1012 /* rather than output_length). The uppercase_flags array must */ 1013 /* hold input_length boolean values, where nonzero means the */ 1014 /* corresponding Unicode character should be forced to uppercase */ 1015 /* after being decoded, and zero means it is caseless or should */ 1016 /* be forced to lowercase. Alternatively, uppercase_flags may */ 1017 /* be a null pointer, which is equivalent to all zeros. The */ 1018 /* letters a-z and A-Z are always encoded literally, regardless */ 1019 /* of the corresponding flags. The encoder always outputs */ 1020 /* lowercase base-32 characters except when nonzero values */ 1021 /* of uppercase_flags require otherwise, so the encoder is */ 1022 /* compatible with any of the case models. The return value */ 1023 /* may be any of the amc_ace_status values defined above; if */ 1024 /* not amc_ace_success, then output_size and output may contain */ 1025 /* garbage. On success, the encoder will never need to write an */ 1026 /* output_size greater than input_length*5+6, because of how the */ 1027 /* encoding is defined. */ 1029 int amc_ace_m_decode( 1030 enum case_sensitivity case_sensitivity, 1031 unsigned char *scratch_space, 1032 const unsigned char *input, 1033 unsigned int *output_length, 1034 u_code_point *output, 1035 unsigned char *uppercase_flags ); 1036 /* amc_ace_m_decode() converts AMC-ACE-M to Unicode. The input */ 1037 /* must be represented as null-terminated ASCII, and the output */ 1038 /* will be represented as an array of Unicode code points. */ 1039 /* The case_sensitivity argument influences the check on the */ 1040 /* well-formedness of the input string; it must be case_sensitive */ 1041 /* if case-sensitive comparisons are allowed on encoded strings, */ 1042 /* case_insensitive otherwise (see also section "Case sensitivity */ 1043 /* models" of the AMC-ACE-M specification). The scratch_space */ 1044 /* must point to space at least as large as the input, which will */ 1045 /* get overwritten (this allows the decoder to avoid calling */ 1046 /* malloc()). The output_length is an in/out argument: the */ 1047 /* caller must pass in the maximum number of code points that */ 1048 /* may be output, and on successful return it will contain the */ 1049 /* actual number of code points output. The uppercase_flags */ 1050 /* array must have room for at least output_length values, or it */ 1051 /* may be a null pointer if the case information is not needed. */ 1052 /* A nonzero flag indicates that the corresponding Unicode */ 1053 /* character should be forced to uppercase by the caller, while */ 1054 /* zero means it is caseless or should be forced to lowercase. */ 1055 /* The letters a-z and A-Z are output already in the proper case, */ 1056 /* but their flags will be set appropriately so that applying the */ 1057 /* flags would be harmless. The return value may be any of the */ 1058 /* amc_ace_status values defined above; if not amc_ace_success, */ 1059 /* then output_length, output, and uppercase_flags may contain */ 1060 /* garbage. On success, the decoder will never need to write */ 1061 /* an output_length greater than the length of the input (not */ 1062 /* counting the null terminator), because of how the encoding is */ 1063 /* defined. */ 1065 /**********************************************************/ 1066 /* Implementation (would normally go in its own .c file): */ 1068 #include 1070 /* Character utilities: */ 1072 /* is_ldh(codept) returns 1 if the code point represents an LDH */ 1073 /* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */ 1075 static int is_ldh(u_code_point codept) 1076 { 1077 if (codept == 45) return 1; 1078 if (codept < 48) return 0; 1079 if (codept <= 57) return 1; 1080 if (codept < 65) return 0; 1081 if (codept <= 90) return 1; 1082 if (codept < 97) return 0; 1083 if (codept <= 122) return 1; 1084 return 0; 1085 } 1087 /* is_AtoZ(c) returns 1 if c is an */ 1088 /* uppercase ASCII letter, zero otherwise. */ 1089 static unsigned char is_AtoZ(unsigned char c) 1090 { 1091 return c >= 65 && c <= 90; 1092 } 1094 /* special_row_offset[n] holds the offset of the */ 1095 /* bottom of special row 0xD8 + n, where n is in 0..7. */ 1097 static u_code_point special_row_offset[] = 1098 { 0x0020, 0x005B, 0x007B, 0x00A0, 0x00C0, 0x00DF, 0x0134, 0x0270 }; 1100 /* base32[n] is the lowercase base-32 character representing */ 1101 /* the number n from the range 0 to 31. Note that we cannot */ 1102 /* use string literals for ASCII characters because an ANSI C */ 1103 /* compiler does not necessarily use ASCII. */ 1105 static const unsigned char base32[] = { 1106 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 1107 109, 110, /* m-n */ 1108 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 1109 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ 1110 }; 1112 /* base32_decode(c) returns the value of a base-32 character, in the */ 1113 /* range 0 to 31, or the constant base32_invalid if c is not a valid */ 1114 /* base-32 character. */ 1116 enum { base32_invalid = 32 }; 1118 static unsigned int base32_decode(unsigned char c) 1119 { 1120 if (c < 50) return base32_invalid; 1121 if (c <= 57) return c - 26; 1122 if (c < 97) c += 32; 1123 if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; 1124 return c - 97 - (c > 108) - (c > 111); 1125 } 1127 /* unequal(case_sensitivity,a1,a2,n) returns 0 if the arrays */ 1128 /* a1 and a2 are equal in the first n positions, 1 otherwise. */ 1129 /* If case_sensitivity is case_insensitive, then ASCII A-Z are */ 1130 /* considered equal to a-z respectively. */ 1132 static int unequal( 1133 enum case_sensitivity case_sensitivity, 1134 const unsigned char *a1, 1135 const unsigned char *a2, 1136 unsigned int n ) 1137 { 1138 const unsigned char *end; 1139 unsigned char c1, c2; 1140 if (case_sensitivity != case_insensitive) return memcmp(a1,a2,n); 1142 for (end = a1 + n; a1 < end; ++a1, ++a2) { 1143 c1 = *a1; 1144 c2 = *a2; 1145 if (c1 >= 65 && c1 <= 90) c1 += 32; 1146 if (c2 >= 65 && c2 <= 90) c2 += 32; 1147 if (c1 != c2) return 1; 1148 } 1150 return 0; 1151 } 1153 /* Encoder: */ 1155 int amc_ace_m_encode( 1156 unsigned int input_length, 1157 const u_code_point *input, 1158 const unsigned char *uppercase_flags, 1159 unsigned int *output_size, 1160 unsigned char *output ) 1161 { 1162 unsigned int literal, wide; /* boolean */ 1163 u_code_point codept, n, diff, morebits; 1164 u_code_point A, B, C, offsetA, offsetB, offsetC, offset; 1165 const u_code_point *input_end, *p, *pp; 1166 unsigned int count, max, next_in, next_out, max_out, codelen, i; 1167 unsigned char c; 1169 input_end = input + input_length; 1171 /* 1) Verify that only valid code points appear: */ 1173 for (p = input; p < input_end; ++p) { 1174 if (*p >> 11 == 0x1B || *p > 0x10FFFF) return amc_ace_invalid_input; 1175 } 1177 /* 2) Determine the most populous row: B and offsetB */ 1179 /* first check the special rows: */ 1181 B = 0xD8; 1182 offsetB = special_row_offset[0]; 1183 max = 0; 1185 for (n = 0; n < 8; ++n) { 1186 offset = special_row_offset[n]; 1187 count = 0; 1189 for (p = input; p < input_end; ++p) { 1190 if (*p - offset <= 0xFF && !is_ldh(*p)) ++count; 1191 } 1192 if (count > max) { 1193 B = 0xD8 + n; 1194 offsetB = offset; 1195 max = count; 1196 } 1197 } 1199 /* now check the regular rows: */ 1201 for (pp = input; pp < input_end; ++pp) { 1202 n = *pp >> 8; 1203 count = 0; 1205 for (p = input; p < input_end; ++p) { 1206 if (*p >> 8 == n && !is_ldh(*p)) ++count; 1207 } 1209 if (count > max || (count == max && n < B)) { 1210 B = n; 1211 offsetB = n << 8; 1212 max = count; 1213 } 1214 } 1216 /* 3) Determine the most populous 16-window: A and offsetA */ 1218 A = 0; 1219 max = 0; 1221 for (n = 0; n <= 0x1F; ++n) { 1222 offset = ((offsetB >> 3) + n) << 3; 1223 count = 0; 1225 for (p = input; p < input_end; ++p) { 1226 if (*p - offset <= 0xF && !is_ldh(*p)) ++count; 1227 } 1229 if (count > max) { 1230 A = n; 1231 offsetA = offset; 1232 max = count; 1233 } 1234 } 1236 /* 4) Determine the most populous 20k-window: C */ 1238 C = 0; 1239 max = 0; 1241 for (pp = input; pp < input_end; ++pp) { 1242 count = 0; 1243 n = *pp >> 11; 1244 offset = n << 11; 1246 for (p = input; p < input_end; ++p) { 1247 if (*p - offset <= 0x4FFF && !is_ldh(*p)) ++count; 1248 if (count > max || (count == max && n < C)) { 1249 C = n; 1250 max = count; 1251 } 1252 } 1253 } 1255 /* 5) Determine the style to use: wide or narrow */ 1257 /* if narrow style were used: */ 1259 offsetC = (offsetB >> 12) << 12; 1260 count = 3 + (B > 0xFF); 1262 for (p = input; p < input_end; ++p) { 1263 if (is_ldh(*p)) { } 1264 else if (*p - offsetA <= 0xF) count += 1; 1265 else if (*p - offsetB <= 0xFF) count += 2; 1266 else if (*p - offsetC <= 0xFFF) count += 3; 1267 else if (*p <= 0xFFFF) count += 4; 1268 else count += 5; 1269 } 1271 max = count; 1273 /* if wide style were used: */ 1275 offsetC = C << 11; 1276 count = B <= 0xFF && C <= 0x1F ? 3 : 5; 1278 for (p = input; p < input_end; ++p) { 1279 if (is_ldh(*p)) { } 1280 else if (*p - offsetB <= 0xFF) count += 2; 1281 else if (*p - offsetC <= 0x4FFF) count += 3; 1282 else if (*p <= 0xFFFF) count += 4; 1283 else count += 5; 1284 } 1286 wide = (count < max); 1288 /* 6) Initialize offsetC, and encode the style and offsets: */ 1290 max_out = *output_size; 1291 next_out = 0; 1293 if (wide) { 1294 offsetC = C << 11; 1295 if (B <= 0xFF && C <= 0x1F) { 1296 if (max_out - next_out < 3) return amc_ace_output_too_big; 1297 output[next_out++] = base32[0x10 | (B >> 5)]; 1298 output[next_out++] = base32[B & 0x1F]; 1299 output[next_out++] = base32[C]; 1300 } 1301 else { 1302 if (max_out - next_out < 5) return amc_ace_output_too_big; 1303 output[next_out++] = base32[0x18 | (B >> 10)]; 1304 output[next_out++] = base32[(B >> 5) & 0x1F]; 1305 output[next_out++] = base32[B & 0x1F]; 1306 output[next_out++] = base32[C >> 5]; 1307 output[next_out++] = base32[C & 0x1F]; 1308 } 1309 } 1310 else { 1311 offsetC = (offsetB >> 12) << 12; 1313 if (B <= 0xFF) { 1314 if (max_out - next_out < 3) return amc_ace_output_too_big; 1315 output[next_out++] = base32[B >> 5]; 1316 output[next_out++] = base32[B & 0x1F]; 1317 } 1318 else { 1319 if (max_out - next_out < 4) return amc_ace_output_too_big; 1320 output[next_out++] = base32[8 | (B >> 10)]; 1321 output[next_out++] = base32[(B >> 5) & 0x1F]; 1322 output[next_out++] = base32[B & 0x1F]; 1323 } 1325 output[next_out++] = base32[A]; 1326 } 1328 /* 7) Main encoding loop: */ 1330 literal = 0; 1332 for (next_in = 0; next_in < input_length; ++next_in) { 1333 codept = input[next_in]; 1335 if (codept == 45 /* hyphen-minus */) { 1336 /* case 7.1 */ 1337 if (max_out - next_out < 2) return amc_ace_output_too_big; 1338 output[next_out++] = 45; 1339 output[next_out++] = 45; 1340 continue; 1341 } 1343 if (is_ldh(codept)) { 1344 /* case 7.2 */ 1345 if (!literal) { 1346 if (max_out - next_out < 1) return amc_ace_output_too_big; 1347 output[next_out++] = 45; 1348 literal = 1; 1349 } 1350 if (max_out - next_out < 1) return amc_ace_output_too_big; 1351 output[next_out++] = codept; 1352 continue; 1353 } 1355 /* case 7.3 */ 1357 if (literal) { 1358 if (max_out - next_out < 1) return amc_ace_output_too_big; 1359 output[next_out++] = 45; 1360 literal = 0; 1361 } 1363 if (!wide) { 1364 diff = codept - offsetA; 1366 if (diff <= 0xF) { 1367 /* case 7.3.1 */ 1368 codelen = 1; 1369 goto encoder_base32_bottom; 1370 } 1371 } 1373 diff = codept - offsetB; 1375 if (diff <= 0xFF) { 1376 /* case 7.3.2 */ 1377 codelen = 2; 1378 goto encoder_base32_bottom; 1379 } 1381 diff = codept - offsetC; 1383 if (diff <= 0xFFF) { 1384 /* case 7.3.3 */ 1385 codelen = 3; 1386 goto encoder_base32_bottom; 1387 } 1389 if (wide) { 1390 diff = codept - offsetC - 0x1000; 1392 if (diff <= 0x3FFF) { 1393 /* case 7.3.4 */ 1394 codelen = 1; 1395 morebits = diff & 0x3FF; 1396 diff >>= 10; 1397 goto encoder_base32_bottom; 1398 } 1399 } 1401 if (codept <= 0xFFFF) { 1402 /* case 7.3.5 */ 1403 diff = codept; 1404 codelen = 4; 1405 goto encoder_base32_bottom; 1406 } 1407 /* case 7.3.6 */ 1408 diff = codept - 0x10000; 1409 codelen = 5; 1411 encoder_base32_bottom: /* output diff as n base-32 digits: */ 1412 if (max_out - next_out < codelen) return amc_ace_output_too_big; 1413 i = codelen - 1; 1414 c = base32[diff & 0xF]; 1415 if (uppercase_flags && uppercase_flags[next_in]) c -= 32; 1416 output[next_out + i] = c; 1418 while (i > 0) { 1419 diff >>= 4; 1420 output[next_out + --i] = base32[0x10 | (diff & 0xF)]; 1421 } 1423 next_out += codelen; 1425 if (wide && codelen == 1) { 1426 /* case 7.3.4 */ 1427 if (max_out - next_out < 2) return amc_ace_output_too_big; 1428 output[next_out++] = base32[morebits >> 5]; 1429 output[next_out++] = base32[morebits & 0x1F]; 1430 } 1431 } 1433 /* null terminator: */ 1434 if (max_out - next_out < 1) return amc_ace_output_too_big; 1435 output[next_out++] = 0; 1436 *output_size = next_out; 1437 return amc_ace_success; 1438 } 1440 /* Decoder: */ 1442 int amc_ace_m_decode( 1443 enum case_sensitivity case_sensitivity, 1444 unsigned char *scratch_space, 1445 const unsigned char *input, 1446 unsigned int *output_length, 1447 u_code_point *output, 1448 unsigned char *uppercase_flags ) 1449 { 1450 unsigned int literal, wide, large; /* boolean */ 1451 const unsigned char *next_in; 1452 unsigned char c; 1453 unsigned int next_out, max_out, codelen, input_size, scratch_size; 1454 u_code_point q, B, offsets[6], diff, offset; 1455 enum amc_ace_status status; 1456 /* 1) Decode the style and offsets: */ 1458 next_in = input; 1459 q = base32_decode(*next_in++); 1460 if (q == base32_invalid) return amc_ace_invalid_input; 1461 wide = q >> 4; 1462 large = (q >> 3) & 1; 1463 B = q & 7; 1464 q = base32_decode(*next_in++); 1465 if (q == base32_invalid) return amc_ace_invalid_input; 1466 B = (B << 5) | q; 1468 if (large) { 1469 q = base32_decode(*next_in++); 1470 if (q == base32_invalid) return amc_ace_invalid_input; 1471 B = (B << 5) | q; 1472 } 1474 /* offsets[codelen] is for base-32 codes with codelen characters */ 1475 /* (not counting the extra two in wide-style 0xxxx xxxxx xxxxx) */ 1477 offsets[2] = B >> 3 == 0x1B ? special_row_offset[B & 7] : B << 8; 1478 q = base32_decode(*next_in++); 1479 if (q == base32_invalid) return amc_ace_invalid_input; 1481 if (!wide) { 1482 offsets[1] = ((offsets[2] >> 3) + q) << 3; 1483 offsets[3] = (offsets[2] >> 12) << 12; 1484 } 1485 else { 1486 offset = q << 11; 1488 if (large) { 1489 q = base32_decode(*next_in++); 1490 if (q == base32_invalid) return amc_ace_invalid_input; 1491 offset = (offset << 5) | q; 1492 } 1494 offsets[3] = offset; 1495 offsets[1] = offset + 0x1000; 1496 } 1498 offsets[4] = 0; 1499 offsets[5] = 0x10000; 1501 /* 2) Main decoding loop: */ 1503 max_out = *output_length; 1504 next_out = 0; 1505 literal = 0; 1507 for (;;) { 1508 c = *next_in++; 1509 if (!c) break; 1510 if (c == 45 /* hyphen-minus */) { 1511 if (*next_in == 45) { 1512 /* case 2.1: "--" decodes to "-" */ 1513 ++next_in; 1514 if (max_out - next_out < 1) return amc_ace_output_too_big; 1515 if (uppercase_flags) uppercase_flags[next_out] = 0; 1516 output[next_out++] = 45; 1517 continue; 1518 } 1520 /* case 2.2: unpaired hyphen-minus toggles mode */ 1521 literal = !literal; 1522 continue; 1523 } 1525 if (!is_ldh(c)) return amc_ace_invalid_input; 1526 if (max_out - next_out < 1) return amc_ace_output_too_big; 1528 if (literal) { 1529 /* case 2.3: literal letter/digit */ 1530 if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(c); 1531 output[next_out++] = c; 1532 continue; 1533 } 1535 /* case 2.4: base-32 sequence */ 1537 diff = 0; 1538 codelen = 1; 1540 for (;;) { 1541 q = base32_decode(c); 1542 if (q == base32_invalid) return amc_ace_invalid_input; 1543 diff = (diff << 4) | (q & 0xF); 1544 if ((q & 0x10) == 0) break; 1545 if (++codelen > 5) return amc_ace_invalid_input; 1546 c = *next_in++; 1547 } 1549 /* Now codelen is the number of input characters read, */ 1550 /* and c is the character holding the uppercase flag. */ 1552 if (wide && codelen == 1) { 1553 q = base32_decode(*next_in++); 1554 if (q == base32_invalid) return amc_ace_invalid_input; 1555 diff = (diff << 5) | q; 1556 q = base32_decode(*next_in++); 1557 if (q == base32_invalid) return amc_ace_invalid_input; 1558 diff = (diff << 5) | q; 1559 } 1561 offset = offsets[codelen]; 1562 if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(c); 1563 output[next_out++] = offset + diff; 1564 } 1565 /* 3) Re-encode the output and compare to the input: */ 1567 input_size = next_in - input; 1568 scratch_size = input_size; 1569 status = amc_ace_m_encode(next_out, output, uppercase_flags, 1570 &scratch_size, scratch_space); 1571 if (status != amc_ace_success || 1572 scratch_size != input_size || 1573 unequal(case_sensitivity, scratch_space, input, input_size) 1574 ) return amc_ace_invalid_input; 1575 *output_length = next_out; 1576 return amc_ace_success; 1577 } 1579 /******************************************************************/ 1580 /* Wrapper for testing (would normally go in a separate .c file): */ 1582 #include 1583 #include 1584 #include 1585 #include 1587 /* For testing, we'll just set some compile-time limits rather than */ 1588 /* use malloc(), and set a compile-time option rather than using a */ 1589 /* command-line option. */ 1591 enum { 1592 unicode_max_length = 256, 1593 ace_max_size = 256, 1594 test_case_sensitivity = case_insensitive 1595 }; 1597 static void usage(char **argv) 1598 { 1599 fprintf(stderr, 1600 "%s -e reads big-endian UTF-32 and writes AMC-ACE-M ASCII.\n" 1601 "%s -d reads AMC-ACE-M ASCII and writes big-endian UTF-32.\n" 1602 "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n" 1603 , argv[0], argv[0]); 1604 exit(EXIT_FAILURE); 1605 } 1607 static void fail(const char *msg) 1608 { 1609 fputs(msg,stderr); 1610 exit(EXIT_FAILURE); 1611 } 1613 static const char too_large[] = 1614 "input or output is too large, recompile with larger limits\n"; 1616 static const char invalid_input[] = "invalid input\n"; 1617 int main(int argc, char **argv) 1618 { 1619 enum amc_ace_status status; 1621 if (argc != 2) usage(argv); 1622 if (argv[1][0] != '-') usage(argv); 1623 if (argv[1][2] != '\0') usage(argv); 1625 if (argv[1][1] == 'e') { 1626 u_code_point input[unicode_max_length]; 1627 unsigned char uppercase_flags[unicode_max_length]; 1628 unsigned char output[ace_max_size]; 1629 unsigned int input_length, output_size; 1630 int c0, c1, c2, c3; 1632 /* Read the UTF-32 input string: */ 1634 input_length = 0; 1636 for (;;) { 1637 c0 = getchar(); 1638 c1 = getchar(); 1639 c2 = getchar(); 1640 c3 = getchar(); 1642 if (c1 == EOF || c2 == EOF || c3 == EOF) { 1643 if (c0 != EOF) fail("input not a multiple of 4 bytes\n"); 1644 break; 1645 } 1647 if (input_length == unicode_max_length) fail(too_large); 1649 if ((c0 != 0 && c0 != 0x80) 1650 || c1 < 0 || c1 > 0x10 1651 || c2 < 0 || c2 > 0xFF 1652 || c3 < 0 || c3 > 0xFF ) { 1653 fail(invalid_input); 1654 } 1656 input[input_length] = ((u_code_point) c1 << 16) | 1657 ((u_code_point) c2 << 8) | (u_code_point) c3; 1658 uppercase_flags[input_length] = (c0 >> 7); 1659 ++input_length; 1660 } 1662 /* Encode, and output the result: */ 1664 output_size = ace_max_size; 1665 status = amc_ace_m_encode(input_length, input, uppercase_flags, 1666 &output_size, output); 1667 if (status == amc_ace_invalid_input) fail(invalid_input); 1668 if (status == amc_ace_output_too_big) fail(too_large); 1669 assert(status == amc_ace_success); 1670 fputs((char *) output, stdout); 1671 return EXIT_SUCCESS; 1672 } 1673 if (argv[1][1] == 'd') { 1674 unsigned char input[ace_max_size], scratch[ace_max_size]; 1675 u_code_point output[unicode_max_length], codept; 1676 unsigned char uppercase_flags[unicode_max_length]; 1677 unsigned int output_length, i; 1678 size_t n; 1680 /* Read the AMC-ACE-M ASCII input string: */ 1682 n = fread(input, 1, ace_max_size, stdin); 1683 if (n == ace_max_size) fail(too_large); 1684 input[n] = 0; 1686 /* Decode, and output the result: */ 1688 output_length = unicode_max_length; 1689 status = amc_ace_m_decode(test_case_sensitivity, scratch, input, 1690 &output_length, output, uppercase_flags); 1691 if (status == amc_ace_invalid_input) fail(invalid_input); 1692 if (status == amc_ace_output_too_big) fail(too_large); 1693 assert(status == 0); 1695 for (i = 0; i < output_length; ++i) { 1696 putchar(uppercase_flags[i] ? 0x80 : 0); 1697 codept = output[i]; 1698 putchar(codept >> 16); 1699 putchar((codept >> 8) & 0xFF); 1700 putchar(codept & 0xFF); 1701 } 1703 return EXIT_SUCCESS; 1704 } 1706 usage(argv); 1707 return EXIT_SUCCESS; /* not reached, but quiets a compiler warning */ 1708 } 1710 INTERNET-DRAFT expires 2001-Aug-12