idnits 2.17.1 draft-ietf-idn-amc-ace-r-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 23) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([UNICODE], [RFC1123], [AMCACEO00], [RFC952], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1461 -- Looks like a reference, but probably isn't: '6' on line 1350 == Missing Reference: '--out' is mentioned on line 1320, but not defined -- Looks like a reference, but probably isn't: '1' on line 1515 == Missing Reference: '-1' is mentioned on line 1396, but not defined -- Looks like a reference, but probably isn't: '2' on line 1462 == Unused Reference: 'AltDUDE00' is defined on line 997, but no explicit reference was found in the text == Unused Reference: 'LACE01' is defined on line 1016, but no explicit reference was found in the text == Outdated reference: A later version (-10) exists of draft-ietf-idn-nameprep-03 -- Possible downref: Normative reference to a draft: ref. 'RACE03' -- Possible downref: Normative reference to a draft: ref. 'AltDUDE00' -- Possible downref: Normative reference to a draft: ref. 'AMCACEM00' -- Possible downref: Normative reference to a draft: ref. 'AMCACEO00' -- Possible downref: Normative reference to a draft: ref. 'BRACE00' == Outdated reference: A later version (-02) exists of draft-ietf-idn-dude-01 -- Possible downref: Normative reference to a draft: ref. 'DUDE01' -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN' -- Possible downref: Normative reference to a draft: ref. 'LACE01' -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL' ** Downref: Normative reference to an Unknown state RFC: RFC 952 -- No information found for draft-ietf-idn-sace- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'SACE' -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- No information found for draft-jseng-utf5- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'UTF5' -- No information found for draft-ietf-idn-utf6- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'UTF6' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS6' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTFCONV' Summary: 5 errors (**), 0 flaws (~~), 8 warnings (==), 26 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Adam M. Costello 2 draft-ietf-idn-amc-ace-r-00.txt 2001-Mar-27 3 Expires 2001-Sep-27 5 AMC-ACE-R version 0.0.0 7 Status of this Memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note 14 that other groups may also distribute working documents as 15 Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months and may be updated, replaced, or obsoleted by other documents 19 at any time. It is inappropriate to use Internet-Drafts as 20 reference material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 Distribution of this document is unlimited. Please send comments 29 to the author at amc@cs.berkeley.edu, or to the idn working 30 group at idn@ops.ietf.org. A non-paginated (and possibly 31 newer) version of this specification may be available at 32 http://www.cs.berkeley.edu/~amc/charset/amc-ace-r 34 Abstract 36 AMC-ACE-R is a reversible map from a sequence of Unicode [UNICODE] 37 code points to a sequence of letters (A-Z, a-z), digits (0-9), 38 and hyphen-minus (-), henceforth called LDH characters. Such a 39 map might be useful for an "ASCII-Compatible Encoding" (ACE) for 40 internationalized domain names [IDN], because host name labels are 41 currently restricted to LDH characters by [RFC952] and [RFC1123]. 43 AMC-ACE-R is similar to AMC-ACE-O [AMCACEO00] but is simpler and not 44 quite as efficient. 46 Besides domain names, there might also be other contexts where it is 47 useful to transform Unicode characters into "safe" (delimiter-free) 48 ASCII characters. (If other contexts consider hyphen-minus to be 49 unsafe, a different character could be used to play its role, like 50 underscore.) 51 Contents 53 Features 54 Name 55 Overview 56 Base-32 characters 57 Encoding and decoding algorithms 58 Signature 59 Case sensitivity models 60 Comparison with RACE, BRACE, LACE, AltDUDE, AMC-ACE-M, AMC-ACE-O 61 Example strings 62 Security considerations 63 Credits 64 References 65 Author 66 Example implementation 68 Features 70 Uniqueness: Every Unicode string maps to at most one LDH string. 72 Completeness: Every Unicode string maps to an LDH string. 73 Restrictions on which Unicode strings are allowed, and on length, 74 may be imposed by higher layers. 76 Efficient encoding: The ratio of encoded size to original size is 77 small for all Unicode strings. This is important in the context 78 of domain names because [RFC1034] restricts the length of a domain 79 label to 63 characters. 81 Simplicity: The encoding and decoding algorithms are reasonably 82 simple to implement. The goals of efficiency and simplicity are at 83 odds; AMC-ACE-R aims at a good balance between them. 85 Case-preservation: If the Unicode string has been case-folded prior 86 to encoding, it is possible to record the case information in the 87 case of the letters in the encoding, allowing a mixed-case Unicode 88 string to be recovered if desired, but a case-insensitive comparison 89 of two encoded strings is equivalent to a case-insensitive 90 comparison of the Unicode strings. This feature is optional; see 91 section "Case sensitivity models". 93 Readability: The letters A-Z and a-z and the digits 0-9 appearing 94 in the Unicode string are represented as themselves in the label. 95 This comes for free because it usually the most efficient encoding 96 anyway. 98 Name 100 AMC-ACE-R is a working name that should be changed if it is adopted. 101 (The R merely indicates that it is the eighteenth ACE devised by 102 this author. BRACE was the third. D-L, N, P, and Q were not worth 103 releasing.) Rather than waste good names on experimental proposals, 104 let's wait until one proposal is chosen, then assign it a good name. 105 Suggestions (assuming the primary use is in domain names): 107 UniHost 108 UTF-D ("D" for "domain names") 109 NUDE (Normal Unicode Domain Encoding) 111 A name that makes no reference to domain names: 113 UTF-37 (there are 37 characters in the output repertoire) 115 Overview 117 AMC-ACE-R maps a sequence of Unicode code points to a sequence of 118 LDH characters. The encoder input and decoder output are arrays of 119 code points, not characters, bytes, or code units (in particular, 120 not UTF-16 surrogates). Formally, the encoder output and decoder 121 input are character strings, not code points, code units, or bytes, 122 although implementations will of course need to represent the 123 characters somehow, usually as bytes or other code units. 125 Each Unicode code point is represented by an integral number of 126 characters in the encoded string. There is no intermediate bit 127 string or octet string. 129 The encoded string alternates between two modes: literal mode and 130 base-32 mode. Unicode code points representing LDH characters 131 are encoded as those LDH characters, except that hyphen-minus is 132 doubled. Other Unicode code points are encoded using base-32, in 133 which each character of the encoded string represents five bits 134 (a "quintet"). A non-paired hyphen-minus in the encoded string 135 indicates a mode change. 137 In base-32 mode a variable-length code sequence of one to five 138 quintets represents a delta, which is added to a reference point to 139 yield a Unicode code point. There are five reference points, one 140 for each code length, three of which continually change during the 141 encoding/decoding process. 143 Base-32 characters 145 "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 146 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 147 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 148 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 149 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 150 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 151 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 152 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 153 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 154 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 155 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 156 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 157 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 158 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 159 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 160 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 162 The digits "0" and "1" and the letters "o" and "l" are not used, to 163 avoid transcription errors. 165 All decoders must recognize both the uppercase and lowercase 166 forms of the base-32 characters. The case may or may not convey 167 information, as described in section "Case sensitivity models". 169 Encoding and decoding algorithms 171 The algorithms are given below as commented pseudocode. All 172 ordering of bits and quintets is big-endian (most significant 173 first). The >> and << operators used below mean bit shift, as in 174 C. For >> there is no question of logical versus arithmetic shift 175 because AMC-ACE-R never needs to right-shift a negative value. 176 As in C, "continue" means terminate the current iteration of the 177 innermost loop, "break" means terminate the innermost loop, and 178 "return" means terminate the current function. 180 shared variables: # All others are local to each function. 181 array refpoint[1..5] # refpoint[k] is for sequences of length k 183 function update_refpoints(history[first..latest]): 184 # Adapt refpoint[1..3] based on the code points seen so far. 185 for k = 1 to 3 do begin 186 let b = k << 2 187 if latest - first == 1 188 then let refpoint[k] = (history[latest] >> b) << b 189 else for i = latest - 1 down to first do begin 190 if history[i] represents an LDH character then continue 191 if (refpoint[k] XOR history[i]) >> b == 0 then break 192 if (history[latest] XOR history[i]) >> b == 0 then begin 193 let refpoint[k] = (history[latest] >> b) << b 194 return 195 end 196 end 197 end 198 function encode(input[first..last]): 199 let refpoint[1..5] = 0x60, 0, 0, 0, 0x10000 200 let output = the empty string 201 let literal = false 202 for i = first to last do begin 203 if input[i] == 0x2D then append two hyphen-minuses to output 204 else if input[i] represents an LDH character then begin 205 if not literal then append hyphen-minus to output 206 let literal = true 207 append the character represented by input[i] to output 208 end 209 else begin 210 if literal then append hyphen-minus to output 211 let literal = false 212 for k = 1 to infinity do begin 213 let delta = codepoint - refpoint[k] 214 if delta >= 0 and delta >> (4*k) == 0 then break 215 end 216 extract the k least significant nybbles of delta 217 prepend 0 to the last nybble and 1 to the rest 218 output base-32 characters corresponding to the quintets 219 update_refpoints(input[first..i]) 220 end 221 end 222 return output 224 function decode(input string): 225 let refpoint[1..5] = 0x60, 0, 0, 0, 0x10000 226 let output = the empty array 227 let literal = false 228 while not end-of-input do begin 229 if the next character is hyphen-minus then begin 230 consume the character 231 if the next character is also hyphen-minus 232 then consume it and append 0x2D to output 233 else toggle literal 234 end 235 else if literal then consume the character and output it 236 else begin 237 consume characters and convert them to quintets until 238 encountering a quintet beginning with 0 239 fail upon encountering a non-base-32 character or end-of-input 240 let k = the number of quintets obtained 241 strip the first bit of each quintet 242 concatenate the resulting nybbles to form delta 243 append refpoint[k] + delta to output 244 update_refpoints(output) 245 end 246 end 247 let check = encode(output) 248 if check != the input string then fail 249 return output 250 The comparison at the end of decode() must be case-insensitive 251 if ACEs are always compared case-insensitively (which is true of 252 domain names), case-sensitive otherwise (see also section "Case 253 sensitivity models"). This check is necessary to guarantee the 254 uniqueness property, that there cannot be two distinct encoded 255 strings representing the same sequence of integers. This check also 256 frees the decoder from having to check for overflow while decoding 257 the base-32 characters. 259 Signature 261 The issue of how to distinguish ACE strings from unencoded strings 262 is largely orthogonal to the encoding scheme itself, and is 263 therefore not specified here. In the context of domain name labels, 264 a standard prefix and/or suffix (chosen to be unlikely to occur 265 naturally) would presumably be attached to ACE labels. 267 In order to use AMC-ACE-R in domain names, the choice of signature 268 must be mindful of the requirement in [RFC952] that labels never 269 begin or end with hyphen-minus. Since the raw encoded string 270 sometimes begins with a hyphen-minus, the signature must include 271 a prefix that does not begin with hyphen-minus. If the Unicode 272 strings are forbidden from ending with hyphen-minus (which seems 273 prudent anyway), then the raw encoded string will never end with 274 hyphen-minus; otherwise, the signature must include a suffix as well 275 as a prefix. 277 It appears that "---" is extremely rare in domain names; among the 278 four-character prefixes of all the second-level domains under .com, 279 .net, and .org, "---" never appears at all. Therefore, perhaps the 280 signature should be of the form "?---", where ? could be "u" for 281 Unicode, or "i" for internationalized, or "a" for ACE, or maybe "q" 282 or "z" because they are rare. 284 Case sensitivity models 286 The higher layer must choose one of the following four models. 288 Models suitable for domain names: 290 * Case-insensitive: Before a string is encoded, all its non-LDH 291 characters must be case-folded so that any strings differing 292 only in case become the same string (for example, strings could 293 be forced to lowercase). Folding LDH characters is optional. 294 The case of base-32 characters and literal-mode characters is 295 arbitrary and not significant. Comparisons between encoded 296 strings must be case-insensitive. The original case of non-LDH 297 characters cannot be recovered from the encoded string. 299 * Case-preserving: The case of the Unicode characters is not 300 considered significant, but it can be preserved and recovered, 301 just like in non-internationalized host names. Before a string 302 is encoded, all its non-LDH characters must be case-folded 303 as in the previous model. LDH characters are naturally able 304 to retain their case attributes because they are encoded 305 literally. The case attribute of a non-LDH character is 306 recorded in the last of the base-32 characters that represent 307 it, which is guaranteed to be a letter rather than a digit. 308 If the base-32 character is uppercase, it means the Unicode 309 character is caseless or should be forced to uppercase after 310 being decoded (which is a no-op if the case folding already 311 forces to uppercase). If the base-32 character is lowercase, 312 it means the Unicode character is caseless or should be forced 313 to lowercase after being decoded (which is a no-op if the case 314 folding already forces to lowercase). The case of the other 315 base-32 characters in a multi-quintet encoding is arbitrary 316 and not significant. Only uppercase and lowercase attributes 317 can be recorded, not titlecase. Comparisons between encoded 318 strings must be case-insensitive, and are equivalent to 319 case-insensitive comparisons between the Unicode strings. The 320 intended mixed-case Unicode string can be recovered as long as 321 the encoded characters are unaltered, but altering the case of 322 the encoded characters is not harmful--it merely alters the case 323 of the Unicode characters, and such a change is not considered 324 significant. 326 In this model, the input to the encoder and the output of the 327 decoder can be the unfolded Unicode string (in which case the 328 encoder and decoder are responsible for performing the case 329 folding and recovery), or can be the folded Unicode string 330 accompanied by separate case information (in which case the 331 higher layer is responsible for performing the case folding and 332 recovery). Whichever layer performs the case recovery must 333 first verify that the Unicode string is properly folded, to 334 guarantee the uniqueness of the encoding. 336 It should not be very difficult to extend the nameprep algorithm 337 [NAMEPREP03] to remember case information; it could be done by 338 adding flags to the mapping tables. 340 The case-insensitive and case-preserving models are interoperable. 341 If a domain name passes from a case-preserving entity to a 342 case-insensitive entity, the case information may be lost, but the 343 domain name will still be equivalent. This phenomenon already 344 occurs with non-internationalized domain names. 346 Models unsuitable for domain names, but possibly useful in other 347 contexts: 349 * Case-sensitive: Unicode strings may contain both uppercase and 350 lowercase characters, which are not folded. Base-32 characters 351 must be lowercase. Comparisons between encoded strings must be 352 case-sensitive. 354 * Case-flexible: Like case-preserving, except that the choice 355 of whether the case of the Unicode characters is considered 356 significant is deferred. Therefore, base-32 characters must 357 be lowercase, except for those used to indicate uppercase 358 Unicode characters. Comparisons between encoded strings may be 359 case-sensitive or case-insensitive, and such comparisons are 360 equivalent to the corresponding comparisons between the Unicode 361 strings. 363 Comparison with RACE, BRACE, LACE, AltDUDE, AMC-ACE-M, AMC-ACE-O 365 In this section we compare AMC-ACE-R and six other ACEs: RACE 366 [RACE03], BRACE [BRACE00], LACE [LACE01], AltDUDE [AltDUDE00], 367 AMC-ACE-M [AMCACEM00], and AMC-ACE-O [AMCACEO00]. We do not include 368 SACE [SACE], UTF-5 [UTF5], UTF-6 [UTF6], or DUDE [DUDE01] in the 369 comparison, because SACE appears obviously too complex, UTF-5 370 appears obviously too inefficient, UTF-6 can never be more efficient 371 than its similarly simple successor DUDE, and DUDE is almost 372 identical to AltDUDE. 374 Complexity is hard to measure. This author would subjectively 375 describe the complexity of the algorithms as: 377 LACE, AltDUDE: simple but not trivial 378 RACE, AMC-ACE-R: less simple 379 AMC-ACE-O: moderate 380 AMC-ACE-M: fairly complex 381 BRACE: complex 383 AMC-ACE-R is similar to AMC-ACE-O, but is considerably simpler 384 because it does not calculate the most useful reference points 385 beforehand, encode them, and decode them. Instead, it uses a simple 386 heuristic to set the reference points adaptively based on the code 387 points that have been seen so far. 389 Implementations can be long and straightforward, or short and 390 subtle, but for whatever it's worth, here are the code sizes of 391 four of the algorithms that were implemented by this author in 392 similar styles: 394 AltDUDE: 130 lines @@@@@@@@@@@@@@@@@@@ 395 AMC-ACE-R: 171 lines @@@@@@@@@@@@@@@@@@@@@@@@ 396 AMC-ACE-O: 232 lines @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 397 AMC-ACE-M: 324 lines @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 399 (Not counted in the code sizes are blank lines, lines containing 400 only comments or only a single brace, and wrapper code for testing. 401 BRACE was also implemented by this author, but it was a less general 402 implementation, with bounded input and output sizes.) 404 If a different implementation style were to alter the code sizes 405 additively, or multiplicatively, or a combination thereof, AMC-ACE-O 406 would remain about halfway between AltDUDE and AMC-ACE-M, and 407 AMC-ACE-R would remain closer to AltDUDE than to AMC-ACE-O. 409 Case preservation support: 411 AltDUDE, AMC-ACE-M/O/R: all characters 412 BRACE: only the letters A-Z, a-z 413 RACE, LACE: none 415 RACE, BRACE, and LACE transform the Unicode string to an 416 intermediate bit string, then into a base-32 string, so there is no 417 particular alignment between the base-32 characters and the Unicode 418 characters. AltDUDE and AMC-ACE-M/O/R do not have this intermediate 419 stage, and enforce alignment between the base-32 characters and the 420 Unicode characters, which facilitates the case preservation. 422 The relative efficiency of the various algorithms is suggested 423 by the sizes of the encodings in section "Example strings". The 424 lengths of examples A-K (which are the same sentence translated into 425 a languages from a variety of language families using a variety 426 of scripts) are shown graphically below for each ACE, scaled by a 427 factor of 0.4 so they fit on one line, and sorted so they look like 428 a cummulative distribution. The fictional "Super-ACE" encodes its 429 input using whichever of the other seven ACEs is shortest for that 430 input. 432 RACE: 433 A Arabic 29 @@@@@@@@@@@@ 434 B Chinese 31 @@@@@@@@@@@@ 435 J Taiwanese 31 @@@@@@@@@@@@ 436 D Hebrew 37 @@@@@@@@@@@@@@@ 437 H Russian 47 @@@@@@@@@@@@@@@@@@@ 438 E Hindi 50 @@@@@@@@@@@@@@@@@@@@ 439 F Japanese 60 @@@@@@@@@@@@@@@@@@@@@@@@ 440 I Spanish 66 @@@@@@@@@@@@@@@@@@@@@@@@@@ 441 C Czech 68 @@@@@@@@@@@@@@@@@@@@@@@@@@@ 442 G Korean 79 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 443 K Vietnamese 112 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 445 LACE: 446 B Chinese 28 @@@@@@@@@@@ 447 A Arabic 31 @@@@@@@@@@@@ 448 J Taiwanese 31 @@@@@@@@@@@@ 449 D Hebrew 39 @@@@@@@@@@@@@@@@ 450 H Russian 48 @@@@@@@@@@@@@@@@@@@ 451 E Hindi 52 @@@@@@@@@@@@@@@@@@@@@ 452 F Japanese 52 @@@@@@@@@@@@@@@@@@@@@ 453 C Czech 58 @@@@@@@@@@@@@@@@@@@@@@@ 454 I Spanish 68 @@@@@@@@@@@@@@@@@@@@@@@@@@@ 455 G Korean 79 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 456 K Vietnamese 109 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 457 AltDUDE: 458 A Arabic 25 @@@@@@@@@@ 459 B Chinese 26 @@@@@@@@@@ 460 D Hebrew 33 @@@@@@@@@@@@@ 461 J Taiwanese 36 @@@@@@@@@@@@@@ 462 H Russian 38 @@@@@@@@@@@@@@@ 463 C Czech 43 @@@@@@@@@@@@@@@@@ 464 F Japanese 49 @@@@@@@@@@@@@@@@@@@@ 465 E Hindi 58 @@@@@@@@@@@@@@@@@@@@@@@ 466 I Spanish 59 @@@@@@@@@@@@@@@@@@@@@@@@ 467 K Vietnamese 81 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 468 G Korean 89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 470 AMC-ACE-R: 471 B Chinese 24 @@@@@@@@@@ 472 A Arabic 28 @@@@@@@@@@@ 473 J Taiwanese 30 @@@@@@@@@@@@ 474 D Hebrew 32 @@@@@@@@@@@@@ 475 C Czech 36 @@@@@@@@@@@@@@ 476 H Russian 40 @@@@@@@@@@@@@@@@ 477 F Japanese 42 @@@@@@@@@@@@@@@@@ 478 I Spanish 47 @@@@@@@@@@@@@@@@@@@ 479 E Hindi 55 @@@@@@@@@@@@@@@@@@@@@@ 480 K Vietnamese 70 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 481 G Korean 89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 483 AMC-ACE-O: 484 B Chinese 24 @@@@@@@@@@ 485 A Arabic 28 @@@@@@@@@@@ 486 J Taiwanese 30 @@@@@@@@@@@@ 487 D Hebrew 31 @@@@@@@@@@@@ 488 C Czech 34 @@@@@@@@@@@@@@ 489 H Russian 40 @@@@@@@@@@@@@@@@ 490 F Japanese 41 @@@@@@@@@@@@@@@@ 491 I Spanish 49 @@@@@@@@@@@@@@@@@@@@ 492 E Hindi 54 @@@@@@@@@@@@@@@@@@@@@@ 493 K Vietnamese 69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 494 G Korean 80 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 496 BRACE: 497 B Chinese 22 @@@@@@@@@ 498 A Arabic 26 @@@@@@@@@@ 499 J Taiwanese 27 @@@@@@@@@@@ 500 D Hebrew 33 @@@@@@@@@@@@@ 501 C Czech 36 @@@@@@@@@@@@@@ 502 F Japanese 40 @@@@@@@@@@@@@@@@ 503 H Russian 42 @@@@@@@@@@@@@@@@@ 504 E Hindi 45 @@@@@@@@@@@@@@@@@@ 505 I Spanish 48 @@@@@@@@@@@@@@@@@@@ 506 K Vietnamese 72 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 507 G Korean 78 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 508 AMC-ACE-M: 509 B Chinese 23 @@@@@@@@@ 510 J Taiwanese 27 @@@@@@@@@@@ 511 A Arabic 28 @@@@@@@@@@@ 512 D Hebrew 31 @@@@@@@@@@@@ 513 C Czech 34 @@@@@@@@@@@@@@ 514 H Russian 38 @@@@@@@@@@@@@@@ 515 F Japanese 42 @@@@@@@@@@@@@@@@@ 516 I Spanish 48 @@@@@@@@@@@@@@@@@@@ 517 E Hindi 54 @@@@@@@@@@@@@@@@@@@@@@ 518 K Vietnamese 69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 519 G Korean 71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 521 Super-ACE: 522 B Chinese 22 @@@@@@@@@ 523 A Arabic 25 @@@@@@@@@@ 524 J Taiwanese 27 @@@@@@@@@@@ 525 D Hebrew 31 @@@@@@@@@@@@ 526 C Czech 34 @@@@@@@@@@@@@@ 527 H Russian 38 @@@@@@@@@@@@@@@ 528 F Japanese 40 @@@@@@@@@@@@@@@@ 529 E Hindi 45 @@@@@@@@@@@@@@@@@@ 530 I Spanish 47 @@@@@@@@@@@@@@@@@@@ 531 K Vietnamese 69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 532 G Korean 71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 534 totals: 535 RACE: 610 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 536 LACE: 595 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 537 AltDUDE: 537 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 538 AMC-ACE-R: 493 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 539 AMC-ACE-O: 480 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 540 BRACE: 469 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 541 AMC-ACE-M: 465 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 542 Super-ACE: 449 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 544 worst cases: 545 RACE: 112 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 546 LACE: 109 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 547 AltDUDE: 89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 548 AMC-ACE-R: 89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 549 AMC-ACE-O: 80 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 550 BRACE: 78 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 551 AMC-ACE-M: 71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 552 Super-ACE: 71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 554 The totals and worst cases above give more weight to languages 555 that produce longer encodings, which arguably yields a good metric 556 (because being efficient for easy languages is arguably less 557 important than being efficient for difficult languages). We can 558 alternatively give each language equal weight by dividing each 559 output length by the corresponding Super-ACE output length. This 560 method yields: 562 totals: 563 RACE: 14.9 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 564 LACE: 14.5 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 565 AltDUDE: 13.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 566 AMC-ACE-R: 12.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 567 AMC-ACE-O: 11.8 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 AMC-ACE-M: 11.4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 569 BRACE: 11.4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 570 Super-ACE: 11.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 572 worst cases: 573 RACE: 2.00 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 574 LACE: 1.71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 575 AltDUDE: 1.33 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 576 AMC-ACE-R: 1.25 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 577 AMC-ACE-O: 1.20 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 578 AMC-ACE-M: 1.20 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 579 BRACE: 1.11 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 580 Super-ACE: 1.00 @@@@@@@@@@@@@@@@@@@@@@@@@ 582 No matter which way we average, the results suggest that AltDUDE is 583 preferrable to RACE and LACE, because it is no more complex, is more 584 efficient, and has better support for case preservation. 586 The results also suggest that AMC-ACE-M is preferrable to BRACE, 587 because it has similar efficiency, is a little simpler, and has 588 better support for case preservation. 590 AltDUDE, AMC-ACE-R, AMC-ACE-O, and AMC-ACE-M are progressively 591 more complex and more efficient, and have equal support for case 592 preservation. The choice depends on how much efficiency is required 593 and how much complexity is acceptable. 595 The efficiency gaps between AMC-ACE-M, AMC-ACE-O, and AMC-ACE-R are 596 mostly due to the Korean (Hangul) string. Of the 15 characters 597 by which the AMC-ACE-M total beats the AMC-ACE-O total, 9 come 598 from the Korean string. Similarly, of the 13 characters by which 599 the AMC-ACE-O total beats the AMC-ACE-R total, 9 come from the 600 Korean string. The large increases in complexity from AMC-ACE-R to 601 -O to -M yield significant efficiency gains for Korean, but only 602 very small gains for the other languages. More sample strings 603 from more languages need to be tried before one can conclude that 604 Korean is the only significant beneficiary, but if it is, then this 605 author would suggest that AMC-ACE-R is preferable to -O and -M, with 606 apologies to Korean speakers. 608 That would leave a choice between AltDUDE and AMC-ACE-R, the latter 609 being somewhat more complex and somewhat more efficient. 611 Example strings 613 In the ACE encodings below, signatures (like "bq--" for RACE) are 614 not shown. Non-LDH characters in the Unicode string are forced to 615 lowercase before being encoded. For RACE, LACE, and AltDUDE, the 616 letters A-Z are likewise forced to lowercase. UTF-8 and UTF-16 are 617 included for length comparisons, with non-ASCII bytes shown as "?". 618 AMC-ACE-* and AltDUDE are abbreviated AMC-* and ADUDE. Backslashes 619 show where line breaks have been inserted in ACE strings too long 620 for one line. The RACE and LACE encodings are courtesy of Mark 621 Davis's online UTF converter [UTFCONV] (slightly modified to remove 622 the length restrictions). 624 The first several examples are all translations of the sentence "Why 625 can't they just speak in ?" (courtesy of Michael Kaplan's 626 "provincial" page [PROVINCIAL]). Word breaks and punctuation have 627 been removed, as is often done in domain names. 629 (A) Arabic (Egyptian): 630 U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644 631 U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F 633 ADUDE: yueqpcycrcyjhbpznpitjycxf 634 BRACE: 28akcjwcmp3ciwb4t3ngd4nbaz 635 AMC-R: ywekhfuhuikwdwefivevjbuiwktr 636 AMC-O: ageekhfuhuiukdefivevjvbuiktr 637 AMC-M: agiekhfuhuiukdefivevjvbuiktr 638 RACE: azceur2fe4ucuq2eivediojrfbfb6 639 LACE: cedeisshiutsqksdircuqnbzgeueuhy 640 UTF-16: ?????????????????????????????????? 641 UTF-8: ?????????????????????????????????? 643 (B) Chinese (simplified): 644 U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587 646 UTF-16: ?????????????????? 647 BRACE: kgcqqsgp26i5h4zn7req5i 648 AMC-M: uqj7g8nvk6awispn9wupdnh 649 AMC-R: w87g8nvk6awisp259eupyx2h 650 AMC-O: eqpg8nvk6awisp259eupyx2h 651 ADUDE: w85gvk7g9k2iwf6x9j6x7ju54k 652 UTF-8: ??????????????????????????? 653 LACE: azhnn3b2ybea2aml6qau4libmwdq 654 RACE: 3bhnmtxmjy5e5qcojbha3c7ujywwlby 656 (C) Czech: Proprostnemluvesky 658 = U+010D 659 = U+011B 660 = U+00ED 661 UTF-8: Pro??prost??nemluv????esky 662 AMC-O: piq-Pro-p-prost-9m-nemluv-6pp-esky 663 AMC-M: g26-Pro-p-prost-9m-nemluv-6pp-esky 664 AMC-R: -Pro-tsp-prost-ttm-nemluv-s8psp-esky 665 BRACE: i32-Pro-u-prost-8y-nemluv-29f3n-esky 666 ADUDE: tActptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc 667 UTF-16: ???????????????????????????????????????????? 668 LACE: amaha4tpaeaq2biaobzg643uaearwbyanzsw23dvo3wqcainaqagk43\ 669 lpe 670 RACE: ah7xb73s75xq373q75zp6377op7xig77n37wl73n75wp65p7o3762dp\ 671 7mx7xh73l754q 673 (D) Hebrew: 674 U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8 675 U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2 676 U+05D1 U+05E8 U+05D9 U+05EA 678 AMC-O: afpnqeep8e8jfinaqdb8ijp8cb8ij8k 679 AMC-M: af4nqeep8e8jfinaqdb8ijp8cb8ij8k 680 AMC-R: x7nqeep8e8j7f7inaqdb8ijp8cb8ij8k 681 ADUDE: x5nckajvjpvnpenqpcvjvbevrvdvjvbvd 682 BRACE: 27vkyp7bgwmbpfjgc4ynx5nd8xsp5nd9c 683 RACE: axon5vgu3xsotvoy3tin5u6r5dm53ywr5dm6u 684 LACE: cyc5zxwu2to6j2ov3donbxwt2huntxpc2hunt2q 685 UTF-8: ???????????????????????????????????????????? 686 UTF-16: ???????????????????????????????????????????? 688 (E) Hindi: 689 U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D 690 U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939 691 U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947 692 U+0939 U+0948 U+0902 (Devanagari) 694 BRACE: 2b7xtenqdr7zc6uma2pmcz7ibage237kdemicnk9gei32 695 RACE: bextsmslc44t6kcnezabktjpjmbcqokaaiwewmrycuseookiai 696 LACE: dyes6ojsjmltspzijuteafknf5fqekbziabcyszshaksirzzjaba 697 AMC-O: ajeurvjvcmthvjvruipugatfpurmscuivjascunmvcvitfuehvjisc 698 AMC-M: ajhurbvcwmthbhuiwpugitfwpurwmscuibiscunwmvcatfuerbwisc 699 AMC-R: 3urvjvcwmthjruiwpugwatfwpurmscuivjascunmvcvitfuewhjwisc 700 ADUDE: 3wrtgmzjxnuqgthyfymygxfxiycyewjuktbzjwcuqyhzjkupvbydzqz\ 701 bwk 702 UTF-16: ???????????????????????????????????????????????????????\ 703 ????? 704 UTF-8: ???????????????????????????????????????????????????????\ 705 ??????????????????????????????????? 707 (F) Japanese: 708 U+306A U+305C U+307F U+3093 U+306A U+65E5 U+672C U+8A9E U+3092 709 U+8A71 U+3057 U+3066 U+304F U+308C U+306A U+3044 U+306E U+304B 710 (kanji and hiragana) 711 UTF-16: ???????????????????????????????????? 712 BRACE: ji8nr5zj8uqth7v97mjchakwcg7dqemw88nj5gbe 713 AMC-O: gvagkxnzr3dkx8fzun243q3c24zbxhgwr2nkweqwm 714 AMC-R: vsykxnzr3dkyx8fyzun243q3c24zbxhgwr2nkweqwm 715 AMC-M: bsnkxnzr3dkyx8fyzun243q3c24zbxhgwr2nkweqwm 716 ADUDE: vsskvgud8n9jxx2ru6j875c54sn548d54ugvbuj6d8guqukuf 717 LACE: auyguxd7snvaczpfaftsyamktyatbeqbrjyqqmcxmzhyy2senzfq 718 UTF-8: ?????????????????????????????????????????????????????? 719 RACE: 3aygumc4gb7tbezqnjs6kzzmrkpdbeukoeyfomdggbhtbdbqniyeimd\ 720 ogbfq 722 (G) Korean: 723 U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774 724 U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74 725 U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C (Hangul syllables) 727 UTF-16: ???????????????????????????????????????????????? 728 UTF-8: ???????????????????????????????????????????????????????\ 729 ????????????????? 730 AMC-M: yhxcj2w6exiaxi68acfn92n68ezehk6xypdpwam6zehmwhk648eavwd\ 731 p6aqi23ieemweywn 732 BRACE: y394qebjusrcndbs82pkvstf96sxufcr7ffr4vbgdwsxufcx8pdktgb\ 733 gmnsqydmk7im56arju6pt82 734 LACE: 77atrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\ 735 exj2mlpfzzcyjrsely5ck4ta 736 RACE: 3datrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\ 737 exj2mlpfzzcyjrsely5ck4ta 738 AMC-O: m6hwq6tvi466exi44ia6s4nz2neze7xxn47yp6x5e3znze7xze7xxnu\ 739 8e4ze6x5n36is3i622mwe48wn 740 ADUDE: 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8sit\ 741 usauiyz5i23az96iz6ze3xaz2td96ry3si 742 AMC-R: 6tvi466ezxi544i5w8a6s4nz2nw8e6zze7xxn47yp6x5e53znze7xze\ 743 7xxn5u8e54ze6x5n36is3i622m6zwe48wn 745 (H) Russian: 746 U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E 747 U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440 748 U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A 749 U+0438 (Cyrillic) 751 ADUDE: wxRbzjzcjzrzfdmdffigpnnzqrpzpbzqdcazmc 752 AMC-M: aehHgrvfemvgvfgfafvfvdgvcgiwrkhgimjjca 753 AMC-R: wvRqwhfnwdgfqpipfdqcqwawrcvrvqwawdbbvkvi 754 AMC-O: aedRqwhfnwdgfqpipfdqcqwawrwcrqwawdwbwbki 755 BRACE: 269xyjvcyafqfdwyr3xfd8z8byi6z39xyi692s7ug2 756 RACE: aq7t4rzvhrbtmnj6hu4d2njthyzd4qcpii7t4qcdifatuoa 757 LACE: dqcd6pshgu6egnrvhy6tqpjvgm7depsaj5bd6psainaucory 758 UTF-16: ???????????????????????????????????????????????????????\ 759 ??? 760 UTF-8: ??????????????????????????????????????????????????????? 761 ??? 763 (I) Spanish: PorqunopuedensimplementehablarenEspaol 765 = U+00E9 766 = U+00F1 767 UTF-8: Porqu??nopuedensimplementehablarenEspa??ol 768 AMC-R: -Porqu-8j-nopuedensimplementehablarenEspa-9b-ol 769 AMC-M: aa7-Porqu-b-nopuedensimplementehablarenEspa-j-ol 770 BRACE: 22x-Porqu-9-nopuedensimplementehablarenEspa-j-ol 771 AMC-O: aaq-Porqu-j-nopuedensimplementehablarenEspa-9b-ol 772 ADUDE: tAtrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmMtgdtb3\ 773 a3qd 774 RACE: abyg64troxuw433qovswizloonuw24dmmvwwk3tumvugcytmmfzgk3t\ 775 fonygd4lpnq 776 LACE: faaha33sof26s3tpob2wkzdfnzzws3lqnrsw2zloorswqylcnrqxezl\ 777 omvzxayprn5wa 778 UTF-16: ???????????????????????????????????????????????????????\ 779 ????????????????????????? 781 (J) Taiwanese: 782 U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587 784 UTF-16: ?????????????????? 785 UTF-8: ??????????????????????????? 786 AMC-M: uqj7g2tbgtu6a385pspnxkupdnh 787 BRACE: kgcqui49gatc2wyrn8y7cndgte9 788 AMC-R: w87gxstbzuvc6a385psp244kupyx2h 789 AMC-O: eqpgxstbzuvc6a385psp244kupyx2h 790 RACE: 3bhnmuaroize5qe6xvha3cvkjywwlby 791 LACE: 75hnmuaroize5qe6xvha3cvkjywwlby 792 ADUDE: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k 794 (K) Vietnamese: 795 Taisaohokhngthchi\ 796 noitingVit 798 = U+0323 799 = U+00F4 800 = U+00EA 801 = U+0309 802 = U+0301 804 UTF-8: Ta??isaoho??kh??ngth????chi??no??iti????ngVi????t 805 AMC-O: aava-Ta-vud-isaoho-vud-kh-9e-ngth-8kj-chi-j-no-b-iti-8k\ 806 b-ngVi-8kvud-t 807 AMC-M: ada-Ta-ud-isaoho-ud-kh-s9e-ngth-s8kj-chi-j-no-b-iti-s8k\ 808 b-ngVi-s8kud-t 809 AMC-R: -Ta-vud-isaoho-vud-kh-9e-ngth-8kvsj-chi-vsj-no-b-iti-s8\ 810 kb-ngVi-s8kud-t 811 BRACE: i54-Ta-8-isaoho-ay-kh-29n-ngth-s2xa6i-chi-k-no-2g-iti-2\ 812 9c29-ngVi-25p48-t 813 UTF-16: ???????????????????????????????????????????????????????\ 814 ????????????????????? 815 ADUDE: tEtfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqvy\ 816 itptp2dv8mvyrjtBtr2dv6jvxh 817 LACE: aiahiyibamrqmadjonqw62dpaebsgcaannupi3thoruouaidbebqay3\ 818 ineaqgcicabxg6aidaecaa2lunhvacaybauag4z3wnhvacazdaeahi 819 RACE: ap7xj73bep7wt73t75q76377nd7w6i77np7wr77u75xp6z77ot7wr77\ 820 kbh7wh73i75uqt73o75xqd73j752p62p75ia763x7m77xn73j77vch7\ 821 3u 822 The next several examples are all names of Japanese music artists, 823 song titles, and TV programs, just because the author happens to 824 have them handy (but Japanese is useful for providing examples 825 of single-row text, two-row text, ideographic text, and various 826 mixtures thereof). 828 (L) 3B (Japanese TV program title) 830 = U+5E74 (kanji) 831 = U+7D44 (kanji) 832 = U+91D1 U+516B U+5148 U+751F (kanji) 834 UTF-16: ???????????????? 835 UTF-8: 3???B??????????????? 836 AMC-M: utk-3-8ze-B-hkenqtymwifi9 837 BRACE: u-3-ygj-b-ynb6gjc7pp4k5p5w 838 AMC-O: fb8h-3-e-B-z7we3t7bymwizxtr 839 ADUDE: xdx8whx8tGz7ug863f6s5kuduwxh 840 RACE: 3aadgxtuabrh2rer2fiwwukioupq 841 LACE: 74adgxtuabrh2rer2fiwwukioupq 842 AMC-R: -3-x8ze-B-z7we3t7bxtymtwizxtr 844 (M) -with-SUPER-MONKEYS (Japanese music group name) 846 = U+5B89 U+5BA4 U+5948 U+7F8E U+6075 (kanji) 848 UTF-8: ??????????????????-with-SUPER-MONKEYS 849 AMC-M: u5m2j4etwif6q2zf---with--SUPER--MONKEYS 850 AMC-R: x52j4e3wiz92qyszf---with--SUPER--MONKEYS 851 AMC-O: fmij4e3wiz92qyszf---with--SUPER--MONKEYS 852 BRACE: uvj7fuaqcahy982xa---with--SUPER--MONKEYS 853 ADUDE: x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK 854 UTF-16: ???????????????????????????????????????????????? 855 LACE: ajnytjablfeac74oafqhkeyafv3qm5difvzxk4dfoiww233onnsxs4y 856 RACE: 3bnysw5elfeh7dtaouac2adxabuqa5aanaac2adtab2qa4aamuaheab\ 857 nabwqa3yanyagwadfab4qa4y 859 (N) Hello-Another-Way- (Japanese song title) 861 = U+305D U+308C U+305E U+308C U+306E (hiragana) 862 = U+5834 U+6240 (kanji) 864 UTF-8: Hello-Another-Way-????????????????????? 865 BRACE: ji7-Hello--Another--Way---v3jhaefvd2ufj62 866 AMC-O: daf-Hello--Another--Way---p2nq2nyqx2veyuwa 867 AMC-M: bsk-Hello--Another--Way---p2nq2nyqx2veyuwa 868 ADUDE: Ipjad-Qrbtmtnpth-Ftgti-vsue7b7c7c8cy2xkv4ze 869 AMC-R: -Hello--Another--Way---vsxpvs2nxq2nyqx2veyuwa 870 UTF-16: ?????????????????????????????????????????????????? 871 LACE: ciagqzlmnrxs2ylon52gqzlsfv3wc6jnauyf3dc6rrxacwbuafrea 872 RACE: 3aagqadfabwaa3aan4ac2adbabxaa3yaoqagqadfabzaaliao4agcad\ 873 zaawtaxjqrqyf4memgbxfqndcia 874 (O) 2 (Japanese TV program title) 876 = U+3072 U+3068 U+3064 (hiragana) 877 = U+5C4B U+6839 (kanji) 878 = U+306E (hiragana) 879 = U+4E0B (kanji) 881 UTF-16: ???????????????? 882 UTF-8: ?????????????????????2 883 AMC-O: dagzciex6wmy2vjqw8sm-2 884 AMC-M: bsnzciex6wmy2vjqw8sm-2 885 BRACE: ji96u56uwbhf2wqxnw4s-2 886 AMC-R: vszcyiyex6wmy2vjqw8sm-2 887 ADUDE: vstctkny6urvwzcx2xhz8yfw8vj 888 RACE: 3ayhemdigbsfys3iheyg4tqlaaza 889 LACE: 74yhemdigbsfys3iheyg4tqlaaza 891 (P) MajiKoi5 (Japanese song title) 893 = U+3067 (hiragana) 894 = U+3059 U+308B (hiragana) 895 = U+79D2 U+524D (kanji) 897 UTF-8: Maji???Koi??????5?????? 898 UTF-16: ?????????????????????????? 899 AMC-M: bsm-Maji-r-Koi-b2m-5-z37cxuwp 900 BRACE: ji8-Maji-g-Koi-qe7x-5-wx7p6ma 901 AMC-O: dag-Maji-h-Koi-xj2m-5-z37cxuwp 902 ADUDE: PnmdvssqvssNegvsva7cvs5qz38hu53r 903 AMC-R: -Maji-vsyh-Koi-vsxj2m-5-z37cxuwp 904 RACE: 3aag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq 905 LACE: 74ag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq 907 (Q) de (Japanese song title) 909 = U+30D1 U+30D5 U+30A3 U+30FC (katakana) 910 = U+30EB U+30F3 U+30D0 (katakana) 912 UTF-16: ?????????????? 913 BRACE: 3iu8pazt-de-pygi 914 AMC-O: dapbf4d9n-de-8m9da 915 AMC-M: bs3jp4d9n-de-8m9di 916 AMC-R: vs7bf4d9n-de-8m9d7a 917 RACE: gdi5li7475sp6zpl6pia 918 ADUDE: vs5bezgxrvs3ibvs2qtiud 919 UTF-8: ????????????de????????? 920 LACE: aqyndvnd7qbaazdfamyox46q 922 (R) (Japanese song title) 924 = U+305D U+306E (hiragana) 925 = U+30B9 U+30D4 U+30FC U+30C9 (katakana) 926 = U+3067 (hiragana) 927 RACE: gbow5oou7tewo 928 UTF-16: ?????????????? 929 BRACE: bidprdmp9wt7mi 930 LACE: a4yf23vz2t6mszy 931 AMC-O: dagxpq5j7e9n6jh 932 AMC-M: bsmfyq5j7e9n6jr 933 ADUDE: vsvpvd7hypuivf4q 934 AMC-R: vsxpyq5j7e9n6jyh 935 UTF-8: ????????????????????? 937 The last example is an ASCII string that breaks not only the 938 existing rules for host name labels but also the rules proposed in 939 [NAMEPREP03] for internationalized domain names. 941 (S) -> $1.00 <- 943 UTF-8: -> $1.00 <- 944 ADUDE: -xqtqetftrtqatatn- 945 RACE: aawt4ibegexdambahqwq 946 LACE: bmac2praeqys4mbqea6c2 947 AMC-R: --vquaue-1-q-00-avn-- 948 UTF-16: ?????????????????????? 949 AMC-O: aac--vqae-1-q-00-avn-- 950 AMC-M: aae--vqae-1-q-00-avn-- 951 BRACE: 229--t2b4-1-w-00-i9i-- 953 Security considerations 955 Users expect each domain name in DNS to be controlled by a single 956 authority. If a Unicode string intended for use as a domain label 957 could map to multiple ACE labels, then an internationalized domain 958 name could map to multiple ACE domain names, each controlled by 959 a different authority, some of which could be spoofs that hijack 960 service requests intended for another. Therefore AMC-ACE-R is 961 designed so that each Unicode string has a unique encoding. 963 However, there can still be multiple Unicode representations of the 964 "same" text, for various definitions of "same". This problem is 965 addressed to some extent by the Unicode standard under the topic of 966 canonicalization, and this work is leveraged for domain names by 967 "nameprep" [NAMEPREP03]. 969 Credits 971 AMC-ACE-R reuses a number of preexisting techniques. 973 The basic encoding of integers to nybbles to quintets to base-32 974 comes from UTF-5 [UTF5], and the particular variant used here comes 975 from AMC-ACE-M [AMCACEM00]. 977 The idea of avoiding 0, 1, o, and l in base-32 strings was taken 978 from SFS [SFS]. 980 The idea of encoding deltas from reference points was taken from 981 RACE (of which the latest version is [RACE03]), which may have 982 gotten the idea from Unicode Technical Standard #6 [UTS6]. 984 The idea of switching between literal mode and base-32 mode comes 985 from BRACE [BRACE00]. 987 The general idea of using the alphabetic case of base-32 characters 988 to record the desired case of the Unicode characters was suggested 989 by this author, and first applied to the UTF-5-style encoding in 990 DUDE (of which the latest version is [DUDE01]). 992 The heuristic used to adapt the reference points based on past code 993 points is new in AMC-ACE-R. 995 References 997 [AltDUDE00] Adam Costello, "AltDUDE version 0.0.2", 2001-Mar-19, 998 draft-ietf-idn-altdude-00. 1000 [AMCACEM00] Adam Costello, "AMC-ACE-M version 0.1.0", 2001-Feb-12, 1001 draft-ietf-idn-amc-ace-m-00. 1003 [AMCACEO00] Adam Costello, "AMC-ACE-O version 0.0.3", 2001-Mar-19, 1004 draft-ietf-idn-amc-ace-o-00. 1006 [BRACE00] Adam Costello, "BRACE: Bi-mode Row-based 1007 ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19, 1008 draft-ietf-idn-brace-00. 1010 [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode 1011 Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01. 1013 [IDN] Internationalized Domain Names (IETF working group), 1014 http://www.i-d-n.net/, idn@ops.ietf.org. 1016 [LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII 1017 Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01. 1019 [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation 1020 of Internationalized Host Names", 2001-Feb-24, 1021 draft-ietf-idn-nameprep-03. 1023 [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", 1024 http://www.trigeminal.com/samples/provincial.html. 1026 [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding 1027 for IDN", 2000-Nov-28, draft-ietf-idn-race-03. 1029 [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host 1030 Table Specification", 1985-Oct, RFC 952. 1032 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 1033 1987-Nov, RFC 1034. 1035 [RFC1123] Internet Engineering Task Force, R. Braden (editor), 1036 "Requirements for Internet Hosts -- Application and Support", 1037 1989-Oct, RFC 1123. 1039 [SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)", 1040 draft-ietf-idn-sace-*. 1042 [SFS] David Mazieres et al, "Self-certifying File System", 1043 http://www.fs.net/. 1045 [UNICODE] The Unicode Consortium, "The Unicode Standard", 1046 http://www.unicode.org/unicode/standard/standard.html. 1048 [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a 1049 Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*. 1051 [UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another 1052 ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*. 1054 [UTS6] Misha Wolf, Ken Whistler, Charles Wicksteed, 1055 Mark Davis, Asmus Freytag, "Unicode Technical Standard 1056 #6: A Standard Compression Scheme for Unicode", 1057 http://www.unicode.org/unicode/reports/tr6/. 1059 [UTFCONV] Mark Davis, "UTF Converter", 1060 http://www.macchiato.com/unicode/convert.html. 1062 Author 1064 Adam M. Costello 1065 http://www.cs.berkeley.edu/~amc/ 1067 Example implementation 1069 /******************************************/ 1070 /* amc-ace-r.c 0.0.0 (2001-Mar-27-Tue) */ 1071 /* Adam M. Costello */ 1072 /******************************************/ 1074 /* This is ANSI C code (C89) implementing AMC-ACE-R version 0.0.*. */ 1076 /************************************************************/ 1077 /* Public interface (would normally go in its own .h file): */ 1079 #include 1081 enum amc_ace_status { 1082 amc_ace_success, 1083 amc_ace_invalid_input, 1084 amc_ace_big_output 1085 }; 1087 enum case_sensitivity { case_sensitive, case_insensitive }; 1089 #if UINT_MAX >= 0x10FFFF 1090 typedef unsigned int u_code_point; 1091 #else 1092 typedef unsigned long u_code_point; 1093 #endif 1094 enum amc_ace_status amc_ace_r_encode( 1095 unsigned int input_length, 1096 const u_code_point *input, 1097 const unsigned char *uppercase_flags, 1098 unsigned int *output_size, 1099 char *output ); 1101 /* amc_ace_r_encode() converts Unicode to AMC-ACE-R (without */ 1102 /* any signature). The input must be represented as an array */ 1103 /* of Unicode code points (not code units; surrogate pairs */ 1104 /* are not allowed), and the output will be represented as */ 1105 /* null-terminated ASCII. The input_length is the number of */ 1106 /* code points in the input. The output_size is an in/out */ 1107 /* argument: the caller must pass in the maximum number of */ 1108 /* characters that may be output (including the terminating */ 1109 /* null), and on successful return it will contain the number of */ 1110 /* characters actually output (including the terminating null, */ 1111 /* so it will be one more than strlen() would return, which is */ 1112 /* why it is called output_size rather than output_length). The */ 1113 /* uppercase_flags array must hold input_length boolean values, */ 1114 /* where nonzero means the corresponding Unicode character should */ 1115 /* be forced to uppercase after being decoded, and zero means it */ 1116 /* is caseless or should be forced to lowercase. Alternatively, */ 1117 /* uppercase_flags may be a null pointer, which is equivalent */ 1118 /* to all zeros. The letters a-z and A-Z are always encoded */ 1119 /* literally, regardless of the corresponding flags. The encoder */ 1120 /* always outputs lowercase base-32 characters except when */ 1121 /* nonzero values of uppercase_flags require otherwise. The */ 1122 /* return value may be any of the amc_ace_status values defined */ 1123 /* above; if not amc_ace_success, then output_size and output may */ 1124 /* contain garbage. On success, the encoder will never need to */ 1125 /* write an output_size greater than input_length*5+1, because of */ 1126 /* how the encoding is defined. */ 1128 enum amc_ace_status amc_ace_r_decode( 1129 enum case_sensitivity case_sensitivity, 1130 char *scratch_space, 1131 const char *input, 1132 unsigned int *output_length, 1133 u_code_point *output, 1134 unsigned char *uppercase_flags ); 1135 /* amc_ace_r_decode() converts AMC-ACE-R (without any signature) */ 1136 /* to Unicode. The input must be represented as null-terminated */ 1137 /* ASCII, and the output will be represented as an array of */ 1138 /* Unicode code points. The case_sensitivity argument influences */ 1139 /* the check on the well-formedness of the input string; it */ 1140 /* must be case_sensitive if case-sensitive comparisons are */ 1141 /* allowed on encoded strings, case_insensitive otherwise. */ 1142 /* The scratch_space must point to space at least as large */ 1143 /* as the input, which will get overwritten (this allows the */ 1144 /* decoder to avoid calling malloc()). The output_length is */ 1145 /* an in/out argument: the caller must pass in the maximum */ 1146 /* number of code points that may be output, and on successful */ 1147 /* return it will contain the actual number of code points */ 1148 /* output. The uppercase_flags array must have room for at */ 1149 /* least output_length values, or it may be a null pointer */ 1150 /* if the case information is not needed. A nonzero flag */ 1151 /* indicates that the corresponding Unicode character should */ 1152 /* be forced to uppercase by the caller, while zero means it */ 1153 /* is caseless or should be forced to lowercase. The letters */ 1154 /* a-z and A-Z are output already in the proper case, but their */ 1155 /* flags will be set appropriately so that applying the flags */ 1156 /* would be harmless. The return value may be any of the */ 1157 /* amc_ace_status values defined above; if not amc_ace_success, */ 1158 /* then output_length, output, and uppercase_flags may contain */ 1159 /* garbage. On success, the decoder will never need to write */ 1160 /* an output_length greater than the length of the input (not */ 1161 /* counting the null terminator), because of how the encoding is */ 1162 /* defined. */ 1164 /**********************************************************/ 1165 /* Implementation (would normally go in its own .c file): */ 1167 #include 1169 static int is_ldh(u_code_point codept) 1170 { 1171 return codept > 122 ? 0 : 1172 codept >= 97 ? 1 : 1173 codept > 90 ? 0 : 1174 codept >= 65 ? 1 : 1175 codept > 57 ? 0 : 1176 codept >= 48 ? 1 : 1177 codept == 45 ; 1178 } 1180 /* is_AtoZ(c) returns 1 if c is an */ 1181 /* uppercase ASCII letter, zero otherwise. */ 1183 static unsigned char is_AtoZ(char c) 1184 { 1185 return c >= 65 && c <= 90; 1186 } 1188 /* base32[n] is the lowercase base-32 character representing */ 1189 /* the number n from the range 0 to 31. Note that we cannot */ 1190 /* use string literals for ASCII characters because an ANSI C */ 1191 /* compiler does not necessarily use ASCII. */ 1192 static const char base32[] = { 1193 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 1194 109, 110, /* m-n */ 1195 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 1196 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ 1197 }; 1199 /* base32_decode(c) returns the value of a base-32 character, in the */ 1200 /* range 0 to 31, or the constant base32_invalid if c is not a valid */ 1201 /* base-32 character. */ 1203 enum { base32_invalid = 32 }; 1205 static unsigned int base32_decode(char c) 1206 { 1207 if (c < 50) return base32_invalid; 1208 if (c <= 57) return c - 26; 1209 if (c < 97) c += 32; 1210 if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; 1211 return c - 97 - (c > 108) - (c > 111); 1212 } 1214 /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ 1215 /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ 1216 /* then ASCII A-Z are considered equal to a-z respectively. */ 1218 static int unequal( enum case_sensitivity case_sensitivity, 1219 const char *s1, const char *s2 ) 1220 { 1221 char c1, c2; 1223 if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; 1225 for (;;) { 1226 c1 = *s1; 1227 c2 = *s2; 1228 if (c1 >= 65 && c1 <= 90) c1 += 32; 1229 if (c2 >= 65 && c2 <= 90) c2 += 32; 1230 if (c1 != c2) return 1; 1231 if (c1 == 0) return 0; 1232 ++s1, ++s2; 1233 } 1234 } 1236 /* update_refpoints(refpoint,history,latest) updates refpoint[1..3] */ 1237 /* based on the history, where history[latest] is the latest code */ 1238 /* point. */ 1240 void update_refpoints( u_code_point *refpoint, 1241 const u_code_point *history, unsigned int latest ) 1242 { 1243 unsigned int k, b, i; 1244 for (k = 1; k <= 3; ++k) { 1245 b = k << 2; 1246 if (latest == 0) refpoint[k] = (history[0] >> b) << b; 1247 else for (i = latest; i-- > 0; ) { 1248 if (is_ldh(history[i])) continue; 1249 if ((refpoint[k] ^ history[i]) >> b == 0) break; 1251 if ((history[latest] ^ history[i]) >> b == 0) { 1252 refpoint[k] = (history[latest] >> b) << b; 1253 return; 1254 } 1255 } 1256 } 1257 } 1259 /* Main encode function: */ 1261 enum amc_ace_status amc_ace_r_encode( 1262 unsigned int input_length, 1263 const u_code_point *input, 1264 const unsigned char *uppercase_flags, 1265 unsigned int *output_size, 1266 char *output ) 1267 { 1268 unsigned int max_out, next_out, literal, i, k, out; 1269 u_code_point codept, delta; 1270 char shift; 1271 u_code_point refpoint[6] = {0, 0x60, 0, 0, 0, 0x10000}; 1273 max_out = *output_size; 1274 next_out = 0; 1275 literal = 0; 1277 for (i = 0; i < input_length; ++i) { 1278 codept = input[i]; 1279 if (codept > 0x10FFFF) return amc_ace_invalid_input; 1281 if (codept == 0x2D) { 1282 /* hyphen-minus is doubled */ 1283 if (max_out - next_out < 1) return amc_ace_big_output; 1284 output[next_out++] = 0x2D; 1285 output[next_out++] = 0x2D; 1286 } 1287 else if (is_ldh(codept)) { 1288 /* encode LDH character literally */ 1289 if (max_out - next_out < 1 + !literal) return amc_ace_big_output; 1290 /* switch to literal mode if necessary: */ 1291 if (!literal) output[next_out++] = 0x2D; 1292 literal = 1; 1293 output[next_out++] = codept; 1294 } 1295 else { 1296 /* encode non-LDH character using base-32 */ 1298 shift = uppercase_flags && uppercase_flags[i] ? 32 : 0; 1299 /* shift will determine the case of the last base-32 digit */ 1300 for (k = 1; ; ++k) { 1301 delta = codept - refpoint[k]; 1302 if (delta >> (4*k) == 0) break; 1303 } 1305 /* We will encode delta as k base-32 digits. */ 1307 if (max_out - next_out < k + literal) return amc_ace_big_output; 1308 /* switch to base-32 mode if necessary: */ 1309 if (literal) output[next_out++] = 0x2D; 1310 literal = 0; 1312 /* Computing the base-32 digits in reverse order is easiest. */ 1313 /* Only the last base-32 digit has the high bit clear. */ 1315 out = next_out + k - 1; 1316 output[out] = base32[delta & 0xF] - shift; 1318 while (out > next_out) { 1319 delta >>= 4; 1320 output[--out] = base32[0x10 | (delta & 0xF)]; 1321 } 1323 next_out += k; 1324 update_refpoints(refpoint,input,i); 1325 } 1326 } 1328 /* null terminator: */ 1329 if (max_out - next_out < 1) return amc_ace_big_output; 1330 output[next_out++] = 0; 1331 *output_size = next_out; 1332 return amc_ace_success; 1333 } 1335 /* Main decode function: */ 1337 enum amc_ace_status amc_ace_r_decode( 1338 enum case_sensitivity case_sensitivity, 1339 char *scratch_space, 1340 const char *input, 1341 unsigned int *output_length, 1342 u_code_point *output, 1343 unsigned char *uppercase_flags ) 1344 { 1345 u_code_point q, delta; 1346 const char *in, *first; 1347 char c; 1348 unsigned int next_out, max_out, literal, input_size, scratch_size; 1349 enum amc_ace_status status; 1350 u_code_point refpoint[6] = {0, 0x60, 0, 0, 0, 0x10000}; 1352 max_out = *output_length; 1353 next_out = 0; 1354 in = input; 1355 literal = 0; 1356 for (c = *in; c != 0; ) { 1357 if (c == 45 && in[1] != 45) { 1358 /* unpaired hyphen-minus toggles mode */ 1359 literal = !literal; 1360 c = *++in; 1361 continue; 1362 } 1364 if (max_out - next_out < 1) return amc_ace_big_output; 1366 if (c == 45) { 1367 /* double hyphen-minus represents a hyphen-minus */ 1368 if (uppercase_flags) uppercase_flags[next_out] = 0; 1369 output[next_out] = 45; 1370 c = *(in += 2); 1371 } 1372 else { 1373 if (literal) { 1374 /* decode one base-32 code point */ 1375 output[next_out] = c; 1376 c = *++in; 1377 } 1378 else { 1379 /* Base-32 sequence: */ 1381 delta = 0; 1382 first = in; 1384 do { 1385 q = base32_decode(c); 1386 if (q == base32_invalid) return amc_ace_invalid_input; 1387 delta = (delta << 4) | (q & 0xF); 1388 c = *++in; 1389 } while (q >> 4 == 1); 1391 output[next_out] = refpoint[in - first] + delta; 1392 update_refpoints(refpoint, output, next_out); 1393 } 1395 /* case of last digit determines uppercase flag: */ 1396 if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(in[-1]); 1397 } 1399 ++next_out; 1400 } 1402 /* Re-encode the output and compare to the input: */ 1404 input_size = in - input + 1; 1405 scratch_size = input_size; 1406 status = amc_ace_r_encode(next_out, output, uppercase_flags, 1407 &scratch_size, scratch_space); 1408 if (status != amc_ace_success || 1409 scratch_size != input_size || 1410 unequal(case_sensitivity, scratch_space, input) 1411 ) return amc_ace_invalid_input; 1412 *output_length = next_out; 1413 return amc_ace_success; 1414 } 1416 /******************************************************************/ 1417 /* Wrapper for testing (would normally go in a separate .c file): */ 1419 #include 1420 #include 1421 #include 1422 #include 1424 /* For testing, we'll just set some compile-time limits rather than */ 1425 /* use malloc(), and set a compile-time option rather than using a */ 1426 /* command-line option. */ 1428 enum { 1429 unicode_max_length = 256, 1430 ace_max_size = 256, 1431 test_case_sensitivity = case_insensitive /* good for host names */ 1432 }; 1434 static void usage(char **argv) 1435 { 1436 fprintf(stderr, 1437 "%s -e reads big-endian UTF-32 and writes AMC-ACE-R ASCII.\n" 1438 "%s -d reads AMC-ACE-R ASCII and writes big-endian UTF-32.\n" 1439 "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n" 1440 , argv[0], argv[0]); 1441 exit(EXIT_FAILURE); 1442 } 1444 static void fail(const char *msg) 1445 { 1446 fputs(msg,stderr); 1447 exit(EXIT_FAILURE); 1448 } 1450 static const char too_big[] = 1451 "input or output is too large, recompile with larger limits\n"; 1452 static const char invalid_input[] = "invalid input\n"; 1453 static const char io_error[] = "I/O error\n"; 1455 int main(int argc, char **argv) 1456 { 1457 enum amc_ace_status status; 1458 int r; 1460 if (argc != 2) usage(argv); 1461 if (argv[1][0] != '-') usage(argv); 1462 if (argv[1][2] != '\0') usage(argv); 1463 if (argv[1][1] == 'e') { 1464 u_code_point input[unicode_max_length]; 1465 unsigned char uppercase_flags[unicode_max_length]; 1466 char output[ace_max_size]; 1467 unsigned int input_length, output_size; 1468 int c0, c1, c2, c3; 1470 /* Read the UTF-32 input string: */ 1472 input_length = 0; 1474 for (;;) { 1475 c0 = getchar(); 1476 c1 = getchar(); 1477 c2 = getchar(); 1478 c3 = getchar(); 1479 if (ferror(stdin)) fail(io_error); 1481 if (c1 == EOF || c2 == EOF || c3 == EOF) { 1482 if (c0 != EOF) fail("input not a multiple of 4 bytes\n"); 1483 break; 1484 } 1486 if (input_length == unicode_max_length) fail(too_big); 1488 if ((c0 != 0 && c0 != 0x80) 1489 || c1 < 0 || c1 > 0x10 1490 || c2 < 0 || c2 > 0xFF 1491 || c3 < 0 || c3 > 0xFF ) { 1492 fail(invalid_input); 1493 } 1495 input[input_length] = ((u_code_point) c1 << 16) | 1496 ((u_code_point) c2 << 8) | 1497 (u_code_point) c3 ; 1498 uppercase_flags[input_length] = (c0 >> 7); 1499 ++input_length; 1500 } 1502 /* Encode, and output the result: */ 1504 output_size = ace_max_size; 1505 status = amc_ace_r_encode(input_length, input, uppercase_flags, 1506 &output_size, output); 1507 if (status == amc_ace_invalid_input) fail(invalid_input); 1508 if (status == amc_ace_big_output) fail(too_big); 1509 assert(status == amc_ace_success); 1510 r = fputs(output,stdout); 1511 if (r == EOF) fail(io_error); 1512 return EXIT_SUCCESS; 1513 } 1515 if (argv[1][1] == 'd') { 1516 char input[ace_max_size], scratch[ace_max_size]; 1517 u_code_point output[unicode_max_length], codept; 1518 unsigned char uppercase_flags[unicode_max_length]; 1519 unsigned int output_length, i; 1520 /* Read the AMC-ACE-R ASCII input string: */ 1522 fgets(input, ace_max_size, stdin); 1523 if (ferror(stdin)) fail(io_error); 1524 if (!feof(stdin)) fail(too_big); 1526 /* Decode, and output the result: */ 1528 output_length = unicode_max_length; 1529 status = amc_ace_r_decode(test_case_sensitivity, scratch, input, 1530 &output_length, output, uppercase_flags); 1531 if (status == amc_ace_invalid_input) fail(invalid_input); 1532 if (status == amc_ace_big_output) fail(too_big); 1533 assert(status == amc_ace_success); 1535 for (i = 0; i < output_length; ++i) { 1536 r = putchar(uppercase_flags[i] ? 0x80 : 0); 1537 if (r == EOF) fail(io_error); 1538 codept = output[i]; 1539 r = putchar(codept >> 16); 1540 if (r == EOF) fail(io_error); 1541 r = putchar((codept >> 8) & 0xFF); 1542 if (r == EOF) fail(io_error); 1543 r = putchar(codept & 0xFF); 1544 if (r == EOF) fail(io_error); 1545 } 1547 return EXIT_SUCCESS; 1548 } 1550 usage(argv); 1551 return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ 1552 } 1554 INTERNET-DRAFT expires 2001-Sep-27