idnits 2.17.1 draft-ietf-idn-amc-ace-o-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 2 longer pages, the longest (page 13) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 21 instances of too long lines in the document, the longest one being 6 characters in excess of 72. ** The abstract seems to contain references ([UNICODE], [AMCACEM00], [RFC1123], [RFC952], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '2' on line 1643 -- Looks like a reference, but probably isn't: '1' on line 1695 -- Looks like a reference, but probably isn't: '3' on line 1534 -- Looks like a reference, but probably isn't: '6' on line 1392 -- Looks like a reference, but probably isn't: '0' on line 1642 -- Looks like a reference, but probably isn't: '4' on line 1534 == Missing Reference: '--i' is mentioned on line 1330, but not defined -- Looks like a reference, but probably isn't: '5' on line 1533 == Missing Reference: '-1' is mentioned on line 1576, but not defined == Unused Reference: 'LACE01' is defined on line 1049, but no explicit reference was found in the text == Outdated reference: A later version (-10) exists of draft-ietf-idn-nameprep-03 -- Possible downref: Normative reference to a draft: ref. 'RACE03' -- Possible downref: Normative reference to a draft: ref. 'AltDUDE00' -- Possible downref: Normative reference to a draft: ref. 'AMCACEM00' -- Possible downref: Normative reference to a draft: ref. 'BRACE00' == Outdated reference: A later version (-02) exists of draft-ietf-idn-dude-01 -- Possible downref: Normative reference to a draft: ref. 'DUDE01' -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN' -- Possible downref: Normative reference to a draft: ref. 'LACE01' -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL' ** Downref: Normative reference to an Unknown state RFC: RFC 952 -- No information found for draft-ietf-idn-sace- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'SACE' -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- No information found for draft-jseng-utf5- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'UTF5' -- No information found for draft-ietf-idn-utf6- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'UTF6' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS6' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTFCONV' Summary: 6 errors (**), 0 flaws (~~), 7 warnings (==), 28 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Adam M. Costello 2 draft-ietf-idn-amc-ace-o-00.txt 2001-Mar-19 3 Expires 2001-Sep-19 5 AMC-ACE-O version 0.0.3 7 Status of this Memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note 14 that other groups may also distribute working documents as 15 Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months and may be updated, replaced, or obsoleted by other documents 19 at any time. It is inappropriate to use Internet-Drafts as 20 reference material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 Distribution of this document is unlimited. Please send comments 29 to the author at amc@cs.berkeley.edu, or to the idn working 30 group at idn@ops.ietf.org. A non-paginated (and possibly 31 newer) version of this specification may be available at 32 http://www.cs.berkeley.edu/~amc/charset/amc-ace-o 34 Abstract 36 AMC-ACE-O is a reversible map from a sequence of Unicode [UNICODE] 37 characters to a sequence of letters (A-Z, a-z), digits (0-9), and 38 hyphen-minus (-), henceforth called LDH characters. Such a map 39 (called an "ASCII-Compatible Encoding", or ACE) might be useful for 40 internationalized domain names [IDN], because host name labels are 41 currently restricted to LDH characters by [RFC952] and [RFC1123]. 43 AMC-ACE-O is similar to AMC-ACE-M [AMCACEM00] but is simpler and 44 slightly less efficient. 46 Besides domain names, there might also be other contexts where it is 47 useful to transform Unicode characters into "safe" (delimiter-free) 48 ASCII characters. (If other contexts consider hyphen-minus to be 49 unsafe, a different character could be used to play its role, like 50 underscore.) 51 Contents 53 Features 54 Name 55 Overview 56 Base-32 characters 57 Encoding and decoding algorithms 58 Signature 59 Case sensitivity models 60 Comparison with RACE, BRACE, LACE, DUDE, AMC-ACE-M 61 Example strings 62 Security considerations 63 Credits 64 References 65 Author 66 Example implementation 68 Features 70 Uniqueness: Every Unicode string maps to at most one LDH string. 72 Completeness: Every Unicode string maps to an LDH string. 73 Restrictions on which Unicode strings are allowed, and on length, 74 may be imposed by higher layers. 76 Efficient encoding: The ratio of encoded size to original size is 77 small for all Unicode strings. This is important in the context 78 of domain names because [RFC1034] restricts the length of a domain 79 label to 63 characters. 81 Simplicity: The encoding and decoding algorithms are reasonably 82 simple to implement. The goals of efficiency and simplicity are at 83 odds; AMC-ACE-O aims at a good balance between them. 85 Case-preservation: If the Unicode string has been case-folded prior 86 to encoding, it is possible to record the case information in the 87 case of the letters in the encoding, allowing a mixed-case Unicode 88 string to be recovered if desired, but a case-insensitive comparison 89 of two encoded strings is equivalent to a case-insensitive 90 comparison of the Unicode strings. This feature is optional; see 91 section "Case sensitivity models". 93 Readability: The letters A-Z and a-z and the digits 0-9 appearing 94 in the Unicode string are represented as themselves in the label. 95 This comes for free because it usually the most efficient encoding 96 anyway. 98 Name 100 AMC-ACE-O is a working name that should be changed if it is adopted. 101 (The O merely indicates that it is the fifteenth ACE devised by this 102 author. BRACE was the third. D-L and N did not deliver enough 103 efficiency to justify their complexity.) Rather than waste good 104 names on experimental proposals, let's wait until one proposal is 105 chosen, then assign it a good name. Suggestions (assuming the 106 primary use is in domain names): 108 UniHost 109 UTF-D ("D" for "domain names") 110 UTF-37 (there are 37 characters in the output repertoire) 111 NUDE (Normal Unicode Domain Encoding) 113 Overview 115 AMC-ACE-O maps characters to characters--it does not consume or 116 produce code points, code units, or bytes, although the algorithm 117 makes use of code points, and implementations will of course need to 118 represent the input and output characters somehow, usually as bytes 119 or other code units. 121 Each character in the Unicode string is represented by an 122 integral number of characters in the encoded string. There is no 123 intermediate bit string or octet string. 125 The encoded string alternates between two modes: literal mode and 126 base-32 mode. LDH characters in the Unicode string are encoded 127 literally, except that hyphen-minus is doubled. Non-LDH characters 128 in the Unicode string are encoded using base-32, in which each 129 character of the encoded string represents five bits (a "quintet"). 130 A non-paired hyphen-minus in the encoded string indicates a mode 131 change. 133 In base-32 mode a variable-length code sequence of one to five 134 quintets represents a delta, which is added to a reference point 135 to yield a Unicode code point, which in turn represents a Unicode 136 character. (Surrogates, which are code units used by UTF-16 in 137 pairs to refer to code points, are not used with AMC-ACE-O.) There 138 is one reference point for each code length; they are chosen by the 139 encoder based on the input string and declared at the beginning 140 of the encoded string, and never change. Locality among the code 141 points is discovered and exploited by the encoder to make the 142 encoding more compact. 144 Base-32 characters 146 "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 147 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 148 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 149 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 150 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 151 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 152 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 153 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 154 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 155 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 156 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 157 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 158 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 159 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 160 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 161 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 163 The digits "0" and "1" and the letters "o" and "l" are not used, to 164 avoid transcription errors. 166 All decoders must recognize both the uppercase and lowercase 167 forms of the base-32 characters. The case may or may not convey 168 information, as described in section "Case sensitivity models". 170 Encoding and decoding algorithms 172 The algorithms are given below as commented pseudocode. All 173 ordering of bits and quintets is big-endian (most significant 174 first). The >> and << operators used below mean bit shift, as in 175 C. For >> there is no question of logical versus arithmetic shift 176 because AMC-ACE-O makes no use of negative numbers. 178 primitives: 179 to_codepoint() # maps a character to a Unicode code point 180 from_codepoint() # maps a Unicode code point to a character 181 # These are no-ops if the implementation represents characters 182 # using Unicode code points. 184 subroutine names: 185 encode # main encoding function 186 decode # main decoding function 187 find_refpoint # scan the reference points for a suitable one 188 encode_point # encode one code point as base-32 189 decode_point # decode one code point from base-32 190 choose_refpoints # choose good reference points for the input 191 census # used by choose_refpoints 192 encode_refpoints # encode the reference points 193 decode_refpoints # decode the reference points 194 bootstrap # used by en/decode_refpoints 196 shared variables: # All others are local to each subroutine. 197 the input/output strings taken/returned by encode() and decode() 198 array refpoint[1..5] # refpoint[k] is for sequences of length k 199 # The rest are used only by the encoder: 200 array prefix[1..3] # prefix[k] is used to encode refpoint[k] 201 integers best_count, best_refpoint 203 constants: 204 array special_refpoint[0..7] = 205 0x20, 0x50, 0x70, 0xA0, 0xC0, 0xE0, 0x140, 0x270 206 # Generally, prefix[k] << (4*k) == refpoint[k], 207 # but for prefix[2] == 0xD8..0xDF, refpoint[2] == 208 # special_refpoint[0..7] respectively. These prefixes would 209 # not otherwise be used because they correspond to surrogates. 210 # These special reference points are used to assist the Latin 211 # script because, unlike almost every other small script, 212 # Latin is split across multiple rows with inconvenient 213 # boundaries, and therefore has a hard time compressing well. 215 function encode(input string): 216 if any input character's codepoint is outside 0..10FFFF then fail 217 # Too-large values could cause array bounds errors later. 218 choose_refpoints() 219 encode_refpoints() 220 let literal = false 221 for each character in the input string (in order) do begin 222 if the character is hyphen-minus then output two hyphen-minuses 223 else if the character is an LDH character then begin 224 if not literal then output hyphen-minus and toggle literal 225 output the character 226 end 227 else begin 228 if literal then output hyphen-minus and toggle literal 229 encode_point(to_codepoint(character)) 230 end 231 end 232 return the output string 234 function decode(input string): 235 decode_refpoints() 236 let literal = false 237 while not end-of-input do begin 238 if the next character is hyphen-minus then begin 239 consume the character 240 if the next character is hyphen-minus then consume it 241 else toggle literal 242 end 243 else if literal then consume character and output it 244 else output from_codepoint(decode_point()) 245 end 246 let check = encode(the output string) 247 if check != the input string then fail 248 # This comparison must be case-insensitive if ACEs are always 249 # compared case-insensitively (which is true of domain names), 250 # case-sensitive otherwise. See also section "Case sensitivity 251 # models". This check is necessary to guarantee the uniqueness 252 # property (there cannot be two distinct encoded strings 253 # representing the same Unicode string). 254 return the output string 256 function find_refpoint(start,n): 257 let i = start 258 while n < refpoint[k] or (n - refpoint[k]) >> (4*k) != 0 259 do increment i 260 return i 262 procedure encode_point(n): 263 let k = find_refpoint(1,n) 264 let delta = n - refpoint[k] 265 extract the k least significant nybbles of delta 266 # A nybble is 4 bits. 267 prepend 0 to the last nybble and prepend 1 to the rest 268 output the base-32 characters corresponding to the quintets 269 function decode_point(): 270 input characters and convert them to quintets until a quintet 271 beginning with 0 is obtained (expect at most four quintets 272 beginning with 1) 273 fail upon encountering anything unexpected 274 let k = the number of quintets obtained 275 strip the first bit of each quintet 276 concatenate the resulting nybbles to form delta 277 return refpoint[k] + delta 279 procedure encode_refpoints(): 280 # refpoint[4..5] always end up as 0 and 0x10000. 281 # refpoint[1..3] are implied by prefix[1..3], which are encoded 282 # in reverse order because that often yields a compact encoding. 283 let refpoint[1..2] = 0, 0x10 284 for k = 3 down to 1 do begin 285 encode_point(prefix[k]) 286 bootstrap(k, prefix[k]) 287 end 289 procedure decode_refpoints(): 290 let refpoint[1..5] = 0, 0x10, 0, 0, 0x10000 291 for k = 3 down to 1 do bootstrap(k, decode_point()) 293 procedure bootstrap(k,p): 294 # The prefixes need to be left-shifted to become reference 295 # points. As this happens, the current reference points often 296 # become helpful for encoding/decoding the next prefix. 297 for j = 4 down to 2 do let refpoint[j] = refpoint[j-1] << 4 298 if k == 2 and 0xD8 <= p <= 0xDF 299 then let refpoint[1] = special_refpoint[p - 0xD8] >> 4 300 else let refpoint[1] = p << 4 302 procedure choose_refpoints(): 303 # First choose refpoint[1] so that it will be used as often as 304 # possible, then choose refpoint[2] similarly, then refpoint[3]. 305 let refpoint[1..5] = 0, 0, 0, 0, 0x10000 306 let prefix[1..3] = 0, 0, 0, 0 307 for k = 1 to 3 do begin 308 let best_count = 0 309 let best_refpoint = 0 310 # Try the input code point prefixes, then the special prefixes: 311 for each input character in order 312 do census(k, to_codepoint(character) >> (4*k)) 313 if k == 2 then for i = 0 to 7 do census(k, 0xD8 + i) 314 if k == 3 then census(k,0xD) 315 let refpoint[k] = best_refpoint 316 end 317 function census(k,p): 318 # Determine how many times the reference point corresponding to 319 # prefix p would be used to encode input characters and other 320 # reference points if it were chosen as refpoint[k], and update 321 # best_count, best_refpoint, and prefix[k] accordingly. 322 if k == 2 and 0xD8 <= p <= 0xDF 323 then let refpoint[k] = special_refpoint[p - 0xD8] 324 else let refpoint[k] = p << (4*k) 325 let count = the number of non-LDH input characters for which 326 find_refpoint(1, to_codepoint(character)) == k 327 # Don't forget the non-LDH requirement. 328 increment count once for each i such that 1 <= i <= k and 329 find_refpoint(i+1, prefix[i] << (4*i)) == k 330 if count > best_count then begin 331 let best_count = count 332 let best_refpoint = refpoint[k] 333 let prefix[k] = p 334 end 336 Signature 338 The issue of how to distinguish ACE strings from unencoded strings 339 is largely orthogonal to the encoding scheme itself, and is 340 therefore not specified here. In the context of domain name labels, 341 a standard prefix and/or suffix (chosen to be unlikely to occur 342 naturally) would presumably be attached to ACE labels. (In that 343 case, it would probably be good to forbid the encoding of Unicode 344 strings that appear to match the signature, to avoid confusing 345 humans about whether they are looking at a Unicode string or an ACE 346 string.) 348 In order to use AMC-ACE-O in domain names, the choice of signature 349 must be mindful of the requirement in [RFC952] that labels never 350 begin or end with hyphen-minus. The raw encoded string will never 351 begin with a hyphen-minus, and will end with a hyphen-minus iff the 352 Unicode string ends with a hyphen-minus. If the Unicode strings 353 are forbidden from ending with hyphen-minus (which seems prudent 354 anyway), then there is no problem. Otherwise, AMC-ACE-O would need 355 to use a suffix as the signature. 357 It appears that "---" is extremely rare in domain names; among the 358 four-character prefixes of all the second-level domains under .com, 359 .net, and .org, "---" never appears at all. Therefore, perhaps the 360 signature should be of the form ?--- (prefix) or ---? (suffix), 361 where ? could be "u" for Unicode, or "i" for internationalized, or 362 "a" for ACE, or maybe "q" or "z" because they are rare. 364 Case sensitivity models 366 The higher layer must choose one of the following four models. 368 Models suitable for domain names: 370 * Case-insensitive: Before a string is encoded, all its non-LDH 371 characters must be case-folded so that any strings differing 372 only in case become the same string (for example, strings could 373 be forced to lowercase). Folding LDH characters is optional. 374 The case of base-32 characters and literal-mode characters is 375 arbitrary and not significant. Comparisons between encoded 376 strings must be case-insensitive. The original case of non-LDH 377 characters cannot be recovered from the encoded string. 379 * Case-preserving: The case of the Unicode characters is not 380 considered significant, but it can be preserved and recovered, 381 just like in non-internationalized host names. Before a string 382 is encoded, all its non-LDH characters must be case-folded 383 as in the previous model. LDH characters are naturally able 384 to retain their case attributes because they are encoded 385 literally. The case attribute of a non-LDH character is 386 recorded in the last of the base-32 characters that represent 387 it, which is guaranteed to be a letter rather than a digit. 388 If the base-32 character is uppercase, it means the Unicode 389 character is caseless or should be forced to uppercase after 390 being decoded (which is a no-op if the case folding already 391 forces to uppercase). If the base-32 character is lowercase, 392 it means the Unicode character is caseless or should be forced 393 to lowercase after being decoded (which is a no-op if the case 394 folding already forces to lowercase). The case of the other 395 base-32 characters in a multi-quintet encoding is arbitrary 396 and not significant. Only uppercase and lowercase attributes 397 can be recorded, not titlecase. Comparisons between encoded 398 strings must be case-insensitive, and are equivalent to 399 case-insensitive comparisons between the Unicode strings. The 400 intended mixed-case Unicode string can be recovered as long as 401 the encoded characters are unaltered, but altering the case of 402 the encoded characters is not harmful--it merely alters the case 403 of the Unicode characters, and such a change is not considered 404 significant. 406 In this model, the input to the encoder and the output of the 407 decoder can be the unfolded Unicode string (in which case the 408 encoder and decoder are responsible for performing the case 409 folding and recovery), or can be the folded Unicode string 410 accompanied by separate case information (in which case the 411 higher layer is responsible for performing the case folding and 412 recovery). Whichever layer performs the case recovery must 413 first verify that the Unicode string is properly folded, to 414 guarantee the uniqueness of the encoding. 416 It is not very difficult to extend the nameprep algorithm 417 [NAMEPREP03] to remember case information. 419 The case-insensitive and case-preserving models are interoperable. 420 If a domain name passes from a case-preserving entity to a 421 case-insensitive entity, the case information will be lost, but 422 the domain name will still be equivalent. This phenomenon already 423 occurs with non-internationalized domain names. 425 Models unsuitable for domain names, but possibly useful in other 426 contexts: 428 * Case-sensitive: Unicode strings may contain both uppercase and 429 lowercase characters, which are not folded. Base-32 characters 430 must be lowercase. Comparisons between encoded strings must be 431 case-sensitive. 433 * Case-flexible: Like case-preserving, except that the choice 434 of whether the case of the Unicode characters is considered 435 significant is deferred. Therefore, base-32 characters must 436 be lowercase, except for those used to indicate uppercase 437 Unicode characters. Comparisons between encoded strings may be 438 case-sensitive or case-insensitive, and such comparisons are 439 equivalent to the corresponding comparisons between the Unicode 440 strings. 442 Comparison with RACE, BRACE, LACE, DUDE, AMC-ACE-M 444 In this section we compare AMC-ACE-O and five other ACEs: RACE 445 [RACE03], BRACE [BRACE00], LACE [LACE01], DUDE [DUDE01], and 446 AMC-ACE-M [AMCACEM00]. We do not include SACE [SACE], UTF-5 [UTF5], 447 or UTF-6 [UTF6] in the comparison, because SACE appears obviously 448 too complex, UTF-5 appears obviously too inefficient, and UTF-6 can 449 never be more efficient than its similarly simple successor, DUDE. 451 Complexity is hard to measure. This author would subjectively 452 describe the complexity of the algorithms as: 454 RACE, LACE, DUDE: fairly simple but not trivial 455 AMC-ACE-O: moderate 456 AMC-ACE-M: fairly complex 457 BRACE: complex 459 AMC-ACE-O is very similar to AMC-ACE-M, but is simpler because it 460 discards the "wide" encoding style, and uses a different method for 461 choosing and encoding the reference points that has fewer special 462 cases and more reuse of logic already needed for encoding the 463 Unicode characters. 465 Implementations can be long and straightforward, or short and 466 subtle, but for whatever it's worth, here are the code sizes of 467 three of the algorithms that were implemented by this author in 468 similar styles: 470 AltDUDE: 130 lines @@@@@@@@@@@@@@@@@@@ 471 DUDE: 135 lines @@@@@@@@@@@@@@@@@@@ 472 AMC-ACE-O: 234 lines @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 473 AMC-ACE-M: 324 lines @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 474 (AltDUDE [AltDUDE00] is a variant of DUDE that is not worth 475 including separately in the rest of the comparison because it is 476 practically identical to DUDE. Not counted in the code sizes are 477 blank lines, lines containing only comments or only a single brace, 478 and wrapper code for testing. BRACE was also implemented by this 479 author, but it was a less general implementation, with bounded input 480 and output sizes.) 482 If a different implementation style were to alter the code sizes 483 additively, or multiplicatively, or a combination thereof, AMC-ACE-O 484 would remain about halfway between DUDE and AMC-ACE-M. 486 Case preservation support: 488 DUDE, AMC-ACE-M, AMC-ACE-O: all characters 489 BRACE: only the letters A-Z, a-z 490 RACE, LACE: none 492 RACE, BRACE, and LACE transform the Unicode string to an 493 intermediate bit string, then into a base-32 string, so there is 494 no particular alignment between the base-32 characters and the 495 Unicode characters. DUDE, AMC-ACE-M, and AMC-ACE-O do not have 496 this intermediate stage, and enforce alignment between the base-32 497 characters and the Unicode characters, which facilitates the case 498 preservation. 500 The relative efficiency of the various algorithms is suggested 501 by the sizes of the encodings in section "Example strings". The 502 lengths of examples A-K (which are the same sentence translated into 503 a languages from a variety of language families using a variety 504 of scripts) are shown graphically below for each ACE, scaled by a 505 factor of 0.4 so they fit on one line, and sorted so they look like 506 a cummulative distribution. The fictional "Super-ACE" encodes its 507 input using whichever of the other six ACEs is shortest for that 508 input. 510 RACE: 511 A Arabic 29 @@@@@@@@@@@@ 512 B Chinese 31 @@@@@@@@@@@@ 513 J Taiwanese 31 @@@@@@@@@@@@ 514 D Hebrew 37 @@@@@@@@@@@@@@@ 515 H Russian 47 @@@@@@@@@@@@@@@@@@@ 516 E Hindi 50 @@@@@@@@@@@@@@@@@@@@ 517 F Japanese 60 @@@@@@@@@@@@@@@@@@@@@@@@ 518 I Spanish 66 @@@@@@@@@@@@@@@@@@@@@@@@@@ 519 C Czech 68 @@@@@@@@@@@@@@@@@@@@@@@@@@@ 520 G Korean 79 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 521 K Vietnamese 112 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 522 LACE: 523 B Chinese 28 @@@@@@@@@@@ 524 A Arabic 31 @@@@@@@@@@@@ 525 J Taiwanese 31 @@@@@@@@@@@@ 526 D Hebrew 39 @@@@@@@@@@@@@@@@ 527 H Russian 48 @@@@@@@@@@@@@@@@@@@ 528 E Hindi 52 @@@@@@@@@@@@@@@@@@@@@ 529 F Japanese 52 @@@@@@@@@@@@@@@@@@@@@ 530 C Czech 58 @@@@@@@@@@@@@@@@@@@@@@@ 531 I Spanish 68 @@@@@@@@@@@@@@@@@@@@@@@@@@@ 532 G Korean 79 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 533 K Vietnamese 109 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 535 DUDE: 536 A Arabic 25 @@@@@@@@@@ 537 B Chinese 26 @@@@@@@@@@ 538 D Hebrew 33 @@@@@@@@@@@@@ 539 J Taiwanese 36 @@@@@@@@@@@@@@ 540 H Russian 38 @@@@@@@@@@@@@@@ 541 C Czech 43 @@@@@@@@@@@@@@@@@ 542 F Japanese 49 @@@@@@@@@@@@@@@@@@@@ 543 E Hindi 58 @@@@@@@@@@@@@@@@@@@@@@@ 544 I Spanish 59 @@@@@@@@@@@@@@@@@@@@@@@@ 545 K Vietnamese 81 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 546 G Korean 89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 548 AMC-ACE-O: 549 B Chinese 24 @@@@@@@@@@ 550 A Arabic 28 @@@@@@@@@@@ 551 J Taiwanese 30 @@@@@@@@@@@@ 552 D Hebrew 31 @@@@@@@@@@@@ 553 C Czech 34 @@@@@@@@@@@@@@ 554 H Russian 40 @@@@@@@@@@@@@@@@ 555 F Japanese 41 @@@@@@@@@@@@@@@@ 556 I Spanish 49 @@@@@@@@@@@@@@@@@@@@ 557 E Hindi 54 @@@@@@@@@@@@@@@@@@@@@@ 558 K Vietnamese 69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 559 G Korean 80 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 561 BRACE: 562 B Chinese 22 @@@@@@@@@ 563 A Arabic 26 @@@@@@@@@@ 564 J Taiwanese 27 @@@@@@@@@@@ 565 D Hebrew 33 @@@@@@@@@@@@@ 566 C Czech 36 @@@@@@@@@@@@@@ 567 F Japanese 40 @@@@@@@@@@@@@@@@ 568 H Russian 42 @@@@@@@@@@@@@@@@@ 569 E Hindi 45 @@@@@@@@@@@@@@@@@@ 570 I Spanish 48 @@@@@@@@@@@@@@@@@@@ 571 K Vietnamese 72 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 572 G Korean 78 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 573 AMC-ACE-M: 574 B Chinese 23 @@@@@@@@@ 575 J Taiwanese 27 @@@@@@@@@@@ 576 A Arabic 28 @@@@@@@@@@@ 577 D Hebrew 31 @@@@@@@@@@@@ 578 C Czech 34 @@@@@@@@@@@@@@ 579 H Russian 38 @@@@@@@@@@@@@@@ 580 F Japanese 42 @@@@@@@@@@@@@@@@@ 581 I Spanish 48 @@@@@@@@@@@@@@@@@@@ 582 E Hindi 54 @@@@@@@@@@@@@@@@@@@@@@ 583 K Vietnamese 69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 584 G Korean 71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 586 Super-ACE: 587 B Chinese 22 @@@@@@@@@ 588 A Arabic 25 @@@@@@@@@@ 589 J Taiwanese 27 @@@@@@@@@@@ 590 D Hebrew 31 @@@@@@@@@@@@ 591 C Czech 34 @@@@@@@@@@@@@@ 592 H Russian 38 @@@@@@@@@@@@@@@ 593 F Japanese 40 @@@@@@@@@@@@@@@@ 594 E Hindi 45 @@@@@@@@@@@@@@@@@@ 595 I Spanish 48 @@@@@@@@@@@@@@@@@@@ 596 K Vietnamese 69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 597 G Korean 71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 599 totals: 600 RACE: 610 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 601 LACE: 595 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 602 DUDE: 537 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 603 AMC-ACE-O: 480 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 604 BRACE: 469 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 605 AMC-ACE-M: 465 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 606 Super-ACE: 450 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 608 worst cases: 609 RACE: 112 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 610 LACE: 109 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 611 DUDE: 89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 612 AMC-ACE-O: 80 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 613 BRACE: 78 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 614 AMC-ACE-M: 71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 615 Super-ACE: 71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 617 The totals and worst cases above give more weight to languages 618 that produce longer encodings, which arguably yields a good metric 619 (because being efficient for easy languages is arguably less 620 important than being efficient for difficult languages). We can 621 alternatively give each language equal weight by dividing each 622 output length by the corresponding Super-ACE output length. This 623 method yields: 625 totals: 626 RACE: 14.9 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 627 LACE: 14.5 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 628 DUDE: 13.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 629 AMC-ACE-O: 11.7 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 630 BRACE: 11.4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 631 AMC-ACE-M: 11.4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 632 Super-ACE: 11.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 634 worst cases: 635 RACE: 2.00 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 636 LACE: 1.71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 637 DUDE: 1.33 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 638 AMC-ACE-O: 1.20 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 639 AMC-ACE-M: 1.20 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 640 BRACE: 1.11 @@@@@@@@@@@@@@@@@@@@@@@@@@@@ 641 Super-ACE: 1.00 @@@@@@@@@@@@@@@@@@@@@@@@@ 643 No matter which way we average, the results suggest that DUDE is 644 preferrable to RACE and LACE, because it has similar simplicity, is 645 more efficient, and has better support for case preservation. 647 The results also suggest that AMC-ACE-M is preferrable to BRACE, 648 because it has similar efficiency, is a little simpler, and has 649 better support for case preservation. 651 DUDE, AMC-ACE-O, and AMC-ACE-M are progressively more complex and 652 more efficient, and have equal support for case preservation. The 653 choice depends on how much efficiency is required and how much 654 complexity is acceptable. 656 The efficiency gap between AMC-ACE-M and AMC-ACE-O is mostly due 657 to the Korean (Hangul) string. Of the 15 characters by which the 658 AMC-ACE-M total beats the AMC-ACE-O total, 9 come from that string, 659 for which AMC-ACE-M had an output length of 71, compared to about 80 660 for all the other ACEs. 662 Example strings 664 In the ACE encodings below, signatures (like "bq--" for RACE) are 665 not shown. Non-LDH characters in the Unicode string are forced to 666 lowercase before being encoded using BRACE, RACE, and LACE. For 667 RACE and LACE, the letters A-Z are likewise forced to lowercase. 668 UTF-8 and UTF-16 are included for length comparisons, with non-ASCII 669 bytes shown as "?". AMC-ACE-M and AMC-ACE-O are abbreviated AMC-M 670 and AMC-O. Backslashes show where line breaks have been inserted in 671 ACE strings too long for one line. The RACE and LACE encodings are 672 courtesy of Mark Davis's online UTF converter [UTFCONV] (slightly 673 modified to remove the length restrictions). 675 The first several examples are all translations of the sentence "Why 676 can't they just speak in ?" (courtesy of Michael Kaplan's 677 "provincial" page [PROVINCIAL]). Word breaks and punctuation have 678 been removed, as is often done in domain names. 680 (A) Arabic (Egyptian): 681 U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644 682 U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F 683 DUDE: m44qnli7oqk3kloj4phi8kahf 684 BRACE: 28akcjwcmp3ciwb4t3ngd4nbaz 685 AMC-O: ageekhfuhuiukdefivevjvbuiktr 686 AMC-M: agiekhfuhuiukdefivevjvbuiktr 687 RACE: azceur2fe4ucuq2eivediojrfbfb6 688 LACE: cedeisshiutsqksdircuqnbzgeueuhy 689 UTF-16: ?????????????????????????????????? 690 UTF-8: ?????????????????????????????????? 692 (B) Chinese (simplified): 693 U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587 695 UTF-16: ?????????????????? 696 BRACE: kgcqqsgp26i5h4zn7req5i 697 AMC-M: uqj7g8nvk6awispn9wupdnh 698 AMC-O: eqpg8nvk6awisp259eupyx2h 699 DUDE: ked6ucjas0k8gdobf4ke2dm587 700 UTF-8: ??????????????????????????? 701 LACE: azhnn3b2ybea2aml6qau4libmwdq 702 RACE: 3bhnmtxmjy5e5qcojbha3c7ujywwlby 704 (C) Czech: Proprostnemluvesky 706 = U+010D 707 = U+011B 708 = U+00ED 710 UTF-8: Pro??prost??nemluv????esky 711 AMC-O: piq-Pro-p-prost-9m-nemluv-6pp-esky 712 AMC-M: g26-Pro-p-prost-9m-nemluv-6pp-esky 713 BRACE: i32-Pro-u-prost-8y-nemluv-29f3n-esky 714 DUDE: N0imfh0dg70imfn3kh1bg6eltsn5mudh0dg65n3mbn9 715 UTF-16: ???????????????????????????????????????????? 716 LACE: amaha4tpaeaq2biaobzg643uaearwbyanzsw23dvo3wqcainaqagk43\ 717 lpe 718 RACE: ah7xb73s75xq373q75zp6377op7xig77n37wl73n75wp65p7o3762dp\ 719 7mx7xh73l754q 721 (D) Hebrew: 722 U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8 723 U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2 724 U+05D1 U+05E8 U+05D9 U+05EA 726 AMC-O: afpnqeep8e8jfinaqdb8ijp8cb8ij8k 727 AMC-M: af4nqeep8e8jfinaqdb8ijp8cb8ij8k 728 DUDE: ldcukktu4pt5osgujhu8t9tu2t1u8t9ua 729 BRACE: 27vkyp7bgwmbpfjgc4ynx5nd8xsp5nd9c 730 RACE: axon5vgu3xsotvoy3tin5u6r5dm53ywr5dm6u 731 LACE: cyc5zxwu2to6j2ov3donbxwt2huntxpc2hunt2q 732 UTF-8: ???????????????????????????????????????????? 733 UTF-16: ???????????????????????????????????????????? 735 (E) Hindi: 736 U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D 737 U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939 738 U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947 739 U+0939 U+0948 U+0902 (Devanagari) 740 BRACE: 2b7xtenqdr7zc6uma2pmcz7ibage237kdemicnk9gei32 741 RACE: bextsmslc44t6kcnezabktjpjmbcqokaaiwewmrycuseookiai 742 LACE: dyes6ojsjmltspzijuteafknf5fqekbziabcyszshaksirzzjaba 743 AMC-O: ajeurvjvcmthvjvruipugatfpurmscuivjascunmvcvitfuehvjisc 744 AMC-M: ajhurbvcwmthbhuiwpugitfwpurwmscuibiscunwmvcatfuerbwisc 745 DUDE: p2fj9ikbh7j9vi8kdi6k0h5kdifkbg2i8j9k0g2ickbj2oh5i4k7j9k\ 746 8g2 747 UTF-16: ???????????????????????????????????????????????????????\ 748 ????? 749 UTF-8: ???????????????????????????????????????????????????????\ 750 ??????????????????????????????????? 752 (F) Japanese: 753 U+306A U+305C U+307F U+3093 U+306A U+65E5 U+672C U+8A9E U+3092 754 U+8A71 U+3057 U+3066 U+304F U+308C U+306A U+3044 U+306E U+304B 755 (kanji and hiragana) 757 UTF-16: ???????????????????????????????????? 758 BRACE: ji8nr5zj8uqth7v97mjchakwcg7dqemw88nj5gbe 759 AMC-O: gvagkxnzr3dkx8fzun243q3c24zbxhgwr2nkweqwm 760 AMC-M: bsnkxnzr3dkyx8fyzun243q3c24zbxhgwr2nkweqwm 761 DUDE: j06alcnfp3mam5e5n2coa9ej092oa71j057m6kfocmak4mekb 762 LACE: auyguxd7snvaczpfaftsyamktyatbeqbrjyqqmcxmzhyy2senzfq 763 UTF-8: ?????????????????????????????????????????????????????? 764 RACE: 3aygumc4gb7tbezqnjs6kzzmrkpdbeukoeyfomdggbhtbdbqniyeimd\ 765 ogbfq 767 (G) Korean: 768 U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774 769 U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74 770 U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C (Hangul syllables) 772 UTF-16: ???????????????????????????????????????????????? 773 UTF-8: ???????????????????????????????????????????????????????\ 774 ????????????????? 775 AMC-M: yhxcj2w6exiaxi68acfn92n68ezehk6xypdpwam6zehmwhk648eavwd\ 776 p6aqi23ieemweywn 777 BRACE: y394qebjusrcndbs82pkvstf96sxufcr7ffr4vbgdwsxufcx8pdktgb\ 778 gmnsqydmk7im56arju6pt82 779 LACE: 77atrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\ 780 exj2mlpfzzcyjrsely5ck4ta 781 RACE: 3datrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\ 782 exj2mlpfzzcyjrsely5ck4ta 783 AMC-O: m6hwq6tvi466exi44ia6s4nz2neze7xxn47yp6x5e3znze7xze7xxnu\ 784 8e4ze6x5n36is3i622mwe48wn 785 DUDE: s138qcc4s758raa8ke0s0acr78cke4s774t55cqd6ds5b4r97cs774t\ 786 574lcr2e4q74s5bcr9c8g98s88bn44qe4c 788 (H) Russian: 789 U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E 790 U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440 791 U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A 792 U+0438 (Cyrillic) 793 DUDE: K3fuk7j5sk3j6lutotljuiuk0vijfuk0jhhjao 794 AMC-M: aehHgrvfemvgvfgfafvfvdgvcgiwrkhgimjjca 795 AMC-O: aedRqwhfnwdgfqpipfdqcqwawrwcrqwawdwbwbki 796 BRACE: 269xyjvcyafqfdwyr3xfd8z8byi6z39xyi692s7ug2 797 RACE: aq7t4rzvhrbtmnj6hu4d2njthyzd4qcpii7t4qcdifatuoa 798 LACE: dqcd6pshgu6egnrvhy6tqpjvgm7depsaj5bd6psainaucory 799 UTF-16: ???????????????????????????????????????????????????????\ 800 ??? 801 UTF-8: ??????????????????????????????????????????????????????? 802 ??? 804 (I) Spanish: PorqunopuedensimplementehablarenEspaol 806 = U+00E9 807 = U+00F1 809 UTF-8: Porqu??nopuedensimplementehablarenEspa??ol 810 AMC-M: aa7-Porqu-b-nopuedensimplementehablarenEspa-j-ol 811 BRACE: 22x-Porqu-9-nopuedensimplementehablarenEspa-j-ol 812 AMC-O: aaq-Porqu-j-nopuedensimplementehablarenEspa-9b-ol 813 DUDE: N0mfn2hlu9mevn0lm5klun3m9tn0mcltlun4m5ohishn2m5uLn3gm1v\ 814 1mfs 815 RACE: abyg64troxuw433qovswizloonuw24dmmvwwk3tumvugcytmmfzgk3t\ 816 fonygd4lpnq 817 LACE: faaha33sof26s3tpob2wkzdfnzzws3lqnrsw2zloorswqylcnrqxezl\ 818 omvzxayprn5wa 819 UTF-16: ???????????????????????????????????????????????????????\ 820 ????????????????????????? 822 (J) Taiwanese: 823 U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587 825 UTF-16: ?????????????????? 826 UTF-8: ??????????????????????????? 827 AMC-M: uqj7g2tbgtu6a385pspnxkupdnh 828 BRACE: kgcqui49gatc2wyrn8y7cndgte9 829 AMC-O: eqpgxstbzuvc6a385psp244kupyx2h 830 RACE: 3bhnmuaroize5qe6xvha3cvkjywwlby 831 LACE: 75hnmuaroize5qe6xvha3cvkjywwlby 832 DUDE: ked6l011n232kec0pebdke0doaaake2dm587 834 (K) Vietnamese: 835 Taisaohokhngthchi\ 836 noitingVit 838 = U+0323 839 = U+00F4 840 = U+00EA 841 = U+0309 842 = U+0301 843 UTF-8: Ta??isaoho??kh??ngth????chi??no??iti????ngVi????t 844 AMC-O: aava-Ta-vud-isaoho-vud-kh-9e-ngth-8kj-chi-j-no-b-iti-8k\ 845 b-ngVi-8kvud-t 846 AMC-M: ada-Ta-ud-isaoho-ud-kh-s9e-ngth-s8kj-chi-j-no-b-iti-s8k\ 847 b-ngVi-s8kud-t 848 BRACE: i54-Ta-8-isaoho-ay-kh-29n-ngth-s2xa6i-chi-k-no-2g-iti-2\ 849 9c29-ngVi-25p48-t 850 UTF-16: ???????????????????????????????????????????????????????\ 851 ????????????????????? 852 DUDE: N4m1j23g69n3m1vovj23g6bov4menn4m8uaj09g63opj09g6evj01g6\ 853 9n4m9uaj01g6enN6m9uaj23g74 854 LACE: aiahiyibamrqmadjonqw62dpaebsgcaannupi3thoruouaidbebqay3\ 855 ineaqgcicabxg6aidaecaa2lunhvacaybauag4z3wnhvacazdaeahi 856 RACE: ap7xj73bep7wt73t75q76377nd7w6i77np7wr77u75xp6z77ot7wr77\ 857 kbh7wh73i75uqt73o75xqd73j752p62p75ia763x7m77xn73j77vch7\ 858 3u 860 The next several examples are all names of Japanese music artists, 861 song titles, and TV programs, just because the author happens to 862 have them handy (but Japanese is useful for providing examples 863 of single-row text, two-row text, ideographic text, and various 864 mixtures thereof). 866 (L) 3B (Japanese TV program title) 868 = U+5E74 (kanji) 869 = U+7D44 (kanji) 870 = U+91D1 U+516B U+5148 U+751F (kanji) 872 UTF-16: ???????????????? 873 UTF-8: 3???B??????????????? 874 AMC-M: utk-3-8ze-B-hkenqtymwifi9 875 BRACE: u-3-ygj-b-ynb6gjc7pp4k5p5w 876 AMC-O: fb8h-3-e-B-z7we3t7bymwizxtr 877 DUDE: j3le74G062nd44p1d1l16bk8n51f 878 RACE: 3aadgxtuabrh2rer2fiwwukioupq 879 LACE: 74adgxtuabrh2rer2fiwwukioupq 881 (M) -with-SUPER-MONKEYS (Japanese music group name) 883 = U+5B89 U+5BA4 U+5948 U+7F8E U+6075 (kanji) 885 UTF-8: ??????????????????-with-SUPER-MONKEYS 886 AMC-M: u5m2j4etwif6q2zf---with--SUPER--MONKEYS 887 AMC-O: fmij4e3wiz92qyszf---with--SUPER--MONKEYS 888 BRACE: uvj7fuaqcahy982xa---with--SUPER--MONKEYS 889 DUDE: lb89q4p48nf8em075-g077m9n4m8-N3LGM5N2-MdVURLN9J 890 UTF-16: ???????????????????????????????????????????????? 891 LACE: ajnytjablfeac74oafqhkeyafv3qm5difvzxk4dfoiww233onnsxs4y 892 RACE: 3bnysw5elfeh7dtaouac2adxabuqa5aanaac2adtab2qa4aamuaheab\ 893 nabwqa3yanyagwadfab4qa4y 895 (N) Hello-Another-Way- (Japanese song title) 897 = U+305D U+308C U+305E U+308C U+306E (hiragana) 898 = U+5834 U+6240 (kanji) 899 UTF-8: Hello-Another-Way-????????????????????? 900 BRACE: ji7-Hello--Another--Way---v3jhaefvd2ufj62 901 AMC-O: daf-Hello--Another--Way---p2nq2nyqx2veyuwa 902 AMC-M: bsk-Hello--Another--Way---p2nq2nyqx2veyuwa 903 DUDE: M8lssv-Huvn4m8ln2-Nm1n9-j05docleocmel834m240 904 UTF-16: ?????????????????????????????????????????????????? 905 LACE: ciagqzlmnrxs2ylon52gqzlsfv3wc6jnauyf3dc6rrxacwbuafrea 906 RACE: 3aagqadfabwaa3aan4ac2adbabxaa3yaoqagqadfabzaaliao4agcad\ 907 zaawtaxjqrqyf4memgbxfqndcia 909 (O) 2 (Japanese TV program title) 911 = U+3072 U+3068 U+3064 (hiragana) 912 = U+5C4B U+6839 (kanji) 913 = U+306E (hiragana) 914 = U+4E0B (kanji) 916 UTF-16: ???????????????? 917 UTF-8: ?????????????????????2 918 AMC-O: dagzciex6wmy2vjqw8sm-2 919 AMC-M: bsnzciex6wmy2vjqw8sm-2 920 BRACE: ji96u56uwbhf2wqxnw4s-2 921 DUDE: j072m8klc4bm839j06eke0bg032 922 RACE: 3ayhemdigbsfys3iheyg4tqlaaza 923 LACE: 74yhemdigbsfys3iheyg4tqlaaza 925 (P) MajiKoi5 (Japanese song title) 927 = U+3067 (hiragana) 928 = U+3059 U+308B (hiragana) 929 = U+79D2 U+524D (kanji) 931 UTF-8: Maji???Koi??????5?????? 932 UTF-16: ?????????????????????????? 933 AMC-M: bsm-Maji-r-Koi-b2m-5-z37cxuwp 934 BRACE: ji8-Maji-g-Koi-qe7x-5-wx7p6ma 935 AMC-O: dag-Maji-h-Koi-xj2m-5-z37cxuwp 936 DUDE: Mdhqpj067G06bvpj059obg035n9d2l24d 937 RACE: 3aag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq 938 LACE: 74ag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq 940 (Q) de (Japanese song title) 942 = U+30D1 U+30D5 U+30A3 U+30FC (katakana) 943 = U+30EB U+30F3 U+30D0 (katakana) 945 UTF-16: ?????????????? 946 BRACE: 3iu8pazt-de-pygi 947 AMC-O: dapbf4d9n-de-8m9da 948 AMC-M: bs3jp4d9n-de-8m9di 949 RACE: gdi5li7475sp6zpl6pia 950 DUDE: j0d1lq3vcg064lj0ebv3t0 951 UTF-8: ????????????de????????? 952 LACE: aqyndvnd7qbaazdfamyox46q 953 (R) (Japanese song title) 955 = U+305D U+306E (hiragana) 956 = U+30B9 U+30D4 U+30FC U+30C9 (katakana) 957 = U+3067 (hiragana) 959 RACE: gbow5oou7tewo 960 UTF-16: ?????????????? 961 BRACE: bidprdmp9wt7mi 962 LACE: a4yf23vz2t6mszy 963 AMC-O: dagxpq5j7e9n6jh 964 AMC-M: bsmfyq5j7e9n6jr 965 DUDE: j05dmer9t4vcs9m7 966 UTF-8: ????????????????????? 968 The last example is an ASCII string that breaks not only the 969 existing rules for host name labels but also the rules proposed in 970 [NAMEPREP03] for internationalized domain names. 972 (S) -> $1.00 <- 974 UTF-8: -> $1.00 <- 975 DUDE: -jei0kj1iej0gi0jc- 976 RACE: aawt4ibegexdambahqwq 977 LACE: bmac2praeqys4mbqea6c2 978 UTF-16: ?????????????????????? 979 AMC-O: aac--vqae-1-q-00-avn-- 980 AMC-M: aae--vqae-1-q-00-avn-- 981 BRACE: 229--t2b4-1-w-00-i9i-- 983 Security considerations 985 Users expect each domain name in DNS to be controlled by a single 986 authority. If a Unicode string intended for use as a domain label 987 could map to multiple ACE labels, then an internationalized domain 988 name could map to multiple ACE domain names, each controlled by 989 a different authority, some of which could be spoofs that hijack 990 service requests intended for another. Therefore AMC-ACE-O is 991 designed so that each Unicode string has a unique encoding. 993 However, there can still be multiple Unicode representations of the 994 "same" text, for various definitions of "same". This problem is 995 addressed to some extent by the Unicode standard under the topic 996 of canonicalization, but some text strings may be misleading or 997 ambiguous to humans when used as domain names, such as strings 998 containing dots, slashes, at-signs, etc. These issues are being 999 further studied under the topic of "nameprep" [NAMEPREP03]. 1001 Credits 1003 AMC-ACE-O reuses a number of preexisting techniques. 1005 The basic encoding of integers to nybbles to quintets to base-32 1006 comes from UTF-5 [UTF5], and the particular variant used here comes 1007 from AMC-ACE-M [AMCACEM00]. 1009 The idea of avoiding 0, 1, o, and l in base-32 strings was taken 1010 from SFS [SFS]. 1012 The idea of encoding deltas from reference points declared at the 1013 beginning of the encoded string was taken from RACE (of which 1014 the latest version is [RACE03]), which may have gotten the idea 1015 from Unicode Technical Standard #6 [UTS6]. The latter also uses 1016 predefined reference points in the Latin range. 1018 From BRACE [BRACE00] comes the idea of switching between literal 1019 mode and base-32 mode, and the technique of counting how many code 1020 points fall within a window (as opposed to checking whether all do). 1022 The general idea of using the alphabetic case of base-32 characters 1023 to record the desired case of the Unicode characters was suggested 1024 by this author, and first applied to the UTF-5-style encoding in 1025 DUDE (of which the latest version is [DUDE01]). 1027 The bootstrapping method of encoding reference points, which does 1028 not require them to nest but takes advantage of nesting when it 1029 occurs, is new in AMC-ACE-O. 1031 References 1033 [AltDUDE00] Adam Costello, "AltDUDE version 0.0.2", 2001-Mar-19, 1034 draft-ietf-idn-altdude-00. 1036 [AMCACEM00] Adam Costello, "AMC-ACE-M version 0.1.0", 2001-Feb-12, 1037 draft-ietf-idn-amc-ace-m-00. 1039 [BRACE00] Adam Costello, "BRACE: Bi-mode Row-based 1040 ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19, 1041 draft-ietf-idn-brace-00. 1043 [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode 1044 Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01. 1046 [IDN] Internationalized Domain Names (IETF working group), 1047 http://www.i-d-n.net/, idn@ops.ietf.org. 1049 [LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII 1050 Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01. 1052 [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation 1053 of Internationalized Host Names", 2001-Feb-24, 1054 draft-ietf-idn-nameprep-03. 1056 [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", 1057 http://www.trigeminal.com/samples/provincial.html. 1059 [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding 1060 for IDN", 2000-Nov-28, draft-ietf-idn-race-03. 1062 [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host 1063 Table Specification", 1985-Oct, RFC 952. 1065 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 1066 1987-Nov, RFC 1034. 1068 [RFC1123] Internet Engineering Task Force, R. Braden (editor), 1069 "Requirements for Internet Hosts -- Application and Support", 1070 1989-Oct, RFC 1123. 1072 [SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)", 1073 draft-ietf-idn-sace-*. 1075 [SFS] David Mazieres et al, "Self-certifying File System", 1076 http://www.fs.net/. 1078 [UNICODE] The Unicode Consortium, "The Unicode Standard", 1079 http://www.unicode.org/unicode/standard/standard.html. 1081 [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a 1082 Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*. 1084 [UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another 1085 ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*. 1087 [UTS6] Misha Wolf, Ken Whistler, Charles Wicksteed, 1088 Mark Davis, Asmus Freytag, "Unicode Technical Standard 1089 #6: A Standard Compression Scheme for Unicode", 1090 http://www.unicode.org/unicode/reports/tr6/. 1092 [UTFCONV] Mark Davis, "UTF Converter", 1093 http://www.macchiato.com/unicode/convert.html. 1095 Author 1097 Adam M. Costello 1098 http://www.cs.berkeley.edu/~amc/ 1100 Example implementation 1102 /******************************************/ 1103 /* amc-ace-o.c 0.0.0 (2001-Mar-17-Sat) */ 1104 /* Adam M. Costello */ 1105 /******************************************/ 1107 /* This is ANSI C code (C89) implementing AMC-ACE-O version 0.0.*. */ 1109 /************************************************************/ 1110 /* Public interface (would normally go in its own .h file): */ 1112 #include 1114 enum amc_ace_status { 1115 amc_ace_success, 1116 amc_ace_invalid_input, 1117 amc_ace_output_too_big 1118 }; 1119 enum case_sensitivity { case_sensitive, case_insensitive }; 1121 #if UINT_MAX >= 0x10FFFF 1122 typedef unsigned int u_code_point; 1123 #else 1124 typedef unsigned long u_code_point; 1125 #endif 1127 enum amc_ace_status amc_ace_o_encode( 1128 unsigned int input_length, 1129 const u_code_point *input, 1130 const unsigned char *uppercase_flags, 1131 unsigned int *output_size, 1132 char *output ); 1134 /* amc_ace_o_encode() converts Unicode to AMC-ACE-O. The input */ 1135 /* must be represented as an array of Unicode code points */ 1136 /* (not code units; surrogate pairs are not allowed), and the */ 1137 /* output will be represented as null-terminated ASCII. The */ 1138 /* input_length is the number of code points in the input. The */ 1139 /* output_size is an in/out argument: the caller must pass */ 1140 /* in the maximum number of characters that may be output */ 1141 /* (including the terminating null), and on successful return */ 1142 /* it will contain the number of characters actually output */ 1143 /* (including the terminating null, so it will be one more than */ 1144 /* strlen() would return, which is why it is called output_size */ 1145 /* rather than output_length). The uppercase_flags array must */ 1146 /* hold input_length boolean values, where nonzero means the */ 1147 /* corresponding Unicode character should be forced to uppercase */ 1148 /* after being decoded, and zero means it is caseless or should */ 1149 /* be forced to lowercase. Alternatively, uppercase_flags may */ 1150 /* be a null pointer, which is equivalent to all zeros. The */ 1151 /* letters a-z and A-Z are always encoded literally, regardless */ 1152 /* of the corresponding flags. The encoder always outputs */ 1153 /* lowercase base-32 characters except when nonzero values */ 1154 /* of uppercase_flags require otherwise, so the encoder is */ 1155 /* compatible with any of the case models. The return value */ 1156 /* may be any of the amc_ace_status values defined above; if */ 1157 /* not amc_ace_success, then output_size and output may contain */ 1158 /* garbage. On success, the encoder will never need to write an */ 1159 /* output_size greater than input_length*5+10, because of how the */ 1160 /* encoding is defined. */ 1162 enum amc_ace_status amc_ace_o_decode( 1163 enum case_sensitivity case_sensitivity, 1164 char *scratch_space, 1165 const char *input, 1166 unsigned int *output_length, 1167 u_code_point *output, 1168 unsigned char *uppercase_flags ); 1169 /* amc_ace_o_decode() converts AMC-ACE-O to Unicode. The input */ 1170 /* must be represented as null-terminated ASCII, and the output */ 1171 /* will be represented as an array of Unicode code points. */ 1172 /* The case_sensitivity argument influences the check on the */ 1173 /* well-formedness of the input string; it must be case_sensitive */ 1174 /* if case-sensitive comparisons are allowed on encoded strings, */ 1175 /* case_insensitive otherwise (see also section "Case sensitivity */ 1176 /* models" of the AMC-ACE-O specification). The scratch_space */ 1177 /* must point to space at least as large as the input, which will */ 1178 /* get overwritten (this allows the decoder to avoid calling */ 1179 /* malloc()). The output_length is an in/out argument: the */ 1180 /* caller must pass in the maximum number of code points that */ 1181 /* may be output, and on successful return it will contain the */ 1182 /* actual number of code points output. The uppercase_flags */ 1183 /* array must have room for at least output_length values, or it */ 1184 /* may be a null pointer if the case information is not needed. */ 1185 /* A nonzero flag indicates that the corresponding Unicode */ 1186 /* character should be forced to uppercase by the caller, while */ 1187 /* zero means it is caseless or should be forced to lowercase. */ 1188 /* The letters a-z and A-Z are output already in the proper case, */ 1189 /* but their flags will be set appropriately so that applying the */ 1190 /* flags would be harmless. The return value may be any of the */ 1191 /* amc_ace_status values defined above; if not amc_ace_success, */ 1192 /* then output_length, output, and uppercase_flags may contain */ 1193 /* garbage. On success, the decoder will never need to write */ 1194 /* an output_length greater than the length of the input (not */ 1195 /* counting the null terminator), because of how the encoding is */ 1196 /* defined. */ 1198 /**********************************************************/ 1199 /* Implementation (would normally go in its own .c file): */ 1201 #include 1203 /* is_ldh(codept) returns 1 if the code point represents an LDH */ 1204 /* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */ 1206 static int is_ldh(u_code_point codept) 1207 { 1208 return codept > 122 ? 0 : 1209 codept >= 97 ? 1 : 1210 codept > 90 ? 0 : 1211 codept >= 65 ? 1 : 1212 codept > 57 ? 0 : 1213 codept >= 48 ? 1 : 1214 codept == 45 ; 1215 } 1217 /* is_AtoZ(c) returns 1 if c is an */ 1218 /* uppercase ASCII letter, zero otherwise. */ 1220 static unsigned char is_AtoZ(char c) 1221 { 1222 return c >= 65 && c <= 90; 1223 } 1224 /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ 1225 /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ 1226 /* then ASCII A-Z are considered equal to a-z respectively. */ 1228 static int unequal( 1229 enum case_sensitivity case_sensitivity, const char *s1, const char *s2 ) 1230 { 1231 char c1, c2; 1233 if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; 1235 for (;;) { 1236 c1 = *s1; 1237 c2 = *s2; 1238 if (c1 >= 65 && c1 <= 90) c1 += 32; 1239 if (c2 >= 65 && c2 <= 90) c2 += 32; 1240 if (c1 != c2) return 1; 1241 if (c1 == 0) return 0; 1242 ++s1, ++s2; 1243 } 1244 } 1246 /* base32[q] is the lowercase base-32 character representing */ 1247 /* the number q from the range 0 to 31. Note that we cannot */ 1248 /* use string literals for ASCII characters because an ANSI C */ 1249 /* compiler does not necessarily use ASCII. */ 1251 static const char base32[] = { 1252 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 1253 109, 110, /* m-n */ 1254 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 1255 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ 1256 }; 1258 /* base32_decode(c) returns the value of a base-32 character, in the */ 1259 /* range 0 to 31, or the constant base32_invalid if c is not a valid */ 1260 /* base-32 character. */ 1262 enum { base32_invalid = 32 }; 1264 static unsigned int base32_decode(char c) 1265 { 1266 if (c < 50) return base32_invalid; 1267 if (c <= 57) return c - 26; 1268 if (c < 97) c += 32; 1269 if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; 1270 return c - 97 - (c > 108) - (c > 111); 1271 } 1273 /* The decoder_state and encoder_state structures contains */ 1274 /* variables that are shared among several of the functions below. */ 1276 struct decoder_state { 1277 const char *in_next; /* unread part of ACE input */ 1278 u_code_point refpoint[6]; /* reference points, [0] unused */ 1279 }; 1280 struct encoder_state { 1281 char *out_next, *out_end; /* unwritten part of ACE output */ 1282 const u_code_point *in_start, *in_end; /* entire Unicode input */ 1283 u_code_point refpoint[6], prefix[4]; /* reference points and prefixes */ 1284 unsigned int best_count; /* max found so far by census() */ 1285 u_code_point best_refpoint; /* corresponding reference point */ 1286 }; 1288 /* refpoint[k] is for base-32 sequences of length k, and prefix[k] */ 1289 /* is used to encode refpoint[k]. Generally, prefix[k] << (4*k) */ 1290 /* == refpoint[k], but for prefix[2] == 0xD8 + i, where i = 0..7, */ 1291 /* refpoint[2] == special_refpoint[i]. These prefixes, which would */ 1292 /* otherwise correspond to surrogates, are instead used to encode */ 1293 /* special reference points that help the Latin script compress */ 1294 /* better, because unlike most other small scripts it is split */ 1295 /* across multiple rows with inconvenient boundaries. */ 1297 static const u_code_point special_refpoint[] = 1298 { 0x20, 0x50, 0x70, 0xA0, 0xC0, 0xE0, 0x140, 0x270 }; 1300 /* find_refpoint(refpoint,start,n) scans the refpoint array, starting */ 1301 /* at position start, for a reference point suitable for encoding n, */ 1302 /* and returns the index of the first match. */ 1304 unsigned int find_refpoint( 1305 u_code_point refpoint[6], u_code_point start, u_code_point n ) 1306 { 1307 while ((n - refpoint[start]) >> (4*start) != 0) ++start; 1308 return start; 1309 } 1311 /* encode_point(state,n) encodes n as a sequence of base-32 */ 1312 /* characters representing a delta from a reference point. The */ 1313 /* delta divided into a big-endian sequence of nybbles; each nybble */ 1314 /* is expanded to a quintet with a highest bit of 0 for the last */ 1315 /* nybble, 1 for the others; and the quintets are mapped to base-32 */ 1316 /* characters. Returns amc_ace_success or amc_ace_output_too_big. */ 1318 enum amc_ace_status encode_point(struct encoder_state *state, u_code_point n) 1319 { 1320 unsigned int k, i; 1322 k = find_refpoint(state->refpoint, 1, n); 1323 if (state->out_end - state->out_next < k) return amc_ace_output_too_big; 1324 n -= state->refpoint[k]; 1325 i = k - 1; 1326 state->out_next[i] = base32[n & 0xF]; 1328 while (i > 0) { 1329 n >>= 4; 1330 state->out_next[--i] = base32[0x10 | (n & 0xF)]; 1331 } 1333 state->out_next += k; 1334 return amc_ace_success; 1335 } 1336 /* decode_point(state,n) is the reverse of encode_point(): it */ 1337 /* consumes base-32 characters and writes the code point into */ 1338 /* *n. Returns amc_ace_success or amc_ace_invalid_input. */ 1340 enum amc_ace_status decode_point(struct decoder_state *state, u_code_point *n) 1341 { 1342 u_code_point q, delta = 0; 1343 unsigned int k = 0; 1345 do { 1346 if (k >= 5) return amc_ace_invalid_input; 1347 q = base32_decode(state->in_next[k++]); 1348 if (q == base32_invalid) return amc_ace_invalid_input; 1349 delta = (delta << 4) | (q & 0xF); 1350 } while (q > 0xF); 1352 state->in_next += k; 1353 *n = state->refpoint[k] + delta; 1354 return amc_ace_success; 1355 } 1357 /* census(state,k,p) sets refpoint[k] to the reference point */ 1358 /* corresponding to prefix p, then calculates how many times */ 1359 /* that reference point would get used, and sets prefix[k] to */ 1360 /* p if the result exceeds the previous maximum. */ 1362 void census(struct encoder_state *state, unsigned int k, u_code_point p) 1363 { 1364 unsigned int count, i; 1365 u_code_point *refpoint = state->refpoint, *prefix = state->prefix; 1366 const u_code_point *in; 1368 refpoint[k] = 1369 k == 2 && p - 0xD8 <= 7 ? special_refpoint[p - 0xD8] : p << (4*k); 1371 /* count times used to encode input code points: */ 1373 for (count = 0, in = state->in_start; in < state->in_end; ++in) { 1374 if (!is_ldh(*in) && find_refpoint(refpoint, 1, *in) == k) ++count; 1375 } 1377 /* count times used to encode other reference points: */ 1379 for (i = 1; i < k; ++i) { 1380 if (find_refpoint(refpoint, i+1, prefix[i] << (4*i)) == k) ++count; 1381 } 1383 if (count > state->best_count) { 1384 state->best_count = count; 1385 state->best_refpoint = refpoint[k]; 1386 prefix[k] = p; 1387 } 1388 } 1390 /* bootstrap(refpoint,k,p) adjusts the existing reference points so */ 1391 /* they can be used for encoding/decoding another reference point. */ 1392 void bootstrap(u_code_point refpoint[6], unsigned int k, u_code_point p) 1393 { 1394 unsigned int j; 1396 for (j = 4; j >= 2; --j) refpoint[j] = refpoint[j-1] << 4; 1397 refpoint[1] = 1398 k == 2 && p - 0xD8 <= 7 ? special_refpoint[p - 0xD8] >> 4 : p << 4; 1399 } 1401 /* Main encode function: */ 1403 enum amc_ace_status amc_ace_o_encode( 1404 unsigned int input_length, 1405 const u_code_point *input, 1406 const unsigned char *uppercase_flags, 1407 unsigned int *output_size, 1408 char *output ) 1409 { 1410 struct encoder_state dummy = {0} /* all zeros */, *state = &dummy; 1411 const u_code_point *in, *in_end; 1412 char *out_end; 1413 unsigned int k, i; 1414 u_code_point codept; 1415 enum amc_ace_status status; 1416 unsigned int literal; /* boolean */ 1418 /* Initialization: */ 1420 state->out_next = output; 1421 state->out_end = out_end = output + *output_size; 1422 state->in_start = input; 1423 state->in_end = in_end = input + input_length; 1425 /* Verify that all code points are in 0..10FFFF: */ 1427 for (in = input; in < in_end; ++in) { 1428 if (*in > 0x10FFFF) return amc_ace_invalid_input; 1429 } 1431 /* Choose the reference points: Choose refpoint[1] so that it will */ 1432 /* be used as often as possible, then choose refpoint[2] similarly, */ 1433 /* then refpoint[3]. */ 1435 state->refpoint[5] = 0x10000; 1436 /* refpoint[1..4] and prefix[1..3] are already 0 */ 1438 for (k = 1; k <= 3; ++k) { 1439 state->best_count = 0; 1440 state->best_refpoint = 0; 1441 /* Try prefixes of the input code points, then the special prefixes: */ 1442 for (in = input; in < in_end; ++in) census(state, k, *in >> (4*k)); 1443 if (k == 2) for (i = 0; i <= 7; ++i) census(state, k, 0xD8 + i); 1444 if (k == 3) census(state, k, 0xD); 1445 state->refpoint[k] = state->best_refpoint; 1446 } 1447 /* Encode the reference points: */ 1449 state->refpoint[1] = 0; 1450 state->refpoint[2] = 0x10; 1452 for (k = 3; k >= 1; --k) { 1453 status = encode_point(state, state->prefix[k]); 1454 if (status != amc_ace_success) return status; 1455 bootstrap(state->refpoint, k, state->prefix[k]); 1456 } 1458 /* Main encoding loop: */ 1460 literal = 0; 1462 for (i = 0; i < input_length; ++i) { 1463 codept = input[i]; 1465 if (codept == 45) { 1466 /* hyphen-minus is doubled */ 1467 if (state->out_end - state->out_next < 2) return amc_ace_output_too_big; 1468 *state->out_next++ = 45; 1469 *state->out_next++ = 45; 1470 } 1471 else if (is_ldh(codept)) { 1472 /* encode LDH character literally */ 1474 if (!literal) { 1475 /* switch to literal mode by outputting hyphen-minus */ 1476 if (out_end - state->out_next < 1) return amc_ace_output_too_big; 1477 *state->out_next++ = 45; 1478 literal = 1; 1479 } 1481 if (out_end - state->out_next < 1) return amc_ace_output_too_big; 1482 *state->out_next++ = codept; 1483 } 1484 else { 1485 /* encode non-LDH character using base-32 */ 1487 if (literal) { 1488 /* switch to base-32 mode by outputting hyphen-minus */ 1489 if (out_end - state->out_next < 1) return amc_ace_output_too_big; 1490 *state->out_next++ = 45; 1491 literal = 0; 1492 } 1494 status = encode_point(state,codept); 1495 if (status != amc_ace_success) return status; 1496 /* the last base-32 character can record the uppercase flag: */ 1497 if (uppercase_flags && uppercase_flags[i]) state->out_next[-1] -= 32; 1498 } 1499 } 1500 /* null terminator: */ 1501 if (out_end - state->out_next < 1) return amc_ace_output_too_big; 1502 *state->out_next++ = 0; 1503 *output_size = state->out_next - output; 1504 return amc_ace_success; 1505 } 1507 /* Main decode function: */ 1509 enum amc_ace_status amc_ace_o_decode( 1510 enum case_sensitivity case_sensitivity, 1511 char *scratch_space, 1512 const char *input, 1513 unsigned int *output_length, 1514 u_code_point *output, 1515 unsigned char *uppercase_flags ) 1516 { 1517 struct decoder_state dummy = {0} /* all zeros */, *state = &dummy; 1518 unsigned int k, next_out, max_out, input_size, scratch_size; 1519 enum amc_ace_status status; 1520 u_code_point p; 1521 unsigned int literal; /* boolean */ 1522 char c; 1524 /* Initialization: */ 1526 state->in_next = input; 1527 next_out = 0; 1528 max_out = *output_length; 1530 /* Decode the reference points: */ 1532 state->refpoint[2] = 0x10; 1533 state->refpoint[5] = 0x10000; 1534 /* refpoint[1,3,4] are already 0 */ 1536 for (k = 3; k >= 1; --k) { 1537 status = decode_point(state, &p); 1538 if (status != amc_ace_success) return status; 1539 bootstrap(state->refpoint, k, p); 1540 } 1542 /* Main decoding loop: */ 1544 literal = 0; 1546 for (;;) { 1547 c = *state->in_next; 1548 if (c == 0) break; 1549 if (c == 45 /* hyphen-minus */) { 1550 if (*++state->in_next == 45) { 1551 /* double hyphen-minus represents a hyphen-minus */ 1552 ++state->in_next; 1553 if (max_out - next_out < 1) return amc_ace_output_too_big; 1554 if (uppercase_flags) uppercase_flags[next_out] = 0; 1555 output[next_out++] = 45; 1556 } 1557 else { 1558 /* unpaired hyphen-minus toggles mode */ 1559 literal = !literal; 1560 } 1561 } 1562 else { 1563 if (literal) { 1564 /* copy literal character to the output */ 1565 ++state->in_next; 1566 if (max_out - next_out < 1) return amc_ace_output_too_big; 1567 output[next_out] = c; 1568 } 1569 else { 1570 /* decode one base-32 code point */ 1571 status = decode_point(state, output + next_out); 1572 if (status != amc_ace_success) return status; 1573 } 1575 if (uppercase_flags) { 1576 uppercase_flags[next_out] = is_AtoZ(state->in_next[-1]); 1577 } 1579 ++next_out; 1580 } 1581 } 1583 /* Re-encode the output and compare to the input: */ 1585 input_size = state->in_next - input + 1; 1586 scratch_size = input_size; 1587 status = amc_ace_o_encode(next_out, output, uppercase_flags, 1588 &scratch_size, scratch_space); 1589 if (status != amc_ace_success || 1590 scratch_size != input_size || 1591 unequal(case_sensitivity, scratch_space, input) 1592 ) return amc_ace_invalid_input; 1594 *output_length = next_out; 1595 return amc_ace_success; 1596 } 1598 /******************************************************************/ 1599 /* Wrapper for testing (would normally go in a separate .c file): */ 1601 #include 1602 #include 1603 #include 1604 #include 1605 /* For testing, we'll just set some compile-time limits rather than */ 1606 /* use malloc(), and set a compile-time option rather than using a */ 1607 /* command-line option. */ 1609 enum { 1610 unicode_max_length = 256, 1611 ace_max_size = 256, 1612 test_case_sensitivity = case_insensitive /* suitable for host names */ 1613 }; 1615 static void usage(char **argv) 1616 { 1617 fprintf(stderr, 1618 "%s -e reads big-endian UTF-32 and writes AMC-ACE-O ASCII.\n" 1619 "%s -d reads AMC-ACE-O ASCII and writes big-endian UTF-32.\n" 1620 "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n" 1621 , argv[0], argv[0]); 1622 exit(EXIT_FAILURE); 1623 } 1625 static void fail(const char *msg) 1626 { 1627 fputs(msg,stderr); 1628 exit(EXIT_FAILURE); 1629 } 1631 static const char too_big[] = 1632 "input or output is too large, recompile with larger limits\n"; 1633 static const char invalid_input[] = "invalid input\n"; 1634 static const char io_error[] = "I/O error\n"; 1636 int main(int argc, char **argv) 1637 { 1638 enum amc_ace_status status; 1639 int r; 1641 if (argc != 2) usage(argv); 1642 if (argv[1][0] != '-') usage(argv); 1643 if (argv[1][2] != '\0') usage(argv); 1645 if (argv[1][1] == 'e') { 1646 u_code_point input[unicode_max_length]; 1647 unsigned char uppercase_flags[unicode_max_length]; 1648 char output[ace_max_size]; 1649 unsigned int input_length, output_size; 1650 int c0, c1, c2, c3; 1651 /* Read the UTF-32 input string: */ 1653 input_length = 0; 1655 for (;;) { 1656 c0 = getchar(); 1657 c1 = getchar(); 1658 c2 = getchar(); 1659 c3 = getchar(); 1660 if (ferror(stdin)) fail(io_error); 1662 if (c1 == EOF || c2 == EOF || c3 == EOF) { 1663 if (c0 != EOF) fail("input not a multiple of 4 bytes\n"); 1664 break; 1665 } 1667 if (input_length == unicode_max_length) fail(too_big); 1669 if ((c0 != 0 && c0 != 0x80) 1670 || c1 < 0 || c1 > 0x10 1671 || c2 < 0 || c2 > 0xFF 1672 || c3 < 0 || c3 > 0xFF ) { 1673 fail(invalid_input); 1674 } 1676 input[input_length] = ((u_code_point) c1 << 16) | 1677 ((u_code_point) c2 << 8) | (u_code_point) c3; 1678 uppercase_flags[input_length] = (c0 >> 7); 1679 ++input_length; 1680 } 1682 /* Encode, and output the result: */ 1684 output_size = ace_max_size; 1685 status = amc_ace_o_encode(input_length, input, uppercase_flags, 1686 &output_size, output); 1687 if (status == amc_ace_invalid_input) fail(invalid_input); 1688 if (status == amc_ace_output_too_big) fail(too_big); 1689 assert(status == amc_ace_success); 1690 r = fputs(output,stdout); 1691 if (r == EOF) fail(io_error); 1692 return EXIT_SUCCESS; 1693 } 1695 if (argv[1][1] == 'd') { 1696 char input[ace_max_size], scratch[ace_max_size]; 1697 u_code_point output[unicode_max_length], codept; 1698 unsigned char uppercase_flags[unicode_max_length]; 1699 unsigned int output_length, i; 1701 /* Read the AMC-ACE-O ASCII input string: */ 1703 fgets(input, ace_max_size, stdin); 1704 if (ferror(stdin)) fail(io_error); 1705 if (!feof(stdin)) fail(too_big); 1706 /* Decode, and output the result: */ 1708 output_length = unicode_max_length; 1709 status = amc_ace_o_decode(test_case_sensitivity, scratch, input, 1710 &output_length, output, uppercase_flags); 1711 if (status == amc_ace_invalid_input) fail(invalid_input); 1712 if (status == amc_ace_output_too_big) fail(too_big); 1713 assert(status == 0); 1715 for (i = 0; i < output_length; ++i) { 1716 r = putchar(uppercase_flags[i] ? 0x80 : 0); 1717 if (r == EOF) fail(io_error); 1718 codept = output[i]; 1719 r = putchar(codept >> 16); 1720 if (r == EOF) fail(io_error); 1721 r = putchar((codept >> 8) & 0xFF); 1722 if (r == EOF) fail(io_error); 1723 r = putchar(codept & 0xFF); 1724 if (r == EOF) fail(io_error); 1725 } 1727 return EXIT_SUCCESS; 1728 } 1730 usage(argv); 1731 return EXIT_SUCCESS; /* not reached, but quiets a compiler warning */ 1732 } 1734 INTERNET-DRAFT expires 2001-Sep-19