idnits 2.17.1 draft-ietf-idn-amc-ace-z-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 2 longer pages, the longest (page 9) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([UNICODE], [IDNA], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 805 has weird spacing: '... return cp - ...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'NAMEPREP' is mentioned on line 78, but not defined == Missing Reference: 'RFC2119' is mentioned on line 121, but not defined -- Looks like a reference, but probably isn't: '0' on line 1111 -- Looks like a reference, but probably isn't: '1' on line 1141 -- Looks like a reference, but probably isn't: '2' on line 1087 -- Looks like a reference, but probably isn't: '3' on line 1092 -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN' == Outdated reference: A later version (-13) exists of draft-ietf-idn-idna-02 == Outdated reference: A later version (-10) exists of draft-ietf-idn-nameprep-03 -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL' ** Downref: Normative reference to an Unknown state RFC: RFC 952 -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' Summary: 6 errors (**), 0 flaws (~~), 7 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Adam M. Costello 2 draft-ietf-idn-amc-ace-z-00.txt 2001-Aug-16 3 Expires 2002-Feb-16 5 AMC-ACE-Z version 0.3.0 7 Status of this Memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note 14 that other groups may also distribute working documents as 15 Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months and may be updated, replaced, or obsoleted by other documents 19 at any time. It is inappropriate to use Internet-Drafts as 20 reference material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 Distribution of this document is unlimited. Please send comments 29 to the author at amc@cs.berkeley.edu, or to the idn working 30 group at idn@ops.ietf.org. A non-paginated (and possibly 31 newer) version of this specification may be available at 32 http://www.cs.berkeley.edu/~amc/charset/ 34 Abstract 36 AMC-ACE-Z is a simple and efficient ASCII-Compatible Encoding (ACE) 37 designed for use with Internationalized Domain Names [IDN] [IDNA]. 38 It uniquely and reversibly transforms a Unicode string [UNICODE] 39 into an ASCII string. ASCII characters in the Unicode string are 40 represented literally, and non-ASCII characters are represented 41 by ASCII characters that are allowed in hostname labels (letters, 42 digits, and hyphens). Bootstring is a general algorithm that allows 43 a string of basic code points to uniquely represent any string of 44 code points drawn from a larger set. AMC-ACE-Z is an instance 45 Bootstring that uses particular parameter values appropriate for 46 IDNA and uses an IDNA signature prefix (or suffix). This document 47 specifies Bootstring and the parameter values for AMC-ACE-Z. 49 Contents 51 1. Introduction 52 2. Terminology 53 3. Bootstring description 54 3.1 Basic code point segregation 55 3.2 Insertion unsort coding 56 3.3 Generalized variable-length integers 57 3.4 Bias adaptation 58 4. Bootstring parameters 59 5. Parameter values for AMC-ACE-Z 60 6. Bootstring algorithms 61 6.1 Bias adaptation function 62 6.2 Decoding procedure 63 6.3 Encoding procedure 64 7. AMC-ACE-Z example strings 65 8. Security considerations 66 9. References 67 A. Author contact information 68 B. Mixed-case annotation 69 C. AMC-ACE-Z sample implementation 71 1. Introduction 73 The IDNA draft [IDNA] describes an architecture for supporting 74 internationalized domain names. Each label of a domain name may 75 contain a special prefix (or suffix), in which case the rest of the 76 label is an ASCII-Compatible Encoding (ACE) of a Unicode string 77 satisfying certain constraints. For the details of the constraints, 78 see [IDNA] and [NAMEPREP]. The prefix has not yet been specified, 79 but see http://www.i-d-n.net/ for prefixes to be used for testing 80 and experimentation. 82 Bootstring has been designed to have the following features: 84 * Completeness: Every extended string (sequence of arbitrary code 85 points) can be represented by a basic string (sequence of basic 86 code points). Restrictions on what strings are allowed, and on 87 length, may be imposed by higher layers. 89 * Uniqueness: There is at most one basic string that represents a 90 given extended string. 92 * Reversibility: Any extended string mapped to a basic string can 93 be recovered from that basic string. 95 * Efficient encoding: The ratio of extended string length to 96 basic string length is small. This is important in the context 97 of domain names because RFC 1034 [RFC1034] restricts the length 98 of a domain label to 63 characters. 100 * Simplicity: The encoding and decoding algorithms are reasonably 101 simple to implement. The goals of efficiency and simplicity are 102 at odds; Bootstring aims at a good balance between them. 104 * Readability: Basic code points appearing in the extended 105 string are represented as themselves in the basic string. This 106 comes for free because it makes the encoding more efficient on 107 average. 109 In addition, AMC-ACE-Z can support an optional feature described in 110 appendix B "Mixed-case annotation". 112 AMC-ACE-Z is a working name that should be changed if it is adopted. 113 (The Z merely indicates that it is the twenty-sixth ACE devised by 114 this author. Most were not worth releasing.) 116 2. Terminology 118 The key words "must", "shall", "required", "should", "recommended", 119 and "may" in this document are to be interpreted as described in RFC 120 2119 [RFC2119]. 122 As in the Unicode Standard [UNICODE], Unicode code points are 123 denoted by "U+" followed by four to six hexadecimal digits, while a 124 range of code points is denoted by two hexadecimal numbers separated 125 by "..", with no prefixes. 127 The operators div and mod perform integer division; (x div y) is the 128 quotient of x divided by y, discarding the remainder, and (x mod y) 129 is the remainder, so (x div y) * y + (x mod y) == x. Bootstring 130 uses these operators only with nonnegative operands, so the quotient 131 and remainder are always nonnegative. 133 The ?: operator is a conditional; (x ? y : z) means y if x is true, 134 z if x is false. It is just like "if x then y else z" except that y 135 and z are expressions rather than statements. 137 The "break" statement jumps out of the innermost loop (as in C). 139 3. Bootstring description 141 Bootstring represents an arbitrary sequence of code points (the 142 "extended string") as a sequence of basic code points (the 143 "basic string"). This section describes the representation. 144 Section 6 "Bootstring algorithms" presents the algorithms as 145 pseudocode. 147 3.1 Basic code point segregation 149 All basic code points appearing in the extended string are 150 represented literally at the beginning of the basic string, in their 151 original order, followed by a delimiter if (and only if) the number 152 of basic code points is nonzero. The delimiter is a particular 153 basic code point, which never appears in the remainder of the basic 154 string. The decoder can therefore find the end of the literal 155 portion (if there is one) by scanning for the last delimiter. 157 3.2 Insertion unsort coding 159 The remainder of the basic string (after the last delimiter if there 160 is one) represents a sequence of nonnegative integral deltas as 161 generalized variable-length integers, described in section 3.3. The 162 meaning of the deltas is best understood in terms of the decoder. 164 The decoder builds the extended string incrementally. Initially, 165 the extended string is a copy of the literal portion of the basic 166 string (excluding the last delimiter). Each delta causes the 167 decoder to insert a code point into the extended string according 168 to the following procedure. There are two state variables: a 169 code point n, and an index i that ranges from zero (which is the 170 first position of the extended string) to the current length of 171 the extended string (which refers to a potential position beyond 172 the current end). The decoder advances the state monotonically 173 (never returning to an earlier state) by taking steps only upward. 174 Each step increments i, except when i already equals the length 175 of the extended string, in which case a step resets i to zero 176 and increments n. For each delta (in order), the decoder takes 177 delta steps upward, then inserts the value n into the extended 178 string at position i, then increments i (to skip over the code 179 point just inserted). (An implementation should not take each 180 step individually, but should insead use division and remainder 181 calculations to advance by delta steps all at once.) It is an error 182 if the inserted code point is a basic code point (because basic code 183 points must be segregated as described in section 3.1). 185 The encoder's main task is to derive the sequence of deltas that 186 will cause the decoder to construct the desired string. It can do 187 this by repeatedly scanning the extended string for the next code 188 point that the decoder would need to insert, and counting the number 189 of steps the decoder would need to take, mindful of the fact that 190 the decoder will be stepping over only those code points that have 191 already been inserted. Section 6.3 "Encoding procedure" gives a 192 precise algorithm. 194 3.3 Generalized variable-length integers 196 In a conventional integer representation the base is the number of 197 distinct symbols for digits, whose values are 0 through base-1. Let 198 digit_0 denote the least significant digit, digit_1 the next least 199 significant, and so on. The value represented is the sum over j of 200 digit_j * w(j), where w(j) = base^j is the weight (scale factor) 201 for position j. For example, in the base 8 integer 437, the digits 202 are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is 203 7 + 3*8 + 4*64 = 287. This representation has two disadvantages: 204 First, there are multiple encodings of each value (because there 205 can be extra zeros in the most significant positions), which 206 is inconvenient when unique encodings are required. Second, 207 the integer is not self-delimiting, so if multiple integers are 208 concatenated the boundaries between them are lost. 210 The generalized variable-length representation solves these two 211 problems. The digit values are still 0 through base-1, but now 212 the integer is self-delimiting by means of thresholds t(j), each 213 of which is in the range 0 through base-1. Exactly one digit, the 214 most significant, satisfies digit_j < t(j). Therefore, if several 215 integers are concatenated, it is easy to separate them, starting 216 with the first if they are little-endian (least significant digit 217 first), or starting with the last if they are big-endian (most 218 significant digit first). As before, the value is the sum over j of 219 digit_j * w(j), but the weights are different: 221 w(0) = 1 222 w(j) = w(j-1) * (base - t(j-1)) for j > 0 224 For example, consider the little-endian sequence of base 8 digits 225 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This 226 implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) 227 = 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is 228 not less than 3, but 4 is less than 5, so 4 must be the last digit. 229 The value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 230 251, with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation 231 is very similar to decoding a conventional integer: Start with a 232 current value of N = 0 and a weight w = 1. Fetch the next digit d 233 and increase N by d * w. If d is less than the current threshold 234 (t) then stop, otherwise increase w by a factor of (base - t), 235 update t for the next position, and repeat. 237 Encoding this representation is similar to encoding a conventional 238 integer: If N < t then output one digit for N and stop, otherwise 239 output the digit for t + ((N - t) mod (base - t)), then replace N 240 with (N - t) div (base - t), update t for the next position, and 241 repeat. 243 For any particular set of values of t(j), there is exactly one 244 generalized variable-length representation of each nonnegative 245 integral value. 247 Bootstring uses little-endian ordering so that the deltas can be 248 separated starting with the first. The t(j) values are defined in 249 terms of the constants base, tmin, and tmax, and a state variable 250 called bias: 252 t(j) = base * (j + 1) - bias, 253 clamped to the range tmin through tmax 255 (The clamping means that if the formula yields a value less than 256 tmin or greater than tmax, then t(j) = tmin or tmax, respectively.) 257 These t(j) values cause the representation to favor integers within 258 a particular range determined by the bias. 260 3.4 Bias adaptation 262 After each delta is encoded or decoded, bias is set for the next 263 delta as follows: 265 1. Delta is scaled in order to avoid overflow in the next step: 267 let delta = delta div 2 269 But when this is the very first delta, the divisor is not 2, but 270 instead a constant called damp. This compensates for the fact 271 that the second delta is usually much smaller than the first. 273 2. Delta is increased to compensate for the fact that the next 274 delta will be inserting into a longer string: 276 let delta = delta + (delta div numpoints) 278 numpoints is the total number of code points encoded/decoded so 279 far (including the one corresponding to this delta itself, and 280 including the basic code points). 282 3. Delta is repeatedly divided until it falls within a threshold, 283 to predict the minimum number of digits needed to represent the 284 next delta: 286 while delta > ((base - tmin) * tmax) div 2 287 do let delta = delta div (base - tmin) 289 4. The bias is set: 291 let bias = 292 (base * the number of divisions performed in step 3) + 293 (((base - tmin + 1) * delta) div (delta + skew)) 295 The motivation for this procedure is that the current delta provides 296 a hint about the likely size of the next delta, and so t(j) is 297 set to tmax for the more significant digits starting with the one 298 expected to be last, tmin for the less significant digits up through 299 the one expected to be third-last, and somewhere between tmin and 300 tmax for the digit expected to be second-last (balancing the hope of 301 the expected-last digit being unnecessary against the danger of it 302 being insufficient). 304 4. Bootstring parameters 306 Given a set of basic code points, one must be designated as 307 the delimiter. The base can be no greater than the number of 308 distinguishable basic code points remaining. The values 0 through 309 base-1 must be associated with non-delimiter basic code points. 310 In some cases multiple code points must represent the same value; 311 for example, uppercase and lowercase versions of a letter must be 312 equivalent if basic strings are case-insensitive. 314 The initial value of n must be no greater than the minimum non-basic 315 code point that could appear in extended strings. 317 The remaining five parameters (tmin, tmax, skew, damp, and the 318 initial value of bias) must satisfy the following constraints: 320 0 <= tmin <= tmax <= base-1 321 skew >= 1 322 damp >= 2 323 initial_bias mod base <= base - tmin 325 Provided the constraints are satisfied, these five parameters affect 326 efficiency but not correctness. They should be chosen empirically. 328 If support for mixed-case annotation is desired (see appendix B), 329 make sure that the code points corresponding to 0 through tmax-1 all 330 have both uppercase and lowercase forms. 332 5. Parameter values for AMC-ACE-Z 334 AMC-ACE-Z uses the following Bootstring parameter values: 336 base = 36 337 tmin = 1 338 tmax = 26 339 skew = 38 340 damp = 700 341 initial_bias = 72 342 initial_n = U+0080 344 In AMC-ACE-Z, code points are Unicode code points [UNICODE], that 345 is, integers in the range 0..10FFFF, but not D800..DFFF, which are 346 reserved for use by UTF-16. The basic code points are the ASCII 347 code points (0..7F), some of which have values associated with them: 349 U+002D (-) = delimiter 350 41..5A (A-Z) = 0 to 25, respectively 351 61..7A (a-z) = 0 to 25, respectively 352 30..39 (0-9) = 26 to 35, respectively 354 Using hyphen-minus as the delimiter implies that the ACE can end 355 with a hyphen-minus only if the Unicode string consists entirely 356 of basic code points, but IDNA forbids such strings from being 357 ACE-encoded. Furthermore, the ACE can begin with a hyphen-minus 358 only if the Unicode string does, which is forbidden by IDNA. 359 Therefore IDNA using AMC-ACE-Z, regardless of whether the signature 360 is a prefix or a suffix, conforms to the RFC 952 requirement that 361 hostname labels neither begin nor end with a hyphen-minus [RFC952]. 363 A decoder must recognize the letters in both uppercase and lowercase 364 forms (including mixtures of both forms). An encoder should output 365 only uppercase forms or only lowercase forms, unless it uses 366 mixed-case annotation (see appendix B). 368 Presumably most users will not manually copy ACEs by writing or 369 typing them (as opposed to letting computers do it via cut & paste), 370 but those that do will need to be alert to the potential visual 371 ambiguity between the following sets of characters: 373 G 6 374 I l 1 375 O 0 376 S 5 377 U V 378 Z 2 380 Such ambiguities are usually resolved by context, but in an ACE 381 there is no context apparent to humans. 383 6. Bootstring algorithms 385 Some parts of the pseudocode can be omitted if the parameters 386 satisfy certain conditions (for which AMC-ACE-Z qualifies). These 387 parts are enclosed in {braces}, and notes immediately following the 388 pseudocode explain the conditions under which they may be omitted. 390 6.1 Bias adaptation function 392 function adapt(delta,numpoints,firsttime): 393 let delta = delta div (firsttime ? damp : 2) 394 let delta = delta + (delta div numpoints) 395 let k = 0 396 while delta > ((base - tmin) * tmax) div 2 397 do let delta = delta div (base - tmin) and let k = k + base 398 return k + (((base - tmin + 1) * delta) div (delta + skew)) 400 6.2 Decoding procedure 402 let n = initial_n 403 let i = 0 404 let bias = initial_bias 405 let output = an empty string indexed from 0 406 consume all code points before the last delimiter (if there is one) 407 and copy them to output, fail on any non-basic code point 408 if more than zero code points were consumed then consume one more 409 while the input is not exhausted do begin 410 let oldi = i 411 let w = 1 412 for k = base to infinity in steps of base do begin 413 consume a code point, fail on end-of-input 414 let digit = the code point's value, fail if it has no value 415 let i = i + digit * w, fail on overflow 416 let t = k <= bias ? tmin : k - bias > tmax ? tmax : k - bias 417 if digit < t then break 418 let w = w * (base - t), fail on overflow 419 end 420 let bias = adapt(i - oldi, length(output) + 1, oldi == 0) 421 let n = n + i div (length(output) + 1), fail on overflow 422 let i = i mod (length(output) + 1) 423 {if n is a basic code point then fail} 424 insert n into output at position i 425 increment i 426 end 427 The statement enclosed in braces (checking whether n is a basic 428 code point) may be omitted if initial_n exceeds all basic code 429 points (which is true for AMC-ACE-Z), because n is never less than 430 initial_n. 432 Because the decoder state can only advance monotonically, and there 433 is only one representation of any delta, there is therefore only 434 one encoded string that can represent a given sequence of integers. 435 The only error conditions are invalid code points, unexpected 436 end-of-input, overflow (attempts to compute values that exceed the 437 maximum value of an integer variable), and basic code points encoded 438 using deltas instead of appearing literally. If the decoder fails 439 on these errors as shown above, then it cannot produce the same 440 output for two distinct inputs, and hence it need not re-encode its 441 output to verify that it matches the input. 443 The assignment of t, where t is clamped to the range tmin through 444 tmax, does not handle the case where bias < k < bias + tmin, but 445 that is impossible because of the way bias is computed and because 446 of the constraints on the parameters. 448 If the programming language does not provide overflow detection, 449 the following technique can be used. Suppose A, B, and C are 450 representable nonnegative integers and C is nonzero. Then A + B 451 overflows if and only if B > maxint - A, and A + (B * C) overflows 452 if and only if B > (maxint - A) div C. See appendix C "AMC-ACE-Z 453 sample implementation" for demonstrations of this technique. 455 6.3 Encoding procedure 457 let n = initial_n 458 let delta = 0 459 let bias = initial_bias 460 let h = b = the number of basic code points in the input 461 copy them to the output in order, followed by a delimiter if b > 0 462 {if the input contains a non-basic code point < n then fail} 463 while h < length(input) do begin 464 let m = the minimum {non-basic} code point >= n in the input 465 let delta = delta + (m - n) * (h + 1), fail on overflow 466 let n = m 467 for each code point m in the input (in order) do begin 468 if m < n {or m is basic} then increment delta, fail on overflow 469 if m == n then begin 470 let q = delta 471 for k = base to infinity in steps of base do begin 472 let t = k <= bias ? tmin : k - bias > tmax ? tmax : k - bias 473 if q < t then break 474 output the code point for digit t + ((q - t) mod (base - t)) 475 let q = (q - t) div (base - t) 476 end 477 output the code point for digit q 478 let bias = adapt(delta, h + 1, h == b) 479 let delta = 0 480 increment h 481 end 482 end 483 increment delta and n 484 end 485 The full statement enclosed in braces (checking whether the input 486 contains a non-basic code point less than n) can be omitted if all 487 code points less than initial_n are basic code points (which is true 488 for AMC-ACE-Z if code points are unsigned). 490 The brace-enclosed conditions "non-basic" and "or m is basic" can be 491 omitted if initial_n exceeds all basic code points (which is true 492 for AMC-ACE-Z), because the code point being tested is never less 493 than initial_n. 495 The checks for overflow are necessary to avoid producing invalid 496 output when the input contains very large values or is very long. 497 Wider integer variables can handle more extreme inputs. For 498 AMC-ACE-Z, 26-bit unsigned integers are sufficient, because any 499 string that required a 27-bit delta would have to exceed either 500 the code point limit (0..10FFFF) or the label length limit (63 501 characters). 503 The increment of delta at the bottom of the outer loop cannot 504 overflow because delta < length(input) before the increment, and 505 length(input) is already assumed to be representable. The increment 506 of n could overflow, but only if h == length(input), in which case 507 the procedure is finished anyway. 509 7. AMC-ACE-Z example strings 511 In the AMC-ACE-Z encodings below, the IDNA signature prefix is not 512 shown. AMC-ACE-Z is abbreviated AMC-Z. Backslashes show where line 513 breaks have been inserted in strings too long for one line. 515 The first several examples are all translations of the sentence "Why 516 can't they just speak in ?" (courtesy of Michael Kaplan's 517 "provincial" page [PROVINCIAL]). Word breaks and punctuation have 518 been removed, as is often done in domain names. 520 (A) Arabic (Egyptian): 521 u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 522 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F 523 AMC-Z: egbpdaj6bu4bxfgehfvwxn 525 (B) Chinese (simplified): 526 u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587 527 AMC-Z: ihqwcrb4cv8a8dqg056pqjye 529 (C) Czech: Proprostnemluvesky 530 U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 531 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D 532 u+0065 u+0073 u+006B u+0079 533 AMC-Z: Proprostnemluvesky-uyb24dma41a 535 (D) Hebrew: 536 u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 537 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 538 u+05D1 u+05E8 u+05D9 u+05EA 539 AMC-Z: 4dbcagdahymbxekheh6e0a7fei0b 540 (E) Hindi (Devanagari): 541 u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D 542 u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 543 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 544 u+0939 u+0948 u+0902 545 AMC-Z: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd 547 (F) Japanese (kanji and hiragana): 548 u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 549 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B 550 AMC-Z: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa 552 (G) Korean (Hangul syllables): 553 u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 554 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 555 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C 556 AMC-Z: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5jps\ 557 d879ccm6fea98c 559 (H) Russian (Cyrillic): 560 U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E 561 u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 562 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A 563 u+0438 564 AMC-Z: b1abfaaepdrnnbgefbaDotcwatmq2g4l 566 (I) Spanish: PorqunopuedensimplementehablarenEspaol 567 U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 568 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 569 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 570 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 571 u+0061 u+00F1 u+006F u+006C 572 AMC-Z: PorqunopuedensimplementehablarenEspaol-fmd56a 574 (J) Taiwanese: 575 u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587 576 AMC-Z: ihqwctvzc91f659drss3x8bo0yb 578 (K) Vietnamese: 579 Tisaohkhngthch\ 580 nitingVit 581 U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B 582 u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 583 u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 584 U+0056 u+0069 u+1EC7 u+0074 585 AMC-Z: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g 587 The next several examples are all names of Japanese music artists, 588 song titles, and TV programs, just because the author happens to 589 have them handy (but Japanese is useful for providing examples 590 of single-row text, two-row text, ideographic text, and various 591 mixtures thereof). 593 (L) 3B 594 u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F 595 AMC-Z: 3B-ww4c5e180e575a65lsy2b 596 (M) -with-SUPER-MONKEYS 597 u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 598 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D 599 U+004F U+004E U+004B U+0045 U+0059 U+0053 600 AMC-Z: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n 602 (N) Hello-Another-Way- 603 U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F 604 u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D 605 u+305D u+308C u+305E u+308C u+306E u+5834 u+6240 606 AMC-Z: Hello-Another-Way--fc4qua05auwb3674vfr0b 608 (O) 2 609 u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032 610 AMC-Z: 2-u9tlzr9756bt3uc0v 612 (P) MajiKoi5 613 U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 614 u+308B u+0035 u+79D2 u+524D 615 AMC-Z: MajiKoi5-783gue6qz075azm5e 617 (Q) de 618 u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 619 AMC-Z: de-jg4avhby1noc0d 621 (R) 622 u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 623 AMC-Z: d9juau41awczczp 625 The last example is an ASCII string that breaks not only the 626 existing rules for host name labels but also the rules proposed in 627 [NAMEPREP03] for internationalized domain names. 629 (S) -> $1.00 <- 630 u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 631 u+003C u+002D 632 AMC-Z: -> $1.00 <-- 634 8. Security considerations 636 Users expect each domain name in DNS to be controlled by a single 637 authority. If a Unicode string intended for use as a domain label 638 could map to multiple ACE labels, then an internationalized domain 639 name could map to multiple ACE domain names, each controlled by 640 a different authority, some of which could be spoofs that hijack 641 service requests intended for another. Therefore AMC-ACE-Z is 642 designed so that each Unicode string has a unique encoding. 644 However, there can still be multiple Unicode representations of the 645 "same" text, for various definitions of "same". This problem is 646 addressed to some extent by the Unicode standard under the topic of 647 canonicalization, and this work is leveraged for domain names by 648 "nameprep" [NAMEPREP03]. 650 9. References 652 [IDN] Internationalized Domain Names (IETF working group), 653 http://www.i-d-n.net/, idn@ops.ietf.org. 655 [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host 656 Names In Applications (IDNA)", 2001-Jun-16, draft-ietf-idn-idna-02. 658 [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation 659 of Internationalized Host Names", 2001-Feb-24, 660 draft-ietf-idn-nameprep-03. 662 [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", 663 http://www.trigeminal.com/samples/provincial.html. 665 [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host 666 Table Specification", 1985-Oct, RFC 952. 668 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 669 1987-Nov, RFC 1034. 671 [UNICODE] The Unicode Consortium, "The Unicode Standard", 672 http://www.unicode.org/unicode/standard/standard.html. 674 A. Author contact information 676 Adam M. Costello 677 University of California, Berkeley 678 http://www.cs.berkeley.edu/~amc/ 680 B. Mixed-case annotation 682 In order to use AMC-ACE-Z to represent case-insensitive strings, 683 higher layers need to case-fold the strings prior to AMC-ACE-Z 684 encoding. The encoded string can, however, use mixed case as an 685 annotation telling how to convert the original folded string into a 686 mixed-case string for display purposes. 688 Basic code points are represented literally, and can therefore use 689 mixed case directly. Each non-basic code point is represented by 690 a delta, which is represented by a sequence of basic code points, 691 the last of which provides the annotation. If it is uppercase, 692 it is a suggestion to map the non-basic code point to uppercase 693 (if possible); if it is lowercase, it is a suggestion to map the 694 non-basic code point to lowercase (if possible). 696 AMC-ACE-Z encoders and decoders are not required to support these 697 annotations, and higher layers need not use them. 699 C. AMC-ACE-Z sample implementation 701 /******************************************/ 702 /* amc-ace-z.c 0.3.0 (2001-Aug-07-Tue) */ 703 /* Adam M. Costello */ 704 /******************************************/ 706 /* This is ANSI C code (C89) implementing AMC-ACE-Z version 0.3.x. */ 707 /************************************************************/ 708 /* Public interface (would normally go in its own .h file): */ 710 #include 712 enum amc_ace_status { 713 amc_ace_success, 714 amc_ace_bad_input, /* Input is invalid. */ 715 amc_ace_big_output, /* Output would exceed the space provided. */ 716 amc_ace_overflow /* Input requires wider integers to process. */ 717 }; 719 #if UINT_MAX >= (1 << 26) - 1 720 typedef unsigned int amc_ace_z_uint; 721 #else 722 typedef unsigned long amc_ace_z_uint; 723 #endif 725 enum amc_ace_status amc_ace_z_encode( 726 amc_ace_z_uint input_length, 727 const amc_ace_z_uint input[], 728 const unsigned char uppercase_flags[], 729 amc_ace_z_uint *output_length, 730 char output[] ); 732 /* amc_ace_z_encode() converts Unicode to AMC-ACE-Z (without */ 733 /* any signature). The input must be represented as an array */ 734 /* of Unicode code points (not code units; surrogate pairs */ 735 /* are not allowed), and the output will be represented as an */ 736 /* array of ASCII code points. The output string is *not* */ 737 /* null-terminated; it will contain zeros if and only if the */ 738 /* input contains zeros. (Of course the caller can leave room */ 739 /* for a terminator and add one if needed.) The input_length is */ 740 /* the number of code points in the input. The output_length is */ 741 /* an in/out argument: the caller must pass in the maximum number */ 742 /* of code points that may be output, and on successful return it */ 743 /* will contain the number of code points actually output. The */ 744 /* uppercase_flags array must hold input_length boolean values, */ 745 /* where nonzero means the corresponding Unicode character should */ 746 /* be forced to uppercase after being decoded, and zero means it */ 747 /* is caseless or should be forced to lowercase. Alternatively, */ 748 /* uppercase_flags may be a null pointer, which is equivalent to */ 749 /* all zeros. ASCII code points are always encoded literally, */ 750 /* regardless of the corresponding flags. The return value may */ 751 /* be any of the amc_ace_status values defined above except */ 752 /* amc_ace_bad_input; if not amc_ace_success, then output_size */ 753 /* and output may contain garbage. */ 755 enum amc_ace_status amc_ace_z_decode( 756 amc_ace_z_uint input_length, 757 const char input[], 758 amc_ace_z_uint *output_length, 759 amc_ace_z_uint output[], 760 unsigned char uppercase_flags[] ); 761 /* amc_ace_z_decode() converts AMC-ACE-Z (without any signature) */ 762 /* to Unicode. The input must be represented as an array of */ 763 /* ASCII code points, and the output will be represented as */ 764 /* an array of Unicode code points. The input_length is the */ 765 /* number of code points in the input. The output_length is */ 766 /* an in/out argument: the caller must pass in the maximum */ 767 /* number of code points that may be output, and on successful */ 768 /* return it will contain the actual number of code points */ 769 /* output. The uppercase_flags array must have room for at */ 770 /* least output_length values, or it may be a null pointer if */ 771 /* the case information is not needed. A nonzero flag indicates */ 772 /* that the corresponding Unicode character should be forced to */ 773 /* uppercase by the caller, while zero means it is caseless or */ 774 /* should be forced to lowercase. ASCII code points are output */ 775 /* already in the proper case, but their flags will be set */ 776 /* appropriately so that applying the flags would be harmless. */ 777 /* The return value may be any of the amc_ace_status values */ 778 /* defined above; if not amc_ace_success, then output_length, */ 779 /* output, and uppercase_flags may contain garbage. On success, */ 780 /* the decoder will never need to write an output_length greater */ 781 /* than input_length, because of how the encoding is defined. */ 783 /**********************************************************/ 784 /* Implementation (would normally go in its own .c file): */ 786 #include 788 /*** Bootstring parameters for AMC-ACE-Z ***/ 790 enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700, 791 initial_bias = 72, initial_n = 0x80, delimiter = 0x2D }; 793 /* basic(cp) tests whether cp is a basic code point: */ 794 #define basic(cp) ((amc_ace_z_uint)(cp) < 0x80) 796 /* delim(cp) tests whether cp is a delimiter: */ 797 #define delim(cp) ((cp) == delimiter) 799 /* decode_digit(cp) returns the numeric value of a basic code */ 800 /* point (for use in representing integers) in the range 0 to */ 801 /* base-1, or base if cp is does not represent a value. */ 803 static amc_ace_z_uint decode_digit(amc_ace_z_uint cp) 804 { 805 return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : 806 cp - 97 < 26 ? cp - 97 : base; 807 } 809 /* encode_digit(d,flag) returns the basic code point whose value */ 810 /* (when used for representing integers) is d, which must be in the */ 811 /* range 0 to base-1. The lowercase form is used unless flag is */ 812 /* nonzero, in which case the uppercase form is used. The behavior */ 813 /* is undefined if flag is nonzero and digit d has no uppercase form. */ 814 static char encode_digit(amc_ace_z_uint d, int flag) 815 { 816 return d + 22 + 75 * (d < 26) - ((flag != 0) << 5); 817 /* 0..25 map to ASCII a..z or A..Z */ 818 /* 26..35 map to ASCII 0..9 */ 819 } 821 /* flagged(bcp) tests whether a basic code point is flagged */ 822 /* (uppercase). The behavior is undefined if bcp is not a */ 823 /* basic code point. */ 825 #define flagged(bcp) ((amc_ace_z_uint)(bcp) - 65 < 26) 827 /*** Useful constants ***/ 829 /* maxint is the maximum value of an amc_ace_z_uint variable: */ 830 static const amc_ace_z_uint maxint = -1; 832 /* lobase and cutoff are used in the calculation of bias: */ 833 enum { lobase = base - tmin, cutoff = lobase * tmax / 2 }; 835 /*** Main encode function ***/ 837 enum amc_ace_status amc_ace_z_encode( 838 amc_ace_z_uint input_length, 839 const amc_ace_z_uint input[], 840 const unsigned char uppercase_flags[], 841 amc_ace_z_uint *output_length, 842 char output[] ) 843 { 844 amc_ace_z_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t; 846 /* Initialize the state: */ 848 n = initial_n; 849 delta = out = 0; 850 max_out = *output_length; 851 bias = initial_bias; 853 /* Handle the basic code points: */ 855 for (j = 0; j < input_length; ++j) { 856 if (basic(input[j])) { 857 if (max_out - out < 2) return amc_ace_big_output; 858 output[out++] = input[j]; 859 } 860 /* else if (input[j] < n) return amc_ace_bad_input; */ 861 /* (not needed for AMC-ACE-Z with unsigned code points) */ 862 } 864 h = b = out; 866 /* h is the number of code points that have been handled, b is the */ 867 /* number of basic code points, and out is the number of characters */ 868 /* that have been output. */ 869 if (b > 0) output[out++] = delimiter; 871 /* Main encoding loop: */ 873 while (h < input_length) { 874 /* All non-basic code points < n have been */ 875 /* handled already. Find the next larger one: */ 877 for (m = maxint, j = 0; j < input_length; ++j) { 878 /* if (basic(input[j])) continue; */ 879 /* (not needed for AMC-ACE-Z) */ 880 if (input[j] >= n && input[j] < m) m = input[j]; 881 } 883 /* Increase delta enough to advance the decoder's */ 884 /* state to , but guard against overflow: */ 886 if (m - n > (maxint - delta) / (h + 1)) return amc_ace_overflow; 887 delta += (m - n) * (h + 1); 888 n = m; 890 for (j = 0; j < input_length; ++j) { 891 #if 0 892 if (input[j] < n || basic(input[j])) { 893 if (++delta == 0) return amc_ace_overflow; 894 } 895 #endif 896 /* AMC-ACE-Z can use this simplified version instead: */ 897 if (input[j] < n && ++delta == 0) return amc_ace_overflow; 899 if (input[j] == n) { 900 /* Represent delta as a generalized variable-length integer: */ 902 for (q = delta, k = base; ; k += base) { 903 if (out >= max_out) return amc_ace_big_output; 904 t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; 905 if (q < t) break; 906 output[out++] = encode_digit(t + (q - t) % (base - t), 0); 907 q = (q - t) / (base - t); 908 } 910 output[out++] = 911 encode_digit(q, uppercase_flags && uppercase_flags[j]); 913 /* Adapt the bias: */ 914 delta = h == b ? delta / damp : delta >> 1; 915 delta += delta / (h + 1); 916 for (bias = 0; delta > cutoff; bias += base) delta /= lobase; 917 bias += (lobase + 1) * delta / (delta + skew); 919 delta = 0; 920 ++h; 921 } 922 } 923 ++delta, ++n; 924 } 926 *output_length = out; 927 return amc_ace_success; 928 } 930 /*** Main decode function ***/ 932 enum amc_ace_status amc_ace_z_decode( 933 amc_ace_z_uint input_length, 934 const char input[], 935 amc_ace_z_uint *output_length, 936 amc_ace_z_uint output[], 937 unsigned char uppercase_flags[] ) 938 { 939 amc_ace_z_uint n, out, i, max_out, bias, b, j, 940 in, oldi, w, k, delta, digit, t; 942 /* Initialize the state: */ 944 n = initial_n; 945 out = i = 0; 946 max_out = *output_length; 947 bias = initial_bias; 949 /* Handle the basic code points: Let b be the number of input code */ 950 /* points before the last delimiter, or 0 if there is none, then */ 951 /* copy the first b code points to the output. */ 953 for (b = j = 0; j < input_length; ++j) if (delim(input[j])) b = j; 954 if (b > max_out) return amc_ace_big_output; 956 for (j = 0; j < b; ++j) { 957 if (uppercase_flags) uppercase_flags[out] = flagged(input[j]); 958 if (!basic(input[j])) return amc_ace_bad_input; 959 output[out++] = input[j]; 960 } 962 /* Main decoding loop: Start just after the last delimiter if any */ 963 /* basic code points were copied; start at the beginning otherwise. */ 965 for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) { 967 /* in is the index of the next character to be consumed, and */ 968 /* out is the number of code points in the output array. */ 970 /* Decode a generalized variable-length integer into delta, */ 971 /* which gets added to i. The overflow checking is easier */ 972 /* if we increase i as we go, then subtract off its starting */ 973 /* value at the end to obtain delta. */ 974 for (oldi = i, w = 1, k = base; ; k += base) { 975 if (in >= input_length) return amc_ace_bad_input; 976 digit = decode_digit(input[in++]); 977 if (digit >= base) return amc_ace_bad_input; 978 if (digit > (maxint - i) / w) return amc_ace_overflow; 979 i += digit * w; 980 t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; 981 if (digit < t) break; 982 if (w > maxint / (base - t)) return amc_ace_overflow; 983 w *= (base - t); 984 } 986 /* Adapt the bias: */ 987 delta = oldi == 0 ? i / damp : (i - oldi) >> 1; 988 delta += delta / (out + 1); 989 for (bias = 0; delta > cutoff; bias += base) delta /= lobase; 990 bias += (lobase + 1) * delta / (delta + skew); 992 /* i was supposed to wrap around from out+1 to 0, */ 993 /* incrementing n each time, so we'll fix that now: */ 995 if (i / (out + 1) > maxint - n) return amc_ace_overflow; 996 n += i / (out + 1); 997 i %= (out + 1); 999 /* Insert n at position i of the output: */ 1001 /* not needed for AMC-ACE-Z: */ 1002 /* if (decode_digit(n) <= base) return amc_ace_invalid_input; */ 1003 if (out >= max_out) return amc_ace_big_output; 1005 if (uppercase_flags) { 1006 memmove(uppercase_flags + i + 1, uppercase_flags + i, out - i); 1007 /* Case of last character determines uppercase flag: */ 1008 uppercase_flags[i] = flagged(input[in - 1]); 1009 } 1011 memmove(output + i + 1, output + i, (out - i) * sizeof *output); 1012 output[i++] = n; 1013 } 1015 *output_length = out; 1016 return amc_ace_success; 1017 } 1019 /******************************************************************/ 1020 /* Wrapper for testing (would normally go in a separate .c file): */ 1022 #include 1023 #include 1024 #include 1025 #include 1027 /* For testing, we'll just set some compile-time limits rather than */ 1028 /* use malloc(), and set a compile-time option rather than using a */ 1029 /* command-line option. */ 1030 enum { 1031 unicode_max_length = 256, 1032 ace_max_length = 256 1033 }; 1035 static void usage(char **argv) 1036 { 1037 fprintf(stderr, 1038 "\n" 1039 "%s -e reads code points and writes an AMC-ACE-Z string.\n" 1040 "%s -d reads an AMC-ACE-Z string and writes code points.\n" 1041 "\n" 1042 "Input and output are plain text in the native character set.\n" 1043 "Code points are in the form u+hex separated by whitespace.\n" 1044 "The AMC-ACE-Z strings do not include any signatures.\n" 1045 "Although the specification allows AMC-ACE-Z strings to contain\n" 1046 "any characters from the ASCII repertoire, this test code\n" 1047 "supports only the printable characters, and requires the\n" 1048 "AMC-ACE-Z string to be followed by a newline.\n" 1049 "The case of the u in u+hex is the force-to-uppercase flag.\n" 1050 , argv[0], argv[0]); 1051 exit(EXIT_FAILURE); 1052 } 1054 static void fail(const char *msg) 1055 { 1056 fputs(msg,stderr); 1057 exit(EXIT_FAILURE); 1058 } 1060 static const char too_big[] = 1061 "input or output is too large, recompile with larger limits\n"; 1062 static const char invalid_input[] = "invalid input\n"; 1063 static const char overflow[] = "arithmetic overflow\n"; 1064 static const char io_error[] = "I/O error\n"; 1066 /* The following string is used to convert printable */ 1067 /* characters between ASCII and the native charset: */ 1069 static const char print_ascii[] = 1070 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" 1071 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" 1072 " !\"#$%&'()*+,-./" 1073 "0123456789:;<=>?" 1074 "@ABCDEFGHIJKLMNO" 1075 "PQRSTUVWXYZ[\\]^_" 1076 "`abcdefghijklmno" 1077 "pqrstuvwxyz{|}~\n"; 1078 int main(int argc, char **argv) 1079 { 1080 enum amc_ace_status status; 1081 int r; 1082 unsigned int input_length, output_length, j; 1083 unsigned char uppercase_flags[unicode_max_length]; 1085 if (argc != 2) usage(argv); 1086 if (argv[1][0] != '-') usage(argv); 1087 if (argv[1][2] != 0) usage(argv); 1089 if (argv[1][1] == 'e') { 1090 amc_ace_z_uint input[unicode_max_length]; 1091 unsigned long codept; 1092 char output[ace_max_length+1], uplus[3]; 1093 int c; 1095 /* Read the input code points: */ 1097 input_length = 0; 1099 for (;;) { 1100 r = scanf("%2s%lx", uplus, &codept); 1101 if (ferror(stdin)) fail(io_error); 1102 if (r == EOF || r == 0) break; 1104 if (r != 2 || uplus[1] != '+' || codept > (amc_ace_z_uint)-1) { 1105 fail(invalid_input); 1106 } 1108 if (input_length == unicode_max_length) fail(too_big); 1110 if (uplus[0] == 'u') uppercase_flags[input_length] = 0; 1111 else if (uplus[0] == 'U') uppercase_flags[input_length] = 1; 1112 else fail(invalid_input); 1114 input[input_length++] = codept; 1115 } 1117 /* Encode: */ 1119 output_length = ace_max_length; 1120 status = amc_ace_z_encode(input_length, input, uppercase_flags, 1121 &output_length, output); 1122 if (status == amc_ace_bad_input) fail(invalid_input); 1123 if (status == amc_ace_big_output) fail(too_big); 1124 if (status == amc_ace_overflow) fail(overflow); 1125 assert(status == amc_ace_success); 1127 /* Convert to native charset and output: */ 1129 for (j = 0; j < output_length; ++j) { 1130 c = output[j]; 1131 assert(c >= 0 && c <= 127); 1132 if (print_ascii[c] == 0) fail(invalid_input); 1133 output[j] = print_ascii[c]; 1134 } 1135 output[j] = 0; 1136 r = puts(output); 1137 if (r == EOF) fail(io_error); 1138 return EXIT_SUCCESS; 1139 } 1141 if (argv[1][1] == 'd') { 1142 char input[ace_max_length+2], *p, *pp; 1143 amc_ace_z_uint output[unicode_max_length]; 1145 /* Read the AMC-ACE-Z input string and convert to ASCII: */ 1147 fgets(input, ace_max_length+2, stdin); 1148 if (ferror(stdin)) fail(io_error); 1149 if (feof(stdin)) fail(invalid_input); 1150 input_length = strlen(input) - 1; 1151 if (input[input_length] != '\n') fail(too_big); 1152 input[input_length] = 0; 1154 for (p = input; *p != 0; ++p) { 1155 pp = strchr(print_ascii, *p); 1156 if (pp == 0) fail(invalid_input); 1157 *p = pp - print_ascii; 1158 } 1160 /* Decode: */ 1162 output_length = unicode_max_length; 1163 status = amc_ace_z_decode(input_length, input, &output_length, 1164 output, uppercase_flags); 1165 if (status == amc_ace_bad_input) fail(invalid_input); 1166 if (status == amc_ace_big_output) fail(too_big); 1167 if (status == amc_ace_overflow) fail(overflow); 1168 assert(status == amc_ace_success); 1170 /* Output the result: */ 1172 for (j = 0; j < output_length; ++j) { 1173 r = printf("%s+%04lX\n", 1174 uppercase_flags[j] ? "U" : "u", 1175 (unsigned long) output[j] ); 1176 if (r < 0) fail(io_error); 1177 } 1179 return EXIT_SUCCESS; 1180 } 1182 usage(argv); 1183 return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ 1184 } 1186 INTERNET-DRAFT expires 2002-Feb-16