idnits 2.17.1 draft-ietf-idn-amc-ace-z-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 2 longer pages, the longest (page 9) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([UNICODE], [IDNA], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 805 has weird spacing: '... return cp - ...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'NAMEPREP' is mentioned on line 78, but not defined

  == Missing Reference: 'RFC2119' is mentioned on line 121, but not defined

  -- Looks like a reference, but probably isn't: '0' on line 1111

  -- Looks like a reference, but probably isn't: '1' on line 1141

  -- Looks like a reference, but probably isn't: '2' on line 1087

  -- Looks like a reference, but probably isn't: '3' on line 1092

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN'

  == Outdated reference: A later version (-13) exists of
     draft-ietf-idn-idna-02

  == Outdated reference: A later version (-10) exists of
     draft-ietf-idn-nameprep-03

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL'

  ** Downref: Normative reference to an Unknown state RFC: RFC  952

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'


     Summary: 6 errors (**), 0 flaws (~~), 7 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          Adam M. Costello
2	draft-ietf-idn-amc-ace-z-00.txt                              2001-Aug-16
3	Expires 2002-Feb-16

5	                         AMC-ACE-Z version 0.3.0

7	Status of this Memo

9	    This document is an Internet-Draft and is in full conformance with
10	    all provisions of Section 10 of RFC2026.

12	    Internet-Drafts are working documents of the Internet Engineering
13	    Task Force (IETF), its areas, and its working groups.  Note
14	    that other groups may also distribute working documents as
15	    Internet-Drafts.

17	    Internet-Drafts are draft documents valid for a maximum of six
18	    months and may be updated, replaced, or obsoleted by other documents
19	    at any time.  It is inappropriate to use Internet-Drafts as
20	    reference material or to cite them other than as "work in progress."

22	    The list of current Internet-Drafts can be accessed at
23	    http://www.ietf.org/ietf/1id-abstracts.txt

25	    The list of Internet-Draft Shadow Directories can be accessed at
26	    http://www.ietf.org/shadow.html

28	    Distribution of this document is unlimited.  Please send comments
29	    to the author at amc@cs.berkeley.edu, or to the idn working
30	    group at idn@ops.ietf.org.  A non-paginated (and possibly
31	    newer) version of this specification may be available at
32	    http://www.cs.berkeley.edu/~amc/charset/

34	Abstract

36	    AMC-ACE-Z is a simple and efficient ASCII-Compatible Encoding (ACE)
37	    designed for use with Internationalized Domain Names [IDN] [IDNA].
38	    It uniquely and reversibly transforms a Unicode string [UNICODE]
39	    into an ASCII string.  ASCII characters in the Unicode string are
40	    represented literally, and non-ASCII characters are represented
41	    by ASCII characters that are allowed in hostname labels (letters,
42	    digits, and hyphens).  Bootstring is a general algorithm that allows
43	    a string of basic code points to uniquely represent any string of
44	    code points drawn from a larger set.  AMC-ACE-Z is an instance
45	    Bootstring that uses particular parameter values appropriate for
46	    IDNA and uses an IDNA signature prefix (or suffix).  This document
47	    specifies Bootstring and the parameter values for AMC-ACE-Z.

49	Contents

51	     1. Introduction
52	     2. Terminology
53	     3. Bootstring description
54	         3.1 Basic code point segregation
55	         3.2 Insertion unsort coding
56	         3.3 Generalized variable-length integers
57	         3.4 Bias adaptation
58	     4. Bootstring parameters
59	     5. Parameter values for AMC-ACE-Z
60	     6. Bootstring algorithms
61	         6.1 Bias adaptation function
62	         6.2 Decoding procedure
63	         6.3 Encoding procedure
64	     7. AMC-ACE-Z example strings
65	     8. Security considerations
66	     9. References
67	     A. Author contact information
68	     B. Mixed-case annotation
69	     C. AMC-ACE-Z sample implementation

71	1. Introduction

73	    The IDNA draft [IDNA] describes an architecture for supporting
74	    internationalized domain names.  Each label of a domain name may
75	    contain a special prefix (or suffix), in which case the rest of the
76	    label is an ASCII-Compatible Encoding (ACE) of a Unicode string
77	    satisfying certain constraints.  For the details of the constraints,
78	    see [IDNA] and [NAMEPREP].  The prefix has not yet been specified,
79	    but see http://www.i-d-n.net/ for prefixes to be used for testing
80	    and experimentation.

82	    Bootstring has been designed to have the following features:

84	      * Completeness:  Every extended string (sequence of arbitrary code
85	        points) can be represented by a basic string (sequence of basic
86	        code points).  Restrictions on what strings are allowed, and on
87	        length, may be imposed by higher layers.

89	      * Uniqueness:  There is at most one basic string that represents a
90	        given extended string.

92	      * Reversibility:  Any extended string mapped to a basic string can
93	        be recovered from that basic string.

95	      * Efficient encoding:  The ratio of extended string length to
96	        basic string length is small.  This is important in the context
97	        of domain names because RFC 1034 [RFC1034] restricts the length
98	        of a domain label to 63 characters.

100	      * Simplicity:  The encoding and decoding algorithms are reasonably
101	        simple to implement.  The goals of efficiency and simplicity are
102	        at odds; Bootstring aims at a good balance between them.

104	      * Readability:  Basic code points appearing in the extended
105	        string are represented as themselves in the basic string.  This
106	        comes for free because it makes the encoding more efficient on
107	        average.

109	    In addition, AMC-ACE-Z can support an optional feature described in
110	    appendix B "Mixed-case annotation".

112	    AMC-ACE-Z is a working name that should be changed if it is adopted.
113	    (The Z merely indicates that it is the twenty-sixth ACE devised by
114	    this author.  Most were not worth releasing.)

116	2. Terminology

118	    The key words "must", "shall", "required", "should", "recommended",
119	    and "may" in this document are to be interpreted as described in RFC
120	    2119 [RFC2119].

122	    As in the Unicode Standard [UNICODE], Unicode code points are
123	    denoted by "U+" followed by four to six hexadecimal digits, while a
124	    range of code points is denoted by two hexadecimal numbers separated
125	    by "..", with no prefixes.

127	    The operators div and mod perform integer division; (x div y) is the
128	    quotient of x divided by y, discarding the remainder, and (x mod y)
129	    is the remainder, so (x div y) * y + (x mod y) == x.  Bootstring
130	    uses these operators only with nonnegative operands, so the quotient
131	    and remainder are always nonnegative.

133	    The ?: operator is a conditional; (x ? y : z) means y if x is true,
134	    z if x is false.  It is just like "if x then y else z" except that y
135	    and z are expressions rather than statements.

137	    The "break" statement jumps out of the innermost loop (as in C).

139	3. Bootstring description

141	    Bootstring represents an arbitrary sequence of code points (the
142	    "extended string") as a sequence of basic code points (the
143	    "basic string").  This section describes the representation.
144	    Section 6 "Bootstring algorithms" presents the algorithms as
145	    pseudocode.

147	3.1 Basic code point segregation

149	    All basic code points appearing in the extended string are
150	    represented literally at the beginning of the basic string, in their
151	    original order, followed by a delimiter if (and only if) the number
152	    of basic code points is nonzero.  The delimiter is a particular
153	    basic code point, which never appears in the remainder of the basic
154	    string.  The decoder can therefore find the end of the literal
155	    portion (if there is one) by scanning for the last delimiter.

157	3.2 Insertion unsort coding

159	    The remainder of the basic string (after the last delimiter if there
160	    is one) represents a sequence of nonnegative integral deltas as
161	    generalized variable-length integers, described in section 3.3.  The
162	    meaning of the deltas is best understood in terms of the decoder.

164	    The decoder builds the extended string incrementally.  Initially,
165	    the extended string is a copy of the literal portion of the basic
166	    string (excluding the last delimiter).  Each delta causes the
167	    decoder to insert a code point into the extended string according
168	    to the following procedure.  There are two state variables: a
169	    code point n, and an index i that ranges from zero (which is the
170	    first position of the extended string) to the current length of
171	    the extended string (which refers to a potential position beyond
172	    the current end).  The decoder advances the state monotonically
173	    (never returning to an earlier state) by taking steps only upward.
174	    Each step increments i, except when i already equals the length
175	    of the extended string, in which case a step resets i to zero
176	    and increments n.  For each delta (in order), the decoder takes
177	    delta steps upward, then inserts the value n into the extended
178	    string at position i, then increments i (to skip over the code
179	    point just inserted).  (An implementation should not take each
180	    step individually, but should insead use division and remainder
181	    calculations to advance by delta steps all at once.)  It is an error
182	    if the inserted code point is a basic code point (because basic code
183	    points must be segregated as described in section 3.1).

185	    The encoder's main task is to derive the sequence of deltas that
186	    will cause the decoder to construct the desired string.  It can do
187	    this by repeatedly scanning the extended string for the next code
188	    point that the decoder would need to insert, and counting the number
189	    of steps the decoder would need to take, mindful of the fact that
190	    the decoder will be stepping over only those code points that have
191	    already been inserted.  Section 6.3 "Encoding procedure" gives a
192	    precise algorithm.

194	3.3 Generalized variable-length integers

196	    In a conventional integer representation the base is the number of
197	    distinct symbols for digits, whose values are 0 through base-1.  Let
198	    digit_0 denote the least significant digit, digit_1 the next least
199	    significant, and so on.  The value represented is the sum over j of
200	    digit_j * w(j), where w(j) = base^j is the weight (scale factor)
201	    for position j.  For example, in the base 8 integer 437, the digits
202	    are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is
203	    7 + 3*8 + 4*64 = 287.  This representation has two disadvantages:
204	    First, there are multiple encodings of each value (because there
205	    can be extra zeros in the most significant positions), which
206	    is inconvenient when unique encodings are required.  Second,
207	    the integer is not self-delimiting, so if multiple integers are
208	    concatenated the boundaries between them are lost.

210	    The generalized variable-length representation solves these two
211	    problems.  The digit values are still 0 through base-1, but now
212	    the integer is self-delimiting by means of thresholds t(j), each
213	    of which is in the range 0 through base-1.  Exactly one digit, the
214	    most significant, satisfies digit_j < t(j).  Therefore, if several
215	    integers are concatenated, it is easy to separate them, starting
216	    with the first if they are little-endian (least significant digit
217	    first), or starting with the last if they are big-endian (most
218	    significant digit first).  As before, the value is the sum over j of
219	    digit_j * w(j), but the weights are different:

221	        w(0) = 1
222	        w(j) = w(j-1) * (base - t(j-1)) for j > 0

224	    For example, consider the little-endian sequence of base 8 digits
225	    734251...  Suppose the thresholds are 2, 3, 5, 5, 5, 5...  This
226	    implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5)
227	    = 90, 90*(8-5) = 270, and so on.  7 is not less than 2, and 3 is
228	    not less than 3, but 4 is less than 5, so 4 must be the last digit.
229	    The value of 734 is 7*1 + 3*6 + 4*30 = 145.  The next integer is
230	    251, with value 2*1 + 5*6 + 1*30 = 62.  Decoding this representation
231	    is very similar to decoding a conventional integer:  Start with a
232	    current value of N = 0 and a weight w = 1.  Fetch the next digit d
233	    and increase N by d * w.  If d is less than the current threshold
234	    (t) then stop, otherwise increase w by a factor of (base - t),
235	    update t for the next position, and repeat.

237	    Encoding this representation is similar to encoding a conventional
238	    integer:  If N < t then output one digit for N and stop, otherwise
239	    output the digit for t + ((N - t) mod (base - t)), then replace N
240	    with (N - t) div (base - t), update t for the next position, and
241	    repeat.

243	    For any particular set of values of t(j), there is exactly one
244	    generalized variable-length representation of each nonnegative
245	    integral value.

247	    Bootstring uses little-endian ordering so that the deltas can be
248	    separated starting with the first.  The t(j) values are defined in
249	    terms of the constants base, tmin, and tmax, and a state variable
250	    called bias:

252	        t(j) = base * (j + 1) - bias,
253	        clamped to the range tmin through tmax

255	    (The clamping means that if the formula yields a value less than
256	    tmin or greater than tmax, then t(j) = tmin or tmax, respectively.)
257	    These t(j) values cause the representation to favor integers within
258	    a particular range determined by the bias.

260	3.4 Bias adaptation

262	    After each delta is encoded or decoded, bias is set for the next
263	    delta as follows:

265	     1. Delta is scaled in order to avoid overflow in the next step:

267	            let delta = delta div 2

269	        But when this is the very first delta, the divisor is not 2, but
270	        instead a constant called damp.  This compensates for the fact
271	        that the second delta is usually much smaller than the first.

273	     2. Delta is increased to compensate for the fact that the next
274	        delta will be inserting into a longer string:

276	            let delta = delta + (delta div numpoints)

278	        numpoints is the total number of code points encoded/decoded so
279	        far (including the one corresponding to this delta itself, and
280	        including the basic code points).

282	     3. Delta is repeatedly divided until it falls within a threshold,
283	        to predict the minimum number of digits needed to represent the
284	        next delta:

286	            while delta > ((base - tmin) * tmax) div 2
287	            do let delta = delta div (base - tmin)

289	     4. The bias is set:

291	            let bias =
292	              (base * the number of divisions performed in step 3) +
293	              (((base - tmin + 1) * delta) div (delta + skew))

295	    The motivation for this procedure is that the current delta provides
296	    a hint about the likely size of the next delta, and so t(j) is
297	    set to tmax for the more significant digits starting with the one
298	    expected to be last, tmin for the less significant digits up through
299	    the one expected to be third-last, and somewhere between tmin and
300	    tmax for the digit expected to be second-last (balancing the hope of
301	    the expected-last digit being unnecessary against the danger of it
302	    being insufficient).

304	4. Bootstring parameters

306	    Given a set of basic code points, one must be designated as
307	    the delimiter.  The base can be no greater than the number of
308	    distinguishable basic code points remaining.  The values 0 through
309	    base-1 must be associated with non-delimiter basic code points.
310	    In some cases multiple code points must represent the same value;
311	    for example, uppercase and lowercase versions of a letter must be
312	    equivalent if basic strings are case-insensitive.

314	    The initial value of n must be no greater than the minimum non-basic
315	    code point that could appear in extended strings.

317	    The remaining five parameters (tmin, tmax, skew, damp, and the
318	    initial value of bias) must satisfy the following constraints:

320	        0 <= tmin <= tmax <= base-1
321	        skew >= 1
322	        damp >= 2
323	        initial_bias mod base <= base - tmin

325	    Provided the constraints are satisfied, these five parameters affect
326	    efficiency but not correctness.  They should be chosen empirically.

328	    If support for mixed-case annotation is desired (see appendix B),
329	    make sure that the code points corresponding to 0 through tmax-1 all
330	    have both uppercase and lowercase forms.

332	5. Parameter values for AMC-ACE-Z

334	    AMC-ACE-Z uses the following Bootstring parameter values:

336	        base         = 36
337	        tmin         = 1
338	        tmax         = 26
339	        skew         = 38
340	        damp         = 700
341	        initial_bias = 72
342	        initial_n    = U+0080

344	    In AMC-ACE-Z, code points are Unicode code points [UNICODE], that
345	    is, integers in the range 0..10FFFF, but not D800..DFFF, which are
346	    reserved for use by UTF-16.  The basic code points are the ASCII
347	    code points (0..7F), some of which have values associated with them:

349	        U+002D (-)   = delimiter
350	        41..5A (A-Z) = 0 to 25, respectively
351	        61..7A (a-z) = 0 to 25, respectively
352	        30..39 (0-9) = 26 to 35, respectively

354	    Using hyphen-minus as the delimiter implies that the ACE can end
355	    with a hyphen-minus only if the Unicode string consists entirely
356	    of basic code points, but IDNA forbids such strings from being
357	    ACE-encoded.  Furthermore, the ACE can begin with a hyphen-minus
358	    only if the Unicode string does, which is forbidden by IDNA.
359	    Therefore IDNA using AMC-ACE-Z, regardless of whether the signature
360	    is a prefix or a suffix, conforms to the RFC 952 requirement that
361	    hostname labels neither begin nor end with a hyphen-minus [RFC952].

363	    A decoder must recognize the letters in both uppercase and lowercase
364	    forms (including mixtures of both forms).  An encoder should output
365	    only uppercase forms or only lowercase forms, unless it uses
366	    mixed-case annotation (see appendix B).

368	    Presumably most users will not manually copy ACEs by writing or
369	    typing them (as opposed to letting computers do it via cut & paste),
370	    but those that do will need to be alert to the potential visual
371	    ambiguity between the following sets of characters:

373	        G 6
374	        I l 1
375	        O 0
376	        S 5
377	        U V
378	        Z 2

380	    Such ambiguities are usually resolved by context, but in an ACE
381	    there is no context apparent to humans.

383	6. Bootstring algorithms

385	    Some parts of the pseudocode can be omitted if the parameters
386	    satisfy certain conditions (for which AMC-ACE-Z qualifies).  These
387	    parts are enclosed in {braces}, and notes immediately following the
388	    pseudocode explain the conditions under which they may be omitted.

390	6.1 Bias adaptation function

392	    function adapt(delta,numpoints,firsttime):
393	      let delta = delta div (firsttime ? damp : 2)
394	      let delta = delta + (delta div numpoints)
395	      let k = 0
396	      while delta > ((base - tmin) * tmax) div 2
397	      do let delta = delta div (base - tmin) and let k = k + base
398	      return k + (((base - tmin + 1) * delta) div (delta + skew))

400	6.2 Decoding procedure

402	    let n = initial_n
403	    let i = 0
404	    let bias = initial_bias
405	    let output = an empty string indexed from 0
406	    consume all code points before the last delimiter (if there is one)
407	      and copy them to output, fail on any non-basic code point
408	    if more than zero code points were consumed then consume one more
409	    while the input is not exhausted do begin
410	      let oldi = i
411	      let w = 1
412	      for k = base to infinity in steps of base do begin
413	        consume a code point, fail on end-of-input
414	        let digit = the code point's value, fail if it has no value
415	        let i = i + digit * w, fail on overflow
416	        let t = k <= bias ? tmin : k - bias > tmax ? tmax : k - bias
417	        if digit < t then break
418	        let w = w * (base - t), fail on overflow
419	      end
420	      let bias = adapt(i - oldi, length(output) + 1, oldi == 0)
421	      let n = n + i div (length(output) + 1), fail on overflow
422	      let i = i mod (length(output) + 1)
423	      {if n is a basic code point then fail}
424	      insert n into output at position i
425	      increment i
426	    end
427	    The statement enclosed in braces (checking whether n is a basic
428	    code point) may be omitted if initial_n exceeds all basic code
429	    points (which is true for AMC-ACE-Z), because n is never less than
430	    initial_n.

432	    Because the decoder state can only advance monotonically, and there
433	    is only one representation of any delta, there is therefore only
434	    one encoded string that can represent a given sequence of integers.
435	    The only error conditions are invalid code points, unexpected
436	    end-of-input, overflow (attempts to compute values that exceed the
437	    maximum value of an integer variable), and basic code points encoded
438	    using deltas instead of appearing literally.  If the decoder fails
439	    on these errors as shown above, then it cannot produce the same
440	    output for two distinct inputs, and hence it need not re-encode its
441	    output to verify that it matches the input.

443	    The assignment of t, where t is clamped to the range tmin through
444	    tmax, does not handle the case where bias < k < bias + tmin, but
445	    that is impossible because of the way bias is computed and because
446	    of the constraints on the parameters.

448	    If the programming language does not provide overflow detection,
449	    the following technique can be used.  Suppose A, B, and C are
450	    representable nonnegative integers and C is nonzero.  Then A + B
451	    overflows if and only if B > maxint - A, and A + (B * C) overflows
452	    if and only if B > (maxint - A) div C.  See appendix C "AMC-ACE-Z
453	    sample implementation" for demonstrations of this technique.

455	6.3 Encoding procedure

457	    let n = initial_n
458	    let delta = 0
459	    let bias = initial_bias
460	    let h = b = the number of basic code points in the input
461	    copy them to the output in order, followed by a delimiter if b > 0
462	    {if the input contains a non-basic code point < n then fail}
463	    while h < length(input) do begin
464	      let m = the minimum {non-basic} code point >= n in the input
465	      let delta = delta + (m - n) * (h + 1), fail on overflow
466	      let n = m
467	      for each code point m in the input (in order) do begin
468	        if m < n {or m is basic} then increment delta, fail on overflow
469	        if m == n then begin
470	          let q = delta
471	          for k = base to infinity in steps of base do begin
472	            let t = k <= bias ? tmin : k - bias > tmax ? tmax : k - bias
473	            if q < t then break
474	            output the code point for digit t + ((q - t) mod (base - t))
475	            let q = (q - t) div (base - t)
476	          end
477	          output the code point for digit q
478	          let bias = adapt(delta, h + 1, h == b)
479	          let delta = 0
480	          increment h
481	        end
482	      end
483	      increment delta and n
484	    end
485	    The full statement enclosed in braces (checking whether the input
486	    contains a non-basic code point less than n) can be omitted if all
487	    code points less than initial_n are basic code points (which is true
488	    for AMC-ACE-Z if code points are unsigned).

490	    The brace-enclosed conditions "non-basic" and "or m is basic" can be
491	    omitted if initial_n exceeds all basic code points (which is true
492	    for AMC-ACE-Z), because the code point being tested is never less
493	    than initial_n.

495	    The checks for overflow are necessary to avoid producing invalid
496	    output when the input contains very large values or is very long.
497	    Wider integer variables can handle more extreme inputs.  For
498	    AMC-ACE-Z, 26-bit unsigned integers are sufficient, because any
499	    string that required a 27-bit delta would have to exceed either
500	    the code point limit (0..10FFFF) or the label length limit (63
501	    characters).

503	    The increment of delta at the bottom of the outer loop cannot
504	    overflow because delta < length(input) before the increment, and
505	    length(input) is already assumed to be representable.  The increment
506	    of n could overflow, but only if h == length(input), in which case
507	    the procedure is finished anyway.

509	7. AMC-ACE-Z example strings

511	    In the AMC-ACE-Z encodings below, the IDNA signature prefix is not
512	    shown.  AMC-ACE-Z is abbreviated AMC-Z.  Backslashes show where line
513	    breaks have been inserted in strings too long for one line.

515	    The first several examples are all translations of the sentence "Why
516	    can't they just speak in ?" (courtesy of Michael Kaplan's
517	    "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
518	    been removed, as is often done in domain names.

520	    (A) Arabic (Egyptian):
521	        u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
522	        u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
523	        AMC-Z:  egbpdaj6bu4bxfgehfvwxn

525	    (B) Chinese (simplified):
526	        u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
527	        AMC-Z:  ihqwcrb4cv8a8dqg056pqjye

529	    (C) Czech: Proprostnemluvesky
530	        U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
531	        u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
532	        u+0065 u+0073 u+006B u+0079
533	        AMC-Z:  Proprostnemluvesky-uyb24dma41a

535	    (D) Hebrew:
536	        u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
537	        u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
538	        u+05D1 u+05E8 u+05D9 u+05EA
539	        AMC-Z:  4dbcagdahymbxekheh6e0a7fei0b
540	    (E) Hindi (Devanagari):
541	        u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
542	        u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
543	        u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
544	        u+0939 u+0948 u+0902
545	        AMC-Z:  i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd

547	    (F) Japanese (kanji and hiragana):
548	        u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
549	        u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
550	        AMC-Z:  n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa

552	    (G) Korean (Hangul syllables):
553	        u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
554	        u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
555	        u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
556	        AMC-Z:  989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5jps\
557	                d879ccm6fea98c

559	    (H) Russian (Cyrillic):
560	        U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
561	        u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
562	        u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
563	        u+0438
564	        AMC-Z:  b1abfaaepdrnnbgefbaDotcwatmq2g4l

566	    (I) Spanish: PorqunopuedensimplementehablarenEspaol
567	        U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
568	        u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
569	        u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
570	        u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
571	        u+0061 u+00F1 u+006F u+006C
572	        AMC-Z:  PorqunopuedensimplementehablarenEspaol-fmd56a

574	    (J) Taiwanese:
575	        u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
576	        AMC-Z:  ihqwctvzc91f659drss3x8bo0yb

578	    (K) Vietnamese:
579	        Tisaohkhngthch\
580	        nitingVit
581	        U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
582	        u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
583	        u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
584	        U+0056 u+0069 u+1EC7 u+0074
585	        AMC-Z:  TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g

587	    The next several examples are all names of Japanese music artists,
588	    song titles, and TV programs, just because the author happens to
589	    have them handy (but Japanese is useful for providing examples
590	    of single-row text, two-row text, ideographic text, and various
591	    mixtures thereof).

593	    (L) 3B
594	        u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
595	        AMC-Z:  3B-ww4c5e180e575a65lsy2b
596	    (M) -with-SUPER-MONKEYS
597	        u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
598	        u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
599	        U+004F U+004E U+004B U+0045 U+0059 U+0053
600	        AMC-Z:  -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n

602	    (N) Hello-Another-Way-
603	        U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
604	        u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
605	        u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
606	        AMC-Z:  Hello-Another-Way--fc4qua05auwb3674vfr0b

608	    (O) 2
609	        u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
610	        AMC-Z:  2-u9tlzr9756bt3uc0v

612	    (P) MajiKoi5
613	        U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
614	        u+308B u+0035 u+79D2 u+524D
615	        AMC-Z:  MajiKoi5-783gue6qz075azm5e

617	    (Q) de
618	        u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
619	        AMC-Z:  de-jg4avhby1noc0d

621	    (R) 
622	        u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
623	        AMC-Z:  d9juau41awczczp

625	    The last example is an ASCII string that breaks not only the
626	    existing rules for host name labels but also the rules proposed in
627	    [NAMEPREP03] for internationalized domain names.

629	    (S) -> $1.00 <-
630	        u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
631	        u+003C u+002D
632	        AMC-Z:  -> $1.00 <--

634	8. Security considerations

636	    Users expect each domain name in DNS to be controlled by a single
637	    authority.  If a Unicode string intended for use as a domain label
638	    could map to multiple ACE labels, then an internationalized domain
639	    name could map to multiple ACE domain names, each controlled by
640	    a different authority, some of which could be spoofs that hijack
641	    service requests intended for another.  Therefore AMC-ACE-Z is
642	    designed so that each Unicode string has a unique encoding.

644	    However, there can still be multiple Unicode representations of the
645	    "same" text, for various definitions of "same".  This problem is
646	    addressed to some extent by the Unicode standard under the topic of
647	    canonicalization, and this work is leveraged for domain names by
648	    "nameprep" [NAMEPREP03].

650	9. References

652	    [IDN] Internationalized Domain Names (IETF working group),
653	    http://www.i-d-n.net/, idn@ops.ietf.org.

655	    [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host
656	    Names In Applications (IDNA)", 2001-Jun-16, draft-ietf-idn-idna-02.

658	    [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
659	    of Internationalized Host Names", 2001-Feb-24,
660	    draft-ietf-idn-nameprep-03.

662	    [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
663	    http://www.trigeminal.com/samples/provincial.html.

665	    [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
666	    Table Specification", 1985-Oct, RFC 952.

668	    [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
669	    1987-Nov, RFC 1034.

671	    [UNICODE] The Unicode Consortium, "The Unicode Standard",
672	    http://www.unicode.org/unicode/standard/standard.html.

674	A. Author contact information

676	    Adam M. Costello 
677	    University of California, Berkeley
678	    http://www.cs.berkeley.edu/~amc/

680	B. Mixed-case annotation

682	    In order to use AMC-ACE-Z to represent case-insensitive strings,
683	    higher layers need to case-fold the strings prior to AMC-ACE-Z
684	    encoding.  The encoded string can, however, use mixed case as an
685	    annotation telling how to convert the original folded string into a
686	    mixed-case string for display purposes.

688	    Basic code points are represented literally, and can therefore use
689	    mixed case directly.  Each non-basic code point is represented by
690	    a delta, which is represented by a sequence of basic code points,
691	    the last of which provides the annotation.  If it is uppercase,
692	    it is a suggestion to map the non-basic code point to uppercase
693	    (if possible); if it is lowercase, it is a suggestion to map the
694	    non-basic code point to lowercase (if possible).

696	    AMC-ACE-Z encoders and decoders are not required to support these
697	    annotations, and higher layers need not use them.

699	C. AMC-ACE-Z sample implementation

701	/******************************************/
702	/* amc-ace-z.c 0.3.0 (2001-Aug-07-Tue)    */
703	/* Adam M. Costello  */
704	/******************************************/

706	/* This is ANSI C code (C89) implementing AMC-ACE-Z version 0.3.x. */
707	/************************************************************/
708	/* Public interface (would normally go in its own .h file): */

710	#include 

712	enum amc_ace_status {
713	  amc_ace_success,
714	  amc_ace_bad_input,   /* Input is invalid.                         */
715	  amc_ace_big_output,  /* Output would exceed the space provided.   */
716	  amc_ace_overflow     /* Input requires wider integers to process. */
717	};

719	#if UINT_MAX >= (1 << 26) - 1
720	typedef unsigned int amc_ace_z_uint;
721	#else
722	typedef unsigned long amc_ace_z_uint;
723	#endif

725	enum amc_ace_status amc_ace_z_encode(
726	  amc_ace_z_uint input_length,
727	  const amc_ace_z_uint input[],
728	  const unsigned char uppercase_flags[],
729	  amc_ace_z_uint *output_length,
730	  char output[] );

732	    /* amc_ace_z_encode() converts Unicode to AMC-ACE-Z (without      */
733	    /* any signature).  The input must be represented as an array     */
734	    /* of Unicode code points (not code units; surrogate pairs        */
735	    /* are not allowed), and the output will be represented as an     */
736	    /* array of ASCII code points.  The output string is *not*        */
737	    /* null-terminated; it will contain zeros if and only if the      */
738	    /* input contains zeros.  (Of course the caller can leave room    */
739	    /* for a terminator and add one if needed.)  The input_length is  */
740	    /* the number of code points in the input.  The output_length is  */
741	    /* an in/out argument: the caller must pass in the maximum number */
742	    /* of code points that may be output, and on successful return it */
743	    /* will contain the number of code points actually output.  The   */
744	    /* uppercase_flags array must hold input_length boolean values,   */
745	    /* where nonzero means the corresponding Unicode character should */
746	    /* be forced to uppercase after being decoded, and zero means it  */
747	    /* is caseless or should be forced to lowercase.  Alternatively,  */
748	    /* uppercase_flags may be a null pointer, which is equivalent to  */
749	    /* all zeros.  ASCII code points are always encoded literally,    */
750	    /* regardless of the corresponding flags.  The return value may   */
751	    /* be any of the amc_ace_status values defined above except       */
752	    /* amc_ace_bad_input; if not amc_ace_success, then output_size    */
753	    /* and output may contain garbage.                                */

755	enum amc_ace_status amc_ace_z_decode(
756	  amc_ace_z_uint input_length,
757	  const char input[],
758	  amc_ace_z_uint *output_length,
759	  amc_ace_z_uint output[],
760	  unsigned char uppercase_flags[] );
761	    /* amc_ace_z_decode() converts AMC-ACE-Z (without any signature) */
762	    /* to Unicode.  The input must be represented as an array of     */
763	    /* ASCII code points, and the output will be represented as      */
764	    /* an array of Unicode code points.  The input_length is the     */
765	    /* number of code points in the input.  The output_length is     */
766	    /* an in/out argument: the caller must pass in the maximum       */
767	    /* number of code points that may be output, and on successful   */
768	    /* return it will contain the actual number of code points       */
769	    /* output.  The uppercase_flags array must have room for at      */
770	    /* least output_length values, or it may be a null pointer if    */
771	    /* the case information is not needed.  A nonzero flag indicates */
772	    /* that the corresponding Unicode character should be forced to  */
773	    /* uppercase by the caller, while zero means it is caseless or   */
774	    /* should be forced to lowercase.  ASCII code points are output  */
775	    /* already in the proper case, but their flags will be set       */
776	    /* appropriately so that applying the flags would be harmless.   */
777	    /* The return value may be any of the amc_ace_status values      */
778	    /* defined above; if not amc_ace_success, then output_length,    */
779	    /* output, and uppercase_flags may contain garbage.  On success, */
780	    /* the decoder will never need to write an output_length greater */
781	    /* than input_length, because of how the encoding is defined.    */

783	/**********************************************************/
784	/* Implementation (would normally go in its own .c file): */

786	#include 

788	/*** Bootstring parameters for AMC-ACE-Z ***/

790	enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
791	       initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };

793	/* basic(cp) tests whether cp is a basic code point: */
794	#define basic(cp) ((amc_ace_z_uint)(cp) < 0x80)

796	/* delim(cp) tests whether cp is a delimiter: */
797	#define delim(cp) ((cp) == delimiter)

799	/* decode_digit(cp) returns the numeric value of a basic code */
800	/* point (for use in representing integers) in the range 0 to */
801	/* base-1, or base if cp is does not represent a value.       */

803	static amc_ace_z_uint decode_digit(amc_ace_z_uint cp)
804	{
805	  return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :
806	          cp - 97 < 26 ? cp - 97 :  base;
807	}

809	/* encode_digit(d,flag) returns the basic code point whose value      */
810	/* (when used for representing integers) is d, which must be in the   */
811	/* range 0 to base-1.  The lowercase form is used unless flag is      */
812	/* nonzero, in which case the uppercase form is used.  The behavior   */
813	/* is undefined if flag is nonzero and digit d has no uppercase form. */
814	static char encode_digit(amc_ace_z_uint d, int flag)
815	{
816	  return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
817	  /*  0..25 map to ASCII a..z or A..Z */
818	  /* 26..35 map to ASCII 0..9         */
819	}

821	/* flagged(bcp) tests whether a basic code point is flagged */
822	/* (uppercase).  The behavior is undefined if bcp is not a  */
823	/* basic code point.                                        */

825	#define flagged(bcp) ((amc_ace_z_uint)(bcp) - 65 < 26)

827	/*** Useful constants ***/

829	/* maxint is the maximum value of an amc_ace_z_uint variable: */
830	static const amc_ace_z_uint maxint = -1;

832	/* lobase and cutoff are used in the calculation of bias: */
833	enum { lobase = base - tmin, cutoff = lobase * tmax / 2 };

835	/*** Main encode function ***/

837	enum amc_ace_status amc_ace_z_encode(
838	  amc_ace_z_uint input_length,
839	  const amc_ace_z_uint input[],
840	  const unsigned char uppercase_flags[],
841	  amc_ace_z_uint *output_length,
842	  char output[] )
843	{
844	  amc_ace_z_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;

846	  /* Initialize the state: */

848	  n = initial_n;
849	  delta = out = 0;
850	  max_out = *output_length;
851	  bias = initial_bias;

853	  /* Handle the basic code points: */

855	  for (j = 0;  j < input_length;  ++j) {
856	    if (basic(input[j])) {
857	      if (max_out - out < 2) return amc_ace_big_output;
858	      output[out++] = input[j];
859	    }
860	    /* else if (input[j] < n) return amc_ace_bad_input; */
861	    /* (not needed for AMC-ACE-Z with unsigned code points) */
862	  }

864	  h = b = out;

866	  /* h is the number of code points that have been handled, b is the  */
867	  /* number of basic code points, and out is the number of characters */
868	  /* that have been output.                                           */
869	  if (b > 0) output[out++] = delimiter;

871	  /* Main encoding loop: */

873	  while (h < input_length) {
874	    /* All non-basic code points < n have been     */
875	    /* handled already.  Find the next larger one: */

877	    for (m = maxint, j = 0;  j < input_length;  ++j) {
878	      /* if (basic(input[j])) continue; */
879	      /* (not needed for AMC-ACE-Z) */
880	      if (input[j] >= n && input[j] < m) m = input[j];
881	    }

883	    /* Increase delta enough to advance the decoder's    */
884	    /*  state to , but guard against overflow: */

886	    if (m - n > (maxint - delta) / (h + 1)) return amc_ace_overflow;
887	    delta += (m - n) * (h + 1);
888	    n = m;

890	    for (j = 0;  j < input_length;  ++j) {
891	      #if 0
892	      if (input[j] < n || basic(input[j])) {
893	        if (++delta == 0) return amc_ace_overflow;
894	      }
895	      #endif
896	      /* AMC-ACE-Z can use this simplified version instead: */
897	      if (input[j] < n && ++delta == 0) return amc_ace_overflow;

899	      if (input[j] == n) {
900	        /* Represent delta as a generalized variable-length integer: */

902	        for (q = delta, k = base;  ;  k += base) {
903	          if (out >= max_out) return amc_ace_big_output;
904	          t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias;
905	          if (q < t) break;
906	          output[out++] = encode_digit(t + (q - t) % (base - t), 0);
907	          q = (q - t) / (base - t);
908	        }

910	        output[out++] =
911	          encode_digit(q, uppercase_flags && uppercase_flags[j]);

913	        /* Adapt the bias: */
914	        delta = h == b ? delta / damp : delta >> 1;
915	        delta += delta / (h + 1);
916	        for (bias = 0;  delta > cutoff;  bias += base) delta /= lobase;
917	        bias += (lobase + 1) * delta / (delta + skew);

919	        delta = 0;
920	        ++h;
921	      }
922	    }
923	    ++delta, ++n;
924	  }

926	  *output_length = out;
927	  return amc_ace_success;
928	}

930	/*** Main decode function ***/

932	enum amc_ace_status amc_ace_z_decode(
933	  amc_ace_z_uint input_length,
934	  const char input[],
935	  amc_ace_z_uint *output_length,
936	  amc_ace_z_uint output[],
937	  unsigned char uppercase_flags[] )
938	{
939	  amc_ace_z_uint n, out, i, max_out, bias, b, j,
940	                 in, oldi, w, k, delta, digit, t;

942	  /* Initialize the state: */

944	  n = initial_n;
945	  out = i = 0;
946	  max_out = *output_length;
947	  bias = initial_bias;

949	  /* Handle the basic code points:  Let b be the number of input code */
950	  /* points before the last delimiter, or 0 if there is none, then    */
951	  /* copy the first b code points to the output.                      */

953	  for (b = j = 0;  j < input_length;  ++j) if (delim(input[j])) b = j;
954	  if (b > max_out) return amc_ace_big_output;

956	  for (j = 0;  j < b;  ++j) {
957	    if (uppercase_flags) uppercase_flags[out] = flagged(input[j]);
958	    if (!basic(input[j])) return amc_ace_bad_input;
959	    output[out++] = input[j];
960	  }

962	  /* Main decoding loop:  Start just after the last delimiter if any  */
963	  /* basic code points were copied; start at the beginning otherwise. */

965	  for (in = b > 0 ? b + 1 : 0;  in < input_length;  ++out) {

967	    /* in is the index of the next character to be consumed, and */
968	    /* out is the number of code points in the output array.     */

970	    /* Decode a generalized variable-length integer into delta,  */
971	    /* which gets added to i.  The overflow checking is easier   */
972	    /* if we increase i as we go, then subtract off its starting */
973	    /* value at the end to obtain delta.                         */
974	    for (oldi = i, w = 1, k = base;  ;  k += base) {
975	      if (in >= input_length) return amc_ace_bad_input;
976	      digit = decode_digit(input[in++]);
977	      if (digit >= base) return amc_ace_bad_input;
978	      if (digit > (maxint - i) / w) return amc_ace_overflow;
979	      i += digit * w;
980	      t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias;
981	      if (digit < t) break;
982	      if (w > maxint / (base - t)) return amc_ace_overflow;
983	      w *= (base - t);
984	    }

986	    /* Adapt the bias: */
987	    delta = oldi == 0 ? i / damp : (i - oldi) >> 1;
988	    delta += delta / (out + 1);
989	    for (bias = 0;  delta > cutoff;  bias += base) delta /= lobase;
990	    bias += (lobase + 1) * delta / (delta + skew);

992	    /* i was supposed to wrap around from out+1 to 0,   */
993	    /* incrementing n each time, so we'll fix that now: */

995	    if (i / (out + 1) > maxint - n) return amc_ace_overflow;
996	    n += i / (out + 1);
997	    i %= (out + 1);

999	    /* Insert n at position i of the output: */

1001	    /* not needed for AMC-ACE-Z: */
1002	    /* if (decode_digit(n) <= base) return amc_ace_invalid_input; */
1003	    if (out >= max_out) return amc_ace_big_output;

1005	    if (uppercase_flags) {
1006	      memmove(uppercase_flags + i + 1, uppercase_flags + i, out - i);
1007	      /* Case of last character determines uppercase flag: */
1008	      uppercase_flags[i] = flagged(input[in - 1]);
1009	    }

1011	    memmove(output + i + 1, output + i, (out - i) * sizeof *output);
1012	    output[i++] = n;
1013	  }

1015	  *output_length = out;
1016	  return amc_ace_success;
1017	}

1019	/******************************************************************/
1020	/* Wrapper for testing (would normally go in a separate .c file): */

1022	#include 
1023	#include 
1024	#include 
1025	#include 

1027	/* For testing, we'll just set some compile-time limits rather than */
1028	/* use malloc(), and set a compile-time option rather than using a  */
1029	/* command-line option.                                             */
1030	enum {
1031	  unicode_max_length = 256,
1032	  ace_max_length = 256
1033	};

1035	static void usage(char **argv)
1036	{
1037	  fprintf(stderr,
1038	    "\n"
1039	    "%s -e reads code points and writes an AMC-ACE-Z string.\n"
1040	    "%s -d reads an AMC-ACE-Z string and writes code points.\n"
1041	    "\n"
1042	    "Input and output are plain text in the native character set.\n"
1043	    "Code points are in the form u+hex separated by whitespace.\n"
1044	    "The AMC-ACE-Z strings do not include any signatures.\n"
1045	    "Although the specification allows AMC-ACE-Z strings to contain\n"
1046	    "any characters from the ASCII repertoire, this test code\n"
1047	    "supports only the printable characters, and requires the\n"
1048	    "AMC-ACE-Z string to be followed by a newline.\n"
1049	    "The case of the u in u+hex is the force-to-uppercase flag.\n"
1050	    , argv[0], argv[0]);
1051	  exit(EXIT_FAILURE);
1052	}

1054	static void fail(const char *msg)
1055	{
1056	  fputs(msg,stderr);
1057	  exit(EXIT_FAILURE);
1058	}

1060	static const char too_big[] =
1061	  "input or output is too large, recompile with larger limits\n";
1062	static const char invalid_input[] = "invalid input\n";
1063	static const char overflow[] = "arithmetic overflow\n";
1064	static const char io_error[] = "I/O error\n";

1066	/* The following string is used to convert printable */
1067	/* characters between ASCII and the native charset:  */

1069	static const char print_ascii[] =
1070	  "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1071	  "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1072	  " !\"#$%&'()*+,-./"
1073	  "0123456789:;<=>?"
1074	  "@ABCDEFGHIJKLMNO"
1075	  "PQRSTUVWXYZ[\\]^_"
1076	  "`abcdefghijklmno"
1077	  "pqrstuvwxyz{|}~\n";
1078	int main(int argc, char **argv)
1079	{
1080	  enum amc_ace_status status;
1081	  int r;
1082	  unsigned int input_length, output_length, j;
1083	  unsigned char uppercase_flags[unicode_max_length];

1085	  if (argc != 2) usage(argv);
1086	  if (argv[1][0] != '-') usage(argv);
1087	  if (argv[1][2] != 0) usage(argv);

1089	  if (argv[1][1] == 'e') {
1090	    amc_ace_z_uint input[unicode_max_length];
1091	    unsigned long codept;
1092	    char output[ace_max_length+1], uplus[3];
1093	    int c;

1095	    /* Read the input code points: */

1097	    input_length = 0;

1099	    for (;;) {
1100	      r = scanf("%2s%lx", uplus, &codept);
1101	      if (ferror(stdin)) fail(io_error);
1102	      if (r == EOF || r == 0) break;

1104	      if (r != 2 || uplus[1] != '+' || codept > (amc_ace_z_uint)-1) {
1105	        fail(invalid_input);
1106	      }

1108	      if (input_length == unicode_max_length) fail(too_big);

1110	      if (uplus[0] == 'u') uppercase_flags[input_length] = 0;
1111	      else if (uplus[0] == 'U') uppercase_flags[input_length] = 1;
1112	      else fail(invalid_input);

1114	      input[input_length++] = codept;
1115	    }

1117	    /* Encode: */

1119	    output_length = ace_max_length;
1120	    status = amc_ace_z_encode(input_length, input, uppercase_flags,
1121	                              &output_length, output);
1122	    if (status == amc_ace_bad_input) fail(invalid_input);
1123	    if (status == amc_ace_big_output) fail(too_big);
1124	    if (status == amc_ace_overflow) fail(overflow);
1125	    assert(status == amc_ace_success);

1127	    /* Convert to native charset and output: */

1129	    for (j = 0;  j < output_length;  ++j) {
1130	      c = output[j];
1131	      assert(c >= 0 && c <= 127);
1132	      if (print_ascii[c] == 0) fail(invalid_input);
1133	      output[j] = print_ascii[c];
1134	    }
1135	    output[j] = 0;
1136	    r = puts(output);
1137	    if (r == EOF) fail(io_error);
1138	    return EXIT_SUCCESS;
1139	  }

1141	  if (argv[1][1] == 'd') {
1142	    char input[ace_max_length+2], *p, *pp;
1143	    amc_ace_z_uint output[unicode_max_length];

1145	    /* Read the AMC-ACE-Z input string and convert to ASCII: */

1147	    fgets(input, ace_max_length+2, stdin);
1148	    if (ferror(stdin)) fail(io_error);
1149	    if (feof(stdin)) fail(invalid_input);
1150	    input_length = strlen(input) - 1;
1151	    if (input[input_length] != '\n') fail(too_big);
1152	    input[input_length] = 0;

1154	    for (p = input;  *p != 0;  ++p) {
1155	      pp = strchr(print_ascii, *p);
1156	      if (pp == 0) fail(invalid_input);
1157	      *p = pp - print_ascii;
1158	    }

1160	    /* Decode: */

1162	    output_length = unicode_max_length;
1163	    status = amc_ace_z_decode(input_length, input, &output_length,
1164	                              output, uppercase_flags);
1165	    if (status == amc_ace_bad_input) fail(invalid_input);
1166	    if (status == amc_ace_big_output) fail(too_big);
1167	    if (status == amc_ace_overflow) fail(overflow);
1168	    assert(status == amc_ace_success);

1170	    /* Output the result: */

1172	    for (j = 0;  j < output_length;  ++j) {
1173	      r = printf("%s+%04lX\n",
1174	                 uppercase_flags[j] ? "U" : "u",
1175	                 (unsigned long) output[j] );
1176	      if (r < 0) fail(io_error);
1177	    }

1179	    return EXIT_SUCCESS;
1180	  }

1182	  usage(argv);
1183	  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
1184	}

1186	                   INTERNET-DRAFT expires 2002-Feb-16