idnits 2.17.1 draft-ietf-idn-amc-ace-r-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 23) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([UNICODE], [RFC1123], [AMCACEO00], [RFC952], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1461

  -- Looks like a reference, but probably isn't: '6' on line 1350

  == Missing Reference: '--out' is mentioned on line 1320, but not defined

  -- Looks like a reference, but probably isn't: '1' on line 1515

  == Missing Reference: '-1' is mentioned on line 1396, but not defined

  -- Looks like a reference, but probably isn't: '2' on line 1462

  == Unused Reference: 'AltDUDE00' is defined on line 997, but no explicit
     reference was found in the text

  == Unused Reference: 'LACE01' is defined on line 1016, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-10) exists of
     draft-ietf-idn-nameprep-03

  -- Possible downref: Normative reference to a draft: ref. 'RACE03' 

  -- Possible downref: Normative reference to a draft: ref. 'AltDUDE00' 

  -- Possible downref: Normative reference to a draft: ref. 'AMCACEM00' 

  -- Possible downref: Normative reference to a draft: ref. 'AMCACEO00' 

  -- Possible downref: Normative reference to a draft: ref. 'BRACE00' 

  == Outdated reference: A later version (-02) exists of
     draft-ietf-idn-dude-01

  -- Possible downref: Normative reference to a draft: ref. 'DUDE01' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN'

  -- Possible downref: Normative reference to a draft: ref. 'LACE01' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL'

  ** Downref: Normative reference to an Unknown state RFC: RFC  952

  -- No information found for draft-ietf-idn-sace- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'SACE' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'

  -- No information found for draft-jseng-utf5- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'UTF5' 

  -- No information found for draft-ietf-idn-utf6- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'UTF6' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS6'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTFCONV'


     Summary: 5 errors (**), 0 flaws (~~), 8 warnings (==), 26 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          Adam M. Costello
2	draft-ietf-idn-amc-ace-r-00.txt                              2001-Mar-27
3	Expires 2001-Sep-27

5	                         AMC-ACE-R version 0.0.0

7	Status of this Memo

9	    This document is an Internet-Draft and is in full conformance with
10	    all provisions of Section 10 of RFC2026.

12	    Internet-Drafts are working documents of the Internet Engineering
13	    Task Force (IETF), its areas, and its working groups.  Note
14	    that other groups may also distribute working documents as
15	    Internet-Drafts.

17	    Internet-Drafts are draft documents valid for a maximum of six
18	    months and may be updated, replaced, or obsoleted by other documents
19	    at any time.  It is inappropriate to use Internet-Drafts as
20	    reference material or to cite them other than as "work in progress."

22	    The list of current Internet-Drafts can be accessed at
23	    http://www.ietf.org/ietf/1id-abstracts.txt

25	    The list of Internet-Draft Shadow Directories can be accessed at
26	    http://www.ietf.org/shadow.html

28	    Distribution of this document is unlimited.  Please send comments
29	    to the author at amc@cs.berkeley.edu, or to the idn working
30	    group at idn@ops.ietf.org.  A non-paginated (and possibly
31	    newer) version of this specification may be available at
32	    http://www.cs.berkeley.edu/~amc/charset/amc-ace-r

34	Abstract

36	    AMC-ACE-R is a reversible map from a sequence of Unicode [UNICODE]
37	    code points to a sequence of letters (A-Z, a-z), digits (0-9),
38	    and hyphen-minus (-), henceforth called LDH characters.  Such a
39	    map might be useful for an "ASCII-Compatible Encoding" (ACE) for
40	    internationalized domain names [IDN], because host name labels are
41	    currently restricted to LDH characters by [RFC952] and [RFC1123].

43	    AMC-ACE-R is similar to AMC-ACE-O [AMCACEO00] but is simpler and not
44	    quite as efficient.

46	    Besides domain names, there might also be other contexts where it is
47	    useful to transform Unicode characters into "safe" (delimiter-free)
48	    ASCII characters.  (If other contexts consider hyphen-minus to be
49	    unsafe, a different character could be used to play its role, like
50	    underscore.)
51	Contents

53	    Features
54	    Name
55	    Overview
56	    Base-32 characters
57	    Encoding and decoding algorithms
58	    Signature
59	    Case sensitivity models
60	    Comparison with RACE, BRACE, LACE, AltDUDE, AMC-ACE-M, AMC-ACE-O
61	    Example strings
62	    Security considerations
63	    Credits
64	    References
65	    Author
66	    Example implementation

68	Features

70	    Uniqueness:  Every Unicode string maps to at most one LDH string.

72	    Completeness:  Every Unicode string maps to an LDH string.
73	    Restrictions on which Unicode strings are allowed, and on length,
74	    may be imposed by higher layers.

76	    Efficient encoding:  The ratio of encoded size to original size is
77	    small for all Unicode strings.  This is important in the context
78	    of domain names because [RFC1034] restricts the length of a domain
79	    label to 63 characters.

81	    Simplicity:  The encoding and decoding algorithms are reasonably
82	    simple to implement.  The goals of efficiency and simplicity are at
83	    odds; AMC-ACE-R aims at a good balance between them.

85	    Case-preservation:  If the Unicode string has been case-folded prior
86	    to encoding, it is possible to record the case information in the
87	    case of the letters in the encoding, allowing a mixed-case Unicode
88	    string to be recovered if desired, but a case-insensitive comparison
89	    of two encoded strings is equivalent to a case-insensitive
90	    comparison of the Unicode strings.  This feature is optional; see
91	    section "Case sensitivity models".

93	    Readability:  The letters A-Z and a-z and the digits 0-9 appearing
94	    in the Unicode string are represented as themselves in the label.
95	    This comes for free because it usually the most efficient encoding
96	    anyway.

98	Name

100	    AMC-ACE-R is a working name that should be changed if it is adopted.
101	    (The R merely indicates that it is the eighteenth ACE devised by
102	    this author.  BRACE was the third.  D-L, N, P, and Q were not worth
103	    releasing.)  Rather than waste good names on experimental proposals,
104	    let's wait until one proposal is chosen, then assign it a good name.
105	    Suggestions (assuming the primary use is in domain names):

107	        UniHost
108	        UTF-D ("D" for "domain names")
109	        NUDE (Normal Unicode Domain Encoding)

111	    A name that makes no reference to domain names:

113	        UTF-37 (there are 37 characters in the output repertoire)

115	Overview

117	    AMC-ACE-R maps a sequence of Unicode code points to a sequence of
118	    LDH characters.  The encoder input and decoder output are arrays of
119	    code points, not characters, bytes, or code units (in particular,
120	    not UTF-16 surrogates).  Formally, the encoder output and decoder
121	    input are character strings, not code points, code units, or bytes,
122	    although implementations will of course need to represent the
123	    characters somehow, usually as bytes or other code units.

125	    Each Unicode code point is represented by an integral number of
126	    characters in the encoded string.  There is no intermediate bit
127	    string or octet string.

129	    The encoded string alternates between two modes: literal mode and
130	    base-32 mode.  Unicode code points representing LDH characters
131	    are encoded as those LDH characters, except that hyphen-minus is
132	    doubled.  Other Unicode code points are encoded using base-32, in
133	    which each character of the encoded string represents five bits
134	    (a "quintet").  A non-paired hyphen-minus in the encoded string
135	    indicates a mode change.

137	    In base-32 mode a variable-length code sequence of one to five
138	    quintets represents a delta, which is added to a reference point to
139	    yield a Unicode code point.  There are five reference points, one
140	    for each code length, three of which continually change during the
141	    encoding/decoding process.

143	Base-32 characters

145	        "a" =  0 = 0x00 = 00000         "s" = 16 = 0x10 = 10000
146	        "b" =  1 = 0x01 = 00001         "t" = 17 = 0x11 = 10001
147	        "c" =  2 = 0x02 = 00010         "u" = 18 = 0x12 = 10010
148	        "d" =  3 = 0x03 = 00011         "v" = 19 = 0x13 = 10011
149	        "e" =  4 = 0x04 = 00100         "w" = 20 = 0x14 = 10100
150	        "f" =  5 = 0x05 = 00101         "x" = 21 = 0x15 = 10101
151	        "g" =  6 = 0x06 = 00110         "y" = 22 = 0x16 = 10110
152	        "h" =  7 = 0x07 = 00111         "z" = 23 = 0x17 = 10111
153	        "i" =  8 = 0x08 = 01000         "2" = 24 = 0x18 = 11000
154	        "j" =  9 = 0x09 = 01001         "3" = 25 = 0x19 = 11001
155	        "k" = 10 = 0x0A = 01010         "4" = 26 = 0x1A = 11010
156	        "m" = 11 = 0x0B = 01011         "5" = 27 = 0x1B = 11011
157	        "n" = 12 = 0x0C = 01100         "6" = 28 = 0x1C = 11100
158	        "p" = 13 = 0x0D = 01101         "7" = 29 = 0x1D = 11101
159	        "q" = 14 = 0x0E = 01110         "8" = 30 = 0x1E = 11110
160	        "r" = 15 = 0x0F = 01111         "9" = 31 = 0x1F = 11111

162	    The digits "0" and "1" and the letters "o" and "l" are not used, to
163	    avoid transcription errors.

165	    All decoders must recognize both the uppercase and lowercase
166	    forms of the base-32 characters.  The case may or may not convey
167	    information, as described in section "Case sensitivity models".

169	Encoding and decoding algorithms

171	    The algorithms are given below as commented pseudocode.  All
172	    ordering of bits and quintets is big-endian (most significant
173	    first).  The >> and << operators used below mean bit shift, as in
174	    C.  For >> there is no question of logical versus arithmetic shift
175	    because AMC-ACE-R never needs to right-shift a negative value.
176	    As in C, "continue" means terminate the current iteration of the
177	    innermost loop, "break" means terminate the innermost loop, and
178	    "return" means terminate the current function.

180	    shared variables:  # All others are local to each function.
181	      array refpoint[1..5]  # refpoint[k] is for sequences of length k

183	    function update_refpoints(history[first..latest]):
184	      # Adapt refpoint[1..3] based on the code points seen so far.
185	      for k = 1 to 3 do begin
186	        let b = k << 2
187	        if latest - first == 1
188	        then let refpoint[k] = (history[latest] >> b) << b
189	        else for i = latest - 1 down to first do begin
190	          if history[i] represents an LDH character then continue
191	          if (refpoint[k] XOR history[i]) >> b == 0 then break
192	          if (history[latest] XOR history[i]) >> b == 0 then begin
193	            let refpoint[k] = (history[latest] >> b) << b
194	            return
195	          end
196	        end
197	      end
198	    function encode(input[first..last]):
199	      let refpoint[1..5] = 0x60, 0, 0, 0, 0x10000
200	      let output = the empty string
201	      let literal = false
202	      for i = first to last do begin
203	        if input[i] == 0x2D then append two hyphen-minuses to output
204	        else if input[i] represents an LDH character then begin
205	          if not literal then append hyphen-minus to output
206	          let literal = true
207	          append the character represented by input[i] to output
208	        end
209	        else begin
210	          if literal then append hyphen-minus to output
211	          let literal = false
212	          for k = 1 to infinity do begin
213	            let delta = codepoint - refpoint[k]
214	            if delta >= 0 and delta >> (4*k) == 0 then break
215	          end
216	          extract the k least significant nybbles of delta
217	          prepend 0 to the last nybble and 1 to the rest
218	          output base-32 characters corresponding to the quintets
219	          update_refpoints(input[first..i])
220	        end
221	      end
222	      return output

224	    function decode(input string):
225	      let refpoint[1..5] = 0x60, 0, 0, 0, 0x10000
226	      let output = the empty array
227	      let literal = false
228	      while not end-of-input do begin
229	        if the next character is hyphen-minus then begin
230	          consume the character
231	          if the next character is also hyphen-minus
232	          then consume it and append 0x2D to output
233	          else toggle literal
234	        end
235	        else if literal then consume the character and output it
236	        else begin
237	          consume characters and convert them to quintets until
238	            encountering a quintet beginning with 0
239	          fail upon encountering a non-base-32 character or end-of-input
240	          let k = the number of quintets obtained
241	          strip the first bit of each quintet
242	          concatenate the resulting nybbles to form delta
243	          append refpoint[k] + delta to output
244	          update_refpoints(output)
245	        end
246	      end
247	      let check = encode(output)
248	      if check != the input string then fail
249	      return output
250	    The comparison at the end of decode() must be case-insensitive
251	    if ACEs are always compared case-insensitively (which is true of
252	    domain names), case-sensitive otherwise (see also section "Case
253	    sensitivity models").  This check is necessary to guarantee the
254	    uniqueness property, that there cannot be two distinct encoded
255	    strings representing the same sequence of integers.  This check also
256	    frees the decoder from having to check for overflow while decoding
257	    the base-32 characters.

259	Signature

261	    The issue of how to distinguish ACE strings from unencoded strings
262	    is largely orthogonal to the encoding scheme itself, and is
263	    therefore not specified here.  In the context of domain name labels,
264	    a standard prefix and/or suffix (chosen to be unlikely to occur
265	    naturally) would presumably be attached to ACE labels.

267	    In order to use AMC-ACE-R in domain names, the choice of signature
268	    must be mindful of the requirement in [RFC952] that labels never
269	    begin or end with hyphen-minus.  Since the raw encoded string
270	    sometimes begins with a hyphen-minus, the signature must include
271	    a prefix that does not begin with hyphen-minus.  If the Unicode
272	    strings are forbidden from ending with hyphen-minus (which seems
273	    prudent anyway), then the raw encoded string will never end with
274	    hyphen-minus; otherwise, the signature must include a suffix as well
275	    as a prefix.

277	    It appears that "---" is extremely rare in domain names; among the
278	    four-character prefixes of all the second-level domains under .com,
279	    .net, and .org, "---" never appears at all.  Therefore, perhaps the
280	    signature should be of the form "?---", where ? could be "u" for
281	    Unicode, or "i" for internationalized, or "a" for ACE, or maybe "q"
282	    or "z" because they are rare.

284	Case sensitivity models

286	    The higher layer must choose one of the following four models.

288	    Models suitable for domain names:

290	      * Case-insensitive:  Before a string is encoded, all its non-LDH
291	        characters must be case-folded so that any strings differing
292	        only in case become the same string (for example, strings could
293	        be forced to lowercase).  Folding LDH characters is optional.
294	        The case of base-32 characters and literal-mode characters is
295	        arbitrary and not significant.  Comparisons between encoded
296	        strings must be case-insensitive.  The original case of non-LDH
297	        characters cannot be recovered from the encoded string.

299	      * Case-preserving:  The case of the Unicode characters is not
300	        considered significant, but it can be preserved and recovered,
301	        just like in non-internationalized host names.  Before a string
302	        is encoded, all its non-LDH characters must be case-folded
303	        as in the previous model.  LDH characters are naturally able
304	        to retain their case attributes because they are encoded
305	        literally.  The case attribute of a non-LDH character is
306	        recorded in the last of the base-32 characters that represent
307	        it, which is guaranteed to be a letter rather than a digit.
308	        If the base-32 character is uppercase, it means the Unicode
309	        character is caseless or should be forced to uppercase after
310	        being decoded (which is a no-op if the case folding already
311	        forces to uppercase).  If the base-32 character is lowercase,
312	        it means the Unicode character is caseless or should be forced
313	        to lowercase after being decoded (which is a no-op if the case
314	        folding already forces to lowercase).  The case of the other
315	        base-32 characters in a multi-quintet encoding is arbitrary
316	        and not significant.  Only uppercase and lowercase attributes
317	        can be recorded, not titlecase.  Comparisons between encoded
318	        strings must be case-insensitive, and are equivalent to
319	        case-insensitive comparisons between the Unicode strings.  The
320	        intended mixed-case Unicode string can be recovered as long as
321	        the encoded characters are unaltered, but altering the case of
322	        the encoded characters is not harmful--it merely alters the case
323	        of the Unicode characters, and such a change is not considered
324	        significant.

326	        In this model, the input to the encoder and the output of the
327	        decoder can be the unfolded Unicode string (in which case the
328	        encoder and decoder are responsible for performing the case
329	        folding and recovery), or can be the folded Unicode string
330	        accompanied by separate case information (in which case the
331	        higher layer is responsible for performing the case folding and
332	        recovery).  Whichever layer performs the case recovery must
333	        first verify that the Unicode string is properly folded, to
334	        guarantee the uniqueness of the encoding.

336	        It should not be very difficult to extend the nameprep algorithm
337	        [NAMEPREP03] to remember case information; it could be done by
338	        adding flags to the mapping tables.

340	    The case-insensitive and case-preserving models are interoperable.
341	    If a domain name passes from a case-preserving entity to a
342	    case-insensitive entity, the case information may be lost, but the
343	    domain name will still be equivalent.  This phenomenon already
344	    occurs with non-internationalized domain names.

346	    Models unsuitable for domain names, but possibly useful in other
347	    contexts:

349	      * Case-sensitive:  Unicode strings may contain both uppercase and
350	        lowercase characters, which are not folded.  Base-32 characters
351	        must be lowercase.  Comparisons between encoded strings must be
352	        case-sensitive.

354	      * Case-flexible:  Like case-preserving, except that the choice
355	        of whether the case of the Unicode characters is considered
356	        significant is deferred.  Therefore, base-32 characters must
357	        be lowercase, except for those used to indicate uppercase
358	        Unicode characters.  Comparisons between encoded strings may be
359	        case-sensitive or case-insensitive, and such comparisons are
360	        equivalent to the corresponding comparisons between the Unicode
361	        strings.

363	Comparison with RACE, BRACE, LACE, AltDUDE, AMC-ACE-M, AMC-ACE-O

365	    In this section we compare AMC-ACE-R and six other ACEs: RACE
366	    [RACE03], BRACE [BRACE00], LACE [LACE01], AltDUDE [AltDUDE00],
367	    AMC-ACE-M [AMCACEM00], and AMC-ACE-O [AMCACEO00].  We do not include
368	    SACE [SACE], UTF-5 [UTF5], UTF-6 [UTF6], or DUDE [DUDE01] in the
369	    comparison, because SACE appears obviously too complex, UTF-5
370	    appears obviously too inefficient, UTF-6 can never be more efficient
371	    than its similarly simple successor DUDE, and DUDE is almost
372	    identical to AltDUDE.

374	    Complexity is hard to measure.  This author would subjectively
375	    describe the complexity of the algorithms as:

377	          LACE, AltDUDE: simple but not trivial
378	        RACE, AMC-ACE-R: less simple
379	              AMC-ACE-O: moderate
380	              AMC-ACE-M: fairly complex
381	                  BRACE: complex

383	    AMC-ACE-R is similar to AMC-ACE-O, but is considerably simpler
384	    because it does not calculate the most useful reference points
385	    beforehand, encode them, and decode them.  Instead, it uses a simple
386	    heuristic to set the reference points adaptively based on the code
387	    points that have been seen so far.

389	    Implementations can be long and straightforward, or short and
390	    subtle, but for whatever it's worth, here are the code sizes of
391	    four of the algorithms that were implemented by this author in
392	    similar styles:

394	      AltDUDE: 130 lines @@@@@@@@@@@@@@@@@@@
395	    AMC-ACE-R: 171 lines @@@@@@@@@@@@@@@@@@@@@@@@
396	    AMC-ACE-O: 232 lines @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
397	    AMC-ACE-M: 324 lines @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

399	    (Not counted in the code sizes are blank lines, lines containing
400	    only comments or only a single brace, and wrapper code for testing.
401	    BRACE was also implemented by this author, but it was a less general
402	    implementation, with bounded input and output sizes.)

404	    If a different implementation style were to alter the code sizes
405	    additively, or multiplicatively, or a combination thereof, AMC-ACE-O
406	    would remain about halfway between AltDUDE and AMC-ACE-M, and
407	    AMC-ACE-R would remain closer to AltDUDE than to AMC-ACE-O.

409	    Case preservation support:

411	        AltDUDE, AMC-ACE-M/O/R:  all characters
412	                         BRACE:  only the letters A-Z, a-z
413	                    RACE, LACE:  none

415	    RACE, BRACE, and LACE transform the Unicode string to an
416	    intermediate bit string, then into a base-32 string, so there is no
417	    particular alignment between the base-32 characters and the Unicode
418	    characters.  AltDUDE and AMC-ACE-M/O/R do not have this intermediate
419	    stage, and enforce alignment between the base-32 characters and the
420	    Unicode characters, which facilitates the case preservation.

422	    The relative efficiency of the various algorithms is suggested
423	    by the sizes of the encodings in section "Example strings".  The
424	    lengths of examples A-K (which are the same sentence translated into
425	    a languages from a variety of language families using a variety
426	    of scripts) are shown graphically below for each ACE, scaled by a
427	    factor of 0.4 so they fit on one line, and sorted so they look like
428	    a cummulative distribution.  The fictional "Super-ACE" encodes its
429	    input using whichever of the other seven ACEs is shortest for that
430	    input.

432	    RACE:
433	      A Arabic      29 @@@@@@@@@@@@
434	      B Chinese     31 @@@@@@@@@@@@
435	      J Taiwanese   31 @@@@@@@@@@@@
436	      D Hebrew      37 @@@@@@@@@@@@@@@
437	      H Russian     47 @@@@@@@@@@@@@@@@@@@
438	      E Hindi       50 @@@@@@@@@@@@@@@@@@@@
439	      F Japanese    60 @@@@@@@@@@@@@@@@@@@@@@@@
440	      I Spanish     66 @@@@@@@@@@@@@@@@@@@@@@@@@@
441	      C Czech       68 @@@@@@@@@@@@@@@@@@@@@@@@@@@
442	      G Korean      79 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
443	      K Vietnamese 112 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

445	    LACE:
446	      B Chinese     28 @@@@@@@@@@@
447	      A Arabic      31 @@@@@@@@@@@@
448	      J Taiwanese   31 @@@@@@@@@@@@
449	      D Hebrew      39 @@@@@@@@@@@@@@@@
450	      H Russian     48 @@@@@@@@@@@@@@@@@@@
451	      E Hindi       52 @@@@@@@@@@@@@@@@@@@@@
452	      F Japanese    52 @@@@@@@@@@@@@@@@@@@@@
453	      C Czech       58 @@@@@@@@@@@@@@@@@@@@@@@
454	      I Spanish     68 @@@@@@@@@@@@@@@@@@@@@@@@@@@
455	      G Korean      79 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
456	      K Vietnamese 109 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
457	    AltDUDE:
458	      A Arabic      25 @@@@@@@@@@
459	      B Chinese     26 @@@@@@@@@@
460	      D Hebrew      33 @@@@@@@@@@@@@
461	      J Taiwanese   36 @@@@@@@@@@@@@@
462	      H Russian     38 @@@@@@@@@@@@@@@
463	      C Czech       43 @@@@@@@@@@@@@@@@@
464	      F Japanese    49 @@@@@@@@@@@@@@@@@@@@
465	      E Hindi       58 @@@@@@@@@@@@@@@@@@@@@@@
466	      I Spanish     59 @@@@@@@@@@@@@@@@@@@@@@@@
467	      K Vietnamese  81 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
468	      G Korean      89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

470	    AMC-ACE-R:
471	      B Chinese     24 @@@@@@@@@@
472	      A Arabic      28 @@@@@@@@@@@
473	      J Taiwanese   30 @@@@@@@@@@@@
474	      D Hebrew      32 @@@@@@@@@@@@@
475	      C Czech       36 @@@@@@@@@@@@@@
476	      H Russian     40 @@@@@@@@@@@@@@@@
477	      F Japanese    42 @@@@@@@@@@@@@@@@@
478	      I Spanish     47 @@@@@@@@@@@@@@@@@@@
479	      E Hindi       55 @@@@@@@@@@@@@@@@@@@@@@
480	      K Vietnamese  70 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
481	      G Korean      89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

483	    AMC-ACE-O:
484	      B Chinese     24 @@@@@@@@@@
485	      A Arabic      28 @@@@@@@@@@@
486	      J Taiwanese   30 @@@@@@@@@@@@
487	      D Hebrew      31 @@@@@@@@@@@@
488	      C Czech       34 @@@@@@@@@@@@@@
489	      H Russian     40 @@@@@@@@@@@@@@@@
490	      F Japanese    41 @@@@@@@@@@@@@@@@
491	      I Spanish     49 @@@@@@@@@@@@@@@@@@@@
492	      E Hindi       54 @@@@@@@@@@@@@@@@@@@@@@
493	      K Vietnamese  69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
494	      G Korean      80 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

496	    BRACE:
497	      B Chinese     22 @@@@@@@@@
498	      A Arabic      26 @@@@@@@@@@
499	      J Taiwanese   27 @@@@@@@@@@@
500	      D Hebrew      33 @@@@@@@@@@@@@
501	      C Czech       36 @@@@@@@@@@@@@@
502	      F Japanese    40 @@@@@@@@@@@@@@@@
503	      H Russian     42 @@@@@@@@@@@@@@@@@
504	      E Hindi       45 @@@@@@@@@@@@@@@@@@
505	      I Spanish     48 @@@@@@@@@@@@@@@@@@@
506	      K Vietnamese  72 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@
507	      G Korean      78 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
508	    AMC-ACE-M:
509	      B Chinese     23 @@@@@@@@@
510	      J Taiwanese   27 @@@@@@@@@@@
511	      A Arabic      28 @@@@@@@@@@@
512	      D Hebrew      31 @@@@@@@@@@@@
513	      C Czech       34 @@@@@@@@@@@@@@
514	      H Russian     38 @@@@@@@@@@@@@@@
515	      F Japanese    42 @@@@@@@@@@@@@@@@@
516	      I Spanish     48 @@@@@@@@@@@@@@@@@@@
517	      E Hindi       54 @@@@@@@@@@@@@@@@@@@@@@
518	      K Vietnamese  69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
519	      G Korean      71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@

521	    Super-ACE:
522	      B Chinese     22 @@@@@@@@@
523	      A Arabic      25 @@@@@@@@@@
524	      J Taiwanese   27 @@@@@@@@@@@
525	      D Hebrew      31 @@@@@@@@@@@@
526	      C Czech       34 @@@@@@@@@@@@@@
527	      H Russian     38 @@@@@@@@@@@@@@@
528	      F Japanese    40 @@@@@@@@@@@@@@@@
529	      E Hindi       45 @@@@@@@@@@@@@@@@@@
530	      I Spanish     47 @@@@@@@@@@@@@@@@@@@
531	      K Vietnamese  69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
532	      G Korean      71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@

534	    totals:
535	             RACE: 610 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
536	             LACE: 595 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
537	          AltDUDE: 537 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
538	        AMC-ACE-R: 493 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
539	        AMC-ACE-O: 480 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
540	            BRACE: 469 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
541	        AMC-ACE-M: 465 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
542	        Super-ACE: 449 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

544	    worst cases:
545	             RACE: 112 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
546	             LACE: 109 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
547	          AltDUDE:  89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
548	        AMC-ACE-R:  89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
549	        AMC-ACE-O:  80 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
550	            BRACE:  78 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
551	        AMC-ACE-M:  71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
552	        Super-ACE:  71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@

554	    The totals and worst cases above give more weight to languages
555	    that produce longer encodings, which arguably yields a good metric
556	    (because being efficient for easy languages is arguably less
557	    important than being efficient for difficult languages).  We can
558	    alternatively give each language equal weight by dividing each
559	    output length by the corresponding Super-ACE output length.  This
560	    method yields:

562	    totals:
563	           RACE: 14.9 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
564	           LACE: 14.5 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
565	        AltDUDE: 13.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
566	      AMC-ACE-R: 12.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
567	      AMC-ACE-O: 11.8 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
568	      AMC-ACE-M: 11.4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
569	          BRACE: 11.4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
570	      Super-ACE: 11.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

572	    worst cases:
573	           RACE: 2.00 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
574	           LACE: 1.71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
575	        AltDUDE: 1.33 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
576	      AMC-ACE-R: 1.25 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
577	      AMC-ACE-O: 1.20 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
578	      AMC-ACE-M: 1.20 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
579	          BRACE: 1.11 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
580	      Super-ACE: 1.00 @@@@@@@@@@@@@@@@@@@@@@@@@

582	    No matter which way we average, the results suggest that AltDUDE is
583	    preferrable to RACE and LACE, because it is no more complex, is more
584	    efficient, and has better support for case preservation.

586	    The results also suggest that AMC-ACE-M is preferrable to BRACE,
587	    because it has similar efficiency, is a little simpler, and has
588	    better support for case preservation.

590	    AltDUDE, AMC-ACE-R, AMC-ACE-O, and AMC-ACE-M are progressively
591	    more complex and more efficient, and have equal support for case
592	    preservation.  The choice depends on how much efficiency is required
593	    and how much complexity is acceptable.

595	    The efficiency gaps between AMC-ACE-M, AMC-ACE-O, and AMC-ACE-R are
596	    mostly due to the Korean (Hangul) string.  Of the 15 characters
597	    by which the AMC-ACE-M total beats the AMC-ACE-O total, 9 come
598	    from the Korean string.  Similarly, of the 13 characters by which
599	    the AMC-ACE-O total beats the AMC-ACE-R total, 9 come from the
600	    Korean string.  The large increases in complexity from AMC-ACE-R to
601	    -O to -M yield significant efficiency gains for Korean, but only
602	    very small gains for the other languages.  More sample strings
603	    from more languages need to be tried before one can conclude that
604	    Korean is the only significant beneficiary, but if it is, then this
605	    author would suggest that AMC-ACE-R is preferable to -O and -M, with
606	    apologies to Korean speakers.

608	    That would leave a choice between AltDUDE and AMC-ACE-R, the latter
609	    being somewhat more complex and somewhat more efficient.

611	Example strings

613	    In the ACE encodings below, signatures (like "bq--" for RACE) are
614	    not shown.  Non-LDH characters in the Unicode string are forced to
615	    lowercase before being encoded.  For RACE, LACE, and AltDUDE, the
616	    letters A-Z are likewise forced to lowercase.  UTF-8 and UTF-16 are
617	    included for length comparisons, with non-ASCII bytes shown as "?".
618	    AMC-ACE-* and AltDUDE are abbreviated AMC-* and ADUDE.  Backslashes
619	    show where line breaks have been inserted in ACE strings too long
620	    for one line.  The RACE and LACE encodings are courtesy of Mark
621	    Davis's online UTF converter [UTFCONV] (slightly modified to remove
622	    the length restrictions).

624	    The first several examples are all translations of the sentence "Why
625	    can't they just speak in ?" (courtesy of Michael Kaplan's
626	    "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
627	    been removed, as is often done in domain names.

629	    (A) Arabic (Egyptian):
630	        U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644
631	        U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F

633	        ADUDE:  yueqpcycrcyjhbpznpitjycxf
634	        BRACE:  28akcjwcmp3ciwb4t3ngd4nbaz
635	        AMC-R:  ywekhfuhuikwdwefivevjbuiwktr
636	        AMC-O:  ageekhfuhuiukdefivevjvbuiktr
637	        AMC-M:  agiekhfuhuiukdefivevjvbuiktr
638	        RACE:   azceur2fe4ucuq2eivediojrfbfb6
639	        LACE:   cedeisshiutsqksdircuqnbzgeueuhy
640	        UTF-16: ??????????????????????????????????
641	        UTF-8:  ??????????????????????????????????

643	    (B) Chinese (simplified):
644	        U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587

646	        UTF-16: ??????????????????
647	        BRACE:  kgcqqsgp26i5h4zn7req5i
648	        AMC-M:  uqj7g8nvk6awispn9wupdnh
649	        AMC-R:  w87g8nvk6awisp259eupyx2h
650	        AMC-O:  eqpg8nvk6awisp259eupyx2h
651	        ADUDE:  w85gvk7g9k2iwf6x9j6x7ju54k
652	        UTF-8:  ???????????????????????????
653	        LACE:   azhnn3b2ybea2aml6qau4libmwdq
654	        RACE:   3bhnmtxmjy5e5qcojbha3c7ujywwlby

656	    (C) Czech: Proprostnemluvesky

658	         = U+010D
659	         = U+011B
660	         = U+00ED
661	        UTF-8:  Pro??prost??nemluv????esky
662	        AMC-O:  piq-Pro-p-prost-9m-nemluv-6pp-esky
663	        AMC-M:  g26-Pro-p-prost-9m-nemluv-6pp-esky
664	        AMC-R:  -Pro-tsp-prost-ttm-nemluv-s8psp-esky
665	        BRACE:  i32-Pro-u-prost-8y-nemluv-29f3n-esky
666	        ADUDE:  tActptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc
667	        UTF-16: ????????????????????????????????????????????
668	        LACE:   amaha4tpaeaq2biaobzg643uaearwbyanzsw23dvo3wqcainaqagk43\
669	                lpe
670	        RACE:   ah7xb73s75xq373q75zp6377op7xig77n37wl73n75wp65p7o3762dp\
671	                7mx7xh73l754q

673	    (D) Hebrew:
674	        U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8
675	        U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2
676	        U+05D1 U+05E8 U+05D9 U+05EA

678	        AMC-O:  afpnqeep8e8jfinaqdb8ijp8cb8ij8k
679	        AMC-M:  af4nqeep8e8jfinaqdb8ijp8cb8ij8k
680	        AMC-R:  x7nqeep8e8j7f7inaqdb8ijp8cb8ij8k
681	        ADUDE:  x5nckajvjpvnpenqpcvjvbevrvdvjvbvd
682	        BRACE:  27vkyp7bgwmbpfjgc4ynx5nd8xsp5nd9c
683	        RACE:   axon5vgu3xsotvoy3tin5u6r5dm53ywr5dm6u
684	        LACE:   cyc5zxwu2to6j2ov3donbxwt2huntxpc2hunt2q
685	        UTF-8:  ????????????????????????????????????????????
686	        UTF-16: ????????????????????????????????????????????

688	    (E) Hindi:
689	        U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D
690	        U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939
691	        U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947
692	        U+0939 U+0948 U+0902  (Devanagari)

694	        BRACE:  2b7xtenqdr7zc6uma2pmcz7ibage237kdemicnk9gei32
695	        RACE:   bextsmslc44t6kcnezabktjpjmbcqokaaiwewmrycuseookiai
696	        LACE:   dyes6ojsjmltspzijuteafknf5fqekbziabcyszshaksirzzjaba
697	        AMC-O:  ajeurvjvcmthvjvruipugatfpurmscuivjascunmvcvitfuehvjisc
698	        AMC-M:  ajhurbvcwmthbhuiwpugitfwpurwmscuibiscunwmvcatfuerbwisc
699	        AMC-R:  3urvjvcwmthjruiwpugwatfwpurmscuivjascunmvcvitfuewhjwisc
700	        ADUDE:  3wrtgmzjxnuqgthyfymygxfxiycyewjuktbzjwcuqyhzjkupvbydzqz\
701	                bwk
702	        UTF-16: ???????????????????????????????????????????????????????\
703	                ?????
704	        UTF-8:  ???????????????????????????????????????????????????????\
705	                ???????????????????????????????????

707	    (F) Japanese:
708	        U+306A U+305C U+307F U+3093 U+306A U+65E5 U+672C U+8A9E U+3092
709	        U+8A71 U+3057 U+3066 U+304F U+308C U+306A U+3044 U+306E U+304B
710	        (kanji and hiragana)
711	        UTF-16: ????????????????????????????????????
712	        BRACE:  ji8nr5zj8uqth7v97mjchakwcg7dqemw88nj5gbe
713	        AMC-O:  gvagkxnzr3dkx8fzun243q3c24zbxhgwr2nkweqwm
714	        AMC-R:  vsykxnzr3dkyx8fyzun243q3c24zbxhgwr2nkweqwm
715	        AMC-M:  bsnkxnzr3dkyx8fyzun243q3c24zbxhgwr2nkweqwm
716	        ADUDE:  vsskvgud8n9jxx2ru6j875c54sn548d54ugvbuj6d8guqukuf
717	        LACE:   auyguxd7snvaczpfaftsyamktyatbeqbrjyqqmcxmzhyy2senzfq
718	        UTF-8:  ??????????????????????????????????????????????????????
719	        RACE:   3aygumc4gb7tbezqnjs6kzzmrkpdbeukoeyfomdggbhtbdbqniyeimd\
720	                ogbfq

722	    (G) Korean:
723	        U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
724	        U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
725	        U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C  (Hangul syllables)

727	        UTF-16: ????????????????????????????????????????????????
728	        UTF-8:  ???????????????????????????????????????????????????????\
729	                ?????????????????
730	        AMC-M:  yhxcj2w6exiaxi68acfn92n68ezehk6xypdpwam6zehmwhk648eavwd\
731	                p6aqi23ieemweywn
732	        BRACE:  y394qebjusrcndbs82pkvstf96sxufcr7ffr4vbgdwsxufcx8pdktgb\
733	                gmnsqydmk7im56arju6pt82
734	        LACE:   77atrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\
735	                exj2mlpfzzcyjrsely5ck4ta
736	        RACE:   3datrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\
737	                exj2mlpfzzcyjrsely5ck4ta
738	        AMC-O:  m6hwq6tvi466exi44ia6s4nz2neze7xxn47yp6x5e3znze7xze7xxnu\
739	                8e4ze6x5n36is3i622mwe48wn
740	        ADUDE:  6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8sit\
741	                usauiyz5i23az96iz6ze3xaz2td96ry3si
742	        AMC-R:  6tvi466ezxi544i5w8a6s4nz2nw8e6zze7xxn47yp6x5e53znze7xze\
743	                7xxn5u8e54ze6x5n36is3i622m6zwe48wn

745	    (H) Russian:
746	        U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E
747	        U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440
748	        U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A
749	        U+0438  (Cyrillic)

751	        ADUDE:  wxRbzjzcjzrzfdmdffigpnnzqrpzpbzqdcazmc
752	        AMC-M:  aehHgrvfemvgvfgfafvfvdgvcgiwrkhgimjjca
753	        AMC-R:  wvRqwhfnwdgfqpipfdqcqwawrcvrvqwawdbbvkvi
754	        AMC-O:  aedRqwhfnwdgfqpipfdqcqwawrwcrqwawdwbwbki
755	        BRACE:  269xyjvcyafqfdwyr3xfd8z8byi6z39xyi692s7ug2
756	        RACE:   aq7t4rzvhrbtmnj6hu4d2njthyzd4qcpii7t4qcdifatuoa
757	        LACE:   dqcd6pshgu6egnrvhy6tqpjvgm7depsaj5bd6psainaucory
758	        UTF-16: ???????????????????????????????????????????????????????\
759	                ???
760	        UTF-8:  ???????????????????????????????????????????????????????
761	                ???

763	    (I) Spanish: PorqunopuedensimplementehablarenEspaol

765	         = U+00E9
766	         = U+00F1
767	        UTF-8:  Porqu??nopuedensimplementehablarenEspa??ol
768	        AMC-R:  -Porqu-8j-nopuedensimplementehablarenEspa-9b-ol
769	        AMC-M:  aa7-Porqu-b-nopuedensimplementehablarenEspa-j-ol
770	        BRACE:  22x-Porqu-9-nopuedensimplementehablarenEspa-j-ol
771	        AMC-O:  aaq-Porqu-j-nopuedensimplementehablarenEspa-9b-ol
772	        ADUDE:  tAtrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmMtgdtb3\
773	                a3qd
774	        RACE:   abyg64troxuw433qovswizloonuw24dmmvwwk3tumvugcytmmfzgk3t\
775	                fonygd4lpnq
776	        LACE:   faaha33sof26s3tpob2wkzdfnzzws3lqnrsw2zloorswqylcnrqxezl\
777	                omvzxayprn5wa
778	        UTF-16: ???????????????????????????????????????????????????????\
779	                ?????????????????????????

781	    (J) Taiwanese:
782	        U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587

784	        UTF-16: ??????????????????
785	        UTF-8:  ???????????????????????????
786	        AMC-M:  uqj7g2tbgtu6a385pspnxkupdnh
787	        BRACE:  kgcqui49gatc2wyrn8y7cndgte9
788	        AMC-R:  w87gxstbzuvc6a385psp244kupyx2h
789	        AMC-O:  eqpgxstbzuvc6a385psp244kupyx2h
790	        RACE:   3bhnmuaroize5qe6xvha3cvkjywwlby
791	        LACE:   75hnmuaroize5qe6xvha3cvkjywwlby
792	        ADUDE:  w85gt86huuudv69c7szp7s5a6w4h6w2hu54k

794	    (K) Vietnamese:
795	        Taisaohokhngthchi\
796	        noitingVit

798	          = U+0323
799	             = U+00F4
800	             = U+00EA
801	         = U+0309
802	             = U+0301

804	        UTF-8:  Ta??isaoho??kh??ngth????chi??no??iti????ngVi????t
805	        AMC-O:  aava-Ta-vud-isaoho-vud-kh-9e-ngth-8kj-chi-j-no-b-iti-8k\
806	                b-ngVi-8kvud-t
807	        AMC-M:  ada-Ta-ud-isaoho-ud-kh-s9e-ngth-s8kj-chi-j-no-b-iti-s8k\
808	                b-ngVi-s8kud-t
809	        AMC-R:  -Ta-vud-isaoho-vud-kh-9e-ngth-8kvsj-chi-vsj-no-b-iti-s8\
810	                kb-ngVi-s8kud-t
811	        BRACE:  i54-Ta-8-isaoho-ay-kh-29n-ngth-s2xa6i-chi-k-no-2g-iti-2\
812	                9c29-ngVi-25p48-t
813	        UTF-16: ???????????????????????????????????????????????????????\
814	                ?????????????????????
815	        ADUDE:  tEtfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqvy\
816	                itptp2dv8mvyrjtBtr2dv6jvxh
817	        LACE:   aiahiyibamrqmadjonqw62dpaebsgcaannupi3thoruouaidbebqay3\
818	                ineaqgcicabxg6aidaecaa2lunhvacaybauag4z3wnhvacazdaeahi
819	        RACE:   ap7xj73bep7wt73t75q76377nd7w6i77np7wr77u75xp6z77ot7wr77\
820	                kbh7wh73i75uqt73o75xqd73j752p62p75ia763x7m77xn73j77vch7\
821	                3u
822	    The next several examples are all names of Japanese music artists,
823	    song titles, and TV programs, just because the author happens to
824	    have them handy (but Japanese is useful for providing examples
825	    of single-row text, two-row text, ideographic text, and various
826	    mixtures thereof).

828	    (L) 3B  (Japanese TV program title)

830	                      = U+5E74                       (kanji)
831	                     = U+7D44                       (kanji)
832	         = U+91D1 U+516B U+5148 U+751F  (kanji)

834	        UTF-16: ????????????????
835	        UTF-8:  3???B???????????????
836	        AMC-M:  utk-3-8ze-B-hkenqtymwifi9
837	        BRACE:  u-3-ygj-b-ynb6gjc7pp4k5p5w
838	        AMC-O:  fb8h-3-e-B-z7we3t7bymwizxtr
839	        ADUDE:  xdx8whx8tGz7ug863f6s5kuduwxh
840	        RACE:   3aadgxtuabrh2rer2fiwwukioupq
841	        LACE:   74adgxtuabrh2rer2fiwwukioupq
842	        AMC-R:  -3-x8ze-B-z7we3t7bxtymtwizxtr

844	    (M) -with-SUPER-MONKEYS  (Japanese music group name)

846	         = U+5B89 U+5BA4 U+5948 U+7F8E U+6075  (kanji)

848	        UTF-8:  ??????????????????-with-SUPER-MONKEYS
849	        AMC-M:  u5m2j4etwif6q2zf---with--SUPER--MONKEYS
850	        AMC-R:  x52j4e3wiz92qyszf---with--SUPER--MONKEYS
851	        AMC-O:  fmij4e3wiz92qyszf---with--SUPER--MONKEYS
852	        BRACE:  uvj7fuaqcahy982xa---with--SUPER--MONKEYS
853	        ADUDE:  x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK
854	        UTF-16: ????????????????????????????????????????????????
855	        LACE:   ajnytjablfeac74oafqhkeyafv3qm5difvzxk4dfoiww233onnsxs4y
856	        RACE:   3bnysw5elfeh7dtaouac2adxabuqa5aanaac2adtab2qa4aamuaheab\
857	                nabwqa3yanyagwadfab4qa4y

859	    (N) Hello-Another-Way-  (Japanese song title)

861	         = U+305D U+308C U+305E U+308C U+306E  (hiragana)
862	                = U+5834 U+6240                       (kanji)

864	        UTF-8:  Hello-Another-Way-?????????????????????
865	        BRACE:  ji7-Hello--Another--Way---v3jhaefvd2ufj62
866	        AMC-O:  daf-Hello--Another--Way---p2nq2nyqx2veyuwa
867	        AMC-M:  bsk-Hello--Another--Way---p2nq2nyqx2veyuwa
868	        ADUDE:  Ipjad-Qrbtmtnpth-Ftgti-vsue7b7c7c8cy2xkv4ze
869	        AMC-R:  -Hello--Another--Way---vsxpvs2nxq2nyqx2veyuwa
870	        UTF-16: ??????????????????????????????????????????????????
871	        LACE:   ciagqzlmnrxs2ylon52gqzlsfv3wc6jnauyf3dc6rrxacwbuafrea
872	        RACE:   3aagqadfabwaa3aan4ac2adbabxaa3yaoqagqadfabzaaliao4agcad\
873	                zaawtaxjqrqyf4memgbxfqndcia
874	    (O) 2  (Japanese TV program title)

876	         = U+3072 U+3068 U+3064  (hiragana)
877	            = U+5C4B U+6839         (kanji)
878	              = U+306E                (hiragana)
879	           = U+4E0B                (kanji)

881	        UTF-16: ????????????????
882	        UTF-8:  ?????????????????????2
883	        AMC-O:  dagzciex6wmy2vjqw8sm-2
884	        AMC-M:  bsnzciex6wmy2vjqw8sm-2
885	        BRACE:  ji96u56uwbhf2wqxnw4s-2
886	        AMC-R:  vszcyiyex6wmy2vjqw8sm-2
887	        ADUDE:  vstctkny6urvwzcx2xhz8yfw8vj
888	        RACE:   3ayhemdigbsfys3iheyg4tqlaaza
889	        LACE:   74yhemdigbsfys3iheyg4tqlaaza

891	    (P) MajiKoi5 (Japanese song title)

893	                = U+3067         (hiragana)
894	              = U+3059 U+308B  (hiragana)
895	         = U+79D2 U+524D  (kanji)

897	        UTF-8:  Maji???Koi??????5??????
898	        UTF-16: ??????????????????????????
899	        AMC-M:  bsm-Maji-r-Koi-b2m-5-z37cxuwp
900	        BRACE:  ji8-Maji-g-Koi-qe7x-5-wx7p6ma
901	        AMC-O:  dag-Maji-h-Koi-xj2m-5-z37cxuwp
902	        ADUDE:  PnmdvssqvssNegvsva7cvs5qz38hu53r
903	        AMC-R:  -Maji-vsyh-Koi-vsxj2m-5-z37cxuwp
904	        RACE:   3aag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq
905	        LACE:   74ag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq

907	    (Q) de  (Japanese song title)

909	         = U+30D1 U+30D5 U+30A3 U+30FC  (katakana)
910	         = U+30EB U+30F3 U+30D0         (katakana)

912	        UTF-16: ??????????????
913	        BRACE:  3iu8pazt-de-pygi
914	        AMC-O:  dapbf4d9n-de-8m9da
915	        AMC-M:  bs3jp4d9n-de-8m9di
916	        AMC-R:  vs7bf4d9n-de-8m9d7a
917	        RACE:   gdi5li7475sp6zpl6pia
918	        ADUDE:  vs5bezgxrvs3ibvs2qtiud
919	        UTF-8:  ????????????de?????????
920	        LACE:   aqyndvnd7qbaazdfamyox46q

922	    (R)   (Japanese song title)

924	            = U+305D U+306E                (hiragana)
925	         = U+30B9 U+30D4 U+30FC U+30C9  (katakana)
926	              = U+3067                       (hiragana)
927	        RACE:   gbow5oou7tewo
928	        UTF-16: ??????????????
929	        BRACE:  bidprdmp9wt7mi
930	        LACE:   a4yf23vz2t6mszy
931	        AMC-O:  dagxpq5j7e9n6jh
932	        AMC-M:  bsmfyq5j7e9n6jr
933	        ADUDE:  vsvpvd7hypuivf4q
934	        AMC-R:  vsxpyq5j7e9n6jyh
935	        UTF-8:  ?????????????????????

937	    The last example is an ASCII string that breaks not only the
938	    existing rules for host name labels but also the rules proposed in
939	    [NAMEPREP03] for internationalized domain names.

941	    (S) -> $1.00 <-

943	        UTF-8:  -> $1.00 <-
944	        ADUDE:  -xqtqetftrtqatatn-
945	        RACE:   aawt4ibegexdambahqwq
946	        LACE:   bmac2praeqys4mbqea6c2
947	        AMC-R:  --vquaue-1-q-00-avn--
948	        UTF-16: ??????????????????????
949	        AMC-O:  aac--vqae-1-q-00-avn--
950	        AMC-M:  aae--vqae-1-q-00-avn--
951	        BRACE:  229--t2b4-1-w-00-i9i--

953	Security considerations

955	    Users expect each domain name in DNS to be controlled by a single
956	    authority.  If a Unicode string intended for use as a domain label
957	    could map to multiple ACE labels, then an internationalized domain
958	    name could map to multiple ACE domain names, each controlled by
959	    a different authority, some of which could be spoofs that hijack
960	    service requests intended for another.  Therefore AMC-ACE-R is
961	    designed so that each Unicode string has a unique encoding.

963	    However, there can still be multiple Unicode representations of the
964	    "same" text, for various definitions of "same".  This problem is
965	    addressed to some extent by the Unicode standard under the topic of
966	    canonicalization, and this work is leveraged for domain names by
967	    "nameprep" [NAMEPREP03].

969	Credits

971	    AMC-ACE-R reuses a number of preexisting techniques.

973	    The basic encoding of integers to nybbles to quintets to base-32
974	    comes from UTF-5 [UTF5], and the particular variant used here comes
975	    from AMC-ACE-M [AMCACEM00].

977	    The idea of avoiding 0, 1, o, and l in base-32 strings was taken
978	    from SFS [SFS].

980	    The idea of encoding deltas from reference points was taken from
981	    RACE (of which the latest version is [RACE03]), which may have
982	    gotten the idea from Unicode Technical Standard #6 [UTS6].

984	    The idea of switching between literal mode and base-32 mode comes
985	    from BRACE [BRACE00].

987	    The general idea of using the alphabetic case of base-32 characters
988	    to record the desired case of the Unicode characters was suggested
989	    by this author, and first applied to the UTF-5-style encoding in
990	    DUDE (of which the latest version is [DUDE01]).

992	    The heuristic used to adapt the reference points based on past code
993	    points is new in AMC-ACE-R.

995	References

997	    [AltDUDE00] Adam Costello, "AltDUDE version 0.0.2", 2001-Mar-19,
998	    draft-ietf-idn-altdude-00.

1000	    [AMCACEM00] Adam Costello, "AMC-ACE-M version 0.1.0", 2001-Feb-12,
1001	    draft-ietf-idn-amc-ace-m-00.

1003	    [AMCACEO00] Adam Costello, "AMC-ACE-O version 0.0.3", 2001-Mar-19,
1004	    draft-ietf-idn-amc-ace-o-00.

1006	    [BRACE00] Adam Costello, "BRACE: Bi-mode Row-based
1007	    ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19,
1008	    draft-ietf-idn-brace-00.

1010	    [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode
1011	    Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01.

1013	    [IDN] Internationalized Domain Names (IETF working group),
1014	    http://www.i-d-n.net/, idn@ops.ietf.org.

1016	    [LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII
1017	    Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01.

1019	    [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
1020	    of Internationalized Host Names", 2001-Feb-24,
1021	    draft-ietf-idn-nameprep-03.

1023	    [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
1024	    http://www.trigeminal.com/samples/provincial.html.

1026	    [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding
1027	    for IDN", 2000-Nov-28, draft-ietf-idn-race-03.

1029	    [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
1030	    Table Specification", 1985-Oct, RFC 952.

1032	    [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
1033	    1987-Nov, RFC 1034.

1035	    [RFC1123] Internet Engineering Task Force, R. Braden (editor),
1036	    "Requirements for Internet Hosts -- Application and Support",
1037	    1989-Oct, RFC 1123.

1039	    [SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)",
1040	    draft-ietf-idn-sace-*.

1042	    [SFS] David Mazieres et al, "Self-certifying File System",
1043	    http://www.fs.net/.

1045	    [UNICODE] The Unicode Consortium, "The Unicode Standard",
1046	    http://www.unicode.org/unicode/standard/standard.html.

1048	    [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a
1049	    Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*.

1051	    [UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another
1052	    ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*.

1054	    [UTS6] Misha Wolf, Ken Whistler, Charles Wicksteed,
1055	    Mark Davis, Asmus Freytag, "Unicode Technical Standard
1056	    #6: A Standard Compression Scheme for Unicode",
1057	    http://www.unicode.org/unicode/reports/tr6/.

1059	    [UTFCONV] Mark Davis, "UTF Converter",
1060	    http://www.macchiato.com/unicode/convert.html.

1062	Author

1064	    Adam M. Costello 
1065	    http://www.cs.berkeley.edu/~amc/

1067	Example implementation

1069	/******************************************/
1070	/* amc-ace-r.c 0.0.0 (2001-Mar-27-Tue)    */
1071	/* Adam M. Costello  */
1072	/******************************************/

1074	/* This is ANSI C code (C89) implementing AMC-ACE-R version 0.0.*. */

1076	/************************************************************/
1077	/* Public interface (would normally go in its own .h file): */

1079	#include 

1081	enum amc_ace_status {
1082	  amc_ace_success,
1083	  amc_ace_invalid_input,
1084	  amc_ace_big_output
1085	};

1087	enum case_sensitivity { case_sensitive, case_insensitive };

1089	#if UINT_MAX >= 0x10FFFF
1090	typedef unsigned int u_code_point;
1091	#else
1092	typedef unsigned long u_code_point;
1093	#endif
1094	enum amc_ace_status amc_ace_r_encode(
1095	  unsigned int input_length,
1096	  const u_code_point *input,
1097	  const unsigned char *uppercase_flags,
1098	  unsigned int *output_size,
1099	  char *output );

1101	    /* amc_ace_r_encode() converts Unicode to AMC-ACE-R (without      */
1102	    /* any signature).  The input must be represented as an array     */
1103	    /* of Unicode code points (not code units; surrogate pairs        */
1104	    /* are not allowed), and the output will be represented as        */
1105	    /* null-terminated ASCII.  The input_length is the number of      */
1106	    /* code points in the input.  The output_size is an in/out        */
1107	    /* argument: the caller must pass in the maximum number of        */
1108	    /* characters that may be output (including the terminating       */
1109	    /* null), and on successful return it will contain the number of  */
1110	    /* characters actually output (including the terminating null,    */
1111	    /* so it will be one more than strlen() would return, which is    */
1112	    /* why it is called output_size rather than output_length).  The  */
1113	    /* uppercase_flags array must hold input_length boolean values,   */
1114	    /* where nonzero means the corresponding Unicode character should */
1115	    /* be forced to uppercase after being decoded, and zero means it  */
1116	    /* is caseless or should be forced to lowercase.  Alternatively,  */
1117	    /* uppercase_flags may be a null pointer, which is equivalent     */
1118	    /* to all zeros.  The letters a-z and A-Z are always encoded      */
1119	    /* literally, regardless of the corresponding flags.  The encoder */
1120	    /* always outputs lowercase base-32 characters except when        */
1121	    /* nonzero values of uppercase_flags require otherwise.  The      */
1122	    /* return value may be any of the amc_ace_status values defined   */
1123	    /* above; if not amc_ace_success, then output_size and output may */
1124	    /* contain garbage.  On success, the encoder will never need to   */
1125	    /* write an output_size greater than input_length*5+1, because of */
1126	    /* how the encoding is defined.                                   */

1128	enum amc_ace_status amc_ace_r_decode(
1129	  enum case_sensitivity case_sensitivity,
1130	  char *scratch_space,
1131	  const char *input,
1132	  unsigned int *output_length,
1133	  u_code_point *output,
1134	  unsigned char *uppercase_flags );
1135	    /* amc_ace_r_decode() converts AMC-ACE-R (without any signature)  */
1136	    /* to Unicode.  The input must be represented as null-terminated  */
1137	    /* ASCII, and the output will be represented as an array of       */
1138	    /* Unicode code points.  The case_sensitivity argument influences */
1139	    /* the check on the well-formedness of the input string; it       */
1140	    /* must be case_sensitive if case-sensitive comparisons are       */
1141	    /* allowed on encoded strings, case_insensitive otherwise.        */
1142	    /* The scratch_space must point to space at least as large        */
1143	    /* as the input, which will get overwritten (this allows the      */
1144	    /* decoder to avoid calling malloc()).  The output_length is      */
1145	    /* an in/out argument: the caller must pass in the maximum        */
1146	    /* number of code points that may be output, and on successful    */
1147	    /* return it will contain the actual number of code points        */
1148	    /* output.  The uppercase_flags array must have room for at       */
1149	    /* least output_length values, or it may be a null pointer        */
1150	    /* if the case information is not needed.  A nonzero flag         */
1151	    /* indicates that the corresponding Unicode character should      */
1152	    /* be forced to uppercase by the caller, while zero means it      */
1153	    /* is caseless or should be forced to lowercase.  The letters     */
1154	    /* a-z and A-Z are output already in the proper case, but their   */
1155	    /* flags will be set appropriately so that applying the flags     */
1156	    /* would be harmless.  The return value may be any of the         */
1157	    /* amc_ace_status values defined above; if not amc_ace_success,   */
1158	    /* then output_length, output, and uppercase_flags may contain    */
1159	    /* garbage.  On success, the decoder will never need to write     */
1160	    /* an output_length greater than the length of the input (not     */
1161	    /* counting the null terminator), because of how the encoding is  */
1162	    /* defined.                                                       */

1164	/**********************************************************/
1165	/* Implementation (would normally go in its own .c file): */

1167	#include 

1169	static int is_ldh(u_code_point codept)
1170	{
1171	  return codept >  122 ? 0 :
1172	         codept >=  97 ? 1 :
1173	         codept >   90 ? 0 :
1174	         codept >=  65 ? 1 :
1175	         codept >   57 ? 0 :
1176	         codept >=  48 ? 1 :
1177	         codept ==  45      ;
1178	}

1180	/* is_AtoZ(c) returns 1 if c is an         */
1181	/* uppercase ASCII letter, zero otherwise. */

1183	static unsigned char is_AtoZ(char c)
1184	{
1185	  return c >= 65 && c <= 90;
1186	}

1188	/* base32[n] is the lowercase base-32 character representing  */
1189	/* the number n from the range 0 to 31.  Note that we cannot  */
1190	/* use string literals for ASCII characters because an ANSI C */
1191	/* compiler does not necessarily use ASCII.                   */
1192	static const char base32[] = {
1193	  97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
1194	  109, 110,                                               /* m-n */
1195	  112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
1196	  50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
1197	};

1199	/* base32_decode(c) returns the value of a base-32 character, in the */
1200	/* range 0 to 31, or the constant base32_invalid if c is not a valid */
1201	/* base-32 character.                                                */

1203	enum { base32_invalid = 32 };

1205	static unsigned int base32_decode(char c)
1206	{
1207	  if (c < 50) return base32_invalid;
1208	  if (c <= 57) return c - 26;
1209	  if (c < 97) c += 32;
1210	  if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
1211	  return c - 97 - (c > 108) - (c > 111);
1212	}

1214	/* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
1215	/* are equal, 1 otherwise.  If case_sensitivity is case_insensitive,  */
1216	/* then ASCII A-Z are considered equal to a-z respectively.           */

1218	static int unequal( enum case_sensitivity case_sensitivity,
1219	  const char *s1, const char *s2 )
1220	{
1221	  char c1, c2;

1223	  if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;

1225	  for (;;) {
1226	    c1 = *s1;
1227	    c2 = *s2;
1228	    if (c1 >= 65 && c1 <= 90) c1 += 32;
1229	    if (c2 >= 65 && c2 <= 90) c2 += 32;
1230	    if (c1 != c2) return 1;
1231	    if (c1 == 0) return 0;
1232	    ++s1, ++s2;
1233	  }
1234	}

1236	/* update_refpoints(refpoint,history,latest) updates refpoint[1..3] */
1237	/* based on the history, where history[latest] is the latest code   */
1238	/* point.                                                           */

1240	void update_refpoints( u_code_point *refpoint,
1241	  const u_code_point *history, unsigned int latest )
1242	{
1243	  unsigned int k, b, i;
1244	  for (k = 1;  k <= 3;  ++k) {
1245	    b = k << 2;
1246	    if (latest == 0) refpoint[k] = (history[0] >> b) << b;
1247	    else for (i = latest;  i-- > 0; ) {
1248	      if (is_ldh(history[i])) continue;
1249	      if ((refpoint[k] ^ history[i]) >> b == 0) break;

1251	      if ((history[latest] ^ history[i]) >> b == 0) {
1252	        refpoint[k] = (history[latest] >> b) << b;
1253	        return;
1254	      }
1255	    }
1256	  }
1257	}

1259	/* Main encode function: */

1261	enum amc_ace_status amc_ace_r_encode(
1262	  unsigned int input_length,
1263	  const u_code_point *input,
1264	  const unsigned char *uppercase_flags,
1265	  unsigned int *output_size,
1266	  char *output )
1267	{
1268	  unsigned int max_out, next_out, literal, i, k, out;
1269	  u_code_point codept, delta;
1270	  char shift;
1271	  u_code_point refpoint[6] = {0, 0x60, 0, 0, 0, 0x10000};

1273	  max_out = *output_size;
1274	  next_out = 0;
1275	  literal = 0;

1277	  for (i = 0;  i < input_length;  ++i) {
1278	    codept = input[i];
1279	    if (codept > 0x10FFFF) return amc_ace_invalid_input;

1281	    if (codept == 0x2D) {
1282	      /* hyphen-minus is doubled */
1283	      if (max_out - next_out < 1) return amc_ace_big_output;
1284	      output[next_out++] = 0x2D;
1285	      output[next_out++] = 0x2D;
1286	    }
1287	    else if (is_ldh(codept)) {
1288	      /* encode LDH character literally */
1289	      if (max_out - next_out < 1 + !literal) return amc_ace_big_output;
1290	      /* switch to literal mode if necessary: */
1291	      if (!literal) output[next_out++] = 0x2D;
1292	      literal = 1;
1293	      output[next_out++] = codept;
1294	    }
1295	    else {
1296	      /* encode non-LDH character using base-32 */

1298	      shift = uppercase_flags && uppercase_flags[i] ? 32 : 0;
1299	      /* shift will determine the case of the last base-32 digit */
1300	      for (k = 1; ; ++k) {
1301	        delta = codept - refpoint[k];
1302	        if (delta >> (4*k) == 0) break;
1303	      }

1305	      /* We will encode delta as k base-32 digits. */

1307	      if (max_out - next_out < k + literal) return amc_ace_big_output;
1308	      /* switch to base-32 mode if necessary: */
1309	      if (literal) output[next_out++] = 0x2D;
1310	      literal = 0;

1312	      /* Computing the base-32 digits in reverse order is easiest. */
1313	      /* Only the last base-32 digit has the high bit clear.       */

1315	      out = next_out + k - 1;
1316	      output[out] = base32[delta & 0xF] - shift;

1318	      while (out > next_out) {
1319	        delta >>= 4;
1320	        output[--out] = base32[0x10 | (delta & 0xF)];
1321	      }

1323	      next_out += k;
1324	      update_refpoints(refpoint,input,i);
1325	    }
1326	  }

1328	  /* null terminator: */
1329	  if (max_out - next_out < 1) return amc_ace_big_output;
1330	  output[next_out++] = 0;
1331	  *output_size = next_out;
1332	  return amc_ace_success;
1333	}

1335	/* Main decode function: */

1337	enum amc_ace_status amc_ace_r_decode(
1338	  enum case_sensitivity case_sensitivity,
1339	  char *scratch_space,
1340	  const char *input,
1341	  unsigned int *output_length,
1342	  u_code_point *output,
1343	  unsigned char *uppercase_flags )
1344	{
1345	  u_code_point q, delta;
1346	  const char *in, *first;
1347	  char c;
1348	  unsigned int next_out, max_out, literal, input_size, scratch_size;
1349	  enum amc_ace_status status;
1350	  u_code_point refpoint[6] = {0, 0x60, 0, 0, 0, 0x10000};

1352	  max_out = *output_length;
1353	  next_out = 0;
1354	  in = input;
1355	  literal = 0;
1356	  for (c = *in;  c != 0; ) {
1357	    if (c == 45 && in[1] != 45) {
1358	      /* unpaired hyphen-minus toggles mode */
1359	      literal = !literal;
1360	      c = *++in;
1361	      continue;
1362	    }

1364	    if (max_out - next_out < 1) return amc_ace_big_output;

1366	    if (c == 45) {
1367	      /* double hyphen-minus represents a hyphen-minus */
1368	      if (uppercase_flags) uppercase_flags[next_out] = 0;
1369	      output[next_out] = 45;
1370	      c = *(in += 2);
1371	    }
1372	    else {
1373	      if (literal) {
1374	        /* decode one base-32 code point */
1375	        output[next_out] = c;
1376	        c = *++in;
1377	      }
1378	      else {
1379	        /* Base-32 sequence: */

1381	        delta = 0;
1382	        first = in;

1384	        do {
1385	          q = base32_decode(c);
1386	          if (q == base32_invalid) return amc_ace_invalid_input;
1387	          delta = (delta << 4) | (q & 0xF);
1388	          c = *++in;
1389	        } while (q >> 4 == 1);

1391	        output[next_out] = refpoint[in - first] + delta;
1392	        update_refpoints(refpoint, output, next_out);
1393	      }

1395	      /* case of last digit determines uppercase flag: */
1396	      if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(in[-1]);
1397	    }

1399	    ++next_out;
1400	  }

1402	  /* Re-encode the output and compare to the input: */

1404	  input_size = in - input + 1;
1405	  scratch_size = input_size;
1406	  status = amc_ace_r_encode(next_out, output, uppercase_flags,
1407	                            &scratch_size, scratch_space);
1408	  if (status != amc_ace_success ||
1409	      scratch_size != input_size ||
1410	      unequal(case_sensitivity, scratch_space, input)
1411	     ) return amc_ace_invalid_input;
1412	  *output_length = next_out;
1413	  return amc_ace_success;
1414	}

1416	/******************************************************************/
1417	/* Wrapper for testing (would normally go in a separate .c file): */

1419	#include 
1420	#include 
1421	#include 
1422	#include 

1424	/* For testing, we'll just set some compile-time limits rather than */
1425	/* use malloc(), and set a compile-time option rather than using a  */
1426	/* command-line option.                                             */

1428	enum {
1429	  unicode_max_length = 256,
1430	  ace_max_size = 256,
1431	  test_case_sensitivity = case_insensitive  /* good for host names */
1432	};

1434	static void usage(char **argv)
1435	{
1436	  fprintf(stderr,
1437	    "%s -e reads big-endian UTF-32 and writes AMC-ACE-R ASCII.\n"
1438	    "%s -d reads AMC-ACE-R ASCII and writes big-endian UTF-32.\n"
1439	    "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n"
1440	    , argv[0], argv[0]);
1441	  exit(EXIT_FAILURE);
1442	}

1444	static void fail(const char *msg)
1445	{
1446	  fputs(msg,stderr);
1447	  exit(EXIT_FAILURE);
1448	}

1450	static const char too_big[] =
1451	  "input or output is too large, recompile with larger limits\n";
1452	static const char invalid_input[] = "invalid input\n";
1453	static const char io_error[] = "I/O error\n";

1455	int main(int argc, char **argv)
1456	{
1457	  enum amc_ace_status status;
1458	  int r;

1460	  if (argc != 2) usage(argv);
1461	  if (argv[1][0] != '-') usage(argv);
1462	  if (argv[1][2] != '\0') usage(argv);
1463	  if (argv[1][1] == 'e') {
1464	    u_code_point input[unicode_max_length];
1465	    unsigned char uppercase_flags[unicode_max_length];
1466	    char output[ace_max_size];
1467	    unsigned int input_length, output_size;
1468	    int c0, c1, c2, c3;

1470	    /* Read the UTF-32 input string: */

1472	    input_length = 0;

1474	    for (;;) {
1475	      c0 = getchar();
1476	      c1 = getchar();
1477	      c2 = getchar();
1478	      c3 = getchar();
1479	      if (ferror(stdin)) fail(io_error);

1481	      if (c1 == EOF || c2 == EOF || c3 == EOF) {
1482	        if (c0 != EOF) fail("input not a multiple of 4 bytes\n");
1483	        break;
1484	      }

1486	      if (input_length == unicode_max_length) fail(too_big);

1488	      if ((c0 != 0 && c0 != 0x80)
1489	          || c1 < 0 || c1 > 0x10
1490	          || c2 < 0 || c2 > 0xFF
1491	          || c3 < 0 || c3 > 0xFF ) {
1492	        fail(invalid_input);
1493	      }

1495	      input[input_length] = ((u_code_point) c1 << 16) |
1496	                            ((u_code_point) c2 <<  8) |
1497	                             (u_code_point) c3         ;
1498	      uppercase_flags[input_length] = (c0 >> 7);
1499	      ++input_length;
1500	    }

1502	    /* Encode, and output the result: */

1504	    output_size = ace_max_size;
1505	    status = amc_ace_r_encode(input_length, input, uppercase_flags,
1506	                              &output_size, output);
1507	    if (status == amc_ace_invalid_input) fail(invalid_input);
1508	    if (status == amc_ace_big_output) fail(too_big);
1509	    assert(status == amc_ace_success);
1510	    r = fputs(output,stdout);
1511	    if (r == EOF) fail(io_error);
1512	    return EXIT_SUCCESS;
1513	  }

1515	  if (argv[1][1] == 'd') {
1516	    char input[ace_max_size], scratch[ace_max_size];
1517	    u_code_point output[unicode_max_length], codept;
1518	    unsigned char uppercase_flags[unicode_max_length];
1519	    unsigned int output_length, i;
1520	    /* Read the AMC-ACE-R ASCII input string: */

1522	    fgets(input, ace_max_size, stdin);
1523	    if (ferror(stdin)) fail(io_error);
1524	    if (!feof(stdin)) fail(too_big);

1526	    /* Decode, and output the result: */

1528	    output_length = unicode_max_length;
1529	    status = amc_ace_r_decode(test_case_sensitivity, scratch, input,
1530	                              &output_length, output, uppercase_flags);
1531	    if (status == amc_ace_invalid_input) fail(invalid_input);
1532	    if (status == amc_ace_big_output) fail(too_big);
1533	    assert(status == amc_ace_success);

1535	    for (i = 0;  i < output_length;  ++i) {
1536	      r = putchar(uppercase_flags[i] ? 0x80 : 0);
1537	      if (r == EOF) fail(io_error);
1538	      codept = output[i];
1539	      r = putchar(codept >> 16);
1540	      if (r == EOF) fail(io_error);
1541	      r = putchar((codept >> 8) & 0xFF);
1542	      if (r == EOF) fail(io_error);
1543	      r = putchar(codept & 0xFF);
1544	      if (r == EOF) fail(io_error);
1545	    }

1547	    return EXIT_SUCCESS;
1548	  }

1550	  usage(argv);
1551	  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
1552	}

1554	                   INTERNET-DRAFT expires 2001-Sep-27