idnits 2.17.1 draft-ietf-idn-amc-ace-o-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 2 longer pages, the longest (page 13) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 21 instances of too long lines in the document, the longest one being 6 characters in excess of 72. ** The abstract seems to contain references ([UNICODE], [AMCACEM00], [RFC1123], [RFC952], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '2' on line 1643

  -- Looks like a reference, but probably isn't: '1' on line 1695

  -- Looks like a reference, but probably isn't: '3' on line 1534

  -- Looks like a reference, but probably isn't: '6' on line 1392

  -- Looks like a reference, but probably isn't: '0' on line 1642

  -- Looks like a reference, but probably isn't: '4' on line 1534

  == Missing Reference: '--i' is mentioned on line 1330, but not defined

  -- Looks like a reference, but probably isn't: '5' on line 1533

  == Missing Reference: '-1' is mentioned on line 1576, but not defined

  == Unused Reference: 'LACE01' is defined on line 1049, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-10) exists of
     draft-ietf-idn-nameprep-03

  -- Possible downref: Normative reference to a draft: ref. 'RACE03' 

  -- Possible downref: Normative reference to a draft: ref. 'AltDUDE00' 

  -- Possible downref: Normative reference to a draft: ref. 'AMCACEM00' 

  -- Possible downref: Normative reference to a draft: ref. 'BRACE00' 

  == Outdated reference: A later version (-02) exists of
     draft-ietf-idn-dude-01

  -- Possible downref: Normative reference to a draft: ref. 'DUDE01' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN'

  -- Possible downref: Normative reference to a draft: ref. 'LACE01' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL'

  ** Downref: Normative reference to an Unknown state RFC: RFC  952

  -- No information found for draft-ietf-idn-sace- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'SACE' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'

  -- No information found for draft-jseng-utf5- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'UTF5' 

  -- No information found for draft-ietf-idn-utf6- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'UTF6' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS6'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTFCONV'


     Summary: 6 errors (**), 0 flaws (~~), 7 warnings (==), 28 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          Adam M. Costello
2	draft-ietf-idn-amc-ace-o-00.txt                              2001-Mar-19
3	Expires 2001-Sep-19

5	                         AMC-ACE-O version 0.0.3

7	Status of this Memo

9	    This document is an Internet-Draft and is in full conformance with
10	    all provisions of Section 10 of RFC2026.

12	    Internet-Drafts are working documents of the Internet Engineering
13	    Task Force (IETF), its areas, and its working groups.  Note
14	    that other groups may also distribute working documents as
15	    Internet-Drafts.

17	    Internet-Drafts are draft documents valid for a maximum of six
18	    months and may be updated, replaced, or obsoleted by other documents
19	    at any time.  It is inappropriate to use Internet-Drafts as
20	    reference material or to cite them other than as "work in progress."

22	    The list of current Internet-Drafts can be accessed at
23	    http://www.ietf.org/ietf/1id-abstracts.txt

25	    The list of Internet-Draft Shadow Directories can be accessed at
26	    http://www.ietf.org/shadow.html

28	    Distribution of this document is unlimited.  Please send comments
29	    to the author at amc@cs.berkeley.edu, or to the idn working
30	    group at idn@ops.ietf.org.  A non-paginated (and possibly
31	    newer) version of this specification may be available at
32	    http://www.cs.berkeley.edu/~amc/charset/amc-ace-o

34	Abstract

36	    AMC-ACE-O is a reversible map from a sequence of Unicode [UNICODE]
37	    characters to a sequence of letters (A-Z, a-z), digits (0-9), and
38	    hyphen-minus (-), henceforth called LDH characters.  Such a map
39	    (called an "ASCII-Compatible Encoding", or ACE) might be useful for
40	    internationalized domain names [IDN], because host name labels are
41	    currently restricted to LDH characters by [RFC952] and [RFC1123].

43	    AMC-ACE-O is similar to AMC-ACE-M [AMCACEM00] but is simpler and
44	    slightly less efficient.

46	    Besides domain names, there might also be other contexts where it is
47	    useful to transform Unicode characters into "safe" (delimiter-free)
48	    ASCII characters.  (If other contexts consider hyphen-minus to be
49	    unsafe, a different character could be used to play its role, like
50	    underscore.)
51	Contents

53	    Features
54	    Name
55	    Overview
56	    Base-32 characters
57	    Encoding and decoding algorithms
58	    Signature
59	    Case sensitivity models
60	    Comparison with RACE, BRACE, LACE, DUDE, AMC-ACE-M
61	    Example strings
62	    Security considerations
63	    Credits
64	    References
65	    Author
66	    Example implementation

68	Features

70	    Uniqueness:  Every Unicode string maps to at most one LDH string.

72	    Completeness:  Every Unicode string maps to an LDH string.
73	    Restrictions on which Unicode strings are allowed, and on length,
74	    may be imposed by higher layers.

76	    Efficient encoding:  The ratio of encoded size to original size is
77	    small for all Unicode strings.  This is important in the context
78	    of domain names because [RFC1034] restricts the length of a domain
79	    label to 63 characters.

81	    Simplicity:  The encoding and decoding algorithms are reasonably
82	    simple to implement.  The goals of efficiency and simplicity are at
83	    odds; AMC-ACE-O aims at a good balance between them.

85	    Case-preservation:  If the Unicode string has been case-folded prior
86	    to encoding, it is possible to record the case information in the
87	    case of the letters in the encoding, allowing a mixed-case Unicode
88	    string to be recovered if desired, but a case-insensitive comparison
89	    of two encoded strings is equivalent to a case-insensitive
90	    comparison of the Unicode strings.  This feature is optional; see
91	    section "Case sensitivity models".

93	    Readability:  The letters A-Z and a-z and the digits 0-9 appearing
94	    in the Unicode string are represented as themselves in the label.
95	    This comes for free because it usually the most efficient encoding
96	    anyway.

98	Name

100	    AMC-ACE-O is a working name that should be changed if it is adopted.
101	    (The O merely indicates that it is the fifteenth ACE devised by this
102	    author.  BRACE was the third.  D-L and N did not deliver enough
103	    efficiency to justify their complexity.)  Rather than waste good
104	    names on experimental proposals, let's wait until one proposal is
105	    chosen, then assign it a good name.  Suggestions (assuming the
106	    primary use is in domain names):

108	        UniHost
109	        UTF-D ("D" for "domain names")
110	        UTF-37 (there are 37 characters in the output repertoire)
111	        NUDE (Normal Unicode Domain Encoding)

113	Overview

115	    AMC-ACE-O maps characters to characters--it does not consume or
116	    produce code points, code units, or bytes, although the algorithm
117	    makes use of code points, and implementations will of course need to
118	    represent the input and output characters somehow, usually as bytes
119	    or other code units.

121	    Each character in the Unicode string is represented by an
122	    integral number of characters in the encoded string.  There is no
123	    intermediate bit string or octet string.

125	    The encoded string alternates between two modes: literal mode and
126	    base-32 mode.  LDH characters in the Unicode string are encoded
127	    literally, except that hyphen-minus is doubled.  Non-LDH characters
128	    in the Unicode string are encoded using base-32, in which each
129	    character of the encoded string represents five bits (a "quintet").
130	    A non-paired hyphen-minus in the encoded string indicates a mode
131	    change.

133	    In base-32 mode a variable-length code sequence of one to five
134	    quintets represents a delta, which is added to a reference point
135	    to yield a Unicode code point, which in turn represents a Unicode
136	    character.  (Surrogates, which are code units used by UTF-16 in
137	    pairs to refer to code points, are not used with AMC-ACE-O.)  There
138	    is one reference point for each code length; they are chosen by the
139	    encoder based on the input string and declared at the beginning
140	    of the encoded string, and never change.  Locality among the code
141	    points is discovered and exploited by the encoder to make the
142	    encoding more compact.

144	Base-32 characters

146	        "a" =  0 = 0x00 = 00000         "s" = 16 = 0x10 = 10000
147	        "b" =  1 = 0x01 = 00001         "t" = 17 = 0x11 = 10001
148	        "c" =  2 = 0x02 = 00010         "u" = 18 = 0x12 = 10010
149	        "d" =  3 = 0x03 = 00011         "v" = 19 = 0x13 = 10011
150	        "e" =  4 = 0x04 = 00100         "w" = 20 = 0x14 = 10100
151	        "f" =  5 = 0x05 = 00101         "x" = 21 = 0x15 = 10101
152	        "g" =  6 = 0x06 = 00110         "y" = 22 = 0x16 = 10110
153	        "h" =  7 = 0x07 = 00111         "z" = 23 = 0x17 = 10111
154	        "i" =  8 = 0x08 = 01000         "2" = 24 = 0x18 = 11000
155	        "j" =  9 = 0x09 = 01001         "3" = 25 = 0x19 = 11001
156	        "k" = 10 = 0x0A = 01010         "4" = 26 = 0x1A = 11010
157	        "m" = 11 = 0x0B = 01011         "5" = 27 = 0x1B = 11011
158	        "n" = 12 = 0x0C = 01100         "6" = 28 = 0x1C = 11100
159	        "p" = 13 = 0x0D = 01101         "7" = 29 = 0x1D = 11101
160	        "q" = 14 = 0x0E = 01110         "8" = 30 = 0x1E = 11110
161	        "r" = 15 = 0x0F = 01111         "9" = 31 = 0x1F = 11111

163	    The digits "0" and "1" and the letters "o" and "l" are not used, to
164	    avoid transcription errors.

166	    All decoders must recognize both the uppercase and lowercase
167	    forms of the base-32 characters.  The case may or may not convey
168	    information, as described in section "Case sensitivity models".

170	Encoding and decoding algorithms

172	    The algorithms are given below as commented pseudocode.  All
173	    ordering of bits and quintets is big-endian (most significant
174	    first).  The >> and << operators used below mean bit shift, as in
175	    C.  For >> there is no question of logical versus arithmetic shift
176	    because AMC-ACE-O makes no use of negative numbers.

178	    primitives:
179	      to_codepoint()   # maps a character to a Unicode code point
180	      from_codepoint() # maps a Unicode code point to a character
181	        # These are no-ops if the implementation represents characters
182	        # using Unicode code points.

184	    subroutine names:
185	      encode           # main encoding function
186	      decode           # main decoding function
187	      find_refpoint    # scan the reference points for a suitable one
188	      encode_point     # encode one code point as base-32
189	      decode_point     # decode one code point from base-32
190	      choose_refpoints # choose good reference points for the input
191	      census           # used by choose_refpoints
192	      encode_refpoints # encode the reference points
193	      decode_refpoints # decode the reference points
194	      bootstrap        # used by en/decode_refpoints

196	    shared variables:  # All others are local to each subroutine.
197	      the input/output strings taken/returned by encode() and decode()
198	      array refpoint[1..5]  # refpoint[k] is for sequences of length k
199	      # The rest are used only by the encoder:
200	      array prefix[1..3]    # prefix[k] is used to encode refpoint[k]
201	      integers best_count, best_refpoint

203	    constants:
204	      array special_refpoint[0..7] =
205	        0x20, 0x50, 0x70, 0xA0, 0xC0, 0xE0, 0x140, 0x270
206	          # Generally, prefix[k] << (4*k) == refpoint[k],
207	          # but for prefix[2] == 0xD8..0xDF, refpoint[2] ==
208	          # special_refpoint[0..7] respectively.  These prefixes would
209	          # not otherwise be used because they correspond to surrogates.
210	          # These special reference points are used to assist the Latin
211	          # script because, unlike almost every other small script,
212	          # Latin is split across multiple rows with inconvenient
213	          # boundaries, and therefore has a hard time compressing well.

215	    function encode(input string):
216	      if any input character's codepoint is outside 0..10FFFF then fail
217	        # Too-large values could cause array bounds errors later.
218	      choose_refpoints()
219	      encode_refpoints()
220	      let literal = false
221	      for each character in the input string (in order) do begin
222	        if the character is hyphen-minus then output two hyphen-minuses
223	        else if the character is an LDH character then begin
224	          if not literal then output hyphen-minus and toggle literal
225	          output the character
226	        end
227	        else begin
228	          if literal then output hyphen-minus and toggle literal
229	          encode_point(to_codepoint(character))
230	        end
231	      end
232	      return the output string

234	    function decode(input string):
235	      decode_refpoints()
236	      let literal = false
237	      while not end-of-input do begin
238	        if the next character is hyphen-minus then begin
239	          consume the character
240	          if the next character is hyphen-minus then consume it
241	          else toggle literal
242	        end
243	        else if literal then consume character and output it
244	        else output from_codepoint(decode_point())
245	      end
246	      let check = encode(the output string)
247	      if check != the input string then fail
248	        # This comparison must be case-insensitive if ACEs are always
249	        # compared case-insensitively (which is true of domain names),
250	        # case-sensitive otherwise.  See also section "Case sensitivity
251	        # models".  This check is necessary to guarantee the uniqueness
252	        # property (there cannot be two distinct encoded strings
253	        # representing the same Unicode string).
254	      return the output string

256	    function find_refpoint(start,n):
257	      let i = start
258	      while n < refpoint[k] or (n - refpoint[k]) >> (4*k) != 0
259	        do increment i
260	      return i

262	    procedure encode_point(n):
263	      let k = find_refpoint(1,n)
264	      let delta = n - refpoint[k]
265	      extract the k least significant nybbles of delta
266	        # A nybble is 4 bits.
267	      prepend 0 to the last nybble and prepend 1 to the rest
268	      output the base-32 characters corresponding to the quintets
269	    function decode_point():
270	      input characters and convert them to quintets until a quintet
271	        beginning with 0 is obtained (expect at most four quintets
272	        beginning with 1)
273	      fail upon encountering anything unexpected
274	      let k = the number of quintets obtained
275	      strip the first bit of each quintet
276	      concatenate the resulting nybbles to form delta
277	      return refpoint[k] + delta

279	    procedure encode_refpoints():
280	        # refpoint[4..5] always end up as 0 and 0x10000.
281	        # refpoint[1..3] are implied by prefix[1..3], which are encoded
282	        # in reverse order because that often yields a compact encoding.
283	      let refpoint[1..2] = 0, 0x10
284	      for k = 3 down to 1 do begin
285	        encode_point(prefix[k])
286	        bootstrap(k, prefix[k])
287	      end

289	    procedure decode_refpoints():
290	      let refpoint[1..5] = 0, 0x10, 0, 0, 0x10000
291	      for k = 3 down to 1 do bootstrap(k, decode_point())

293	    procedure bootstrap(k,p):
294	        # The prefixes need to be left-shifted to become reference
295	        # points.  As this happens, the current reference points often
296	        # become helpful for encoding/decoding the next prefix.
297	      for j = 4 down to 2 do let refpoint[j] = refpoint[j-1] << 4
298	      if k == 2 and 0xD8 <= p <= 0xDF
299	      then let refpoint[1] = special_refpoint[p - 0xD8] >> 4
300	      else let refpoint[1] = p << 4

302	    procedure choose_refpoints():
303	        # First choose refpoint[1] so that it will be used as often as
304	        # possible, then choose refpoint[2] similarly, then refpoint[3].
305	      let refpoint[1..5] = 0, 0, 0, 0, 0x10000
306	      let prefix[1..3] = 0, 0, 0, 0
307	      for k = 1 to 3 do begin
308	        let best_count = 0
309	        let best_refpoint = 0
310	        # Try the input code point prefixes, then the special prefixes:
311	        for each input character in order
312	          do census(k, to_codepoint(character) >> (4*k))
313	        if k == 2 then for i = 0 to 7 do census(k, 0xD8 + i)
314	        if k == 3 then census(k,0xD)
315	        let refpoint[k] = best_refpoint
316	      end
317	    function census(k,p):
318	        # Determine how many times the reference point corresponding to
319	        # prefix p would be used to encode input characters and other
320	        # reference points if it were chosen as refpoint[k], and update
321	        # best_count, best_refpoint, and prefix[k] accordingly.
322	      if k == 2 and 0xD8 <= p <= 0xDF
323	      then let refpoint[k] = special_refpoint[p - 0xD8]
324	      else let refpoint[k] = p << (4*k)
325	      let count = the number of non-LDH input characters for which
326	        find_refpoint(1, to_codepoint(character)) == k
327	        # Don't forget the non-LDH requirement.
328	      increment count once for each i such that 1 <= i <= k and
329	        find_refpoint(i+1, prefix[i] << (4*i)) == k
330	      if count > best_count then begin
331	        let best_count = count
332	        let best_refpoint = refpoint[k]
333	        let prefix[k] = p
334	      end

336	Signature

338	    The issue of how to distinguish ACE strings from unencoded strings
339	    is largely orthogonal to the encoding scheme itself, and is
340	    therefore not specified here.  In the context of domain name labels,
341	    a standard prefix and/or suffix (chosen to be unlikely to occur
342	    naturally) would presumably be attached to ACE labels.  (In that
343	    case, it would probably be good to forbid the encoding of Unicode
344	    strings that appear to match the signature, to avoid confusing
345	    humans about whether they are looking at a Unicode string or an ACE
346	    string.)

348	    In order to use AMC-ACE-O in domain names, the choice of signature
349	    must be mindful of the requirement in [RFC952] that labels never
350	    begin or end with hyphen-minus.  The raw encoded string will never
351	    begin with a hyphen-minus, and will end with a hyphen-minus iff the
352	    Unicode string ends with a hyphen-minus.  If the Unicode strings
353	    are forbidden from ending with hyphen-minus (which seems prudent
354	    anyway), then there is no problem.  Otherwise, AMC-ACE-O would need
355	    to use a suffix as the signature.

357	    It appears that "---" is extremely rare in domain names; among the
358	    four-character prefixes of all the second-level domains under .com,
359	    .net, and .org, "---" never appears at all.  Therefore, perhaps the
360	    signature should be of the form ?--- (prefix) or ---? (suffix),
361	    where ? could be "u" for Unicode, or "i" for internationalized, or
362	    "a" for ACE, or maybe "q" or "z" because they are rare.

364	Case sensitivity models

366	    The higher layer must choose one of the following four models.

368	    Models suitable for domain names:

370	      * Case-insensitive:  Before a string is encoded, all its non-LDH
371	        characters must be case-folded so that any strings differing
372	        only in case become the same string (for example, strings could
373	        be forced to lowercase).  Folding LDH characters is optional.
374	        The case of base-32 characters and literal-mode characters is
375	        arbitrary and not significant.  Comparisons between encoded
376	        strings must be case-insensitive.  The original case of non-LDH
377	        characters cannot be recovered from the encoded string.

379	      * Case-preserving:  The case of the Unicode characters is not
380	        considered significant, but it can be preserved and recovered,
381	        just like in non-internationalized host names.  Before a string
382	        is encoded, all its non-LDH characters must be case-folded
383	        as in the previous model.  LDH characters are naturally able
384	        to retain their case attributes because they are encoded
385	        literally.  The case attribute of a non-LDH character is
386	        recorded in the last of the base-32 characters that represent
387	        it, which is guaranteed to be a letter rather than a digit.
388	        If the base-32 character is uppercase, it means the Unicode
389	        character is caseless or should be forced to uppercase after
390	        being decoded (which is a no-op if the case folding already
391	        forces to uppercase).  If the base-32 character is lowercase,
392	        it means the Unicode character is caseless or should be forced
393	        to lowercase after being decoded (which is a no-op if the case
394	        folding already forces to lowercase).  The case of the other
395	        base-32 characters in a multi-quintet encoding is arbitrary
396	        and not significant.  Only uppercase and lowercase attributes
397	        can be recorded, not titlecase.  Comparisons between encoded
398	        strings must be case-insensitive, and are equivalent to
399	        case-insensitive comparisons between the Unicode strings.  The
400	        intended mixed-case Unicode string can be recovered as long as
401	        the encoded characters are unaltered, but altering the case of
402	        the encoded characters is not harmful--it merely alters the case
403	        of the Unicode characters, and such a change is not considered
404	        significant.

406	        In this model, the input to the encoder and the output of the
407	        decoder can be the unfolded Unicode string (in which case the
408	        encoder and decoder are responsible for performing the case
409	        folding and recovery), or can be the folded Unicode string
410	        accompanied by separate case information (in which case the
411	        higher layer is responsible for performing the case folding and
412	        recovery).  Whichever layer performs the case recovery must
413	        first verify that the Unicode string is properly folded, to
414	        guarantee the uniqueness of the encoding.

416	        It is not very difficult to extend the nameprep algorithm
417	        [NAMEPREP03] to remember case information.

419	    The case-insensitive and case-preserving models are interoperable.
420	    If a domain name passes from a case-preserving entity to a
421	    case-insensitive entity, the case information will be lost, but
422	    the domain name will still be equivalent.  This phenomenon already
423	    occurs with non-internationalized domain names.

425	    Models unsuitable for domain names, but possibly useful in other
426	    contexts:

428	      * Case-sensitive:  Unicode strings may contain both uppercase and
429	        lowercase characters, which are not folded.  Base-32 characters
430	        must be lowercase.  Comparisons between encoded strings must be
431	        case-sensitive.

433	      * Case-flexible:  Like case-preserving, except that the choice
434	        of whether the case of the Unicode characters is considered
435	        significant is deferred.  Therefore, base-32 characters must
436	        be lowercase, except for those used to indicate uppercase
437	        Unicode characters.  Comparisons between encoded strings may be
438	        case-sensitive or case-insensitive, and such comparisons are
439	        equivalent to the corresponding comparisons between the Unicode
440	        strings.

442	Comparison with RACE, BRACE, LACE, DUDE, AMC-ACE-M

444	    In this section we compare AMC-ACE-O and five other ACEs: RACE
445	    [RACE03], BRACE [BRACE00], LACE [LACE01], DUDE [DUDE01], and
446	    AMC-ACE-M [AMCACEM00].  We do not include SACE [SACE], UTF-5 [UTF5],
447	    or UTF-6 [UTF6] in the comparison, because SACE appears obviously
448	    too complex, UTF-5 appears obviously too inefficient, and UTF-6 can
449	    never be more efficient than its similarly simple successor, DUDE.

451	    Complexity is hard to measure.  This author would subjectively
452	    describe the complexity of the algorithms as:

454	        RACE, LACE, DUDE: fairly simple but not trivial
455	               AMC-ACE-O: moderate
456	               AMC-ACE-M: fairly complex
457	                   BRACE: complex

459	    AMC-ACE-O is very similar to AMC-ACE-M, but is simpler because it
460	    discards the "wide" encoding style, and uses a different method for
461	    choosing and encoding the reference points that has fewer special
462	    cases and more reuse of logic already needed for encoding the
463	    Unicode characters.

465	    Implementations can be long and straightforward, or short and
466	    subtle, but for whatever it's worth, here are the code sizes of
467	    three of the algorithms that were implemented by this author in
468	    similar styles:

470	      AltDUDE: 130 lines @@@@@@@@@@@@@@@@@@@
471	         DUDE: 135 lines @@@@@@@@@@@@@@@@@@@
472	    AMC-ACE-O: 234 lines @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
473	    AMC-ACE-M: 324 lines @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
474	    (AltDUDE [AltDUDE00] is a variant of DUDE that is not worth
475	    including separately in the rest of the comparison because it is
476	    practically identical to DUDE.  Not counted in the code sizes are
477	    blank lines, lines containing only comments or only a single brace,
478	    and wrapper code for testing.  BRACE was also implemented by this
479	    author, but it was a less general implementation, with bounded input
480	    and output sizes.)

482	    If a different implementation style were to alter the code sizes
483	    additively, or multiplicatively, or a combination thereof, AMC-ACE-O
484	    would remain about halfway between DUDE and AMC-ACE-M.

486	    Case preservation support:

488	        DUDE, AMC-ACE-M, AMC-ACE-O:  all characters
489	                             BRACE:  only the letters A-Z, a-z
490	                        RACE, LACE:  none

492	    RACE, BRACE, and LACE transform the Unicode string to an
493	    intermediate bit string, then into a base-32 string, so there is
494	    no particular alignment between the base-32 characters and the
495	    Unicode characters.  DUDE, AMC-ACE-M, and AMC-ACE-O do not have
496	    this intermediate stage, and enforce alignment between the base-32
497	    characters and the Unicode characters, which facilitates the case
498	    preservation.

500	    The relative efficiency of the various algorithms is suggested
501	    by the sizes of the encodings in section "Example strings".  The
502	    lengths of examples A-K (which are the same sentence translated into
503	    a languages from a variety of language families using a variety
504	    of scripts) are shown graphically below for each ACE, scaled by a
505	    factor of 0.4 so they fit on one line, and sorted so they look like
506	    a cummulative distribution.  The fictional "Super-ACE" encodes its
507	    input using whichever of the other six ACEs is shortest for that
508	    input.

510	    RACE:
511	      A Arabic      29 @@@@@@@@@@@@
512	      B Chinese     31 @@@@@@@@@@@@
513	      J Taiwanese   31 @@@@@@@@@@@@
514	      D Hebrew      37 @@@@@@@@@@@@@@@
515	      H Russian     47 @@@@@@@@@@@@@@@@@@@
516	      E Hindi       50 @@@@@@@@@@@@@@@@@@@@
517	      F Japanese    60 @@@@@@@@@@@@@@@@@@@@@@@@
518	      I Spanish     66 @@@@@@@@@@@@@@@@@@@@@@@@@@
519	      C Czech       68 @@@@@@@@@@@@@@@@@@@@@@@@@@@
520	      G Korean      79 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
521	      K Vietnamese 112 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
522	    LACE:
523	      B Chinese     28 @@@@@@@@@@@
524	      A Arabic      31 @@@@@@@@@@@@
525	      J Taiwanese   31 @@@@@@@@@@@@
526	      D Hebrew      39 @@@@@@@@@@@@@@@@
527	      H Russian     48 @@@@@@@@@@@@@@@@@@@
528	      E Hindi       52 @@@@@@@@@@@@@@@@@@@@@
529	      F Japanese    52 @@@@@@@@@@@@@@@@@@@@@
530	      C Czech       58 @@@@@@@@@@@@@@@@@@@@@@@
531	      I Spanish     68 @@@@@@@@@@@@@@@@@@@@@@@@@@@
532	      G Korean      79 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
533	      K Vietnamese 109 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

535	    DUDE:
536	      A Arabic      25 @@@@@@@@@@
537	      B Chinese     26 @@@@@@@@@@
538	      D Hebrew      33 @@@@@@@@@@@@@
539	      J Taiwanese   36 @@@@@@@@@@@@@@
540	      H Russian     38 @@@@@@@@@@@@@@@
541	      C Czech       43 @@@@@@@@@@@@@@@@@
542	      F Japanese    49 @@@@@@@@@@@@@@@@@@@@
543	      E Hindi       58 @@@@@@@@@@@@@@@@@@@@@@@
544	      I Spanish     59 @@@@@@@@@@@@@@@@@@@@@@@@
545	      K Vietnamese  81 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
546	      G Korean      89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

548	    AMC-ACE-O:
549	      B Chinese     24 @@@@@@@@@@
550	      A Arabic      28 @@@@@@@@@@@
551	      J Taiwanese   30 @@@@@@@@@@@@
552	      D Hebrew      31 @@@@@@@@@@@@
553	      C Czech       34 @@@@@@@@@@@@@@
554	      H Russian     40 @@@@@@@@@@@@@@@@
555	      F Japanese    41 @@@@@@@@@@@@@@@@
556	      I Spanish     49 @@@@@@@@@@@@@@@@@@@@
557	      E Hindi       54 @@@@@@@@@@@@@@@@@@@@@@
558	      K Vietnamese  69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
559	      G Korean      80 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

561	    BRACE:
562	      B Chinese     22 @@@@@@@@@
563	      A Arabic      26 @@@@@@@@@@
564	      J Taiwanese   27 @@@@@@@@@@@
565	      D Hebrew      33 @@@@@@@@@@@@@
566	      C Czech       36 @@@@@@@@@@@@@@
567	      F Japanese    40 @@@@@@@@@@@@@@@@
568	      H Russian     42 @@@@@@@@@@@@@@@@@
569	      E Hindi       45 @@@@@@@@@@@@@@@@@@
570	      I Spanish     48 @@@@@@@@@@@@@@@@@@@
571	      K Vietnamese  72 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@
572	      G Korean      78 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
573	    AMC-ACE-M:
574	      B Chinese     23 @@@@@@@@@
575	      J Taiwanese   27 @@@@@@@@@@@
576	      A Arabic      28 @@@@@@@@@@@
577	      D Hebrew      31 @@@@@@@@@@@@
578	      C Czech       34 @@@@@@@@@@@@@@
579	      H Russian     38 @@@@@@@@@@@@@@@
580	      F Japanese    42 @@@@@@@@@@@@@@@@@
581	      I Spanish     48 @@@@@@@@@@@@@@@@@@@
582	      E Hindi       54 @@@@@@@@@@@@@@@@@@@@@@
583	      K Vietnamese  69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
584	      G Korean      71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@

586	    Super-ACE:
587	      B Chinese     22 @@@@@@@@@
588	      A Arabic      25 @@@@@@@@@@
589	      J Taiwanese   27 @@@@@@@@@@@
590	      D Hebrew      31 @@@@@@@@@@@@
591	      C Czech       34 @@@@@@@@@@@@@@
592	      H Russian     38 @@@@@@@@@@@@@@@
593	      F Japanese    40 @@@@@@@@@@@@@@@@
594	      E Hindi       45 @@@@@@@@@@@@@@@@@@
595	      I Spanish     48 @@@@@@@@@@@@@@@@@@@
596	      K Vietnamese  69 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
597	      G Korean      71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@

599	    totals:
600	             RACE: 610 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
601	             LACE: 595 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
602	             DUDE: 537 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
603	        AMC-ACE-O: 480 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
604	            BRACE: 469 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
605	        AMC-ACE-M: 465 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
606	        Super-ACE: 450 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

608	    worst cases:
609	             RACE: 112 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
610	             LACE: 109 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
611	             DUDE:  89 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
612	        AMC-ACE-O:  80 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
613	            BRACE:  78 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
614	        AMC-ACE-M:  71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
615	        Super-ACE:  71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@

617	    The totals and worst cases above give more weight to languages
618	    that produce longer encodings, which arguably yields a good metric
619	    (because being efficient for easy languages is arguably less
620	    important than being efficient for difficult languages).  We can
621	    alternatively give each language equal weight by dividing each
622	    output length by the corresponding Super-ACE output length.  This
623	    method yields:

625	    totals:
626	           RACE: 14.9 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
627	           LACE: 14.5 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
628	           DUDE: 13.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
629	      AMC-ACE-O: 11.7 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
630	          BRACE: 11.4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
631	      AMC-ACE-M: 11.4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
632	      Super-ACE: 11.0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

634	    worst cases:
635	           RACE: 2.00 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
636	           LACE: 1.71 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
637	           DUDE: 1.33 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
638	      AMC-ACE-O: 1.20 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
639	      AMC-ACE-M: 1.20 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
640	          BRACE: 1.11 @@@@@@@@@@@@@@@@@@@@@@@@@@@@
641	      Super-ACE: 1.00 @@@@@@@@@@@@@@@@@@@@@@@@@

643	    No matter which way we average, the results suggest that DUDE is
644	    preferrable to RACE and LACE, because it has similar simplicity, is
645	    more efficient, and has better support for case preservation.

647	    The results also suggest that AMC-ACE-M is preferrable to BRACE,
648	    because it has similar efficiency, is a little simpler, and has
649	    better support for case preservation.

651	    DUDE, AMC-ACE-O, and AMC-ACE-M are progressively more complex and
652	    more efficient, and have equal support for case preservation.  The
653	    choice depends on how much efficiency is required and how much
654	    complexity is acceptable.

656	    The efficiency gap between AMC-ACE-M and AMC-ACE-O is mostly due
657	    to the Korean (Hangul) string.  Of the 15 characters by which the
658	    AMC-ACE-M total beats the AMC-ACE-O total, 9 come from that string,
659	    for which AMC-ACE-M had an output length of 71, compared to about 80
660	    for all the other ACEs.

662	Example strings

664	    In the ACE encodings below, signatures (like "bq--" for RACE) are
665	    not shown.  Non-LDH characters in the Unicode string are forced to
666	    lowercase before being encoded using BRACE, RACE, and LACE.  For
667	    RACE and LACE, the letters A-Z are likewise forced to lowercase.
668	    UTF-8 and UTF-16 are included for length comparisons, with non-ASCII
669	    bytes shown as "?".  AMC-ACE-M and AMC-ACE-O are abbreviated AMC-M
670	    and AMC-O.  Backslashes show where line breaks have been inserted in
671	    ACE strings too long for one line.  The RACE and LACE encodings are
672	    courtesy of Mark Davis's online UTF converter [UTFCONV] (slightly
673	    modified to remove the length restrictions).

675	    The first several examples are all translations of the sentence "Why
676	    can't they just speak in ?" (courtesy of Michael Kaplan's
677	    "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
678	    been removed, as is often done in domain names.

680	    (A) Arabic (Egyptian):
681	        U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644
682	        U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F
683	        DUDE:   m44qnli7oqk3kloj4phi8kahf
684	        BRACE:  28akcjwcmp3ciwb4t3ngd4nbaz
685	        AMC-O:  ageekhfuhuiukdefivevjvbuiktr
686	        AMC-M:  agiekhfuhuiukdefivevjvbuiktr
687	        RACE:   azceur2fe4ucuq2eivediojrfbfb6
688	        LACE:   cedeisshiutsqksdircuqnbzgeueuhy
689	        UTF-16: ??????????????????????????????????
690	        UTF-8:  ??????????????????????????????????

692	    (B) Chinese (simplified):
693	        U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587

695	        UTF-16: ??????????????????
696	        BRACE:  kgcqqsgp26i5h4zn7req5i
697	        AMC-M:  uqj7g8nvk6awispn9wupdnh
698	        AMC-O:  eqpg8nvk6awisp259eupyx2h
699	        DUDE:   ked6ucjas0k8gdobf4ke2dm587
700	        UTF-8:  ???????????????????????????
701	        LACE:   azhnn3b2ybea2aml6qau4libmwdq
702	        RACE:   3bhnmtxmjy5e5qcojbha3c7ujywwlby

704	    (C) Czech: Proprostnemluvesky

706	         = U+010D
707	         = U+011B
708	         = U+00ED

710	        UTF-8:  Pro??prost??nemluv????esky
711	        AMC-O:  piq-Pro-p-prost-9m-nemluv-6pp-esky
712	        AMC-M:  g26-Pro-p-prost-9m-nemluv-6pp-esky
713	        BRACE:  i32-Pro-u-prost-8y-nemluv-29f3n-esky
714	        DUDE:   N0imfh0dg70imfn3kh1bg6eltsn5mudh0dg65n3mbn9
715	        UTF-16: ????????????????????????????????????????????
716	        LACE:   amaha4tpaeaq2biaobzg643uaearwbyanzsw23dvo3wqcainaqagk43\
717	                lpe
718	        RACE:   ah7xb73s75xq373q75zp6377op7xig77n37wl73n75wp65p7o3762dp\
719	                7mx7xh73l754q

721	    (D) Hebrew:
722	        U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8
723	        U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2
724	        U+05D1 U+05E8 U+05D9 U+05EA

726	        AMC-O:  afpnqeep8e8jfinaqdb8ijp8cb8ij8k
727	        AMC-M:  af4nqeep8e8jfinaqdb8ijp8cb8ij8k
728	        DUDE:   ldcukktu4pt5osgujhu8t9tu2t1u8t9ua
729	        BRACE:  27vkyp7bgwmbpfjgc4ynx5nd8xsp5nd9c
730	        RACE:   axon5vgu3xsotvoy3tin5u6r5dm53ywr5dm6u
731	        LACE:   cyc5zxwu2to6j2ov3donbxwt2huntxpc2hunt2q
732	        UTF-8:  ????????????????????????????????????????????
733	        UTF-16: ????????????????????????????????????????????

735	    (E) Hindi:
736	        U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D
737	        U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939
738	        U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947
739	        U+0939 U+0948 U+0902  (Devanagari)
740	        BRACE:  2b7xtenqdr7zc6uma2pmcz7ibage237kdemicnk9gei32
741	        RACE:   bextsmslc44t6kcnezabktjpjmbcqokaaiwewmrycuseookiai
742	        LACE:   dyes6ojsjmltspzijuteafknf5fqekbziabcyszshaksirzzjaba
743	        AMC-O:  ajeurvjvcmthvjvruipugatfpurmscuivjascunmvcvitfuehvjisc
744	        AMC-M:  ajhurbvcwmthbhuiwpugitfwpurwmscuibiscunwmvcatfuerbwisc
745	        DUDE:   p2fj9ikbh7j9vi8kdi6k0h5kdifkbg2i8j9k0g2ickbj2oh5i4k7j9k\
746	                8g2
747	        UTF-16: ???????????????????????????????????????????????????????\
748	                ?????
749	        UTF-8:  ???????????????????????????????????????????????????????\
750	                ???????????????????????????????????

752	    (F) Japanese:
753	        U+306A U+305C U+307F U+3093 U+306A U+65E5 U+672C U+8A9E U+3092
754	        U+8A71 U+3057 U+3066 U+304F U+308C U+306A U+3044 U+306E U+304B
755	        (kanji and hiragana)

757	        UTF-16: ????????????????????????????????????
758	        BRACE:  ji8nr5zj8uqth7v97mjchakwcg7dqemw88nj5gbe
759	        AMC-O:  gvagkxnzr3dkx8fzun243q3c24zbxhgwr2nkweqwm
760	        AMC-M:  bsnkxnzr3dkyx8fyzun243q3c24zbxhgwr2nkweqwm
761	        DUDE:   j06alcnfp3mam5e5n2coa9ej092oa71j057m6kfocmak4mekb
762	        LACE:   auyguxd7snvaczpfaftsyamktyatbeqbrjyqqmcxmzhyy2senzfq
763	        UTF-8:  ??????????????????????????????????????????????????????
764	        RACE:   3aygumc4gb7tbezqnjs6kzzmrkpdbeukoeyfomdggbhtbdbqniyeimd\
765	                ogbfq

767	    (G) Korean:
768	        U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
769	        U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
770	        U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C  (Hangul syllables)

772	        UTF-16: ????????????????????????????????????????????????
773	        UTF-8:  ???????????????????????????????????????????????????????\
774	                ?????????????????
775	        AMC-M:  yhxcj2w6exiaxi68acfn92n68ezehk6xypdpwam6zehmwhk648eavwd\
776	                p6aqi23ieemweywn
777	        BRACE:  y394qebjusrcndbs82pkvstf96sxufcr7ffr4vbgdwsxufcx8pdktgb\
778	                gmnsqydmk7im56arju6pt82
779	        LACE:   77atrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\
780	                exj2mlpfzzcyjrsely5ck4ta
781	        RACE:   3datrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\
782	                exj2mlpfzzcyjrsely5ck4ta
783	        AMC-O:  m6hwq6tvi466exi44ia6s4nz2neze7xxn47yp6x5e3znze7xze7xxnu\
784	                8e4ze6x5n36is3i622mwe48wn
785	        DUDE:   s138qcc4s758raa8ke0s0acr78cke4s774t55cqd6ds5b4r97cs774t\
786	                574lcr2e4q74s5bcr9c8g98s88bn44qe4c

788	    (H) Russian:
789	        U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E
790	        U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440
791	        U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A
792	        U+0438  (Cyrillic)
793	        DUDE:   K3fuk7j5sk3j6lutotljuiuk0vijfuk0jhhjao
794	        AMC-M:  aehHgrvfemvgvfgfafvfvdgvcgiwrkhgimjjca
795	        AMC-O:  aedRqwhfnwdgfqpipfdqcqwawrwcrqwawdwbwbki
796	        BRACE:  269xyjvcyafqfdwyr3xfd8z8byi6z39xyi692s7ug2
797	        RACE:   aq7t4rzvhrbtmnj6hu4d2njthyzd4qcpii7t4qcdifatuoa
798	        LACE:   dqcd6pshgu6egnrvhy6tqpjvgm7depsaj5bd6psainaucory
799	        UTF-16: ???????????????????????????????????????????????????????\
800	                ???
801	        UTF-8:  ???????????????????????????????????????????????????????
802	                ???

804	    (I) Spanish: PorqunopuedensimplementehablarenEspaol

806	         = U+00E9
807	         = U+00F1

809	        UTF-8:  Porqu??nopuedensimplementehablarenEspa??ol
810	        AMC-M:  aa7-Porqu-b-nopuedensimplementehablarenEspa-j-ol
811	        BRACE:  22x-Porqu-9-nopuedensimplementehablarenEspa-j-ol
812	        AMC-O:  aaq-Porqu-j-nopuedensimplementehablarenEspa-9b-ol
813	        DUDE:   N0mfn2hlu9mevn0lm5klun3m9tn0mcltlun4m5ohishn2m5uLn3gm1v\
814	                1mfs
815	        RACE:   abyg64troxuw433qovswizloonuw24dmmvwwk3tumvugcytmmfzgk3t\
816	                fonygd4lpnq
817	        LACE:   faaha33sof26s3tpob2wkzdfnzzws3lqnrsw2zloorswqylcnrqxezl\
818	                omvzxayprn5wa
819	        UTF-16: ???????????????????????????????????????????????????????\
820	                ?????????????????????????

822	    (J) Taiwanese:
823	        U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587

825	        UTF-16: ??????????????????
826	        UTF-8:  ???????????????????????????
827	        AMC-M:  uqj7g2tbgtu6a385pspnxkupdnh
828	        BRACE:  kgcqui49gatc2wyrn8y7cndgte9
829	        AMC-O:  eqpgxstbzuvc6a385psp244kupyx2h
830	        RACE:   3bhnmuaroize5qe6xvha3cvkjywwlby
831	        LACE:   75hnmuaroize5qe6xvha3cvkjywwlby
832	        DUDE:   ked6l011n232kec0pebdke0doaaake2dm587

834	    (K) Vietnamese:
835	        Taisaohokhngthchi\
836	        noitingVit

838	          = U+0323
839	             = U+00F4
840	             = U+00EA
841	         = U+0309
842	             = U+0301
843	        UTF-8:  Ta??isaoho??kh??ngth????chi??no??iti????ngVi????t
844	        AMC-O:  aava-Ta-vud-isaoho-vud-kh-9e-ngth-8kj-chi-j-no-b-iti-8k\
845	                b-ngVi-8kvud-t
846	        AMC-M:  ada-Ta-ud-isaoho-ud-kh-s9e-ngth-s8kj-chi-j-no-b-iti-s8k\
847	                b-ngVi-s8kud-t
848	        BRACE:  i54-Ta-8-isaoho-ay-kh-29n-ngth-s2xa6i-chi-k-no-2g-iti-2\
849	                9c29-ngVi-25p48-t
850	        UTF-16: ???????????????????????????????????????????????????????\
851	                ?????????????????????
852	        DUDE:   N4m1j23g69n3m1vovj23g6bov4menn4m8uaj09g63opj09g6evj01g6\
853	                9n4m9uaj01g6enN6m9uaj23g74
854	        LACE:   aiahiyibamrqmadjonqw62dpaebsgcaannupi3thoruouaidbebqay3\
855	                ineaqgcicabxg6aidaecaa2lunhvacaybauag4z3wnhvacazdaeahi
856	        RACE:   ap7xj73bep7wt73t75q76377nd7w6i77np7wr77u75xp6z77ot7wr77\
857	                kbh7wh73i75uqt73o75xqd73j752p62p75ia763x7m77xn73j77vch7\
858	                3u

860	    The next several examples are all names of Japanese music artists,
861	    song titles, and TV programs, just because the author happens to
862	    have them handy (but Japanese is useful for providing examples
863	    of single-row text, two-row text, ideographic text, and various
864	    mixtures thereof).

866	    (L) 3B  (Japanese TV program title)

868	                      = U+5E74                       (kanji)
869	                     = U+7D44                       (kanji)
870	         = U+91D1 U+516B U+5148 U+751F  (kanji)

872	        UTF-16: ????????????????
873	        UTF-8:  3???B???????????????
874	        AMC-M:  utk-3-8ze-B-hkenqtymwifi9
875	        BRACE:  u-3-ygj-b-ynb6gjc7pp4k5p5w
876	        AMC-O:  fb8h-3-e-B-z7we3t7bymwizxtr
877	        DUDE:   j3le74G062nd44p1d1l16bk8n51f
878	        RACE:   3aadgxtuabrh2rer2fiwwukioupq
879	        LACE:   74adgxtuabrh2rer2fiwwukioupq

881	    (M) -with-SUPER-MONKEYS  (Japanese music group name)

883	         = U+5B89 U+5BA4 U+5948 U+7F8E U+6075  (kanji)

885	        UTF-8:  ??????????????????-with-SUPER-MONKEYS
886	        AMC-M:  u5m2j4etwif6q2zf---with--SUPER--MONKEYS
887	        AMC-O:  fmij4e3wiz92qyszf---with--SUPER--MONKEYS
888	        BRACE:  uvj7fuaqcahy982xa---with--SUPER--MONKEYS
889	        DUDE:   lb89q4p48nf8em075-g077m9n4m8-N3LGM5N2-MdVURLN9J
890	        UTF-16: ????????????????????????????????????????????????
891	        LACE:   ajnytjablfeac74oafqhkeyafv3qm5difvzxk4dfoiww233onnsxs4y
892	        RACE:   3bnysw5elfeh7dtaouac2adxabuqa5aanaac2adtab2qa4aamuaheab\
893	                nabwqa3yanyagwadfab4qa4y

895	    (N) Hello-Another-Way-  (Japanese song title)

897	         = U+305D U+308C U+305E U+308C U+306E  (hiragana)
898	                = U+5834 U+6240                       (kanji)
899	        UTF-8:  Hello-Another-Way-?????????????????????
900	        BRACE:  ji7-Hello--Another--Way---v3jhaefvd2ufj62
901	        AMC-O:  daf-Hello--Another--Way---p2nq2nyqx2veyuwa
902	        AMC-M:  bsk-Hello--Another--Way---p2nq2nyqx2veyuwa
903	        DUDE:   M8lssv-Huvn4m8ln2-Nm1n9-j05docleocmel834m240
904	        UTF-16: ??????????????????????????????????????????????????
905	        LACE:   ciagqzlmnrxs2ylon52gqzlsfv3wc6jnauyf3dc6rrxacwbuafrea
906	        RACE:   3aagqadfabwaa3aan4ac2adbabxaa3yaoqagqadfabzaaliao4agcad\
907	                zaawtaxjqrqyf4memgbxfqndcia

909	    (O) 2  (Japanese TV program title)

911	         = U+3072 U+3068 U+3064  (hiragana)
912	            = U+5C4B U+6839         (kanji)
913	              = U+306E                (hiragana)
914	           = U+4E0B                (kanji)

916	        UTF-16: ????????????????
917	        UTF-8:  ?????????????????????2
918	        AMC-O:  dagzciex6wmy2vjqw8sm-2
919	        AMC-M:  bsnzciex6wmy2vjqw8sm-2
920	        BRACE:  ji96u56uwbhf2wqxnw4s-2
921	        DUDE:   j072m8klc4bm839j06eke0bg032
922	        RACE:   3ayhemdigbsfys3iheyg4tqlaaza
923	        LACE:   74yhemdigbsfys3iheyg4tqlaaza

925	    (P) MajiKoi5 (Japanese song title)

927	                = U+3067         (hiragana)
928	              = U+3059 U+308B  (hiragana)
929	         = U+79D2 U+524D  (kanji)

931	        UTF-8:  Maji???Koi??????5??????
932	        UTF-16: ??????????????????????????
933	        AMC-M:  bsm-Maji-r-Koi-b2m-5-z37cxuwp
934	        BRACE:  ji8-Maji-g-Koi-qe7x-5-wx7p6ma
935	        AMC-O:  dag-Maji-h-Koi-xj2m-5-z37cxuwp
936	        DUDE:   Mdhqpj067G06bvpj059obg035n9d2l24d
937	        RACE:   3aag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq
938	        LACE:   74ag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq

940	    (Q) de  (Japanese song title)

942	         = U+30D1 U+30D5 U+30A3 U+30FC  (katakana)
943	         = U+30EB U+30F3 U+30D0         (katakana)

945	        UTF-16: ??????????????
946	        BRACE:  3iu8pazt-de-pygi
947	        AMC-O:  dapbf4d9n-de-8m9da
948	        AMC-M:  bs3jp4d9n-de-8m9di
949	        RACE:   gdi5li7475sp6zpl6pia
950	        DUDE:   j0d1lq3vcg064lj0ebv3t0
951	        UTF-8:  ????????????de?????????
952	        LACE:   aqyndvnd7qbaazdfamyox46q
953	    (R)   (Japanese song title)

955	            = U+305D U+306E                (hiragana)
956	         = U+30B9 U+30D4 U+30FC U+30C9  (katakana)
957	              = U+3067                       (hiragana)

959	        RACE:   gbow5oou7tewo
960	        UTF-16: ??????????????
961	        BRACE:  bidprdmp9wt7mi
962	        LACE:   a4yf23vz2t6mszy
963	        AMC-O:  dagxpq5j7e9n6jh
964	        AMC-M:  bsmfyq5j7e9n6jr
965	        DUDE:   j05dmer9t4vcs9m7
966	        UTF-8:  ?????????????????????

968	    The last example is an ASCII string that breaks not only the
969	    existing rules for host name labels but also the rules proposed in
970	    [NAMEPREP03] for internationalized domain names.

972	    (S) -> $1.00 <-

974	        UTF-8:  -> $1.00 <-
975	        DUDE:   -jei0kj1iej0gi0jc-
976	        RACE:   aawt4ibegexdambahqwq
977	        LACE:   bmac2praeqys4mbqea6c2
978	        UTF-16: ??????????????????????
979	        AMC-O:  aac--vqae-1-q-00-avn--
980	        AMC-M:  aae--vqae-1-q-00-avn--
981	        BRACE:  229--t2b4-1-w-00-i9i--

983	Security considerations

985	    Users expect each domain name in DNS to be controlled by a single
986	    authority.  If a Unicode string intended for use as a domain label
987	    could map to multiple ACE labels, then an internationalized domain
988	    name could map to multiple ACE domain names, each controlled by
989	    a different authority, some of which could be spoofs that hijack
990	    service requests intended for another.  Therefore AMC-ACE-O is
991	    designed so that each Unicode string has a unique encoding.

993	    However, there can still be multiple Unicode representations of the
994	    "same" text, for various definitions of "same".  This problem is
995	    addressed to some extent by the Unicode standard under the topic
996	    of canonicalization, but some text strings may be misleading or
997	    ambiguous to humans when used as domain names, such as strings
998	    containing dots, slashes, at-signs, etc.  These issues are being
999	    further studied under the topic of "nameprep" [NAMEPREP03].

1001	Credits

1003	    AMC-ACE-O reuses a number of preexisting techniques.

1005	    The basic encoding of integers to nybbles to quintets to base-32
1006	    comes from UTF-5 [UTF5], and the particular variant used here comes
1007	    from AMC-ACE-M [AMCACEM00].

1009	    The idea of avoiding 0, 1, o, and l in base-32 strings was taken
1010	    from SFS [SFS].

1012	    The idea of encoding deltas from reference points declared at the
1013	    beginning of the encoded string was taken from RACE (of which
1014	    the latest version is [RACE03]), which may have gotten the idea
1015	    from Unicode Technical Standard #6 [UTS6].  The latter also uses
1016	    predefined reference points in the Latin range.

1018	    From BRACE [BRACE00] comes the idea of switching between literal
1019	    mode and base-32 mode, and the technique of counting how many code
1020	    points fall within a window (as opposed to checking whether all do).

1022	    The general idea of using the alphabetic case of base-32 characters
1023	    to record the desired case of the Unicode characters was suggested
1024	    by this author, and first applied to the UTF-5-style encoding in
1025	    DUDE (of which the latest version is [DUDE01]).

1027	    The bootstrapping method of encoding reference points, which does
1028	    not require them to nest but takes advantage of nesting when it
1029	    occurs, is new in AMC-ACE-O.

1031	References

1033	    [AltDUDE00] Adam Costello, "AltDUDE version 0.0.2", 2001-Mar-19,
1034	    draft-ietf-idn-altdude-00.

1036	    [AMCACEM00] Adam Costello, "AMC-ACE-M version 0.1.0", 2001-Feb-12,
1037	    draft-ietf-idn-amc-ace-m-00.

1039	    [BRACE00] Adam Costello, "BRACE: Bi-mode Row-based
1040	    ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19,
1041	    draft-ietf-idn-brace-00.

1043	    [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode
1044	    Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01.

1046	    [IDN] Internationalized Domain Names (IETF working group),
1047	    http://www.i-d-n.net/, idn@ops.ietf.org.

1049	    [LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII
1050	    Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01.

1052	    [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
1053	    of Internationalized Host Names", 2001-Feb-24,
1054	    draft-ietf-idn-nameprep-03.

1056	    [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
1057	    http://www.trigeminal.com/samples/provincial.html.

1059	    [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding
1060	    for IDN", 2000-Nov-28, draft-ietf-idn-race-03.

1062	    [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
1063	    Table Specification", 1985-Oct, RFC 952.

1065	    [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
1066	    1987-Nov, RFC 1034.

1068	    [RFC1123] Internet Engineering Task Force, R. Braden (editor),
1069	    "Requirements for Internet Hosts -- Application and Support",
1070	    1989-Oct, RFC 1123.

1072	    [SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)",
1073	    draft-ietf-idn-sace-*.

1075	    [SFS] David Mazieres et al, "Self-certifying File System",
1076	    http://www.fs.net/.

1078	    [UNICODE] The Unicode Consortium, "The Unicode Standard",
1079	    http://www.unicode.org/unicode/standard/standard.html.

1081	    [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a
1082	    Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*.

1084	    [UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another
1085	    ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*.

1087	    [UTS6] Misha Wolf, Ken Whistler, Charles Wicksteed,
1088	    Mark Davis, Asmus Freytag, "Unicode Technical Standard
1089	    #6: A Standard Compression Scheme for Unicode",
1090	    http://www.unicode.org/unicode/reports/tr6/.

1092	    [UTFCONV] Mark Davis, "UTF Converter",
1093	    http://www.macchiato.com/unicode/convert.html.

1095	Author

1097	    Adam M. Costello 
1098	    http://www.cs.berkeley.edu/~amc/

1100	Example implementation

1102	/******************************************/
1103	/* amc-ace-o.c 0.0.0 (2001-Mar-17-Sat)    */
1104	/* Adam M. Costello  */
1105	/******************************************/

1107	/* This is ANSI C code (C89) implementing AMC-ACE-O version 0.0.*. */

1109	/************************************************************/
1110	/* Public interface (would normally go in its own .h file): */

1112	#include 

1114	enum amc_ace_status {
1115	  amc_ace_success,
1116	  amc_ace_invalid_input,
1117	  amc_ace_output_too_big
1118	};
1119	enum case_sensitivity { case_sensitive, case_insensitive };

1121	#if UINT_MAX >= 0x10FFFF
1122	typedef unsigned int u_code_point;
1123	#else
1124	typedef unsigned long u_code_point;
1125	#endif

1127	enum amc_ace_status amc_ace_o_encode(
1128	  unsigned int input_length,
1129	  const u_code_point *input,
1130	  const unsigned char *uppercase_flags,
1131	  unsigned int *output_size,
1132	  char *output );

1134	    /* amc_ace_o_encode() converts Unicode to AMC-ACE-O.  The input   */
1135	    /* must be represented as an array of Unicode code points         */
1136	    /* (not code units; surrogate pairs are not allowed), and the     */
1137	    /* output will be represented as null-terminated ASCII.  The      */
1138	    /* input_length is the number of code points in the input.  The   */
1139	    /* output_size is an in/out argument: the caller must pass        */
1140	    /* in the maximum number of characters that may be output         */
1141	    /* (including the terminating null), and on successful return     */
1142	    /* it will contain the number of characters actually output       */
1143	    /* (including the terminating null, so it will be one more than   */
1144	    /* strlen() would return, which is why it is called output_size   */
1145	    /* rather than output_length).  The uppercase_flags array must    */
1146	    /* hold input_length boolean values, where nonzero means the      */
1147	    /* corresponding Unicode character should be forced to uppercase  */
1148	    /* after being decoded, and zero means it is caseless or should   */
1149	    /* be forced to lowercase.  Alternatively, uppercase_flags may    */
1150	    /* be a null pointer, which is equivalent to all zeros.  The      */
1151	    /* letters a-z and A-Z are always encoded literally, regardless   */
1152	    /* of the corresponding flags.  The encoder always outputs        */
1153	    /* lowercase base-32 characters except when nonzero values        */
1154	    /* of uppercase_flags require otherwise, so the encoder is        */
1155	    /* compatible with any of the case models.  The return value      */
1156	    /* may be any of the amc_ace_status values defined above; if      */
1157	    /* not amc_ace_success, then output_size and output may contain   */
1158	    /* garbage.  On success, the encoder will never need to write an  */
1159	    /* output_size greater than input_length*5+10, because of how the */
1160	    /* encoding is defined.                                           */

1162	enum amc_ace_status amc_ace_o_decode(
1163	  enum case_sensitivity case_sensitivity,
1164	  char *scratch_space,
1165	  const char *input,
1166	  unsigned int *output_length,
1167	  u_code_point *output,
1168	  unsigned char *uppercase_flags );
1169	    /* amc_ace_o_decode() converts AMC-ACE-O to Unicode.  The input   */
1170	    /* must be represented as null-terminated ASCII, and the output   */
1171	    /* will be represented as an array of Unicode code points.        */
1172	    /* The case_sensitivity argument influences the check on the      */
1173	    /* well-formedness of the input string; it must be case_sensitive */
1174	    /* if case-sensitive comparisons are allowed on encoded strings,  */
1175	    /* case_insensitive otherwise (see also section "Case sensitivity */
1176	    /* models" of the AMC-ACE-O specification).  The scratch_space    */
1177	    /* must point to space at least as large as the input, which will */
1178	    /* get overwritten (this allows the decoder to avoid calling      */
1179	    /* malloc()).  The output_length is an in/out argument: the       */
1180	    /* caller must pass in the maximum number of code points that     */
1181	    /* may be output, and on successful return it will contain the    */
1182	    /* actual number of code points output.  The uppercase_flags      */
1183	    /* array must have room for at least output_length values, or it  */
1184	    /* may be a null pointer if the case information is not needed.   */
1185	    /* A nonzero flag indicates that the corresponding Unicode        */
1186	    /* character should be forced to uppercase by the caller, while   */
1187	    /* zero means it is caseless or should be forced to lowercase.    */
1188	    /* The letters a-z and A-Z are output already in the proper case, */
1189	    /* but their flags will be set appropriately so that applying the */
1190	    /* flags would be harmless.  The return value may be any of the   */
1191	    /* amc_ace_status values defined above; if not amc_ace_success,   */
1192	    /* then output_length, output, and uppercase_flags may contain    */
1193	    /* garbage.  On success, the decoder will never need to write     */
1194	    /* an output_length greater than the length of the input (not     */
1195	    /* counting the null terminator), because of how the encoding is  */
1196	    /* defined.                                                       */

1198	/**********************************************************/
1199	/* Implementation (would normally go in its own .c file): */

1201	#include 

1203	/* is_ldh(codept) returns 1 if the code point represents an LDH   */
1204	/* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */

1206	static int is_ldh(u_code_point codept)
1207	{
1208	  return codept >  122 ? 0 :
1209	         codept >=  97 ? 1 :
1210	         codept >   90 ? 0 :
1211	         codept >=  65 ? 1 :
1212	         codept >   57 ? 0 :
1213	         codept >=  48 ? 1 :
1214	         codept ==  45      ;
1215	}

1217	/* is_AtoZ(c) returns 1 if c is an         */
1218	/* uppercase ASCII letter, zero otherwise. */

1220	static unsigned char is_AtoZ(char c)
1221	{
1222	  return c >= 65 && c <= 90;
1223	}
1224	/* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
1225	/* are equal, 1 otherwise.  If case_sensitivity is case_insensitive,  */
1226	/* then ASCII A-Z are considered equal to a-z respectively.           */

1228	static int unequal(
1229	  enum case_sensitivity case_sensitivity, const char *s1, const char *s2 )
1230	{
1231	  char c1, c2;

1233	  if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;

1235	  for (;;) {
1236	    c1 = *s1;
1237	    c2 = *s2;
1238	    if (c1 >= 65 && c1 <= 90) c1 += 32;
1239	    if (c2 >= 65 && c2 <= 90) c2 += 32;
1240	    if (c1 != c2) return 1;
1241	    if (c1 == 0) return 0;
1242	    ++s1, ++s2;
1243	  }
1244	}

1246	/* base32[q] is the lowercase base-32 character representing  */
1247	/* the number q from the range 0 to 31.  Note that we cannot  */
1248	/* use string literals for ASCII characters because an ANSI C */
1249	/* compiler does not necessarily use ASCII.                   */

1251	static const char base32[] = {
1252	  97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
1253	  109, 110,                                               /* m-n */
1254	  112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
1255	  50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
1256	};

1258	/* base32_decode(c) returns the value of a base-32 character, in the */
1259	/* range 0 to 31, or the constant base32_invalid if c is not a valid */
1260	/* base-32 character.                                                */

1262	enum { base32_invalid = 32 };

1264	static unsigned int base32_decode(char c)
1265	{
1266	  if (c < 50) return base32_invalid;
1267	  if (c <= 57) return c - 26;
1268	  if (c < 97) c += 32;
1269	  if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
1270	  return c - 97 - (c > 108) - (c > 111);
1271	}

1273	/* The decoder_state and encoder_state structures contains         */
1274	/* variables that are shared among several of the functions below. */

1276	struct decoder_state {
1277	  const char *in_next;                   /* unread part of ACE input      */
1278	  u_code_point refpoint[6];              /* reference points, [0] unused  */
1279	};
1280	struct encoder_state {
1281	  char *out_next, *out_end;              /* unwritten part of ACE output  */
1282	  const u_code_point *in_start, *in_end; /* entire Unicode input          */
1283	  u_code_point refpoint[6], prefix[4];   /* reference points and prefixes */
1284	  unsigned int best_count;               /* max found so far by census()  */
1285	  u_code_point best_refpoint;            /* corresponding reference point */
1286	};

1288	/* refpoint[k] is for base-32 sequences of length k, and prefix[k]  */
1289	/* is used to encode refpoint[k].  Generally, prefix[k] << (4*k)    */
1290	/* == refpoint[k], but for prefix[2] == 0xD8 + i, where i = 0..7,   */
1291	/* refpoint[2] == special_refpoint[i].  These prefixes, which would */
1292	/* otherwise correspond to surrogates, are instead used to encode   */
1293	/* special reference points that help the Latin script compress     */
1294	/* better, because unlike most other small scripts it is split      */
1295	/* across multiple rows with inconvenient boundaries.               */

1297	static const u_code_point special_refpoint[] =
1298	  { 0x20, 0x50, 0x70, 0xA0, 0xC0, 0xE0, 0x140, 0x270 };

1300	/* find_refpoint(refpoint,start,n) scans the refpoint array, starting */
1301	/* at position start, for a reference point suitable for encoding n,  */
1302	/* and returns the index of the first match.                          */

1304	unsigned int find_refpoint(
1305	  u_code_point refpoint[6], u_code_point start, u_code_point n )
1306	{
1307	  while ((n - refpoint[start]) >> (4*start) != 0) ++start;
1308	  return start;
1309	}

1311	/* encode_point(state,n) encodes n as a sequence of base-32         */
1312	/* characters representing a delta from a reference point.  The     */
1313	/* delta divided into a big-endian sequence of nybbles; each nybble */
1314	/* is expanded to a quintet with a highest bit of 0 for the last    */
1315	/* nybble, 1 for the others; and the quintets are mapped to base-32 */
1316	/* characters.  Returns amc_ace_success or amc_ace_output_too_big.  */

1318	enum amc_ace_status encode_point(struct encoder_state *state, u_code_point n)
1319	{
1320	  unsigned int k, i;

1322	  k = find_refpoint(state->refpoint, 1, n);
1323	  if (state->out_end - state->out_next < k) return amc_ace_output_too_big;
1324	  n -= state->refpoint[k];
1325	  i = k - 1;
1326	  state->out_next[i] = base32[n & 0xF];

1328	  while (i > 0) {
1329	    n >>= 4;
1330	    state->out_next[--i] = base32[0x10 | (n & 0xF)];
1331	  }

1333	  state->out_next += k;
1334	  return amc_ace_success;
1335	}
1336	/* decode_point(state,n) is the reverse of encode_point(): it */
1337	/* consumes base-32 characters and writes the code point into */
1338	/* *n.  Returns amc_ace_success or amc_ace_invalid_input.     */

1340	enum amc_ace_status decode_point(struct decoder_state *state, u_code_point *n)
1341	{
1342	  u_code_point q, delta = 0;
1343	  unsigned int k = 0;

1345	  do {
1346	    if (k >= 5) return amc_ace_invalid_input;
1347	    q = base32_decode(state->in_next[k++]);
1348	    if (q == base32_invalid) return amc_ace_invalid_input;
1349	    delta = (delta << 4) | (q & 0xF);
1350	  } while (q > 0xF);

1352	  state->in_next += k;
1353	  *n = state->refpoint[k] + delta;
1354	  return amc_ace_success;
1355	}

1357	/* census(state,k,p) sets refpoint[k] to the reference point  */
1358	/* corresponding to prefix p, then calculates how many times  */
1359	/* that reference point would get used, and sets prefix[k] to */
1360	/* p if the result exceeds the previous maximum.              */

1362	void census(struct encoder_state *state, unsigned int k, u_code_point p)
1363	{
1364	  unsigned int count, i;
1365	  u_code_point *refpoint = state->refpoint, *prefix = state->prefix;
1366	  const u_code_point *in;

1368	  refpoint[k] =
1369	    k == 2 && p - 0xD8 <= 7  ?  special_refpoint[p - 0xD8]  :  p << (4*k);

1371	  /* count times used to encode input code points: */

1373	  for (count = 0, in = state->in_start;  in < state->in_end;  ++in) {
1374	    if (!is_ldh(*in) && find_refpoint(refpoint, 1, *in) == k) ++count;
1375	  }

1377	  /* count times used to encode other reference points: */

1379	  for (i = 1;  i < k;  ++i) {
1380	    if (find_refpoint(refpoint, i+1, prefix[i] << (4*i)) == k) ++count;
1381	  }

1383	  if (count > state->best_count) {
1384	    state->best_count = count;
1385	    state->best_refpoint = refpoint[k];
1386	    prefix[k] = p;
1387	  }
1388	}

1390	/* bootstrap(refpoint,k,p) adjusts the existing reference points so */
1391	/* they can be used for encoding/decoding another reference point.  */
1392	void bootstrap(u_code_point refpoint[6], unsigned int k, u_code_point p)
1393	{
1394	  unsigned int j;

1396	  for (j = 4;  j >= 2;  --j) refpoint[j] = refpoint[j-1] << 4;
1397	  refpoint[1] =
1398	    k == 2 && p - 0xD8 <= 7  ?  special_refpoint[p - 0xD8] >> 4  :  p << 4;
1399	}

1401	/* Main encode function: */

1403	enum amc_ace_status amc_ace_o_encode(
1404	  unsigned int input_length,
1405	  const u_code_point *input,
1406	  const unsigned char *uppercase_flags,
1407	  unsigned int *output_size,
1408	  char *output )
1409	{
1410	  struct encoder_state dummy = {0} /* all zeros */, *state = &dummy;
1411	  const u_code_point *in, *in_end;
1412	  char *out_end;
1413	  unsigned int k, i;
1414	  u_code_point codept;
1415	  enum amc_ace_status status;
1416	  unsigned int literal;  /* boolean */

1418	  /* Initialization: */

1420	  state->out_next = output;
1421	  state->out_end = out_end = output + *output_size;
1422	  state->in_start = input;
1423	  state->in_end = in_end = input + input_length;

1425	  /* Verify that all code points are in 0..10FFFF: */

1427	  for (in = input;  in < in_end;  ++in) {
1428	    if (*in > 0x10FFFF) return amc_ace_invalid_input;
1429	  }

1431	  /* Choose the reference points:  Choose refpoint[1] so that it will */
1432	  /* be used as often as possible, then choose refpoint[2] similarly, */
1433	  /* then refpoint[3].                                                */

1435	  state->refpoint[5] = 0x10000;
1436	  /* refpoint[1..4] and prefix[1..3] are already 0 */

1438	  for (k = 1;  k <= 3;  ++k) {
1439	    state->best_count = 0;
1440	    state->best_refpoint = 0;
1441	    /* Try prefixes of the input code points, then the special prefixes: */
1442	    for (in = input;  in < in_end;  ++in) census(state, k, *in >> (4*k));
1443	    if (k == 2) for (i = 0;  i <= 7;  ++i) census(state, k, 0xD8 + i);
1444	    if (k == 3) census(state, k, 0xD);
1445	    state->refpoint[k] = state->best_refpoint;
1446	  }
1447	  /* Encode the reference points: */

1449	  state->refpoint[1] = 0;
1450	  state->refpoint[2] = 0x10;

1452	  for (k = 3;  k >= 1;  --k) {
1453	    status = encode_point(state, state->prefix[k]);
1454	    if (status != amc_ace_success) return status;
1455	    bootstrap(state->refpoint, k, state->prefix[k]);
1456	  }

1458	  /* Main encoding loop: */

1460	  literal = 0;

1462	  for (i = 0;  i < input_length;  ++i) {
1463	    codept = input[i];

1465	    if (codept == 45) {
1466	      /* hyphen-minus is doubled */
1467	      if (state->out_end - state->out_next < 2) return amc_ace_output_too_big;
1468	      *state->out_next++ = 45;
1469	      *state->out_next++ = 45;
1470	    }
1471	    else if (is_ldh(codept)) {
1472	      /* encode LDH character literally */

1474	      if (!literal) {
1475	        /* switch to literal mode by outputting hyphen-minus */
1476	        if (out_end - state->out_next < 1) return amc_ace_output_too_big;
1477	        *state->out_next++ = 45;
1478	        literal = 1;
1479	      }

1481	      if (out_end - state->out_next < 1) return amc_ace_output_too_big;
1482	      *state->out_next++ = codept;
1483	    }
1484	    else {
1485	      /* encode non-LDH character using base-32 */

1487	      if (literal) {
1488	        /* switch to base-32 mode by outputting hyphen-minus */
1489	        if (out_end - state->out_next < 1) return amc_ace_output_too_big;
1490	        *state->out_next++ = 45;
1491	        literal = 0;
1492	      }

1494	      status = encode_point(state,codept);
1495	      if (status != amc_ace_success) return status;
1496	      /* the last base-32 character can record the uppercase flag: */
1497	      if (uppercase_flags && uppercase_flags[i]) state->out_next[-1] -= 32;
1498	    }
1499	  }
1500	  /* null terminator: */
1501	  if (out_end - state->out_next < 1) return amc_ace_output_too_big;
1502	  *state->out_next++ = 0;
1503	  *output_size = state->out_next - output;
1504	  return amc_ace_success;
1505	}

1507	/* Main decode function: */

1509	enum amc_ace_status amc_ace_o_decode(
1510	  enum case_sensitivity case_sensitivity,
1511	  char *scratch_space,
1512	  const char *input,
1513	  unsigned int *output_length,
1514	  u_code_point *output,
1515	  unsigned char *uppercase_flags )
1516	{
1517	  struct decoder_state dummy = {0} /* all zeros */, *state = &dummy;
1518	  unsigned int k, next_out, max_out, input_size, scratch_size;
1519	  enum amc_ace_status status;
1520	  u_code_point p;
1521	  unsigned int literal;  /* boolean */
1522	  char c;

1524	  /* Initialization: */

1526	  state->in_next = input;
1527	  next_out = 0;
1528	  max_out = *output_length;

1530	  /* Decode the reference points: */

1532	  state->refpoint[2] = 0x10;
1533	  state->refpoint[5] = 0x10000;
1534	  /* refpoint[1,3,4] are already 0 */

1536	  for (k = 3;  k >= 1;  --k) {
1537	    status = decode_point(state, &p);
1538	    if (status != amc_ace_success) return status;
1539	    bootstrap(state->refpoint, k, p);
1540	  }

1542	  /* Main decoding loop: */

1544	  literal = 0;

1546	  for (;;) {
1547	    c = *state->in_next;
1548	    if (c == 0) break;
1549	    if (c == 45 /* hyphen-minus */) {
1550	      if (*++state->in_next == 45) {
1551	        /* double hyphen-minus represents a hyphen-minus */
1552	        ++state->in_next;
1553	        if (max_out - next_out < 1) return amc_ace_output_too_big;
1554	        if (uppercase_flags) uppercase_flags[next_out] = 0;
1555	        output[next_out++] = 45;
1556	      }
1557	      else {
1558	        /* unpaired hyphen-minus toggles mode */
1559	        literal = !literal;
1560	      }
1561	    }
1562	    else {
1563	      if (literal) {
1564	        /* copy literal character to the output */
1565	        ++state->in_next;
1566	        if (max_out - next_out < 1) return amc_ace_output_too_big;
1567	        output[next_out] = c;
1568	      }
1569	      else {
1570	        /* decode one base-32 code point */
1571	        status = decode_point(state, output + next_out);
1572	        if (status != amc_ace_success) return status;
1573	      }

1575	      if (uppercase_flags) {
1576	        uppercase_flags[next_out] = is_AtoZ(state->in_next[-1]);
1577	      }

1579	      ++next_out;
1580	    }
1581	  }

1583	  /* Re-encode the output and compare to the input: */

1585	  input_size = state->in_next - input + 1;
1586	  scratch_size = input_size;
1587	  status = amc_ace_o_encode(next_out, output, uppercase_flags,
1588	                            &scratch_size, scratch_space);
1589	  if (status != amc_ace_success ||
1590	      scratch_size != input_size ||
1591	      unequal(case_sensitivity, scratch_space, input)
1592	     ) return amc_ace_invalid_input;

1594	  *output_length = next_out;
1595	  return amc_ace_success;
1596	}

1598	/******************************************************************/
1599	/* Wrapper for testing (would normally go in a separate .c file): */

1601	#include 
1602	#include 
1603	#include 
1604	#include 
1605	/* For testing, we'll just set some compile-time limits rather than */
1606	/* use malloc(), and set a compile-time option rather than using a  */
1607	/* command-line option.                                             */

1609	enum {
1610	  unicode_max_length = 256,
1611	  ace_max_size = 256,
1612	  test_case_sensitivity = case_insensitive  /* suitable for host names */
1613	};

1615	static void usage(char **argv)
1616	{
1617	  fprintf(stderr,
1618	    "%s -e reads big-endian UTF-32 and writes AMC-ACE-O ASCII.\n"
1619	    "%s -d reads AMC-ACE-O ASCII and writes big-endian UTF-32.\n"
1620	    "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n"
1621	    , argv[0], argv[0]);
1622	  exit(EXIT_FAILURE);
1623	}

1625	static void fail(const char *msg)
1626	{
1627	  fputs(msg,stderr);
1628	  exit(EXIT_FAILURE);
1629	}

1631	static const char too_big[] =
1632	  "input or output is too large, recompile with larger limits\n";
1633	static const char invalid_input[] = "invalid input\n";
1634	static const char io_error[] = "I/O error\n";

1636	int main(int argc, char **argv)
1637	{
1638	  enum amc_ace_status status;
1639	  int r;

1641	  if (argc != 2) usage(argv);
1642	  if (argv[1][0] != '-') usage(argv);
1643	  if (argv[1][2] != '\0') usage(argv);

1645	  if (argv[1][1] == 'e') {
1646	    u_code_point input[unicode_max_length];
1647	    unsigned char uppercase_flags[unicode_max_length];
1648	    char output[ace_max_size];
1649	    unsigned int input_length, output_size;
1650	    int c0, c1, c2, c3;
1651	    /* Read the UTF-32 input string: */

1653	    input_length = 0;

1655	    for (;;) {
1656	      c0 = getchar();
1657	      c1 = getchar();
1658	      c2 = getchar();
1659	      c3 = getchar();
1660	      if (ferror(stdin)) fail(io_error);

1662	      if (c1 == EOF || c2 == EOF || c3 == EOF) {
1663	        if (c0 != EOF) fail("input not a multiple of 4 bytes\n");
1664	        break;
1665	      }

1667	      if (input_length == unicode_max_length) fail(too_big);

1669	      if ((c0 != 0 && c0 != 0x80)
1670	          || c1 < 0 || c1 > 0x10
1671	          || c2 < 0 || c2 > 0xFF
1672	          || c3 < 0 || c3 > 0xFF ) {
1673	        fail(invalid_input);
1674	      }

1676	      input[input_length] = ((u_code_point) c1 << 16) |
1677	                            ((u_code_point) c2 <<  8) | (u_code_point) c3;
1678	      uppercase_flags[input_length] = (c0 >> 7);
1679	      ++input_length;
1680	    }

1682	    /* Encode, and output the result: */

1684	    output_size = ace_max_size;
1685	    status = amc_ace_o_encode(input_length, input, uppercase_flags,
1686	                              &output_size, output);
1687	    if (status == amc_ace_invalid_input) fail(invalid_input);
1688	    if (status == amc_ace_output_too_big) fail(too_big);
1689	    assert(status == amc_ace_success);
1690	    r = fputs(output,stdout);
1691	    if (r == EOF) fail(io_error);
1692	    return EXIT_SUCCESS;
1693	  }

1695	  if (argv[1][1] == 'd') {
1696	    char input[ace_max_size], scratch[ace_max_size];
1697	    u_code_point output[unicode_max_length], codept;
1698	    unsigned char uppercase_flags[unicode_max_length];
1699	    unsigned int output_length, i;

1701	    /* Read the AMC-ACE-O ASCII input string: */

1703	    fgets(input, ace_max_size, stdin);
1704	    if (ferror(stdin)) fail(io_error);
1705	    if (!feof(stdin)) fail(too_big);
1706	    /* Decode, and output the result: */

1708	    output_length = unicode_max_length;
1709	    status = amc_ace_o_decode(test_case_sensitivity, scratch, input,
1710	                              &output_length, output, uppercase_flags);
1711	    if (status == amc_ace_invalid_input) fail(invalid_input);
1712	    if (status == amc_ace_output_too_big) fail(too_big);
1713	    assert(status == 0);

1715	    for (i = 0;  i < output_length;  ++i) {
1716	      r = putchar(uppercase_flags[i] ? 0x80 : 0);
1717	      if (r == EOF) fail(io_error);
1718	      codept = output[i];
1719	      r = putchar(codept >> 16);
1720	      if (r == EOF) fail(io_error);
1721	      r = putchar((codept >> 8) & 0xFF);
1722	      if (r == EOF) fail(io_error);
1723	      r = putchar(codept & 0xFF);
1724	      if (r == EOF) fail(io_error);
1725	    }

1727	    return EXIT_SUCCESS;
1728	  }

1730	  usage(argv);
1731	  return EXIT_SUCCESS;  /* not reached, but quiets a compiler warning */
1732	}

1734	                   INTERNET-DRAFT expires 2001-Sep-19