idnits 2.17.1 draft-ietf-idn-amc-ace-m-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 17) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 2 instances of too long lines in the document, the longest one being 2 characters in excess of 72. ** The abstract seems to contain references ([UNICODE], [DUDE00], [RFC1123], [BRACE00], [RFC952], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 248 has weird spacing: '...b aaaaa other...' == Line 254 has weird spacing: '...c ccccc other...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1622

  == Missing Reference: 'C' is mentioned on line 1299, but not defined

  == Missing Reference: 'A' is mentioned on line 1325, but not defined

  -- Looks like a reference, but probably isn't: '6' on line 1454

  -- Looks like a reference, but probably isn't: '2' on line 1623

  -- Looks like a reference, but probably isn't: '1' on line 1673

  -- Looks like a reference, but probably isn't: '3' on line 1494

  -- Looks like a reference, but probably isn't: '4' on line 1498

  -- Looks like a reference, but probably isn't: '5' on line 1499

  == Unused Reference: 'RACE03' is defined on line 934, but no explicit
     reference was found in the text

  == Unused Reference: 'UTF6' is defined on line 956, but no explicit
     reference was found in the text

  == Unused Reference: 'ACEID01' is defined on line 910, but no explicit
     reference was found in the text

  == Unused Reference: 'LACE01' is defined on line 924, but no explicit
     reference was found in the text

  == Unused Reference: 'SACE' is defined on line 947, but no explicit
     reference was found in the text

  == Unused Reference: 'UTF5' is defined on line 953, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. '0xxxx'

  -- Possible downref: Normative reference to a draft: ref. 'RACE03' 

  == Outdated reference: A later version (-02) exists of
     draft-ietf-idn-dude-00

  -- Possible downref: Normative reference to a draft: ref. 'DUDE00' 

  -- No information found for draft-ietf-idn-utf6- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'UTF6' 

  == Outdated reference: A later version (-10) exists of
     draft-ietf-idn-nameprep-02

  == Outdated reference: A later version (-02) exists of
     draft-ietf-idn-aceid-01

  -- Possible downref: Normative reference to a draft: ref. 'ACEID01' 

  -- Possible downref: Normative reference to a draft: ref. 'BRACE00' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN'

  -- Possible downref: Normative reference to a draft: ref. 'LACE01' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL'

  ** Downref: Normative reference to an Unknown state RFC: RFC  952

  -- No information found for draft-ietf-idn-sace- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'SACE' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'

  -- No information found for draft-jseng-utf5- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'UTF5' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTFCONV'


     Summary: 6 errors (**), 0 flaws (~~), 15 warnings (==), 26 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          Adam M. Costello
2	draft-ietf-idn-amc-ace-m-00.txt                              2001-Feb-12
3	Expires 2001-Aug-14

5	                         AMC-ACE-M version 0.1.0

7	Status of this Memo

9	    This document is an Internet-Draft and is in full conformance with
10	    all provisions of Section 10 of RFC2026.

12	    Internet-Drafts are working documents of the Internet Engineering
13	    Task Force (IETF), its areas, and its working groups.  Note
14	    that other groups may also distribute working documents as
15	    Internet-Drafts.

17	    Internet-Drafts are draft documents valid for a maximum of six
18	    months and may be updated, replaced, or obsoleted by other documents
19	    at any time.  It is inappropriate to use Internet-Drafts as
20	    reference material or to cite them other than as "work in progress."

22	    The list of current Internet-Drafts can be accessed at
23	    http://www.ietf.org/ietf/1id-abstracts.txt

25	    The list of Internet-Draft Shadow Directories can be accessed at
26	    http://www.ietf.org/shadow.html

28	    Distribution of this document is unlimited.  Please send comments
29	    to the author at amc@cs.berkeley.edu, or to the idn working
30	    group at idn@ops.ietf.org.  A non-paginated (and possibly
31	    newer) version of this specification may be available at
32	    http://www.cs.berkeley.edu/~amc/charset/amc-ace-m

34	Abstract

36	    AMC-ACE-M is a reversible map from a sequence of Unicode [UNICODE]
37	    characters to a sequence of letters (A-Z, a-z), digits (0-9), and
38	    hyphen-minus (-), henceforth called LDH characters.  Such a map
39	    (called an "ASCII-Compatible Encoding", or ACE) might be useful for
40	    internationalized domain names [IDN], because host name labels are
41	    currently restricted to LDH characters by [RFC952] and [RFC1123].

43	    AMC-ACE-M is a cross between BRACE [BRACE00] (which is efficient
44	    but complex) and DUDE [DUDE00] (which is simple and provides case
45	    preservation).  AMC-ACE-M is much simpler than BRACE but similarly
46	    efficient, and provides case preservation like DUDE.

48	    Besides domain names, there might also be other contexts where it is
49	    useful to transform Unicode characters into "safe" (delimiter-free)
50	    ASCII characters.  (If other contexts consider hyphen-minus to be
51	    unsafe, a different character could be used to play its role, like
52	    underscore.)
53	Contents

55	    Features
56	    Name
57	    Overview
58	    Base-32 characters
59	    Encoding procedure
60	    Decoding procedure
61	    Signature
62	    Case sensitivity models
63	    Comparison with RACE, BRACE, LACE, and DUDE
64	    Example strings
65	    Security considerations
66	    References
67	    Author
68	    Example implementation

70	Features

72	    Uniqueness:  Every Unicode string maps to at most one LDH string.

74	    Completeness:  Every Unicode string maps to an LDH string.
75	    Restrictions on which Unicode strings are allowed, and on length,
76	    may be imposed by higher layers.

78	    Efficient encoding:  The ratio of encoded size to original size is
79	    small for all Unicode strings.  This is important in the context
80	    of domain names because [RFC1034] restricts the length of a domain
81	    label to 63 characters.

83	    Simplicity:  The encoding and decoding algorithms are reasonably
84	    simple to implement.  The goals of efficiency and simplicity are at
85	    odds; AMC-ACE-M aims at a good balance between them.

87	    Case-preservation:  If the Unicode string has been case-folded prior
88	    to encoding, it is possible to record the case information in the
89	    case of the letters in the encoding, allowing a mixed-case Unicode
90	    string to be recovered if desired, but a case-insensitive comparison
91	    of two encoded strings is equivalent to a case-insensitive
92	    comparison of the Unicode strings.  This feature is optional; see
93	    section "Case sensitivity models".

95	    Readability:  The letters A-Z and a-z and the digits 0-9 appearing
96	    in the Unicode string are represented as themselves in the label.
97	    This comes for free because it usually the most efficient encoding
98	    anyway.

100	Name

102	    AMC-ACE-M is a working name that should be changed if it is adopted.
103	    (The M merely indicates that it is the thirteenth ACE devised by
104	    this author.  BRACE was the third.  D through L did not deliver
105	    enough efficiency to justify their complexity.)  Rather than waste
106	    good names on experimental proposals, let's wait until one proposal
107	    is chosen, then assign it a good name.  Suggestions (assuming the
108	    primary use is in domain names):

110	        UniHost
111	        UTF-A ("A" for "ASCII" or "alphanumeric",
112	               but unfortunately UTF-A sounds like UTF-8)
113	        UTF-H ("H" for "host names",
114	               but unfortunately UTF-H sounds like UTF-8)
115	        UTF-D ("D" for "domain names")
116	        NUDE (Normal Unicode Domain Encoding)

118	Overview

120	    AMC-ACE-M maps characters to characters--it does not consume or
121	    produce code points, code units, or bytes, although the algorithm
122	    makes use of code points, and implementations will of course need to
123	    represent the input and output characters somehow, usually as bytes
124	    or other code units.

126	    Each character in the Unicode string is represented by an
127	    integral number of characters in the encoded string.  There is no
128	    intermediate bit string or octet string.

130	    The encoded string alternates between two modes: literal mode and
131	    base-32 mode.  LDH characters in the Unicode string are encoded
132	    literally, except that hyphen-minus is doubled.  Non-LDH characters
133	    in the Unicode string are encoded using base-32, in which each
134	    character of the encoded string represents five bits (a "quintet").
135	    A non-paired hyphen-minus in the encoded string indicates a mode
136	    change.

138	    In base-32 mode a group of one to five quintets are used to
139	    represent a number, which is added to an offset to yield a
140	    Unicode code point, which in turn represents a Unicode character.
141	    (Surrogates, which are code units used by UTF-16 in pairs to
142	    refer to code points, are not used and not allowed in AMC-ACE-M.)
143	    Similarities between the code points are exploited to make the
144	    encoding more compact.

146	Base-32 characters

148	        "a" =  0 = 0x00 = 00000         "s" = 16 = 0x10 = 10000
149	        "b" =  1 = 0x01 = 00001         "t" = 17 = 0x11 = 10001
150	        "c" =  2 = 0x02 = 00010         "u" = 18 = 0x12 = 10010
151	        "d" =  3 = 0x03 = 00011         "v" = 19 = 0x13 = 10011
152	        "e" =  4 = 0x04 = 00100         "w" = 20 = 0x14 = 10100
153	        "f" =  5 = 0x05 = 00101         "x" = 21 = 0x15 = 10101
154	        "g" =  6 = 0x06 = 00110         "y" = 22 = 0x16 = 10110
155	        "h" =  7 = 0x07 = 00111         "z" = 23 = 0x17 = 10111
156	        "i" =  8 = 0x08 = 01000         "2" = 24 = 0x18 = 11000
157	        "j" =  9 = 0x09 = 01001         "3" = 25 = 0x19 = 11001
158	        "k" = 10 = 0x0A = 01010         "4" = 26 = 0x1A = 11010
159	        "m" = 11 = 0x0B = 01011         "5" = 27 = 0x1B = 11011
160	        "n" = 12 = 0x0C = 01100         "6" = 28 = 0x1C = 11100
161	        "p" = 13 = 0x0D = 01101         "7" = 29 = 0x1D = 11101
162	        "q" = 14 = 0x0E = 01110         "8" = 30 = 0x1E = 11110
163	        "r" = 15 = 0x0F = 01111         "9" = 31 = 0x1F = 11111

165	    The digits "0" and "1" and the letters "o" and "l" are not used, to
166	    avoid transcription errors.

168	    All decoders must recognize both the uppercase and lowercase
169	    forms of the base-32 characters.  The case may or may not convey
170	    information, as described in section "Case sensitivity models".

172	Encoding procedure

174	    The encoder first examines the Unicode string and chooses some
175	    parameters.  It writes these parameters into the output string, then
176	    proceeds to encode each Unicode character, one at a time.  The exact
177	    sequence of steps is given below.  All ordering of bits and quintets
178	    is big-endian (most significant first).  The >> and << operators
179	    used below mean bit shift, as in C.  For >> there is no question of
180	    logical versus arithmetic shift because AMC-ACE-M makes no use of
181	    negative numbers.

183	     0) Determine the Unicode code point for each non-LDH character in
184	        the Unicode string.  Since LDH characters are encoded literally,
185	        their code points are not needed.  Depending on how the Unicode
186	        string is presented to the encoder, this step may be a no-op.

188	     1) Verify that there are are no invalid code points in the input;
189	        that is, none exceed 0x10FFFF (the highest code point in the
190	        Unicode code space) and none are in the range D800..DFFF
191	        (surrogates).

193	     2) Determine the most populous row:  Row n is defined as the 256
194	        code points starting with n << 8, except that this definition
195	        would makes rows D8..DF useless, because they would contain only
196	        surrogates.  Therefore AMC-ACE-M defines rows D8..DF to be the
197	        following non-aligned blocks of 256 code points:

199	            row D8 = 0020..001F
200	            row D9 = 005B..015A
201	            row DA = 007B..017A
202	            row DB = 00A0..019F
203	            row DC = 00C0..01BF
204	            row DD = 00DF..01DE
205	            row DE = 0134..0233
206	            row DF = 0270..036F

208	        (Rationale:  Whereas almost every small script is confined to
209	        a single row, the Latin script is split across a few rows,
210	        and the row boundaries are not especially convenient for many
211	        languages.)

213	        Determine the row containing the most non-LDH input code points,
214	        breaking ties in favor of smaller-numbered rows.  (If a code
215	        point appears multiple times in the input, it counts multiple
216	        times.  This applies to steps 3 and 4 also.)  Call it row B.
217	        Let offsetB be the first code point of row B.

219	     3) Determine the most populous 16-window:  For each n in 0..31 let
220	        offset = ((offsetB >> 3) + n) << 3 and count the number of code
221	        points in the range offset through offset + 0xF.  Let A be the
222	        value of n that maximizes this count, breaking ties in favor
223	        of smaller values of n, and let offsetA be the corresponding
224	        offset.

226	     4) Determine the most populous 20k-window:  If the input is empty,
227	        then let C = 0.  Otherwise, for each input code point, let n =
228	        code_point >> 11, and count the number of non-LDH input code
229	        points that are not in row B and are in the range (n << 11)
230	        through (n << 11) + 0x4FFF.  Determine the value of n that
231	        maximizes the count, breaking ties in favor of smaller values of
232	        n, and let C be that value.

234	     5) Choose a style:  One of the base-32 codes used in step 7.3 has
235	        two variants, and so base-32 mode is subdivided into two styles,
236	        narrow and wide, depending on which variant is used.  Compute
237	        the total number of base-32 characters that would be produced
238	        if narrow style were used, and the number if wide style were
239	        used.  The easiest way to do this is to mimic the logic of steps
240	        6 and 7.3.  Use whichever style would produce fewer base-32
241	        characters.  In case of a tie, use narrow style.

243	     6) Encode the parameters.  If narrow style is used, then let
244	        offsetC = (offsetB >> 12) << 12, and encode B and A as three or
245	        four base-32 characters:

247	            00bbb bbbbb aaaaa        if B <= 0xFF
248	            01bbb bbbbb bbbbb aaaaa  otherwise

250	        If wide style is used, then let offsetC = C << 11, and encode B
251	        and C as three or five base-32 characters:

253	            10bbb bbbbb ccccc              if B <= 0xFF and C <= 0x1F
254	            11bbb bbbbb bbbbb ccccc ccccc  otherwise

256	     7) Encode each input character in turn, using the first of the
257	        following cases that applies.  The mode is initially base-32.

259	         7.1) The character is a hyphen-minus (U+002D).  Encode it as
260	              two hyphen-minuses.

262	         7.2) The character is an LDH character.  If in base-32 mode
263	              then output a hyphen-minus and switch to literal mode.
264	              Copy the character to the output.

266	         7.3) The character is a non-LDH character.  If in literal
267	              mode then output a hyphen-minus and switch to base-32
268	              mode.  Encode the character's code point using the
269	              first of the following cases that applies.  Square
270	              brackets enclose quintets that can be used to record
271	              the upper/lowercase attribute of the Unicode character
272	              (because the corresponding base-32 characters are
273	              guaranteed to be letters rather than digits) (see section
274	              "Case sensitivity models").

276	               7.3.1) Narrow style was chosen and the code point is in
277	                      the range offsetA through offsetA + 0xF.  Subtract
278	                      offsetA and encode the difference as a single
279	                      base-32 character:

281	                          [0xxxx]

283	               7.3.2) The code point is in the range offsetB through
284	                      offsetB + 0xFF.  Subtract offsetB and encode the
285	                      difference as two base-32 characters:

287	                          1xxxx [0xxxx]

289	               7.3.3) The code point is in the range offsetC through
290	                      offsetC + 0xFFF.  Subtract offsetC and encode the
291	                      difference as three base-32 characters:

293	                          1xxxx 1xxxx [0xxxx]

295	               7.3.4) Wide style was chosen and the code point is in
296	                      the range offsetC + 0x1000 through offsetC +
297	                      0x4FFF.  Subtract offsetC + 0x1000 and encode the
298	                      difference as three base-32 characters:

300	                          [0xxxx] xxxxx xxxxx

302	               7.3.5) The code point is in the range 0 through 0xFFFF.
303	                      Encode it as four base-32 characters:

305	                          1xxxx 1xxxx 1xxxx [0xxxx]

307	               7.3.6) If we've come this far, the code point must be
308	                      in the range 0x10000 through 0x10FFFF.  Subtract
309	                      0x10000 and encode the difference as five base-32
310	                      characters:

312	                          1xxxx 1xxxx 1xxxx 1xxxx [0xxxx]

314	Decoding procedure

316	    The details of the decoding procedure are implied by the encoding
317	    procedure.  The overall sequence of steps is as follows.

319	     1) Undo the encoder's step 6:  From the first few base-32
320	        characters, determine whether narrow or wide style is used, and
321	        determine the offsets.

323	     2) Set the mode to base-32.  For each remaining input character, use
324	        the first of the following cases that applies:

326	         2.1) The character is a hyphen-minus, and the following
327	              character is also a hyphen-minus.  Consume them both and
328	              output a hyphen-minus.

330	         2.2) The character is a hyphen-minus.  Consume it and toggle
331	              the mode flag.

333	         2.3) The current mode is literal.  Consume the input character
334	              and output it.

336	         2.4) Interpret the input character and up to four of its
337	              successors as base-32.  Consume characters until one is
338	              found whose value has the form 0xxxx.  That is the one
339	              that carries the upper/lowercase information.  Remember
340	              the length of the code.  If the length is one and wide
341	              style is being used, consume two more characters.
342	              Decode the base-32 characters into an integer, add the
343	              appropriate offset (which depends on the remembered code
344	              length), and output the Unicode character corresponding to
345	              the resulting code point.

347	              If the case-flexible or case-preserving model is being
348	              used (see section "Case sensitivity models"), the decoder
349	              must either perform the case conversion as it is decoding,
350	              or construct a separate record of the case information to
351	              accompany the output string.

353	     3) Before returning the output (be it a string or a string plus
354	        case information), the decoder must invoke the encoder on it,
355	        and compare the result to the input string.  The comparison
356	        must be case-sensitive if the case-sensitive or case-flexible
357	        model is being used, case-insensitive if the case-insensitive
358	        or case-preserving model is being used.  If the two strings do
359	        not match, it is an error.  This check is necessary to guarantee
360	        the uniqueness property (there cannot be two distinct encoded
361	        strings representing the same Unicode string).

363	    If the decoder at any time encounters an unexpected character, or
364	    unexpected end of input, then the input is invalid.

366	Signature

368	    The issue of how to distinguish ACE strings from unencoded strings
369	    is largely orthogonal to the encoding scheme itself, and is
370	    therefore not specified here.  In the context of domain name labels,
371	    a standard prefix and/or suffix (chosen to be unlikely to occur
372	    naturally) would presumably be attached to ACE labels.  (In that
373	    case, it would probably be good to forbid the encoding of Unicode
374	    strings that appear to match the signature, to avoid confusing
375	    humans about whether they are looking at a Unicode string or an ACE
376	    string.)

378	    In order to use AMC-ACE-M in domain names, the choice of signature
379	    must be mindful of the requirement in [RFC952] that labels never
380	    begin or end with hyphen-minus.  The raw encoded string will never
381	    begin with a hyphen-minus, and will end with a hyphen-minus iff the
382	    Unicode string ends with a hyphen-minus.  The easiest solution is
383	    to use a suffix as the signature.  Alternatively, if the Unicode
384	    strings were forbidden from ending with a hyphen-minus, a prefix
385	    could be used.

387	    It appears that "---" is extremely rare in domain names; among the
388	    four-character prefixes of all the second-level domains under .com,
389	    .net, and .org, "---" never appears at all.  Therefore, perhaps the
390	    signature should be of the form ?--- (prefix) or ---? (suffix),
391	    where ? could be "u" for Unicode, or "i" for internationalized, or
392	    "a" for ACE, or maybe "q" or "z" because they are rare.

394	Case sensitivity models

396	    The higher layer must choose one of the following four models.

398	    Models suitable for domain names:

400	      * Case-insensitive:  Before a string is encoded, all its non-LDH
401	        characters must be case-folded so that any strings differing
402	        only in case become the same string (for example, strings could
403	        be forced to lowercase).  Folding LDH characters is optional.
404	        The case of base-32 characters and literal-mode characters is
405	        arbitrary and not significant.  Comparisons between encoded
406	        strings must be case-insensitive.  The original case of non-LDH
407	        characters cannot be recovered from the encoded string.

409	      * Case-preserving:  The case of the Unicode characters is not
410	        considered significant, but it can be preserved and recovered,
411	        just like in non-internationalized host names.  Before a string
412	        is encoded, all its non-LDH characters must be case-folded
413	        as in the previous model.  LDH characters are naturally able
414	        to retain their case attributes because they are encoded
415	        literally.  The case attribute of a non-LDH character is
416	        recorded in one of the base-32 characters that represent
417	        it (section "Encoding procedure" tells which one).  If the
418	        base-32 character is uppercase, it means the Unicode character
419	        is caseless or should be forced to uppercase after being
420	        decoded (which is a no-op if the case folding already forces
421	        to uppercase).  If the base-32 character is lowercase, it
422	        means the Unicode character is caseless or should be forced to
423	        lowercase after being decoded (which is a no-op if the case
424	        folding already forces to lowercase).  The case of the other
425	        base-32 characters in a multi-quintet encoding is arbitrary
426	        and not significant.  Only uppercase and lowercase attributes
427	        can be recorded, not titlecase.  Comparisons between encoded
428	        strings must be case-insensitive, and are equivalent to
429	        case-insensitive comparisons between the Unicode strings.  The
430	        intended mixed-case Unicode string can be recovered as long as
431	        the encoded characters are unaltered, but altering the case of
432	        the encoded characters is not harmful--it merely alters the case
433	        of the Unicode characters, and such a change is not considered
434	        significant.

436	        In this model, the input to the encoder and the output of the
437	        decoder can be the unfolded Unicode string (in which case the
438	        encoder and decoder are responsible for performing the case
439	        folding and recovery), or can be the folded Unicode string
440	        accompanied by separate case information (in which case the
441	        higher layer is responsible for performing the case folding and
442	        recovery).  Whichever layer performs the case recovery must
443	        first verify that the Unicode string is properly folded, to
444	        guarantee the uniqueness of the encoding.

446	        It is easy to extend the nameprep algorithm [NAMEPREP02] to
447	        remember case information.  It merely requires an additional
448	        bit to be associated with each output code point in the mapping
449	        table.

451	    The case-insensitive and case-preserving models are interoperable.
452	    If a domain name passes from a case-preserving entity to a
453	    case-insensitive entity, the case information will be lost, but
454	    the domain name will still be equivalent.  This phenomenon already
455	    occurs with non-internationalized domain names.

457	    Models unsuitable for domain names, but possibly useful in other
458	    contexts:

460	      * Case-sensitive:  Unicode strings may contain both uppercase and
461	        lowercase characters, which are not folded.  Base-32 characters
462	        must be lowercase.  Comparisons between encoded strings must be
463	        case-sensitive.

465	      * Case-flexible:  Like case-preserving, except that the choice
466	        of whether the case of the Unicode characters is considered
467	        significant is deferred.  Therefore, base-32 characters must
468	        be lowercase, except for those used to indicate uppercase
469	        Unicode characters.  Comparisons between encoded strings may be
470	        case-sensitive or case-insensitive, and such comparisons are
471	        equivalent to the corresponding comparisons between the Unicode
472	        strings.

474	Comparison with RACE, BRACE, LACE, and DUDE

476	    In this section we compare AMC-ACE-M and four other ACEs: RACE
477	    [RACE03], BRACE [BRACE00], LACE [LACE01], and Extended DUDE
478	    [DUDE00].  We do not include SACE [SACE], UTF-5 [UTF5], or UTF-6
479	    [UTF6] in the comparison, because SACE appears obviously too
480	    complex, UTF-5 appears obviously too inefficient, and UTF-6 can
481	    never be more efficient than its similarly simple successor, DUDE.

483	    Case preservation support:

485	        DUDE, AMC-ACE-M:  all characters
486	                  BRACE:  only the letters A-Z, a-z
487	             RACE, LACE:  none

489	    RACE, BRACE, and LACE transform the Unicode string to an
490	    intermediate bit string, then into a base-32 string, so there is no
491	    particular alignment between the base-32 characters and the Unicode
492	    characters.  DUDE and AMC-ACE-M do not have this intermediate stage,
493	    and enforce alignment between the base-32 characters and the Unicode
494	    characters, which facilitates the case preservation.

496	    Complexity is hard to measure.  This author would subjectively
497	    describe the complexity of the algorithms as:

499	        RACE, LACE, DUDE: fairly simple but not trivial
500	               AMC-ACE-M: moderate
501	                   BRACE: complex

503	    The complexity of AMC-ACE-M is in the number of rules, but the
504	    individual rules are not very complex, and they are generally
505	    non-interacting.

507	    The relative efficiency of the various algorithms is suggested
508	    by the sizes of the encodings in section "Example strings".  For
509	    each ACE there is a graph below showing a horizontal bar for
510	    each example string, representing the ACE length divided by the
511	    minimum length among all the ACEs for that example string (so the
512	    ratio is at least 1).  Example R is excluded because it violates
513	    nameprep [NAMEPREP02].  The other example strings all use different
514	    languages, except that there are several Japanese examples.  To
515	    avoid skewing the results, each graph collapses all the Japanese
516	    ratios into a single bar representing the median ratio.  A ratio r
517	    is represented by a bar of length r/0.04 characters.  Since the bar
518	    will always be at least 1/0.04 = 25 characters long, we show the
519	    first 25 characters as "O" and the rest as "@". The bars are sorted
520	    so that the graph looks like a cummulative distribution.  Each bar
521	    is labeled with the language of the corresponding example string.
522	    (The difference between the Chinese and Taiwanese strings is that
523	    the former uses simplified characters.)

525	        RACE:
526	          Hindi       OOOOOOOOOOOOOOOOOOOOOOOOO@@@
527	          Korean      OOOOOOOOOOOOOOOOOOOOOOOOO@@@
528	          Arabic      OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
529	          Taiwanese   OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
530	          Hebrew      OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@
531	          Russian     OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
532	          Japanese    OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
533	          Spanish     OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@
534	          Chinese     OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@
535	          Vietnamese  OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@
536	          Czech       OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@@@@@@@@@@

538	        LACE:
539	          Korean      OOOOOOOOOOOOOOOOOOOOOOOOO@@@
540	          Hindi       OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
541	          Taiwanese   OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
542	          Arabic      OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
543	          Hebrew      OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
544	          Chinese     OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
545	          Japanese    OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
546	          Russian     OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
547	          Spanish     OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@
548	          Vietnamese  OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@
549	          Czech       OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@@@

551	        DUDE:
552	          Russian     OOOOOOOOOOOOOOOOOOOOOOOOO
553	          Arabic      OOOOOOOOOOOOOOOOOOOOOOOOO
554	          Hebrew      OOOOOOOOOOOOOOOOOOOOOOOOO@@
555	          Vietnamese  OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
556	          Chinese     OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@
557	          Japanese    OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@
558	          Korean      OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
559	          Spanish     OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
560	          Czech       OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
561	          Hindi       OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
562	          Taiwanese   OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@
563	        AMC-ACE-M:
564	          Czech       OOOOOOOOOOOOOOOOOOOOOOOOO
565	          Hebrew      OOOOOOOOOOOOOOOOOOOOOOOOO
566	          Japanese    OOOOOOOOOOOOOOOOOOOOOOOOO
567	          Korean      OOOOOOOOOOOOOOOOOOOOOOOOO
568	          Russian     OOOOOOOOOOOOOOOOOOOOOOOOO
569	          Spanish     OOOOOOOOOOOOOOOOOOOOOOOOO
570	          Taiwanese   OOOOOOOOOOOOOOOOOOOOOOOOO
571	          Vietnamese  OOOOOOOOOOOOOOOOOOOOOOOOO
572	          Chinese     OOOOOOOOOOOOOOOOOOOOOOOOO@
573	          Arabic      OOOOOOOOOOOOOOOOOOOOOOOOO@@@
574	          Hindi       OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@

576	        BRACE:
577	          Chinese     OOOOOOOOOOOOOOOOOOOOOOOOO
578	          Hindi       OOOOOOOOOOOOOOOOOOOOOOOOO
579	          Japanese    OOOOOOOOOOOOOOOOOOOOOOOOO
580	          Spanish     OOOOOOOOOOOOOOOOOOOOOOOOO
581	          Taiwanese   OOOOOOOOOOOOOOOOOOOOOOOOO
582	          Arabic      OOOOOOOOOOOOOOOOOOOOOOOOO@
583	          Czech       OOOOOOOOOOOOOOOOOOOOOOOOO@
584	          Vietnamese  OOOOOOOOOOOOOOOOOOOOOOOOO@
585	          Hebrew      OOOOOOOOOOOOOOOOOOOOOOOOO@@
586	          Korean      OOOOOOOOOOOOOOOOOOOOOOOOO@@
587	          Russian     OOOOOOOOOOOOOOOOOOOOOOOOO@@@

589	    These results suggest that DUDE is preferrable to RACE and LACE,
590	    because it has similar simplicity, better support for case
591	    preservation, and is somewhat more efficient.

593	    The results also suggest that AMC-ACE-M is preferrable to BRACE,
594	    because it has similar efficiency, better support for case
595	    preservation, and is simpler.

597	    DUDE and AMC-ACE-M have equal support for case preservation, but
598	    AMC-ACE-M offers significantly better efficiency, at the cost of
599	    significantly greater complexity, so choosing between them entails a
600	    value judgement.

602	Example strings

604	    In the ACE encodings below, signatures (like "bq--" for RACE) are
605	    not shown.  Non-LDH characters in the Unicode string are forced to
606	    lowercase before being encoded using BRACE, RACE, and LACE.  For
607	    RACE and LACE, the letters A-Z are likewise forced to lowercase.
608	    UTF-8 and UTF-16 are included for length comparisons, with non-ASCII
609	    bytes shown as "?". AMC-ACE-M is abbreviated AMC-M.  Backslashes
610	    show where line breaks have been inserted in ACE strings too long
611	    for one line.  The RACE and LACE encodings are courtesy of Mark
612	    Davis's online UTF converter [UTFCONV] (slightly modified to remove
613	    the length restrictions).

615	    The first several examples are all names of Japanese music artists,
616	    song titles, and TV programs, just because the author happens to
617	    have them handy (but Japanese is useful for providing examples
618	    of single-row text, two-row text, ideographic text, and various
619	    mixtures thereof).

621	    (A) 3B  (Japanese TV program title)

623	                      = U+5E74                       (kanji)
624	                     = U+7D44                       (kanji)
625	         = U+91D1 U+516B U+5148 U+751F  (kanji)

627	        UTF-16: ????????????????
628	        UTF-8:  3???B???????????????
629	        AMC-M:  utk-3-8ze-B-hkenqtymwifi9
630	        BRACE:  u-3-ygj-b-ynb6gjc7pp4k5p5w
631	        DUDE:   j3le74G062nd44p1d1l16bk8n51f
632	        RACE:   3aadgxtuabrh2rer2fiwwukioupq
633	        LACE:   74adgxtuabrh2rer2fiwwukioupq

635	    (B) -with-SUPER-MONKEYS  (Japanese music group name)

637	         = U+5B89 U+5BA4 U+5948 U+7F8E U+6075  (kanji)

639	        UTF-8:  ??????????????????-with-SUPER-MONKEYS
640	        AMC-M:  u5m2j4etwif6q2zf---with--SUPER--MONKEYS
641	        BRACE:  uvj7fuaqcahy982xa---with--SUPER--MONKEYS
642	        DUDE:   lb89q4p48nf8em075-g077m9n4m8-N3LGM5N2-MdVURLN9J
643	        UTF-16: ????????????????????????????????????????????????
644	        LACE:   ajnytjablfeac74oafqhkeyafv3qm5difvzxk4dfoiww233onnsxs4y
645	        RACE:   3bnysw5elfeh7dtaouac2adxabuqa5aanaac2adtab2qa4aamuaheab\
646	                nabwqa3yanyagwadfab4qa4y

648	    (C) Hello-Another-Way-  (Japanese song title)

650	         = U+305D U+308C U+305E U+308C U+306E  (hiragana)
651	                = U+5834 U+6240                       (kanji)

653	        UTF-8:  Hello-Another-Way-?????????????????????
654	        BRACE:  ji7-Hello--Another--Way---v3jhaefvd2ufj62
655	        AMC-M:  bsk-Hello--Another--Way---p2nq2nyqx2veyuwa
656	        DUDE:   M8lssv-Huvn4m8ln2-Nm1n9-j05docleocmel834m240
657	        UTF-16: ??????????????????????????????????????????????????
658	        LACE:   ciagqzlmnrxs2ylon52gqzlsfv3wc6jnauyf3dc6rrxacwbuafrea
659	        RACE:   3aagqadfabwaa3aan4ac2adbabxaa3yaoqagqadfabzaaliao4agcad\
660	                zaawtaxjqrqyf4memgbxfqndcia

662	    (D) 2  (Japanese TV program title)

664	         = U+3072 U+3068 U+3064  (hiragana)
665	            = U+5C4B U+6839         (kanji)
666	              = U+306E                (hiragana)
667	           = U+4E0B                (kanji)

669	        UTF-16: ????????????????
670	        UTF-8:  ?????????????????????2
671	        AMC-M:  bsnzciex6wmy2vjqw8sm-2
672	        BRACE:  ji96u56uwbhf2wqxnw4s-2
673	        DUDE:   j072m8klc4bm839j06eke0bg032
674	        RACE:   3ayhemdigbsfys3iheyg4tqlaaza
675	        LACE:   74yhemdigbsfys3iheyg4tqlaaza
676	    (E) MajiKoi5 (Japanese song title)

678	                = U+3067         (hiragana)
679	              = U+3059 U+308B  (hiragana)
680	         = U+79D2 U+524D  (kanji)

682	        UTF-8:  Maji???Koi??????5??????
683	        UTF-16: ??????????????????????????
684	        AMC-M:  bsm-Maji-r-Koi-b2m-5-z37cxuwp
685	        BRACE:  ji8-Maji-g-Koi-qe7x-5-wx7p6ma
686	        DUDE:   Mdhqpj067G06bvpj059obg035n9d2l24d
687	        RACE:   3aag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq
688	        LACE:   74ag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq

690	    (F) de  (Japanese song title)

692	         = U+30D1 U+30D5 U+30A3 U+30FC  (katakana)
693	         = U+30EB U+30F3 U+30D0         (katakana)

695	        UTF-16: ??????????????
696	        BRACE:  3iu8pazt-de-pygi
697	        AMC-M:  bs3jp4d9n-de-8m9di
698	        RACE:   gdi5li7475sp6zpl6pia
699	        DUDE:   j0d1lq3vcg064lj0ebv3t0
700	        UTF-8:  ????????????de?????????
701	        LACE:   aqyndvnd7qbaazdfamyox46q

703	    (G)   (Japanese song title)

705	            = U+305D U+306E                (hiragana)
706	         = U+30B9 U+30D4 U+30FC U+30C9  (katakana)
707	              = U+3067                       (hiragana)

709	        RACE:   gbow5oou7tewo
710	        UTF-16: ??????????????
711	        BRACE:  bidprdmp9wt7mi
712	        LACE:   a4yf23vz2t6mszy
713	        AMC-M:  bsmfyq5j7e9n6jr
714	        DUDE:   j05dmer9t4vcs9m7
715	        UTF-8:  ?????????????????????

717	    The next several examples are all translations of the sentence "Why
718	    can't they just speak in ?" (courtesy of Michael Kaplan's
719	    "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
720	    been removed, as is often done in domain names.

722	    (H) Arabic (Egyptian):
723	        U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644
724	        U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F

726	        DUDE:   m44qnli7oqk3kloj4phi8kahf
727	        BRACE:  28akcjwcmp3ciwb4t3ngd4nbaz
728	        AMC-M:  agiekhfuhuiukdefivevjvbuiktr
729	        RACE:   azceur2fe4ucuq2eivediojrfbfb6
730	        LACE:   cedeisshiutsqksdircuqnbzgeueuhy
731	        UTF-16: ??????????????????????????????????
732	        UTF-8:  ??????????????????????????????????
733	    (I) Chinese (simplified):
734	        U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587

736	        UTF-16: ??????????????????
737	        BRACE:  kgcqqsgp26i5h4zn7req5i
738	        AMC-M:  uqj7g8nvk6awispn9wupdnh
739	        DUDE:   ked6ucjas0k8gdobf4ke2dm587
740	        UTF-8:  ???????????????????????????
741	        LACE:   azhnn3b2ybea2aml6qau4libmwdq
742	        RACE:   3bhnmtxmjy5e5qcojbha3c7ujywwlby

744	    (J) Czech: Proprostnemluvesky

746	         = U+010D
747	         = U+011B
748	         = U+00ED

750	        UTF-8:  Pro??prost??nemluv????esky
751	        AMC-M:  g26-Pro-p-prost-9m-nemluv-6pp-esky
752	        BRACE:  i32-Pro-u-prost-8y-nemluv-29f3n-esky
753	        DUDE:   N0imfh0dg70imfn3kh1bg6eltsn5mudh0dg65n3mbn9
754	        UTF-16: ????????????????????????????????????????????
755	        LACE:   amaha4tpaeaq2biaobzg643uaearwbyanzsw23dvo3wqcainaqagk43\
756	                lpe
757	        RACE:   ah7xb73s75xq373q75zp6377op7xig77n37wl73n75wp65p7o3762dp\
758	                7mx7xh73l754q

760	    (K) Hebrew:
761	        U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8
762	        U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2
763	        U+05D1 U+05E8 U+05D9 U+05EA

765	        AMC-M:  af4nqeep8e8jfinaqdb8ijp8cb8ij8k
766	        DUDE:   ldcukktu4pt5osgujhu8t9tu2t1u8t9ua
767	        BRACE:  27vkyp7bgwmbpfjgc4ynx5nd8xsp5nd9c
768	        RACE:   axon5vgu3xsotvoy3tin5u6r5dm53ywr5dm6u
769	        LACE:   cyc5zxwu2to6j2ov3donbxwt2huntxpc2hunt2q
770	        UTF-8:  ????????????????????????????????????????????
771	        UTF-16: ????????????????????????????????????????????

773	    (L) Hindi:
774	        U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D
775	        U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939
776	        U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947
777	        U+0939 U+0948 U+0902  (Devanagari)

779	        BRACE:  2b7xtenqdr7zc6uma2pmcz7ibage237kdemicnk9gei32
780	        RACE:   bextsmslc44t6kcnezabktjpjmbcqokaaiwewmrycuseookiai
781	        LACE:   dyes6ojsjmltspzijuteafknf5fqekbziabcyszshaksirzzjaba
782	        AMC-M:  ajhurbvcwmthbhuiwpugitfwpurwmscuibiscunwmvcatfuerbwisc
783	        DUDE:   p2fj9ikbh7j9vi8kdi6k0h5kdifkbg2i8j9k0g2ickbj2oh5i4k7j9k\
784	                8g2
785	        UTF-16: ???????????????????????????????????????????????????????\
786	                ?????
787	        UTF-8:  ???????????????????????????????????????????????????????\
788	                ???????????????????????????????????
789	    (M) Korean:
790	        U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
791	        U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
792	        U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C  (Hangul syllables)

794	        UTF-16: ????????????????????????????????????????????????
795	        UTF-8:  ???????????????????????????????????????????????????????\
796	                ?????????????????
797	        AMC-M:  yhxcj2w6exiaxi68acfn92n68ezehk6xypdpwam6zehmwhk648eavwd\
798	                p6aqi23ieemweywn
799	        BRACE:  y394qebjusrcndbs82pkvstf96sxufcr7ffr4vbgdwsxufcx8pdktgb\
800	                gmnsqydmk7im56arju6pt82
801	        LACE:   77atrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\
802	                exj2mlpfzzcyjrsely5ck4ta
803	        RACE:   3datrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\
804	                exj2mlpfzzcyjrsely5ck4ta
805	        DUDE:   s138qcc4s758raa8ke0s0acr78cke4s774t55cqd6ds5b4r97cs774t\
806	                574lcr2e4q74s5bcr9c8g98s88bn44qe4c

808	    (N) Russian:
809	        U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E
810	        U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440
811	        U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A
812	        U+0438  (Cyrillic)

814	        DUDE:   K3fuk7j5sk3j6lutotljuiuk0vijfuk0jhhjao
815	        AMC-M:  aehHgrvfemvgvfgfafvfvdgvcgiwrkhgimjjca
816	        BRACE:  269xyjvcyafqfdwyr3xfd8z8byi6z39xyi692s7ug2
817	        RACE:   aq7t4rzvhrbtmnj6hu4d2njthyzd4qcpii7t4qcdifatuoa
818	        LACE:   dqcd6pshgu6egnrvhy6tqpjvgm7depsaj5bd6psainaucory
819	        UTF-16: ???????????????????????????????????????????????????????\
820	                ???
821	        UTF-8:  ???????????????????????????????????????????????????????
822	                ???

824	    (O) Spanish: PorqunopuedensimplementehablarenEspaol

826	         = U+00E9
827	         = U+00F1

829	        UTF-8:  Porqu??nopuedensimplementehablarenEspa??ol
830	        AMC-M:  aa7-Porqu-b-nopuedensimplementehablarenEspa-j-ol
831	        BRACE:  22x-Porqu-9-nopuedensimplementehablarenEspa-j-ol
832	        DUDE:   N0mfn2hlu9mevn0lm5klun3m9tn0mcltlun4m5ohishn2m5uLn3gm1v\
833	                1mfs
834	        RACE:   abyg64troxuw433qovswizloonuw24dmmvwwk3tumvugcytmmfzgk3t\
835	                fonygd4lpnq
836	        LACE:   faaha33sof26s3tpob2wkzdfnzzws3lqnrsw2zloorswqylcnrqxezl\
837	                omvzxayprn5wa
838	        UTF-16: ???????????????????????????????????????????????????????\
839	                ?????????????????????????

841	    (P) Taiwanese:
842	        U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587
843	        UTF-16: ??????????????????
844	        UTF-8:  ???????????????????????????
845	        AMC-M:  uqj7g2tbgtu6a385pspnxkupdnh
846	        BRACE:  kgcqui49gatc2wyrn8y7cndgte9
847	        RACE:   3bhnmuaroize5qe6xvha3cvkjywwlby
848	        LACE:   75hnmuaroize5qe6xvha3cvkjywwlby
849	        DUDE:   ked6l011n232kec0pebdke0doaaake2dm587

851	    (Q) Vietnamese:
852	        Taisaohokhngthchi\
853	        noitingVit

855	          = U+0323
856	             = U+00F4
857	             = U+00EA
858	         = U+0309
859	             = U+0301

861	        UTF-8:  Ta??isaoho??kh??ngth????chi??no??iti????ngVi????t
862	        AMC-M:  ada-Ta-ud-isaoho-ud-kh-s9e-ngth-s8kj-chi-j-no-b-iti-s8k\
863	                b-ngVi-s8kud-t
864	        BRACE:  i54-Ta-8-isaoho-ay-kh-29n-ngth-s2xa6i-chi-k-no-2g-iti-2\
865	                9c29-ngVi-25p48-t
866	        UTF-16: ???????????????????????????????????????????????????????\
867	                ?????????????????????
868	        DUDE:   N4m1j23g69n3m1vovj23g6bov4menn4m8uaj09g63opj09g6evj01g6\
869	                9n4m9uaj01g6enN6m9uaj23g74
870	        LACE:   aiahiyibamrqmadjonqw62dpaebsgcaannupi3thoruouaidbebqay3\
871	                ineaqgcicabxg6aidaecaa2lunhvacaybauag4z3wnhvacazdaeahi
872	        RACE:   ap7xj73bep7wt73t75q76377nd7w6i77np7wr77u75xp6z77ot7wr77\
873	                kbh7wh73i75uqt73o75xqd73j752p62p75ia763x7m77xn73j77vch7\
874	                3u

876	    The last example is an ASCII string that breaks not only the
877	    existing rules for host name labels but also the rules proposed in
878	    [NAMEPREP02] for internationalized domain names.

880	    (R) -> $1.00 <-

882	        UTF-8:  -> $1.00 <-
883	        DUDE:   -jei0kj1iej0gi0jc-
884	        RACE:   aawt4ibegexdambahqwq
885	        LACE:   bmac2praeqys4mbqea6c2
886	        UTF-16: ??????????????????????
887	        AMC-M:  aae--vqae-1-q-00-avn--
888	        BRACE:  229--t2b4-1-w-00-i9i--

890	Security considerations

892	    Users expect each domain name in DNS to be controlled by a single
893	    authority.  If a Unicode string intended for use as a domain label
894	    could map to multiple ACE labels, then an internationalized domain
895	    name could map to multiple ACE domain names, each controlled by
896	    a different authority, some of which could be spoofs that hijack
897	    service requests intended for another.  Therefore AMC-ACE-M is
898	    designed so that each Unicode string has a unique encoding.

900	    However, there can still be multiple Unicode representations of the
901	    "same" text, for various definitions of "same".  This problem is
902	    addressed to some extent by the Unicode standard under the topic
903	    of canonicalization, but some text strings may be misleading or
904	    ambiguous to humans when used as domain names, such as strings
905	    containing dots, slashes, at-signs, etc.  These issues are being
906	    further studied under the topic of "nameprep" [NAMEPREP02].

908	References

910	    [ACEID01] Yoshiro Yoneya, Naomasa Maruyama, "Proposal for
911	    a determining process of ACE identifier", 2000-Dec-19,
912	    draft-ietf-idn-aceid-01.

914	    [BRACE00] Adam Costello, "BRACE: Bi-mode Row-based
915	    ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19,
916	    draft-ietf-idn-brace-00.

918	    [DUDE00] Brian Spolarich, Mark Welter, "DUDE: Differential Unicode
919	    Domain Encoding", 2000-Nov-21, draft-ietf-idn-dude-00.

921	    [IDN] Internationalized Domain Names (IETF working group),
922	    http://www.i-d-n.net/, idn@ops.ietf.org.

924	    [LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII
925	    Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01.

927	    [NAMEPREP02] Paul Hoffman, Marc Blanchet, "Preparation
928	    of Internationalized Host Names", 2001-Jan-17,
929	    draft-ietf-idn-nameprep-02.

931	    [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
932	    http://www.trigeminal.com/samples/provincial.html.

934	    [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding
935	    for IDN", 2000-Nov-28, draft-ietf-idn-race-03.

937	    [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
938	    Table Specification", 1985-Oct, RFC 952.

940	    [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
941	    1987-Nov, RFC 1034.

943	    [RFC1123] Internet Engineering Task Force, R. Braden (editor),
944	    "Requirements for Internet Hosts -- Application and Support",
945	    1989-Oct, RFC 1123.

947	    [SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)",
948	    draft-ietf-idn-sace-*.

950	    [UNICODE] The Unicode Consortium, "The Unicode Standard",
951	    http://www.unicode.org/unicode/standard/standard.html.

953	    [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a
954	    Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*.

956	    [UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another
957	    ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*.

959	    [UTFCONV] Mark Davis, "UTF Converter",
960	    http://www.macchiato.com/unicode/convert.html.

962	Author

964	    Adam M. Costello 
965	    http://www.cs.berkeley.edu/~amc/

967	Example implementation

969	/******************************************/
970	/* amc-ace-m.c 0.1.0 (2001-Feb-12-Mon)    */
971	/* Adam M. Costello  */
972	/******************************************/

974	/* This is ANSI C code implementing AMC-ACE-M version 0.1.*. */

976	/************************************************************/
977	/* Public interface (would normally go in its own .h file): */

979	#include 

981	enum amc_ace_status {
982	  amc_ace_success,
983	  amc_ace_invalid_input,
984	  amc_ace_output_too_big
985	};

987	enum case_sensitivity { case_sensitive, case_insensitive };

989	#if UINT_MAX >= 0x10FFFF
990	typedef unsigned int u_code_point;
991	#else
992	typedef unsigned long u_code_point;
993	#endif

995	int amc_ace_m_encode(
996	  unsigned int input_length,
997	  const u_code_point *input,
998	  const unsigned char *uppercase_flags,
999	  unsigned int *output_size,
1000	  unsigned char *output );
1001	    /* amc_ace_m_encode() converts Unicode to AMC-ACE-M.  The input  */
1002	    /* must be represented as an array of Unicode code points        */
1003	    /* (not code units; surrogate pairs are not allowed), and the    */
1004	    /* output will be represented as null-terminated ASCII.  The     */
1005	    /* input_length is the number of code points in the input.  The  */
1006	    /* output_size is an in/out argument: the caller must pass       */
1007	    /* in the maximum number of characters that may be output        */
1008	    /* (including the terminating null), and on successful return    */
1009	    /* it will contain the number of characters actually output      */
1010	    /* (including the terminating null, so it will be one more than  */
1011	    /* strlen() would return, which is why it is called output_size  */
1012	    /* rather than output_length).  The uppercase_flags array must   */
1013	    /* hold input_length boolean values, where nonzero means the     */
1014	    /* corresponding Unicode character should be forced to uppercase */
1015	    /* after being decoded, and zero means it is caseless or should  */
1016	    /* be forced to lowercase.  Alternatively, uppercase_flags may   */
1017	    /* be a null pointer, which is equivalent to all zeros.  The     */
1018	    /* letters a-z and A-Z are always encoded literally, regardless  */
1019	    /* of the corresponding flags.  The encoder always outputs       */
1020	    /* lowercase base-32 characters except when nonzero values       */
1021	    /* of uppercase_flags require otherwise, so the encoder is       */
1022	    /* compatible with any of the case models.  The return value     */
1023	    /* may be any of the amc_ace_status values defined above; if     */
1024	    /* not amc_ace_success, then output_size and output may contain  */
1025	    /* garbage.  On success, the encoder will never need to write an */
1026	    /* output_size greater than input_length*5+6, because of how the */
1027	    /* encoding is defined.                                          */

1029	int amc_ace_m_decode(
1030	  enum case_sensitivity case_sensitivity,
1031	  unsigned char *scratch_space,
1032	  const unsigned char *input,
1033	  unsigned int *output_length,
1034	  u_code_point *output,
1035	  unsigned char *uppercase_flags );
1036	    /* amc_ace_m_decode() converts AMC-ACE-M to Unicode.  The input   */
1037	    /* must be represented as null-terminated ASCII, and the output   */
1038	    /* will be represented as an array of Unicode code points.        */
1039	    /* The case_sensitivity argument influences the check on the      */
1040	    /* well-formedness of the input string; it must be case_sensitive */
1041	    /* if case-sensitive comparisons are allowed on encoded strings,  */
1042	    /* case_insensitive otherwise (see also section "Case sensitivity */
1043	    /* models" of the AMC-ACE-M specification).  The scratch_space    */
1044	    /* must point to space at least as large as the input, which will */
1045	    /* get overwritten (this allows the decoder to avoid calling      */
1046	    /* malloc()).  The output_length is an in/out argument: the       */
1047	    /* caller must pass in the maximum number of code points that     */
1048	    /* may be output, and on successful return it will contain the    */
1049	    /* actual number of code points output.  The uppercase_flags      */
1050	    /* array must have room for at least output_length values, or it  */
1051	    /* may be a null pointer if the case information is not needed.   */
1052	    /* A nonzero flag indicates that the corresponding Unicode        */
1053	    /* character should be forced to uppercase by the caller, while   */
1054	    /* zero means it is caseless or should be forced to lowercase.    */
1055	    /* The letters a-z and A-Z are output already in the proper case, */
1056	    /* but their flags will be set appropriately so that applying the */
1057	    /* flags would be harmless.  The return value may be any of the   */
1058	    /* amc_ace_status values defined above; if not amc_ace_success,   */
1059	    /* then output_length, output, and uppercase_flags may contain    */
1060	    /* garbage.  On success, the decoder will never need to write     */
1061	    /* an output_length greater than the length of the input (not     */
1062	    /* counting the null terminator), because of how the encoding is  */
1063	    /* defined.                                                       */

1065	/**********************************************************/
1066	/* Implementation (would normally go in its own .c file): */

1068	#include 

1070	/* Character utilities: */

1072	/* is_ldh(codept) returns 1 if the code point represents an LDH   */
1073	/* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */

1075	static int is_ldh(u_code_point codept)
1076	{
1077	  if (codept ==  45) return 1;
1078	  if (codept <   48) return 0;
1079	  if (codept <=  57) return 1;
1080	  if (codept <   65) return 0;
1081	  if (codept <=  90) return 1;
1082	  if (codept <   97) return 0;
1083	  if (codept <= 122) return 1;
1084	  return 0;
1085	}

1087	/* is_AtoZ(c) returns 1 if c is an         */
1088	/* uppercase ASCII letter, zero otherwise. */
1089	static unsigned char is_AtoZ(unsigned char c)
1090	{
1091	  return c >= 65 && c <= 90;
1092	}

1094	/* special_row_offset[n] holds the offset of the       */
1095	/* bottom of special row 0xD8 + n, where n is in 0..7. */

1097	static u_code_point special_row_offset[] =
1098	  { 0x0020, 0x005B, 0x007B, 0x00A0, 0x00C0, 0x00DF, 0x0134, 0x0270 };

1100	/* base32[n] is the lowercase base-32 character representing  */
1101	/* the number n from the range 0 to 31.  Note that we cannot  */
1102	/* use string literals for ASCII characters because an ANSI C */
1103	/* compiler does not necessarily use ASCII.                   */

1105	static const unsigned char base32[] = {
1106	  97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
1107	  109, 110,                                               /* m-n */
1108	  112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
1109	  50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
1110	};

1112	/* base32_decode(c) returns the value of a base-32 character, in the */
1113	/* range 0 to 31, or the constant base32_invalid if c is not a valid */
1114	/* base-32 character.                                                */

1116	enum { base32_invalid = 32 };

1118	static unsigned int base32_decode(unsigned char c)
1119	{
1120	  if (c < 50) return base32_invalid;
1121	  if (c <= 57) return c - 26;
1122	  if (c < 97) c += 32;
1123	  if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
1124	  return c - 97 - (c > 108) - (c > 111);
1125	}

1127	/* unequal(case_sensitivity,a1,a2,n) returns 0 if the arrays   */
1128	/* a1 and a2 are equal in the first n positions, 1 otherwise.  */
1129	/* If case_sensitivity is case_insensitive, then ASCII A-Z are */
1130	/* considered equal to a-z respectively.                       */

1132	static int unequal(
1133	  enum case_sensitivity case_sensitivity,
1134	  const unsigned char *a1,
1135	  const unsigned char *a2,
1136	  unsigned int n )
1137	{
1138	  const unsigned char *end;
1139	  unsigned char c1, c2;
1140	  if (case_sensitivity != case_insensitive) return memcmp(a1,a2,n);

1142	  for (end = a1 + n;  a1 < end;  ++a1, ++a2) {
1143	    c1 = *a1;
1144	    c2 = *a2;
1145	    if (c1 >= 65 && c1 <= 90) c1 += 32;
1146	    if (c2 >= 65 && c2 <= 90) c2 += 32;
1147	    if (c1 != c2) return 1;
1148	  }

1150	  return 0;
1151	}

1153	/* Encoder: */

1155	int amc_ace_m_encode(
1156	  unsigned int input_length,
1157	  const u_code_point *input,
1158	  const unsigned char *uppercase_flags,
1159	  unsigned int *output_size,
1160	  unsigned char *output )
1161	{
1162	  unsigned int literal, wide;  /* boolean */
1163	  u_code_point codept, n, diff, morebits;
1164	  u_code_point A, B, C, offsetA, offsetB, offsetC, offset;
1165	  const u_code_point *input_end, *p, *pp;
1166	  unsigned int count, max, next_in, next_out, max_out, codelen, i;
1167	  unsigned char c;

1169	  input_end = input + input_length;

1171	  /* 1) Verify that only valid code points appear: */

1173	  for (p = input;  p < input_end;  ++p) {
1174	    if (*p >> 11 == 0x1B || *p > 0x10FFFF) return amc_ace_invalid_input;
1175	  }

1177	  /* 2) Determine the most populous row: B and offsetB */

1179	  /* first check the special rows: */

1181	  B = 0xD8;
1182	  offsetB = special_row_offset[0];
1183	  max = 0;

1185	  for (n = 0;  n < 8;  ++n) {
1186	    offset = special_row_offset[n];
1187	    count = 0;

1189	    for (p = input;  p < input_end;  ++p) {
1190	      if (*p - offset <= 0xFF && !is_ldh(*p)) ++count;
1191	    }
1192	    if (count > max) {
1193	      B = 0xD8 + n;
1194	      offsetB = offset;
1195	      max = count;
1196	    }
1197	  }

1199	  /* now check the regular rows: */

1201	  for (pp = input;  pp < input_end;  ++pp) {
1202	    n = *pp >> 8;
1203	    count = 0;

1205	    for (p = input;  p < input_end;  ++p) {
1206	      if (*p >> 8 == n && !is_ldh(*p)) ++count;
1207	    }

1209	    if (count > max || (count == max && n < B)) {
1210	      B = n;
1211	      offsetB = n << 8;
1212	      max = count;
1213	    }
1214	  }

1216	  /* 3) Determine the most populous 16-window: A and offsetA */

1218	  A = 0;
1219	  max = 0;

1221	  for (n = 0;  n <= 0x1F;  ++n) {
1222	    offset = ((offsetB >> 3) + n) << 3;
1223	    count = 0;

1225	    for (p = input;  p < input_end;  ++p) {
1226	      if (*p - offset <= 0xF && !is_ldh(*p)) ++count;
1227	    }

1229	    if (count > max) {
1230	      A = n;
1231	      offsetA = offset;
1232	      max = count;
1233	    }
1234	  }

1236	  /* 4) Determine the most populous 20k-window: C */

1238	  C = 0;
1239	  max = 0;

1241	  for (pp = input;  pp < input_end;  ++pp) {
1242	    count = 0;
1243	    n = *pp >> 11;
1244	    offset = n << 11;

1246	    for (p = input;  p < input_end;  ++p) {
1247	      if (*p - offset <= 0x4FFF && !is_ldh(*p)) ++count;
1248	      if (count > max || (count == max && n < C)) {
1249	        C = n;
1250	        max = count;
1251	      }
1252	    }
1253	  }

1255	  /* 5) Determine the style to use: wide or narrow */

1257	  /* if narrow style were used: */

1259	  offsetC = (offsetB >> 12) << 12;
1260	  count = 3 + (B > 0xFF);

1262	  for (p = input;  p < input_end;  ++p) {
1263	    if (is_ldh(*p)) { }
1264	    else if (*p - offsetA <= 0xF) count += 1;
1265	    else if (*p - offsetB <= 0xFF) count += 2;
1266	    else if (*p - offsetC <= 0xFFF) count += 3;
1267	    else if (*p <= 0xFFFF) count += 4;
1268	    else count += 5;
1269	  }

1271	  max = count;

1273	  /* if wide style were used: */

1275	  offsetC = C << 11;
1276	  count =  B <= 0xFF && C <= 0x1F ?  3 :  5;

1278	  for (p = input;  p < input_end;  ++p) {
1279	    if (is_ldh(*p)) { }
1280	    else if (*p - offsetB <= 0xFF) count += 2;
1281	    else if (*p - offsetC <= 0x4FFF) count += 3;
1282	    else if (*p <= 0xFFFF) count += 4;
1283	    else count += 5;
1284	  }

1286	  wide = (count < max);

1288	  /* 6) Initialize offsetC, and encode the style and offsets: */

1290	  max_out = *output_size;
1291	  next_out = 0;

1293	  if (wide) {
1294	    offsetC = C << 11;
1295	    if (B <= 0xFF && C <= 0x1F) {
1296	      if (max_out - next_out < 3) return amc_ace_output_too_big;
1297	      output[next_out++] = base32[0x10 | (B >> 5)];
1298	      output[next_out++] = base32[B & 0x1F];
1299	      output[next_out++] = base32[C];
1300	    }
1301	    else {
1302	      if (max_out - next_out < 5) return amc_ace_output_too_big;
1303	      output[next_out++] = base32[0x18 | (B >> 10)];
1304	      output[next_out++] = base32[(B >> 5) & 0x1F];
1305	      output[next_out++] = base32[B & 0x1F];
1306	      output[next_out++] = base32[C >> 5];
1307	      output[next_out++] = base32[C & 0x1F];
1308	    }
1309	  }
1310	  else {
1311	    offsetC = (offsetB >> 12) << 12;

1313	    if (B <= 0xFF) {
1314	      if (max_out - next_out < 3) return amc_ace_output_too_big;
1315	      output[next_out++] = base32[B >> 5];
1316	      output[next_out++] = base32[B & 0x1F];
1317	    }
1318	    else {
1319	      if (max_out - next_out < 4) return amc_ace_output_too_big;
1320	      output[next_out++] = base32[8 | (B >> 10)];
1321	      output[next_out++] = base32[(B >> 5) & 0x1F];
1322	      output[next_out++] = base32[B & 0x1F];
1323	    }

1325	    output[next_out++] = base32[A];
1326	  }

1328	  /* 7) Main encoding loop: */

1330	  literal = 0;

1332	  for (next_in = 0;  next_in < input_length;  ++next_in) {
1333	    codept = input[next_in];

1335	    if (codept == 45 /* hyphen-minus */) {
1336	      /* case 7.1 */
1337	      if (max_out - next_out < 2) return amc_ace_output_too_big;
1338	      output[next_out++] = 45;
1339	      output[next_out++] = 45;
1340	      continue;
1341	    }

1343	    if (is_ldh(codept)) {
1344	      /* case 7.2 */
1345	      if (!literal) {
1346	        if (max_out - next_out < 1) return amc_ace_output_too_big;
1347	        output[next_out++] = 45;
1348	        literal = 1;
1349	      }
1350	      if (max_out - next_out < 1) return amc_ace_output_too_big;
1351	      output[next_out++] = codept;
1352	      continue;
1353	    }

1355	    /* case 7.3 */

1357	    if (literal) {
1358	      if (max_out - next_out < 1) return amc_ace_output_too_big;
1359	      output[next_out++] = 45;
1360	      literal = 0;
1361	    }

1363	    if (!wide) {
1364	      diff = codept - offsetA;

1366	      if (diff <= 0xF) {
1367	        /* case 7.3.1 */
1368	        codelen = 1;
1369	        goto encoder_base32_bottom;
1370	      }
1371	    }

1373	    diff = codept - offsetB;

1375	    if (diff <= 0xFF) {
1376	      /* case 7.3.2 */
1377	      codelen = 2;
1378	      goto encoder_base32_bottom;
1379	    }

1381	    diff = codept - offsetC;

1383	    if (diff <= 0xFFF) {
1384	      /* case 7.3.3 */
1385	      codelen = 3;
1386	      goto encoder_base32_bottom;
1387	    }

1389	    if (wide) {
1390	      diff = codept - offsetC - 0x1000;

1392	      if (diff <= 0x3FFF) {
1393	        /* case 7.3.4 */
1394	        codelen = 1;
1395	        morebits = diff & 0x3FF;
1396	        diff >>= 10;
1397	        goto encoder_base32_bottom;
1398	      }
1399	    }

1401	    if (codept <= 0xFFFF) {
1402	      /* case 7.3.5 */
1403	      diff = codept;
1404	      codelen = 4;
1405	      goto encoder_base32_bottom;
1406	    }
1407	    /* case 7.3.6 */
1408	    diff = codept - 0x10000;
1409	    codelen =  5;

1411	  encoder_base32_bottom: /* output diff as n base-32 digits: */
1412	    if (max_out - next_out < codelen) return amc_ace_output_too_big;
1413	    i = codelen - 1;
1414	    c = base32[diff & 0xF];
1415	    if (uppercase_flags && uppercase_flags[next_in]) c -= 32;
1416	    output[next_out + i] = c;

1418	    while (i > 0) {
1419	      diff >>= 4;
1420	      output[next_out + --i] = base32[0x10 | (diff & 0xF)];
1421	    }

1423	    next_out += codelen;

1425	    if (wide && codelen == 1) {
1426	      /* case 7.3.4 */
1427	      if (max_out - next_out < 2) return amc_ace_output_too_big;
1428	      output[next_out++] = base32[morebits >> 5];
1429	      output[next_out++] = base32[morebits & 0x1F];
1430	    }
1431	  }

1433	  /* null terminator: */
1434	  if (max_out - next_out < 1) return amc_ace_output_too_big;
1435	  output[next_out++] = 0;
1436	  *output_size = next_out;
1437	  return amc_ace_success;
1438	}

1440	/* Decoder: */

1442	int amc_ace_m_decode(
1443	  enum case_sensitivity case_sensitivity,
1444	  unsigned char *scratch_space,
1445	  const unsigned char *input,
1446	  unsigned int *output_length,
1447	  u_code_point *output,
1448	  unsigned char *uppercase_flags )
1449	{
1450	  unsigned int literal, wide, large;  /* boolean */
1451	  const unsigned char *next_in;
1452	  unsigned char c;
1453	  unsigned int next_out, max_out, codelen, input_size, scratch_size;
1454	  u_code_point q, B, offsets[6], diff, offset;
1455	  enum amc_ace_status status;
1456	  /* 1) Decode the style and offsets: */

1458	  next_in = input;
1459	  q = base32_decode(*next_in++);
1460	  if (q == base32_invalid) return amc_ace_invalid_input;
1461	  wide = q >> 4;
1462	  large = (q >> 3) & 1;
1463	  B = q & 7;
1464	  q = base32_decode(*next_in++);
1465	  if (q == base32_invalid) return amc_ace_invalid_input;
1466	  B = (B << 5) | q;

1468	  if (large) {
1469	    q = base32_decode(*next_in++);
1470	    if (q == base32_invalid) return amc_ace_invalid_input;
1471	    B = (B << 5) | q;
1472	  }

1474	  /* offsets[codelen] is for base-32 codes with codelen characters */
1475	  /* (not counting the extra two in wide-style 0xxxx xxxxx xxxxx)  */

1477	  offsets[2] = B >> 3 == 0x1B ? special_row_offset[B & 7] : B << 8;
1478	  q = base32_decode(*next_in++);
1479	  if (q == base32_invalid) return amc_ace_invalid_input;

1481	  if (!wide) {
1482	    offsets[1] = ((offsets[2] >> 3) + q) << 3;
1483	    offsets[3] = (offsets[2] >> 12) << 12;
1484	  }
1485	  else {
1486	    offset = q << 11;

1488	    if (large) {
1489	      q = base32_decode(*next_in++);
1490	      if (q == base32_invalid) return amc_ace_invalid_input;
1491	      offset = (offset << 5) | q;
1492	    }

1494	    offsets[3] = offset;
1495	    offsets[1] = offset + 0x1000;
1496	  }

1498	  offsets[4] = 0;
1499	  offsets[5] = 0x10000;

1501	  /* 2) Main decoding loop: */

1503	  max_out = *output_length;
1504	  next_out = 0;
1505	  literal = 0;

1507	  for (;;) {
1508	    c = *next_in++;
1509	    if (!c) break;
1510	    if (c == 45 /* hyphen-minus */) {
1511	      if (*next_in == 45) {
1512	        /* case 2.1: "--" decodes to "-" */
1513	        ++next_in;
1514	        if (max_out - next_out < 1) return amc_ace_output_too_big;
1515	        if (uppercase_flags) uppercase_flags[next_out] = 0;
1516	        output[next_out++] = 45;
1517	        continue;
1518	      }

1520	      /* case 2.2: unpaired hyphen-minus toggles mode */
1521	      literal = !literal;
1522	      continue;
1523	    }

1525	    if (!is_ldh(c)) return amc_ace_invalid_input;
1526	    if (max_out - next_out < 1) return amc_ace_output_too_big;

1528	    if (literal) {
1529	      /* case 2.3: literal letter/digit */
1530	      if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(c);
1531	      output[next_out++] = c;
1532	      continue;
1533	    }

1535	    /* case 2.4: base-32 sequence */

1537	    diff = 0;
1538	    codelen = 1;

1540	    for (;;) {
1541	      q = base32_decode(c);
1542	      if (q == base32_invalid) return amc_ace_invalid_input;
1543	      diff = (diff << 4) | (q & 0xF);
1544	      if ((q & 0x10) == 0) break;
1545	      if (++codelen > 5) return amc_ace_invalid_input;
1546	      c = *next_in++;
1547	    }

1549	    /* Now codelen is the number of input characters read, */
1550	    /* and c is the character holding the uppercase flag.  */

1552	    if (wide && codelen == 1) {
1553	      q = base32_decode(*next_in++);
1554	      if (q == base32_invalid) return amc_ace_invalid_input;
1555	      diff = (diff << 5) | q;
1556	      q = base32_decode(*next_in++);
1557	      if (q == base32_invalid) return amc_ace_invalid_input;
1558	      diff = (diff << 5) | q;
1559	    }

1561	    offset = offsets[codelen];
1562	    if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(c);
1563	    output[next_out++] = offset + diff;
1564	  }
1565	  /* 3) Re-encode the output and compare to the input: */

1567	  input_size = next_in - input;
1568	  scratch_size = input_size;
1569	  status = amc_ace_m_encode(next_out, output, uppercase_flags,
1570	                            &scratch_size, scratch_space);
1571	  if (status != amc_ace_success ||
1572	      scratch_size != input_size ||
1573	      unequal(case_sensitivity, scratch_space, input, input_size)
1574	     ) return amc_ace_invalid_input;
1575	  *output_length = next_out;
1576	  return amc_ace_success;
1577	}

1579	/******************************************************************/
1580	/* Wrapper for testing (would normally go in a separate .c file): */

1582	#include 
1583	#include 
1584	#include 
1585	#include 

1587	/* For testing, we'll just set some compile-time limits rather than */
1588	/* use malloc(), and set a compile-time option rather than using a  */
1589	/* command-line option.                                             */

1591	enum {
1592	  unicode_max_length = 256,
1593	  ace_max_size = 256,
1594	  test_case_sensitivity = case_insensitive
1595	};

1597	static void usage(char **argv)
1598	{
1599	  fprintf(stderr,
1600	    "%s -e reads big-endian UTF-32 and writes AMC-ACE-M ASCII.\n"
1601	    "%s -d reads AMC-ACE-M ASCII and writes big-endian UTF-32.\n"
1602	    "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n"
1603	    , argv[0], argv[0]);
1604	  exit(EXIT_FAILURE);
1605	}

1607	static void fail(const char *msg)
1608	{
1609	  fputs(msg,stderr);
1610	  exit(EXIT_FAILURE);
1611	}

1613	static const char too_large[] =
1614	  "input or output is too large, recompile with larger limits\n";

1616	static const char invalid_input[] = "invalid input\n";
1617	int main(int argc, char **argv)
1618	{
1619	  enum amc_ace_status status;

1621	  if (argc != 2) usage(argv);
1622	  if (argv[1][0] != '-') usage(argv);
1623	  if (argv[1][2] != '\0') usage(argv);

1625	  if (argv[1][1] == 'e') {
1626	    u_code_point input[unicode_max_length];
1627	    unsigned char uppercase_flags[unicode_max_length];
1628	    unsigned char output[ace_max_size];
1629	    unsigned int input_length, output_size;
1630	    int c0, c1, c2, c3;

1632	    /* Read the UTF-32 input string: */

1634	    input_length = 0;

1636	    for (;;) {
1637	      c0 = getchar();
1638	      c1 = getchar();
1639	      c2 = getchar();
1640	      c3 = getchar();

1642	      if (c1 == EOF || c2 == EOF || c3 == EOF) {
1643	        if (c0 != EOF) fail("input not a multiple of 4 bytes\n");
1644	        break;
1645	      }

1647	      if (input_length == unicode_max_length) fail(too_large);

1649	      if ((c0 != 0 && c0 != 0x80)
1650	          || c1 < 0 || c1 > 0x10
1651	          || c2 < 0 || c2 > 0xFF
1652	          || c3 < 0 || c3 > 0xFF ) {
1653	        fail(invalid_input);
1654	      }

1656	      input[input_length] = ((u_code_point) c1 << 16) |
1657	                            ((u_code_point) c2 <<  8) | (u_code_point) c3;
1658	      uppercase_flags[input_length] = (c0 >> 7);
1659	      ++input_length;
1660	    }

1662	    /* Encode, and output the result: */

1664	    output_size = ace_max_size;
1665	    status = amc_ace_m_encode(input_length, input, uppercase_flags,
1666	                              &output_size, output);
1667	    if (status == amc_ace_invalid_input) fail(invalid_input);
1668	    if (status == amc_ace_output_too_big) fail(too_large);
1669	    assert(status == amc_ace_success);
1670	    fputs((char *) output, stdout);
1671	    return EXIT_SUCCESS;
1672	  }
1673	  if (argv[1][1] == 'd') {
1674	    unsigned char input[ace_max_size], scratch[ace_max_size];
1675	    u_code_point output[unicode_max_length], codept;
1676	    unsigned char uppercase_flags[unicode_max_length];
1677	    unsigned int output_length, i;
1678	    size_t n;

1680	    /* Read the AMC-ACE-M ASCII input string: */

1682	    n = fread(input, 1, ace_max_size, stdin);
1683	    if (n == ace_max_size) fail(too_large);
1684	    input[n] = 0;

1686	    /* Decode, and output the result: */

1688	    output_length = unicode_max_length;
1689	    status = amc_ace_m_decode(test_case_sensitivity, scratch, input,
1690	                              &output_length, output, uppercase_flags);
1691	    if (status == amc_ace_invalid_input) fail(invalid_input);
1692	    if (status == amc_ace_output_too_big) fail(too_large);
1693	    assert(status == 0);

1695	    for (i = 0;  i < output_length;  ++i) {
1696	      putchar(uppercase_flags[i] ? 0x80 : 0);
1697	      codept = output[i];
1698	      putchar(codept >> 16);
1699	      putchar((codept >> 8) & 0xFF);
1700	      putchar(codept & 0xFF);
1701	    }

1703	    return EXIT_SUCCESS;
1704	  }

1706	  usage(argv);
1707	  return EXIT_SUCCESS;  /* not reached, but quiets a compiler warning */
1708	}

1710	                   INTERNET-DRAFT expires 2001-Aug-12