idnits 2.17.1 draft-ietf-idn-amc-ace-r-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 3 longer pages, the longest (page 9) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([UNICODE], [IDNA], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '1' on line 1106

  -- Looks like a reference, but probably isn't: '2' on line 1054

  -- Looks like a reference, but probably isn't: '6' on line 931

  -- Looks like a reference, but probably isn't: '0' on line 1078

  -- Looks like a reference, but probably isn't: '3' on line 1060

  == Outdated reference: A later version (-10) exists of
     draft-ietf-idn-nameprep-03

  -- Possible downref: Normative reference to a draft: ref. 'AMCACEM' 

  -- Possible downref: Normative reference to a draft: ref. 'AMCACEW' 

  -- Possible downref: Normative reference to a draft: ref. 'BRACE' 

  == Outdated reference: A later version (-02) exists of
     draft-ietf-idn-dude-01

  -- Possible downref: Normative reference to a draft: ref. 'DUDE01' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN'

  == Outdated reference: A later version (-13) exists of
     draft-ietf-idn-idna-01

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL'

  -- Possible downref: Normative reference to a draft: ref. 'RACE03' 

  ** Downref: Normative reference to an Unknown state RFC: RFC  952

  -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'

  -- No information found for draft-jseng-utf5- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'UTF5' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS6'


     Summary: 6 errors (**), 0 flaws (~~), 5 warnings (==), 20 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          Adam M. Costello
2	draft-ietf-idn-amc-ace-r-01.txt                              2001-May-31
3	Expires 2001-Nov-30

5	                         AMC-ACE-R version 0.2.1

7	Status of this Memo

9	    This document is an Internet-Draft and is in full conformance with
10	    all provisions of Section 10 of RFC2026.

12	    Internet-Drafts are working documents of the Internet Engineering
13	    Task Force (IETF), its areas, and its working groups.  Note
14	    that other groups may also distribute working documents as
15	    Internet-Drafts.

17	    Internet-Drafts are draft documents valid for a maximum of six
18	    months and may be updated, replaced, or obsoleted by other documents
19	    at any time.  It is inappropriate to use Internet-Drafts as
20	    reference material or to cite them other than as "work in progress."

22	    The list of current Internet-Drafts can be accessed at
23	    http://www.ietf.org/ietf/1id-abstracts.txt

25	    The list of Internet-Draft Shadow Directories can be accessed at
26	    http://www.ietf.org/shadow.html

28	    Distribution of this document is unlimited.  Please send comments
29	    to the author at amc@cs.berkeley.edu, or to the idn working
30	    group at idn@ops.ietf.org.  A non-paginated (and possibly
31	    newer) version of this specification may be available at
32	    http://www.cs.berkeley.edu/~amc/charset/amc-ace-r

34	Abstract

36	    AMC-ACE-R is a reversible transformation from a sequence of Unicode
37	    [UNICODE] code points to a sequence of letters, digits, and hyphens
38	    (LDH characters).  AMC-ACE-R could be used as an ASCII-Compatible
39	    Encoding (ACE) for internationalized domain names [IDN] [IDNA].

41	    Besides domain names, there might also be other contexts where it is
42	    useful to transform Unicode characters into "safe" (delimiter-free)
43	    ASCII characters.  (If other contexts consider hyphens to be
44	    unsafe, a different character could be used to play its role, like
45	    underscore.)
46	Contents

48	    Technical changes from earlier versions
49	    Features
50	    Name
51	    Terminology
52	    Description
53	    Base-32 characters
54	    Encoding and decoding algorithms
55	    Signature
56	    Mixed-case annotation
57	    Comparison with other ACEs
58	    Example strings
59	    Security considerations
60	    Acknowledgements
61	    References
62	    Author
63	    Example implementation

65	Technical changes from earlier versions

67	    From 0.0.x to 0.1.x:

69	        In update(), the test "latest - first == 1" was a bug, changed
70	        to "latest == first".

72	        In encode(), "codepoint" was a misspelling of "input[i]".

74	        Initializing refpoint[1] to 0x60 was a design flaw, because this
75	        initial value is useless for nameprep'd strings.  The initial
76	        value is changed to 0xE0.

78	        encode() now fails if input[i] is not in 0..10FFFF, in order to
79	        avoid an array bounds error.  This does not affect the encoding
80	        of valid Unicode strings.

82	    From 0.1.x to 0.2.x:

84	        The test "latest == first" tests for the first code point, but
85	        the intention was to test for the first non-LDH code point,
86	        or equivalently, for the first time update() is called.  The
87	        boolean flag "updated" has been introduced for performing the
88	        proper test.  This alters the encoding of some strings, usually
89	        for the better.

91	        The initial value of refpoint[2] has been changed from 0 to
92	        0xA0, which is more useful in light of nameprep's prohibition of
93	        non-LDH code points below 0xA0.

95	        In decode() the number of base-32 characters consumed has been
96	        limited to 5, to avoid an array bounds error on invalid input.

98	Features

100	    Completeness:  Every Unicode string maps to an LDH string.
101	    Restrictions on which Unicode strings are allowed, and on length,
102	    may be imposed by higher layers.

104	    Uniqueness:  Every Unicode string maps to at most one LDH string.

106	    Reversibility:  Any Unicode string mapped to an LDH string can be
107	    recovered from that LDH string.

109	    Efficient encoding:  The ratio of encoded size to original size is
110	    small for all Unicode strings.  This is important in the context
111	    of domain names because [RFC1034] restricts the length of a domain
112	    label to 63 characters.

114	    Simplicity:  The encoding and decoding algorithms are reasonably
115	    simple to implement.  The goals of efficiency and simplicity are at
116	    odds; AMC-ACE-R aims at a reasonable balance between them.

118	    Mixed-case annotation:  Even if the Unicode string has been
119	    case-folded prior to encoding, it is possible to used mixed case
120	    in the encoded string as an annotation telling how to convert the
121	    folded Unicode string into a mixed-case Unicode string for display
122	    purposes.  This feature is optional; see section "Mixed-case
123	    annotation".

125	    Readability:  The letters A-Z and a-z and the digits 0-9 appearing
126	    in the Unicode string are represented as themselves in the label.
127	    This comes for free because it usually the most efficient encoding
128	    anyway.

130	Name

132	    AMC-ACE-R is a working name that should be changed if it is adopted.
133	    (The R merely indicates that it is the eighteenth ACE devised by
134	    this author.  BRACE was the third.  Most were not worth releasing.)
135	    Rather than waste good names on experimental proposals, let's
136	    wait until one proposal is chosen, then assign it a good name.
137	    Suggestions:

139	        UniHost
140	        NUDE (Normal Unicode Domain Encoding)
141	        UTF-D ("D" for "domain names")
142	        UTF-37 (there are 37 characters in the output repertoire)

144	Terminology

146	    LDH characters are the letters A-Z and a-z, the digits 0-9, and
147	    hyphen-minus.

149	    A quartet is a sequence of four bits (also known as a nibble or
150	    nybble).

152	    A quintet is a sequence of five bits.

154	    Hexadecimal values are shown preceeded by "0x".  For example, 0x60
155	    is decimal 96.

157	    As in the Unicode Standard [UNICODE], Unicode code points are
158	    denoted by "U+" followed by four to six hexadecimal digits, while a
159	    range of code points is denoted by two hexadecimal numbers separated
160	    by "..", with no prefixes.

162	    "x..y" means the range of integers x through y inclusive.

164	    "x << y" means x left-shifted by y bits (equivalent to x times
165	    2 to the power y), and "x >> y" means x right-shifted by y bits
166	    (equivalent to x divided by 2 to the power y, discarding the
167	    remainder).  These operations are used only with nonnegative
168	    integral values.

170	Description

172	    AMC-ACE-R represents a sequence of Unicode code points as a sequence
173	    of LDH characters, although implementations will also need to
174	    represent the LDH characters somehow, typically as ASCII octets.
175	    The encoder input and decoder output are arrays of Unicode code
176	    points (integral values in the range 0..10FFFF, but not D800..DFFF,
177	    which are reserved for use by UTF-16).

179	    This section describes the representation.  Section "Encoding
180	    and decoding algorithms" presents the algorithms as commented
181	    pseudocode.  There is also commented C code in section "Example
182	    implementation".

184	    The encoded string alternates between two modes: literal mode and
185	    base-32 mode.  Unicode code points representing LDH characters
186	    are encoded as those LDH characters, except that hyphen-minus is
187	    doubled.  Other Unicode code points are encoded as one or more LDH
188	    characters using base-32, in which each character of the encoded
189	    string represents a quintet according to the table in section
190	    "Base-32 characters".  A mode change is indicated by an unpaired
191	    hyphen-minus.  A pair of consecutive hyphen-minuses represents a
192	    hyphen-minus and does not change the mode.

194	    In base-32 mode a variable-length code sequence of one to five
195	    quintets represents a delta, which is added to a reference point to
196	    yield a Unicode code point.  There are five reference points, one
197	    for each code length.  The delta is represented by the lowest four
198	    bits of each quintet.  The highest bit of each quintet is 1, except
199	    for the last quintet, where it is 0, allowing the decoder to detect
200	    the end of the sequence.

202	    Code sequences:
203	        delta from reference point 1: 0xxxx
204	        delta from reference point 2: 1xxxx 0xxxx
205	        delta from reference point 3: 1xxxx 1xxxx 0xxxx
206	        delta from reference point 4: 1xxxx 1xxxx 1xxxx 0xxxx
207	        delta from reference point 5: 1xxxx 1xxxx 1xxxx 1xxxx 0xxxx

209	    For reference point k, the delta is constrained by the available
210	    bits to range from 0 to (1 << (4*k)) - 1, so each reference point is
211	    the bottom of a window of 1 << (4*k) code points.  A code point is
212	    encoded as an offset into the smallest window that contains it.

214	    Reference points 4 and 5 are fixed at 0 and 0x10000 respectively,
215	    so that windows 4 and 5 always cover the entire Unicode code space
216	    0..10FFFF.  The other reference points are updated whenever a code
217	    point has been encoded or decoded in base-32 mode, using following
218	    heuristic.

220	    The latest code point is rounded down to a multiple of 0x10 to
221	    obtain a candidate for replacing reference point 1.  If a non-LDH
222	    code point falling within the candidate window has appeared more
223	    recently than one falling within the current window, then the
224	    reference point is changed.  Otherwise a similar check is performed
225	    for reference point 2 using 0x100 as the divisor, and failing that,
226	    reference point 3 is checked using 0x1000.  At most one window is
227	    changed each time, except that after the very first non-LDH code
228	    point (when there is no useful history), all three windows are
229	    changed.

231	    The initial values of the state variables are:

233	                     mode:  base-32
234	        reference point 1:  0xE0
235	        reference point 2:  0xA0
236	        reference point 3:  0
237	        reference point 4:  0
238	        reference point 5:  0x10000

240	Base-32 characters

242	        "a" =  0 = 0x00 = 00000         "s" = 16 = 0x10 = 10000
243	        "b" =  1 = 0x01 = 00001         "t" = 17 = 0x11 = 10001
244	        "c" =  2 = 0x02 = 00010         "u" = 18 = 0x12 = 10010
245	        "d" =  3 = 0x03 = 00011         "v" = 19 = 0x13 = 10011
246	        "e" =  4 = 0x04 = 00100         "w" = 20 = 0x14 = 10100
247	        "f" =  5 = 0x05 = 00101         "x" = 21 = 0x15 = 10101
248	        "g" =  6 = 0x06 = 00110         "y" = 22 = 0x16 = 10110
249	        "h" =  7 = 0x07 = 00111         "z" = 23 = 0x17 = 10111
250	        "i" =  8 = 0x08 = 01000         "2" = 24 = 0x18 = 11000
251	        "j" =  9 = 0x09 = 01001         "3" = 25 = 0x19 = 11001
252	        "k" = 10 = 0x0A = 01010         "4" = 26 = 0x1A = 11010
253	        "m" = 11 = 0x0B = 01011         "5" = 27 = 0x1B = 11011
254	        "n" = 12 = 0x0C = 01100         "6" = 28 = 0x1C = 11100
255	        "p" = 13 = 0x0D = 01101         "7" = 29 = 0x1D = 11101
256	        "q" = 14 = 0x0E = 01110         "8" = 30 = 0x1E = 11110
257	        "r" = 15 = 0x0F = 01111         "9" = 31 = 0x1F = 11111

259	    The digits "0" and "1" and the letters "o" and "l" are not used, to
260	    avoid transcription errors.

262	    All decoders must recognize both the uppercase and lowercase forms
263	    of the base-32 characters (including mixtures of both forms).
264	    An encoder should output only lowercase forms or only uppercase
265	    forms unless it uses the feature described in section "Mixed-case
266	    annotation").

268	Encoding and decoding algorithms

270	    All ordering of bits, quartets, and quintets is big-endian (most
271	    significant first).  When subroutines alter variables that are
272	    passed in as arguments, those changes are seen by the caller after
273	    the subroutine returns.  As in C, "continue" means terminate the
274	    current iteration of the innermost loop, and "break" means terminate
275	    the innermost loop.

277	    procedure initialize(refpoint,literal,updated):
278	      let refpoint[1..5] = (0xE0, 0xA0, 0, 0, 0x10000)
279	      let literal = updated = false

281	    procedure update(refpoint, updated, history[first..latest]):
282	      # Update the reference points based on the history.
283	      for k = 1 to 3 do begin
284	        let b = 4 * k
285	        # The first time here change all the windows:
286	        if not updated
287	        then let refpoint[k] = (history[latest] >> b) << b
288	        else for i = latest - 1 down to first do begin
289	          if history[i] represents an LDH character then continue
290	          # If a code point falling in the existing window has appeared
291	          # at least as recently as one falling in the candidate window,
292	          # then leave this window unchanged and go on to the next one:
293	          if (refpoint[k] XOR history[i]) >> b == 0 then break
294	          if (history[latest] XOR history[i]) >> b == 0 then begin
295	            # A code point falling in the candidate window has appeared
296	            # more recently than one falling in the existing window, so
297	            # change this window (and no others):
298	            let refpoint[k] = (history[latest] >> b) << b
299	            goto update_end
300	          end
301	        end
302	      end
303	      update_end: let updated = true

305	    procedure encode(input[first..last]):
306	      initialize(refpoint,literal,updated)
307	      for i = first to last do begin
308	        # Check code point range to avoid array bounds errors later:
309	        if input[i] is not in 0..10FFFF then fail
310	        if input[i] == 0x2D then output two hyphen-minuses
311	        else if input[i] represents an LDH character then begin
312	          # Letter/digit is encoded literally, so get into literal mode.
313	          if not literal then output hyphen-minus
314	          let literal = true
315	          output the character represented by input[i]
316	        end
317	        else begin
318	          # Non-LDH code point is encoded in base-32.
319	          # Compute the number of base-32 characters to use:
320	          for k = 1 to infinity do begin
321	            let delta = input[i] - refpoint[k]
322	            if delta >= 0 and delta >> (4*k) == 0 then break
323	          end
324	          # Switch to base-32 mode if necessary:
325	          if literal then output hyphen-minus
326	          let literal = false
327	          represent delta in base 16 as k quartets
328	          prepend 0 to the last quartet and 1 to each of the others
329	          output a base-32 character corresponding to each quintet
330	          update(refpoint, updated, input[first..i])
331	        end
332	      end
333	    procedure decode(input string):
334	      initialize(refpoint,literal,updated)
335	      let history = the empty array
336	      while the input string is not exhausted do begin
337	        read the next character into c
338	        # Unpaired hyphen-minus toggles the mode:
339	        if c is hyphen-minus and the next character is not
340	        then read the next character into c and toggle literal
341	        # Double hyphen-minus represents 0x2D:
342	        if c is hyphen-minus
343	        then read the next character and append 0x2D to history
344	        else if literal then append the code point of c to history
345	        else begin
346	          # Decode a base-32 sequence.
347	          convert c to a quintet
348	          while a quintet beginning with 0 has not been seen
349	          do read and convert up to four more characters
350	          concatenate the lowest four bits of each quintet to form delta
351	          append refpoint[number of quintets] + delta to history
352	          update(refpoint,updated,history)
353	        end
354	      end
355	      # Enforce the uniqueness of the encoding:
356	      encode history and compare it to the input string
357	      fail if they are not equal
358	      output history

360	    The decoder must always be prepared for premature end-of-input or
361	    invalid input characters, and must either fail immediately or forge
362	    ahead and let the comparison at the end fail.  The comparison must
363	    be case-insensitive if ACEs are always compared case-insensitively
364	    (which is true of domain names), case-sensitive otherwise.  This
365	    check is necessary to guarantee the uniqueness property (there
366	    cannot be two distinct encoded strings representing the same
367	    sequence of integers).  (If the decoder is one step of a larger
368	    decoding process, it may be possible to defer the re-encoding and
369	    comparison to the end of that larger decoding process.)

371	Signature

373	    The issue of how to distinguish ACE strings from unencoded strings
374	    is largely orthogonal to the encoding scheme itself, and is
375	    therefore not specified here.  In the context of domain name labels,
376	    a standard prefix and/or suffix (chosen to be unlikely to occur
377	    naturally) would presumably be attached to ACE labels.

379	    In order to use AMC-ACE-R in domain names, the choice of signature
380	    must be mindful of the requirement in [RFC952] that labels never
381	    begin or end with hyphen-minus.  Since the raw encoded string
382	    sometimes begins with a hyphen-minus, the signature must include
383	    a prefix that does not begin with hyphen-minus.  If the Unicode
384	    strings are forbidden from ending with hyphen-minus (which seems
385	    prudent anyway), then the raw encoded string will never end with
386	    hyphen-minus; otherwise, the signature must include a suffix as well
387	    as a prefix.

389	Mixed-case annotation

391	    In order to use AMC-ACE-R to represent case-insensitive Unicode
392	    strings, higher layers need to case-fold the Unicode strings prior
393	    to AMC-ACE-R encoding.  The encoded string can, however, use
394	    mixed-case base-32 (rather than all-lowercase or all-uppercase
395	    as recommended in section "Base-32 characters") as an annotation
396	    telling how to convert the folded Unicode string into a mixed-case
397	    Unicode string for display purposes.

399	    Each non-LDH code point is represented by a sequence of base-32
400	    characters, the last of which is always a letter (as opposed to
401	    a digit).  If that letter is uppercase, it is a suggestion that
402	    the Unicode character be mapped to uppercase (if possible); if the
403	    letter is lowercase, it is a suggestion that the Unicode character
404	    be mapped to lowercase (if possible).

406	    AMC-ACE-R encoders and decoders are not required to support these
407	    annotations, and higher layers need not use them.

409	Comparison with other ACEs

411	    Please refer to the comparison in [AMCACEW].

413	Example strings

415	    In the ACE encodings below, no signatures are shown.  AMC-ACE-R is
416	    abbreviated AMC-R.  Backslashes show where line breaks have been
417	    inserted in strings too long for one line.

419	    The first several examples are all translations of the sentence "Why
420	    can't they just speak in ?" (courtesy of Michael Kaplan's
421	    "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
422	    been removed, as is often done in domain names.

424	    (A) Arabic (Egyptian):
425	        u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
426	        u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
427	        AMC-R:  ywekhfuhuikwdwefivevjbuiwktr

429	    (B) Chinese (simplified):
430	        u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
431	        AMC-R:  w87g8nvk6awisp259eupyx2h

433	    (C) Czech: Proprostnemluvesky
434	        U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
435	        u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
436	        u+0065 u+0073 u+006B u+0079
437	        AMC-R:  -Pro-yp-prost-tm-nemluv-s8pp-esky

439	    (D) Hebrew:
440	        u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
441	        u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
442	        u+05D1 u+05E8 u+05D9 u+05EA
443	        AMC-R:  x7nqeep8e8j7f7inaqdb8ijp8cb8ij8k
444	    (E) Hindi (Devanagari):
445	        u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
446	        u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
447	        u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
448	        u+0939 u+0948 u+0902
449	        AMC-R:  3urvjvcwmthjruiwpugwatfwpurmscuivjascunmvcvitfuewhjwisc

451	    (F) Japanese (kanji and hiragana):
452	        u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
453	        u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
454	        AMC-R:  vsykxnzr3dkyx8fyzun243q3c24zbxhgwr2nkweqwm

456	    (G) Korean (Hangul syllables):
457	        u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
458	        u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
459	        u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
460	        AMC-R:  6tvi466ezxi544i5w8a6s4nz2nw8e6zze7xxn47yp6x5e53znze7xze\
461	                7xxn5u8e54ze6x5n36is3i622m6zwe48wn

463	    (H) Russian (Cyrillic):
464	        U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
465	        u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
466	        u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
467	        u+0438
468	        AMC-R:  wvRqwhfnwdgfqpipfdqcqwawrcvrvqwawdbbvkvi

470	    (I) Spanish: PorqunopuedensimplementehablarenEspaol
471	        U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
472	        u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
473	        u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
474	        u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
475	        u+0061 u+00F1 u+006F u+006C
476	        AMC-R:  -Porqu-j-nopuedensimplementehablarenEspa-9b-ol

478	    (J) Taiwanese:
479	        u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
480	        AMC-R:  w87gxstbzuvc6a385psp244kupyx2h

482	    (K) Vietnamese:
483	        Taisaohokhngthchi\
484	        noitingVit
485	        U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F
486	        u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA
487	        u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069
488	        u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA
489	        u+0323 u+0074
490	        AMC-R:  -Ta-vud-isaoho-d-kh-s9e-ngth-s8kvsj-chi-vsj-no-b-iti-s8\
491	                kb-ngVi-s8kud-t

493	    The next several examples are all names of Japanese music artists,
494	    song titles, and TV programs, just because the author happens to
495	    have them handy (but Japanese is useful for providing examples
496	    of single-row text, two-row text, ideographic text, and various
497	    mixtures thereof).

499	    (L) 3B
500	        u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
501	        AMC-R:  -3-x8ze-B-z7we3t7btymtwizxtr
502	    (M) -with-SUPER-MONKEYS
503	        u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
504	        u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
505	        U+004F U+004E U+004B U+0045 U+0059 U+0053
506	        AMC-R:  x52j4e3wiz92qyszf---with--SUPER--MONKEYS

508	    (N) Hello-Another-Way-
509	        U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
510	        u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
511	        u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
512	        AMC-R:  -Hello--Another--Way---vsxp2nq2nyqx2veyuwa

514	    (O) 2
515	        u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
516	        AMC-R:  vszcyiyex6wmy2vjqw8sm-2

518	    (P) MajiKoi5
519	        U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
520	        u+308B u+0035 u+79D2 u+524D
521	        AMC-R:  -Maji-vsyh-Koi-xj2m-5-z37cxuwp

523	    (Q) de
524	        u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
525	        AMC-R:  vs7bf4d9n-de-8m9d7a

527	    (R) 
528	        u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
529	        AMC-R:  vsxpyq5j7e9n6jyh

531	    The last example is an ASCII string that breaks not only the
532	    existing rules for host name labels but also the rules proposed in
533	    [NAMEPREP03] for internationalized domain names.

535	    (S) -> $1.00 <-
536	        u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
537	        u+003C u+002D
538	        AMC-R:  --svquaue-1-q-00-avn--

540	Security considerations

542	    Users expect each domain name in DNS to be controlled by a single
543	    authority.  If a Unicode string intended for use as a domain label
544	    could map to multiple ACE labels, then an internationalized domain
545	    name could map to multiple ACE domain names, each controlled by
546	    a different authority, some of which could be spoofs that hijack
547	    service requests intended for another.  Therefore AMC-ACE-R is
548	    designed so that each Unicode string has a unique encoding.

550	    However, there can still be multiple Unicode representations of the
551	    "same" text, for various definitions of "same".  This problem is
552	    addressed to some extent by the Unicode standard under the topic of
553	    canonicalization, and this work is leveraged for domain names by
554	    "nameprep" [NAMEPREP03].

556	Acknowledgements

558	    AMC-ACE-R reuses a number of preexisting techniques.

560	    The basic encoding of integers to quartets to quintets to base-32
561	    comes from UTF-5 [UTF5], and the particular variant used here comes
562	    from AMC-ACE-M [AMCACEM].

564	    The idea of avoiding 0, 1, o, and l in base-32 strings was taken
565	    from SFS [SFS].

567	    The idea of encoding deltas from reference points was taken from
568	    RACE (of which the latest version is [RACE03]), which may have
569	    gotten the idea from Unicode Technical Standard #6 [UTS6].

571	    The idea of switching between literal mode and base-32 mode comes
572	    from BRACE [BRACE].

574	    The general idea of using the alphabetic case of base-32 characters
575	    to indicate the desired case of the Unicode characters was suggested
576	    by this author, and first applied to the UTF-5-style encoding in
577	    DUDE (of which the latest version is [DUDE01]).

579	    The heuristic used to adapt the reference points based on past code
580	    points is new in AMC-ACE-R.

582	References

584	    [AMCACEM] Adam Costello, "AMC-ACE-M version 0.1.4", 2001-Apr-01,
585	    update of draft-ietf-idn-amc-ace-m-00, latest version at
586	    http://www.cs.berkeley.edu/~amc/charset/amc-ace-m.

588	    [AMCACEW] Adam Costello, "AMC-ACE-W version 0.1.0",
589	    2001-May-31, draft-ietf-idn-amc-ace-w-00, latest version at
590	    http://www.cs.berkeley.edu/~amc/charset/amc-ace-w.

592	    [BRACE] Adam Costello, "BRACE: Bi-mode Row-based
593	    ASCII-Compatible Encoding for IDN version 0.1.2",
594	    2000-Sep-19, draft-ietf-idn-brace-00, version at
595	    http://www.cs.berkeley.edu/~amc/charset/brace.

597	    [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode
598	    Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01.

600	    [IDN] Internationalized Domain Names (IETF working group),
601	    http://www.i-d-n.net/, idn@ops.ietf.org.

603	    [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host
604	    Names In Applications (IDNA)", draft-ietf-idn-idna-01.

606	    [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
607	    of Internationalized Host Names", 2001-Feb-24,
608	    draft-ietf-idn-nameprep-03.

610	    [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
611	    http://www.trigeminal.com/samples/provincial.html.

613	    [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding
614	    for IDN", 2000-Nov-28, draft-ietf-idn-race-03.

616	    [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
617	    Table Specification", 1985-Oct, RFC 952.

619	    [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
620	    1987-Nov, RFC 1034.

622	    [SFS] David Mazieres et al, "Self-certifying File System",
623	    http://www.fs.net/.

625	    [UNICODE] The Unicode Consortium, "The Unicode Standard",
626	    http://www.unicode.org/unicode/standard/standard.html.

628	    [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a
629	    Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*.

631	    [UTS6] Misha Wolf, Ken Whistler, Charles Wicksteed,
632	    Mark Davis, Asmus Freytag, "Unicode Technical Standard
633	    #6: A Standard Compression Scheme for Unicode",
634	    http://www.unicode.org/unicode/reports/tr6/.

636	Author

638	    Adam M. Costello 
639	    http://www.cs.berkeley.edu/~amc/

641	Example implementation

643	/******************************************/
644	/* amc-ace-r.c 0.2.1 (2001-May-31-Thu)    */
645	/* Adam M. Costello  */
646	/******************************************/

648	/* This is ANSI C code (C89) implementing AMC-ACE-R version 0.2.*. */

650	/************************************************************/
651	/* Public interface (would normally go in its own .h file): */

653	#include 

655	enum amc_ace_status {
656	  amc_ace_success,
657	  amc_ace_bad_input,
658	  amc_ace_big_output  /* Output would exceed the space provided. */
659	};

661	enum case_sensitivity { case_sensitive, case_insensitive };

663	#if UINT_MAX >= 0x10FFFF
664	typedef unsigned int u_code_point;
665	#else
666	typedef unsigned long u_code_point;
667	#endif
668	enum amc_ace_status amc_ace_r_encode(
669	  unsigned int input_length,
670	  const u_code_point input[],
671	  const unsigned char uppercase_flags[],
672	  unsigned int *output_size,
673	  char output[] );

675	    /* amc_ace_r_encode() converts Unicode to AMC-ACE-R (without      */
676	    /* any signature).  The input must be represented as an array     */
677	    /* of Unicode code points (not code units; surrogate pairs        */
678	    /* are not allowed), and the output will be represented as        */
679	    /* null-terminated ASCII.  The input_length is the number of      */
680	    /* code points in the input.  The output_size is an in/out        */
681	    /* argument: the caller must pass in the maximum number of        */
682	    /* characters that may be output (including the terminating       */
683	    /* null), and on successful return it will contain the number of  */
684	    /* characters actually output (including the terminating null,    */
685	    /* so it will be one more than strlen() would return, which is    */
686	    /* why it is called output_size rather than output_length).  The  */
687	    /* uppercase_flags array must hold input_length boolean values,   */
688	    /* where nonzero means the corresponding Unicode character should */
689	    /* be forced to uppercase after being decoded, and zero means it  */
690	    /* is caseless or should be forced to lowercase.  Alternatively,  */
691	    /* uppercase_flags may be a null pointer, which is equivalent     */
692	    /* to all zeros.  The letters a-z and A-Z are always encoded      */
693	    /* literally, regardless of the corresponding flags.  The encoder */
694	    /* always outputs lowercase base-32 characters except when        */
695	    /* nonzero values of uppercase_flags require otherwise.  The      */
696	    /* return value may be any of the amc_ace_status values defined   */
697	    /* above; if not amc_ace_success, then output_size and output may */
698	    /* contain garbage.  On success, the encoder will never need to   */
699	    /* write an output_size greater than input_length*5+1, because of */
700	    /* how the encoding is defined.                                   */

702	enum amc_ace_status amc_ace_r_decode(
703	  enum case_sensitivity case_sensitivity,
704	  char scratch_space[],
705	  const char input[],
706	  unsigned int *output_length,
707	  u_code_point output[],
708	  unsigned char uppercase_flags[] );
709	    /* amc_ace_r_decode() converts AMC-ACE-R (without any signature)  */
710	    /* to Unicode.  The input must be represented as null-terminated  */
711	    /* ASCII, and the output will be represented as an array of       */
712	    /* Unicode code points.  The case_sensitivity argument influences */
713	    /* the check on the well-formedness of the input string; it       */
714	    /* must be case_sensitive if case-sensitive comparisons are       */
715	    /* allowed on encoded strings, case_insensitive otherwise.        */
716	    /* The scratch_space must point to space at least as large        */
717	    /* as the input, which will get overwritten (this allows the      */
718	    /* decoder to avoid calling malloc()).  The output_length is      */
719	    /* an in/out argument: the caller must pass in the maximum        */
720	    /* number of code points that may be output, and on successful    */
721	    /* return it will contain the actual number of code points        */
722	    /* output.  The uppercase_flags array must have room for at       */
723	    /* least output_length values, or it may be a null pointer        */
724	    /* if the case information is not needed.  A nonzero flag         */
725	    /* indicates that the corresponding Unicode character should      */
726	    /* be forced to uppercase by the caller, while zero means it      */
727	    /* is caseless or should be forced to lowercase.  The letters     */
728	    /* a-z and A-Z are output already in the proper case, but their   */
729	    /* flags will be set appropriately so that applying the flags     */
730	    /* would be harmless.  The return value may be any of the         */
731	    /* amc_ace_status values defined above; if not amc_ace_success,   */
732	    /* then output_length, output, and uppercase_flags may contain    */
733	    /* garbage.  On success, the decoder will never need to write     */
734	    /* an output_length greater than the length of the input (not     */
735	    /* counting the null terminator), because of how the encoding is  */
736	    /* defined.                                                       */

738	/**********************************************************/
739	/* Implementation (would normally go in its own .c file): */

741	#include 

743	/* is_ldh(n) returns 1 if the code point n represents an LDH      */
744	/* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */

746	static int is_ldh(u_code_point n)
747	{
748	  return n <= 122 && ( n >= 97 || n == 45 ||
749	         (n >= 48 && n <= 57) || (n >= 65 && n <= 90) );
750	}

752	/* base32[q] is the lowercase base-32 character representing  */
753	/* the number q from the range 0 to 31.  Note that we cannot  */
754	/* use string literals for ASCII characters because an ANSI C */
755	/* compiler does not necessarily use ASCII.                   */

757	static const char base32[] = {
758	  97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
759	  109, 110,                                               /* m-n */
760	  112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
761	  50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
762	};
763	/* base32_decode(c) returns the value of a base-32 character, in the */
764	/* range 0 to 31, or the constant base32_invalid if c is not a valid */
765	/* base-32 character.                                                */

767	enum { base32_invalid = 32 };

769	static unsigned int base32_decode(char c)
770	{
771	  if (c < 50) return base32_invalid;
772	  if (c <= 57) return c - 26;
773	  if (c < 97) c += 32;
774	  if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
775	  return c - 97 - (c > 108) - (c > 111);
776	}

778	/* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
779	/* are equal, 1 otherwise.  If case_sensitivity is case_insensitive,  */
780	/* then ASCII A-Z are considered equal to a-z respectively.           */

782	static int unequal( enum case_sensitivity case_sensitivity,
783	                    const char s1[], const char s2[]        )
784	{
785	  char c1, c2;

787	  if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;

789	  for (;;) {
790	    c1 = *s1;
791	    c2 = *s2;
792	    if (c1 >= 65 && c1 <= 90) c1 += 32;
793	    if (c2 >= 65 && c2 <= 90) c2 += 32;
794	    if (c1 != c2) return 1;
795	    if (c1 == 0) return 0;
796	    ++s1, ++s2;
797	  }
798	}

800	/* update(refpoint,updated,history,latest) updates refpoint[1..3] */
801	/* based on the updated flag and history[0..latest].              */

803	static void update( u_code_point refpoint[6], unsigned int *updated,
804	  const u_code_point history[], unsigned int latest )
805	{
806	  unsigned int k, b, i;

808	  for (k = 1;  k <= 3;  ++k) {
809	    b = k << 2;
810	    /* The first time here change all the windows: */
811	    if (!*updated) refpoint[k] = (history[latest] >> b) << b;
812	    else for (i = latest;  i-- > 0; ) {
813	      if (is_ldh(history[i])) continue;

815	      /* If a code point falling in the existing window has appeared  */
816	      /* at least as recently as one falling in the candidate window, */
817	      /* then leave this window unchanged and go on to the next one:  */
818	      if ((refpoint[k] ^ history[i]) >> b == 0) break;

820	      if ((history[latest] ^ history[i]) >> b == 0) {
821	        /* A code point falling in the candidate window has appeared */
822	        /* more recently than one falling in the existing window, so */
823	        /* change this window (and no others):                       */

825	        refpoint[k] = (history[latest] >> b) << b;
826	        goto update_end;
827	      }
828	    }
829	  }

831	  update_end: *updated = 1;
832	}

834	/* Main encode function: */

836	enum amc_ace_status amc_ace_r_encode(
837	  unsigned int input_length,
838	  const u_code_point input[],
839	  const unsigned char uppercase_flags[],
840	  unsigned int *output_size,
841	  char output[] )
842	{
843	  unsigned int literal, updated, max_out, in, out, k, j;
844	  u_code_point n, delta;
845	  char shift;

847	  /* Initialize the state: */

849	  u_code_point refpoint[6] = {0, 0xE0, 0xA0, 0, 0, 0x10000};

851	  literal = updated = 0;
852	  max_out = *output_size;

854	  for (in = out = 0;  in < input_length;  ++in) {

856	    /* At the start of each iteration, in and out are the number of */
857	    /* items already input/output, or equivalently, the indices of  */
858	    /* the next items to be input/output.                           */

860	    n = input[in];
861	    /* Check the code point range to avoid array bounds errors later: */
862	    if (n > 0x10FFFF) return amc_ace_bad_input;
863	    if (n == 0x2D) {
864	      /* Hyphen-minus is doubled. */
865	      if (max_out - out < 2) return amc_ace_big_output;
866	      output[out++] = 0x2D;
867	      output[out++] = 0x2D;
868	    }
869	    else if (is_ldh(n)) {
870	      /* Encode an LDH character literally. */
871	      if (max_out - out < 1 + !literal) return amc_ace_big_output;
872	      /* Switch to literal mode if necessary: */
873	      if (!literal) output[out++] = 0x2D;
874	      literal = 1;
875	      output[out++] = n;
876	    }
877	    else {
878	      /* Encode a non-LDH character using base-32.           */
879	      /* First compute the number of base-32 characters (k): */

881	      for (k = 1; ; ++k) {
882	        delta = n - refpoint[k];
883	        if (delta >> (4*k) == 0) break;
884	      }

886	      if (max_out - out < k + literal) return amc_ace_big_output;
887	      /* Switch to base-32 mode if necessary: */
888	      if (literal) output[out++] = 0x2D;
889	      literal = 0;
890	      shift = uppercase_flags && uppercase_flags[in] ? 32 : 0;

892	      /* Each quintet has the form 1xxxx except the last is 0xxxx. */
893	      /* Computing the base-32 digits in reverse order is easiest. */

895	      out += k;
896	      output[out - 1] = base32[delta & 0xF] - shift;

898	      for (j = 2;  j <= k;  ++j) {
899	        delta >>= 4;
900	        output[out - j] = base32[0x10 | (delta & 0xF)];
901	      }

903	      update(refpoint, &updated, input, in);
904	    }
905	  }

907	  /* Append the null terminator: */
908	  if (max_out - out < 1) return amc_ace_big_output;
909	  output[out++] = 0;

911	  *output_size = out;
912	  return amc_ace_success;
913	}
914	/* Main decode function: */

916	enum amc_ace_status amc_ace_r_decode(
917	  enum case_sensitivity case_sensitivity,
918	  char scratch_space[],
919	  const char input[],
920	  unsigned int *output_length,
921	  u_code_point output[],
922	  unsigned char uppercase_flags[] )
923	{
924	  u_code_point q, delta;
925	  char c;
926	  unsigned int literal, updated, max_out, in, out, k, scratch_size;
927	  enum amc_ace_status status;

929	  /* Initialize the state: */

931	  u_code_point refpoint[6] = {0, 0xE0, 0xA0, 0, 0, 0x10000};

933	  literal = updated = 0;
934	  max_out = *output_length;

936	  for (c = input[in = 0], out = 0;  c != 0;  c = input[++in], ++out) {

938	    /* At the start of each iteration, in and out are the number of   */
939	    /* items already input/output, or equivalently, the indices of    */
940	    /* the next items to be input/output. c is the same as input[in]. */

942	    if (c == 0x2D && input[in + 1] != 0x2D) {
943	      /* Unpaired hyphen-minus toggles mode. */
944	      literal = !literal;
945	      c = input[++in];
946	    }

948	    if (max_out - out < 1) return amc_ace_big_output;

950	    if (c == 0x2D) {
951	      /* Double hyphen-minus represents a hyphen-minus. */
952	      ++in;
953	      output[out] = 0x2D;
954	    }
955	    else {
956	      if (literal) output[out] = c;
957	      else {
958	        /* Decode a base-32 sequence.                  */
959	        /* First decode quintets until 0xxxx is found: */

961	        for (delta = 0, k = 1;  ;  c = input[++in], ++k) {
962	          q = base32_decode(c);
963	          if (q == base32_invalid || k > 5) return amc_ace_bad_input;
964	          delta = (delta << 4) | (q & 0xF);
965	          if (q >> 4 == 0) break;
966	        }

968	        output[out] = refpoint[k] + delta;
969	        update(refpoint, &updated, output, out);
970	      }
971	    }
972	    /* Case of last character determines uppercase flag: */
973	    if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90;
974	  }

976	  /* Enforce the uniqueness of the encoding by re-encoding */
977	  /* the output and comparing the result to the input:     */

979	  scratch_size = ++in;
980	  status = amc_ace_r_encode(out, output, uppercase_flags,
981	                            &scratch_size, scratch_space);
982	  if (status != amc_ace_success || scratch_size != in ||
983	      unequal(case_sensitivity, scratch_space, input)
984	     ) return amc_ace_bad_input;

986	  *output_length = out;
987	  return amc_ace_success;
988	}

990	/******************************************************************/
991	/* Wrapper for testing (would normally go in a separate .c file): */

993	#include 
994	#include 
995	#include 
996	#include 

998	/* For testing, we'll just set some compile-time limits rather than */
999	/* use malloc(), and set a compile-time option rather than using a  */
1000	/* command-line option.                                             */

1002	enum {
1003	  unicode_max_length = 256,
1004	  ace_max_size = 256,
1005	  test_case_sensitivity = case_insensitive
1006	                          /* suitable for host names */
1007	};

1009	static void usage(char **argv)
1010	{
1011	  fprintf(stderr,
1012	    "%s -e reads code points and writes an AMC-ACE-R string.\n"
1013	    "%s -d reads an AMC-ACE-R string and writes code points.\n"
1014	    "Input and output are plain text in the native character set.\n"
1015	    "Code points are in the form u+hex separated by whitespace.\n"
1016	    "An AMC-ACE-R string is a newline-terminated sequence of LDH\n"
1017	    "characters (without any signature).\n"
1018	    "The case of the u in u+hex is the force-to-uppercase flag.\n"
1019	    , argv[0], argv[0]);
1020	  exit(EXIT_FAILURE);
1021	}
1022	static void fail(const char *msg)
1023	{
1024	  fputs(msg,stderr);
1025	  exit(EXIT_FAILURE);
1026	}

1028	static const char too_big[] =
1029	  "input or output is too large, recompile with larger limits\n";
1030	static const char invalid_input[] = "invalid input\n";
1031	static const char io_error[] = "I/O error\n";

1033	/* The following string is used to convert LDH      */
1034	/* characters between ASCII and the native charset: */

1036	static const char ldh_ascii[] =
1037	  "................"
1038	  "................"
1039	  ".............-.."
1040	  "0123456789......"
1041	  ".ABCDEFGHIJKLMNO"
1042	  "PQRSTUVWXYZ....."
1043	  ".abcdefghijklmno"
1044	  "pqrstuvwxyz";

1046	int main(int argc, char **argv)
1047	{
1048	  enum amc_ace_status status;
1049	  int r;
1050	  char *p;

1052	  if (argc != 2) usage(argv);
1053	  if (argv[1][0] != '-') usage(argv);
1054	  if (argv[1][2] != 0) usage(argv);

1056	  if (argv[1][1] == 'e') {
1057	    u_code_point input[unicode_max_length];
1058	    unsigned long codept;
1059	    unsigned char uppercase_flags[unicode_max_length];
1060	    char output[ace_max_size], uplus[3];
1061	    unsigned int input_length, output_size, i;

1063	    /* Read the input code points: */

1065	    input_length = 0;

1067	    for (;;) {
1068	      r = scanf("%2s%lx", uplus, &codept);
1069	      if (ferror(stdin)) fail(io_error);
1070	      if (r == EOF || r == 0) break;

1072	      if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) {
1073	        fail(invalid_input);
1074	      }
1075	      if (input_length == unicode_max_length) fail(too_big);

1077	      if (uplus[0] == 'u') uppercase_flags[input_length] = 0;
1078	      else if (uplus[0] == 'U') uppercase_flags[input_length] = 1;
1079	      else fail(invalid_input);

1081	      input[input_length++] = codept;
1082	    }

1084	    /* Encode: */

1086	    output_size = ace_max_size;
1087	    status = amc_ace_r_encode(input_length, input, uppercase_flags,
1088	                              &output_size, output);
1089	    if (status == amc_ace_bad_input) fail(invalid_input);
1090	    if (status == amc_ace_big_output) fail(too_big);
1091	    assert(status == amc_ace_success);

1093	    /* Convert to native charset and output: */

1095	    for (p = output;  *p != 0;  ++p) {
1096	      i = *p;
1097	      assert(i <= 122 && ldh_ascii[i] != '.');
1098	      *p = ldh_ascii[i];
1099	    }

1101	    r = puts(output);
1102	    if (r == EOF) fail(io_error);
1103	    return EXIT_SUCCESS;
1104	  }

1106	  if (argv[1][1] == 'd') {
1107	    char input[ace_max_size], scratch[ace_max_size], *pp;
1108	    u_code_point output[unicode_max_length];
1109	    unsigned char uppercase_flags[unicode_max_length];
1110	    unsigned int input_length, output_length, i;

1112	    /* Read the AMC-ACE-R input string and convert to ASCII: */

1114	    fgets(input, ace_max_size, stdin);
1115	    if (ferror(stdin)) fail(io_error);
1116	    if (feof(stdin)) fail(invalid_input);
1117	    input_length = strlen(input);
1118	    if (input[input_length - 1] != '\n') fail(too_big);
1119	    input[--input_length] = 0;

1121	    for (p = input;  *p != 0;  ++p) {
1122	      pp = strchr(ldh_ascii, *p);
1123	      if (pp == 0) fail(invalid_input);
1124	      *p = pp - ldh_ascii;
1125	    }
1126	    /* Decode: */

1128	    output_length = unicode_max_length;
1129	    status = amc_ace_r_decode(test_case_sensitivity, scratch, input,
1130	                              &output_length, output, uppercase_flags);
1131	    if (status == amc_ace_bad_input) fail(invalid_input);
1132	    if (status == amc_ace_big_output) fail(too_big);
1133	    assert(status == amc_ace_success);

1135	    /* Output the result: */

1137	    for (i = 0;  i < output_length;  ++i) {
1138	      r = printf("%s+%04lX\n",
1139	                 uppercase_flags[i] ? "U" : "u",
1140	                 (unsigned long) output[i] );
1141	      if (r < 0) fail(io_error);
1142	    }

1144	    return EXIT_SUCCESS;
1145	  }

1147	  usage(argv);
1148	  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
1149	}

1151	                   INTERNET-DRAFT expires 2001-Nov-30