idnits 2.17.1 draft-ietf-idn-altdude-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 3 instances of too long lines in the document, the longest one being 2 characters in excess of 72. ** The abstract seems to contain references ([UNICODE], [DUDE01], [RFC1123], [RFC952], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: '--out' is mentioned on line 778, but not defined

  == Missing Reference: '-1' is mentioned on line 836, but not defined

  -- Looks like a reference, but probably isn't: '0' on line 898

  -- Looks like a reference, but probably isn't: '1' on line 951

  -- Looks like a reference, but probably isn't: '2' on line 899

  == Unused Reference: 'RFC1034' is defined on line 551, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-10) exists of
     draft-ietf-idn-nameprep-03

  -- Possible downref: Normative reference to a draft: ref. 'AMCACEM00' 

  -- Possible downref: Normative reference to a draft: ref. 'AMCACEO00' 

  == Outdated reference: A later version (-02) exists of
     draft-ietf-idn-dude-01

  -- Possible downref: Normative reference to a draft: ref. 'DUDE01' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL'

  ** Downref: Normative reference to an Unknown state RFC: RFC  952

  -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'

  -- No information found for draft-jseng-utf5- - is the name correct?

  -- Possible downref: Normative reference to a draft: ref. 'UTF5' 


     Summary: 6 errors (**), 0 flaws (~~), 6 warnings (==), 15 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          Adam M. Costello
2	draft-ietf-idn-altdude-00.txt                                2001-Mar-19
3	Expires 2001-Sep-19

5	                         AltDUDE version 0.0.2

7	Status of this Memo

9	    This document is an Internet-Draft and is in full conformance with
10	    all provisions of Section 10 of RFC2026.

12	    Internet-Drafts are working documents of the Internet Engineering
13	    Task Force (IETF), its areas, and its working groups.  Note
14	    that other groups may also distribute working documents as
15	    Internet-Drafts.

17	    Internet-Drafts are draft documents valid for a maximum of six
18	    months and may be updated, replaced, or obsoleted by other documents
19	    at any time.  It is inappropriate to use Internet-Drafts as
20	    reference material or to cite them other than as "work in progress."

22	    The list of current Internet-Drafts can be accessed at
23	    http://www.ietf.org/ietf/1id-abstracts.txt

25	    The list of Internet-Draft Shadow Directories can be accessed at
26	    http://www.ietf.org/shadow.html

28	    Distribution of this document is unlimited.  Please send comments
29	    to the author at amc@cs.berkeley.edu, or to the idn working
30	    group at idn@ops.ietf.org.  A non-paginated (and possibly
31	    newer) version of this specification may be available at
32	    http://www.cs.berkeley.edu/~amc/charset/altdude

34	Abstract

36	    DUDE [DUDE01] by Mark Welter and Brian Spolarich is an
37	    ASCII-Compatible Encoding (ACE) of Unicode strings, and AltDUDE is a
38	    slight variation on it that is conceptually simpler.

40	    AltDUDE is a reversible map from a sequence of unsigned integers
41	    (intended to be Unicode code points) to a sequence of letters (A-Z,
42	    a-z), digits (0-9), and hyphen-minus (-), henceforth called LDH
43	    characters.  Such a map might be useful for internationalized domain
44	    names [IDN], because host name labels are currently restricted to
45	    LDH characters by [RFC952] and [RFC1123].

47	    Besides domain names, there might also be other contexts where it is
48	    useful to transform Unicode [UNICODE] code points (or any unsigned
49	    integers that exhibit locality) into "safe" (delimiter-free)
50	    ASCII characters.  (If other contexts consider hyphen-minus to be
51	    unsafe, it can trivially be eliminated, or replaced by a different
52	    character, like underscore.)
53	Contents

55	    Differences from DUDE
56	    Features
57	    Name
58	    Overview
59	    Base-32 characters
60	    Encoding procedure
61	    Decoding procedure
62	    Signature
63	    Case sensitivity models
64	    Comparison with other ACEs
65	    Example strings
66	    Security considerations
67	    Credits
68	    References
69	    Author
70	    Example implementation

72	Differences from DUDE

74	    AltDUDE differs from DUDE in four respects:

76	     1) DUDE computes the XOR of each integer and the previous in order
77	        to decide how many bits of each integer to encode, whereas
78	        AltDUDE encodes the XOR itself, so there is no need for a mask.

80	     2) DUDE makes the first quintet of each sequence different from the
81	        rest, while AltDUDE makes the last quintet different, so it's
82	        easier for the decoder to detect the end of the sequence.

84	     3) AltDUDE uses a base-32 map that avoids 0, 1, o, and l, to help
85	        humans avoid transcription errors.

87	     4) AltDUDE uses 96 rather than 0 as the initial value of the
88	        previous code point.  For domain names, this makes a few
89	        encodings one character shorter and makes none longer.

91	Features

93	    Uniqueness:  Every sequence of integers maps to at most one LDH
94	    string.

96	    Completeness:  Every sequence of integers maps to an LDH string.
97	    Restrictions on which integers are allowed, and on sequence length,
98	    may be imposed by higher layers.

100	    Efficient encoding:  The ratio of encoded size to original size is
101	    small.  This is important in the context of domain names because
102	    [RFC1034] restricts the length of a domain label to 63 characters.

104	    Simplicity:  The encoding and decoding algorithms are reasonably
105	    simple to implement.  The goals of efficiency and simplicity are at
106	    odds; AltDUDE places greater emphasis on simplicity.

108	    Case-preservation:  If a Unicode string has been case-folded prior
109	    to encoding, it is possible to record the case information in the
110	    case of the letters in the encoding, allowing a mixed-case Unicode
111	    string to be recovered if desired, but a case-insensitive comparison
112	    of two encoded strings is equivalent to a case-insensitive
113	    comparison of the Unicode strings.  This feature is optional; see
114	    section "Case sensitivity models".

116	Name

118	    AltDUDE is a working name that should be changed if it is adopted.
119	    Rather than waste good names on experimental proposals, let's
120	    wait until one proposal is chosen, then assign it a good name.
121	    Suggestions (assuming the primary use is in domain names):

123	        DUDE (if the DUDE authors wish to adopt this algorithm)
124	        UniHost
125	        UTF-D ("D" for "domain names")
126	        UTF-33 (there are 33 characters in the output repertoire)

128	Overview

130	    AltDUDE encodes unsigned integers as characters, although
131	    implementations will of course need to represent the output
132	    characters somehow, usually as bytes or other code units.  When
133	    AltDUDE is used to encode Unicode characters, the integers are the
134	    corresponding Unicode code points, not UTF-16 surrogates.

136	    Each integer is represented by an integral number of characters in
137	    the encoded string.  There is no intermediate bit string or octet
138	    string.

140	    Integers with value 45 are represented by hyphen-minus characters
141	    (45 is the Unicode code point for hyphen-minus).  Each
142	    non-hyphen-minus character in the encoded string represents five
143	    bits (a "quintet").  A sequence of quintets represents the bitwise
144	    XOR between each non-45 integer and the previous one.

146	    The exception for 45 and hyphen-minus is useful for domain names,
147	    but could be dropped in other contexts, or replaced by a different
148	    exception.

150	Base-32 characters

152	        "a" =  0 = 0x00 = 00000         "s" = 16 = 0x10 = 10000
153	        "b" =  1 = 0x01 = 00001         "t" = 17 = 0x11 = 10001
154	        "c" =  2 = 0x02 = 00010         "u" = 18 = 0x12 = 10010
155	        "d" =  3 = 0x03 = 00011         "v" = 19 = 0x13 = 10011
156	        "e" =  4 = 0x04 = 00100         "w" = 20 = 0x14 = 10100
157	        "f" =  5 = 0x05 = 00101         "x" = 21 = 0x15 = 10101
158	        "g" =  6 = 0x06 = 00110         "y" = 22 = 0x16 = 10110
159	        "h" =  7 = 0x07 = 00111         "z" = 23 = 0x17 = 10111
160	        "i" =  8 = 0x08 = 01000         "2" = 24 = 0x18 = 11000
161	        "j" =  9 = 0x09 = 01001         "3" = 25 = 0x19 = 11001
162	        "k" = 10 = 0x0A = 01010         "4" = 26 = 0x1A = 11010
163	        "m" = 11 = 0x0B = 01011         "5" = 27 = 0x1B = 11011
164	        "n" = 12 = 0x0C = 01100         "6" = 28 = 0x1C = 11100
165	        "p" = 13 = 0x0D = 01101         "7" = 29 = 0x1D = 11101
166	        "q" = 14 = 0x0E = 01110         "8" = 30 = 0x1E = 11110
167	        "r" = 15 = 0x0F = 01111         "9" = 31 = 0x1F = 11111

169	    The digits "0" and "1" and the letters "o" and "l" are not used, to
170	    avoid transcription errors.

172	    All decoders must recognize both the uppercase and lowercase
173	    forms of the base-32 characters.  The case may or may not convey
174	    information, as described in section "Case sensitivity models".

176	Encoding procedure

178	    All ordering of nybbles and quintets is big-endian (most significant
179	    first).  A nybble is 4 bits.  XOR is bitwise exclusive or.

181	    let prev = 96
182	    for each input integer n (in order) do begin
183	      if n == 45 then output hyphen minus
184	      else begin
185	        let diff = prev XOR n
186	        extract the least significant nybbles of diff, as few as are
187	          sufficient to hold all the nonzero bits (but at least one)
188	        prepend 0 to the last nybble and 1 to the rest
189	        output base-32 characters corresponding to the quintets
190	        let prev = n
191	      end
192	    end

194	    The encoder must either correctly handle all integer values that can
195	    be represented in the type of its input, or it must check whether
196	    the input contains values that it cannot handle and return an error
197	    if so.  Under no circumstances may it produce incorrect output.

199	Decoding procedure

201	    let prev = 96
202	    while the input string is not exhausted do begin
203	      if the next character is hyphen-minus then output 45
204	      else begin
205	        input characters and convert them to quintets until
206	          encountering a quintet beginning with 0
207	        fail upon encountering a non-base-32 character or end-of-input
208	        strip the first bit of each quintet
209	        concatenate the resulting nybbles to form diff
210	        let prev = prev XOR diff
211	        output prev
212	      end
213	    end
214	    encode the output sequence and compare it to the input string
215	    fail if they are not equal

217	    The comparison at the end must be case-insensitive if ACEs are
218	    always compared case-insensitively (which is true of domain names),
219	    case-sensitive otherwise.  See also section "Case sensitivity
220	    models".  This check is necessary to guarantee the uniqueness
221	    property (there cannot be two distinct encoded strings representing
222	    the same sequence of integers).  This check also frees the decoder
223	    from having to check for overflow while decoding the base-32
224	    characters.

226	Signature

228	    The issue of how to distinguish ACE strings from unencoded strings
229	    is largely orthogonal to the encoding scheme itself, and is
230	    therefore not specified here.  In the context of domain name labels,
231	    a standard prefix and/or suffix (chosen to be unlikely to occur
232	    naturally) would presumably be attached to ACE labels.  (In that
233	    case, it would probably be good to forbid the encoding of Unicode
234	    strings that appear to match the signature, to avoid confusing
235	    humans about whether they are looking at a Unicode string or an ACE
236	    string.)

238	    In order to use AltDUDE in domain names, the choice of signature
239	    must be mindful of the requirement in [RFC952] that labels never
240	    begin or end with hyphen-minus.  The raw encoded string will
241	    begin or end with a hyphen-minus iff the Unicode string does.  If
242	    the Unicode strings are forbidden from beginning or ending with
243	    hyphen-minus (which seems prudent anyway), then there is no problem.
244	    Otherwise, the signature must consist of both a prefix and a suffix.

246	    It appears that "---" is extremely rare in domain names; among the
247	    four-character prefixes of all the second-level domains under .com,
248	    .net, and .org, "---" never appears at all.  Therefore, perhaps the
249	    signature should be of the form ?--- (prefix) or ---? (suffix),
250	    where ? could be "u" for Unicode, or "i" for internationalized, or
251	    "a" for ACE, or maybe "q" or "z" because they are rare.

253	Case sensitivity models

255	    The higher layer must choose one of the following four models.

257	    Models suitable for domain names:

259	      * Case-insensitive:  Before a string is encoded, all its non-LDH
260	        characters must be case-folded so that any strings differing
261	        only in case become the same string (for example, strings could
262	        be forced to lowercase).  Folding LDH characters is optional.
263	        The case of base-32 characters and literal-mode characters is
264	        arbitrary and not significant.  Comparisons between encoded
265	        strings must be case-insensitive.  The original case of non-LDH
266	        characters cannot be recovered from the encoded string.

268	      * Case-preserving:  The case of the Unicode characters is not
269	        considered significant, but it can be preserved and recovered,
270	        just like in non-internationalized host names.  Before a string
271	        is encoded, all its non-LDH characters must be case-folded
272	        as in the previous model.  LDH characters are naturally able
273	        to retain their case attributes because they are encoded
274	        literally.  The case attribute of a non-LDH character is
275	        recorded in the last of the base-32 characters that represent
276	        it, which is guaranteed to be a letter rather than a digit.
277	        If the base-32 character is uppercase, it means the Unicode
278	        character is caseless or should be forced to uppercase after
279	        being decoded (which is a no-op if the case folding already
280	        forces to uppercase).  If the base-32 character is lowercase,
281	        it means the Unicode character is caseless or should be forced
282	        to lowercase after being decoded (which is a no-op if the case
283	        folding already forces to lowercase).  The case of the other
284	        base-32 characters in a multi-quintet encoding is arbitrary
285	        and not significant.  Only uppercase and lowercase attributes
286	        can be recorded, not titlecase.  Comparisons between encoded
287	        strings must be case-insensitive, and are equivalent to
288	        case-insensitive comparisons between the Unicode strings.  The
289	        intended mixed-case Unicode string can be recovered as long as
290	        the encoded characters are unaltered, but altering the case of
291	        the encoded characters is not harmful--it merely alters the case
292	        of the Unicode characters, and such a change is not considered
293	        significant.

295	        In this model, the input to the encoder and the output of the
296	        decoder can be the unfolded Unicode string (in which case the
297	        encoder and decoder are responsible for performing the case
298	        folding and recovery), or can be the folded Unicode string
299	        accompanied by separate case information (in which case the
300	        higher layer is responsible for performing the case folding and
301	        recovery).  Whichever layer performs the case recovery must
302	        first verify that the Unicode string is properly folded, to
303	        guarantee the uniqueness of the encoding.

305	        It is not very difficult to extend the nameprep algorithm
306	        [NAMEPREP03] to remember case information.

308	    The case-insensitive and case-preserving models are interoperable.
309	    If a domain name passes from a case-preserving entity to a
310	    case-insensitive entity, the case information will be lost, but
311	    the domain name will still be equivalent.  This phenomenon already
312	    occurs with non-internationalized domain names.

314	    Models unsuitable for domain names, but possibly useful in other
315	    contexts:

317	      * Case-sensitive:  Unicode strings may contain both uppercase and
318	        lowercase characters, which are not folded.  Base-32 characters
319	        must be lowercase.  Comparisons between encoded strings must be
320	        case-sensitive.

322	      * Case-flexible:  Like case-preserving, except that the choice
323	        of whether the case of the Unicode characters is considered
324	        significant is deferred.  Therefore, base-32 characters must
325	        be lowercase, except for those used to indicate uppercase
326	        Unicode characters.  Comparisons between encoded strings may be
327	        case-sensitive or case-insensitive, and such comparisons are
328	        equivalent to the corresponding comparisons between the Unicode
329	        strings.

331	Comparison with other ACEs

333	    The differences between AltDUDE and DUDE were given in section
334	    "Differences from DUDE".  For a comparison between DUDE and other
335	    ACEs, please see the AMC-ACE-O specification [AMCACEO00].

337	Example strings

339	    The first several examples are all translations of the sentence "Why
340	    can't they just speak in ?" (courtesy of Michael Kaplan's
341	    "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
342	    been removed, as is often done in domain names.

344	    (A) Arabic (Egyptian):
345	        U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644
346	        U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F

348	        AltDUDE: yueqpcycrcyjhbpznpitjycxf

350	    (B) Chinese (simplified):
351	        U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587

353	        AltDUDE: w85gvk7g9k2iwf6x9j6x7ju54k

355	    (C) Czech: Proprostnemluvesky

357	         = U+010D
358	         = U+011B
359	         = U+00ED
360	        AltDUDE: tActptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc

362	    (D) Hebrew:
363	        U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8
364	        U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2
365	        U+05D1 U+05E8 U+05D9 U+05EA

367	        AltDUDE: x5nckajvjpvnpenqpcvjvbevrvdvjvbvd

369	    (E) Hindi:
370	        U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D
371	        U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939
372	        U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947
373	        U+0939 U+0948 U+0902  (Devanagari)

375	        AltDUDE: 3wrtgmzjxnuqgthyfymygxfxiycyewjuktbzjwcuqyhzjkupvbydzq\
376	                 zbwk

378	    (F) Japanese:
379	        U+306A U+305C U+307F U+3093 U+306A U+65E5 U+672C U+8A9E U+3092
380	        U+8A71 U+3057 U+3066 U+304F U+308C U+306A U+3044 U+306E U+304B
381	        (kanji and hiragana)

383	        AltDUDE: vsskvgud8n9jxx2ru6j875c54sn548d54ugvbuj6d8guqukuf

385	    (G) Korean:
386	        U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
387	        U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
388	        U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C  (Hangul syllables)

390	        AltDUDE: 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8si\
391	                 tusauiyz5i23az96iz6ze3xaz2td96ry3si

393	    (H) Russian:
394	        U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E
395	        U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440
396	        U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A
397	        U+0438  (Cyrillic)

399	        AltDUDE: wxRbzjzcjzrzfdmdffigpnnzqrpzpbzqdcazmc

401	    (I) Spanish: PorqunopuedensimplementehablarenEspaol

403	         = U+00E9
404	         = U+00F1

406	        AltDUDE: tAtrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmMtgdtb\
407	                 3a3qd

409	    (J) Taiwanese:
410	        U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587

412	        AltDUDE: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k

414	    (K) Vietnamese:
415	        Taisaohokhngthchi\
416	        noitingVit
417	          = U+0323
418	             = U+00F4
419	             = U+00EA
420	         = U+0309
421	             = U+0301

423	        AltDUDE: tEtfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\
424	                 yitptp2dv8mvyrjtBtr2dv6jvxh

426	    The next several examples are all names of Japanese music artists,
427	    song titles, and TV programs, just because the author happens to
428	    have them handy (but Japanese is useful for providing examples
429	    of single-row text, two-row text, ideographic text, and various
430	    mixtures thereof).

432	    (L) 3B  (Japanese TV program title)

434	                      = U+5E74                       (kanji)
435	                     = U+7D44                       (kanji)
436	         = U+91D1 U+516B U+5148 U+751F  (kanji)

438	        AltDUDE: xdx8whx8tGz7ug863f6s5kuduwxh

440	    (M) -with-SUPER-MONKEYS  (Japanese music group name)

442	         = U+5B89 U+5BA4 U+5948 U+7F8E U+6075  (kanji)

444	        AltDUDE: x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK

446	    (N) Hello-Another-Way-  (Japanese song title)

448	         = U+305D U+308C U+305E U+308C U+306E  (hiragana)
449	                = U+5834 U+6240                       (kanji)

451	        AltDUDE: Ipjad-Qrbtmtnpth-Ftgti-vsue7b7c7c8cy2xkv4ze

453	    (O) 2  (Japanese TV program title)

455	         = U+3072 U+3068 U+3064  (hiragana)
456	            = U+5C4B U+6839         (kanji)
457	              = U+306E                (hiragana)
458	           = U+4E0B                (kanji)

460	        AltDUDE: vstctkny6urvwzcx2xhz8yfw8vj

462	    (P) MajiKoi5 (Japanese song title)

464	                = U+3067         (hiragana)
465	              = U+3059 U+308B  (hiragana)
466	         = U+79D2 U+524D  (kanji)

468	        AltDUDE: PnmdvssqvssNegvsva7cvs5qz38hu53r

470	    (Q) de  (Japanese song title)

472	         = U+30D1 U+30D5 U+30A3 U+30FC  (katakana)
473	         = U+30EB U+30F3 U+30D0         (katakana)
474	        AltDUDE: vs5bezgxrvs3ibvs2qtiud

476	    (R)   (Japanese song title)

478	            = U+305D U+306E                (hiragana)
479	         = U+30B9 U+30D4 U+30FC U+30C9  (katakana)
480	              = U+3067                       (hiragana)

482	        AltDUDE: vsvpvd7hypuivf4q

484	    The last example is an ASCII string that breaks not only the
485	    existing rules for host name labels but also the rules proposed in
486	    [NAMEPREP03] for internationalized domain names.

488	    (S) -> $1.00 <-

490	        AltDUDE: -xqtqetftrtqatatn-

492	Security considerations

494	    Users expect each domain name in DNS to be controlled by a single
495	    authority.  If a Unicode string intended for use as a domain label
496	    could map to multiple ACE labels, then an internationalized domain
497	    name could map to multiple ACE domain names, each controlled by
498	    a different authority, some of which could be spoofs that hijack
499	    service requests intended for another.  Therefore AltDUDE is
500	    designed so that each Unicode string has a unique encoding.

502	    However, there can still be multiple Unicode representations of the
503	    "same" text, for various definitions of "same".  This problem is
504	    addressed to some extent by the Unicode standard under the topic
505	    of canonicalization, but some text strings may be misleading or
506	    ambiguous to humans when used as domain names, such as strings
507	    containing dots, slashes, at-signs, etc.  These issues are being
508	    further studied under the topic of "nameprep" [NAMEPREP03].

510	Credits

512	    AltDUDE reuses a number of preexisting techniques.

514	    The basic encoding of integers to nybbles to quintets to base-32
515	    comes from UTF-5 [UTF5], and the particular variant used here comes
516	    from AMC-ACE-M [AMCACEM00].

518	    The idea of avoiding 0, 1, o, and l in base-32 strings was taken
519	    from SFS [SFS].

521	    From DUDE (of which the latest version is [DUDE01]) comes the idea
522	    of encoding differences between successive integers.  The idea
523	    of using the alphabetic case of base-32 characters to record the
524	    desired case of the Unicode characters was suggested by this author,
525	    but in DUDE it was first applied it to the UTF-5-style encoding.

527	References

529	    [AMCACEM00] Adam Costello, "AMC-ACE-M version 0.1.0", 2001-Feb-12,
530	    draft-ietf-idn-amc-ace-m-00.

532	    [AMCACEO00] Adam Costello, "AMC-ACE-O version 0.0.3", 2001-Mar-19,
533	    draft-ietf-idn-amc-ace-o-00.

535	    [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode
536	    Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01.

538	    [IDN] Internationalized Domain Names (IETF working group),
539	    http://www.i-d-n.net/, idn@ops.ietf.org.

541	    [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
542	    of Internationalized Host Names", 2001-Feb-24,
543	    draft-ietf-idn-nameprep-03.

545	    [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
546	    http://www.trigeminal.com/samples/provincial.html.

548	    [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
549	    Table Specification", 1985-Oct, RFC 952.

551	    [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
552	    1987-Nov, RFC 1034.

554	    [RFC1123] Internet Engineering Task Force, R. Braden (editor),
555	    "Requirements for Internet Hosts -- Application and Support",
556	    1989-Oct, RFC 1123.

558	    [SFS] David Mazieres et al, "Self-certifying File System",
559	    http://www.fs.net/.

561	    [UNICODE] The Unicode Consortium, "The Unicode Standard",
562	    http://www.unicode.org/unicode/standard/standard.html.

564	    [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a
565	    Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*.

567	Author

569	    Adam M. Costello 
570	    http://www.cs.berkeley.edu/~amc/

572	    See also the authors of DUDE [DUDE01].

574	Example implementation

576	/******************************************/
577	/* altdude.c 0.0.2 (2001-Mar-19-Sun)      */
578	/* Adam M. Costello  */
579	/******************************************/

581	/* This is ANSI C code (C89) implementing AltDUDE    */
582	/* (draft-ietf-idn-altdude-00), a simplified variant */
583	/* of DUDE (draft-ietf-idn-dude-01).                 */
584	/************************************************************/
585	/* Public interface (would normally go in its own .h file): */

587	#include 

589	enum altdude_status {
590	  altdude_success,
591	  altdude_invalid_input,
592	  altdude_output_too_big
593	};

595	enum case_sensitivity { case_sensitive, case_insensitive };

597	#if UINT_MAX >= 0x1FFFFF
598	typedef unsigned int u_code_point;
599	#else
600	typedef unsigned long u_code_point;
601	#endif

603	enum altdude_status altdude_encode(
604	  unsigned int input_length,
605	  const u_code_point *input,
606	  const unsigned char *uppercase_flags,
607	  unsigned int *output_size,
608	  char *output );

610	    /* altdude_encode() converts Unicode to AltDUDE (without any      */
611	    /* signature).  The input must be represented as an array         */
612	    /* of Unicode code points (not code units; surrogate pairs        */
613	    /* are not allowed), and the output will be represented as        */
614	    /* null-terminated ASCII.  The input_length is the number of code */
615	    /* points in the input.  The output_size is an in/out argument:   */
616	    /* the caller must pass in the maximum number of characters       */
617	    /* that may be output (including the terminating null), and on    */
618	    /* successful return it will contain the number of characters     */
619	    /* actually output (including the terminating null, so it will be */
620	    /* one more than strlen() would return, which is why it is called */
621	    /* output_size rather than output_length).  The uppercase_flags   */
622	    /* array must hold input_length boolean values, where nonzero     */
623	    /* means the corresponding Unicode character should be forced     */
624	    /* to uppercase after being decoded, and zero means it is         */
625	    /* caseless or should be forced to lowercase.  Alternatively,     */
626	    /* uppercase_flags may be a null pointer, which is equivalent     */
627	    /* to all zeros.  The encoder always outputs lower case base-32   */
628	    /* characters except when nonzero values of uppercase_flags       */
629	    /* require otherwise.  The return value may be any of the         */
630	    /* altdude_status values defined above; if not altdude_success,   */
631	    /* then output_size and output may contain garbage.  On success,  */
632	    /* the encoder will never need to write an output_size greater    */
633	    /* than input_length*k+1 if all the input code points are less    */
634	    /* than 1 << (4*k), because of how the encoding is defined.       */
635	enum altdude_status altdude_decode(
636	  enum case_sensitivity case_sensitivity,
637	  char *scratch_space,
638	  const char *input,
639	  unsigned int *output_length,
640	  u_code_point *output,
641	  unsigned char *uppercase_flags );

643	    /* altdude_decode() converts AltDUDE (without any signature) to   */
644	    /* Unicode.  The input must be represented as null-terminated     */
645	    /* ASCII, and the output will be represented as an array of       */
646	    /* Unicode code points.  The case_sensitivity argument influences */
647	    /* the check on the well-formedness of the input string; it       */
648	    /* must be case_sensitive if case-sensitive comparisons are       */
649	    /* allowed on encoded strings, case_insensitive otherwise.        */
650	    /* The scratch_space must point to space at least as large        */
651	    /* as the input, which will get overwritten (this allows the      */
652	    /* decoder to avoid calling malloc()).  The output_length is      */
653	    /* an in/out argument: the caller must pass in the maximum        */
654	    /* number of code points that may be output, and on successful    */
655	    /* return it will contain the actual number of code points        */
656	    /* output.  The uppercase_flags array must have room for at least */
657	    /* output_length values, or it may be a null pointer if the case  */
658	    /* information is not needed.  A nonzero flag indicates that the  */
659	    /* corresponding Unicode character should be forced to uppercase  */
660	    /* by the caller, while zero means it is caseless or should be    */
661	    /* forced to lowercase.  The return value may be any of the       */
662	    /* altdude_status values defined above; if not altdude_success,   */
663	    /* then output_length, output, and uppercase_flags may contain    */
664	    /* garbage.  On success, the decoder will never need to write     */
665	    /* an output_length greater than the length of the input (not     */
666	    /* counting the null terminator), because of how the encoding is  */
667	    /* defined.                                                       */

669	/**********************************************************/
670	/* Implementation (would normally go in its own .c file): */

672	#include 

674	/* Character utilities: */

676	/* is_AtoZ(c) returns 1 if c is an         */
677	/* uppercase ASCII letter, zero otherwise. */

679	static unsigned char is_AtoZ(char c)
680	{
681	  return c >= 65 && c <= 90;
682	}

684	/* base32[n] is the lowercase base-32 character representing  */
685	/* the number n from the range 0 to 31.  Note that we cannot  */
686	/* use string literals for ASCII characters because an ANSI C */
687	/* compiler does not necessarily use ASCII.                   */
688	static const char base32[] = {
689	  97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
690	  109, 110,                                               /* m-n */
691	  112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
692	  50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
693	};

695	/* base32_decode(c) returns the value of a base-32 character, in the */
696	/* range 0 to 31, or the constant base32_invalid if c is not a valid */
697	/* base-32 character.                                                */

699	enum { base32_invalid = 32 };

701	static unsigned int base32_decode(char c)
702	{
703	  if (c < 50) return base32_invalid;
704	  if (c <= 57) return c - 26;
705	  if (c < 97) c += 32;
706	  if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
707	  return c - 97 - (c > 108) - (c > 111);
708	}

710	/* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
711	/* are equal, 1 otherwise.  If case_sensitivity is case_insensitive,  */
712	/* then ASCII A-Z are considered equal to a-z respectively.           */

714	static int unequal(
715	  enum case_sensitivity case_sensitivity, const char *s1, const char *s2 )
716	{
717	  char c1, c2;

719	  if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;

721	  for (;;) {
722	    c1 = *s1;
723	    c2 = *s2;
724	    if (c1 >= 65 && c1 <= 90) c1 += 32;
725	    if (c2 >= 65 && c2 <= 90) c2 += 32;
726	    if (c1 != c2) return 1;
727	    if (c1 == 0) return 0;
728	    ++s1, ++s2;
729	  }
730	}

732	/* altdude_initial_code_point is the initial value of the */
733	/* "previous" code point, before the first code point. */

735	static const u_code_point altdude_initial_code_point = 96;
736	/* Encoder: */

738	enum altdude_status altdude_encode(
739	  unsigned int input_length,
740	  const u_code_point *input,
741	  const unsigned char *uppercase_flags,
742	  unsigned int *output_size,
743	  char *output )
744	{
745	  unsigned int next_in, next_out, max_out, n, out;
746	  u_code_point prev, codept, diff, tmp;
747	  char shift;

749	  prev = altdude_initial_code_point;
750	  max_out = *output_size;
751	  next_out = 0;

753	  for (next_in = 0;  next_in < input_length;  ++next_in) {
754	    codept = input[next_in];

756	    if (codept == 45) {
757	      /* hyphen-minus stands for itself */
758	      if (max_out - next_out < 1) return altdude_output_too_big;
759	      output[next_out++] = 45;
760	      continue;
761	    }

763	    shift = uppercase_flags && uppercase_flags[next_in] ? 32 : 0;
764	    /* shift will determine the case of the last base-32 digit */
765	    diff = prev ^ codept;
766	    for (tmp = diff >> 4, n = 1;  tmp != 0;  ++n, tmp >>= 4);
767	    /* n is the number of base-32 digits */
768	    if (max_out - next_out < n) return altdude_output_too_big;

770	    /* Computing the base-32 digits in reverse order is easiest. */
771	    /* Only the last base-32 digit has the high bit clear.       */

773	    out = next_out + n - 1;
774	    output[out] = base32[diff & 0xF] - shift;

776	    while (out > next_out) {
777	      diff >>= 4;
778	      output[--out] = base32[0x10 | (diff & 0xF)];
779	    }

781	    next_out += n;
782	    prev = codept;
783	  }

785	  /* null terminator: */
786	  if (max_out - next_out < 1) return altdude_output_too_big;
787	  output[next_out++] = 0;
788	  *output_size = next_out;
789	  return altdude_success;
790	}
791	/* Decoder: */

793	enum altdude_status altdude_decode(
794	  enum case_sensitivity case_sensitivity,
795	  char *scratch_space,
796	  const char *input,
797	  unsigned int *output_length,
798	  u_code_point *output,
799	  unsigned char *uppercase_flags )
800	{
801	  u_code_point prev, q, diff;
802	  const char *in;
803	  char c;
804	  unsigned int next_out, max_out, input_size, scratch_size;
805	  enum altdude_status status;

807	  prev = altdude_initial_code_point;
808	  max_out = *output_length;
809	  next_out = 0;
810	  in = input;

812	  for (c = *in;  c != 0; ) {
813	    if (max_out - next_out < 1) return altdude_output_too_big;

815	    if (c == 45) {
816	      /* hyphen-minus stands for itself */
817	      output[next_out] = 45;
818	      if (uppercase_flags) uppercase_flags[next_out] = 0;
819	      ++next_out;
820	      c = *++in;
821	      continue;
822	    }

824	    /* Base-32 sequence: */

826	    diff = 0;

828	    do {
829	      q = base32_decode(c);
830	      if (q == base32_invalid) return altdude_invalid_input;
831	      diff = (diff << 4) | (q & 0xF);
832	      c = *++in;
833	    } while (q >> 4 == 1);

835	    /* case of last digit determines uppercase flag: */
836	    if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(in[-1]);
837	    prev = output[next_out++] = prev ^ diff;
838	  }
839	  /* Re-encode the output and compare to the input: */

841	  input_size = in - input + 1;
842	  scratch_size = input_size;
843	  status = altdude_encode(next_out, output, uppercase_flags,
844	                          &scratch_size, scratch_space);
845	  if (status != altdude_success ||
846	      scratch_size != input_size ||
847	      unequal(case_sensitivity, scratch_space, input)
848	     ) return altdude_invalid_input;

850	  *output_length = next_out;
851	  return altdude_success;
852	}

854	/******************************************************************/
855	/* Wrapper for testing (would normally go in a separate .c file): */

857	#include 
858	#include 
859	#include 
860	#include 

862	/* For testing, we'll just set some compile-time limits rather than */
863	/* use malloc(), and set a compile-time option rather than using a  */
864	/* command-line option.                                             */

866	enum {
867	  unicode_max_length = 256,
868	  ace_max_size = 256,
869	  test_case_sensitivity = case_insensitive  /* suitable for host names */
870	};

872	static void usage(char **argv)
873	{
874	  fprintf(stderr,
875	    "%s -e reads big-endian UTF-32 and writes AltDUDE ASCII.\n"
876	    "%s -d reads AltDUDE ASCII and writes big-endian UTF-32.\n"
877	    "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n"
878	    , argv[0], argv[0]);
879	  exit(EXIT_FAILURE);
880	}

882	static void fail(const char *msg)
883	{
884	  fputs(msg,stderr);
885	  exit(EXIT_FAILURE);
886	}

888	static const char too_big[] =
889	  "input or output is too large, recompile with larger limits\n";
890	static const char invalid_input[] = "invalid input\n";
891	static const char io_error[] = "I/O error\n";
892	int main(int argc, char **argv)
893	{
894	  enum altdude_status status;
895	  int r;

897	  if (argc != 2) usage(argv);
898	  if (argv[1][0] != '-') usage(argv);
899	  if (argv[1][2] != '\0') usage(argv);

901	  if (argv[1][1] == 'e') {
902	    u_code_point input[unicode_max_length];
903	    unsigned char uppercase_flags[unicode_max_length];
904	    char output[ace_max_size];
905	    unsigned int input_length, output_size;
906	    int c0, c1, c2, c3;

908	    /* Read the UTF-32 input string: */

910	    input_length = 0;

912	    for (;;) {
913	      c0 = getchar();
914	      c1 = getchar();
915	      c2 = getchar();
916	      c3 = getchar();
917	      if (ferror(stdin)) fail(io_error);

919	      if (c1 == EOF || c2 == EOF || c3 == EOF) {
920	        if (c0 != EOF) fail("input not a multiple of 4 bytes\n");
921	        break;
922	      }

924	      if (input_length == unicode_max_length) fail(too_big);

926	      if ((c0 != 0 && c0 != 0x80)
927	          || c1 < 0 || c1 > 0x10
928	          || c2 < 0 || c2 > 0xFF
929	          || c3 < 0 || c3 > 0xFF ) {
930	        fail(invalid_input);
931	      }

933	      input[input_length] = ((u_code_point) c1 << 16) |
934	                            ((u_code_point) c2 <<  8) | (u_code_point) c3;
935	      uppercase_flags[input_length] = (c0 >> 7);
936	      ++input_length;
937	    }
938	    /* Encode, and output the result: */

940	    output_size = ace_max_size;
941	    status = altdude_encode(input_length, input, uppercase_flags,
942	                            &output_size, output);
943	    if (status == altdude_invalid_input) fail(invalid_input);
944	    if (status == altdude_output_too_big) fail(too_big);
945	    assert(status == altdude_success);
946	    r = fputs(output,stdout);
947	    if (r == EOF) fail(io_error);
948	    return EXIT_SUCCESS;
949	  }

951	  if (argv[1][1] == 'd') {
952	    char input[ace_max_size], scratch[ace_max_size];
953	    u_code_point output[unicode_max_length], codept;
954	    unsigned char uppercase_flags[unicode_max_length];
955	    unsigned int output_length, i;

957	    /* Read the AltDUDE ASCII input string: */

959	    fgets(input, ace_max_size, stdin);
960	    if (ferror(stdin)) fail(io_error);
961	    if (!feof(stdin)) fail(too_big);

963	    /* Decode, and output the result: */

965	    output_length = unicode_max_length;
966	    status = altdude_decode(test_case_sensitivity, scratch, input,
967	                            &output_length, output, uppercase_flags);
968	    if (status == altdude_invalid_input) fail(invalid_input);
969	    if (status == altdude_output_too_big) fail(too_big);
970	    assert(status == altdude_success);

972	    for (i = 0;  i < output_length;  ++i) {
973	      r = putchar(uppercase_flags[i] ? 0x80 : 0);
974	      if (r == EOF) fail(io_error);
975	      codept = output[i];
976	      r = putchar(codept >> 16);
977	      if (r == EOF) fail(io_error);
978	      r = putchar((codept >> 8) & 0xFF);
979	      if (r == EOF) fail(io_error);
980	      r = putchar(codept & 0xFF);
981	      if (r == EOF) fail(io_error);
982	    }

984	    return EXIT_SUCCESS;
985	  }

987	  usage(argv);
988	  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
989	}

991	                   INTERNET-DRAFT expires 2001-Sep-19