idnits 2.17.1 draft-ietf-idn-dude-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 2 longer pages, the longest (page 12) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([UNICODE], [IDNA], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'NAMEPREP' is mentioned on line 63, but not defined

  -- Looks like a reference, but probably isn't: '0' on line 766

  -- Looks like a reference, but probably isn't: '1' on line 794

  -- Looks like a reference, but probably isn't: '2' on line 742

  -- Looks like a reference, but probably isn't: '3' on line 747

  == Unused Reference: 'RFC952' is defined on line 318, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1123' is defined on line 324, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN'

  == Outdated reference: A later version (-13) exists of
     draft-ietf-idn-idna-01

  == Outdated reference: A later version (-10) exists of
     draft-ietf-idn-nameprep-03

  ** Downref: Normative reference to an Unknown state RFC: RFC  952

  -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'


     Summary: 6 errors (**), 0 flaws (~~), 7 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                               Mark Welter
2	draft-ietf-idn-dude-02.txt                            Brian W. Spolarich
3	Expires 2001-Dec-07                                     Adam M. Costello
4	                                                             2001-Jun-07

6	              Differential Unicode Domain Encoding (DUDE)

8	Status of this Memo

10	    This document is an Internet-Draft and is in full conformance with
11	    all provisions of Section 10 of RFC2026.

13	    Internet-Drafts are working documents of the Internet Engineering
14	    Task Force (IETF), its areas, and its working groups.  Note
15	    that other groups may also distribute working documents as
16	    Internet-Drafts.

18	    Internet-Drafts are draft documents valid for a maximum of six
19	    months and may be updated, replaced, or obsoleted by other documents
20	    at any time.  It is inappropriate to use Internet-Drafts as
21	    reference material or to cite them other than as "work in progress."

23	    The list of current Internet-Drafts can be accessed at
24	    http://www.ietf.org/ietf/1id-abstracts.txt

26	    The list of Internet-Draft Shadow Directories can be accessed at
27	    http://www.ietf.org/shadow.html

29	    Distribution of this document is unlimited.  Please send comments to
30	    the authors or to the idn working group at idn@ops.ietf.org.

32	Abstract

34	    DUDE is a reversible transformation from a sequence of nonnegative
35	    integer values to a sequence of letters, digits, and hyphens (LDH
36	    characters).  DUDE provides a simple and efficient ASCII-Compatible
37	    Encoding (ACE) of Unicode strings [UNICODE] for use with
38	    Internationalized Domain Names [IDN] [IDNA].

40	Contents

42	    1. Introduction
43	    2. Terminology
44	    3. Overview
45	    4. Base-32 characters
46	    5. Encoding procedure
47	    6. Decoding procedure
48	    7. Example strings
49	    8. Security considerations
50	    9. References
51	    A. Acknowledgements
52	    B. Author contact information
53	    C. Mixed-case annotation
54	    D. Differences from draft-ietf-idn-dude-01
55	    E. Example implementation
56	1. Introduction

58	    The IDNA draft [IDNA] describes an architecture for supporting
59	    internationalized domain names.  Each label of a domain name may
60	    begin with a special prefix, in which case the remainder of the
61	    label is an ASCII-Compatible Encoding (ACE) of a Unicode string
62	    satisfying certain constraints.  For the details of the constraints,
63	    see [IDNA] and [NAMEPREP].  The prefix has not yet been specified,
64	    but see http://www.i-d-n.net/ for prefixes to be used for testing
65	    and experimentation.

67	    DUDE is intended to be used as an ACE within IDNA, and has been
68	    designed to have the following features:

70	      * Completeness:  Every sequence of nonnegative integers maps to an
71	        LDH string.  Restrictions on which integers are allowed, and on
72	        sequence length, may be imposed by higher layers.

74	      * Uniqueness:  Every sequence of nonnegative integers maps to at
75	        most one LDH string.

77	      * Reversibility:  Any Unicode string mapped to an LDH string can
78	        be recovered from that LDH string.

80	      * Efficient encoding:  The ratio of encoded size to original size
81	        is small.  This is important in the context of domain names
82	        because [RFC1034] restricts the length of a domain label to 63
83	        characters.

85	      * Simplicity:  The encoding and decoding algorithms are reasonably
86	        simple to implement.  The goals of efficiency and simplicity are
87	        at odds; DUDE places greater emphasis on simplicity.

89	    An optional feature is described in appendix C "Mixed-case
90	    annotation".

92	2. Terminology

94	    The key words "must", "shall", "required", "should", "recommended",
95	    and "may" in this document are to be interpreted as described in
96	    RFC 2119 [RFC2119].

98	    LDH characters are the letters A-Z and a-z, the digits 0-9, and
99	    hyphen-minus.

101	    A quartet is a sequence of four bits (also known as a nibble or
102	    nybble).

104	    A quintet is a sequence of five bits.

106	    Hexadecimal values are shown preceeded by "0x".  For example, 0x60
107	    is decimal 96.

109	    As in the Unicode Standard [UNICODE], Unicode code points are
110	    denoted by "U+" followed by four to six hexadecimal digits, while a
111	    range of code points is denoted by two hexadecimal numbers separated
112	    by "..", with no prefixes.

114	    XOR means bitwise exclusive or.  Given two nonnegative integer
115	    values A and B, A XOR B is the nonnegative integer value whose
116	    binary representation is 1 in whichever places the binary
117	    representations of A and B disagree, and 0 wherever they agree.
118	    For the purpose of applying this rule, recall that an integer's
119	    representation begins with an infinite number of unwritten zeros.
120	    In some programming languages, care may need to be taken that A and
121	    B are stored in variables of the same type and size.

123	3. Overview

125	    DUDE encodes a sequence of nonnegative integral values as a sequence
126	    of LDH characters, although implementations will of course need to
127	    represent the output characters somehow, typically as ASCII octets.
128	    When DUDE is used to encode Unicode characters, the input values are
129	    Unicode code points (integral values in the range 0..10FFFF, but not
130	    D800..DFFF, which are reserved for use by UTF-16).

132	    Each value in the input sequence is represented by one or more LDH
133	    characters in the encoded string.  The value 0x2D is represented
134	    by hyphen-minus (U+002D).  Each non-hyphen-minus character in
135	    the encoded string represents a quintet.  A sequence of quintets
136	    represents the bitwise XOR between each non-0x2D integer and the
137	    previous one.

139	4. Base-32 characters

141	        "a" =  0 = 0x00 = 00000         "s" = 16 = 0x10 = 10000
142	        "b" =  1 = 0x01 = 00001         "t" = 17 = 0x11 = 10001
143	        "c" =  2 = 0x02 = 00010         "u" = 18 = 0x12 = 10010
144	        "d" =  3 = 0x03 = 00011         "v" = 19 = 0x13 = 10011
145	        "e" =  4 = 0x04 = 00100         "w" = 20 = 0x14 = 10100
146	        "f" =  5 = 0x05 = 00101         "x" = 21 = 0x15 = 10101
147	        "g" =  6 = 0x06 = 00110         "y" = 22 = 0x16 = 10110
148	        "h" =  7 = 0x07 = 00111         "z" = 23 = 0x17 = 10111
149	        "i" =  8 = 0x08 = 01000         "2" = 24 = 0x18 = 11000
150	        "j" =  9 = 0x09 = 01001         "3" = 25 = 0x19 = 11001
151	        "k" = 10 = 0x0A = 01010         "4" = 26 = 0x1A = 11010
152	        "m" = 11 = 0x0B = 01011         "5" = 27 = 0x1B = 11011
153	        "n" = 12 = 0x0C = 01100         "6" = 28 = 0x1C = 11100
154	        "p" = 13 = 0x0D = 01101         "7" = 29 = 0x1D = 11101
155	        "q" = 14 = 0x0E = 01110         "8" = 30 = 0x1E = 11110
156	        "r" = 15 = 0x0F = 01111         "9" = 31 = 0x1F = 11111

158	    The digits "0" and "1" and the letters "o" and "l" are not used, to
159	    avoid transcription errors.

161	    A decoder must accept both the uppercase and lowercase forms of
162	    the base-32 characters (including mixtures of both forms).  An
163	    encoder should output only lowercase forms or only uppercase forms
164	    (unless it uses the feature described in the appendix C "Mixed-case
165	    annotation").

167	5. Encoding procedure

169	    All ordering of bits, quartets, and quintets is big-endian (most
170	    significant first).

172	    let prev = 0x60
173	    for each input integer n (in order) do begin
174	      if n == 0x2D then output hyphen-minus
175	      else begin
176	        let diff = prev XOR n
177	        represent diff in base 16 as a sequence of quartets,
178	          as few as are sufficient (but at least one)
179	        prepend 0 to the last quartet and 1 to each of the others
180	        output a base-32 character corresponding to each quintet
181	        let prev = n
182	      end
183	    end

185	    If an encoder encounters an input value larger than expected (for
186	    example, the largest Unicode code point is U+10FFFF, and nameprep
187	    [NAMEPREP03] can never output a code point larger than U+EFFFD),
188	    the encoder may either encode the value correctly, or may fail, but
189	    it must not produce incorrect output.  The encoder must fail if it
190	    encounters a negative input value.

192	6. Decoding procedure

194	    let prev = 0x60
195	    while the input string is not exhausted do begin
196	      if the next character is hyphen-minus
197	      then consume it and output 0x2D
198	      else begin
199	        consume characters and convert them to quintets until
200	          encountering a quintet whose first bit is 0
201	        fail upon encountering a non-base-32 character or end-of-input
202	        strip the first bit of each quintet
203	        concatenate the resulting quartets to form diff
204	        let prev = prev XOR diff
205	        output prev
206	      end
207	    end
208	    encode the output sequence and compare it to the input string
209	    fail if they do not match (case-insensitively)

211	    The comparison at the end is necessary to guarantee the uniqueness
212	    property (there cannot be two distinct encoded strings representing
213	    the same sequence of integers).  This check also frees the decoder
214	    from having to check for overflow while decoding the base-32
215	    characters.  (If the decoder is one step of a larger decoding
216	    process, it may be possible to defer the re-encoding and comparison
217	    to the end of that larger decoding process.)

219	7. Example strings

221	    The first several examples are nonsense strings of mostly unassigned
222	    code points intended to exercise the corner cases of the algorithm.

224	    (A) u+0061
225	        DUDE: b

227	    (B) u+2C7EF u+2C7EF
228	        DUDE: u6z2ra
229	    (C) u+1752B u+1752A
230	        DUDE: tzxwmb

232	    (D) u+63AB1 u+63ABA
233	        DUDE: yv47bm

235	    (E) u+261AF u+261BF
236	        DUDE: uyt6rta

238	    (F) u+C3A31 u+C3A8C
239	        DUDE: 6v4xb5p

241	    (G) u+09F44 u+0954C
242	        DUDE: 39ue4si

244	    (H) u+8D1A3 u+8C8A3
245	        DUDE: 27t6dt3sa

247	    (I) u+6C2B6 u+CC266
248	        DUDE: y6u7g4ss7a

250	    (J) u+002D u+002D u+002D u+E848F
251	        DUDE: ---82w8r

253	    (K) u+BD08E u+002D u+002D u+002D
254	        DUDE: 57s8q---

256	    (L) u+A9A24 u+002D u+002D u+002D u+C05B7
257	        DUDE: 434we---y393d

259	    (M) u+7FFFFFFF
260	        DUDE: z999993r or explicit failure

262	    The next several examples are realistic Unicode strings that could
263	    be used in domain names.  They exhibit single-row text, two-row
264	    text, ideographic text, and mixtures thereof.  These examples are
265	    names of Japanese television programs, music artists, and songs,
266	    merely because one of the authors happened to have them handy.

268	    (N) 3b  (Latin, kanji)
269	        u+0033 u+5E74 u+0062 u+7D44 u+91D1 u+516B u+5148 u+751F
270	        DUDE: xdx8whx8tgz7ug863f6s5kuduwxh

272	    (O) -with-super-monkeys  (Latin, kanji, hyphens)
273	        u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
274	        u+0068 u+002D u+0073 u+0075 u+0070 u+0065 u+0072 u+002D u+006D
275	        u+006F u+006E u+006B u+0065 u+0079 u+0073
276	        DUDE: x58jupu8nuy6gt99m-yssctqtptn-tmgftfth-trcbfqtnk

278	    (P) majikoi5  (Latin, hiragana, kanji)
279	        u+006D u+0061 u+006A u+0069 u+3067 u+006B u+006F u+0069 u+3059
280	        u+308B u+0035 u+79D2 u+524D
281	        DUDE: pnmdvssqvssnegvsva7cvs5qz38hu53r

283	    (Q) de  (Latin, katakana)
284	        u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
285	        DUDE: vs5bezgxrvs3ibvs2qtiud
286	    (R)   (hiragana, katakana)
287	        u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
288	        DUDE: vsvpvd7hypuivf4q

290	8. Security considerations

292	    Users expect each domain name in DNS to be controlled by a single
293	    authority.  If a Unicode string intended for use as a domain label
294	    could map to multiple ACE labels, then an internationalized domain
295	    name could map to multiple ACE domain names, each controlled by
296	    a different authority, some of which could be spoofs that hijack
297	    service requests intended for another.  Therefore DUDE is designed
298	    so that each Unicode string has a unique encoding.

300	    However, there can still be multiple Unicode representations of the
301	    "same" text, for various definitions of "same".  This problem is
302	    addressed to some extent by the Unicode standard under the topic of
303	    canonicalization, and this work is leveraged for domain names by
304	    "nameprep" [NAMEPREP03].

306	9. References

308	    [IDN] Internationalized Domain Names (IETF working group),
309	    http://www.i-d-n.net/, idn@ops.ietf.org.

311	    [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host
312	    Names In Applications (IDNA)", draft-ietf-idn-idna-01.

314	    [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
315	    of Internationalized Host Names", 2001-Feb-24,
316	    draft-ietf-idn-nameprep-03.

318	    [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
319	    Table Specification", 1985-Oct, RFC 952.

321	    [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
322	    1987-Nov, RFC 1034.

324	    [RFC1123] Internet Engineering Task Force, R. Braden (editor),
325	    "Requirements for Internet Hosts -- Application and Support",
326	    1989-Oct, RFC 1123.

328	    [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
329	    Requirement Levels", 1997-Mar, RFC 2119.

331	    [SFS] David Mazieres et al, "Self-certifying File System",
332	    http://www.fs.net/.

334	    [UNICODE] The Unicode Consortium, "The Unicode Standard",
335	    http://www.unicode.org/unicode/standard/standard.html.

337	A. Acknowledgements

339	    The basic encoding of integers to quartets to quintets to base-32
340	    comes from earlier IETF work by Martin Duerst.  DUDE uses a slight
341	    variation on the idea.

343	    Paul Hoffman provided helpful comments on this document.

345	    The idea of avoiding 0, 1, o, and l in base-32 strings was taken
346	    from SFS [SFS].

348	B. Author contact information

350	    Mark Welter 
351	    Brian W. Spolarich 
352	    WALID, Inc.
353	    State Technology Park
354	    2245 S. State St.
355	    Ann Arbor, MI  48104
356	    +1 734 822 2020

358	    Adam M. Costello 
359	    University of California, Berkeley
360	    http://www.cs.berkeley.edu/~amc/

362	C. Mixed-case annotation

364	    In order to use DUDE to represent case-insensitive Unicode strings,
365	    higher layers need to case-fold the Unicode strings prior to DUDE
366	    encoding.  The encoded string can, however, use mixed-case base-32
367	    (rather than all-lowercase or all-uppercase as recommended in
368	    section 4 "Base-32 characters") as an annotation telling how to
369	    convert the folded Unicode string into a mixed-case Unicode string
370	    for display purposes.

372	    Each Unicode code point (unless it is U+002D hyphen-minus) is
373	    represented by a sequence of base-32 characters, the last of which
374	    is always a letter (as opposed to a digit).  If that letter is
375	    uppercase, it is a suggestion that the Unicode character be mapped
376	    to uppercase (if possible); if the letter is lowercase, it is a
377	    suggestion that the Unicode character be mapped to lowercase (if
378	    possible).

380	    DUDE encoders and decoders are not required to support these
381	    annotations, and higher layers need not use them.

383	    Example:  In order to suggest that example O in section 7 "Example
384	    strings" be displayed as:

386	        -with-SUPER-MONKEYS

388	    one could capitalize the DUDE encoding as:

390	        x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK

392	D. Differences from draft-ietf-idn-dude-01

394	    Four changes have been made since draft-ietf-idn-dude-01 (DUDE-01):

396	     1) DUDE-01 computed the XOR of each integer with the previous one
397	        in order to decide how many bits of each integer to encode, but
398	        now the XOR itself is encoded, so there is no need for a mask.

400	     2) DUDE-01 made the first quintet of each sequence different from
401	        the rest, while now it is the last quintet that differs, so it's
402	        easier for the decoder to detect the end of the sequence.

404	     3) The base-32 map has changed to avoid 0, 1, o, and l, to help
405	        humans avoid transcription errors.

407	     4) The initial value of the previous code point has changed from 0
408	        to 0x60, making the encodings of a few domain names shorter and
409	        none longer.

411	E. Example implementation

413	/******************************************/
414	/* dude.c 0.2.3 (2001-May-31-Thu)         */
415	/* Adam M. Costello  */
416	/******************************************/

418	/* This is ANSI C code (C89) implementing */
419	/* DUDE (draft-ietf-idn-dude-02).         */

421	/************************************************************/
422	/* Public interface (would normally go in its own .h file): */

424	#include 

426	enum dude_status {
427	  dude_success,
428	  dude_bad_input,
429	  dude_big_output  /* Output would exceed the space provided. */
430	};

432	enum case_sensitivity { case_sensitive, case_insensitive };

434	#if UINT_MAX >= 0x1FFFFF
435	typedef unsigned int u_code_point;
436	#else
437	typedef unsigned long u_code_point;
438	#endif

440	enum dude_status dude_encode(
441	  unsigned int input_length,
442	  const u_code_point input[],
443	  const unsigned char uppercase_flags[],
444	  unsigned int *output_size,
445	  char output[] );
446	    /* dude_encode() converts Unicode to DUDE (without any            */
447	    /* signature).  The input must be represented as an array         */
448	    /* of Unicode code points (not code units; surrogate pairs        */
449	    /* are not allowed), and the output will be represented as        */
450	    /* null-terminated ASCII.  The input_length is the number of code */
451	    /* points in the input.  The output_size is an in/out argument:   */
452	    /* the caller must pass in the maximum number of characters       */
453	    /* that may be output (including the terminating null), and on    */
454	    /* successful return it will contain the number of characters     */
455	    /* actually output (including the terminating null, so it will be */
456	    /* one more than strlen() would return, which is why it is called */
457	    /* output_size rather than output_length).  The uppercase_flags   */
458	    /* array must hold input_length boolean values, where nonzero     */
459	    /* means the corresponding Unicode character should be forced     */
460	    /* to uppercase after being decoded, and zero means it is         */
461	    /* caseless or should be forced to lowercase.  Alternatively,     */
462	    /* uppercase_flags may be a null pointer, which is equivalent     */
463	    /* to all zeros.  The encoder always outputs lowercase base-32    */
464	    /* characters except when nonzero values of uppercase_flags       */
465	    /* require otherwise.  The return value may be any of the         */
466	    /* dude_status values defined above; if not dude_success, then    */
467	    /* output_size and output may contain garbage.  On success, the   */
468	    /* encoder will never need to write an output_size greater than   */
469	    /* input_length*k+1 if all the input code points are less than 1  */
470	    /* << (4*k), because of how the encoding is defined.              */

472	enum dude_status dude_decode(
473	  enum case_sensitivity case_sensitivity,
474	  char scratch_space[],
475	  const char input[],
476	  unsigned int *output_length,
477	  u_code_point output[],
478	  unsigned char uppercase_flags[] );
479	    /* dude_decode() converts DUDE (without any signature) to         */
480	    /* Unicode.  The input must be represented as null-terminated     */
481	    /* ASCII, and the output will be represented as an array of       */
482	    /* Unicode code points.  The case_sensitivity argument influences */
483	    /* the check on the well-formedness of the input string; it       */
484	    /* must be case_sensitive if case-sensitive comparisons are       */
485	    /* allowed on encoded strings, case_insensitive otherwise.        */
486	    /* The scratch_space must point to space at least as large        */
487	    /* as the input, which will get overwritten (this allows the      */
488	    /* decoder to avoid calling malloc()).  The output_length is      */
489	    /* an in/out argument: the caller must pass in the maximum        */
490	    /* number of code points that may be output, and on successful    */
491	    /* return it will contain the actual number of code points        */
492	    /* output.  The uppercase_flags array must have room for at       */
493	    /* least output_length values, or it may be a null pointer if     */
494	    /* the case information is not needed.  A nonzero flag indicates  */
495	    /* that the corresponding Unicode character should be forced to   */
496	    /* uppercase by the caller, while zero means it is caseless or    */
497	    /* should be forced to lowercase.  The return value may be any    */
498	    /* of the dude_status values defined above; if not dude_success,  */
499	    /* then output_length, output, and uppercase_flags may contain    */
500	    /* garbage.  On success, the decoder will never need to write     */
501	    /* an output_length greater than the length of the input (not     */
502	    /* counting the null terminator), because of how the encoding is  */
503	    /* defined.                                                       */

505	/**********************************************************/
506	/* Implementation (would normally go in its own .c file): */

508	#include 

510	/* Character utilities: */

512	/* base32[q] is the lowercase base-32 character representing  */
513	/* the number q from the range 0 to 31.  Note that we cannot  */
514	/* use string literals for ASCII characters because an ANSI C */
515	/* compiler does not necessarily use ASCII.                   */

517	static const char base32[] = {
518	  97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
519	  109, 110,                                               /* m-n */
520	  112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
521	  50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
522	};

524	/* base32_decode(c) returns the value of a base-32 character, in the */
525	/* range 0 to 31, or the constant base32_invalid if c is not a valid */
526	/* base-32 character.                                                */
527	enum { base32_invalid = 32 };

529	static unsigned int base32_decode(char c)
530	{
531	  if (c < 50) return base32_invalid;
532	  if (c <= 57) return c - 26;
533	  if (c < 97) c += 32;
534	  if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
535	  return c - 97 - (c > 108) - (c > 111);
536	}

538	/* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
539	/* are equal, 1 otherwise.  If case_sensitivity is case_insensitive,  */
540	/* then ASCII A-Z are considered equal to a-z respectively.           */

542	static int unequal( enum case_sensitivity case_sensitivity,
543	                    const char s1[], const char s2[]        )
544	{
545	  char c1, c2;

547	  if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;

549	  for (;;) {
550	    c1 = *s1;
551	    c2 = *s2;
552	    if (c1 >= 65 && c1 <= 90) c1 += 32;
553	    if (c2 >= 65 && c2 <= 90) c2 += 32;
554	    if (c1 != c2) return 1;
555	    if (c1 == 0) return 0;
556	    ++s1, ++s2;
557	  }
558	}

560	/* Encoder: */

562	enum dude_status dude_encode(
563	  unsigned int input_length,
564	  const u_code_point input[],
565	  const unsigned char uppercase_flags[],
566	  unsigned int *output_size,
567	  char output[] )
568	{
569	  unsigned int max_out, in, out, k, j;
570	  u_code_point prev, codept, diff, tmp;
571	  char shift;

573	  prev = 0x60;
574	  max_out = *output_size;

576	  for (in = out = 0;  in < input_length;  ++in) {

578	    /* At the start of each iteration, in and out are the number of */
579	    /* items already input/output, or equivalently, the indices of  */
580	    /* the next items to be input/output.                           */
581	    codept = input[in];

583	    if (codept == 0x2D) {
584	      /* Hyphen-minus stands for itself. */
585	      if (max_out - out < 1) return dude_big_output;
586	      output[out++] = 0x2D;
587	      continue;
588	    }

590	    diff = prev ^ codept;

592	    /* Compute the number of base-32 characters (k): */
593	    for (tmp = diff >> 4, k = 1;  tmp != 0;  ++k, tmp >>= 4);

595	    if (max_out - out < k) return dude_big_output;
596	    shift = uppercase_flags && uppercase_flags[in] ? 32 : 0;
597	    /* shift controls the case of the last base-32 digit. */

599	    /* Each quintet has the form 1xxxx except the last is 0xxxx. */
600	    /* Computing the base-32 digits in reverse order is easiest. */

602	    out += k;
603	    output[out - 1] = base32[diff & 0xF] - shift;

605	    for (j = 2;  j <= k;  ++j) {
606	      diff >>= 4;
607	      output[out - j] = base32[0x10 | (diff & 0xF)];
608	    }

610	    prev = codept;
611	  }

613	  /* Append the null terminator: */
614	  if (max_out - out < 1) return dude_big_output;
615	  output[out++] = 0;

617	  *output_size = out;
618	  return dude_success;
619	}

621	/* Decoder: */

623	enum dude_status dude_decode(
624	  enum case_sensitivity case_sensitivity,
625	  char scratch_space[],
626	  const char input[],
627	  unsigned int *output_length,
628	  u_code_point output[],
629	  unsigned char uppercase_flags[] )
630	{
631	  u_code_point prev, q, diff;
632	  char c;
633	  unsigned int max_out, in, out, scratch_size;
634	  enum dude_status status;

636	  prev = 0x60;
637	  max_out = *output_length;
638	  for (c = input[in = 0], out = 0;  c != 0;  c = input[++in], ++out) {

640	    /* At the start of each iteration, in and out are the number of */
641	    /* items already input/output, or equivalently, the indices of  */
642	    /* the next items to be input/output.                           */

644	    if (max_out - out < 1) return dude_big_output;

646	    if (c == 0x2D) output[out] = c;  /* hyphen-minus is literal */
647	    else {
648	      /* Base-32 sequence.  Decode quintets until 0xxxx is found: */

650	      for (diff = 0;  ;  c = input[++in]) {
651	        q = base32_decode(c);
652	        if (q == base32_invalid) return dude_bad_input;
653	        diff = (diff << 4) | (q & 0xF);
654	        if (q >> 4 == 0) break;
655	      }

657	      prev = output[out] = prev ^ diff;
658	    }

660	    /* Case of last character determines uppercase flag: */
661	    if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90;
662	  }

664	  /* Enforce the uniqueness of the encoding by re-encoding */
665	  /* the output and comparing the result to the input:     */

667	  scratch_size = ++in;
668	  status = dude_encode(out, output, uppercase_flags,
669	                       &scratch_size, scratch_space);
670	  if (status != dude_success || scratch_size != in ||
671	      unequal(case_sensitivity, scratch_space, input)
672	     ) return dude_bad_input;

674	  *output_length = out;
675	  return dude_success;
676	}

678	/******************************************************************/
679	/* Wrapper for testing (would normally go in a separate .c file): */

681	#include 
682	#include 
683	#include 
684	#include 

686	/* For testing, we'll just set some compile-time limits rather than */
687	/* use malloc(), and set a compile-time option rather than using a  */
688	/* command-line option.                                             */
689	enum {
690	  unicode_max_length = 256,
691	  ace_max_size = 256,
692	  test_case_sensitivity = case_insensitive
693	                          /* suitable for host names */
694	};

696	static void usage(char **argv)
697	{
698	  fprintf(stderr,
699	    "%s -e reads code points and writes a DUDE string.\n"
700	    "%s -d reads a DUDE string and writes code points.\n"
701	    "Input and output are plain text in the native character set.\n"
702	    "Code points are in the form u+hex separated by whitespace.\n"
703	    "A DUDE string is a newline-terminated sequence of LDH characters\n"
704	    "(without any signature).\n"
705	    "The case of the u in u+hex is the force-to-uppercase flag.\n"
706	    , argv[0], argv[0]);
707	  exit(EXIT_FAILURE);
708	}

710	static void fail(const char *msg)
711	{
712	  fputs(msg,stderr);
713	  exit(EXIT_FAILURE);
714	}

716	static const char too_big[] =
717	  "input or output is too large, recompile with larger limits\n";
718	static const char invalid_input[] = "invalid input\n";
719	static const char io_error[] = "I/O error\n";

721	/* The following string is used to convert LDH      */
722	/* characters between ASCII and the native charset: */

724	static const char ldh_ascii[] =
725	  "................"
726	  "................"
727	  ".............-.."
728	  "0123456789......"
729	  ".ABCDEFGHIJKLMNO"
730	  "PQRSTUVWXYZ....."
731	  ".abcdefghijklmno"
732	  "pqrstuvwxyz";

734	int main(int argc, char **argv)
735	{
736	  enum dude_status status;
737	  int r;
738	  char *p;

740	  if (argc != 2) usage(argv);
741	  if (argv[1][0] != '-') usage(argv);
742	  if (argv[1][2] != 0) usage(argv);
743	  if (argv[1][1] == 'e') {
744	    u_code_point input[unicode_max_length];
745	    unsigned long codept;
746	    unsigned char uppercase_flags[unicode_max_length];
747	    char output[ace_max_size], uplus[3];
748	    unsigned int input_length, output_size, i;

750	    /* Read the input code points: */

752	    input_length = 0;

754	    for (;;) {
755	      r = scanf("%2s%lx", uplus, &codept);
756	      if (ferror(stdin)) fail(io_error);
757	      if (r == EOF || r == 0) break;

759	      if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) {
760	        fail(invalid_input);
761	      }

763	      if (input_length == unicode_max_length) fail(too_big);

765	      if (uplus[0] == 'u') uppercase_flags[input_length] = 0;
766	      else if (uplus[0] == 'U') uppercase_flags[input_length] = 1;
767	      else fail(invalid_input);

769	      input[input_length++] = codept;
770	    }

772	    /* Encode: */

774	    output_size = ace_max_size;
775	    status = dude_encode(input_length, input, uppercase_flags,
776	                         &output_size, output);
777	    if (status == dude_bad_input) fail(invalid_input);
778	    if (status == dude_big_output) fail(too_big);
779	    assert(status == dude_success);

781	    /* Convert to native charset and output: */

783	    for (p = output;  *p != 0;  ++p) {
784	      i = *p;
785	      assert(i <= 122 && ldh_ascii[i] != '.');
786	      *p = ldh_ascii[i];
787	    }

789	    r = puts(output);
790	    if (r == EOF) fail(io_error);
791	    return EXIT_SUCCESS;
792	  }

794	  if (argv[1][1] == 'd') {
795	    char input[ace_max_size], scratch[ace_max_size], *pp;
796	    u_code_point output[unicode_max_length];
797	    unsigned char uppercase_flags[unicode_max_length];
798	    unsigned int input_length, output_length, i;
799	    /* Read the DUDE input string and convert to ASCII: */

801	    fgets(input, ace_max_size, stdin);
802	    if (ferror(stdin)) fail(io_error);
803	    if (feof(stdin)) fail(invalid_input);
804	    input_length = strlen(input);
805	    if (input[input_length - 1] != '\n') fail(too_big);
806	    input[--input_length] = 0;

808	    for (p = input;  *p != 0;  ++p) {
809	      pp = strchr(ldh_ascii, *p);
810	      if (pp == 0) fail(invalid_input);
811	      *p = pp - ldh_ascii;
812	    }

814	    /* Decode: */

816	    output_length = unicode_max_length;
817	    status = dude_decode(test_case_sensitivity, scratch, input,
818	                         &output_length, output, uppercase_flags);
819	    if (status == dude_bad_input) fail(invalid_input);
820	    if (status == dude_big_output) fail(too_big);
821	    assert(status == dude_success);

823	    /* Output the result: */

825	    for (i = 0;  i < output_length;  ++i) {
826	      r = printf("%s+%04lX\n",
827	                 uppercase_flags[i] ? "U" : "u",
828	                 (unsigned long) output[i] );
829	      if (r < 0) fail(io_error);
830	    }

832	    return EXIT_SUCCESS;
833	  }

835	  usage(argv);
836	  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
837	}

839	                   INTERNET-DRAFT expires 2001-Dec-07