idnits 2.17.1 draft-costello-rfc3492bis-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 1694 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 407 has weird spacing: '... points dig...' == Line 1204 has weird spacing: '... return cp - ...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1545

  -- Looks like a reference, but probably isn't: '1' on line 1576

  -- Looks like a reference, but probably isn't: '2' on line 1521

  -- Looks like a reference, but probably isn't: '3' on line 1526


     Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          Adam M. Costello
2	draft-costello-rfc3492bis-02.txt                             2004-Apr-14
3	Expires 2004-Oct-14

5	                 Punycode: A Bootstring encoding of Unicode
6	          for Internationalized Domain Names in Applications (IDNA)

8	Status of this Memo

10	    This document is an Internet-Draft and is in full conformance with
11	    all provisions of Section 10 of RFC2026.

13	    Internet-Drafts are working documents of the Internet Engineering
14	    Task Force (IETF), its areas, and its working groups.  Note
15	    that other groups may also distribute working documents as
16	    Internet-Drafts.

18	    Internet-Drafts are draft documents valid for a maximum of six
19	    months and may be updated, replaced, or obsoleted by other documents
20	    at any time.  It is inappropriate to use Internet-Drafts as
21	    reference material or to cite them other than as "work in progress."

23	    The list of current Internet-Drafts can be accessed at
24	    http://www.ietf.org/ietf/1id-abstracts.txt

26	    The list of Internet-Draft Shadow Directories can be accessed at
27	    http://www.ietf.org/shadow.html

29	    Distribution of this document is unlimited.

31	Abstract

33	    Punycode is a simple and efficient transfer encoding syntax designed
34	    for use with Internationalized Domain Names in Applications (IDNA).
35	    It uniquely and reversibly transforms a Unicode string into an ASCII
36	    string.  ASCII characters in the Unicode string are represented
37	    literally, and non-ASCII characters are represented by ASCII
38	    characters that are allowed in host name labels (letters, digits,
39	    and hyphens).  This document defines a general algorithm called
40	    Bootstring that allows a string of basic code points to uniquely
41	    represent any string of code points drawn from a larger set.
42	    Punycode is an instance of Bootstring that uses particular parameter
43	    values specified by this document, appropriate for IDNA.

45	Contents

47	     1. Introduction
48	         1.1 Features
49	         1.2 Interaction of protocol parts
50	     2. Terminology
51	     3. Bootstring description
52	         3.1 Basic code point segregation
53	         3.2 Insertion unsort coding
54	         3.3 Generalized variable-length integers
55	         3.4 Bias adaptation
56	     4. Bootstring parameters
57	     5. Parameter values for Punycode
58	     6. Bootstring algorithms
59	         6.1 Bias adaptation function
60	         6.2 Decoding procedure
61	         6.3 Encoding procedure
62	         6.4 Overflow handling
63	     7. Punycode examples
64	         7.1 Sample strings
65	         7.2 Decoding traces
66	         7.3 Encoding traces
67	     8. Security considerations
68	     9. References
69	         9.1 Normative references
70	         9.2 Informative references
71	     A. Mixed-case annotation
72	     B. Disclaimer and license
73	     C. Punycode sample implementation
74	     D. Changes from RFC 3492
75	     Author's address

77	1. Introduction

79	    [IDNA] describes an architecture for supporting internationalized
80	    domain names.  Labels containing non-ASCII characters can be
81	    represented by ACE labels, which begin with a special ACE prefix and
82	    contain only ASCII characters.  The remainder of the label after the
83	    prefix is a Punycode encoding of a Unicode string satisfying certain
84	    constraints.  For the details of the prefix and constraints, see
85	    [IDNA] and [NAMEPREP].

87	    Punycode is an instance of a more general algorithm called
88	    Bootstring, which allows strings composed from a small set of
89	    "basic" code points to uniquely represent any string of code points
90	    drawn from a larger set.  Punycode is Bootstring with particular
91	    parameter values appropriate for IDNA.

93	1.1 Features

95	    Bootstring has been designed to have the following features:

97	      * Completeness:  Every extended string (sequence of arbitrary code
98	        points) can be represented by a basic string (sequence of basic
99	        code points).  Restrictions on what strings are allowed, and on
100	        length, can be imposed by higher layers.

102	      * Uniqueness:  There is at most one basic string that represents a
103	        given extended string.

105	      * Reversibility:  Any extended string mapped to a basic string can
106	        be recovered from that basic string.

108	      * Efficient encoding:  The ratio of basic string length to
109	        extended string length is small.  This is important in the
110	        context of domain names because RFC 1034 [RFC1034] restricts the
111	        length of a domain label to 63 characters.

113	      * Simplicity:  The encoding and decoding algorithms are reasonably
114	        simple to implement.  The goals of efficiency and simplicity are
115	        at odds; Bootstring aims at a good balance between them.

117	      * Readability:  Basic code points appearing in the extended string
118	        are represented as themselves in the basic string (although the
119	        main purpose is to improve efficiency, not readability).

121	    Punycode can also support an additional feature that is not used
122	    by the ToASCII and ToUnicode operations of [IDNA].  When extended
123	    strings are case-folded prior to encoding, the basic string can
124	    use mixed case to tell how to convert the folded string into a
125	    mixed-case string.  See appendix A "Mixed-case annotation".

127	1.2 Interaction of protocol parts

129	    Punycode is used by the IDNA protocol [IDNA] for converting domain
130	    labels into ASCII; it is not designed for any other purpose.  It is
131	    explicitly not designed for processing arbitrary free text.

133	2. Terminology

135	    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
136	    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
137	    document are to be interpreted as described in BCP 14, RFC 2119
138	    [RFC2119].

140	    A code point is an integral value associated with a character in a
141	    coded character set.

143	    As in the Unicode Standard [UNICODE], Unicode code points are
144	    denoted by "U+" followed by four to six hexadecimal digits, while a
145	    range of code points is denoted by two hexadecimal numbers separated
146	    by "..", with no prefixes.

148	    The operators div and mod perform integer division; (x div y) is the
149	    quotient of x divided by y, discarding the remainder, and (x mod y)
150	    is the remainder, so (x div y) * y + (x mod y) == x.  Bootstring
151	    uses these operators only with nonnegative operands, so the quotient
152	    and remainder are always nonnegative.

154	    The break statement jumps out of the innermost loop (as in C).

156	    An overflow is an attempt to compute a value that exceeds the
157	    maximum value of an integer variable.  It is assumed that all
158	    integer values from zero through the maximum value can be
159	    represented.

161	3. Bootstring description

163	    Bootstring represents an arbitrary sequence of code points (the
164	    "extended string") as a sequence of basic code points (the "basic
165	    string").  This section describes the representation.  Section 6
166	    "Bootstring algorithms" presents the algorithms as pseudocode.
167	    Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
168	    algorithms for sample inputs.

170	    The following sections describe the four techniques used in
171	    Bootstring.  "Basic code point segregation" is a very simple
172	    and efficient encoding for basic code points occurring in the
173	    extended string: they are simply copied all at once.  "Insertion
174	    unsort coding" encodes the non-basic code points as deltas, and
175	    processes the code points in numerical order rather than in order of
176	    appearance, which typically results in smaller deltas.  The deltas
177	    are represented as "generalized variable-length integers", which use
178	    basic code points to represent nonnegative integers.  The parameters
179	    of this integer representation are dynamically adjusted using "bias
180	    adaptation", to improve efficiency when consecutive deltas have
181	    similar magnitudes.

183	3.1 Basic code point segregation

185	    All basic code points appearing in the extended string are
186	    represented literally at the beginning of the basic string, in their
187	    original order, followed by a delimiter if (and only if) the number
188	    of basic code points is nonzero.  At least one basic code point is
189	    designated to serve as delimiter.  Delimiters never appear in the
190	    remainder of the basic string; therefore the decoder can find the
191	    end of the literal portion (if there is one) by scanning for the
192	    last delimiter.

194	3.2 Insertion unsort coding

196	    The remainder of the basic string (after the last delimiter if there
197	    is one) represents a sequence of nonnegative integral deltas as
198	    generalized variable-length integers, described in section 3.3.  The
199	    meaning of the deltas is best understood in terms of the decoder.

201	    The decoder builds the extended string incrementally.  Initially,
202	    the extended string is a copy of the literal portion of the basic
203	    string (excluding the last delimiter).  The decoder inserts
204	    non-basic code points, one for each delta, into the extended string,
205	    ultimately arriving at the final decoded string.

207	    At the heart of this process is a state machine with two state
208	    variables: an index i and a counter n.  The index i refers to
209	    a position in the extended string; it ranges from 0 (the first
210	    position) to the current length of the extended string (which refers
211	    to a potential position beyond the current end).  If the current
212	    state is , the next state is  if i is less than the
213	    length of the extended string, or  if i equals the length of
214	    the extended string.  In other words, each state change causes i to
215	    increment, wrapping around to zero if necessary, and n counts the
216	    number of wrap-arounds.

218	    Notice that the state always advances monotonically (there is no
219	    way for the decoder to return to an earlier state).  At each state,
220	    an insertion is either performed or not performed.  At most one
221	    insertion is performed in a given state.  An insertion inserts the
222	    value of n at position i in the extended string.  The deltas are
223	    a run-length encoding of this sequence of events: they are the
224	    lengths of the runs of non-insertion states preceeding the insertion
225	    states.  Hence, for each delta, the decoder performs delta state
226	    changes, then an insertion, and then one more state change.  (An
227	    implementation need not perform each state change individually, but
228	    can instead use division and remainder calculations to compute the
229	    next insertion state directly.)  It is an error if the inserted code
230	    point is a basic code point (because basic code points were supposed
231	    to be segregated as described in section 3.1).

233	    The encoder's main task is to derive the sequence of deltas that
234	    will cause the decoder to construct the desired string.  It can do
235	    this by repeatedly scanning the extended string for the next code
236	    point that the decoder would need to insert, and counting the number
237	    of state changes the decoder would need to perform, mindful of the
238	    fact that the decoder's extended string will include only those
239	    code points that have already been inserted.  Section 6.3 "Encoding
240	    procedure" gives a precise algorithm.

242	3.3 Generalized variable-length integers

244	    In a conventional integer representation the base is the number of
245	    distinct symbols for digits, whose values are 0 through base-1.  Let
246	    digit_0 denote the least significant digit, digit_1 the next least
247	    significant, and so on.  The value represented is the sum over j of
248	    digit_j * w(j), where w(j) = base^j is the weight (scale factor)
249	    for position j.  For example, in the base 8 integer 437, the digits
250	    are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is
251	    7 + 3*8 + 4*64 = 287.  This representation has two disadvantages:
252	    First, there are multiple encodings of each value (because there
253	    can be extra zeros in the most significant positions), which is
254	    inconvenient when unique encodings are needed.  Second, the integer
255	    is not self-delimiting, so if multiple integers are concatenated the
256	    boundaries between them are lost.

258	    The generalized variable-length representation solves these two
259	    problems.  The digit values are still 0 through base-1, but now
260	    the integer is self-delimiting by means of thresholds t(j), each
261	    of which is in the range 0 through base-1.  Exactly one digit, the
262	    most significant, satisfies digit_j < t(j).  Therefore, if several
263	    integers are concatenated, it is easy to separate them, starting
264	    with the first if they are little-endian (least significant digit
265	    first), or starting with the last if they are big-endian (most
266	    significant digit first).  As before, the value is the sum over j of
267	    digit_j * w(j), but the weights are different:

269	        w(0) = 1
270	        w(j) = w(j-1) * (base - t(j-1)) for j > 0

272	    For example, consider the little-endian sequence of base 8 digits
273	    734251...  Suppose the thresholds are 2, 3, 5, 5, 5, 5...  This
274	    implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5)
275	    = 90, 90*(8-5) = 270, and so on.  7 is not less than 2, and 3 is
276	    not less than 3, but 4 is less than 5, so 4 is the last digit.  The
277	    value of 734 is 7*1 + 3*6 + 4*30 = 145.  The next integer is 251,
278	    with value 2*1 + 5*6 + 1*30 = 62.  Decoding this representation
279	    is very similar to decoding a conventional integer:  Start with a
280	    current value of N = 0 and a weight w = 1.  Fetch the next digit d
281	    and increase N by d * w.  If d is less than the current threshold
282	    (t) then stop, otherwise increase w by a factor of (base - t),
283	    update t for the next position, and repeat.

285	    Encoding this representation is similar to encoding a conventional
286	    integer:  If N < t then output one digit for N and stop, otherwise
287	    output the digit for t + ((N - t) mod (base - t)), then replace N
288	    with (N - t) div (base - t), update t for the next position, and
289	    repeat.

291	    For any particular set of values of t(j), there is exactly one
292	    generalized variable-length representation of each nonnegative
293	    integral value.

295	    Bootstring uses little-endian ordering so that the deltas can be
296	    separated starting with the first.  The t(j) values are defined in
297	    terms of the constants base, tmin, and tmax, and a state variable
298	    called bias:

300	        t(j) = base * (j + 1) - bias,
301	        clamped to the range tmin through tmax

303	    The clamping means that if the formula yields a value less than tmin
304	    or greater than tmax, then t(j) = tmin or tmax, respectively.  (In
305	    the pseudocode in section 6 "Bootstring algorithms", the expression
306	    base * (j + 1) is denoted by k for performance reasons.)  These
307	    t(j) values cause the representation to favor integers within a
308	    particular range determined by the bias.

310	3.4 Bias adaptation

312	    After each delta is encoded or decoded, bias is set for the next
313	    delta as follows:

315	     1. Delta is scaled in order to avoid overflow in the next step:

317	            let delta = delta div 2

319	        But when this is the very first delta, the divisor is not 2, but
320	        instead a constant called damp.  This compensates for the fact
321	        that the second delta is usually much smaller than the first.

323	     2. Delta is increased to compensate for the fact that the next
324	        delta will be inserting into a longer string:

326	            let delta = delta + (delta div numpoints)

328	        numpoints is the total number of code points encoded/decoded so
329	        far (including the one corresponding to this delta itself, and
330	        including the basic code points).

332	     3. Delta is repeatedly divided until it falls within a threshold,
333	        to predict the minimum number of digits needed to represent the
334	        next delta:

336	            while delta > ((base - tmin) * tmax) div 2
337	            do let delta = delta div (base - tmin)

339	     4. The bias is set:

341	            let bias =
342	              (base * the number of divisions performed in step 3) +
343	              (((base - tmin + 1) * delta) div (delta + skew))

345	    The motivation for this procedure is that the current delta provides
346	    a hint about the likely size of the next delta, and so t(j) is
347	    set to tmax for the more significant digits starting with the one
348	    expected to be last, tmin for the less significant digits up through
349	    the one expected to be third-last, and somewhere between tmin and
350	    tmax for the digit expected to be second-last (balancing the hope of
351	    the expected-last digit being unnecessary against the danger of it
352	    being insufficient).

354	4. Bootstring parameters

356	    Given a set of basic code points, at least one needs to be
357	    designated to serve as delimiter.  The base cannot be greater than
358	    the number of distinguishable basic code points remaining.  The
359	    digit-values in the range 0 through base-1 need to be associated
360	    with distinct non-delimiter basic code points.  In some cases
361	    multiple code points need to have the same digit-value; for example,
362	    uppercase and lowercase versions of the same letter need to be
363	    equivalent if basic strings are case-insensitive.

365	    The initial value of n cannot be greater than the minimum non-basic
366	    code point that could appear in extended strings.

368	    The remaining five parameters (tmin, tmax, skew, damp, and the
369	    initial value of bias) need to satisfy the following constraints:

371	        0 <= tmin <= base-2
372	        1 <= tmax <= base-1
373	        tmin <= tmax
374	        skew >= 1
375	        damp >= 2
376	        initial_bias mod base <= base - tmin

378	    Provided the constraints are satisfied, these five parameters affect
379	    efficiency but not correctness.  They are best chosen empirically.

381	    If support for mixed-case annotation is desired (see appendix A),
382	    make sure that the code points corresponding to 0 through tmax-1 all
383	    have both uppercase and lowercase forms.

385	5. Parameter values for Punycode

387	    Punycode uses the following Bootstring parameter values:

389	        base         = 36
390	        tmin         = 1
391	        tmax         = 26
392	        skew         = 38
393	        damp         = 700
394	        initial_bias = 72
395	        initial_n    = 128 = 0x80

397	    Although the only restriction Punycode imposes on the input integers
398	    is that they be nonnegative, these parameters are especially
399	    designed to work well with Unicode [UNICODE] code points, which are
400	    integers in the range 0..10FFFF.  (Note that code points D800..DFFF
401	    do not occur in any valid Unicode string.  UTF-16 uses code units
402	    D800 through DFFF to refer to code points 10000..10FFFF.)  The
403	    basic code points are the ASCII [ASCII] code points (0..7F), of
404	    which U+002D (-) is the only delimiter, and some of the others have
405	    digit-values as follows:

407	        code points    digit-values
408	        ------------   ----------------------
409	        41..5A (A-Z) =  0 to 25, respectively
410	        61..7A (a-z) =  0 to 25, respectively
411	        30..39 (0-9) = 26 to 35, respectively

413	    Using hyphen-minus as the delimiter implies that the encoded string
414	    can end with a hyphen-minus only if the Unicode string consists
415	    entirely of basic code points, but IDNA forbids such strings from
416	    being encoded.  The encoded string can begin with a hyphen-minus,
417	    but IDNA prepends a prefix.  Therefore IDNA using Punycode conforms
418	    to the RFC 952 rule that host name labels neither begin nor end with
419	    a hyphen-minus [RFC952].

421	    In the non-literal portion of an encoded string, an encoder SHOULD
422	    output only uppercase forms or only lowercase forms, unless it
423	    uses mixed-case annotation (see appendix A).  But a decoder MUST
424	    recognize the letters in any mixture of uppercase and lowercase
425	    forms.

427	    Presumably most users will not manually write or type encoded
428	    strings (as opposed to cutting and pasting them), but those who do
429	    will need to be alert to the potential visual ambiguity between the
430	    following sets of characters:

432	        G 6
433	        I l 1
434	        O 0
435	        S 5
436	        U V
437	        Z 2

439	    Such ambiguities are usually resolved by context, but in a Punycode
440	    encoded string there is no context apparent to humans.

442	6. Bootstring algorithms

444	    Some parts of the pseudocode can be omitted if the parameters
445	    satisfy certain conditions (for which Punycode qualifies).  These
446	    parts are enclosed in {braces}, and notes immediately following the
447	    pseudocode explain the conditions under which they can be omitted.

449	    Formally, code points are integers, and hence the pseudocode assumes
450	    that arithmetic operations can be performed directly on code points.
451	    In some programming languages, explicit conversion between code
452	    points and integers might be necessary.

454	6.1 Bias adaptation function

456	    function adapt(delta,numpoints,firsttime):
457	      if firsttime then let delta = delta div damp
458	      else let delta = delta div 2
459	      let delta = delta + (delta div numpoints)
460	      let k = 0
461	      while delta > ((base - tmin) * tmax) div 2 do begin
462	        let delta = delta div (base - tmin)
463	        let k = k + base
464	      end
465	      return k + (((base - tmin + 1) * delta) div (delta + skew))

467	    It does not matter whether the modifications to delta and k
468	    inside adapt() affect variables of the same name inside the
469	    encoding/decoding procedures, because after calling adapt() the
470	    caller does not read those variables before overwriting them.

472	    If overflow is possible within adapt() then it MUST be detected,
473	    causing the caller to fail; see section 6.4 "Overflow handling" for
474	    detection techniques.  However, if the range of the integers is
475	    great enough then overflow will be impossible and no detection will
476	    be necessary.  Overflow will be impossible within adapt() if the
477	    maximum integer value (maxint) satisfies all three of the following
478	    conditions:

480	        maxint >= (((base - tmin) * tmax) div 2) + skew
481	        maxint >= (base - tmin + 1) * (((base - tmin) * tmax) div 2)
482	        maxint >= base * log(maxint * 2.0 / tmax) / log(base - tmin)
483	                  + base - tmin

485	    For practical values of maxint and the Bootstring parameters these
486	    conditions will indeed be satisfied, and hence no overflow detection
487	    will be needed in adapt().  For Punycode, the conditions are
488	    satisfied whenever maxint >= 16380, which is true for virtually all
489	    programming languages and all platforms.

491	6.2 Decoding procedure

493	    let n = initial_n
494	    let i = 0
495	    let bias = initial_bias
496	    let output = an empty string indexed from 0
497	    consume all code points before the last delimiter (if there is one)
498	      and copy them to output, fail on any non-basic code point
499	    if more than zero code points were consumed then consume one more
500	      (which will be the last delimiter)
501	    while the input is not exhausted do begin
502	      let oldi = i
503	      let w = 1
504	      for k = base to infinity in steps of base do begin
505	        consume a code point, or fail if there was none to consume
506	        let digit = the code point's digit-value, fail if it has none
507	        let i = i + digit * w, fail on overflow
508	        let t = tmin if k <= bias {+ tmin}, or
509	                tmax if k >= bias + tmax, or k - bias otherwise
510	        if digit < t then break
511	        let w = w * (base - t), fail on overflow
512	      end
513	      let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
514	      let n = n + i div (length(output) + 1), fail on overflow
515	      let i = i mod (length(output) + 1)
516	      {if n is a basic code point then fail}
517	      insert n into output at position i
518	      increment i
519	    end

521	    The full statement enclosed in braces (checking whether n is a basic
522	    code point) can be omitted if initial_n exceeds all basic code
523	    points (which is true for Punycode), because n is never less than
524	    initial_n.

526	    In the assignment of t, where t is clamped to the range tmin through
527	    tmax, "+ tmin" can always be omitted.  This makes the clamping
528	    calculation incorrect when bias < k < bias + tmin, but that cannot
529	    happen because of the way bias is computed and because of the
530	    constraints on the parameters.

532	    Because the decoder state can only advance monotonically, and there
533	    is only one representation of any delta, there is therefore only
534	    one encoded string that can represent a given sequence of integers.
535	    The only error conditions are invalid code points, unexpected
536	    end-of-input, overflow, and basic code points encoded using deltas
537	    instead of appearing literally.  Because the decoder fails on these
538	    errors as shown above, it cannot produce the same output for two
539	    distinct inputs.  Without this property it would have been necessary
540	    to re-encode the output and verify that it matches the input in
541	    order to guarantee the uniqueness of the encoding.

543	    Therefore decoders MUST implement error handling, including the
544	    handling of overflow.  See also section 6.4 "Overflow handling".

546	6.3 Encoding procedure

548	    let n = initial_n
549	    let delta = 0
550	    let bias = initial_bias
551	    let h = b = the number of basic code points in the input
552	    copy them to the output in order, followed by a delimiter if b > 0
553	    {if the input contains a non-basic code point < n then fail}
554	    while h < length(input) do begin
555	      let m = the minimum {non-basic} code point >= n in the input
556	      let delta = delta + (m - n) * (h + 1), fail on overflow
557	      let n = m
558	      for each code point c in the input (in order) do begin
559	        if c < n {or c is basic} then increment delta, fail on overflow
560	        if c == n then begin
561	          let q = delta
562	          for k = base to infinity in steps of base do begin
563	            let t = tmin if k <= bias {+ tmin}, or
564	                    tmax if k >= bias + tmax, or k - bias otherwise
565	            if q < t then break
566	            output the code point for digit t + ((q - t) mod (base - t))
567	            let q = (q - t) div (base - t)
568	          end
569	          output the code point for digit q
570	          let bias = adapt(delta, h + 1, test h equals b?)
571	          let delta = 0
572	          increment h
573	        end
574	      end
575	      increment delta and n
576	    end

578	    The full statement enclosed in braces (checking whether the input
579	    contains a non-basic code point less than n) can be omitted if all
580	    code points less than initial_n are basic code points (which is true
581	    for Punycode if code points are unsigned).

583	    The brace-enclosed conditions "non-basic" and "or c is basic" can be
584	    omitted if initial_n exceeds all basic code points (which is true
585	    for Punycode), because the code point being tested is never less
586	    than initial_n.

588	    In the assignment of t, where t is clamped to the range tmin through
589	    tmax, "+ tmin" can always be omitted.  This makes the clamping
590	    calculation incorrect when bias < k < bias + tmin, but that cannot
591	    happen because of the way bias is computed and because of the
592	    constraints on the parameters.

594	    The checks for overflow are necessary to avoid producing invalid
595	    output when the input contains very large values or is very long.

597	    Therefore encoders MUST implement overflow handling.  See also
598	    section 6.4 "Overflow handling".

600	    The increment of delta at the bottom of the outer loop cannot
601	    overflow because delta < length(input) before the increment, and
602	    length(input) is already assumed to be representable.  The increment
603	    of n could overflow, but only if h == length(input), in which case
604	    the procedure is finished anyway.

606	6.4 Overflow handling

608	    For IDNA, 26-bit unsigned integers are sufficient to handle all
609	    valid IDNA labels without overflow, because any string that
610	    needed a 27-bit delta would have to exceed either the code point
611	    limit (0..10FFFF) or the label length limit (63 characters).
612	    However, overflow handling is necessary because the inputs are not
613	    necessarily valid IDNA labels.

615	    If the programming language does not provide overflow detection,
616	    the following technique can be used.  Suppose A, B, and C are
617	    representable nonnegative integers and C is nonzero.  Then A + B
618	    overflows if and only if B > maxint - A, and A + (B * C) overflows
619	    if and only if B > (maxint - A) div C, where maxint is the greatest
620	    integer that can be represented.  Refer to appendix C "Punycode
621	    sample implementation" for demonstrations of this technique in the C
622	    language.

624	    The decoding and encoding algorithms shown in sections 6.2 and
625	    6.3 handle overflow by detecting it whenever it happens.  Another
626	    approach is to enforce limits on the inputs that prevent overflow
627	    from happening.  For example, if the encoder were to verify that
628	    no input code points exceed M and that the input length does not
629	    exceed L, then no delta could ever exceed (M - initial_n) * (L + 1),
630	    and hence no overflow could occur if integer variables were capable
631	    of representing values that large.  This prevention approach would
632	    impose more restrictions on the input than the detection approach
633	    does, but might be considered simpler in some programming languages.

635	    In theory, the decoder could use an analogous approach, limiting the
636	    number of digits in a variable-length integer (that is, limiting the
637	    number of iterations in the innermost loop).  However, the number
638	    of digits that suffice to represent a given delta can sometimes
639	    represent much larger deltas (because of the adaptation), and hence
640	    this approach would probably need integers wider than 32 bits.

642	    Yet another approach for the decoder is to allow overflow to occur,
643	    but to check the final output string by re-encoding it and comparing
644	    to the decoder input.  If and only if they do not match (using a
645	    case-insensitive ASCII comparison) overflow has occurred.  This
646	    delayed-detection approach would not impose any more restrictions on
647	    the input than the immediate-detection approach does, and might be
648	    considered simpler in some programming languages.

650	    In fact, if the decoder is used only inside the IDNA ToUnicode
651	    operation [IDNA], then it need not check for overflow at all,
652	    because ToUnicode performs a higher level re-encoding and
653	    comparison, and a mismatch has the same consequence as if the
654	    Punycode decoder had failed.

656	7. Punycode examples

658	7.1 Sample strings

660	    In the Punycode encodings below, the ACE prefix is not shown.
661	    Backslashes show where line breaks have been inserted in strings too
662	    long for one line.  The encodings below use mixed-case annotation
663	    (see appendix A), but all-uppercase or all-lowercase for the
664	    non-literal portion would also be correct.  The code points begin
665	    with U+ or u+ to indicate the case flag, as expected by the sample
666	    Punycode implementation (see appendix C).

668	    The first several examples are all translations of the sentence "Why
669	    can't they just speak in ?" (courtesy of Michael Kaplan's
670	    "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
671	    been removed, as is often done in domain names.

673	    (A) Arabic (Egyptian):
674	        u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
675	        u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
676	        Punycode: egbpdaj6bu4bxfgehfvwxn

678	    (B) Chinese (simplified):
679	        u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
680	        Punycode: ihqwcrb4cv8a8dqg056pqjye

682	    (C) Chinese (traditional):
683	        u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
684	        Punycode: ihqwctvzc91f659drss3x8bo0yb

686	    (D) Czech: Proprostnemluvesky
687	        U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
688	        u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
689	        u+0065 u+0073 u+006B u+0079
690	        Punycode: Proprostnemluvesky-uyb24dma41a

692	    (E) Hebrew:
693	        u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
694	        u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
695	        u+05D1 u+05E8 u+05D9 u+05EA
696	        Punycode: 4dbcagdahymbxekheh6e0a7fei0b

698	    (F) Hindi (Devanagari):
699	        u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
700	        u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
701	        u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
702	        u+0939 u+0948 u+0902
703	        Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd

705	    (G) Japanese (kanji and hiragana):
706	        u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
707	        u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
708	        Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa

710	    (H) Korean (Hangul syllables):
711	        u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
712	        u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
713	        u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
714	        Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
715	                  psd879ccm6fea98c

717	    (I) Russian (Cyrillic):
718	        U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
719	        u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
720	        u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
721	        u+0438
722	        Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l

724	    (J) Spanish: PorqunopuedensimplementehablarenEspaol
725	        U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
726	        u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
727	        u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
728	        u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
729	        u+0061 u+00F1 u+006F u+006C
730	        Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a

732	    (K) Vietnamese:
733	        Tisaohkhngthch\
734	        nitingVit
735	        U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
736	        u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
737	        u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
738	        U+0056 u+0069 u+1EC7 u+0074
739	        Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g

741	    The next several examples are all names of Japanese music artists,
742	    song titles, and TV programs, just because the author happens to
743	    have them handy (but Japanese is useful for providing examples
744	    of single-row text, two-row text, ideographic text, and various
745	    mixtures thereof).

747	    (L) 3B
748	        u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
749	        Punycode: 3B-ww4c5e180e575a65lsy2b

751	    (M) -with-SUPER-MONKEYS
752	        u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
753	        u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
754	        U+004F U+004E U+004B U+0045 U+0059 U+0053
755	        Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n

757	    (N) Hello-Another-Way-
758	        U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
759	        u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
760	        u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
761	        Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b

763	    (O) 2
764	        u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
765	        Punycode: 2-u9tlzr9756bt3uc0v

767	    (P) MajiKoi5
768	        U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
769	        u+308B u+0035 u+79D2 u+524D
770	        Punycode: MajiKoi5-783gue6qz075azm5e

772	    (Q) de
773	        u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
774	        Punycode: de-jg4avhby1noc0d

776	    (R) 
777	        u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
778	        Punycode: d9juau41awczczp

780	    The last example is an ASCII string that breaks the existing rules
781	    for host name labels.  (It is not a realistic example for IDNA,
782	    because IDNA never encodes pure ASCII labels.)

784	    (S) -> $1.00 <-
785	        u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
786	        u+003C u+002D
787	        Punycode: -> $1.00 <--

789	7.2 Decoding traces

791	    In the following traces, the evolving state of the decoder is
792	    shown as a sequence of hexadecimal values, representing the code
793	    points in the extended string.  An asterisk appears just after the
794	    most recently inserted code point, indicating both n (the value
795	    preceeding the asterisk) and i (the position of the value just after
796	    the asterisk).  Other numerical values are decimal.

798	    Decoding trace of example B from section 7.1:

800	    n is 128, i is 0, bias is 72
801	    input is "ihqwcrb4cv8a8dqg056pqjye"
802	    there is no delimiter, so extended string starts empty
803	    delta "ihq" decodes to 19853
804	    bias becomes 21
805	    4E0D *
806	    delta "wc" decodes to 64
807	    bias becomes 20
808	    4E0D 4E2D *
809	    delta "rb" decodes to 37
810	    bias becomes 13
811	    4E3A * 4E0D 4E2D
812	    delta "4c" decodes to 56
813	    bias becomes 17
814	    4E3A 4E48 * 4E0D 4E2D
815	    delta "v8a" decodes to 599
816	    bias becomes 32
817	    4E3A 4EC0 * 4E48 4E0D 4E2D
818	    delta "8d" decodes to 130
819	    bias becomes 23
820	    4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
821	    delta "qg" decodes to 154
822	    bias becomes 25
823	    4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
824	    delta "056p" decodes to 46301
825	    bias becomes 84
826	    4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
827	    delta "qjye" decodes to 88531
828	    bias becomes 90
829	    4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587

831	    Decoding trace of example L from section 7.1:

833	    n is 128, i is 0, bias is 72
834	    input is "3B-ww4c5e180e575a65lsy2b"
835	    literal portion is "3B-", so extended string starts as:
836	    0033 0042
837	    delta "ww4c" decodes to 62042
838	    bias becomes 27
839	    0033 0042 5148 *
840	    delta "5e" decodes to 139
841	    bias becomes 24
842	    0033 0042 516B * 5148
843	    delta "180e" decodes to 16683
844	    bias becomes 67
845	    0033 5E74 * 0042 516B 5148
846	    delta "575a" decodes to 34821
847	    bias becomes 82
848	    0033 5E74 0042 516B 5148 751F *
849	    delta "65l" decodes to 14592
850	    bias becomes 67
851	    0033 5E74 0042 7D44 * 516B 5148 751F
852	    delta "sy2b" decodes to 42088
853	    bias becomes 84
854	    0033 5E74 0042 7D44 91D1 * 516B 5148 751F

856	7.3 Encoding traces

858	    In the following traces, code point values are hexadecimal, while
859	    other numerical values are decimal.

861	    Encoding trace of example B from section 7.1:

863	    bias is 72
864	    input is:
865	    4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
866	    there are no basic code points, so no literal portion
867	    next code point to insert is 4E0D
868	    needed delta is 19853, encodes as "ihq"
869	    bias becomes 21
870	    next code point to insert is 4E2D
871	    needed delta is 64, encodes as "wc"
872	    bias becomes 20
873	    next code point to insert is 4E3A
874	    needed delta is 37, encodes as "rb"
875	    bias becomes 13
876	    next code point to insert is 4E48
877	    needed delta is 56, encodes as "4c"
878	    bias becomes 17
879	    next code point to insert is 4EC0
880	    needed delta is 599, encodes as "v8a"
881	    bias becomes 32
882	    next code point to insert is 4ED6
883	    needed delta is 130, encodes as "8d"
884	    bias becomes 23
885	    next code point to insert is 4EEC
886	    needed delta is 154, encodes as "qg"
887	    bias becomes 25
888	    next code point to insert is 6587
889	    needed delta is 46301, encodes as "056p"
890	    bias becomes 84
891	    next code point to insert is 8BF4
892	    needed delta is 88531, encodes as "qjye"
893	    bias becomes 90
894	    output is "ihqwcrb4cv8a8dqg056pqjye"

896	    Encoding trace of example L from section 7.1:

898	    bias is 72
899	    input is:
900	    0033 5E74 0042 7D44 91D1 516B 5148 751F
901	    basic code points (0033, 0042) are copied to literal portion: "3B-"
902	    next code point to insert is 5148
903	    needed delta is 62042, encodes as "ww4c"
904	    bias becomes 27
905	    next code point to insert is 516B
906	    needed delta is 139, encodes as "5e"
907	    bias becomes 24
908	    next code point to insert is 5E74
909	    needed delta is 16683, encodes as "180e"
910	    bias becomes 67
911	    next code point to insert is 751F
912	    needed delta is 34821, encodes as "575a"
913	    bias becomes 82
914	    next code point to insert is 7D44
915	    needed delta is 14592, encodes as "65l"
916	    bias becomes 67
917	    next code point to insert is 91D1
918	    needed delta is 42088, encodes as "sy2b"
919	    bias becomes 84
920	    output is "3B-ww4c5e180e575a65lsy2b"

922	8. Security considerations

924	    Users expect each domain name in DNS to be controlled by a single
925	    authority.  If a Unicode string intended for use as a domain label
926	    could map to multiple ACE labels, then an internationalized domain
927	    name could map to multiple ASCII domain names, each controlled by
928	    a different authority, some of which could be spoofs that hijack
929	    service requests intended for another.  Therefore Punycode is
930	    designed so that each Unicode string has a unique encoding.

932	    However, there can still be multiple Unicode representations of the
933	    "same" text, for various definitions of "same".  This problem is
934	    addressed to some extent by the Unicode standard under the topic of
935	    canonicalization, and this work is leveraged for domain names by
936	    Nameprep [NAMEPREP].

938	9. References

940	9.1 Normative references

942	   [RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate
943	                Requirement Levels", BCP 14, RFC 2119, March 1997.

945	9.2 Informative references

947	   [ASCII]      Cerf, V., "ASCII format for Network Interchange",
948	                RFC 20, October 1969.

950	   [IDNA]       Faltstrom, P., Hoffman, P. and A. Costello,
951	                "Internationalizing Domain Names in Applications
952	                (IDNA)", draft-hoffman-rfc3490bis.

954	   [NAMEPREP]   Hoffman, P. and  M. Blanchet, "Nameprep: A Stringprep
955	                Profile for Internationalized Domain Names (IDN)",
956	                draft-hoffman-rfc3491bis.

958	   [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page",
959	                http://www.trigeminal.com/samples/provincial.html.

961	   [RFC952]     Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet
962	                Host Table Specification", RFC 952, October 1985.

964	   [RFC1034]    Mockapetris, P., "Domain Names - Concepts and
965	                Facilities", STD 13, RFC 1034, November 1987.

967	   [UNICODE]    The Unicode Consortium, "The Unicode Standard",
968	                http://www.unicode.org/unicode/standard/standard.html.

970	A. Mixed-case annotation

972	    In order to use Punycode to represent case-insensitive strings,
973	    higher layers need to case-fold the strings prior to Punycode
974	    encoding.  The encoded string can use mixed case as an annotation
975	    telling how to convert the folded string into a mixed-case string
976	    for display purposes.  Note, however, that mixed-case annotation
977	    is not used by the ToASCII and ToUnicode operations specified in
978	    [IDNA], and therefore implementors of IDNA can disregard this
979	    appendix.

981	    Basic code points can use mixed case directly, because the decoder
982	    copies them verbatim, leaving lowercase code points lowercase, and
983	    leaving uppercase code points uppercase.  Each non-basic code point
984	    is represented by a delta, which is represented by a sequence of
985	    basic code points, the last of which provides the annotation.  If it
986	    is uppercase, it is a suggestion to map the non-basic code point to
987	    uppercase (if possible); if it is lowercase, it is a suggestion to
988	    map the non-basic code point to lowercase (if possible).

990	    These annotations do not alter the code points returned by decoders;
991	    the annotations are returned separately, for the caller to use or
992	    ignore.  Encoders can accept annotations in addition to code points,
993	    but the annotations do not alter the output, except to influence the
994	    uppercase/lowercase form of ASCII letters.

996	    Punycode encoders and decoders need not support these annotations,
997	    and higher layers need not use them.

999	B. Disclaimer and license

1001	    Regarding this entire document or any portion of it (including
1002	    the pseudocode and C code), the author makes no guarantees and
1003	    is not responsible for any damage resulting from its use.  The
1004	    author grants irrevocable permission to anyone to use, modify,
1005	    and distribute it in any way that does not diminish the rights
1006	    of anyone else to use, modify, and distribute it, provided that
1007	    redistributed derivative works do not contain misleading author or
1008	    version information.  Derivative works need not be licensed under
1009	    similar terms.

1011	C. Punycode sample implementation

1013	/*
1014	punycode-sample.c from RFC ????
1015	http://www.nicemice.net/idn/
1016	Adam M. Costello
1017	http://www.nicemice.net/amc/

1019	This is ANSI C code (C89) implementing Punycode (RFC ????).

1021	This single file contains three sections (an interface, an
1022	implementation, and a wrapper for testing) that would normally belong
1023	in three separate files (punycode.h, punycode.c, punycode-test.c), but
1024	here they are bundled into one file (punycode-sample.c) for convenient
1025	testing.  Anyone wishing to reuse this code will probably want to split
1026	it apart.

1028	*/

1030	/************************************************************/
1031	/* Public interface (would normally go in its own .h file): */

1033	#include 
1034	#include 

1036	enum punycode_status {
1037	  punycode_success    = 0,
1038	  punycode_bad_input  = 1, /* Input is invalid.                       */
1039	  punycode_big_output = 2, /* Output would exceed the space provided. */
1040	  punycode_overflow   = 3  /* Wider integers needed to process input. */
1041	};

1043	/* punycode_uint needs to be unsigned and needs to be */
1044	/* at least 26 bits wide.  The particular type can be */
1045	/* specified by defining PUNYCODE_UINT, otherwise a   */
1046	/* suitable type will be chosen automatically.        */

1048	#ifdef PUNYCODE_UINT
1049	  typedef PUNYCODE_UINT punycode_uint;
1050	#elif UINT_MAX >= (1 << 26) - 1
1051	  typedef unsigned int punycode_uint;
1052	#else
1053	  typedef unsigned long punycode_uint;
1054	#endif

1056	enum punycode_status punycode_encode(
1057	  size_t,                 /* input_length  */
1058	  const punycode_uint [], /* input         */
1059	  const unsigned char [], /* case_flags    */
1060	  size_t *,               /* output_length */
1061	  char []                 /* output        */
1062	);

1064	/*
1065	    punycode_encode() converts a sequence of code points (presumed to be
1066	    Unicode code points) to Punycode.

1068	    Input arguments (to be supplied by the caller):

1070	        input_length
1071	            The number of code points in the input array and the number
1072	            of flags in the case_flags array.

1074	        input
1075	            An array of code points.  They are presumed to be Unicode
1076	            code points, but that is not strictly necessary.  The
1077	            array contains code points, not code units.  UTF-16 uses
1078	            code units D800 through DFFF to refer to code points
1079	            10000..10FFFF.  The code points D800..DFFF do not occur in
1080	            any valid Unicode string.  The code points that can occur in
1081	            Unicode strings (0..D7FF and E000..10FFFF) are also called
1082	            Unicode scalar values.

1084	        case_flags
1085	            A null pointer or an array of boolean values parallel to
1086	            the input array.  Nonzero (true, flagged) suggests that the
1087	            corresponding Unicode character be forced to uppercase after
1088	            being decoded (if possible), and zero (false, unflagged)
1089	            suggests that it be forced to lowercase (if possible).
1090	            ASCII code points (0..7F) are encoded literally, except that
1091	            ASCII letters are forced to uppercase or lowercase according
1092	            to the corresponding case flags.  If case_flags is a null
1093	            pointer then ASCII letters are left as they are, and other
1094	            code points are treated as unflagged.

1096	    Output arguments (to be filled in by the function):

1098	        output
1099	            An array of ASCII code points.  It is *not* null-terminated;
1100	            it will contain zeros if and only if the input contains
1101	            zeros.  (Of course the caller can leave room for a
1102	            terminator and add one if needed.)

1104	    Input/output arguments (to be supplied by the caller and overwritten
1105	    by the function):

1107	        output_length
1108	            The caller passes in the maximum number of ASCII code points
1109	            that it can receive.  On successful return it will contain
1110	            the number of ASCII code points actually output.

1112	    Return value:

1114	        Can be any of the punycode_status values defined above except
1115	        punycode_bad_input.  If not punycode_success, then output_size
1116	        and output might contain garbage.
1117	*/

1119	enum punycode_status punycode_decode(
1120	  size_t,           /* input_length  */
1121	  const char [],    /* input         */
1122	  size_t *,         /* output_length */
1123	  punycode_uint [], /* output        */
1124	  unsigned char []  /* case_flags    */
1125	);

1127	/*
1128	    punycode_decode() converts Punycode to a sequence of code points
1129	    (presumed to be Unicode code points).

1131	    Input arguments (to be supplied by the caller):

1133	        input_length
1134	            The number of ASCII code points in the input array.

1136	        input
1137	            An array of ASCII code points (0..7F).

1139	    Output arguments (to be filled in by the function):

1141	        output
1142	            An array of code points like the input argument of
1143	            punycode_encode() (see above).

1145	        case_flags
1146	            A null pointer (if the flags are not needed by the caller)
1147	            or an array of boolean values parallel to the output array.
1148	            Nonzero (true, flagged) suggests that the corresponding
1149	            Unicode character be forced to uppercase by the caller (if
1150	            possible), and zero (false, unflagged) suggests that it
1151	            be forced to lowercase (if possible).  ASCII code points
1152	            (0..7F) are output already in the proper case, but their
1153	            flags will be set appropriately so that applying the flags
1154	            would be harmless.

1156	    Input/output arguments (to be supplied by the caller and overwritten
1157	    by the function):

1159	        output_length
1160	            The caller passes in the maximum number of code points
1161	            that it can receive into the output array (which is also
1162	            the maximum number of flags that it can receive into the
1163	            case_flags array, if case_flags is not a null pointer).  On
1164	            successful return it will contain the number of code points
1165	            actually output (which is also the number of flags actually
1166	            output, if case_flags is not a null pointer).  The decoder
1167	            will never need to output more code points than the number
1168	            of ASCII code points in the input, because of the way the
1169	            encoding is defined.  The number of code points output
1170	            cannot exceed the maximum possible value of a punycode_uint,
1171	            even if the supplied output_length is greater than that.

1173	    Return value:

1175	        Can be any of the punycode_status values defined above.  If not
1176	        punycode_success, then output_length, output, and case_flags
1177	        might contain garbage.
1178	*/

1180	/**********************************************************/
1181	/* Implementation (would normally go in its own .c file): */

1183	#include 

1185	/* #include "punycode.h" */

1187	/*** Bootstring parameters for Punycode ***/

1189	enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
1190	       initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };

1192	/* basic(cp) tests whether cp is a basic code point: */
1193	#define basic(cp) ((punycode_uint)(cp) < 0x80)

1195	/* delim(cp) tests whether cp is a delimiter: */
1196	#define delim(cp) ((cp) == delimiter)

1198	/* decode_digit(cp) returns the numeric value of a basic code */
1199	/* point (for use in representing integers) in the range 0 to */
1200	/* base-1, or base if cp does not represent a value.          */

1202	static punycode_uint decode_digit(punycode_uint cp)
1203	{
1204	  return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :
1205	          cp - 97 < 26 ? cp - 97 :  base;
1206	}

1208	/* encode_digit(d,flag) returns the basic code point whose value      */
1209	/* (when used for representing integers) is d, which needs to be in   */
1210	/* the range 0 to base-1.  The lowercase form is used unless flag is  */
1211	/* nonzero, in which case the uppercase form is used.  The behavior   */
1212	/* is undefined if flag is nonzero and digit d has no uppercase form. */

1214	static char encode_digit(punycode_uint d, int flag)
1215	{
1216	  return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
1217	  /*  0..25 map to ASCII a..z or A..Z */
1218	  /* 26..35 map to ASCII 0..9         */
1219	}

1221	/* flagged(bcp) tests whether a basic code point is flagged */
1222	/* (uppercase).  The behavior is undefined if bcp is not a  */
1223	/* basic code point.                                        */

1225	#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)

1227	/* encode_basic(bcp,flag) forces a basic code point to lowercase */
1228	/* if flag is zero, uppercase if flag is nonzero, and returns    */
1229	/* the resulting code point.  The code point is unchanged if it  */
1230	/* is caseless.  The behavior is undefined if bcp is not a basic */
1231	/* code point.                                                   */

1233	static char encode_basic(punycode_uint bcp, int flag)
1234	{
1235	  bcp -= (bcp - 97 < 26) << 5;
1236	  return bcp + ((!flag && (bcp - 65 < 26)) << 5);
1237	}

1239	/*** Platform-specific constants ***/

1241	/* maxint is the maximum value of a punycode_uint variable: */
1242	static const punycode_uint maxint = -1;
1243	/* Because maxint is unsigned, -1 becomes the maximum value. */

1245	/*** Bias adaptation function ***/

1247	static punycode_uint adapt(
1248	  punycode_uint delta, punycode_uint numpoints, int firsttime )
1249	{
1250	  punycode_uint k;

1252	  delta = firsttime ? delta / damp : delta >> 1;
1253	  /* delta >> 1 is a faster way of doing delta / 2 */
1254	  delta += delta / numpoints;

1256	  for (k = 0;  delta > ((base - tmin) * tmax) / 2;  k += base) {
1257	    delta /= base - tmin;
1258	  }

1260	  return k + (base - tmin + 1) * delta / (delta + skew);
1261	}

1263	/*** Main encode function ***/

1265	enum punycode_status punycode_encode(
1266	  size_t input_length_orig,
1267	  const punycode_uint input[],
1268	  const unsigned char case_flags[],
1269	  size_t *output_length,
1270	  char output[] )
1271	{
1272	  punycode_uint input_length, n, delta, h, b, bias, j, m, q, k, t;
1273	  size_t out, max_out;

1275	  /* The Punycode spec assumes that the input length is the same type */
1276	  /* of integer as a code point, so we need to convert the size_t to  */
1277	  /* a punycode_uint, which could overflow.                           */

1279	  if (input_length_orig > maxint) return punycode_overflow;
1280	  input_length = (punycode_uint) input_length_orig;

1282	  /* Initialize the state: */

1284	  n = initial_n;
1285	  delta = 0;
1286	  out = 0;
1287	  max_out = *output_length;
1288	  bias = initial_bias;

1290	  /* Handle the basic code points: */

1292	  for (j = 0;  j < input_length;  ++j) {
1293	    if (basic(input[j])) {
1294	      if (max_out - out < 2) return punycode_big_output;
1295	      output[out++] = case_flags ?
1296	        encode_basic(input[j], case_flags[j]) : (char) input[j];
1297	    }
1298	    /* else if (input[j] < n) return punycode_bad_input; */
1299	    /* (not needed for Punycode with unsigned code points) */
1300	  }

1302	  h = b = (punycode_uint) out;
1303	  /* cannot overflow because out <= input_length <= maxint */

1305	  /* h is the number of code points that have been handled, b is the  */
1306	  /* number of basic code points, and out is the number of ASCII code */
1307	  /* points that have been output.                                    */

1309	  if (b > 0) output[out++] = delimiter;

1311	  /* Main encoding loop: */

1313	  while (h < input_length) {
1314	    /* All non-basic code points < n have been     */
1315	    /* handled already.  Find the next larger one: */

1317	    for (m = maxint, j = 0;  j < input_length;  ++j) {
1318	      /* if (basic(input[j])) continue; */
1319	      /* (not needed for Punycode) */
1320	      if (input[j] >= n && input[j] < m) m = input[j];
1321	    }

1323	    /* Increase delta enough to advance the decoder's    */
1324	    /*  state to , but guard against overflow: */

1326	    if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
1327	    delta += (m - n) * (h + 1);
1328	    n = m;

1330	    for (j = 0;  j < input_length;  ++j) {
1331	      /* Punycode does not need to check whether input[j] is basic: */
1332	      if (input[j] < n /* || basic(input[j]) */ ) {
1333	        if (++delta == 0) return punycode_overflow;
1334	      }

1336	      if (input[j] == n) {
1337	        /* Represent delta as a generalized variable-length integer: */

1339	        for (q = delta, k = base;  ;  k += base) {
1340	          if (out >= max_out) return punycode_big_output;
1341	          t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
1342	              k >= bias + tmax ? tmax : k - bias;
1343	          if (q < t) break;
1344	          output[out++] = encode_digit(t + (q - t) % (base - t), 0);
1345	          q = (q - t) / (base - t);
1346	        }

1348	        output[out++] = encode_digit(q, case_flags && case_flags[j]);
1349	        bias = adapt(delta, h + 1, h == b);
1350	        delta = 0;
1351	        ++h;
1352	      }
1353	    }

1355	    ++delta, ++n;
1356	  }

1358	  *output_length = out;
1359	  return punycode_success;
1360	}

1362	/*** Main decode function ***/

1364	enum punycode_status punycode_decode(
1365	  size_t input_length,
1366	  const char input[],
1367	  size_t *output_length,
1368	  punycode_uint output[],
1369	  unsigned char case_flags[] )
1370	{
1371	  punycode_uint n, out, i, max_out, bias, oldi, w, k, digit, t;
1372	  size_t b, j, in;

1374	  /* Initialize the state: */

1376	  n = initial_n;
1377	  out = i = 0;
1378	  max_out = *output_length > maxint ? maxint
1379	            : (punycode_uint) *output_length;
1380	  bias = initial_bias;

1382	  /* Handle the basic code points:  Let b be the number of input code */
1383	  /* points before the last delimiter, or 0 if there is none, then    */
1384	  /* copy the first b code points to the output.                      */

1386	  for (b = j = 0;  j < input_length;  ++j)  if (delim(input[j])) b = j;
1387	  if (b > max_out) return punycode_big_output;

1389	  for (j = 0;  j < b;  ++j) {
1390	    if (case_flags) case_flags[out] = flagged(input[j]);
1391	    if (!basic(input[j])) return punycode_bad_input;
1392	    output[out++] = input[j];
1393	  }

1395	  /* Main decoding loop:  Start just after the last delimiter if any  */
1396	  /* basic code points were copied; start at the beginning otherwise. */

1398	  for (in = b > 0 ? b + 1 : 0;  in < input_length;  ++out) {

1400	    /* in is the index of the next ASCII code point to be consumed, */
1401	    /* and out is the number of code points in the output array.    */

1403	    /* Decode a generalized variable-length integer into delta,  */
1404	    /* which gets added to i.  The overflow checking is easier   */
1405	    /* if we increase i as we go, then subtract off its starting */
1406	    /* value at the end to obtain delta.                         */

1408	    for (oldi = i, w = 1, k = base;  ;  k += base) {
1409	      if (in >= input_length) return punycode_bad_input;
1410	      digit = decode_digit(input[in++]);
1411	      if (digit >= base) return punycode_bad_input;
1412	      if (digit > (maxint - i) / w) return punycode_overflow;
1413	      i += digit * w;
1414	      t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
1415	          k >= bias + tmax ? tmax : k - bias;
1416	      if (digit < t) break;
1417	      if (w > maxint / (base - t)) return punycode_overflow;
1418	      w *= (base - t);
1419	    }

1421	    bias = adapt(i - oldi, out + 1, oldi == 0);

1423	    /* i was supposed to wrap around from out+1 to 0,   */
1424	    /* incrementing n each time, so we'll fix that now: */

1426	    if (i / (out + 1) > maxint - n) return punycode_overflow;
1427	    n += i / (out + 1);
1428	    i %= (out + 1);

1430	    /* Insert n at position i of the output: */

1432	    /* not needed for Punycode: */
1433	    /* if (basic(n)) return punycode_bad_input; */
1434	    if (out >= max_out) return punycode_big_output;

1436	    if (case_flags) {
1437	      memmove(case_flags + i + 1, case_flags + i, out - i);
1438	      /* Case of last ASCII code point determines case flag: */
1439	      case_flags[i] = flagged(input[in - 1]);
1440	    }

1442	    memmove(output + i + 1, output + i, (out - i) * sizeof *output);
1443	    output[i++] = n;
1444	  }

1446	  *output_length = (size_t) out;
1447	  /* cannot overflow because out <= old value of *output_length */
1448	  return punycode_success;
1449	}

1451	/******************************************************************/
1452	/* Wrapper for testing (would normally go in a separate .c file): */

1454	#include 
1455	#include 
1456	#include 
1457	#include 

1459	/* #include "punycode.h" */

1461	/* For testing, we'll just set some compile-time */
1462	/* limits rather than use malloc().              */

1464	enum {
1465	  unicode_max_length = 256,
1466	  ace_max_length = 256
1467	};

1469	static void usage(char **argv)
1470	{
1471	  fprintf(stderr,
1472	    "\n"
1473	    "%s -e reads code points and writes a Punycode string.\n"
1474	    "%s -d reads a Punycode string and writes code points.\n"
1475	    "\n"
1476	    "Input and output are plain text in the native character set.\n"
1477	    "Code points are in the form u+hex separated by whitespace.\n"
1478	    "Although the specification allows Punycode strings to contain\n"
1479	    "any characters from the ASCII repertoire, this test code\n"
1480	    "supports only the printable characters, and needs the Punycode\n"
1481	    "string to be followed by a newline.\n"
1482	    "The case of the u in u+hex is the case flag.\n"
1483	    , argv[0], argv[0]);
1484	  exit(EXIT_FAILURE);
1485	}

1487	static void fail(const char *msg)
1488	{
1489	  fputs(msg,stderr);
1490	  exit(EXIT_FAILURE);
1491	}

1493	static const char too_big[] =
1494	  "input or output is too large, recompile with larger limits\n";
1495	static const char invalid_input[] = "invalid input\n";
1496	static const char overflow[] = "arithmetic overflow\n";
1497	static const char io_error[] = "I/O error\n";

1499	/* The following string is used to convert printable */
1500	/* characters between ASCII and the native charset:  */

1502	static const char print_ascii[] =
1503	  "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1504	  "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1505	  " !\"#$%&'()*+,-./"
1506	  "0123456789:;<=>?"
1507	  "@ABCDEFGHIJKLMNO"
1508	  "PQRSTUVWXYZ[\\]^_"
1509	  "`abcdefghijklmno"
1510	  "pqrstuvwxyz{|}~\n";

1512	int main(int argc, char **argv)
1513	{
1514	  enum punycode_status status;
1515	  int r;
1516	  unsigned int input_length, output_length, j;
1517	  unsigned char case_flags[unicode_max_length];

1519	  if (argc != 2) usage(argv);
1520	  if (argv[1][0] != '-') usage(argv);
1521	  if (argv[1][2] != 0) usage(argv);

1523	  if (argv[1][1] == 'e') {
1524	    punycode_uint input[unicode_max_length];
1525	    unsigned long codept;
1526	    char output[ace_max_length+1], uplus[3];
1527	    int c;

1529	    /* Read the input code points: */

1531	    input_length = 0;

1533	    for (;;) {
1534	      r = scanf("%2s%lx", uplus, &codept);
1535	      if (ferror(stdin)) fail(io_error);
1536	      if (r == EOF || r == 0) break;

1538	      if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {
1539	        fail(invalid_input);
1540	      }

1542	      if (input_length == unicode_max_length) fail(too_big);

1544	      if (uplus[0] == 'u') case_flags[input_length] = 0;
1545	      else if (uplus[0] == 'U') case_flags[input_length] = 1;
1546	      else fail(invalid_input);

1548	      input[input_length++] = codept;
1549	    }

1551	    /* Encode: */

1553	    output_length = ace_max_length;
1554	    status = punycode_encode(input_length, input, case_flags,
1555	                             &output_length, output);
1556	    if (status == punycode_bad_input) fail(invalid_input);
1557	    if (status == punycode_big_output) fail(too_big);
1558	    if (status == punycode_overflow) fail(overflow);
1559	    assert(status == punycode_success);

1561	    /* Convert to native charset and output: */

1563	    for (j = 0;  j < output_length;  ++j) {
1564	      c = output[j];
1565	      assert(c >= 0 && c <= 127);
1566	      if (print_ascii[c] == 0) fail(invalid_input);
1567	      output[j] = print_ascii[c];
1568	    }

1570	    output[j] = 0;
1571	    r = puts(output);
1572	    if (r == EOF) fail(io_error);
1573	    return EXIT_SUCCESS;
1574	  }

1576	  if (argv[1][1] == 'd') {
1577	    char input[ace_max_length+2], *p, *pp;
1578	    punycode_uint output[unicode_max_length];

1580	    /* Read the Punycode input string and convert to ASCII: */

1582	    fgets(input, ace_max_length+2, stdin);
1583	    if (ferror(stdin)) fail(io_error);
1584	    if (feof(stdin)) fail(invalid_input);
1585	    input_length = strlen(input) - 1;
1586	    if (input[input_length] != '\n') fail(too_big);
1587	    input[input_length] = 0;

1589	    for (p = input;  *p != 0;  ++p) {
1590	      pp = strchr(print_ascii, *p);
1591	      if (pp == 0) fail(invalid_input);
1592	      *p = pp - print_ascii;
1593	    }

1595	    /* Decode: */

1597	    output_length = unicode_max_length;
1598	    status = punycode_decode(input_length, input, &output_length,
1599	                             output, case_flags);
1600	    if (status == punycode_bad_input) fail(invalid_input);
1601	    if (status == punycode_big_output) fail(too_big);
1602	    if (status == punycode_overflow) fail(overflow);
1603	    assert(status == punycode_success);

1605	    /* Output the result: */

1607	    for (j = 0;  j < output_length;  ++j) {
1608	      r = printf("%s+%04lX\n",
1609	                 case_flags[j] ? "U" : "u",
1610	                 (unsigned long) output[j] );
1611	      if (r < 0) fail(io_error);
1612	    }

1614	    return EXIT_SUCCESS;
1615	  }

1617	  usage(argv);
1618	  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
1619	}

1621	D. Changes from RFC 3492

1623	    This document is a revision of RFC 3492.  None of the changes alter
1624	    the protocol; that is, any correct implementation of RFC 3492 is
1625	    a correct implementation of this document, and vice-versa.  The
1626	    changes are as follows:

1628	    At the end of section 2 "Terminology", added a statement of
1629	    the assumption that all values from zero through maxint can be
1630	    represented.  RFC 3492 relied on this assumption without stating it.

1632	    In sections 3.1 "Basic code point segregation", 4 "Bootstring
1633	    parameters", and 5 "Parameter values for Punycode", reworded the
1634	    discussion of delimiters to permit more than one code point to
1635	    serve as delimiter in Bootstring, while emphasizing that Punycode
1636	    has only code point serving as delimiter.  The pseudocode and
1637	    sample implementation are not changed, because they already had no
1638	    dependence on the uniqueness of the delimiter.

1640	    In section 4 "Bootstring parameters", strengthened the constraints
1641	    to disallow tmin == base-1 and to disallow tmax == 0.  Either
1642	    of these values would be nonsensical because it would result in
1643	    infinite loops.  The Punycode parameter values already satisfied the
1644	    stronger constraints.

1646	    In section 5 "Parameter values for Punycode", in the paragraph
1647	    following the first list of parameter values, fixed the remark about
1648	    code points D800..DFFF, which RFC 3492 claimed were not code points
1649	    at all.  Technically they are code points, but they do not occur in
1650	    any valid Unicode string.  In the same section, in the paragraph
1651	    discussing uppercase and lowercase, clarified that the restrictions
1652	    on encoder output apply only to the non-literal portion of the
1653	    encoded string.

1655	    In section 6.1, "Bias adaptation function", added a discussion of
1656	    the possibility or impossibility of overflow in adapt(), and added a
1657	    requirement to detect overflow on hypothetical platforms with such
1658	    narrow integers that overflow becomes possible.  Such platforms are
1659	    unlikely to exist, but the requirement closes a theoretical hole in
1660	    the spec.

1662	    In sections 6.2 "Decoding procedure" and 6.3 "Encoding procedure",
1663	    added statements to emphasize that overflow handling is REQUIRED.

1665	    In section 6.4 "Overflow handling", fixed the definition of maxint.
1666	    The definition in RFC 3492 was nonsense.

1668	    In the first paragraph of section 7.1 "Sample strings", clarified
1669	    the significance (or insignificance) of mixed case in the examples.

1671	    In section 9.2 "Informative references", alphabetized the
1672	    references, and updated the references to [IDNA] and [NAMEPREP].

1674	    In appendix C "Punycode sample implementation", fixed and clarified
1675	    several comments in the C code, quieted some compiler warnings
1676	    caused by implicit type conversion, and changed the interface to use
1677	    size_t rather than punycode_uint for array lengths, for consistency
1678	    with customs established by the standard C library.

1680	Author's address

1682	    Adam M. Costello
1683	    Google Inc.
1684	    http://www.nicemice.net/amc/

1686	                   INTERNET-DRAFT expires 2004-Oct-14