idnits 2.17.1 draft-ietf-idn-amc-ace-r-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 3 longer pages, the longest (page 9) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([UNICODE], [IDNA], [IDN]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 1106 -- Looks like a reference, but probably isn't: '2' on line 1054 -- Looks like a reference, but probably isn't: '6' on line 931 -- Looks like a reference, but probably isn't: '0' on line 1078 -- Looks like a reference, but probably isn't: '3' on line 1060 == Outdated reference: A later version (-10) exists of draft-ietf-idn-nameprep-03 -- Possible downref: Normative reference to a draft: ref. 'AMCACEM' -- Possible downref: Normative reference to a draft: ref. 'AMCACEW' -- Possible downref: Normative reference to a draft: ref. 'BRACE' == Outdated reference: A later version (-02) exists of draft-ietf-idn-dude-01 -- Possible downref: Normative reference to a draft: ref. 'DUDE01' -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN' == Outdated reference: A later version (-13) exists of draft-ietf-idn-idna-01 -- Possible downref: Non-RFC (?) normative reference: ref. 'PROVINCIAL' -- Possible downref: Normative reference to a draft: ref. 'RACE03' ** Downref: Normative reference to an Unknown state RFC: RFC 952 -- Possible downref: Non-RFC (?) normative reference: ref. 'SFS' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- No information found for draft-jseng-utf5- - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'UTF5' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS6' Summary: 6 errors (**), 0 flaws (~~), 5 warnings (==), 20 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Adam M. Costello 2 draft-ietf-idn-amc-ace-r-01.txt 2001-May-31 3 Expires 2001-Nov-30 5 AMC-ACE-R version 0.2.1 7 Status of this Memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note 14 that other groups may also distribute working documents as 15 Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months and may be updated, replaced, or obsoleted by other documents 19 at any time. It is inappropriate to use Internet-Drafts as 20 reference material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 Distribution of this document is unlimited. Please send comments 29 to the author at amc@cs.berkeley.edu, or to the idn working 30 group at idn@ops.ietf.org. A non-paginated (and possibly 31 newer) version of this specification may be available at 32 http://www.cs.berkeley.edu/~amc/charset/amc-ace-r 34 Abstract 36 AMC-ACE-R is a reversible transformation from a sequence of Unicode 37 [UNICODE] code points to a sequence of letters, digits, and hyphens 38 (LDH characters). AMC-ACE-R could be used as an ASCII-Compatible 39 Encoding (ACE) for internationalized domain names [IDN] [IDNA]. 41 Besides domain names, there might also be other contexts where it is 42 useful to transform Unicode characters into "safe" (delimiter-free) 43 ASCII characters. (If other contexts consider hyphens to be 44 unsafe, a different character could be used to play its role, like 45 underscore.) 46 Contents 48 Technical changes from earlier versions 49 Features 50 Name 51 Terminology 52 Description 53 Base-32 characters 54 Encoding and decoding algorithms 55 Signature 56 Mixed-case annotation 57 Comparison with other ACEs 58 Example strings 59 Security considerations 60 Acknowledgements 61 References 62 Author 63 Example implementation 65 Technical changes from earlier versions 67 From 0.0.x to 0.1.x: 69 In update(), the test "latest - first == 1" was a bug, changed 70 to "latest == first". 72 In encode(), "codepoint" was a misspelling of "input[i]". 74 Initializing refpoint[1] to 0x60 was a design flaw, because this 75 initial value is useless for nameprep'd strings. The initial 76 value is changed to 0xE0. 78 encode() now fails if input[i] is not in 0..10FFFF, in order to 79 avoid an array bounds error. This does not affect the encoding 80 of valid Unicode strings. 82 From 0.1.x to 0.2.x: 84 The test "latest == first" tests for the first code point, but 85 the intention was to test for the first non-LDH code point, 86 or equivalently, for the first time update() is called. The 87 boolean flag "updated" has been introduced for performing the 88 proper test. This alters the encoding of some strings, usually 89 for the better. 91 The initial value of refpoint[2] has been changed from 0 to 92 0xA0, which is more useful in light of nameprep's prohibition of 93 non-LDH code points below 0xA0. 95 In decode() the number of base-32 characters consumed has been 96 limited to 5, to avoid an array bounds error on invalid input. 98 Features 100 Completeness: Every Unicode string maps to an LDH string. 101 Restrictions on which Unicode strings are allowed, and on length, 102 may be imposed by higher layers. 104 Uniqueness: Every Unicode string maps to at most one LDH string. 106 Reversibility: Any Unicode string mapped to an LDH string can be 107 recovered from that LDH string. 109 Efficient encoding: The ratio of encoded size to original size is 110 small for all Unicode strings. This is important in the context 111 of domain names because [RFC1034] restricts the length of a domain 112 label to 63 characters. 114 Simplicity: The encoding and decoding algorithms are reasonably 115 simple to implement. The goals of efficiency and simplicity are at 116 odds; AMC-ACE-R aims at a reasonable balance between them. 118 Mixed-case annotation: Even if the Unicode string has been 119 case-folded prior to encoding, it is possible to used mixed case 120 in the encoded string as an annotation telling how to convert the 121 folded Unicode string into a mixed-case Unicode string for display 122 purposes. This feature is optional; see section "Mixed-case 123 annotation". 125 Readability: The letters A-Z and a-z and the digits 0-9 appearing 126 in the Unicode string are represented as themselves in the label. 127 This comes for free because it usually the most efficient encoding 128 anyway. 130 Name 132 AMC-ACE-R is a working name that should be changed if it is adopted. 133 (The R merely indicates that it is the eighteenth ACE devised by 134 this author. BRACE was the third. Most were not worth releasing.) 135 Rather than waste good names on experimental proposals, let's 136 wait until one proposal is chosen, then assign it a good name. 137 Suggestions: 139 UniHost 140 NUDE (Normal Unicode Domain Encoding) 141 UTF-D ("D" for "domain names") 142 UTF-37 (there are 37 characters in the output repertoire) 144 Terminology 146 LDH characters are the letters A-Z and a-z, the digits 0-9, and 147 hyphen-minus. 149 A quartet is a sequence of four bits (also known as a nibble or 150 nybble). 152 A quintet is a sequence of five bits. 154 Hexadecimal values are shown preceeded by "0x". For example, 0x60 155 is decimal 96. 157 As in the Unicode Standard [UNICODE], Unicode code points are 158 denoted by "U+" followed by four to six hexadecimal digits, while a 159 range of code points is denoted by two hexadecimal numbers separated 160 by "..", with no prefixes. 162 "x..y" means the range of integers x through y inclusive. 164 "x << y" means x left-shifted by y bits (equivalent to x times 165 2 to the power y), and "x >> y" means x right-shifted by y bits 166 (equivalent to x divided by 2 to the power y, discarding the 167 remainder). These operations are used only with nonnegative 168 integral values. 170 Description 172 AMC-ACE-R represents a sequence of Unicode code points as a sequence 173 of LDH characters, although implementations will also need to 174 represent the LDH characters somehow, typically as ASCII octets. 175 The encoder input and decoder output are arrays of Unicode code 176 points (integral values in the range 0..10FFFF, but not D800..DFFF, 177 which are reserved for use by UTF-16). 179 This section describes the representation. Section "Encoding 180 and decoding algorithms" presents the algorithms as commented 181 pseudocode. There is also commented C code in section "Example 182 implementation". 184 The encoded string alternates between two modes: literal mode and 185 base-32 mode. Unicode code points representing LDH characters 186 are encoded as those LDH characters, except that hyphen-minus is 187 doubled. Other Unicode code points are encoded as one or more LDH 188 characters using base-32, in which each character of the encoded 189 string represents a quintet according to the table in section 190 "Base-32 characters". A mode change is indicated by an unpaired 191 hyphen-minus. A pair of consecutive hyphen-minuses represents a 192 hyphen-minus and does not change the mode. 194 In base-32 mode a variable-length code sequence of one to five 195 quintets represents a delta, which is added to a reference point to 196 yield a Unicode code point. There are five reference points, one 197 for each code length. The delta is represented by the lowest four 198 bits of each quintet. The highest bit of each quintet is 1, except 199 for the last quintet, where it is 0, allowing the decoder to detect 200 the end of the sequence. 202 Code sequences: 203 delta from reference point 1: 0xxxx 204 delta from reference point 2: 1xxxx 0xxxx 205 delta from reference point 3: 1xxxx 1xxxx 0xxxx 206 delta from reference point 4: 1xxxx 1xxxx 1xxxx 0xxxx 207 delta from reference point 5: 1xxxx 1xxxx 1xxxx 1xxxx 0xxxx 209 For reference point k, the delta is constrained by the available 210 bits to range from 0 to (1 << (4*k)) - 1, so each reference point is 211 the bottom of a window of 1 << (4*k) code points. A code point is 212 encoded as an offset into the smallest window that contains it. 214 Reference points 4 and 5 are fixed at 0 and 0x10000 respectively, 215 so that windows 4 and 5 always cover the entire Unicode code space 216 0..10FFFF. The other reference points are updated whenever a code 217 point has been encoded or decoded in base-32 mode, using following 218 heuristic. 220 The latest code point is rounded down to a multiple of 0x10 to 221 obtain a candidate for replacing reference point 1. If a non-LDH 222 code point falling within the candidate window has appeared more 223 recently than one falling within the current window, then the 224 reference point is changed. Otherwise a similar check is performed 225 for reference point 2 using 0x100 as the divisor, and failing that, 226 reference point 3 is checked using 0x1000. At most one window is 227 changed each time, except that after the very first non-LDH code 228 point (when there is no useful history), all three windows are 229 changed. 231 The initial values of the state variables are: 233 mode: base-32 234 reference point 1: 0xE0 235 reference point 2: 0xA0 236 reference point 3: 0 237 reference point 4: 0 238 reference point 5: 0x10000 240 Base-32 characters 242 "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 243 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 244 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 245 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 246 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 247 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 248 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 249 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 250 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 251 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 252 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 253 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 254 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 255 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 256 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 257 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 259 The digits "0" and "1" and the letters "o" and "l" are not used, to 260 avoid transcription errors. 262 All decoders must recognize both the uppercase and lowercase forms 263 of the base-32 characters (including mixtures of both forms). 264 An encoder should output only lowercase forms or only uppercase 265 forms unless it uses the feature described in section "Mixed-case 266 annotation"). 268 Encoding and decoding algorithms 270 All ordering of bits, quartets, and quintets is big-endian (most 271 significant first). When subroutines alter variables that are 272 passed in as arguments, those changes are seen by the caller after 273 the subroutine returns. As in C, "continue" means terminate the 274 current iteration of the innermost loop, and "break" means terminate 275 the innermost loop. 277 procedure initialize(refpoint,literal,updated): 278 let refpoint[1..5] = (0xE0, 0xA0, 0, 0, 0x10000) 279 let literal = updated = false 281 procedure update(refpoint, updated, history[first..latest]): 282 # Update the reference points based on the history. 283 for k = 1 to 3 do begin 284 let b = 4 * k 285 # The first time here change all the windows: 286 if not updated 287 then let refpoint[k] = (history[latest] >> b) << b 288 else for i = latest - 1 down to first do begin 289 if history[i] represents an LDH character then continue 290 # If a code point falling in the existing window has appeared 291 # at least as recently as one falling in the candidate window, 292 # then leave this window unchanged and go on to the next one: 293 if (refpoint[k] XOR history[i]) >> b == 0 then break 294 if (history[latest] XOR history[i]) >> b == 0 then begin 295 # A code point falling in the candidate window has appeared 296 # more recently than one falling in the existing window, so 297 # change this window (and no others): 298 let refpoint[k] = (history[latest] >> b) << b 299 goto update_end 300 end 301 end 302 end 303 update_end: let updated = true 305 procedure encode(input[first..last]): 306 initialize(refpoint,literal,updated) 307 for i = first to last do begin 308 # Check code point range to avoid array bounds errors later: 309 if input[i] is not in 0..10FFFF then fail 310 if input[i] == 0x2D then output two hyphen-minuses 311 else if input[i] represents an LDH character then begin 312 # Letter/digit is encoded literally, so get into literal mode. 313 if not literal then output hyphen-minus 314 let literal = true 315 output the character represented by input[i] 316 end 317 else begin 318 # Non-LDH code point is encoded in base-32. 319 # Compute the number of base-32 characters to use: 320 for k = 1 to infinity do begin 321 let delta = input[i] - refpoint[k] 322 if delta >= 0 and delta >> (4*k) == 0 then break 323 end 324 # Switch to base-32 mode if necessary: 325 if literal then output hyphen-minus 326 let literal = false 327 represent delta in base 16 as k quartets 328 prepend 0 to the last quartet and 1 to each of the others 329 output a base-32 character corresponding to each quintet 330 update(refpoint, updated, input[first..i]) 331 end 332 end 333 procedure decode(input string): 334 initialize(refpoint,literal,updated) 335 let history = the empty array 336 while the input string is not exhausted do begin 337 read the next character into c 338 # Unpaired hyphen-minus toggles the mode: 339 if c is hyphen-minus and the next character is not 340 then read the next character into c and toggle literal 341 # Double hyphen-minus represents 0x2D: 342 if c is hyphen-minus 343 then read the next character and append 0x2D to history 344 else if literal then append the code point of c to history 345 else begin 346 # Decode a base-32 sequence. 347 convert c to a quintet 348 while a quintet beginning with 0 has not been seen 349 do read and convert up to four more characters 350 concatenate the lowest four bits of each quintet to form delta 351 append refpoint[number of quintets] + delta to history 352 update(refpoint,updated,history) 353 end 354 end 355 # Enforce the uniqueness of the encoding: 356 encode history and compare it to the input string 357 fail if they are not equal 358 output history 360 The decoder must always be prepared for premature end-of-input or 361 invalid input characters, and must either fail immediately or forge 362 ahead and let the comparison at the end fail. The comparison must 363 be case-insensitive if ACEs are always compared case-insensitively 364 (which is true of domain names), case-sensitive otherwise. This 365 check is necessary to guarantee the uniqueness property (there 366 cannot be two distinct encoded strings representing the same 367 sequence of integers). (If the decoder is one step of a larger 368 decoding process, it may be possible to defer the re-encoding and 369 comparison to the end of that larger decoding process.) 371 Signature 373 The issue of how to distinguish ACE strings from unencoded strings 374 is largely orthogonal to the encoding scheme itself, and is 375 therefore not specified here. In the context of domain name labels, 376 a standard prefix and/or suffix (chosen to be unlikely to occur 377 naturally) would presumably be attached to ACE labels. 379 In order to use AMC-ACE-R in domain names, the choice of signature 380 must be mindful of the requirement in [RFC952] that labels never 381 begin or end with hyphen-minus. Since the raw encoded string 382 sometimes begins with a hyphen-minus, the signature must include 383 a prefix that does not begin with hyphen-minus. If the Unicode 384 strings are forbidden from ending with hyphen-minus (which seems 385 prudent anyway), then the raw encoded string will never end with 386 hyphen-minus; otherwise, the signature must include a suffix as well 387 as a prefix. 389 Mixed-case annotation 391 In order to use AMC-ACE-R to represent case-insensitive Unicode 392 strings, higher layers need to case-fold the Unicode strings prior 393 to AMC-ACE-R encoding. The encoded string can, however, use 394 mixed-case base-32 (rather than all-lowercase or all-uppercase 395 as recommended in section "Base-32 characters") as an annotation 396 telling how to convert the folded Unicode string into a mixed-case 397 Unicode string for display purposes. 399 Each non-LDH code point is represented by a sequence of base-32 400 characters, the last of which is always a letter (as opposed to 401 a digit). If that letter is uppercase, it is a suggestion that 402 the Unicode character be mapped to uppercase (if possible); if the 403 letter is lowercase, it is a suggestion that the Unicode character 404 be mapped to lowercase (if possible). 406 AMC-ACE-R encoders and decoders are not required to support these 407 annotations, and higher layers need not use them. 409 Comparison with other ACEs 411 Please refer to the comparison in [AMCACEW]. 413 Example strings 415 In the ACE encodings below, no signatures are shown. AMC-ACE-R is 416 abbreviated AMC-R. Backslashes show where line breaks have been 417 inserted in strings too long for one line. 419 The first several examples are all translations of the sentence "Why 420 can't they just speak in ?" (courtesy of Michael Kaplan's 421 "provincial" page [PROVINCIAL]). Word breaks and punctuation have 422 been removed, as is often done in domain names. 424 (A) Arabic (Egyptian): 425 u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 426 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F 427 AMC-R: ywekhfuhuikwdwefivevjbuiwktr 429 (B) Chinese (simplified): 430 u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587 431 AMC-R: w87g8nvk6awisp259eupyx2h 433 (C) Czech: Proprostnemluvesky 434 U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 435 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D 436 u+0065 u+0073 u+006B u+0079 437 AMC-R: -Pro-yp-prost-tm-nemluv-s8pp-esky 439 (D) Hebrew: 440 u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 441 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 442 u+05D1 u+05E8 u+05D9 u+05EA 443 AMC-R: x7nqeep8e8j7f7inaqdb8ijp8cb8ij8k 444 (E) Hindi (Devanagari): 445 u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D 446 u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 447 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 448 u+0939 u+0948 u+0902 449 AMC-R: 3urvjvcwmthjruiwpugwatfwpurmscuivjascunmvcvitfuewhjwisc 451 (F) Japanese (kanji and hiragana): 452 u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 453 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B 454 AMC-R: vsykxnzr3dkyx8fyzun243q3c24zbxhgwr2nkweqwm 456 (G) Korean (Hangul syllables): 457 u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 458 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 459 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C 460 AMC-R: 6tvi466ezxi544i5w8a6s4nz2nw8e6zze7xxn47yp6x5e53znze7xze\ 461 7xxn5u8e54ze6x5n36is3i622m6zwe48wn 463 (H) Russian (Cyrillic): 464 U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E 465 u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 466 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A 467 u+0438 468 AMC-R: wvRqwhfnwdgfqpipfdqcqwawrcvrvqwawdbbvkvi 470 (I) Spanish: PorqunopuedensimplementehablarenEspaol 471 U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 472 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 473 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 474 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 475 u+0061 u+00F1 u+006F u+006C 476 AMC-R: -Porqu-j-nopuedensimplementehablarenEspa-9b-ol 478 (J) Taiwanese: 479 u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587 480 AMC-R: w87gxstbzuvc6a385psp244kupyx2h 482 (K) Vietnamese: 483 Taisaohokhngthchi\ 484 noitingVit 485 U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F 486 u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA 487 u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069 488 u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA 489 u+0323 u+0074 490 AMC-R: -Ta-vud-isaoho-d-kh-s9e-ngth-s8kvsj-chi-vsj-no-b-iti-s8\ 491 kb-ngVi-s8kud-t 493 The next several examples are all names of Japanese music artists, 494 song titles, and TV programs, just because the author happens to 495 have them handy (but Japanese is useful for providing examples 496 of single-row text, two-row text, ideographic text, and various 497 mixtures thereof). 499 (L) 3B 500 u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F 501 AMC-R: -3-x8ze-B-z7we3t7btymtwizxtr 502 (M) -with-SUPER-MONKEYS 503 u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 504 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D 505 U+004F U+004E U+004B U+0045 U+0059 U+0053 506 AMC-R: x52j4e3wiz92qyszf---with--SUPER--MONKEYS 508 (N) Hello-Another-Way- 509 U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F 510 u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D 511 u+305D u+308C u+305E u+308C u+306E u+5834 u+6240 512 AMC-R: -Hello--Another--Way---vsxp2nq2nyqx2veyuwa 514 (O) 2 515 u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032 516 AMC-R: vszcyiyex6wmy2vjqw8sm-2 518 (P) MajiKoi5 519 U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 520 u+308B u+0035 u+79D2 u+524D 521 AMC-R: -Maji-vsyh-Koi-xj2m-5-z37cxuwp 523 (Q) de 524 u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 525 AMC-R: vs7bf4d9n-de-8m9d7a 527 (R) 528 u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 529 AMC-R: vsxpyq5j7e9n6jyh 531 The last example is an ASCII string that breaks not only the 532 existing rules for host name labels but also the rules proposed in 533 [NAMEPREP03] for internationalized domain names. 535 (S) -> $1.00 <- 536 u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 537 u+003C u+002D 538 AMC-R: --svquaue-1-q-00-avn-- 540 Security considerations 542 Users expect each domain name in DNS to be controlled by a single 543 authority. If a Unicode string intended for use as a domain label 544 could map to multiple ACE labels, then an internationalized domain 545 name could map to multiple ACE domain names, each controlled by 546 a different authority, some of which could be spoofs that hijack 547 service requests intended for another. Therefore AMC-ACE-R is 548 designed so that each Unicode string has a unique encoding. 550 However, there can still be multiple Unicode representations of the 551 "same" text, for various definitions of "same". This problem is 552 addressed to some extent by the Unicode standard under the topic of 553 canonicalization, and this work is leveraged for domain names by 554 "nameprep" [NAMEPREP03]. 556 Acknowledgements 558 AMC-ACE-R reuses a number of preexisting techniques. 560 The basic encoding of integers to quartets to quintets to base-32 561 comes from UTF-5 [UTF5], and the particular variant used here comes 562 from AMC-ACE-M [AMCACEM]. 564 The idea of avoiding 0, 1, o, and l in base-32 strings was taken 565 from SFS [SFS]. 567 The idea of encoding deltas from reference points was taken from 568 RACE (of which the latest version is [RACE03]), which may have 569 gotten the idea from Unicode Technical Standard #6 [UTS6]. 571 The idea of switching between literal mode and base-32 mode comes 572 from BRACE [BRACE]. 574 The general idea of using the alphabetic case of base-32 characters 575 to indicate the desired case of the Unicode characters was suggested 576 by this author, and first applied to the UTF-5-style encoding in 577 DUDE (of which the latest version is [DUDE01]). 579 The heuristic used to adapt the reference points based on past code 580 points is new in AMC-ACE-R. 582 References 584 [AMCACEM] Adam Costello, "AMC-ACE-M version 0.1.4", 2001-Apr-01, 585 update of draft-ietf-idn-amc-ace-m-00, latest version at 586 http://www.cs.berkeley.edu/~amc/charset/amc-ace-m. 588 [AMCACEW] Adam Costello, "AMC-ACE-W version 0.1.0", 589 2001-May-31, draft-ietf-idn-amc-ace-w-00, latest version at 590 http://www.cs.berkeley.edu/~amc/charset/amc-ace-w. 592 [BRACE] Adam Costello, "BRACE: Bi-mode Row-based 593 ASCII-Compatible Encoding for IDN version 0.1.2", 594 2000-Sep-19, draft-ietf-idn-brace-00, version at 595 http://www.cs.berkeley.edu/~amc/charset/brace. 597 [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode 598 Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01. 600 [IDN] Internationalized Domain Names (IETF working group), 601 http://www.i-d-n.net/, idn@ops.ietf.org. 603 [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host 604 Names In Applications (IDNA)", draft-ietf-idn-idna-01. 606 [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation 607 of Internationalized Host Names", 2001-Feb-24, 608 draft-ietf-idn-nameprep-03. 610 [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", 611 http://www.trigeminal.com/samples/provincial.html. 613 [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding 614 for IDN", 2000-Nov-28, draft-ietf-idn-race-03. 616 [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host 617 Table Specification", 1985-Oct, RFC 952. 619 [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 620 1987-Nov, RFC 1034. 622 [SFS] David Mazieres et al, "Self-certifying File System", 623 http://www.fs.net/. 625 [UNICODE] The Unicode Consortium, "The Unicode Standard", 626 http://www.unicode.org/unicode/standard/standard.html. 628 [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a 629 Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*. 631 [UTS6] Misha Wolf, Ken Whistler, Charles Wicksteed, 632 Mark Davis, Asmus Freytag, "Unicode Technical Standard 633 #6: A Standard Compression Scheme for Unicode", 634 http://www.unicode.org/unicode/reports/tr6/. 636 Author 638 Adam M. Costello 639 http://www.cs.berkeley.edu/~amc/ 641 Example implementation 643 /******************************************/ 644 /* amc-ace-r.c 0.2.1 (2001-May-31-Thu) */ 645 /* Adam M. Costello */ 646 /******************************************/ 648 /* This is ANSI C code (C89) implementing AMC-ACE-R version 0.2.*. */ 650 /************************************************************/ 651 /* Public interface (would normally go in its own .h file): */ 653 #include 655 enum amc_ace_status { 656 amc_ace_success, 657 amc_ace_bad_input, 658 amc_ace_big_output /* Output would exceed the space provided. */ 659 }; 661 enum case_sensitivity { case_sensitive, case_insensitive }; 663 #if UINT_MAX >= 0x10FFFF 664 typedef unsigned int u_code_point; 665 #else 666 typedef unsigned long u_code_point; 667 #endif 668 enum amc_ace_status amc_ace_r_encode( 669 unsigned int input_length, 670 const u_code_point input[], 671 const unsigned char uppercase_flags[], 672 unsigned int *output_size, 673 char output[] ); 675 /* amc_ace_r_encode() converts Unicode to AMC-ACE-R (without */ 676 /* any signature). The input must be represented as an array */ 677 /* of Unicode code points (not code units; surrogate pairs */ 678 /* are not allowed), and the output will be represented as */ 679 /* null-terminated ASCII. The input_length is the number of */ 680 /* code points in the input. The output_size is an in/out */ 681 /* argument: the caller must pass in the maximum number of */ 682 /* characters that may be output (including the terminating */ 683 /* null), and on successful return it will contain the number of */ 684 /* characters actually output (including the terminating null, */ 685 /* so it will be one more than strlen() would return, which is */ 686 /* why it is called output_size rather than output_length). The */ 687 /* uppercase_flags array must hold input_length boolean values, */ 688 /* where nonzero means the corresponding Unicode character should */ 689 /* be forced to uppercase after being decoded, and zero means it */ 690 /* is caseless or should be forced to lowercase. Alternatively, */ 691 /* uppercase_flags may be a null pointer, which is equivalent */ 692 /* to all zeros. The letters a-z and A-Z are always encoded */ 693 /* literally, regardless of the corresponding flags. The encoder */ 694 /* always outputs lowercase base-32 characters except when */ 695 /* nonzero values of uppercase_flags require otherwise. The */ 696 /* return value may be any of the amc_ace_status values defined */ 697 /* above; if not amc_ace_success, then output_size and output may */ 698 /* contain garbage. On success, the encoder will never need to */ 699 /* write an output_size greater than input_length*5+1, because of */ 700 /* how the encoding is defined. */ 702 enum amc_ace_status amc_ace_r_decode( 703 enum case_sensitivity case_sensitivity, 704 char scratch_space[], 705 const char input[], 706 unsigned int *output_length, 707 u_code_point output[], 708 unsigned char uppercase_flags[] ); 709 /* amc_ace_r_decode() converts AMC-ACE-R (without any signature) */ 710 /* to Unicode. The input must be represented as null-terminated */ 711 /* ASCII, and the output will be represented as an array of */ 712 /* Unicode code points. The case_sensitivity argument influences */ 713 /* the check on the well-formedness of the input string; it */ 714 /* must be case_sensitive if case-sensitive comparisons are */ 715 /* allowed on encoded strings, case_insensitive otherwise. */ 716 /* The scratch_space must point to space at least as large */ 717 /* as the input, which will get overwritten (this allows the */ 718 /* decoder to avoid calling malloc()). The output_length is */ 719 /* an in/out argument: the caller must pass in the maximum */ 720 /* number of code points that may be output, and on successful */ 721 /* return it will contain the actual number of code points */ 722 /* output. The uppercase_flags array must have room for at */ 723 /* least output_length values, or it may be a null pointer */ 724 /* if the case information is not needed. A nonzero flag */ 725 /* indicates that the corresponding Unicode character should */ 726 /* be forced to uppercase by the caller, while zero means it */ 727 /* is caseless or should be forced to lowercase. The letters */ 728 /* a-z and A-Z are output already in the proper case, but their */ 729 /* flags will be set appropriately so that applying the flags */ 730 /* would be harmless. The return value may be any of the */ 731 /* amc_ace_status values defined above; if not amc_ace_success, */ 732 /* then output_length, output, and uppercase_flags may contain */ 733 /* garbage. On success, the decoder will never need to write */ 734 /* an output_length greater than the length of the input (not */ 735 /* counting the null terminator), because of how the encoding is */ 736 /* defined. */ 738 /**********************************************************/ 739 /* Implementation (would normally go in its own .c file): */ 741 #include 743 /* is_ldh(n) returns 1 if the code point n represents an LDH */ 744 /* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */ 746 static int is_ldh(u_code_point n) 747 { 748 return n <= 122 && ( n >= 97 || n == 45 || 749 (n >= 48 && n <= 57) || (n >= 65 && n <= 90) ); 750 } 752 /* base32[q] is the lowercase base-32 character representing */ 753 /* the number q from the range 0 to 31. Note that we cannot */ 754 /* use string literals for ASCII characters because an ANSI C */ 755 /* compiler does not necessarily use ASCII. */ 757 static const char base32[] = { 758 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 759 109, 110, /* m-n */ 760 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 761 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ 762 }; 763 /* base32_decode(c) returns the value of a base-32 character, in the */ 764 /* range 0 to 31, or the constant base32_invalid if c is not a valid */ 765 /* base-32 character. */ 767 enum { base32_invalid = 32 }; 769 static unsigned int base32_decode(char c) 770 { 771 if (c < 50) return base32_invalid; 772 if (c <= 57) return c - 26; 773 if (c < 97) c += 32; 774 if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; 775 return c - 97 - (c > 108) - (c > 111); 776 } 778 /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ 779 /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ 780 /* then ASCII A-Z are considered equal to a-z respectively. */ 782 static int unequal( enum case_sensitivity case_sensitivity, 783 const char s1[], const char s2[] ) 784 { 785 char c1, c2; 787 if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; 789 for (;;) { 790 c1 = *s1; 791 c2 = *s2; 792 if (c1 >= 65 && c1 <= 90) c1 += 32; 793 if (c2 >= 65 && c2 <= 90) c2 += 32; 794 if (c1 != c2) return 1; 795 if (c1 == 0) return 0; 796 ++s1, ++s2; 797 } 798 } 800 /* update(refpoint,updated,history,latest) updates refpoint[1..3] */ 801 /* based on the updated flag and history[0..latest]. */ 803 static void update( u_code_point refpoint[6], unsigned int *updated, 804 const u_code_point history[], unsigned int latest ) 805 { 806 unsigned int k, b, i; 808 for (k = 1; k <= 3; ++k) { 809 b = k << 2; 810 /* The first time here change all the windows: */ 811 if (!*updated) refpoint[k] = (history[latest] >> b) << b; 812 else for (i = latest; i-- > 0; ) { 813 if (is_ldh(history[i])) continue; 815 /* If a code point falling in the existing window has appeared */ 816 /* at least as recently as one falling in the candidate window, */ 817 /* then leave this window unchanged and go on to the next one: */ 818 if ((refpoint[k] ^ history[i]) >> b == 0) break; 820 if ((history[latest] ^ history[i]) >> b == 0) { 821 /* A code point falling in the candidate window has appeared */ 822 /* more recently than one falling in the existing window, so */ 823 /* change this window (and no others): */ 825 refpoint[k] = (history[latest] >> b) << b; 826 goto update_end; 827 } 828 } 829 } 831 update_end: *updated = 1; 832 } 834 /* Main encode function: */ 836 enum amc_ace_status amc_ace_r_encode( 837 unsigned int input_length, 838 const u_code_point input[], 839 const unsigned char uppercase_flags[], 840 unsigned int *output_size, 841 char output[] ) 842 { 843 unsigned int literal, updated, max_out, in, out, k, j; 844 u_code_point n, delta; 845 char shift; 847 /* Initialize the state: */ 849 u_code_point refpoint[6] = {0, 0xE0, 0xA0, 0, 0, 0x10000}; 851 literal = updated = 0; 852 max_out = *output_size; 854 for (in = out = 0; in < input_length; ++in) { 856 /* At the start of each iteration, in and out are the number of */ 857 /* items already input/output, or equivalently, the indices of */ 858 /* the next items to be input/output. */ 860 n = input[in]; 861 /* Check the code point range to avoid array bounds errors later: */ 862 if (n > 0x10FFFF) return amc_ace_bad_input; 863 if (n == 0x2D) { 864 /* Hyphen-minus is doubled. */ 865 if (max_out - out < 2) return amc_ace_big_output; 866 output[out++] = 0x2D; 867 output[out++] = 0x2D; 868 } 869 else if (is_ldh(n)) { 870 /* Encode an LDH character literally. */ 871 if (max_out - out < 1 + !literal) return amc_ace_big_output; 872 /* Switch to literal mode if necessary: */ 873 if (!literal) output[out++] = 0x2D; 874 literal = 1; 875 output[out++] = n; 876 } 877 else { 878 /* Encode a non-LDH character using base-32. */ 879 /* First compute the number of base-32 characters (k): */ 881 for (k = 1; ; ++k) { 882 delta = n - refpoint[k]; 883 if (delta >> (4*k) == 0) break; 884 } 886 if (max_out - out < k + literal) return amc_ace_big_output; 887 /* Switch to base-32 mode if necessary: */ 888 if (literal) output[out++] = 0x2D; 889 literal = 0; 890 shift = uppercase_flags && uppercase_flags[in] ? 32 : 0; 892 /* Each quintet has the form 1xxxx except the last is 0xxxx. */ 893 /* Computing the base-32 digits in reverse order is easiest. */ 895 out += k; 896 output[out - 1] = base32[delta & 0xF] - shift; 898 for (j = 2; j <= k; ++j) { 899 delta >>= 4; 900 output[out - j] = base32[0x10 | (delta & 0xF)]; 901 } 903 update(refpoint, &updated, input, in); 904 } 905 } 907 /* Append the null terminator: */ 908 if (max_out - out < 1) return amc_ace_big_output; 909 output[out++] = 0; 911 *output_size = out; 912 return amc_ace_success; 913 } 914 /* Main decode function: */ 916 enum amc_ace_status amc_ace_r_decode( 917 enum case_sensitivity case_sensitivity, 918 char scratch_space[], 919 const char input[], 920 unsigned int *output_length, 921 u_code_point output[], 922 unsigned char uppercase_flags[] ) 923 { 924 u_code_point q, delta; 925 char c; 926 unsigned int literal, updated, max_out, in, out, k, scratch_size; 927 enum amc_ace_status status; 929 /* Initialize the state: */ 931 u_code_point refpoint[6] = {0, 0xE0, 0xA0, 0, 0, 0x10000}; 933 literal = updated = 0; 934 max_out = *output_length; 936 for (c = input[in = 0], out = 0; c != 0; c = input[++in], ++out) { 938 /* At the start of each iteration, in and out are the number of */ 939 /* items already input/output, or equivalently, the indices of */ 940 /* the next items to be input/output. c is the same as input[in]. */ 942 if (c == 0x2D && input[in + 1] != 0x2D) { 943 /* Unpaired hyphen-minus toggles mode. */ 944 literal = !literal; 945 c = input[++in]; 946 } 948 if (max_out - out < 1) return amc_ace_big_output; 950 if (c == 0x2D) { 951 /* Double hyphen-minus represents a hyphen-minus. */ 952 ++in; 953 output[out] = 0x2D; 954 } 955 else { 956 if (literal) output[out] = c; 957 else { 958 /* Decode a base-32 sequence. */ 959 /* First decode quintets until 0xxxx is found: */ 961 for (delta = 0, k = 1; ; c = input[++in], ++k) { 962 q = base32_decode(c); 963 if (q == base32_invalid || k > 5) return amc_ace_bad_input; 964 delta = (delta << 4) | (q & 0xF); 965 if (q >> 4 == 0) break; 966 } 968 output[out] = refpoint[k] + delta; 969 update(refpoint, &updated, output, out); 970 } 971 } 972 /* Case of last character determines uppercase flag: */ 973 if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90; 974 } 976 /* Enforce the uniqueness of the encoding by re-encoding */ 977 /* the output and comparing the result to the input: */ 979 scratch_size = ++in; 980 status = amc_ace_r_encode(out, output, uppercase_flags, 981 &scratch_size, scratch_space); 982 if (status != amc_ace_success || scratch_size != in || 983 unequal(case_sensitivity, scratch_space, input) 984 ) return amc_ace_bad_input; 986 *output_length = out; 987 return amc_ace_success; 988 } 990 /******************************************************************/ 991 /* Wrapper for testing (would normally go in a separate .c file): */ 993 #include 994 #include 995 #include 996 #include 998 /* For testing, we'll just set some compile-time limits rather than */ 999 /* use malloc(), and set a compile-time option rather than using a */ 1000 /* command-line option. */ 1002 enum { 1003 unicode_max_length = 256, 1004 ace_max_size = 256, 1005 test_case_sensitivity = case_insensitive 1006 /* suitable for host names */ 1007 }; 1009 static void usage(char **argv) 1010 { 1011 fprintf(stderr, 1012 "%s -e reads code points and writes an AMC-ACE-R string.\n" 1013 "%s -d reads an AMC-ACE-R string and writes code points.\n" 1014 "Input and output are plain text in the native character set.\n" 1015 "Code points are in the form u+hex separated by whitespace.\n" 1016 "An AMC-ACE-R string is a newline-terminated sequence of LDH\n" 1017 "characters (without any signature).\n" 1018 "The case of the u in u+hex is the force-to-uppercase flag.\n" 1019 , argv[0], argv[0]); 1020 exit(EXIT_FAILURE); 1021 } 1022 static void fail(const char *msg) 1023 { 1024 fputs(msg,stderr); 1025 exit(EXIT_FAILURE); 1026 } 1028 static const char too_big[] = 1029 "input or output is too large, recompile with larger limits\n"; 1030 static const char invalid_input[] = "invalid input\n"; 1031 static const char io_error[] = "I/O error\n"; 1033 /* The following string is used to convert LDH */ 1034 /* characters between ASCII and the native charset: */ 1036 static const char ldh_ascii[] = 1037 "................" 1038 "................" 1039 ".............-.." 1040 "0123456789......" 1041 ".ABCDEFGHIJKLMNO" 1042 "PQRSTUVWXYZ....." 1043 ".abcdefghijklmno" 1044 "pqrstuvwxyz"; 1046 int main(int argc, char **argv) 1047 { 1048 enum amc_ace_status status; 1049 int r; 1050 char *p; 1052 if (argc != 2) usage(argv); 1053 if (argv[1][0] != '-') usage(argv); 1054 if (argv[1][2] != 0) usage(argv); 1056 if (argv[1][1] == 'e') { 1057 u_code_point input[unicode_max_length]; 1058 unsigned long codept; 1059 unsigned char uppercase_flags[unicode_max_length]; 1060 char output[ace_max_size], uplus[3]; 1061 unsigned int input_length, output_size, i; 1063 /* Read the input code points: */ 1065 input_length = 0; 1067 for (;;) { 1068 r = scanf("%2s%lx", uplus, &codept); 1069 if (ferror(stdin)) fail(io_error); 1070 if (r == EOF || r == 0) break; 1072 if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) { 1073 fail(invalid_input); 1074 } 1075 if (input_length == unicode_max_length) fail(too_big); 1077 if (uplus[0] == 'u') uppercase_flags[input_length] = 0; 1078 else if (uplus[0] == 'U') uppercase_flags[input_length] = 1; 1079 else fail(invalid_input); 1081 input[input_length++] = codept; 1082 } 1084 /* Encode: */ 1086 output_size = ace_max_size; 1087 status = amc_ace_r_encode(input_length, input, uppercase_flags, 1088 &output_size, output); 1089 if (status == amc_ace_bad_input) fail(invalid_input); 1090 if (status == amc_ace_big_output) fail(too_big); 1091 assert(status == amc_ace_success); 1093 /* Convert to native charset and output: */ 1095 for (p = output; *p != 0; ++p) { 1096 i = *p; 1097 assert(i <= 122 && ldh_ascii[i] != '.'); 1098 *p = ldh_ascii[i]; 1099 } 1101 r = puts(output); 1102 if (r == EOF) fail(io_error); 1103 return EXIT_SUCCESS; 1104 } 1106 if (argv[1][1] == 'd') { 1107 char input[ace_max_size], scratch[ace_max_size], *pp; 1108 u_code_point output[unicode_max_length]; 1109 unsigned char uppercase_flags[unicode_max_length]; 1110 unsigned int input_length, output_length, i; 1112 /* Read the AMC-ACE-R input string and convert to ASCII: */ 1114 fgets(input, ace_max_size, stdin); 1115 if (ferror(stdin)) fail(io_error); 1116 if (feof(stdin)) fail(invalid_input); 1117 input_length = strlen(input); 1118 if (input[input_length - 1] != '\n') fail(too_big); 1119 input[--input_length] = 0; 1121 for (p = input; *p != 0; ++p) { 1122 pp = strchr(ldh_ascii, *p); 1123 if (pp == 0) fail(invalid_input); 1124 *p = pp - ldh_ascii; 1125 } 1126 /* Decode: */ 1128 output_length = unicode_max_length; 1129 status = amc_ace_r_decode(test_case_sensitivity, scratch, input, 1130 &output_length, output, uppercase_flags); 1131 if (status == amc_ace_bad_input) fail(invalid_input); 1132 if (status == amc_ace_big_output) fail(too_big); 1133 assert(status == amc_ace_success); 1135 /* Output the result: */ 1137 for (i = 0; i < output_length; ++i) { 1138 r = printf("%s+%04lX\n", 1139 uppercase_flags[i] ? "U" : "u", 1140 (unsigned long) output[i] ); 1141 if (r < 0) fail(io_error); 1142 } 1144 return EXIT_SUCCESS; 1145 } 1147 usage(argv); 1148 return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ 1149 } 1151 INTERNET-DRAFT expires 2001-Nov-30