idnits 2.17.1 draft-ietf-idn-nameprep-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 856 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'ROMAN NUMERALS' is mentioned on line 664, but not defined == Missing Reference: 'SPACES' is mentioned on line 604, but not defined == Missing Reference: 'CONTROL CHARACTERS' is mentioned on line 573, but not defined == Missing Reference: 'PRIVATE USE' is mentioned on line 716, but not defined == Missing Reference: 'PLANE 0' is mentioned on line 716, but not defined == Missing Reference: 'MATHEMATICAL OPERATORS' is mentioned on line 666, but not defined == Missing Reference: 'ARROWS' is mentioned on line 665, but not defined == Missing Reference: 'MISCELLANEOUS TECHNICAL' is mentioned on line 667, but not defined == Missing Reference: 'CONTROL PICTURES' is mentioned on line 668, but not defined == Missing Reference: 'BOX DRAWING' is mentioned on line 690, but not defined == Missing Reference: 'BLOCK ELEMENTS' is mentioned on line 691, but not defined == Missing Reference: 'GEOMETRIC SHAPES' is mentioned on line 692, but not defined == Missing Reference: 'MISCELLANEOUS SYMBOLS' is mentioned on line 693, but not defined == Missing Reference: 'DINGBATS' is mentioned on line 694, but not defined == Missing Reference: 'BRAILLE PATTERNS' is mentioned on line 695, but not defined == Missing Reference: 'SURROGATE CHARACTERS' is mentioned on line 715, but not defined == Missing Reference: 'KANGXI RADICALS' is mentioned on line 697, but not defined == Unused Reference: 'Normalize' is defined on line 787, but no explicit reference was found in the text == Unused Reference: 'STD13' is defined on line 799, but no explicit reference was found in the text -- Possible downref: Normative reference to a draft: ref. 'IDNComp' -- No information found for draft-ietf-idn-requirement - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'IDNReq' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' == Outdated reference: A later version (-04) exists of draft-duerst-i18n-norm-03 -- Possible downref: Normative reference to a draft: ref. 'Normalize' ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986) ** Obsolete normative reference: RFC 2732 (Obsoleted by RFC 3986) -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode3' -- Possible downref: Non-RFC (?) normative reference: ref. 'UniData' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' Summary: 7 errors (**), 0 flaws (~~), 23 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft Paul Hoffman 2 draft-ietf-idn-nameprep-00.txt IMC & VPNC 3 July 3, 2000 Marc Blanchet 4 Expires in six months ViaGenie 6 Preparation of Internationalized Host Names 8 Status of this memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other groups 15 may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference material 20 or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 Abstract 30 This document describes how to prepare internationalized host names for 31 transmission on the wire. The steps include excluding characters that 32 are prohibited from appearing in internationalized host names, changing 33 all characters that have case properties to be lowercase, and 34 normalizing the characters. Further, this document lists the prohibited 35 characters. 37 1. Introduction 39 When expanding today's DNS to include internationalized host names, 40 those new names will be handled in many parts of the DNS. The IDN 41 Working Group's requirements document [IDNReq] describes a framework for 42 domain name handling as well as requirements for the new names. The IDN 43 Working Group's comparison document [IDNComp] gives a framework for how 44 various parts of the IDN solution work together. 46 A user can enter a domain name into an application program in a myriad 47 of fashions. Depending on the input method, the characters entered in 48 the domain name may or may not be those that are allowed in 49 internationalized host names. Thus, there must be a way to canonicalized 50 the user's input before the name is resolved in the DNS. 52 It is a design goal of this document to allow users to enter host names 53 in applications and have the highest chance of getting the name correct. 54 This means that the user should not be limited to only entering exactly 55 the characters that might have been used, but to instead be able to 56 enter characters that unambiguously canonicalize to characters in the 57 desired host name. At the same time, this process must not introduce any 58 chance that two host names could be represented by two distinct strings 59 of characters that look identical to typical users. It is also a design 60 goal to have all preprocessing of IDN done before going on the wire, so 61 that no transformation is done in the DNS server space. 63 This document describes the steps needed to convert a name part from one 64 that is entered by the user to one that can be used in the DNS. 66 1.1 Terminology 68 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and 69 "MAY" in this document are to be interpreted as described in RFC 2119 70 [RFC2119]. 72 Examples in this document use the notation from the Unicode Standard 73 [Unicode3] as well as the ISO 10646 [ISO10646] names. For example, the 74 letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER 75 A". In the lists of prohibited characters, the "U+" is left off to make 76 the lists easier to read. 78 1.2 IDN summary 80 Using the terminology in [IDNComp], this document specifies all of the 81 prohibited characters and the canonicalization for an IDN solution. 82 Specifically, it covers the following sections from [IDNComp]: 84 prohib-1: Identical and near-identical characters 85 prohib-2: Separators 86 prohib-3: Non-displaying and non-spacing characters 87 prohib-4: Private use characters 88 prohib-5: Punctuation 89 prohib-6: Symbols 90 canon-1.2: Normalization Form KC 91 canon-2.1: Case folding in ASCII 92 canon-2.2: Case folding in non-ASCII 94 Note that this document does not cover: 95 canon-1.1: Normalization Form C 96 canon-2.3: Han folding 98 1.3 Open issues 100 This is the first draft of this document. Although there has been much 101 discussion on the WG mailing list about the topics here, there has not 102 yet been much agreement on some issues. Now that there is a document to 103 talk about, that discussion can be more focussed. 105 1.3.1 Where to do name preparation 107 Section 2.1 says to do name preparation in the resolver. An argument can 108 be made for doing name preparation in the application, before the 109 application service interface. An advantage of that proposal is that 110 resolvers would not need to do any name preparation. A disadvantage is 111 that applications would have to be updated each time the IDN protocol is 112 updated, such as if new characters are added to the repertoire of 113 allowed characters. It seems likely that resolvers are more easily 114 updated than all the individual applications that use internationalized 115 host names. 117 1.3.2 Choosing between normalization form C and KC 119 Much of the discussion of normalization on the WG mailing list assumed 120 that normalization form C would be used. Near the time that this 121 document was written, people started considering form KC instead of C. 122 This document used form KC, but the reasons for doing so could be 123 contentious. 125 1.3.3 Does the prohibition catch all bad characters? 127 On the mailing list, it was discussed doing prohibition in two steps: a 128 short list of prohibited characters before case folding in order to 129 prevent uppercase characters that have no lowercase equivalents from 130 getting through, and then a full check on the output of normalization. 131 In this draft, all checking is done before case folding, based on the 132 (possibly wrong) assumption that none of the prohibited characters will 133 re-appear after the case folding and normalization. If that assumption 134 turns out to be wrong, a check for just those problematic characters can 135 be added after normalization, or a full check against the prohibited 136 characters can be added. 138 2. Preparation Overview 140 This section describes where name preparation happens and the steps that 141 name preparation software must take. 143 2.1 Where name preparation happens 145 Part of the chart in section 1.4 of [IDNReq] looks like this: 147 +---------------+ 148 | Application | 149 +---------------+ 150 | Application service interface 151 | For ex. GethostbyXXXX interface 152 +---------------+ 153 | Resolver | 154 +---------------+ 155 | <----- DNS service interface 156 +-------------------------------------------+ 158 In this specification, the name preparation is done in the resolver, 159 before the DNS service interface. That is, it is acceptable for software 160 in the application service interface (such as a "GetHostByName" API) to 161 pass the resolver a name that has not been prepared. However, the 162 resolver MUST prepare the name as described in this specification before 163 passing it to the DNS service interface. 165 2.2 Name preparation steps 167 The steps for preparing names are: 169 1) Input from the application service interface -- This can be done in 170 many ways and is not specified in this document 172 2) Look for prohibited input -- Check for any characters that are not 173 allowed in the input. If any are found, return an error to the 174 application service interface. This step is necessary to prevent errors 175 in the following two steps. This step fulfills prohib-1, prohib-2, 176 prohib-3, prohib-4, prohib-5, and prohib-6 from [IDNComp]. 178 3) Fold case -- Change all uppercase characters into lowercase 179 characters. Design note: this step could just as easily have been 180 "change all lowercase characters into uppercase characters". However, 181 the upper-to-lower folding was chosen because most users of the Internet 182 today enter host names in lowercase. This step fulfills canon-2.1 and 183 canon-2.2 from [IDNComp]. 185 4) Canonicalize -- Normalize the characters. This step fulfils canon-1.2 186 from [IDNComp]. 188 5) Resolution of the prepared name -- This must be specified in a 189 different IDN document. 191 The above steps MUST be performed in the order given in order to comply 192 with this specification. 194 3. Prohibited Input 196 Before the text can be processed, it must be checked for prohibited 197 characters. There is a variety of prohibited characters, as described in 198 this section. 200 Note that one of the goals of IDN is to allow the widest possible set of 201 host names as long as those host names do not cause other problems, such 202 as possible ambiguity. Specifically, experience with current DNS names 203 have shown that there is a desire for host names that include personal 204 names, company names, and spoken phrases. A goal of this section is to 205 prohibit as few characters that might be used in these contexts as 206 possible while making sure that characters that might easily cause 207 confusion or ambiguity are prohibited. 209 Note that every character listed in this section MUST NOT be transmitted 210 on the DNS service interface. Although the checking is being performed 211 before case folding and canonicalization, those steps cannot result in 212 any of these characters if these characters are not in the input stream. 213 [[[NOTE: THIS STATEMENT NEEDS TO BE CHECKED ALGORITHMICALLY.]]] If a DNS 214 server receives a request containing a prohibited character, then the 215 IDN protocol MUST return an error message. 217 Note that some characters listed in one section would also appear in 218 other sections. Each character is only listed once. 220 3.1 prohib-1: Identical and near-identical characters 222 Many characters in [ISO10646] are identical or nearly identical to other 223 characters. These were often included for compatibility with other 224 character sets. 226 The characters prohibited because they are identical or nearly identical 227 to allowed characters are: 229 00AD SOFT HYPHEN 230 00D7 MULTIPLICATION SIGN 231 01C3 LATIN LETTER RETROFLEX CLICK 232 02B0-02FF [SPACING MODIFIER LETTERS] 233 066D ARABIC FIVE POINTED STAR 234 1806 MONGOLIAN TODO SOFT HYPHEN 235 2010 HYPHEN 236 2011 NON-BREAKING HYPHEN 237 2012 FIGURE DASH 238 2013 EN DASH 239 2014 EM DASH 240 2160-217F [ROMAN NUMERALS] 241 FB1D-FB4F [HEBREW PRESENTATION FORMS] 242 FB50-FDFF [ARABIC PRESENTATION FORMS A] 243 FE20-FE2F [COMBINING HALF MARKS] 244 FE30-FE4F [CJK COMPATIBILITY FORMS] 245 FE50-FE6F [SMALL FORM VARIANTS] 246 FE70-FEFC [ARABIC PRESENTATION FORMS B] 247 FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS] 249 3.2 prohib-2: Separators 251 Horizontal and vertical spacing characters would make it unclear where a 252 host name begins and ends. The prohibited spacing characters are: 254 0020 SPACE 255 00A0 NO-BREAK SPACE 256 1680 OGHAM SPACE MARK 257 2000-200B [SPACES] 258 2028 LINE SEPARATOR 259 2029 PARAGRAPH SEPARATOR 260 202F NARROW NO-BREAK SPACE 261 3000 IDEOGRAPHIC SPACE 263 Allowing periods and period-like characters as characters within a name 264 part would also cause similar confusion. The prohibited periods, 265 characters that look like periods, and characters that canonicalize to a 266 period or to a period-like character are: 268 002E FULL STOP 269 06D4 ARABIC FULL STOP 270 2024 ONE DOT LEADER 271 2025 TWO DOT LEADER 272 2026 HORIZONTAL ELLIPSIS 273 2488 DIGIT ONE FULL STOP 274 2489 DIGIT TWO FULL STOP 275 248A DIGIT THREE FULL STOP 276 248B DIGIT FOUR FULL STOP 277 248C DIGIT FIVE FULL STOP 278 248D DIGIT SIX FULL STOP 279 248E DIGIT SEVEN FULL STOP 280 248F DIGIT EIGHT FULL STOP 281 2490 DIGIT NINE FULL STOP 282 2491 NUMBER TEN FULL STOP 283 2492 NUMBER ELEVEN FULL STOP 284 2493 NUMBER TWELVE FULL STOP 285 2494 NUMBER THIRTEEN FULL STOP 286 2495 NUMBER FOURTEEN FULL STOP 287 2496 NUMBER FIFTEEN FULL STOP 288 2497 NUMBER SIXTEEN FULL STOP 289 2498 NUMBER SEVENTEEN FULL STOP 290 2499 NUMBER EIGHTEEN FULL STOP 291 249A NUMBER NINETEEN FULL STOP 292 249B NUMBER TWENTY FULL STOP 293 33C2 SQUARE AM 294 33C2 SQUARE AM 295 33C7 SQUARE CO 296 33D8 SQUARE PM 297 33D8 SQUARE PM 299 3.3 prohib-3: Non-displaying and non-spacing characters 301 There are many characters that cannot be seen in the ISO 10646 character 302 set. These include control characters, non-breaking spaces, formatting 303 characters, and tagging characters. These characters would certainly 304 cause confusion if allowed in host names. 306 0000-001F [CONTROL CHARACTERS] 307 007F DELETE 308 0080-009F [CONTROL CHARACTERS] 309 070F SYRIAC ABBREVIATION MARK 310 180B MONGOLIAN FREE VARIATION SELECTOR ONE 311 180C MONGOLIAN FREE VARIATION SELECTOR TWO 312 180D MONGOLIAN FREE VARIATION SELECTOR THREE 313 180E MONGOLIAN VOWEL SEPARATOR 314 200C ZERO WIDTH NON-JOINER 315 200D ZERO WIDTH JOINER 316 200E LEFT-TO-RIGHT MARK 317 200F RIGHT-TO-LEFT MARK 318 202A LEFT-TO-RIGHT EMBEDDING 319 202B RIGHT-TO-LEFT EMBEDDING 320 202C POP DIRECTIONAL FORMATTING 321 202D LEFT-TO-RIGHT OVERRIDE 322 202E RIGHT-TO-LEFT OVERRIDE 323 206A INHIBIT SYMMETRIC SWAPPING 324 206B ACTIVATE SYMMETRIC SWAPPING 325 206C INHIBIT ARABIC FORM SHAPING 326 206D ACTIVATE ARABIC FORM SHAPING 327 206E NATIONAL DIGIT SHAPES 328 206F NOMINAL DIGIT SHAPES 329 FEFF ZERO WIDTH NO-BREAK SPACE 330 FFF9 INTERLINEAR ANNOTATION ANCHOR 331 FFFA INTERLINEAR ANNOTATION SEPARATOR 332 FFFB INTERLINEAR ANNOTATION TERMINATOR 333 FFFC OBJECT REPLACEMENT CHARACTER 334 FFFD REPLACEMENT CHARACTER 336 3.4 prohib-4: Private use characters 338 Because private-use characters do not have defined meanings, they are 339 prohibited. The private-use characters are: 341 E000-F8FF [PRIVATE USE, PLANE 0] 343 3.5 prohib-5: Punctuation 345 The following characters are reserved or delimiters in URLs [RFC2396] 346 and [RFC2732]: 348 " # $ % & + , . / : ; < = > ? @ [ ] 350 3.5.1 Characters from URLs 352 The following punctuation characters are prohibited because they are 353 reserved or delimiters in URLs. 355 0022 QUOTATION MARK 356 0023 NUMBER SIGN 357 0024 DOLLAR SIGN 358 0025 PERCENT SIGN 359 0026 AMPERSAND 360 002B PLUS SIGN 361 002C COMMA 362 002E FULL STOP 363 002F SOLIDUS 364 003A COLON 365 003B SEMICOLON 366 003C LESS-THAN SIGN 367 003D EQUALS SIGN 368 003E GREATER-THAN SIGN 369 003F QUESTION MARK 370 0040 COMMERCIAL AT 371 005B LEFT SQUARE BRACKET 372 005D RIGHT SQUARE BRACKET 374 3.5.2 Characters that canonicalize to characters from URLs 376 The following punctuation characters are prohibited because their 377 normalization contains one or more of the characters from section 3.5.1. 379 037E GREEK QUESTION MARK 380 2048 QUESTION EXCLAMATION MARK 381 2049 EXCLAMATION QUESTION MARK 382 207A SUPERSCRIPT PLUS SIGN 383 207C SUPERSCRIPT EQUALS SIGN 384 208A SUBSCRIPT PLUS SIGN 385 208C SUBSCRIPT EQUALS SIGN 386 2100 ACCOUNT OF 387 2101 ADDRESSED TO THE SUBJECT 388 2105 CARE OF 389 2106 CADA UNA 391 3.5.3 Characters that look like characters from URLs 393 The following are prohibited because they look indistinguishable from 394 the characters listed in section 3.5.1. 396 037E GREEK QUESTION MARK 397 0589 ARMENIAN FULL STOP 398 060C ARABIC COMMA 399 061B ARABIC SEMICOLON 400 066A ARABIC PERCENT SIGN 401 201A SINGLE LOW-9 QUOTATION MARK 402 2030 PER MILLE SIGN 403 2031 PER TEN THOUSAND SIGN 404 2033 DOUBLE PRIME 405 2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK 406 2044 FRACTION SLASH 407 203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 408 203D INTERROBANG 409 3001 IDEOGRAPHIC COMMA 410 3002 IDEOGRAPHIC FULL STOP 411 3003 DITTO MARK 412 3008 LEFT ANGLE BRACKET 413 3009 RIGHT ANGLE BRACKET 414 3014 LEFT TORTOISE SHELL BRACKET 415 3015 RIGHT TORTOISE SHELL BRACKET 416 301A LEFT WHITE SQUARE BRACKET 417 301B RIGHT WHITE SQUARE BRACKET 419 3.5.4 Other punctuation 421 The following punctuation are prohibited because they are unlikely to 422 be used in names and may be confusing to users or to character-entry 423 processes: 425 005C REVERSE SOLIDUS 427 3.6 prohib-6: Symbols 429 [UniData] has non-normative categories for symbols. The four symbol 430 categories are: 432 Symbol, Currency: Currency symbols could appear in company names and 433 spoken phrases, so they are not prohibited. 435 Symbol, Modifier: Stand-alone modifiers might appear in personal names, 436 company names, and spoken phrases, so they are not prohibited. 438 Symbol, Math: It is very unlikely that there are any significant 439 personal names, company names, or spoken phrases that contain 440 mathematical symbols. Further, many of these symbols are the same or 441 similar to other punctuation, thereby leading to ambiguity. For this 442 reason, math-specific symbols are prohibited. These prohibited math 443 symbols are: 445 00AC NOT SIGN 446 00B1 PLUS-MINUS SIGN 447 2200-22FF [MATHEMATICAL OPERATORS] 449 Further, the following characters canonicalize to characters in the 450 above math list, and therefore are also prohibited: 452 00BC VULGAR FRACTION ONE QUARTER 453 00BD VULGAR FRACTION ONE HALF 454 00BE VULGAR FRACTION THREE QUARTERS 455 207B SUPERSCRIPT MINUS 456 208B SUBSCRIPT MINUS 457 2153 VULGAR FRACTION ONE THIRD 458 2154 VULGAR FRACTION TWO THIRDS 459 2155 VULGAR FRACTION ONE FIFTH 460 2156 VULGAR FRACTION TWO FIFTHS 461 2157 VULGAR FRACTION THREE FIFTHS 462 2158 VULGAR FRACTION FOUR FIFTHS 463 2159 VULGAR FRACTION ONE SIXTH 464 215A VULGAR FRACTION FIVE SIXTHS 465 215B VULGAR FRACTION ONE EIGHTH 466 215C VULGAR FRACTION THREE EIGHTHS 467 215D VULGAR FRACTION FIVE EIGHTHS 468 215E VULGAR FRACTION SEVEN EIGHTHS 469 215F FRACTION NUMERATOR ONE 470 33A7 SQUARE M OVER S 471 33A8 SQUARE M OVER S SQUARED 472 33AE SQUARE RAD OVER S 473 33AF SQUARE RAD OVER S SQUARED 474 33C6 SQUARE C OVER KG 476 Symbol, Other: This category covers a multitude of symbols, few of which 477 would ever appear in personal names, company names, and spoken phrases. 478 The rest of the prohibited symbols are: 480 2190-21FF [ARROWS] 481 2300-23FF [MISCELLANEOUS TECHNICAL] 482 2400-243F [CONTROL PICTURES] 483 2440-245F [OPTICAL CHARACTER RECOGNITION] 484 2500-257F [BOX DRAWING] 485 2580-259F [BLOCK ELEMENTS] 486 25A0-25FF [GEOMETRIC SHAPES] 487 2600-267F [MISCELLANEOUS SYMBOLS] 488 2700-27BF [DINGBATS] 489 2800-287F [BRAILLE PATTERNS] 491 3.7 Additional prohibited characters 493 3.7.1 Unassigned characters 495 All characters not yet assigned in [ISO10646] are prohibited. Although 496 this may at first seem trivial, it is extremely important because 497 characters that may be assigned in the future might have properties that 498 would cause them to be prohibited or might have case-folding properties. 499 As is the case of all prohibited characters, if a DNS server receives a 500 request containing an unassigned character, then the IDN protocol MUST 501 return an error message. 503 3.7.2 Surrogate characters 505 So far, all proposals for binary encodings of internationalized name 506 parts have specified UTF-8 as the encoding format. In such an encoding, 507 surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings, 508 the following are prohibited: 510 D800-DFFF [SURROGATE CHARACTERS] 512 3.7.3 Uppercase characters with no lowercase mappings 514 There are many uppercase characters in [ISO10646] which do not have 515 lowercase equivalents in [UniData]. Therefore, they are prohibited on 516 input because they would get through the case mapping step while still 517 being in uppercase. 519 The characters that are prohibited on input because they are uppercase 520 but have no lowercase mappings are: 522 03D2 GREEK UPSILON WITH HOOK SYMBOL 523 03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL 524 03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL 525 04C0 CYRILLIC LETTER PALOCHKA 526 10A0-10C5 [GEORGIAN CAPITAL LETTERS] 528 Note that many characters in the range U+1200 to U+213A, the letterlike 529 symbols, also are uppercase but have no lowercase mappings. However, 530 they are not listed here because the entire range is already prohibited 531 in section 3.6. 533 3.7.4 Radicals and Ideographic Description 535 Some Han characters can be informally defined in terms of ideographic 536 descriptions. However, ideographic descriptions can lead to multiple 537 character streams leading to the same character in a fashion that does 538 not canonicalize. Thus, the radicals for ideographic description and the 539 ideographic description characters themselves are prohibited. These 540 characters are: 542 2E80-2EFF [CJK RADICALS SUPPLEMENT] 543 2F00-2FDF [KANGXI RADICALS] 544 2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS] 546 3.8 Summary of prohibited characters 548 The following is a collected list from the previous sections. 550 0000-001F [CONTROL CHARACTERS] 551 0020 SPACE 552 0022 QUOTATION MARK 553 0023 NUMBER SIGN 554 0024 DOLLAR SIGN 555 0025 PERCENT SIGN 556 0026 AMPERSAND 557 002B PLUS SIGN 558 002C COMMA 559 002E FULL STOP 560 002E FULL STOP 561 002F SOLIDUS 562 003A COLON 563 003B SEMICOLON 564 003C LESS-THAN SIGN 565 003D EQUALS SIGN 566 003E GREATER-THAN SIGN 567 003F QUESTION MARK 568 0040 COMMERCIAL AT 569 005B LEFT SQUARE BRACKET 570 005C REVERSE SOLIDUS 571 005D RIGHT SQUARE BRACKET 572 007F DELETE 573 0080-009F [CONTROL CHARACTERS] 574 00A0 NO-BREAK SPACE 575 00AC NOT SIGN 576 00AD SOFT HYPHEN 577 00B1 PLUS-MINUS SIGN 578 00BC VULGAR FRACTION ONE QUARTER 579 00BD VULGAR FRACTION ONE HALF 580 00BE VULGAR FRACTION THREE QUARTERS 581 00D7 MULTIPLICATION SIGN 582 01C3 LATIN LETTER RETROFLEX CLICK 583 02B0-02FF [SPACING MODIFIER LETTERS] 584 037E GREEK QUESTION MARK 585 037E GREEK QUESTION MARK 586 03D2 GREEK UPSILON WITH HOOK SYMBOL 587 03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL 588 03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL 589 04C0 CYRILLIC LETTER PALOCHKA 590 0589 ARMENIAN FULL STOP 591 060C ARABIC COMMA 592 061B ARABIC SEMICOLON 593 066A ARABIC PERCENT SIGN 594 066D ARABIC FIVE POINTED STAR 595 06D4 ARABIC FULL STOP 596 070F SYRIAC ABBREVIATION MARK 597 10A0-10C5 [GEORGIAN CAPITAL LETTERS] 598 1680 OGHAM SPACE MARK 599 1806 MONGOLIAN TODO SOFT HYPHEN 600 180B MONGOLIAN FREE VARIATION SELECTOR ONE 601 180C MONGOLIAN FREE VARIATION SELECTOR TWO 602 180D MONGOLIAN FREE VARIATION SELECTOR THREE 603 180E MONGOLIAN VOWEL SEPARATOR 604 2000-200B [SPACES] 605 200C ZERO WIDTH NON-JOINER 606 200D ZERO WIDTH JOINER 607 200E LEFT-TO-RIGHT MARK 608 200F RIGHT-TO-LEFT MARK 609 2010 HYPHEN 610 2011 NON-BREAKING HYPHEN 611 2012 FIGURE DASH 612 2013 EN DASH 613 2014 EM DASH 614 201A SINGLE LOW-9 QUOTATION MARK 615 2024 ONE DOT LEADER 616 2025 TWO DOT LEADER 617 2026 HORIZONTAL ELLIPSIS 618 2028 LINE SEPARATOR 619 2029 PARAGRAPH SEPARATOR 620 202A LEFT-TO-RIGHT EMBEDDING 621 202B RIGHT-TO-LEFT EMBEDDING 622 202C POP DIRECTIONAL FORMATTING 623 202D LEFT-TO-RIGHT OVERRIDE 624 202E RIGHT-TO-LEFT OVERRIDE 625 202F NARROW NO-BREAK SPACE 626 2030 PER MILLE SIGN 627 2031 PER TEN THOUSAND SIGN 628 2033 DOUBLE PRIME 629 2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK 630 203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 631 203D INTERROBANG 632 2044 FRACTION SLASH 633 2048 QUESTION EXCLAMATION MARK 634 2049 EXCLAMATION QUESTION MARK 635 206A INHIBIT SYMMETRIC SWAPPING 636 206B ACTIVATE SYMMETRIC SWAPPING 637 206C INHIBIT ARABIC FORM SHAPING 638 206D ACTIVATE ARABIC FORM SHAPING 639 206E NATIONAL DIGIT SHAPES 640 206F NOMINAL DIGIT SHAPES 641 207A SUPERSCRIPT PLUS SIGN 642 207B SUPERSCRIPT MINUS 643 207C SUPERSCRIPT EQUALS SIGN 644 208A SUBSCRIPT PLUS SIGN 645 208B SUBSCRIPT MINUS 646 208C SUBSCRIPT EQUALS SIGN 647 2100 ACCOUNT OF 648 2101 ADDRESSED TO THE SUBJECT 649 2105 CARE OF 650 2106 CADA UNA 651 2153 VULGAR FRACTION ONE THIRD 652 2154 VULGAR FRACTION TWO THIRDS 653 2155 VULGAR FRACTION ONE FIFTH 654 2156 VULGAR FRACTION TWO FIFTHS 655 2157 VULGAR FRACTION THREE FIFTHS 656 2158 VULGAR FRACTION FOUR FIFTHS 657 2159 VULGAR FRACTION ONE SIXTH 658 215A VULGAR FRACTION FIVE SIXTHS 659 215B VULGAR FRACTION ONE EIGHTH 660 215C VULGAR FRACTION THREE EIGHTHS 661 215D VULGAR FRACTION FIVE EIGHTHS 662 215E VULGAR FRACTION SEVEN EIGHTHS 663 215F FRACTION NUMERATOR ONE 664 2160-217F [ROMAN NUMERALS] 665 2190-21FF [ARROWS] 666 2200-22FF [MATHEMATICAL OPERATORS] 667 2300-23FF [MISCELLANEOUS TECHNICAL] 668 2400-243F [CONTROL PICTURES] 669 2440-245F [OPTICAL CHARACTER RECOGNITION] 670 2488 DIGIT ONE FULL STOP 671 2489 DIGIT TWO FULL STOP 672 248A DIGIT THREE FULL STOP 673 248B DIGIT FOUR FULL STOP 674 248C DIGIT FIVE FULL STOP 675 248D DIGIT SIX FULL STOP 676 248E DIGIT SEVEN FULL STOP 677 248F DIGIT EIGHT FULL STOP 678 2490 DIGIT NINE FULL STOP 679 2491 NUMBER TEN FULL STOP 680 2492 NUMBER ELEVEN FULL STOP 681 2493 NUMBER TWELVE FULL STOP 682 2494 NUMBER THIRTEEN FULL STOP 683 2495 NUMBER FOURTEEN FULL STOP 684 2496 NUMBER FIFTEEN FULL STOP 685 2497 NUMBER SIXTEEN FULL STOP 686 2498 NUMBER SEVENTEEN FULL STOP 687 2499 NUMBER EIGHTEEN FULL STOP 688 249A NUMBER NINETEEN FULL STOP 689 249B NUMBER TWENTY FULL STOP 690 2500-257F [BOX DRAWING] 691 2580-259F [BLOCK ELEMENTS] 692 25A0-25FF [GEOMETRIC SHAPES] 693 2600-267F [MISCELLANEOUS SYMBOLS] 694 2700-27BF [DINGBATS] 695 2800-287F [BRAILLE PATTERNS] 696 2E80-2EFF [CJK RADICALS SUPPLEMENT] 697 2F00-2FDF [KANGXI RADICALS] 698 2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS] 699 3000 IDEOGRAPHIC SPACE 700 3001 IDEOGRAPHIC COMMA 701 3002 IDEOGRAPHIC FULL STOP 702 3003 DITTO MARK 703 3008 LEFT ANGLE BRACKET 704 3009 RIGHT ANGLE BRACKET 705 33A7 SQUARE M OVER S 706 33A8 SQUARE M OVER S SQUARED 707 33AE SQUARE RAD OVER S 708 33AF SQUARE RAD OVER S SQUARED 709 33C2 SQUARE AM 710 33C2 SQUARE AM 711 33C6 SQUARE C OVER KG 712 33C7 SQUARE CO 713 33D8 SQUARE PM 714 33D8 SQUARE PM 715 D800-DFFF [SURROGATE CHARACTERS] 716 E000-F8FF [PRIVATE USE, PLANE 0] 717 FB1D-FB4F [HEBREW PRESENTATION FORMS] 718 FB50-FDFF [ARABIC PRESENTATION FORMS A] 719 FE20-FE2F [COMBINING HALF MARKS] 720 FE30-FE4F [CJK COMPATIBILITY FORMS] 721 FE50-FE6F [SMALL FORM VARIANTS] 722 FE70-FEFC [ARABIC PRESENTATION FORMS B] 723 FEFF ZERO WIDTH NO-BREAK SPACE 724 FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS] 725 FFF9 INTERLINEAR ANNOTATION ANCHOR 726 FFFA INTERLINEAR ANNOTATION SEPARATOR 727 FFFB INTERLINEAR ANNOTATION TERMINATOR 728 FFFC OBJECT REPLACEMENT CHARACTER 729 FFFD REPLACEMENT CHARACTER 730 Unassigned characters 732 4. Case Folding 734 After it has been verified that the input text has none of the 735 characters prohibited for case folding, the case-folding step itself is 736 quite straight-forward. For each character in the input, if there is a 737 lowercase mapping for that character in [UniData], the input character 738 is changed to the mapped lowercase letter. 740 5. Canonicalization 742 After case folding, the input string is normalized using form KC, as 743 described in [UTR15]. 745 6. IDN Table Revisions 747 A table consisting of all characters allowed and prohibited and the 748 rules for case folding and canonicalization will be created based on the 749 content of the [UniData] and on the content of this document. This table 750 will be the authority for implementations to follow and will be 751 normatively referenced by this document. Such a table will enable the 752 IDN protocol to have versions independent of the revisions to Unicode 753 and/or to ISO 10646 because the revision of IDN and its deployment may 754 not in sync with revisions to Unicode and ISO 10646. 756 In a future draft of this document, IANA will be asked to keep this 757 table, with an initial version number of 1. Each new version of the 758 table will have a new, higher version number. 760 7. Security Considerations 762 Much of the security of the Internet relies on the DNS. Thus, any change 763 to the characteristics of the DNS can change the security of much of the 764 Internet. 766 Host names are used by users to connect to Internet servers. The 767 security of the Internet would be compromised if a user entering a 768 single internationalized name could be connected to different servers 769 based on different interpretations of the internationalized host name. 771 8. References 773 [IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name 774 Proposals", draft-ietf-idn-compare. 776 [IDNReq] James Seng, "Requirements of Internationalized Domain Names", 777 draft-ietf-idn-requirement. 779 [ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information 780 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 781 1: Architecture and Basic Multilingual Plane. Five amendments and a 782 technical corrigendum have been published up to now. UTF-16 is described 783 in Annex Q, published as Amendment 1. 17 other amendments are currently 784 at various stages of standardization. [[[ THIS REFERENCE NEEDS TO BE 785 UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]] 787 [Normalize] Character Normalization in IETF Protocols, 788 draft-duerst-i18n-norm-03 790 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 791 Requirement Levels", March 1997, RFC 2119. 793 [RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI): 794 Generic Syntax", August 1998, RFC 2396. 796 [RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in 797 URL's, December 1999, RFC 2732. 799 [STD13] Paul Mockapetris, "Domain names - implementation and 800 specification", November 1987, STD 13 (RFC 1035). 802 [Unicode3] The Unicode Consortium, "The Unicode Standard -- Version 803 3.0", ISBN 0-201-61633-5. Described at 804 . 806 [UniData] The Unicode Consortium. UnicodeData File. 807 . 809 [UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms. 810 Unicode Technical Report #15. 811 . 813 A. Acknowledgements 815 Many people from the IETF IDN Working Group and the Unicode Technical 816 Committee contributed ideas that went into the first draft of this 817 document. Mark Davis was particularly helpful in some of the early 818 ideas. 820 B. Changes From Previous Versions of this Draft 822 This is the -00 version, so there are no changes. 824 C. IANA Considerations 826 There are no specific IANA considerations in this draft, but there will 827 be in a future draft of this document. 829 D. Author Contact Information 831 Paul Hoffman 832 Internet Mail Consortium and VPN Consortium 833 127 Segre Place 834 Santa Cruz, CA 95060 USA 835 paul.hoffman@imc.org and paul.hoffman@vpnc.org 837 Marc Blanchet 838 Viagenie inc. 839 2875 boul. Laurier, bur. 300 840 Ste-Foy, Quebec, Canada, G1V 2M2 841 Marc.Blanchet@viagenie.qc.ca